💡💡💡本文独家改进:大多数现有方法注意力机制忽略了多尺度特征表示、结构信息和通道相互之间依赖关系的建模,多尺度空间金字塔注意(MSPA)正好解决了上述问题
💡💡💡创新点:利用分层残差连接在更细粒度上提取多尺度空间信息的分层信息,还设计了以自适应组合机制整合结构正则化和结构信息的空间金字塔再校准(SPR)模块
💡💡💡如何跟YOLOv8结合:1)和C2f创新性结合
💡💡💡本人在小目标数据集涨点明显,订阅者常规数据集如缺陷检测NEU-DET、农业病害检测等也取得涨点;
改进1结构图如下:

收录
YOLOv8原创自研
💡💡💡全网独家首发创新(原创),适合paper !!!
💡💡💡 2024年计算机视觉顶会创新点适用于Yolov5、Yolov7、Yolov8等各个Yolo系列,专栏文章提供每一步步骤和源码,轻松带你上手魔改网络 !!!
💡💡💡重点:通过本专栏的阅读,后续你也可以设计魔改网络,在网络不同位置(Backbone、head、detect、loss等)进行魔改,实现创新!!!
1.原理介绍

论文:Multi-scale spatial pyramid attention mechanism for image recognition: An effective approach - ScienceDirect摘要: 注意机制逐渐成为增强卷积神经网络(cnn)表征能力的必要条件。尽管近年来注意机制的研究取得了一定的进展,但仍存在一些有待解决的问题。大多数现有方法忽略了多尺度特征表示、结构信息和远程通道依赖关系的建模,而这些对于提供更具判别性的注意图至关重要。本研究提出了一种新颖、低开销、高性能的注意力机制,具有较强的泛化能力,适用于各种网络和数据集。这种机制被称为多尺度空间金字塔注意(MSPA),可以用来解决其他注意方法的局限性。对于MSPA的关键组件,我们不仅开发了利用分层残差连接在更细粒度上提取多尺度空间信息的分层(HPC)模块,还设计了以自适应组合机制整合结构正则化和结构信息的空间金字塔再校准(SPR)模块,同时利用Softmax操作构建远程通道依赖关系。所提出的MSPA是一个强大的工具,可以方便地嵌入到各种cnn作为即插即用组件。相应地,利用MSPA取代ResNets瓶颈残块中的3 × 3卷积,我们创建了一系列简单高效的骨干网MSPANet,自然继承了MSPA的优点。基于CIFAR-100和ImageNet-1K图像识别的广泛实验结果,我们的方法在所有评估指标上都大大优于其他最先进的方法。当将MSPA应用于ResNet-50时,我们的模型在CIFAR-100和ImageNet-1K基准上达到了81.74%和78.40%的top-1分类准确率,分别超过了相应的基线3.95%和2.27%。与竞争对手EPSANet50相比,我们也获得了1.15%和0.91%的性能改进。此外,在自动驾驶工程应用中的实证研究结果也表明,我们的方法可以在较低的开销下显著提高图像识别的准确性和实时性。
图1所示。提出了MSPA模块的整体架构。它包含三个核心组件:HPC模块、SPR模块和Softmax操作。HPC模块设计用于提取多尺度空间信息。SPR模块负责学习渠道注意权值,构建跨维度交互。Softmax操作用于重新校准频道的注意力权重,以建立远程频道依赖关系。在这里,特征图被表示为特征维度,其中,<s:1>𝐻和𝑊分别表示特征图的通道数、高度和宽度。⊕表示元素明智求和,⊗表示元素明智乘法。

图2所示。详细说明所提出的HPC模块,其中Split表示通道维度上的均匀分割,Conv表示一个3 × 3的标准卷积层,然后进行批处理归一化,Concat表示通道维度上的连接特征。

图3所示。拟建的SPR模块示意图。它包括两个基本组成部分,即空间金字塔聚集块和通道交互块。空间金字塔聚集块利用不同大小的金字塔状2层自适应平均池化,将注意路径中的结构正则化与结构信息相结合。通道交互块从空间金字塔聚合结构的输出中学习注意图。其中,AAP()表示自适应平均池化,Up-sampling表示使用最近邻插值的上采样,PW Conv表示逐点卷积。

图4所示。原始瓶颈剩余块(a)与提议的MSPANet基本构建块(b)之间的比较。

2.MSAM将入到YOLOv8
2.1 加入ultralytics/nn/attention/mspanet.py
import torch.nn as nn
import math
import torch
from ultralytics.nn.modules import (Conv,C3, Bottleneck,C2f)
class SPRModule(nn.Module):
def __init__(self, channels, reduction=16):
super(SPRModule, self).__init__()
self.avg_pool1 = nn.AdaptiveAvgPool2d(1)
self.avg_pool2 = nn.AdaptiveAvgPool2d(2)
self.fc1 = nn.Conv2d(channels * 5, channels//reduction, kernel_size=1, padding=0)
self.relu = nn.ReLU(inplace=True)
self.fc2 = nn.Conv2d(channels//reduction, channels, kernel_size=1, padding=0)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
out1 = self.avg_pool1(x).view(x.size(0), -1, 1, 1)
out2 = self.avg_pool2(x).view(x.size(0), -1, 1, 1)
out = torch.cat((out1, out2), 1)
out = self.fc1(out)
out = self.relu(out)
out = self.fc2(out)
weight = self.sigmoid(out)
return weight
def conv3x3(in_planes, out_planes, stride=1):
"""3x3 convolution with padding"""
return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride, padding=1, bias=False)
def conv1x1(in_planes, out_planes, stride=1):
"""1x1 convolution"""
return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)
def convdilated(in_planes, out_planes, kSize=3, stride=1, dilation=1):
"""3x3 convolution with dilation"""
padding = int((kSize - 1) / 2) * dilation
return nn.Conv2d(in_planes, out_planes, kernel_size=kSize, stride=stride, padding=padding,
dilation=dilation, bias=False)
class MSAModule(nn.Module):
def __init__(self, inplanes, scale=1, stride=1, stype='normal'):
""" Constructor
Args:
inplanes: input channel dimensionality.
scale: number of scale.
stride: conv stride.
stype: 'normal': normal set. 'stage': first block of a new stage.
"""
super(MSAModule, self).__init__()
self.width = inplanes
self.nums = scale
self.stride = stride
assert stype in ['stage', 'normal'], 'One of these is suppported (stage or normal)'
self.stype = stype
self.convs = nn.ModuleList([])
self.bns = nn.ModuleList([])
for i in range(self.nums):
if self.stype == 'stage' and self.stride != 1:
self.convs.append(convdilated(self.width, self.width, stride=stride, dilation=int(i + 1)))
else:
self.convs.append(conv3x3(self.width, self.width, stride))
self.bns.append(nn.BatchNorm2d(self.width))
self.attention = SPRModule(self.width)
self.softmax = nn.Softmax(dim=1)
def forward(self, x):
batch_size = x.shape[0]
spx = torch.split(x, self.width, 1)
for i in range(self.nums):
if i == 0 or (self.stype == 'stage' and self.stride != 1):
sp = spx[i]
else:
sp = sp + spx[i]
sp = self.convs[i](sp)
sp = self.bns[i](sp)
if i == 0:
out = sp
else:
out = torch.cat((out, sp), 1)
feats = out
feats = feats.view(batch_size, self.nums, self.width, feats.shape[2], feats.shape[3])
sp_inp = torch.split(out, self.width, 1)
attn_weight = []
for inp in sp_inp:
attn_weight.append(self.attention(inp))
attn_weight = torch.cat(attn_weight, dim=1)
attn_vectors = attn_weight.view(batch_size, self.nums, self.width, 1, 1)
attn_vectors = self.softmax(attn_vectors)
feats_weight = feats * attn_vectors
for i in range(self.nums):
x_attn_weight = feats_weight[:, i, :, :, :]
if i == 0:
out = x_attn_weight
else:
out = torch.cat((out, x_attn_weight), 1)
return out
class Bottleneck_MSAModule(Bottleneck):
"""Standard bottleneck with DCNV3."""
def __init__(self, c1, c2, shortcut=True, g=1, k=(3, 3), e=0.5): # ch_in, ch_out, shortcut, groups, kernels, expand
super().__init__(c1, c2, shortcut, g, k, e)
c_ = int(c2 * e) # hidden channels
self.cv2 = MSAModule(c_)
class C2f_MSAM(C2f):
def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5):
super().__init__(c1, c2, n, shortcut, g, e)
self.m = nn.ModuleList(Bottleneck_MSAModule(self.c, self.c, shortcut, g, k=(3, 3), e=1.0) for _ in range(n))
2.2修改task.py
本文改进基于官方最新版本,如新加入C2fAttn等等
下载地址:GitHub - ultralytics/ultralytics: NEW - YOLOv8 🚀 in PyTorch > ONNX > OpenVINO > CoreML > TFLite
1)首先进行注册
from ultralytics.nn.attention.mspanet import C2f_MSAM2)修改def parse_model(d, ch, verbose=True): # model_dict, input_channels(3)
只需要在你源码基础上加入C2f_MSAM,其他模块为博主其他文章的优化点
n = n_ = max(round(n * depth), 1) if n > 1 else n # depth gain
if m in (
Classify,
Conv,
ConvTranspose,
GhostConv,
Bottleneck,
GhostBottleneck,
SPP,
SPPF,
DWConv,
Focus,
BottleneckCSP,
C1,
C2,
C2f,
C2fAttn,
C3,
C3TR,
C3Ghost,
nn.ConvTranspose2d,
DWConvTranspose2d,
C3x,
RepC3,
C2f_MSAM
):
c1, c2 = ch[f], args[0]
if c2 != nc: # if c2 not equal to number of classes (i.e. for Classify() output)
c2 = make_divisible(min(c2, max_channels) * width, 8)
args = [c1, c2, *args[1:]]
if m in (BottleneckCSP, C1, C2, C2f, C2fAttn, C3, C3TR, C3Ghost, C3x, RepC3,C2f_MSAM):
args.insert(2, n) # number of repeats
n = 1
2.3 yolov8_C2f_MSAM.yaml

# Ultralytics YOLO 🚀, AGPL-3.0 license
# YOLOv8 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect
# Parameters
nc: 80 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n.yaml' will call yolov8.yaml with scale 'n'
# [depth, width, max_channels]
n: [0.33, 0.25, 1024] # YOLOv8n summary: 225 layers, 3157200 parameters, 3157184 gradients, 8.9 GFLOPs
s: [0.33, 0.50, 1024] # YOLOv8s summary: 225 layers, 11166560 parameters, 11166544 gradients, 28.8 GFLOPs
m: [0.67, 0.75, 768] # YOLOv8m summary: 295 layers, 25902640 parameters, 25902624 gradients, 79.3 GFLOPs
l: [1.00, 1.00, 512] # YOLOv8l summary: 365 layers, 43691520 parameters, 43691504 gradients, 165.7 GFLOPs
x: [1.00, 1.25, 512] # YOLOv8x summary: 365 layers, 68229648 parameters, 68229632 gradients, 258.5 GFLOPs
# YOLOv8.0n backbone
backbone:
# [from, repeats, module, args]
- [-1, 1, Conv, [64, 3, 2]] # 0-P1/2
- [-1, 1, Conv, [128, 3, 2]] # 1-P2/4
- [-1, 3, C2f_MSAM, [128, True]]
- [-1, 1, Conv, [256, 3, 2]] # 3-P3/8
- [-1, 6, C2f_MSAM, [256, True]]
- [-1, 1, Conv, [512, 3, 2]] # 5-P4/16
- [-1, 6, C2f_MSAM, [512, True]]
- [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32
- [-1, 3, C2f_MSAM, [1024, True]]
- [-1, 1, SPPF, [1024, 5]] # 9
# YOLOv8.0n head
head:
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 6], 1, Concat, [1]] # cat backbone P4
- [-1, 3, C2f, [512]] # 12
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 4], 1, Concat, [1]] # cat backbone P3
- [-1, 3, C2f, [256]] # 15 (P3/8-small)
- [-1, 1, Conv, [256, 3, 2]]
- [[-1, 12], 1, Concat, [1]] # cat head P4
- [-1, 3, C2f, [512]] # 18 (P4/16-medium)
- [-1, 1, Conv, [512, 3, 2]]
- [[-1, 9], 1, Concat, [1]] # cat head P5
- [-1, 3, C2f, [1024]] # 21 (P5/32-large)
- [[15, 18, 21], 1, Detect, [nc]] # Detect(P3, P4, P5)
你也可以1)替换neck的C2f;2)同时替换backbone和neck的C2f;
具体情况看自己数据集涨点情况,多做实验。