一、本文介绍
本文给大家带来的改进机制是结合号称超越 Transformer架构的Mamba架构的最新注意力机制MLLA ,本文将其和我们YOLOv11进行结合, MLLA(Mamba-Like Linear Attention) 的原理是通过将 Mamba 模型 的一些核心设计融入线性 注意力机制 ,从而提升模型的性能。具体来说,MLLA主要整合了Mamba中的“忘记门”(forget gate)和模块设计(block design)这两个关键因素,同时MLLA通过使用位置编码(RoPE)来替代忘记门,从而在保持并行计算和快速推理速度的同时,提供必要的位置信息。这使得MLLA在处理非自回归的视觉任务时更加有效 , 本文内容为我独家整理全网首发。
二、原理介绍
官方论文地址: 官方论文地址点击此处即可跳转
官方代码地址: 官方代码地址点击此处即可跳转
在这篇论文中, MLLA(Mamba-Like Linear Attention) 的原理是通过将Mamba模型的一些核心设计融入线性注意力机制,从而提升模型的 性能 。具体来说,MLLA主要整合了Mamba中的“忘记门”(forget gate )和模块设计(block design)这两个关键因素,这些因素被认为是Mamba成功的主要原因。
以下是对MLLA原理的详细分析:
-
忘记门(Forget Gate) :
- 忘记门提供了局部偏差和位置信息。所有的忘记门元素严格限制在0到1之间,这意味着模型在接收到当前输入后会持续衰减先前的隐藏状态。这种特性确保了模型对输入序列的顺序敏感。
- 忘记门的局部偏差和位置信息对于图像处理任务来说非常重要,尽管引入忘记门会导致计算需要采用递归的形式,从而降低并行计算的效率 。
-
模块设计(Block Design) :
- Mamba的模块设计在保持相似的浮点运算次数(FLOPs)的同时,通过替换注意力子模块为线性注意力来提升性能。结果表明,采用这种模块设计能够显著提高模型的表现 。
-
线性注意力的改进 :
- 线性注意力被重新设计以整合忘记门和模块设计,这种改进后的模型被称为MLLA。实验结果显示,MLLA在图像分类和高分辨率密集预测任务中均优于各种视觉Mamba模型 。
-
并行计算和快速推理速度 :
- MLLA通过使用位置编码(RoPE)来替代忘记门,从而在保持并行计算和快速推理速度的同时,提供必要的位置信息。这使得MLLA在处理非自回归的视觉任务时更加有效 。
通过这些改进,MLLA不仅继承了Mamba模型的优点,还解决了其在并行计算中的一些局限性,使其更适合于视觉任务。MLLA展示了通过合理设计,线性注意力机制也能够超越传统的高性能模型。
三、核心代码
其中包含了上面提到的Rope,但是这个模块是经过我重新设计的,因为原先的代码需要输入图片的宽和高再定义时,但是经过重新设计后改为实时计算,有兴趣的可以和开源代码对比下!
- import torch
- import torch.nn as nn
- __all__ = ['MLLAttention', 'C2PSAMLLA']
- class Mlp(nn.Module):
- def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
- super().__init__()
- out_features = out_features or in_features
- hidden_features = hidden_features or in_features
- self.fc1 = nn.Linear(in_features, hidden_features)
- self.act = act_layer()
- self.fc2 = nn.Linear(hidden_features, out_features)
- self.drop = nn.Dropout(drop)
- def forward(self, x):
- x = self.fc1(x)
- x = self.act(x)
- x = self.drop(x)
- x = self.fc2(x)
- x = self.drop(x)
- return x
- class ConvLayer(nn.Module):
- def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=0, dilation=1, groups=1,
- bias=True, dropout=0, norm=nn.BatchNorm2d, act_func=nn.ReLU):
- super(ConvLayer, self).__init__()
- self.dropout = nn.Dropout2d(dropout, inplace=False) if dropout > 0 else None
- self.conv = nn.Conv2d(
- in_channels,
- out_channels,
- kernel_size=(kernel_size, kernel_size),
- stride=(stride, stride),
- padding=(padding, padding),
- dilation=(dilation, dilation),
- groups=groups,
- bias=bias,
- )
- self.norm = norm(num_features=out_channels) if norm else None
- self.act = act_func() if act_func else None
- def forward(self, x: torch.Tensor) -> torch.Tensor:
- if self.dropout is not None:
- x = self.dropout(x)
- x = self.conv(x)
- if self.norm:
- x = self.norm(x)
- if self.act:
- x = self.act(x)
- return x
- class RoPE(torch.nn.Module):
- r"""Rotary Positional Embedding.
- """
- def __init__(self, base=10000):
- super(RoPE, self).__init__()
- self.base = base
- def generate_rotations(self, x):
- # 获取输入张量的形状
- *channel_dims, feature_dim = x.shape[1:-1][0], x.shape[-1]
- k_max = feature_dim // (2 * len(channel_dims))
- assert feature_dim % k_max == 0, "Feature dimension must be divisible by 2 * k_max"
- # 生成角度
- theta_ks = 1 / (self.base ** (torch.arange(k_max, dtype=x.dtype, device=x.device) / k_max))
- angles = torch.cat([t.unsqueeze(-1) * theta_ks for t in
- torch.meshgrid([torch.arange(d, dtype=x.dtype, device=x.device) for d in channel_dims],
- indexing='ij')], dim=-1)
- # 计算旋转矩阵的实部和虚部
- rotations_re = torch.cos(angles).unsqueeze(dim=-1)
- rotations_im = torch.sin(angles).unsqueeze(dim=-1)
- rotations = torch.cat([rotations_re, rotations_im], dim=-1)
- return rotations
- def forward(self, x):
- # 生成旋转矩阵
- rotations = self.generate_rotations(x)
- # 将 x 转换为复数形式
- x_complex = torch.view_as_complex(x.reshape(*x.shape[:-1], -1, 2))
- # 应用旋转矩阵
- pe_x = torch.view_as_complex(rotations) * x_complex
- # 将结果转换回实数形式并展平最后两个维度
- return torch.view_as_real(pe_x).flatten(-2)
- class MLLAttention(nn.Module):
- r""" Linear Attention with LePE and RoPE.
- Args:
- dim (int): Number of input channels.
- num_heads (int): Number of attention heads.
- qkv_bias (bool, optional): If True, add a learnable bias to query, key, value. Default: True
- """
- def __init__(self, dim=3, input_resolution=[160, 160], num_heads=4, qkv_bias=True, **kwargs):
- super().__init__()
- self.dim = dim
- self.input_resolution = input_resolution
- self.num_heads = num_heads
- self.qk = nn.Linear(dim, dim * 2, bias=qkv_bias)
- self.elu = nn.ELU()
- self.lepe = nn.Conv2d(dim, dim, 3, padding=1, groups=dim)
- self.rope = RoPE()
- def forward(self, x):
- """
- Args:
- x: input features with shape of (B, N, C)
- """
- x = x.reshape((x.size(0), x.size(2) * x.size(3), x.size(1)))
- b, n, c = x.shape
- h = int(n ** 0.5)
- w = int(n ** 0.5)
- # self.rope = RoPE(shape=(h, w, self.dim))
- num_heads = self.num_heads
- head_dim = c // num_heads
- qk = self.qk(x).reshape(b, n, 2, c).permute(2, 0, 1, 3)
- q, k, v = qk[0], qk[1], x
- # q, k, v: b, n, c
- q = self.elu(q) + 1.0
- k = self.elu(k) + 1.0
- q_rope = self.rope(q.reshape(b, h, w, c)).reshape(b, n, num_heads, head_dim).permute(0, 2, 1, 3)
- k_rope = self.rope(k.reshape(b, h, w, c)).reshape(b, n, num_heads, head_dim).permute(0, 2, 1, 3)
- q = q.reshape(b, n, num_heads, head_dim).permute(0, 2, 1, 3)
- k = k.reshape(b, n, num_heads, head_dim).permute(0, 2, 1, 3)
- v = v.reshape(b, n, num_heads, head_dim).permute(0, 2, 1, 3)
- z = 1 / (q @ k.mean(dim=-2, keepdim=True).transpose(-2, -1) + 1e-6)
- kv = (k_rope.transpose(-2, -1) * (n ** -0.5)) @ (v * (n ** -0.5))
- x = q_rope @ kv * z
- x = x.transpose(1, 2).reshape(b, n, c)
- v = v.transpose(1, 2).reshape(b, h, w, c).permute(0, 3, 1, 2)
- x = x + self.lepe(v).permute(0, 2, 3, 1).reshape(b, n, c)
- x = x.transpose(2, 1).reshape((b, c, h, w))
- return x
- def extra_repr(self) -> str:
- return f'dim={self.dim}, num_heads={self.num_heads}'
- def autopad(k, p=None, d=1): # kernel, padding, dilation
- """Pad to 'same' shape outputs."""
- if d > 1:
- k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k] # actual kernel-size
- if p is None:
- p = k // 2 if isinstance(k, int) else [x // 2 for x in k] # auto-pad
- return p
- class Conv(nn.Module):
- """Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)."""
- default_act = nn.SiLU() # default activation
- def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
- """Initialize Conv layer with given arguments including activation."""
- super().__init__()
- self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
- self.bn = nn.BatchNorm2d(c2)
- self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()
- def forward(self, x):
- """Apply convolution, batch normalization and activation to input tensor."""
- return self.act(self.bn(self.conv(x)))
- def forward_fuse(self, x):
- """Perform transposed convolution of 2D data."""
- return self.act(self.conv(x))
- class PSABlock(nn.Module):
- """
- PSABlock class implementing a Position-Sensitive Attention block for neural networks.
- This class encapsulates the functionality for applying multi-head attention and feed-forward neural network layers
- with optional shortcut connections.
- Attributes:
- attn (Attention): Multi-head attention module.
- ffn (nn.Sequential): Feed-forward neural network module.
- add (bool): Flag indicating whether to add shortcut connections.
- Methods:
- forward: Performs a forward pass through the PSABlock, applying attention and feed-forward layers.
- Examples:
- Create a PSABlock and perform a forward pass
- """
- def __init__(self, c, attn_ratio=0.5, num_heads=4, shortcut=True) -> None:
- """Initializes the PSABlock with attention and feed-forward layers for enhanced feature extraction."""
- super().__init__()
- self.attn = MLLAttention(c)
- self.ffn = nn.Sequential(Conv(c, c * 2, 1), Conv(c * 2, c, 1, act=False))
- self.add = shortcut
- def forward(self, x):
- """Executes a forward pass through PSABlock, applying attention and feed-forward layers to the input tensor."""
- x = x + self.attn(x) if self.add else self.attn(x)
- x = x + self.ffn(x) if self.add else self.ffn(x)
- return x
- class C2PSAMLLA(nn.Module):
- """
- C2PSA module with attention mechanism for enhanced feature extraction and processing.
- This module implements a convolutional block with attention mechanisms to enhance feature extraction and processing
- capabilities. It includes a series of PSABlock modules for self-attention and feed-forward operations.
- Attributes:
- c (int): Number of hidden channels.
- cv1 (Conv): 1x1 convolution layer to reduce the number of input channels to 2*c.
- cv2 (Conv): 1x1 convolution layer to reduce the number of output channels to c.
- m (nn.Sequential): Sequential container of PSABlock modules for attention and feed-forward operations.
- Methods:
- forward: Performs a forward pass through the C2PSA module, applying attention and feed-forward operations.
- Notes:
- This module essentially is the same as PSA module, but refactored to allow stacking more PSABlock modules.
- """
- def __init__(self, c1, c2, n=1, e=0.5):
- """Initializes the C2PSA module with specified input/output channels, number of layers, and expansion ratio."""
- super().__init__()
- assert c1 == c2
- self.c = int(c1 * e)
- self.cv1 = Conv(c1, 2 * self.c, 1, 1)
- self.cv2 = Conv(2 * self.c, c1, 1)
- self.m = nn.Sequential(*(PSABlock(self.c, attn_ratio=0.5, num_heads=self.c // 64) for _ in range(n)))
- def forward(self, x):
- """Processes the input tensor 'x' through a series of PSA blocks and returns the transformed tensor."""
- a, b = self.cv1(x).split((self.c, self.c), dim=1)
- b = self.m(b)
- return self.cv2(torch.cat((a, b), 1))
四、手把手教你添加MLLA
4.1 修改一
第一还是建立文件,我们找到如下 ultralytics /nn文件夹下建立一个目录名字呢就是'Addmodules'文件夹 (用群内的文件的话已经有了无需新建) !然后在其内部建立一个新的py文件将核心代码复制粘贴进去即可。
4.2 修改二
第二步我们在该目录下创建一个新的py文件名字为'__init__.py'( 用群内的文件的话已经有了无需新建) ,然后在其内部导入我们的检测头如下图所示。
4.3 修改三
第三步我门中到如下文件'ultralytics/nn/tasks.py'进行导入和注册我们的模块( 用群内的文件的话已经有了无需重新导入直接开始第四步即可) !
从今天开始以后的教程就都统一成这个样子了,因为我默认大家用了我群内的文件来进行修改!!
4.4 修改四
按照我的添加在parse_model里添加即可。
4.5 修改五
找到ultralytics/models/yolo/detect/train.py的DetectionTrainer class中的build_dataset函数中的rect=mode == 'val'改为rect=False
到此就修改完成了,大家可以复制下面的yaml文件运行。
五、MLLA的yaml文件和运行记录
5.1 MLLA的yaml文件1
此版本训练信息:YOLO11-C2PSA-MLLA summary: 312 layers, 2,577,691 parameters, 2,577,675 gradients, 6.5 GFLOPs
- # Ultralytics YOLO 🚀, AGPL-3.0 license
- # YOLO11 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect
- # Parameters
- nc: 80 # number of classes
- scales: # model compound scaling constants, i.e. 'model=yolo11n.yaml' will call yolo11.yaml with scale 'n'
- # [depth, width, max_channels]
- n: [0.50, 0.25, 1024] # summary: 319 layers, 2624080 parameters, 2624064 gradients, 6.6 GFLOPs
- s: [0.50, 0.50, 1024] # summary: 319 layers, 9458752 parameters, 9458736 gradients, 21.7 GFLOPs
- m: [0.50, 1.00, 512] # summary: 409 layers, 20114688 parameters, 20114672 gradients, 68.5 GFLOPs
- l: [1.00, 1.00, 512] # summary: 631 layers, 25372160 parameters, 25372144 gradients, 87.6 GFLOPs
- x: [1.00, 1.50, 512] # summary: 631 layers, 56966176 parameters, 56966160 gradients, 196.0 GFLOPs
- # YOLO11n backbone
- backbone:
- # [from, repeats, module, args]
- - [-1, 1, Conv, [64, 3, 2]] # 0-P1/2
- - [-1, 1, Conv, [128, 3, 2]] # 1-P2/4
- - [-1, 2, C3k2, [256, False, 0.25]]
- - [-1, 1, Conv, [256, 3, 2]] # 3-P3/8
- - [-1, 2, C3k2, [512, False, 0.25]]
- - [-1, 1, Conv, [512, 3, 2]] # 5-P4/16
- - [-1, 2, C3k2, [512, True]]
- - [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32
- - [-1, 2, C3k2, [1024, True]]
- - [-1, 1, SPPF, [1024, 5]] # 9
- - [-1, 2, C2PSAMLLA, [1024]] # 10
- # YOLO11n head
- head:
- - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- - [[-1, 6], 1, Concat, [1]] # cat backbone P4
- - [-1, 2, C3k2, [512, False]] # 13
- - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- - [[-1, 4], 1, Concat, [1]] # cat backbone P3
- - [-1, 2, C3k2, [256, False]] # 16 (P3/8-small)
- - [-1, 1, Conv, [256, 3, 2]]
- - [[-1, 13], 1, Concat, [1]] # cat head P4
- - [-1, 2, C3k2, [512, False]] # 19 (P4/16-medium)
- - [-1, 1, Conv, [512, 3, 2]]
- - [[-1, 10], 1, Concat, [1]] # cat head P5
- - [-1, 2, C3k2, [1024, True]] # 22 (P5/32-large)
- - [[16, 19, 22], 1, Detect, [nc]] # Detect(P3, P4, P5)
5.2 MLLA的yaml文件2
此版本训练信息:YOLO11-MLLA summary: 334 layers, 2,772,123 parameters, 2,772,107 gradients, 6.8 GFLOPs
- # Ultralytics YOLO 🚀, AGPL-3.0 license
- # YOLO11 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect
- # Parameters
- nc: 80 # number of classes
- scales: # model compound scaling constants, i.e. 'model=yolo11n.yaml' will call yolo11.yaml with scale 'n'
- # [depth, width, max_channels]
- n: [0.50, 0.25, 1024] # summary: 319 layers, 2624080 parameters, 2624064 gradients, 6.6 GFLOPs
- s: [0.50, 0.50, 1024] # summary: 319 layers, 9458752 parameters, 9458736 gradients, 21.7 GFLOPs
- m: [0.50, 1.00, 512] # summary: 409 layers, 20114688 parameters, 20114672 gradients, 68.5 GFLOPs
- l: [1.00, 1.00, 512] # summary: 631 layers, 25372160 parameters, 25372144 gradients, 87.6 GFLOPs
- x: [1.00, 1.50, 512] # summary: 631 layers, 56966176 parameters, 56966160 gradients, 196.0 GFLOPs
- # YOLO11n backbone
- backbone:
- # [from, repeats, module, args]
- - [-1, 1, Conv, [64, 3, 2]] # 0-P1/2
- - [-1, 1, Conv, [128, 3, 2]] # 1-P2/4
- - [-1, 2, C3k2, [256, False, 0.25]]
- - [-1, 1, Conv, [256, 3, 2]] # 3-P3/8
- - [-1, 2, C3k2, [512, False, 0.25]]
- - [-1, 1, Conv, [512, 3, 2]] # 5-P4/16
- - [-1, 2, C3k2, [512, True]]
- - [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32
- - [-1, 2, C3k2, [1024, True]]
- - [-1, 1, SPPF, [1024, 5]] # 9
- - [-1, 2, C2PSA, [1024]] # 10
- # YOLO11n head
- head:
- - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- - [[-1, 6], 1, Concat, [1]] # cat backbone P4
- - [-1, 2, C3k2, [512, False]] # 13
- - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- - [[-1, 4], 1, Concat, [1]] # cat backbone P3
- - [-1, 2, C3k2, [256, False]] # 16 (P3/8-small)
- - [-1, 1, MLLAttention, []] # 17 (P3/8-small) 小目标检测层输出位置增加注意力机制
- - [-1, 1, Conv, [256, 3, 2]]
- - [[-1, 13], 1, Concat, [1]] # cat head P4
- - [-1, 2, C3k2, [512, False]] # 20 (P4/16-medium)
- - [-1, 1, MLLAttention, []] # 21 (P4/16-medium) 中目标检测层输出位置增加注意力机制
- - [-1, 1, Conv, [512, 3, 2]]
- - [[-1, 10], 1, Concat, [1]] # cat head P5
- - [-1, 2, C3k2, [1024, True]] # 24 (P5/32-large)
- - [-1, 1, MLLAttention, []] # 25 (P5/32-large) 大目标检测层输出位置增加注意力机制
- # 注意力机制我这里其实是添加了三个但是实际一般生效就只添加一个就可以了,所以大家可以自行注释来尝试, 上面三个仅建议大家保留一个, 但是from位置要对齐.
- # 具体在那一层用注意力机制可以根据自己的数据集场景进行选择。
- # 如果你自己配置注意力位置注意from[17, 21, 25]位置要对应上对应的检测层!
- - [[17, 21, 25], 1, Detect, [nc]] # Detect(P3, P4, P5)
5.3 训练代码
大家可以创建一个py文件将我给的代码复制粘贴进去,配置好自己的文件路径即可运行。
- import warnings
- warnings.filterwarnings('ignore')
- from ultralytics import YOLO
- if __name__ == '__main__':
- model = YOLO('yolov8-MLLA.yaml')
- # 如何切换模型版本, 上面的ymal文件可以改为 yolov8s.yaml就是使用的v8s,
- # 类似某个改进的yaml文件名称为yolov8-XXX.yaml那么如果想使用其它版本就把上面的名称改为yolov8l-XXX.yaml即可(改的是上面YOLO中间的名字不是配置文件的)!
- # model.load('yolov8n.pt') # 是否加载预训练权重,科研不建议大家加载否则很难提升精度
- model.train(data=r"C:\Users\Administrator\PycharmProjects\yolov5-master\yolov5-master\Construction Site Safety.v30-raw-images_latestversion.yolov8\data.yaml",
- # 如果大家任务是其它的'ultralytics/cfg/default.yaml'找到这里修改task可以改成detect, segment, classify, pose
- cache=False,
- imgsz=640,
- epochs=150,
- single_cls=False, # 是否是单类别检测
- batch=16,
- close_mosaic=0,
- workers=0,
- device='0',
- optimizer='SGD', # using SGD
- # resume='runs/train/exp21/weights/last.pt', # 如过想续训就设置last.pt的地址
- amp=True, # 如果出现训练损失为Nan可以关闭amp
- project='runs/train',
- name='exp',
- )
5.4 MLLA的训练过程截图
五、本文总结
到此本文的正式分享内容就结束了,在这里给大家推荐我的YOLOv11改进有效涨点专栏,本专栏目前为新开的平均质量分98分,后期我会根据各种最新的前沿顶会进行论文复现,也会对一些老的改进机制进行补充,如果大家觉得本文帮助到你了,订阅本专栏,关注后续更多的更新~