学习资源站

RT-DETR改进策略【RT-DETR和Mamba】MLLA:Mamba-LikeLinearAttention,融合Mamba设计优势的注意力机制_mamba与detr-

RT-DETR改进策略【RT-DETR和Mamba】| MLLA:Mamba-Like Linear Attention,融合Mamba设计优势的注意力机制

一、本文介绍

本文记录的是 利用 MLLA 模块优化 RT-DETR 的目标检测网络模型 MLLA 模块具有独特优势。它不同于传统模块,能 同时兼顾局部特征高效建模与长距离交互学习 。常见模块要么在局部特征处理上有优势但长距离交互能力弱,要么反之,而 MLLA 模块克服了此问题。它融合了 Mamba模型 线性注意力机制 的优势,通过独特的结构设计,能够在 保持计算效率 的同时, 精准地建模局部特征并学习长距离交互信息 。本文将其用于 RT-DETR 的模型改进和二次创新,能够更加关注图像中的重要特征区域,抑制背景等无关信息的干扰,从而突出目标物体的关键特征。



二、MLLA模块介绍

Demystify Mamba in Vision: A Linear Attention Perspective

2.1 出发点

在探索 Mamba 线性注意力Transformer 关系时发现, Mamba 的特殊设计中遗忘门和块设计对性能提升贡献大。 MLLA模块 旨在将这两个关键设计 融入线性注意力 ,以提升其在视觉任务中的性能,同时保持 并行计算和快速推理优势

2.2 原理

2.2.1 选择性状态空间模型(Selective SSM)与线性注意力的关联

  • 选择性SSM 公式(如 h i = A ~ i ⊙ h i − 1 + B i ( Δ i ⊙ x i ) h_{i}=\tilde{A}_{i}\odot h_{i - 1}+B_{i}(\Delta_{i}\odot x_{i}) h i = A ~ i h i 1 + B i ( Δ i x i ) y i = C i h i + D ⊙ x i y_{i}=C_{i}h_{i}+D\odot x_{i} y i = C i h i + D x i )与 单头线性注意力 公式(如 S i = 1 ⊙ S i − 1 + K i ⊤ ( 1 ⊙ V i ) S_{i}=1\odot S_{i - 1}+K_{i}^{\top}(1\odot V_{i}) S i = 1 S i 1 + K i ( 1 V i ) y i = Q i S i / Q i Z i + 0 ⊙ x i y_{i}=Q_{i}S_{i}/Q_{i}Z_{i}+0\odot x_{i} y i = Q i S i / Q i Z i + 0 x i )相似。如下图,能直观看到两者结构上的相似性,进而理解 将选择性SSM视为线性注意力特殊变体的依据
  • 其中 Δ i \Delta_{i} Δ i 为输入门, A ~ i \tilde{A}_{i} A ~ i 为遗忘门, D ⊙ x i D\odot x_{i} D x i 是捷径,且 选择性SSM 无归一化且类似单头设计。

在这里插入图片描述

2.2.2 遗忘门的特性与作用

  • 遗忘门 A ~ i \tilde{A}_{i} A ~ i 元素值在0到1之间,产生局部偏差且提供位置信息,下 图b 中遗忘门平均值在不同层的情况可辅助理解其在不同层的作用特性。

在这里插入图片描述

  • 但遗忘门需循环计算,降低吞吐量,不适合非自回归视觉模型,不过可利用位置编码(如 APE LePE CPE RoPE )替代。下表中对比使用 遗忘门 不同位置编码 时模型的性能,体现了 位置编码替代遗忘门的可行性

在这里插入图片描述

2.2.3 块设计的改进

  • Mamba块 设计结合 H3 Gated Attention ,集成多种操作,比传统 Transformer块 设计更有效。
  • MLLA 模块通过替换 Transformer 块中的注意力子块为 Mamba的块 设计,并用 线性注意力 替代 选择性SSM ,调整参数。

在这里插入图片描述

2.3 结构

MLLA模块 结构基于上述原理,包含输入/输出投影、 Q / K Q/K Q / K 投影、门控投影、线性注意力、深度卷积(DWConv)和多层感知机(MLP)等组件,下图中是 MLLA模型 架构图,可清晰看到各组件在模块中的位置和连接关系。

数据先经投影,再通过线性注意力聚合信息,接着经深度卷积和门控机制处理,最后通过MLP非线性变换得到输出,以此往复。

在这里插入图片描述

2.4 优势

  1. 性能提升
    • 在图像分类(ImageNet - 1K数据集)、目标检测(COCO数据集)和语义分割(ADE20K数据集)等任务中表现出色,超过多种视觉Mamba模型。
  2. 计算效率高
    • 保持并行计算能力,推理速度快。与Mamba模型相比,MLLA模型推理速度显著提升。如MLLA模型比Mamba2D快4.5倍,比VMamba快1.5倍且准确性更好,处理高分辨率图像等任务更具优势。

论文: https://arxiv.org/pdf/2405.16605
源码: https://github.com/LeapLabTHU/MLLA

三、MLLA的实现代码

MLLA 及其改进的实现代码如下:

# This code implements concepts and methods described in the paper:
# "Demystify Mamba in Vision: A Linear Attention Perspective"
# 
# Authors: Dongchen Han, Ziyi Wang, Zhuofan Xia, Yizeng Han, Yifan Pu, Chunjiang Ge, Jun Song, Shiji Song, Bo Zheng, Gao Huang
# 
# Original Paper: https://arxiv.org/abs/2405.16605

import torch
import torch.nn as nn
import torch.nn.functional as F

def drop_path(x, drop_prob: float = 0., training: bool = False, scale_by_keep: bool = True):
    if drop_prob == 0. or not training:
        return x
    keep_prob = 1 - drop_prob
    shape = (x.shape[0],) + (1,) * (x.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
    random_tensor = x.new_empty(shape).bernoulli_(keep_prob)
    if keep_prob > 0.0 and scale_by_keep:
        random_tensor.div_(keep_prob)
    return x * random_tensor

class DropPath(nn.Module):
    """Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
    """
    def __init__(self, drop_prob: float = 0., scale_by_keep: bool = True):
        super(DropPath, self).__init__()
        self.drop_prob = drop_prob
        self.scale_by_keep = scale_by_keep
 
    def forward(self, x):
        return drop_path(x, self.drop_prob, self.training, self.scale_by_keep)
 
    def extra_repr(self):
        return f'drop_prob={round(self.drop_prob,3):0.3f}'
 
class Mlp(nn.Module):
    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features
        self.fc1 = nn.Linear(in_features, hidden_features)
        self.act = act_layer()
        self.fc2 = nn.Linear(hidden_features, out_features)
        self.drop = nn.Dropout(drop)
 
    def forward(self, x):
        x = self.fc1(x)
        x = self.act(x)
        x = self.drop(x)
        x = self.fc2(x)
        x = self.drop(x)
        return x

class ConvLayer(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=0, dilation=1, groups=1,
                 bias=True, dropout=0, norm=nn.BatchNorm2d, act_func=nn.ReLU):
        super(ConvLayer, self).__init__()
        self.dropout = nn.Dropout2d(dropout, inplace=False) if dropout > 0 else None
        self.conv = nn.Conv2d(
            in_channels,
            out_channels,
            kernel_size=(kernel_size, kernel_size),
            stride=(stride, stride),
            padding=(padding, padding),
            dilation=(dilation, dilation),
            groups=groups,
            bias=bias,
        )
        self.norm = norm(num_features=out_channels) if norm else None
        self.act = act_func() if act_func else None
 
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        if self.dropout is not None:
            x = self.dropout(x)
        x = self.conv(x)
        if self.norm:
            x = self.norm(x)
        if self.act:
            x = self.act(x)
        return x

class RoPE(torch.nn.Module):
    r"""Rotary Positional Embedding.
    """
 
    def __init__(self, base=10000):
        super(RoPE, self).__init__()
        self.base = base
 
    def generate_rotations(self, x):
        # 获取输入张量的形状
        *channel_dims, feature_dim = x.shape[1:-1][0], x.shape[-1]
        k_max = feature_dim // (2 * len(channel_dims))
 
        assert feature_dim % k_max == 0, "Feature dimension must be divisible by 2 * k_max"
 
        # 生成角度
        theta_ks = 1 / (self.base ** (torch.arange(k_max, dtype=x.dtype, device=x.device) / k_max))
        angles = torch.cat([t.unsqueeze(-1) * theta_ks for t in
                            torch.meshgrid([torch.arange(d, dtype=x.dtype, device=x.device) for d in channel_dims],
                                           indexing='ij')], dim=-1)
 
        # 计算旋转矩阵的实部和虚部
        rotations_re = torch.cos(angles).unsqueeze(dim=-1)
        rotations_im = torch.sin(angles).unsqueeze(dim=-1)
        rotations = torch.cat([rotations_re, rotations_im], dim=-1)
 
        return rotations
 
    def forward(self, x):
        # 生成旋转矩阵
        rotations = self.generate_rotations(x)
 
        # 将 x 转换为复数形式
        x_complex = torch.view_as_complex(x.reshape(*x.shape[:-1], -1, 2))
 
        # 应用旋转矩阵
        pe_x = torch.view_as_complex(rotations) * x_complex
 
        # 将结果转换回实数形式并展平最后两个维度
        return torch.view_as_real(pe_x).flatten(-2)

class LinearAttention(nn.Module):
    r""" Linear Attention with LePE and RoPE.
    Args:
        dim (int): Number of input channels.
        num_heads (int): Number of attention heads.
        qkv_bias (bool, optional):  If True, add a learnable bias to query, key, value. Default: True
    """
 
    def __init__(self, dim,  num_heads=4, qkv_bias=True, **kwargs):
 
        super().__init__()
        self.dim = dim
        self.num_heads = num_heads
        self.qk = nn.Linear(dim, dim * 2, bias=qkv_bias)
        self.elu = nn.ELU()
        self.lepe = nn.Conv2d(dim, dim, 3, padding=1, groups=dim)
        self.rope = RoPE()
 
    def forward(self, x):
        """
        Args:
            x: input features with shape of (B, N, C)
        """
        b, n, c = x.shape
        h = int(n ** 0.5)
        w = int(n ** 0.5)
        num_heads = self.num_heads
        head_dim = c // num_heads
 
        qk = self.qk(x).reshape(b, n, 2, c).permute(2, 0, 1, 3)
        q, k, v = qk[0], qk[1], x
        # q, k, v: b, n, c
 
        q = self.elu(q) + 1.0
        k = self.elu(k) + 1.0
        q_rope = self.rope(q.reshape(b, h, w, c)).reshape(b, n, num_heads, head_dim).permute(0, 2, 1, 3)
        k_rope = self.rope(k.reshape(b, h, w, c)).reshape(b, n, num_heads, head_dim).permute(0, 2, 1, 3)
        q = q.reshape(b, n, num_heads, head_dim).permute(0, 2, 1, 3)
        k = k.reshape(b, n, num_heads, head_dim).permute(0, 2, 1, 3)
        v = v.reshape(b, n, num_heads, head_dim).permute(0, 2, 1, 3)
 
        z = 1 / (q @ k.mean(dim=-2, keepdim=True).transpose(-2, -1) + 1e-6)
        kv = (k_rope.transpose(-2, -1) * (n ** -0.5)) @ (v * (n ** -0.5))
        x = q_rope @ kv * z
 
        x = x.transpose(1, 2).reshape(b, n, c)
        v = v.transpose(1, 2).reshape(b, h, w, c).permute(0, 3, 1, 2)
        x = x + self.lepe(v).permute(0, 2, 3, 1).reshape(b, n, c)
 
        return x
 
    def extra_repr(self) -> str:
        return f'dim={self.dim}, num_heads={self.num_heads}'

class MLLABlock(nn.Module):
    r""" MLLA Block.
    Args:
        dim (int): Number of input channels.
        input_resolution (tuple[int]): Input resulotion.
        num_heads (int): Number of attention heads.
        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
        qkv_bias (bool, optional): If True, add a learnable bias to query, key, value. Default: True
        drop (float, optional): Dropout rate. Default: 0.0
        drop_path (float, optional): Stochastic depth rate. Default: 0.0
        act_layer (nn.Module, optional): Activation layer. Default: nn.GELU
        norm_layer (nn.Module, optional): Normalization layer.  Default: nn.LayerNorm
    """
 
    def __init__(self, dim, num_heads=4, mlp_ratio=4., qkv_bias=True, drop=0., drop_path=0.,
                 act_layer=nn.GELU, norm_layer=nn.LayerNorm, **kwargs):
        super().__init__()
        self.dim = dim
        self.num_heads = num_heads
        self.mlp_ratio = mlp_ratio
 
        self.cpe1 = nn.Conv2d(dim, dim, 3, padding=1, groups=dim)
        self.norm1 = norm_layer(dim)
        self.in_proj = nn.Linear(dim, dim)
        self.act_proj = nn.Linear(dim, dim)
        self.dwc = nn.Conv2d(dim, dim, 3, padding=1, groups=dim)
        self.act = nn.SiLU()
        self.attn = LinearAttention(dim=dim, num_heads=num_heads, qkv_bias=qkv_bias)
        self.out_proj = nn.Linear(dim, dim)
        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
 
        self.cpe2 = nn.Conv2d(dim, dim, 3, padding=1, groups=dim)
        self.norm2 = norm_layer(dim)
        self.mlp = Mlp(in_features=dim, hidden_features=int(dim * mlp_ratio), act_layer=act_layer, drop=drop)
 
    def forward(self, x):
        x = x.reshape((x.size(0), x.size(2) * x.size(3), x.size(1)))
        b, n, c = x.shape
        H = int(n ** 0.5)
        W = int(n ** 0.5)
        B, L, C = x.shape
        assert L == H * W, "input feature has wrong size"
 
        x = x + self.cpe1(x.reshape(B, H, W, C).permute(0, 3, 1, 2)).flatten(2).permute(0, 2, 1)
        shortcut = x
 
        x = self.norm1(x)
        act_res = self.act(self.act_proj(x))
        x = self.in_proj(x).view(B, H, W, C)
        x = self.act(self.dwc(x.permute(0, 3, 1, 2))).permute(0, 2, 3, 1).view(B, L, C)
 
        # Linear Attention
        x = self.attn(x)
 
        x = self.out_proj(x * act_res)
        x = shortcut + self.drop_path(x)
        x = x + self.cpe2(x.reshape(B, H, W, C).permute(0, 3, 1, 2)).flatten(2).permute(0, 2, 1)
 
        # FFN
        x = x + self.drop_path(self.mlp(self.norm2(x)))
        x = x.transpose(2, 1).reshape((b, c, H, W))
        return x
 
    def extra_repr(self) -> str:
        return f"dim={self.dim}, input_resolution={self.input_resolution}, num_heads={self.num_heads}, " \
               f"mlp_ratio={self.mlp_ratio}"

def autopad(k, p=None, d=1):  # kernel, padding, dilation
    """Pad to 'same' shape outputs."""
    if d > 1:
        k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k]  # actual kernel-size
    if p is None:
        p = k // 2 if isinstance(k, int) else [x // 2 for x in k]  # auto-pad
    return p

class Conv(nn.Module):
    """Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)."""
    default_act = nn.SiLU()  # default activation
 
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
        """Initialize Conv layer with given arguments including activation."""
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()
 
    def forward(self, x):
        """Apply convolution, batch normalization and activation to input tensor."""
        return self.act(self.bn(self.conv(x)))
 
    def forward_fuse(self, x):
        """Perform transposed convolution of 2D data."""
        return self.act(self.conv(x))

class ResNetBlock(nn.Module):
    """ResNet block with standard convolution layers."""

    def __init__(self, c1, c2, s=1, e=4):
        """Initialize convolution with given parameters."""
        super().__init__()
        c3 = e * c2
        self.cv1 = Conv(c1, c2, k=1, s=1, act=True)
        self.cv2 = Conv(c2, c2, k=3, s=s, p=1, act=True)
        self.cv3 = Conv(c2, c3, k=1, act=False)
        self.cv4 = MLLABlock(c2)
        self.shortcut = nn.Sequential(Conv(c1, c3, k=1, s=s, act=False)) if s != 1 or c1 != c3 else nn.Identity()

    def forward(self, x):
        """Forward pass through the ResNet block."""
        return F.relu(self.cv3(self.cv4(self.cv2(self.cv1(x)))) + self.shortcut(x))

class ResNetLayer_MLLA(nn.Module):
    """ResNet layer with multiple ResNet blocks."""

    def __init__(self, c1, c2, s=1, is_first=False, n=1, e=4):
        """Initializes the ResNetLayer given arguments."""
        super().__init__()
        self.is_first = is_first

        if self.is_first:
            self.layer = nn.Sequential(
                Conv(c1, c2, k=7, s=2, p=3, act=True), nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
            )
        else:
            blocks = [ResNetBlock(c1, c2, s, e=e)]
            blocks.extend([ResNetBlock(e * c2, c2, 1, e=e) for _ in range(n - 1)])
            self.layer = nn.Sequential(*blocks)

    def forward(self, x):
        """Forward pass through the ResNet layer."""
        return self.layer(x)


四、创新模块

4.1 改进点1⭐

模块改进方法 :直接加入 MLLABlock模块 第五节讲解添加步骤 )。

MLLABlock模块 添加后如下:

在这里插入图片描述

4.2 改进点2⭐

模块改进方法 :基于 MLLABlock模块 ResNetLayer 第五节讲解添加步骤 )。

第二种改进方法是对 RT-DETR 中的 ResNetLayer模块 进行改进,并将 MLLABlock 在加入到 ResNetLayer 模块中。

改进代码如下:

ResNetLayer 中加入 MLLABlock模块 ,并重命名为 ResNetLayer_MLLA

class ResNetBlock(nn.Module):
    """ResNet block with standard convolution layers."""

    def __init__(self, c1, c2, s=1, e=4):
        """Initialize convolution with given parameters."""
        super().__init__()
        c3 = e * c2
        self.cv1 = Conv(c1, c2, k=1, s=1, act=True)
        self.cv2 = Conv(c2, c2, k=3, s=s, p=1, act=True)
        self.cv3 = Conv(c2, c3, k=1, act=False)
        self.cv4 = MLLABlock(c2)
        self.shortcut = nn.Sequential(Conv(c1, c3, k=1, s=s, act=False)) if s != 1 or c1 != c3 else nn.Identity()

    def forward(self, x):
        """Forward pass through the ResNet block."""
        return F.relu(self.cv3(self.cv4(self.cv2(self.cv1(x)))) + self.shortcut(x))

class ResNetLayer_MLLA(nn.Module):
    """ResNet layer with multiple ResNet blocks."""

    def __init__(self, c1, c2, s=1, is_first=False, n=1, e=4):
        """Initializes the ResNetLayer given arguments."""
        super().__init__()
        self.is_first = is_first

        if self.is_first:
            self.layer = nn.Sequential(
                Conv(c1, c2, k=7, s=2, p=3, act=True), nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
            )
        else:
            blocks = [ResNetBlock(c1, c2, s, e=e)]
            blocks.extend([ResNetBlock(e * c2, c2, 1, e=e) for _ in range(n - 1)])
            self.layer = nn.Sequential(*blocks)

    def forward(self, x):
        """Forward pass through the ResNet layer."""
        return self.layer(x)

在这里插入图片描述

注意❗:在 第五小节 中需要声明的模块名称为: MLLABlock ResNetLayer_MLLA


五、添加步骤

5.1 修改一

① 在 ultralytics/nn/ 目录下新建 AddModules 文件夹用于存放模块代码

② 在 AddModules 文件夹下新建 MLLA.py ,将 第三节 中的代码粘贴到此处

在这里插入图片描述

5.2 修改二

AddModules 文件夹下新建 __init__.py (已有则不用新建),在文件内导入模块: from .MLLA import *

在这里插入图片描述

5.3 修改三

ultralytics/nn/modules/tasks.py 文件中,需要在两处位置添加各模块类名称。

首先:导入模块

在这里插入图片描述

其次:在 parse_model函数 中注册 ResNetLayer_MLLA 模块

在这里插入图片描述

在这里插入图片描述

最后:在 parse_model函数 中添加如下代码:

elif m in {MLLABlock}:
     c2 = ch[f]
     args = [c2, *args]

在这里插入图片描述


六、yaml模型文件

6.1 模型改进版本1

此处以 ultralytics/cfg/models/rt-detr/rtdetr-l.yaml 为例,在同目录下创建一个用于自己数据集训练的模型文件 rtdetr-l-MLLA.yaml

rtdetr-l.yaml 中的内容复制到 rtdetr-l-MLLA.yaml 文件下,修改 nc 数量等于自己数据中目标的数量。

📌 模型的修改方法是在 骨干网络 添加 MLLABlock模块

# Ultralytics YOLO 🚀, AGPL-3.0 license
# RT-DETR-l object detection model with P3-P5 outputs. For details see https://docs.ultralytics.com/models/rtdetr

# Parameters
nc: 1 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n-cls.yaml' will call yolov8-cls.yaml with scale 'n'
  # [depth, width, max_channels]
  l: [1.00, 1.00, 1024]

backbone:
  # [from, repeats, module, args]
  - [-1, 1, HGStem, [32, 48]] # 0-P2/4
  - [-1, 6, HGBlock, [48, 128, 3]] # stage 1

  - [-1, 1, DWConv, [128, 3, 2, 1, False]] # 2-P3/8
  - [-1, 6, HGBlock, [96, 512, 3]] # stage 2

  - [-1, 1, DWConv, [512, 3, 2, 1, False]] # 4-P4/16
  - [-1, 6, HGBlock, [192, 1024, 5, True, False]] # cm, c2, k, light, shortcut
  - [-1, 6, HGBlock, [192, 1024, 5, True, True]]
  - [-1, 6, HGBlock, [192, 1024, 5, True, True]] # stage 3

  - [-1, 1, DWConv, [1024, 3, 2, 1, False]] # 8-P5/32
  - [-1, 1, MLLABlock, []] # stage 4
  - [-1, 6, HGBlock, [384, 2048, 5, True, False]] # stage 4

head:
  - [-1, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 10 input_proj.2
  - [-1, 1, AIFI, [1024, 8]]
  - [-1, 1, Conv, [256, 1, 1]] # 12, Y5, lateral_convs.0

  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  - [7, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 14 input_proj.1
  - [[-2, -1], 1, Concat, [1]]
  - [-1, 3, RepC3, [256]] # 16, fpn_blocks.0
  - [-1, 1, Conv, [256, 1, 1]] # 17, Y4, lateral_convs.1

  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  - [3, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 19 input_proj.0
  - [[-2, -1], 1, Concat, [1]] # cat backbone P4
  - [-1, 3, RepC3, [256]] # X3 (21), fpn_blocks.1

  - [-1, 1, Conv, [256, 3, 2]] # 22, downsample_convs.0
  - [[-1, 18], 1, Concat, [1]] # cat Y4
  - [-1, 3, RepC3, [256]] # F4 (24), pan_blocks.0

  - [-1, 1, Conv, [256, 3, 2]] # 25, downsample_convs.1
  - [[-1, 13], 1, Concat, [1]] # cat Y5
  - [-1, 3, RepC3, [256]] # F5 (27), pan_blocks.1

  - [[22, 25, 28], 1, RTDETRDecoder, [nc]] # Detect(P3, P4, P5)

6.2 模型改进版本2⭐

此处以 ultralytics/cfg/models/rt-detr/rtdetr-resnet50.yaml 为例,在同目录下创建一个用于自己数据集训练的模型文件 rtdetr-ResNetLayer_MLLA .yaml

rtdetr-resnet50.yaml 中的内容复制到 rtdetr-ResNetLayer_MLLA .yaml 文件下,修改 nc 数量等于自己数据中目标的数量。

📌 模型的修改方法是将 骨干网络 中的 ResNetLayer模块 替换成 ResNetLayer_MLLA 模块

# Ultralytics YOLO 🚀, AGPL-3.0 license
# RT-DETR-ResNet50 object detection model with P3-P5 outputs.

# Parameters
nc: 1 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n-cls.yaml' will call yolov8-cls.yaml with scale 'n'
  # [depth, width, max_channels]
  l: [1.00, 1.00, 1024]

backbone:
  # [from, repeats, module, args]
  - [-1, 1, ResNetLayer_MLLA, [3, 64, 1, True, 1]] # 0
  - [-1, 1, ResNetLayer_MLLA, [64, 64, 1, False, 3]] # 1
  - [-1, 1, ResNetLayer_MLLA, [256, 128, 2, False, 4]] # 2
  - [-1, 1, ResNetLayer_MLLA, [512, 256, 2, False, 6]] # 3
  - [-1, 1, ResNetLayer_MLLA, [1024, 512, 2, False, 3]] # 4

head:
  - [-1, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 5
  - [-1, 1, AIFI, [1024, 8]]
  - [-1, 1, Conv, [256, 1, 1]] # 7

  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  - [3, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 9
  - [[-2, -1], 1, Concat, [1]]
  - [-1, 3, RepC3, [256]] # 11
  - [-1, 1, Conv, [256, 1, 1]] # 12

  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  - [2, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 14
  - [[-2, -1], 1, Concat, [1]] # cat backbone P4
  - [-1, 3, RepC3, [256]] # X3 (16), fpn_blocks.1

  - [-1, 1, Conv, [256, 3, 2]] # 17, downsample_convs.0
  - [[-1, 12], 1, Concat, [1]] # cat Y4
  - [-1, 3, RepC3, [256]] # F4 (19), pan_blocks.0

  - [-1, 1, Conv, [256, 3, 2]] # 20, downsample_convs.1
  - [[-1, 7], 1, Concat, [1]] # cat Y5
  - [-1, 3, RepC3, [256]] # F5 (22), pan_blocks.1

  - [[16, 19, 22], 1, RTDETRDecoder, [nc]] # Detect(P3, P4, P5)


七、成功运行结果

打印网络模型可以看到 MLLABlock ResNetLayer_MLLA 已经加入到模型中,并可以进行训练了。

rtdetr-l-MLLA

rtdetr-l-MLLA summary: 702 layers, 46,494,915 parameters, 46,494,915 gradients, 118.9 GFLOPs

                   from  n    params  module                                       arguments                     
  0                  -1  1     25248  ultralytics.nn.modules.block.HGStem          [3, 32, 48]                   
  1                  -1  6    155072  ultralytics.nn.modules.block.HGBlock         [48, 48, 128, 3, 6]           
  2                  -1  1      1408  ultralytics.nn.modules.conv.DWConv           [128, 128, 3, 2, 1, False]    
  3                  -1  6    839296  ultralytics.nn.modules.block.HGBlock         [128, 96, 512, 3, 6]          
  4                  -1  1      5632  ultralytics.nn.modules.conv.DWConv           [512, 512, 3, 2, 1, False]    
  5                  -1  6   1695360  ultralytics.nn.modules.block.HGBlock         [512, 192, 1024, 5, 6, True, False]
  6                  -1  6   2055808  ultralytics.nn.modules.block.HGBlock         [1024, 192, 1024, 5, 6, True, True]
  7                  -1  6   2055808  ultralytics.nn.modules.block.HGBlock         [1024, 192, 1024, 5, 6, True, True]
  8                  -1  1     11264  ultralytics.nn.modules.conv.DWConv           [1024, 1024, 3, 2, 1, False]  
  9                  -1  1  13686784  ultralytics.nn.AddModules.MLLA.MLLABlock     [1024]                        
 10                  -1  6   6708480  ultralytics.nn.modules.block.HGBlock         [1024, 384, 2048, 5, 6, True, False]
 11                  -1  1    524800  ultralytics.nn.modules.conv.Conv             [2048, 256, 1, 1, None, 1, 1, False]
 12                  -1  1    789760  ultralytics.nn.modules.transformer.AIFI      [256, 1024, 8]                
 13                  -1  1     66048  ultralytics.nn.modules.conv.Conv             [256, 256, 1, 1]              
 14                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 15                   7  1    262656  ultralytics.nn.modules.conv.Conv             [1024, 256, 1, 1, None, 1, 1, False]
 16            [-2, -1]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 17                  -1  3   2232320  ultralytics.nn.modules.block.RepC3           [512, 256, 3]                 
 18                  -1  1     66048  ultralytics.nn.modules.conv.Conv             [256, 256, 1, 1]              
 19                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 20                   3  1    131584  ultralytics.nn.modules.conv.Conv             [512, 256, 1, 1, None, 1, 1, False]
 21            [-2, -1]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 22                  -1  3   2232320  ultralytics.nn.modules.block.RepC3           [512, 256, 3]                 
 23                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]              
 24            [-1, 18]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 25                  -1  3   2232320  ultralytics.nn.modules.block.RepC3           [512, 256, 3]                 
 26                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]              
 27            [-1, 13]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 28                  -1  3   2232320  ultralytics.nn.modules.block.RepC3           [512, 256, 3]                 
 29        [22, 25, 28]  1   7303907  ultralytics.nn.modules.head.RTDETRDecoder    [1, [256, 256, 256]]          
rtdetr-l-MLLA summary: 702 layers, 46,494,915 parameters, 46,494,915 gradients, 118.9 GFLOPs

rtdetr-ResNetLayer_MLLA

rtdetr-ResNetLayer_MLLA summary: 929 layers, 59,313,827 parameters, 59,313,827 gradients, 175.1 GFLOPs

                   from  n    params  module                                       arguments                     
  0                  -1  1      9536  ultralytics.nn.AddModules.MLLA.ResNetLayer_MLLA[3, 64, 1, True, 1]           
  1                  -1  1    385920  ultralytics.nn.AddModules.MLLA.ResNetLayer_MLLA[64, 64, 1, False, 3]         
  2                  -1  1   2099200  ultralytics.nn.AddModules.MLLA.ResNetLayer_MLLA[256, 128, 2, False, 4]       
  3                  -1  1  12293120  ultralytics.nn.AddModules.MLLA.ResNetLayer_MLLA[512, 256, 2, False, 6]       
  4                  -1  1  25271296  ultralytics.nn.AddModules.MLLA.ResNetLayer_MLLA[1024, 512, 2, False, 3]      
  5                  -1  1    524800  ultralytics.nn.modules.conv.Conv             [2048, 256, 1, 1, None, 1, 1, False]
  6                  -1  1    789760  ultralytics.nn.modules.transformer.AIFI      [256, 1024, 8]                
  7                  -1  1     66048  ultralytics.nn.modules.conv.Conv             [256, 256, 1, 1]              
  8                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
  9                   3  1    262656  ultralytics.nn.modules.conv.Conv             [1024, 256, 1, 1, None, 1, 1, False]
 10            [-2, -1]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 11                  -1  3   2232320  ultralytics.nn.modules.block.RepC3           [512, 256, 3]                 
 12                  -1  1     66048  ultralytics.nn.modules.conv.Conv             [256, 256, 1, 1]              
 13                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 14                   2  1    131584  ultralytics.nn.modules.conv.Conv             [512, 256, 1, 1, None, 1, 1, False]
 15            [-2, -1]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 16                  -1  3   2232320  ultralytics.nn.modules.block.RepC3           [512, 256, 3]                 
 17                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]              
 18            [-1, 12]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 19                  -1  3   2232320  ultralytics.nn.modules.block.RepC3           [512, 256, 3]                 
 20                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]              
 21             [-1, 7]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 22                  -1  3   2232320  ultralytics.nn.modules.block.RepC3           [512, 256, 3]                 
 23        [16, 19, 22]  1   7303907  ultralytics.nn.modules.head.RTDETRDecoder    [1, [256, 256, 256]]          
rtdetr-ResNetLayer_MLLA summary: 929 layers, 59,313,827 parameters, 59,313,827 gradients, 175.1 GFLOPs