学习资源站

RT-DETR改进策略【独家融合改进】 AssemFormer + HS-FPN 减少目标尺度变化影响,增加多尺度的学习能力-

RT-DETR改进策略【独家融合改进】| AssemFormer + HS-FPN 减少目标尺度变化影响,增加多尺度的学习能力

一、本文介绍

本文记录的是 利用 AssemFormer 优化 RT-DETR 的目标检测网络模型 。本文的融合改进中利用 AssemFormer 在特征传递和融合过程中 增加多尺度的学习能力 ,配合 HS-FPN 减少因尺度变化及特征不足导致的检测误差 ,显著提升 RT-DETR 在各项检测任务中的准确性与稳定性。



二、AssemFormer介绍

Exploiting Scale-Variant Attention for Segmenting Small Medical Objects

2.1 设计出发点

  • 弥补传统方法不足 :传统的深度学习算法在处理医学图像中的小对象时面临诸多挑战。例如卷积和池化操作会导致信息丢失和压缩缺陷,尤其对于小医学对象,随着网络加深这些问题更加明显。传统的注意力机制通常产生固定维度的注意力图,往往不足以分析医学图像,因为这些方法主要关注中心特征,忽略了背景中的丰富上下文信息,而这些信息对于临床解释至关重要。
  • 结合卷积与Transformer优势 :卷积操作侧重于学习医学对象的局部和一般特征,如角落、边缘、角度和颜色;而Transformer模块利用多头自注意力机制提取医学对象的全局信息,包括形态、深度和颜色分布,同时还能学习医学对象的位置关联。为了综合两者的优势,设计了AssemFormer模块。

2.2 原理

2.2.1 结构组成

AssemFormer模块 包含一个 3 × 3 3×3 3 × 3 卷积和一个 1 × 1 1×1 1 × 1 卷积,接着是两个 Transformer 块和两个 卷积 操作。它通过堆叠和拆分特征图来连接卷积和Transformer操作。

2.2.2 注意力机制

AssemFormer 采用的多头部自注意力机制(MHSA),公式为 A V i T ( q , k , v ) = s o f t m a x ( q k T D h ) v \mathcal{A}_{ViT}(q,k,v)=softmax\left(\frac{qk^{T}}{\sqrt{D_{h}}}\right)v A Vi T ( q , k , v ) = so f t ma x ( D h q k T ) v ,其中 q q q k k k v v v 是输入序列 z ∈ R N ^ × D ˙ z\in\mathbb{R}^{\hat{N}×\dot{D}} z R N ^ × D ˙ 的查询、键和值向量, N N N 表示补丁数量, D D D 表示补丁大小, m m m 个自注意力操作时 D h = D / m D_{h}=D/m D h = D / m 。这种机制有助于补丁交互并丰富上下文信息。

在这里插入图片描述

2.3 特点

  • 融合多尺度特征 :通过堆叠和拆分特征图的方式,使得模块能够同时学习输入医学图像的局部和全局表示,从而更好地捕捉不同尺度的特征,为准确分割小医学对象提供更全面的信息。
  • 提高分割性能 :在实验中,根据消融研究结果, AssemFormer 显著提高了SvANet的分割性能。例如在不同数据集上,SvANet结合AssemFormer在各项评估指标上均取得了较好的成绩,证明了其对小医学对象分割的有效性。
  • 增强特征表示 :从特征图的变化可以看出, AssemFormer 能够逐渐突出更小的区域,这些区域更准确地与 ground truth对齐。在不同的医学对象分割场景中,都表现出能够更好地聚焦于目标区域,增强了小医学对象的可见性和精确定位,突出了其形态细节和准确位置。

论文: https://arxiv.org/abs/2407.07720
源码: https://github.com/anthonyweidai/SvANet

三、HS-FPN介绍

Accurate Leukocyte Detection Based on Deformable-DETR and Multi-Level Feature Fusion for Aiding Diagnosis of Blood Diseases

HS - FPN结构 特征选择模块 特征融合模块 组成。

  • 特征选择模块中, CA模块 先处理输入 特征图 ,经 池化 激活函数 确定各通道权重以 过滤特征图 DM模块 再对不同尺度特征图降维;
  • 特征融合模块中,利用 SFF机制 以高级特征为权重筛选低级特征语义信息后融合 ,提升模型检测能力。

3.1 出发点

在白细胞数据集中,白细胞识别任务面临多尺度问题,不同类型白细胞直径通常有差异,相同白细胞在不同显微镜下成像大小也会不同,这使得模型难以准确识别白细胞,所以需要设计HS - FPN来实现多尺度特征融合,帮助模型捕捉更全面的白细胞特征信息。

3.2 结构原理

  • 特征选择模块 :由 CA模块 DM模块 组成。对于输入特征图 f i n ∈ R C × H × W f_{in } \in R^{C ×H ×W} f in R C × H × W CA模块 先进行 全局平均池化 和全 局最大池化 ,再结合结果,经 Sigmoid激活函数 确定各通道权重 f C A ∈ R C × 1 × 1 f_{C A} \in R^{C ×1 ×1} f C A R C × 1 × 1 ,通过与对应尺度特征图相乘得到过滤后的特征图。因不同尺度特征图通道数不同, DM模块 用1×1卷积将各尺度特征图通道数降为 256。

  • 特征融合模块 :骨干网络生成的多尺度特征图中, 高级特征语义信息丰富但目标定位粗糙,低级特征定位精确但语义信息有限 。传统直接像素求和融合有缺陷,研究中的 SFF模块 以高级特征为权重筛选低级特征中的关键语义信息 。对于输入高级特征 f h i g h ∈ R C × H × W f_{high } \in R^{C ×H ×W} f hi g h R C × H × W 和低级特征 f l o w ∈ R C × H 1 × W 1 f_{low } \in R^{C ×H_{1} ×W_{1}} f l o w R C × H 1 × W 1 ,先对高级特征用步长为2、卷积核为3 x3的 转置卷积 扩展,再用 双线性插值 统一维度得到 f a t t ∈ R C × H 1 × W 1 f_{att } \in R^{C ×H_{1} ×W_{1}} f a tt R C × H 1 × W 1 ,经 CA 模块 将高级特征转为注意力权重过滤低级特征,最后融合得到 f o u t ∈ R C × H 1 × W 1 f_{out } \in R^{C ×H_{1} ×W_{1}} f o u t R C × H 1 × W 1 ,其融合过程公式为 f a t t = B L ( T − C o n v ( f h i g h ) ) f_{att }=B L\left(T - Conv\left(f_{high }\right)\right) f a tt = B L ( T C o n v ( f hi g h ) ) f o u t = f l o w ∗ C A ( f a t t ) + f a t t f_{out }=f_{low } * C A\left(f_{att }\right)+f_{att } f o u t = f l o w C A ( f a tt ) + f a tt

在这里插入图片描述

3.3 作用

HS-FPN 能够利用通道注意力模块,以 高级语义特征为权重过滤低级特征 ,并将筛选后的特征与高级特征逐点相加,实现多尺度特征融合,从而提高模型的特征表达能力,有助于检测到细微特征,增强模型的检测能力。

论文: https://arxiv.org/pdf/2212.11677
源码: https://github.com/Barrett-python/DuAT

四、AssemFormer和HS-FPN的实现代码

AssemFormer模块 的实现代码如下:

import numpy as np
from typing import Union, Sequence, Tuple, Optional
import torch
from torch import nn, Tensor
import torch.nn.functional as F
from typing import Any, Callable
from torchvision.ops import StochasticDepth as StochasticDepthTorch

from ultralytics.nn.modules.conv import LightConv
from ultralytics.utils.torch_utils import fuse_conv_and_bn

class Dropout(nn.Dropout):
    def __init__(self, p: float=0.5, inplace: bool=False):
        super(Dropout, self).__init__(p=p, inplace=inplace)

class StochasticDepth(StochasticDepthTorch):
    def __init__(self, p: float, Mode: str="row") -> None:
        super().__init__(p, Mode)

def pair(Val):
    return Val if isinstance(Val, (tuple, list)) else (Val, Val)

def makeDivisible(v: float, divisor: int, min_value: Optional[int] = None) -> int:
    """
    This function is taken from the original tf repo.
    It ensures that all layers have a channel number that is divisible by 8
    It can be seen here:
    https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.Py
    """
    if min_value is None:
        min_value = divisor
    new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
    # Make sure that round down does not go down by more than 10%.
    if new_v < 0.9 * v:
        new_v += divisor
    return new_v

class LinearSelfAttention(nn.Module):
    """
    This layer applies a self-attention with linear complexity, as described in `MobileViTv2 <https://arxiv.org/abs/2206.02680>`_ paper.
    This layer can be used for self- as well as cross-attention.

    Args:
        opts: command line arguments
        DimEmbed (int): :math:`C` from an expected input of size :math:`(N, C, H, W)`
        AttnDropRate (Optional[float]): Dropout value for context scores. Default: 0.0
        bias (Optional[bool]): Use bias in learnable layers. Default: True

    Shape:
        - Input: :math:`(N, C, P, N)` where :math:`N` is the batch size, :math:`C` is the input channels,
        :math:`P` is the number of pixels in the patch, and :math:`N` is the number of patches
        - Output: same as the input

    .. note::
        For MobileViTv2, we unfold the feature map [B, C, H, W] into [B, C, P, N] where P is the number of pixels
        in a patch and N is the number of patches. Because channel is the first dimension in this unfolded tensor,
        we use point-wise convolution (instead of a linear layer). This avoids a transpose operation (which may be
        expensive on resource-constrained devices) that may be required to convert the unfolded tensor from
        channel-first to channel-last format in case of a linear layer.
    """

    def __init__(
        self,
        DimEmbed: int,
        AttnDropRate: Optional[float]=0.0,
        Bias: Optional[bool]=True,
    ) -> None:
        super().__init__()

        self.qkv_proj = BaseConv2d(DimEmbed, 1 + (2 * DimEmbed), 1, bias=Bias)

        self.AttnDropRate = Dropout(p=AttnDropRate)
        self.out_proj = BaseConv2d(DimEmbed, DimEmbed, 1, bias=Bias)
        self.DimEmbed = DimEmbed

    def forward(self, x: Tensor) -> Tensor:
        # [B, C, P, N] --> [B, h + 2d, P, N]
        qkv = self.qkv_proj(x)

        # Project x into query, key and value
        # Query --> [B, 1, P, N]
        # value, key --> [B, d, P, N]
        query, key, value = torch.split(
            qkv, split_size_or_sections=[1, self.DimEmbed, self.DimEmbed], dim=1
        )

        # apply softmax along N dimension
        context_scores = F.softmax(query, dim=-1)
        # Uncomment below line to visualize context scores
        # self.visualize_context_scores(context_scores=context_scores)
        context_scores = self.AttnDropRate(context_scores)

        # Compute context vector
        # [B, d, P, N] x [B, 1, P, N] -> [B, d, P, N]
        context_vector = key * context_scores
        # [B, d, P, N] --> [B, d, P, 1]
        context_vector = torch.sum(context_vector, dim=-1, keepdim=True)

        # combine context vector with values
        # [B, d, P, N] * [B, d, P, 1] --> [B, d, P, N]
        out = F.relu(value) * context_vector.expand_as(value)
        out = self.out_proj(out)
        return out

class LinearAttnFFN(nn.Module):
    def __init__(
            self,
            DimEmbed: int,
            DimFfnLatent: int,
            AttnDropRate: Optional[float] = 0.0,
            DropRate: Optional[float] = 0.1,
            FfnDropRate: Optional[float] = 0.0,
    ) -> None:
        super().__init__()
        AttnUnit = LinearSelfAttention(DimEmbed, AttnDropRate, Bias=True)

        self.PreNormAttn = nn.Sequential(
            nn.BatchNorm2d(DimEmbed),
            AttnUnit,
            Dropout(DropRate),
        )

        self.PreNormFfn = nn.Sequential(
            nn.BatchNorm2d(DimEmbed),
            BaseConv2d(DimEmbed, DimFfnLatent, 1, 1, ActLayer=nn.SiLU),
            Dropout(FfnDropRate),
            BaseConv2d(DimFfnLatent, DimEmbed, 1, 1),
            Dropout(DropRate),
        )

        self.DimEmbed = DimEmbed

    def forward(self, x: Tensor) -> Tensor:
        # self-attention
        x = x + self.PreNormAttn(x)

        # Feed forward network
        x = x + self.PreNormFfn(x)
        return x

class BaseConv2d(nn.Module):
    def __init__(
            self,
            in_channels: int,
            out_channels: int,
            kernel_size: int,
            stride: Optional[int] = 1,
            padding: Optional[int] = None,
            groups: Optional[int] = 1,
            bias: Optional[bool] = None,
            BNorm: bool = False,
            # norm_layer: Optional[Callable[..., nn.Module]]=nn.BatchNorm2d,
            ActLayer: Optional[Callable[..., nn.Module]] = None,
            dilation: int = 1,
            Momentum: Optional[float] = 0.1,
            **kwargs: Any
    ) -> None:
        super(BaseConv2d, self).__init__()
        if padding is None:
            padding = int((kernel_size - 1) // 2 * dilation)

        if bias is None:
            bias = not BNorm

        self.in_channels = in_channels
        self.out_channels = out_channels
        self.kernel_size = kernel_size
        self.stride = stride
        self.padding = padding
        self.groups = groups
        self.bias = bias

        self.Conv = nn.Conv2d(in_channels, out_channels,
                              kernel_size, stride, padding, dilation, groups, bias, **kwargs)

        self.Bn = nn.BatchNorm2d(out_channels, eps=0.001, momentum=Momentum) if BNorm else nn.Identity()

        if ActLayer is not None:
            if isinstance(list(ActLayer().named_modules())[0][1], nn.Sigmoid):
                self.Act = ActLayer()
            else:
                self.Act = ActLayer(inplace=True)
        else:
            self.Act = ActLayer

    def forward(self, x: Tensor) -> Tensor:
        x = self.Conv(x)
        x = self.Bn(x)
        if self.Act is not None:
            x = self.Act(x)
        return x

class BaseFormer(nn.Module):
    def __init__(
            self,
            InChannels: int,
            FfnMultiplier: Optional[Union[Sequence[Union[int, float]], int, float]] = 2.0,
            NumAttnBlocks: Optional[int] = 2,
            AttnDropRate: Optional[float] = 0.0,
            DropRate: Optional[float] = 0.0,
            FfnDropRate: Optional[float] = 0.0,
            PatchRes: Optional[int] = 2,
            Dilation: Optional[int] = 1,
            ViTSELayer: Optional[nn.Module] = None,
            **kwargs: Any,
    ) -> None:
        DimAttnUnit = InChannels // 2
        DimCNNOut = DimAttnUnit

        Conv3x3In = BaseConv2d(
            InChannels, InChannels, 3, 1, dilation=Dilation,
            BNorm=True, ActLayer=nn.SiLU,
        )  # depth-wise separable convolution
        ViTSELayer = ViTSELayer(InChannels, **kwargs) if ViTSELayer is not None else nn.Identity()
        Conv1x1In = BaseConv2d(InChannels, DimCNNOut, 1, 1, bias=False)

        super(BaseFormer, self).__init__()
        self.LocalRep = nn.Sequential(Conv3x3In, ViTSELayer, Conv1x1In)

        self.GlobalRep, DimAttnUnit = self.buildAttnLayer(
            DimAttnUnit, FfnMultiplier, NumAttnBlocks, AttnDropRate, DropRate, FfnDropRate,
        )
        self.ConvProj = BaseConv2d(DimCNNOut, InChannels, 1, 1, BNorm=True)

        self.DimCNNOut = DimCNNOut

        self.HPatch, self.WPatch = pair(PatchRes)
        self.PatchArea = self.WPatch * self.HPatch

    def buildAttnLayer(
            self,
            DimModel: int,
            FfnMult: Union[Sequence, int, float],
            NumAttnBlocks: int,
            AttnDropRate: float,
            DropRate: float,
            FfnDropRate: float,
    ) -> Tuple[nn.Module, int]:

        if isinstance(FfnMult, Sequence) and len(FfnMult) == 2:
            DimFfn = (
                    np.linspace(FfnMult[0], FfnMult[1], NumAttnBlocks, dtype=float) * DimModel
            )
        elif isinstance(FfnMult, Sequence) and len(FfnMult) == 1:
            DimFfn = [FfnMult[0] * DimModel] * NumAttnBlocks
        elif isinstance(FfnMult, (int, float)):
            DimFfn = [FfnMult * DimModel] * NumAttnBlocks
        else:
            raise NotImplementedError

        # ensure that dims are multiple of 16
        DimFfn = [makeDivisible(d, 16) for d in DimFfn]

        GlobalRep = [
            LinearAttnFFN(DimModel, DimFfn[block_idx], AttnDropRate, DropRate, FfnDropRate)
            for block_idx in range(NumAttnBlocks)
        ]
        GlobalRep.append(nn.BatchNorm2d(DimModel))
        return nn.Sequential(*GlobalRep), DimModel

    def unfolding(self, FeatureMap: Tensor) -> Tuple[Tensor, Tuple[int, int]]:
        B, C, H, W = FeatureMap.shape

        # [B, C, H, W] --> [B, C, P, N]
        Patches = F.unfold(
            FeatureMap,
            kernel_size=(self.HPatch, self.WPatch),
            stride=(self.HPatch, self.WPatch),
        )
        Patches = Patches.reshape(
            B, C, self.HPatch * self.WPatch, -1
        )

        return Patches, (H, W)

    def folding(self, Patches: Tensor, OutputSize: Tuple[int, int]) -> Tensor:
        B, C, P, N = Patches.shape  # BatchSize, DimIn, PatchSize, NumPatches

        # [B, C, P, N]
        Patches = Patches.reshape(B, C * P, N)

        FeatureMap = F.fold(
            Patches,
            output_size=OutputSize,
            kernel_size=(self.HPatch, self.WPatch),
            stride=(self.HPatch, self.WPatch),
        )

        return FeatureMap

    def forward(self, x: Tensor, *args, **kwargs) -> Tensor:
        Fm = self.LocalRep(x)

        # convert feature map to patches
        Patches, OutputSize = self.unfolding(Fm)

        # learn global representations on all patches
        Patches = self.GlobalRep(Patches)

        # [B x Patch x Patches x C] --> [B x C x Patches x Patch]
        Fm = self.folding(Patches, OutputSize)
        Fm = self.ConvProj(Fm)

        return Fm

#AssemFormer, a method that combines convolution with a vision transformer by assembling tensors.
class AssemFormer(BaseFormer):
    """
    Inspired by MobileViTv3.
    Adapted from https://github.com/micronDLA/MobileViTv3/blob/main/MobileViTv3-v2/cvnets/modules/mobilevit_block.py
    """

    def __init__(
            self,
            InChannels: int,
            FfnMultiplier: Optional[Union[Sequence[Union[int, float]], int, float]] = 2.0,
            NumAttnBlocks: Optional[int] = 2,
            AttnDropRate: Optional[float] = 0.0,
            DropRate: Optional[float] = 0.0,
            FfnDropRate: Optional[float] = 0.0,
            PatchRes: Optional[int] = 2,
            Dilation: Optional[int] = 1,
            SDProb: Optional[float] = 0.0,
            ViTSELayer: Optional[nn.Module] = None,
            **kwargs: Any,
    ) -> None:
        super().__init__(InChannels, FfnMultiplier, NumAttnBlocks, AttnDropRate,
                         DropRate, FfnDropRate, PatchRes, Dilation, ViTSELayer, **kwargs)
        # AssembleFormer: input changed from just global to local + global
        self.ConvProj = BaseConv2d(2 * self.DimCNNOut, InChannels, 1, 1, BNorm=True)

        self.Dropout = StochasticDepth(SDProb)

    def forward(self, x: Tensor) -> Tensor:
        FmConv = self.LocalRep(x)

        # convert feature map to patches
        Patches, OutputSize = self.unfolding(FmConv)

        # learn global representations on all patches
        Patches = self.GlobalRep(Patches)

        # [B x Patch x Patches x C] --> [B x C x Patches x Patch]
        Fm = self.folding(Patches, OutputSize)

        # AssembleFormer: local + global instead of only global
        Fm = self.ConvProj(torch.cat((Fm, FmConv), dim=1))

        # AssembleFormer: skip connection
        return x + self.Dropout(Fm)

def autopad(k, p=None, d=1):  # kernel, padding, dilation
    """Pad to 'same' shape outputs."""
    if d > 1:
        k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k]  # actual kernel-size
    if p is None:
        p = k // 2 if isinstance(k, int) else [x // 2 for x in k]  # auto-pad
    return p

class Conv(nn.Module):
    """Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)."""
 
    default_act = nn.SiLU()  # default activation
 
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
        """Initialize Conv layer with given arguments including activation."""
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()
 
    def forward(self, x):
        """Apply convolution, batch normalization and activation to input tensor."""
        return self.act(self.bn(self.conv(x)))
 
    def forward_fuse(self, x):
        """Perform transposed convolution of 2D data."""
        return self.act(self.conv(x))

class HGBlock_AssemFormer(nn.Module):
    """
    HG_Block of PPHGNetV2 with 2 convolutions and LightConv.

    https://github.com/PaddlePaddle/PaddleDetection/blob/develop/ppdet/modeling/backbones/hgnet_v2.py
    """

    def __init__(self, c1, cm, c2, k=3, n=6, lightconv=False, shortcut=False, act=nn.ReLU()):
        """Initializes a CSP Bottleneck with 1 convolution using specified input and output channels."""
        super().__init__()
        block = LightConv if lightconv else Conv
        self.m = nn.ModuleList(block(c1 if i == 0 else cm, cm, k=k, act=act) for i in range(n))
        self.sc = Conv(c1 + n * cm, c2 // 2, 1, 1, act=act)  # squeeze conv
        self.ec = Conv(c2 // 2, c2, 1, 1, act=act)  # excitation conv
        self.add = shortcut and c1 == c2
        self.cv = AssemFormer(c2)
        
    def forward(self, x):
        """Forward pass of a PPHGNetV2 backbone layer."""
        y = [x]
        y.extend(m(y[-1]) for m in self.m)
        y = self.cv(self.ec(self.sc(torch.cat(y, 1))))
        return y + x if self.add else y

HS-FPN模块 的实现代码如下:

import torch
import torch.nn as nn

class ChannelAttention_HSFPN(nn.Module):
    def __init__(self, in_planes, ratio=4, flag=True):
        super(ChannelAttention_HSFPN, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.max_pool = nn.AdaptiveMaxPool2d(1)
 
        self.conv1 = nn.Conv2d(in_planes, in_planes // ratio, 1, bias=False)
        self.relu = nn.ReLU()
        self.conv2 = nn.Conv2d(in_planes // ratio, in_planes, 1, bias=False)
        self.flag = flag
        self.sigmoid = nn.Sigmoid()
 
        nn.init.xavier_uniform_(self.conv1.weight)
        nn.init.xavier_uniform_(self.conv2.weight)
 
    def forward(self, x):
        avg_out = self.conv2(self.relu(self.conv1(self.avg_pool(x))))
        max_out = self.conv2(self.relu(self.conv1(self.max_pool(x))))
        out = avg_out + max_out
        return self.sigmoid(out) * x if self.flag else self.sigmoid(out)

class Multiply(nn.Module):
    def __init__(self) -> None:
        super().__init__()
 
    def forward(self, x):
        return x[0] * x[1]

class Add(nn.Module):
    def __init__(self):
        super().__init__()
 
    def forward(self, x):
        return torch.sum(torch.stack(x, dim=0), dim=0)

五、添加步骤

AssemFormer修改步骤参考:

HS-FPN修改步骤参考:


六、yaml模型文件

6.1 模型改进⭐

在代码配置完成后,配置模型的YAML文件。

此处以 ultralytics/cfg/models/rt-detr/rtdetr-l.yaml 为例,在同目录下创建一个用于自己数据集训练的模型文件 rtdetr-l-AssemFormer-HSFPN.yaml

rtdetr-l.yaml 中的内容复制到 rtdetr-AssemFormer-HSFPN.yaml 文件下,修改 nc 数量等于自己数据中目标的数量。

# Ultralytics YOLO 🚀, AGPL-3.0 license
# RT-DETR-l object detection model with P3-P5 outputs. For details see https://docs.ultralytics.com/models/rtdetr

# Parameters
nc: 1 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n-cls.yaml' will call yolov8-cls.yaml with scale 'n'
  # [depth, width, max_channels]
  l: [1.00, 1.00, 1024]

backbone:
  # [from, repeats, module, args]
  - [-1, 1, HGStem, [32, 48]] # 0-P2/4
  - [-1, 6, HGBlock, [48, 128, 3]] # stage 1

  - [-1, 1, DWConv, [128, 3, 2, 1, False]] # 2-P3/8
  - [-1, 6, HGBlock, [96, 512, 3]] # stage 2

  - [-1, 1, DWConv, [512, 3, 2, 1, False]] # 4-P4/16
  - [-1, 6, HGBlock_AssemFormer, [192, 1024, 5, True, False]] # cm, c2, k, light, shortcut
  - [-1, 6, HGBlock_AssemFormer, [192, 1024, 5, True, True]]
  - [-1, 6, HGBlock_AssemFormer, [192, 1024, 5, True, True]] # stage 3

  - [-1, 1, DWConv, [1024, 3, 2, 1, False]] # 8-P5/32
  - [-1, 6, HGBlock, [384, 2048, 5, True, False]] # stage 4

head:
  - [-1, 1, Conv, [256, 1, 1, None, 1, 1, False]]  # 10 input_proj.2
  - [-1, 1, AIFI, [1024, 8]] # 11
  - [-1, 1, Conv, [256, 1, 1]]  # 12, Y5, lateral_convs.0

  - [-1, 1, ChannelAttention_HSFPN, []] # 13
  - [-1, 1, nn.Conv2d, [256, 1]] # 14
  - [-1, 1, nn.ConvTranspose2d, [256, 3, 2, 1, 1]] # 15

  - [7, 1, ChannelAttention_HSFPN, []] # 16
  - [-1, 1, nn.Conv2d, [256, 1]] # 17
  - [15, 1, ChannelAttention_HSFPN, [4, False]] # 18
  - [[-1, -2], 1, Multiply, []] # 19
  - [[-1, 15], 1, Add, []] # 20
  - [-1, 3, RepC3, [256]] # 21 P4/16

  - [15, 1, nn.ConvTranspose2d, [256, 3, 2, 1, 1, 16]] # 22
  - [3, 1, ChannelAttention_HSFPN, []] # 23
  - [-1, 1, nn.Conv2d, [256, 1]] # 24
  - [22, 1, ChannelAttention_HSFPN, [4, False]] # 25
  - [[-1, -2], 1, Multiply, []] # 26
  - [[-1, 22], 1, Add, []] # 27
  - [-1, 3, RepC3, [256]] # 28 P3/8

  - [[28, 21, 14], 1, RTDETRDecoder, [nc]]  # Detect(P3, P4, P5)


七、成功运行结果

分别打印网络模型可以看到 AssemFormer模块 HSFPN 结构已经加入到模型中,并可以进行训练了。

rtdetr-AssemFormer-HSFPN

                   from  n    params  module                                       arguments                     
  0                  -1  1     25248  ultralytics.nn.modules.block.HGStem          [3, 32, 48]                   
  1                  -1  6    155072  ultralytics.nn.modules.block.HGBlock         [48, 48, 128, 3, 6]           
  2                  -1  1      1408  ultralytics.nn.modules.conv.DWConv           [128, 128, 3, 2, 1, False]    
  3                  -1  6    839296  ultralytics.nn.modules.block.HGBlock         [128, 96, 512, 3, 6]          
  4                  -1  1      5632  ultralytics.nn.modules.conv.DWConv           [512, 512, 3, 2, 1, False]    
  5                  -1  6  16391810  ultralytics.nn.AddModules.AssemFormer.HGBlock_AssemFormer[512, 192, 1024, 5, 6, True, False]
  6                  -1  6  16752258  ultralytics.nn.AddModules.AssemFormer.HGBlock_AssemFormer[1024, 192, 1024, 5, 6, True, True]
  7                  -1  6  16752258  ultralytics.nn.AddModules.AssemFormer.HGBlock_AssemFormer[1024, 192, 1024, 5, 6, True, True]
  8                  -1  1     11264  ultralytics.nn.modules.conv.DWConv           [1024, 1024, 3, 2, 1, False]  
  9                  -1  6   6708480  ultralytics.nn.modules.block.HGBlock         [1024, 384, 2048, 5, 6, True, False]
 10                  -1  1    524800  ultralytics.nn.modules.conv.Conv             [2048, 256, 1, 1, None, 1, 1, False]
 11                  -1  1    789760  ultralytics.nn.modules.transformer.AIFI      [256, 1024, 8]                
 12                  -1  1     66048  ultralytics.nn.modules.conv.Conv             [256, 256, 1, 1]              
 13                  -1  1     32768  ultralytics.nn.AddModules.HSFPN.ChannelAttention_HSFPN[256]                         
 14                  -1  1     65792  torch.nn.modules.conv.Conv2d                 [256, 256, 1]                 
 15                  -1  1    590080  torch.nn.modules.conv.ConvTranspose2d        [256, 256, 3, 2, 1, 1]        
 16                   7  1    524288  ultralytics.nn.AddModules.HSFPN.ChannelAttention_HSFPN[1024]                        
 17                  -1  1    262400  torch.nn.modules.conv.Conv2d                 [1024, 256, 1]                
 18                  15  1     32768  ultralytics.nn.AddModules.HSFPN.ChannelAttention_HSFPN[256, 4, False]               
 19            [-1, -2]  1         0  ultralytics.nn.AddModules.HSFPN.Multiply     []                            
 20            [-1, 15]  1         0  ultralytics.nn.AddModules.HSFPN.Add          []                            
 21                  -1  3   2101248  ultralytics.nn.modules.block.RepC3           [256, 256, 3]                 
 22                  15  1     37120  torch.nn.modules.conv.ConvTranspose2d        [256, 256, 3, 2, 1, 1, 16]    
 23                   3  1    131072  ultralytics.nn.AddModules.HSFPN.ChannelAttention_HSFPN[512]                         
 24                  -1  1    131328  torch.nn.modules.conv.Conv2d                 [512, 256, 1]                 
 25                  22  1     32768  ultralytics.nn.AddModules.HSFPN.ChannelAttention_HSFPN[256, 4, False]               
 26            [-1, -2]  1         0  ultralytics.nn.AddModules.HSFPN.Multiply     []                            
 27            [-1, 22]  1         0  ultralytics.nn.AddModules.HSFPN.Add          []                            
 28                  -1  3   2101248  ultralytics.nn.modules.block.RepC3           [256, 256, 3]                 
 29        [28, 21, 14]  1   7303907  ultralytics.nn.modules.head.RTDETRDecoder    [1, [256, 256, 256]]          
rtdetr-AssemFormer-HSFPN summary: 817 layers, 72,370,121 parameters, 72,370,121 gradients, 237.9 GFLOPs