RT-DETR改进策略【独家融合改进】| RepVit+ASF-YOLO，轻量提点，适用专栏内所有的骨干替换

一、本文介绍

本文记录的是基于RepVit的RT-DETR轻量化改进方法研究。 RepVit 通过分离的 token mixe 和 channel mixer 减少推理时的计算和内存成本，同时减少扩展比率并增加宽度，降低延迟，并通过 加倍通道 来弥补参数大幅减少的问题，提高了准确性。在此基础之上，将 RT-DETR 的颈部网络改进成 ASF-YOLO 的结构，使模型能够 有效的融合多尺度特征 ， 捕获小目标精细信息 ，并根据注意力机制关注小目标相关特征，显著提高模型精度。

二、RepVit结构详解

2.1 出发点

在计算机视觉领域，设计轻量化模型对于在资源受限的移动设备上实现视觉模型的部署至关重要。近年来， 轻量级Vision Transformers （ViTs）在移动设备上表现出优越性能和较低延迟，但ViTs和轻量级Convolutional Neural Networks（CNNs）在块结构、宏观和微观设计上存在显著差异未被充分研究。本研究从ViT视角重新审视轻量级CNNs的高效设计，旨在为移动设备探索更优的模型架构，因此提出了 RepViT模型 。

2.2 原理

2.2.1 借鉴ViT的设计理念

块设计（Block design）
- 分离token mixer和channel mixer ：轻量级ViTs的块结构包含分离的 token mixer 和 channel mixer 这一重要设计特征。研究发现ViTs的有效性主要源于其通用的token mixer和channel mixer架构（MetaFormer架构）。在MobileNetV3 - L中，原始块设计使token mixer和channel mixer耦合，通过 移动DW卷积 和 可选的挤压 - 激励（SE）层 ，成功分离两者，并采用结构重新参数化技术增强模型学习，减少了推理时的计算和内存成本，降低了延迟，命名为 RepViT块 。
- 减少扩展比率并增加宽度 ：在ViTs中，通道混合器的扩展比率通常较大，消耗大量计算资源。而在 RepViT块 中，将所有阶段的通道混合器扩展比率设置为2，降低了延迟，并通过在每个阶段加倍通道来弥补参数大幅减少的问题，提高了准确性。

在这里插入图片描述

2.2.2 宏观设计（Macro design）

早期卷积用于stem ：ViTs通常使用patchify操作作为stem，容易导致优化性欠佳和对训练配方敏感。而MobileNetV3 - L采用复杂的stem，存在延迟瓶颈且限制了表示能力。研究采用早期卷积方式，即堆叠两个步长为2的3×3卷积作为stem，减少了延迟，提高了准确性。
更深的下采样层 ：ViTs通过单独的补丁合并层实现空间下采样，有利于增加网络深度和减少信息损失。而MobileNetV3 - L仅通过倒置瓶颈块实现下采样，可能缺乏足够网络深度。研究采用DW卷积和1×1卷积进行空间下采样并调制通道维度，还前置一个 RepViT块 进一步加深下采样层，并放置一个FFN模块记忆更多潜在信息，提高了准确性，同时降低了延迟。
简单分类器 ：轻量级ViTs的分类器通常由全局平均池化层和线性层组成，对延迟友好。而MobileNetV3 - L采用复杂分类器，增加了延迟负担。考虑到 RepViT块 设计后最后阶段有更多通道，研究采用简单分类器替代，虽有一定精度下降，但降低了延迟。
整体阶段比率 ：调整不同阶段的块数量比例，采用1:1:7:1的阶段比率并增加网络深度，提高了准确性，同时降低了延迟。

2.2.3 微观设计（Micro design）

内核大小选择 ：CNNs的性能和延迟受卷积核大小影响。虽然大内核卷积可展示性能增益，但对移动设备不友好。MobileNetV3 - L主要使用3×3卷积，研究在所有模块中优先使用3×3卷积，维持了准确性，同时降低了延迟。
挤压 - 激励（SE）层放置 ： SE层 可弥补卷积的局限性，但在MobileNetV3 - L中某些块使用SE层存在问题。研究设计了一种跨块使用SE层的策略，即每个阶段的第1、3、5等块使用SE层，以最小的延迟增加获得最大的精度提升。

在这里插入图片描述

2.3 结构

RepViT模型 是一个全新的纯轻量级CNN家族，其结构基于ViT - like MetaFormer结构，完全由重新参数化卷积组成。它具有多个变体，如RepViT - M0.9/M1.0/M1.1/M1.5/M2.3等，不同变体通过每个阶段的通道数量和块数量来区分。

2.4 优势

性能优越
- 在ImageNet - 1K上进行图像分类实验时， RepViT 在不同模型大小下均达到了最先进的性能。

例如RepViT - M1.0`在iPhone 12上以1.0 ms的延迟实现了超过80%的top - 1准确率，这是轻量级模型首次达到该水平。在不使用知识蒸馏的情况下，也能显著优于其他竞争模型。

延迟较低
- RepViT 在各种视觉任务中展现出良好的延迟性能。

例如在对象检测和实例分割任务中，在相似模型大小下，RepViT - M1.1相比EfficientFormer - L1 backbone具有更小的延迟；在语义分割任务中，RepViT - M1.5相比EfficientFormerV2 - S2具有近50%的延迟降低，同时具有更好的性能。

适用于移动设备
- RepViT 的设计充分考虑了移动设备的资源受限特性，通过借鉴ViT的高效架构设计，对轻量级CNN进行优化，使其在移动设备上具有良好的性能和延迟表现，为移动设备上的视觉任务提供了更优的模型选择。

论文： https://arxiv.org/pdf/2307.09283
源码： https://github.com/THU-MIG/RepViT

三、ASF-YOLO介绍

ASF-YOLO 是一种基于YOLO的新颖框架，结合了空间和尺度特征以实现准确和快速的分割。其中， 注意力尺度序列融合模块 的设计包含以下几个关键方面：

3.1 出发点

解决小目标分割挑战 ：细胞实例分割因细胞的小、密集、重叠以及边界模糊等特点，对分割精度要求高。传统基于CNN的方法及一些现有架构在处理此类小目标分割时存在不足，需要一种能更好融合多尺度特征并关注小目标相关信息的方法。
优化YOLO架构 ：尽管YOLO系列在实时实例分割中具有优势，但对于医学图像中的小目标（如细胞）分割，其架构可进一步优化。通过设计注意尺度序列融合模块，提升模型对不同尺度小目标的处理能力和分割性能。

3.2 原理

3.2.1 多尺度特征融合

SSFF模块 ：通过对不同尺度的特征图（P3、P4、P5）进行归一化、上采样和堆叠，然后利用3D卷积将多尺度特征组合起来，从而能够在尺度空间表示中有效处理不同大小、方向和宽高比的目标，增强了模型对小目标尺度变化的鲁棒性。
TFE模块 ：将大、中、小三种不同尺寸的特征图在空间维度上拼接，以捕获不同尺度下小目标的精细空间信息，克服了 FPN 在YOLOv5中无法充分利用金字塔特征图相关性的局限。

3.2.2 注意力机制

CPAM模块 ：整合 SSFF 和 TFE 模块的特征信息，通过通道注意力网络和位置注意力网络，分别捕获与小目标相关的有信息通道和细化空间定位，使模型能够自适应地调整对不同尺度小目标相关通道和空间位置的关注，从而提高检测和分割精度。

3.3 结构

3.3.1 SSFF模块结构

首先对P4和P5特征层进行 $1 \times 1$ 卷积，将通道数变为256，再使用最近邻插值法调整其大小与P3层相同。
然后使用unsqueeze方法增加特征层维度，从3D张量变为4D张量，并沿深度维度将4D特征图拼接形成3D特征图。
最后使用3D卷积、3D批归一化和SiLU激活函数完成尺度序列特征提取。

在这里插入图片描述

3.3.2 TFE模块结构

对于大尺寸特征图（Large），经卷积模块处理后调整通道数为1C，然后采用最大池化+平均池化的混合结构进行下采样。
对于小尺寸特征图（Small），卷积模块调整通道数后使用最近邻插值法进行上采样。
最后将大、中、小三种尺寸相同的特征图在通道维度上拼接输出。

在这里插入图片描述

3.3.3 CPAM模块结构

包含通道注意力网络和位置注意力网络。通道注意力网络接收TFE模块输出的特征图，采用无维度缩减的注意力机制，通过考虑每个通道及其k最近邻来捕获非线性跨通道交互。
位置注意力网络接收通道注意力机制输出与SSFF模块输出叠加后的特征图，通过在水平和垂直轴上进行池化、卷积、分裂等操作，提取每个细胞的关键位置信息。

在这里插入图片描述

3.4 优势

提高分割精度 ：通过 SSFF模块 有效融合多尺度特征， TFE模块 捕获小目标精细信息，以及 CPAM模块 的注意力机制关注小目标相关特征，显著提高了细胞实例分割的精度，在DSB2018和BCC数据集上均取得了优于其他先进方法的结果。
增强模型鲁棒性 ： SSFF模块 对多尺度特征的融合方式使模型对不同条件下细胞图像中小目标的尺度变化具有更强的鲁棒性。
平衡精度与速度 ：在实现高精度分割的同时，保持了较快的推理速度，如在DSB2018数据集上达到了47.3 FPS的推理速度，满足实时处理的需求。

论文： https://arxiv.org/pdf/2312.06458
源码： https://github.com/mkang315/ASF-YOLO

四、RepVit、ASF-YOLO的实现代码

RepVit模块 的实现代码如下：

 
import torch.nn as nn
import numpy as np
from timm.models.layers import SqueezeExcite
import torch

__all__ = ['repvit_m0_9', 'repvit_m1_0', 'repvit_m1_1', 'repvit_m1_5', 'repvit_m2_3']

def replace_batchnorm(net):
    for child_name, child in net.named_children():
        if hasattr(child, 'fuse_self'):
            fused = child.fuse_self()
            setattr(net, child_name, fused)
            replace_batchnorm(fused)
        elif isinstance(child, torch.nn.BatchNorm2d):
            setattr(net, child_name, torch.nn.Identity())
        else:
            replace_batchnorm(child)

def _make_divisible(v, divisor, min_value=None):
    """
    This function is taken from the original tf repo.
    It ensures that all layers have a channel number that is divisible by 8
    It can be seen here:
    https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
    :param v:
    :param divisor:
    :param min_value:
    :return:
    """
    if min_value is None:
        min_value = divisor
    new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
    # Make sure that round down does not go down by more than 10%.
    if new_v < 0.9 * v:
        new_v += divisor
    return new_v

class Conv2d_BN(torch.nn.Sequential):
    def __init__(self, a, b, ks=1, stride=1, pad=0, dilation=1,
                 groups=1, bn_weight_init=1, resolution=-10000):
        super().__init__()
        self.add_module('c', torch.nn.Conv2d(
            a, b, ks, stride, pad, dilation, groups, bias=False))
        self.add_module('bn', torch.nn.BatchNorm2d(b))
        torch.nn.init.constant_(self.bn.weight, bn_weight_init)
        torch.nn.init.constant_(self.bn.bias, 0)

    @torch.no_grad()
    def fuse_self(self):
        c, bn = self._modules.values()
        w = bn.weight / (bn.running_var + bn.eps)**0.5
        w = c.weight * w[:, None, None, None]
        b = bn.bias - bn.running_mean * bn.weight / \
            (bn.running_var + bn.eps)**0.5
        m = torch.nn.Conv2d(w.size(1) * self.c.groups, w.size(
            0), w.shape[2:], stride=self.c.stride, padding=self.c.padding, dilation=self.c.dilation, groups=self.c.groups,
            device=c.weight.device)
        m.weight.data.copy_(w)
        m.bias.data.copy_(b)
        return m

class Residual(torch.nn.Module):
    def __init__(self, m, drop=0.):
        super().__init__()
        self.m = m
        self.drop = drop

    def forward(self, x):
        if self.training and self.drop > 0:
            return x + self.m(x) * torch.rand(x.size(0), 1, 1, 1,
                                              device=x.device).ge_(self.drop).div(1 - self.drop).detach()
        else:
            return x + self.m(x)
    
    @torch.no_grad()
    def fuse_self(self):
        if isinstance(self.m, Conv2d_BN):
            m = self.m.fuse_self()
            assert(m.groups == m.in_channels)
            identity = torch.ones(m.weight.shape[0], m.weight.shape[1], 1, 1)
            identity = torch.nn.functional.pad(identity, [1,1,1,1])
            m.weight += identity.to(m.weight.device)
            return m
        elif isinstance(self.m, torch.nn.Conv2d):
            m = self.m
            assert(m.groups != m.in_channels)
            identity = torch.ones(m.weight.shape[0], m.weight.shape[1], 1, 1)
            identity = torch.nn.functional.pad(identity, [1,1,1,1])
            m.weight += identity.to(m.weight.device)
            return m
        else:
            return self

class RepVGGDW(torch.nn.Module):
    def __init__(self, ed) -> None:
        super().__init__()
        self.conv = Conv2d_BN(ed, ed, 3, 1, 1, groups=ed)
        self.conv1 = torch.nn.Conv2d(ed, ed, 1, 1, 0, groups=ed)
        self.dim = ed
        self.bn = torch.nn.BatchNorm2d(ed)
    
    def forward(self, x):
        return self.bn((self.conv(x) + self.conv1(x)) + x)
    
    @torch.no_grad()
    def fuse_self(self):
        conv = self.conv.fuse_self()
        conv1 = self.conv1
        
        conv_w = conv.weight
        conv_b = conv.bias
        conv1_w = conv1.weight
        conv1_b = conv1.bias
        
        conv1_w = torch.nn.functional.pad(conv1_w, [1,1,1,1])

        identity = torch.nn.functional.pad(torch.ones(conv1_w.shape[0], conv1_w.shape[1], 1, 1, device=conv1_w.device), [1,1,1,1])

        final_conv_w = conv_w + conv1_w + identity
        final_conv_b = conv_b + conv1_b

        conv.weight.data.copy_(final_conv_w)
        conv.bias.data.copy_(final_conv_b)

        bn = self.bn
        w = bn.weight / (bn.running_var + bn.eps)**0.5
        w = conv.weight * w[:, None, None, None]
        b = bn.bias + (conv.bias - bn.running_mean) * bn.weight / \
            (bn.running_var + bn.eps)**0.5
        conv.weight.data.copy_(w)
        conv.bias.data.copy_(b)
        return conv

class RepViTBlock(nn.Module):
    def __init__(self, inp, hidden_dim, oup, kernel_size, stride, use_se, use_hs):
        super(RepViTBlock, self).__init__()
        assert stride in [1, 2]

        self.identity = stride == 1 and inp == oup
        assert(hidden_dim == 2 * inp)

        if stride == 2:
            self.token_mixer = nn.Sequential(
                Conv2d_BN(inp, inp, kernel_size, stride, (kernel_size - 1) // 2, groups=inp),
                SqueezeExcite(inp, 0.25) if use_se else nn.Identity(),
                Conv2d_BN(inp, oup, ks=1, stride=1, pad=0)
            )
            self.channel_mixer = Residual(nn.Sequential(
                    # pw
                    Conv2d_BN(oup, 2 * oup, 1, 1, 0),
                    nn.GELU() if use_hs else nn.GELU(),
                    # pw-linear
                    Conv2d_BN(2 * oup, oup, 1, 1, 0, bn_weight_init=0),
                ))
        else:
            assert(self.identity)
            self.token_mixer = nn.Sequential(
                RepVGGDW(inp),
                SqueezeExcite(inp, 0.25) if use_se else nn.Identity(),
            )
            self.channel_mixer = Residual(nn.Sequential(
                    # pw
                    Conv2d_BN(inp, hidden_dim, 1, 1, 0),
                    nn.GELU() if use_hs else nn.GELU(),
                    # pw-linear
                    Conv2d_BN(hidden_dim, oup, 1, 1, 0, bn_weight_init=0),
                ))

    def forward(self, x):
        return self.channel_mixer(self.token_mixer(x))

class RepViT(nn.Module):
    def __init__(self, cfgs):
        super(RepViT, self).__init__()
        # setting of inverted residual blocks
        self.cfgs = cfgs

        # building first layer
        input_channel = self.cfgs[0][2]
        patch_embed = torch.nn.Sequential(Conv2d_BN(3, input_channel // 2, 3, 2, 1), torch.nn.GELU(),
                           Conv2d_BN(input_channel // 2, input_channel, 3, 2, 1))
        layers = [patch_embed]
        # building inverted residual blocks
        block = RepViTBlock
        for k, t, c, use_se, use_hs, s in self.cfgs:
            output_channel = _make_divisible(c, 8)
            exp_size = _make_divisible(input_channel * t, 8)
            layers.append(block(input_channel, exp_size, output_channel, k, s, use_se, use_hs))
            input_channel = output_channel
        self.features = nn.ModuleList(layers)
        self.channel = [i.size(1) for i in self.forward(torch.randn(1, 3, 640, 640))]
        
    def forward(self, x):
        input_size = x.size(2)
        scale = [4, 8, 16, 32]
        features = [None, None, None, None]
        for f in self.features:
            x = f(x)
            if input_size // x.size(2) in scale:
                features[scale.index(input_size // x.size(2))] = x
        return features
    
    def switch_to_deploy(self):
        replace_batchnorm(self)

def update_weight(model_dict, weight_dict):
    idx, temp_dict = 0, {}
    for k, v in weight_dict.items():
        # k = k[9:]
        if k in model_dict.keys() and np.shape(model_dict[k]) == np.shape(v):
            temp_dict[k] = v
            idx += 1
    model_dict.update(temp_dict)
    print(f'loading weights... {idx}/{len(model_dict)} items')
    return model_dict

def repvit_m0_9(weights=''):
    """
    Constructs a MobileNetV3-Large model
    """
    cfgs = [
        # k, t, c, SE, HS, s 
        [3,   2,  48, 1, 0, 1],
        [3,   2,  48, 0, 0, 1],
        [3,   2,  48, 0, 0, 1],
        [3,   2,  96, 0, 0, 2],
        [3,   2,  96, 1, 0, 1],
        [3,   2,  96, 0, 0, 1],
        [3,   2,  96, 0, 0, 1],
        [3,   2,  192, 0, 1, 2],
        [3,   2,  192, 1, 1, 1],
        [3,   2,  192, 0, 1, 1],
        [3,   2,  192, 1, 1, 1],
        [3,   2, 192, 0, 1, 1],
        [3,   2, 192, 1, 1, 1],
        [3,   2, 192, 0, 1, 1],
        [3,   2, 192, 1, 1, 1],
        [3,   2, 192, 0, 1, 1],
        [3,   2, 192, 1, 1, 1],
        [3,   2, 192, 0, 1, 1],
        [3,   2, 192, 1, 1, 1],
        [3,   2, 192, 0, 1, 1],
        [3,   2, 192, 1, 1, 1],
        [3,   2, 192, 0, 1, 1],
        [3,   2, 192, 0, 1, 1],
        [3,   2, 384, 0, 1, 2],
        [3,   2, 384, 1, 1, 1],
        [3,   2, 384, 0, 1, 1]
    ]
    model = RepViT(cfgs)
    if weights:
        model.load_state_dict(update_weight(model.state_dict(), torch.load(weights)['model']))
    return model

def repvit_m1_0(weights=''):
    """
    Constructs a MobileNetV3-Large model
    """
    cfgs = [
        # k, t, c, SE, HS, s 
        [3,   2,  56, 1, 0, 1],
        [3,   2,  56, 0, 0, 1],
        [3,   2,  56, 0, 0, 1],
        [3,   2,  112, 0, 0, 2],
        [3,   2,  112, 1, 0, 1],
        [3,   2,  112, 0, 0, 1],
        [3,   2,  112, 0, 0, 1],
        [3,   2,  224, 0, 1, 2],
        [3,   2,  224, 1, 1, 1],
        [3,   2,  224, 0, 1, 1],
        [3,   2,  224, 1, 1, 1],
        [3,   2, 224, 0, 1, 1],
        [3,   2, 224, 1, 1, 1],
        [3,   2, 224, 0, 1, 1],
        [3,   2, 224, 1, 1, 1],
        [3,   2, 224, 0, 1, 1],
        [3,   2, 224, 1, 1, 1],
        [3,   2, 224, 0, 1, 1],
        [3,   2, 224, 1, 1, 1],
        [3,   2, 224, 0, 1, 1],
        [3,   2, 224, 1, 1, 1],
        [3,   2, 224, 0, 1, 1],
        [3,   2, 224, 0, 1, 1],
        [3,   2, 448, 0, 1, 2],
        [3,   2, 448, 1, 1, 1],
        [3,   2, 448, 0, 1, 1]
    ]
    model = RepViT(cfgs)
    if weights:
        model.load_state_dict(update_weight(model.state_dict(), torch.load(weights)['model']))
    return model

def repvit_m1_1(weights=''):
    """
    Constructs a MobileNetV3-Large model
    """
    cfgs = [
        # k, t, c, SE, HS, s 
        [3,   2,  64, 1, 0, 1],
        [3,   2,  64, 0, 0, 1],
        [3,   2,  64, 0, 0, 1],
        [3,   2,  128, 0, 0, 2],
        [3,   2,  128, 1, 0, 1],
        [3,   2,  128, 0, 0, 1],
        [3,   2,  128, 0, 0, 1],
        [3,   2,  256, 0, 1, 2],
        [3,   2,  256, 1, 1, 1],
        [3,   2,  256, 0, 1, 1],
        [3,   2,  256, 1, 1, 1],
        [3,   2, 256, 0, 1, 1],
        [3,   2, 256, 1, 1, 1],
        [3,   2, 256, 0, 1, 1],
        [3,   2, 256, 1, 1, 1],
        [3,   2, 256, 0, 1, 1],
        [3,   2, 256, 1, 1, 1],
        [3,   2, 256, 0, 1, 1],
        [3,   2, 256, 1, 1, 1],
        [3,   2, 256, 0, 1, 1],
        [3,   2, 256, 0, 1, 1],
        [3,   2, 512, 0, 1, 2],
        [3,   2, 512, 1, 1, 1],
        [3,   2, 512, 0, 1, 1]
    ]
    model = RepViT(cfgs)
    if weights:
        model.load_state_dict(update_weight(model.state_dict(), torch.load(weights)['model']))
    return model

def repvit_m1_5(weights=''):
    """
    Constructs a MobileNetV3-Large model
    """
    cfgs = [
        # k, t, c, SE, HS, s 
        [3,   2,  64, 1, 0, 1],
        [3,   2,  64, 0, 0, 1],
        [3,   2,  64, 1, 0, 1],
        [3,   2,  64, 0, 0, 1],
        [3,   2,  64, 0, 0, 1],
        [3,   2,  128, 0, 0, 2],
        [3,   2,  128, 1, 0, 1],
        [3,   2,  128, 0, 0, 1],
        [3,   2,  128, 1, 0, 1],
        [3,   2,  128, 0, 0, 1],
        [3,   2,  128, 0, 0, 1],
        [3,   2,  256, 0, 1, 2],
        [3,   2,  256, 1, 1, 1],
        [3,   2,  256, 0, 1, 1],
        [3,   2,  256, 1, 1, 1],
        [3,   2,  256, 0, 1, 1],
        [3,   2,  256, 1, 1, 1],
        [3,   2,  256, 0, 1, 1],
        [3,   2,  256, 1, 1, 1],
        [3,   2, 256, 0, 1, 1],
        [3,   2, 256, 1, 1, 1],
        [3,   2, 256, 0, 1, 1],
        [3,   2, 256, 1, 1, 1],
        [3,   2, 256, 0, 1, 1],
        [3,   2, 256, 1, 1, 1],
        [3,   2, 256, 0, 1, 1],
        [3,   2, 256, 1, 1, 1],
        [3,   2, 256, 0, 1, 1],
        [3,   2, 256, 1, 1, 1],
        [3,   2, 256, 0, 1, 1],
        [3,   2, 256, 1, 1, 1],
        [3,   2, 256, 0, 1, 1],
        [3,   2, 256, 1, 1, 1],
        [3,   2, 256, 0, 1, 1],
        [3,   2, 256, 1, 1, 1],
        [3,   2, 256, 0, 1, 1],
        [3,   2, 256, 0, 1, 1],
        [3,   2, 512, 0, 1, 2],
        [3,   2, 512, 1, 1, 1],
        [3,   2, 512, 0, 1, 1],
        [3,   2, 512, 1, 1, 1],
        [3,   2, 512, 0, 1, 1]
    ]
    model = RepViT(cfgs)
    if weights:
        model.load_state_dict(update_weight(model.state_dict(), torch.load(weights)['model']))
    return model

def repvit_m2_3(weights=''):
    """
    Constructs a MobileNetV3-Large model
    """
    cfgs = [
        # k, t, c, SE, HS, s 
        [3,   2,  80, 1, 0, 1],
        [3,   2,  80, 0, 0, 1],
        [3,   2,  80, 1, 0, 1],
        [3,   2,  80, 0, 0, 1],
        [3,   2,  80, 1, 0, 1],
        [3,   2,  80, 0, 0, 1],
        [3,   2,  80, 0, 0, 1],
        [3,   2,  160, 0, 0, 2],
        [3,   2,  160, 1, 0, 1],
        [3,   2,  160, 0, 0, 1],
        [3,   2,  160, 1, 0, 1],
        [3,   2,  160, 0, 0, 1],
        [3,   2,  160, 1, 0, 1],
        [3,   2,  160, 0, 0, 1],
        [3,   2,  160, 0, 0, 1],
        [3,   2,  320, 0, 1, 2],
        [3,   2,  320, 1, 1, 1],
        [3,   2,  320, 0, 1, 1],
        [3,   2,  320, 1, 1, 1],
        [3,   2,  320, 0, 1, 1],
        [3,   2,  320, 1, 1, 1],
        [3,   2,  320, 0, 1, 1],
        [3,   2,  320, 1, 1, 1],
        [3,   2, 320, 0, 1, 1],
        [3,   2, 320, 1, 1, 1],
        [3,   2, 320, 0, 1, 1],
        [3,   2, 320, 1, 1, 1],
        [3,   2, 320, 0, 1, 1],
        [3,   2, 320, 1, 1, 1],
        [3,   2, 320, 0, 1, 1],
        [3,   2, 320, 1, 1, 1],
        [3,   2, 320, 0, 1, 1],
        [3,   2, 320, 1, 1, 1],
        [3,   2, 320, 0, 1, 1],
        [3,   2, 320, 1, 1, 1],
        [3,   2, 320, 0, 1, 1],
        [3,   2, 320, 1, 1, 1],
        [3,   2, 320, 0, 1, 1],
        [3,   2, 320, 1, 1, 1],
        [3,   2, 320, 0, 1, 1],
        [3,   2, 320, 1, 1, 1],
        [3,   2, 320, 0, 1, 1],
        [3,   2, 320, 1, 1, 1],
        [3,   2, 320, 0, 1, 1],
        [3,   2, 320, 1, 1, 1],
        [3,   2, 320, 0, 1, 1],
        [3,   2, 320, 1, 1, 1],
        [3,   2, 320, 0, 1, 1],
        [3,   2, 320, 1, 1, 1],
        [3,   2, 320, 0, 1, 1],
        # [3,   2, 320, 1, 1, 1],
        # [3,   2, 320, 0, 1, 1],
        [3,   2, 320, 0, 1, 1],
        [3,   2, 640, 0, 1, 2],
        [3,   2, 640, 1, 1, 1],
        [3,   2, 640, 0, 1, 1],
        # [3,   2, 640, 1, 1, 1],
        # [3,   2, 640, 0, 1, 1]
    ]
    model = RepViT(cfgs)
    if weights:
        model.load_state_dict(update_weight(model.state_dict(), torch.load(weights)['model']))
    return model

if __name__ == '__main__':
    model = repvit_m2_3('repvit_m2_3_distill_450e.pth')
    inputs = torch.randn((1, 3, 640, 640))
    res = model(inputs)
    for i in res:
        print(i.size())

ASF-YOLO模块 的实现代码如下：

import torch
import torch.nn as nn
import torch.nn.functional as F
import math
 
def autopad(k, p=None, d=1):  # kernel, padding, dilation
    # Pad to 'same' shape outputs
    if d > 1:
        k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k]  # actual kernel-size
    if p is None:
        p = k // 2 if isinstance(k, int) else [x // 2 for x in k]  # auto-pad
    return p

class Conv(nn.Module):
    # Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)
    default_act = nn.SiLU()  # default activation
 
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()
 
    def forward(self, x):
        return self.act(self.bn(self.conv(x)))
 
    def forward_fuse(self, x):
        return self.act(self.conv(x))

class Zoom_cat(nn.Module):
    def __init__(self):
        super().__init__()
        # self.conv_l_post_down = Conv(in_dim, 2*in_dim, 3, 1, 1)
 
    def forward(self, x):
        """l,m,s表示大中小三个尺度，最终会被整合到m这个尺度上"""
        l, m, s = x[0], x[1], x[2]
        tgt_size = m.shape[2:]
        l = F.adaptive_max_pool2d(l, tgt_size) + F.adaptive_avg_pool2d(l, tgt_size)
        # l = self.conv_l_post_down(l)
        # m = self.conv_m(m)
        # s = self.conv_s_pre_up(s)
        s = F.interpolate(s, m.shape[2:], mode='nearest')
        # s = self.conv_s_post_up(s)
        lms = torch.cat([l, m, s], dim=1)
        return lms

class ScalSeq(nn.Module):
    def __init__(self, inc, channel):
        super(ScalSeq, self).__init__()
        self.conv0 = Conv(inc[0], channel, 1)
        self.conv1 = Conv(inc[1], channel, 1)
        self.conv2 = Conv(inc[2], channel, 1)
        self.conv3d = nn.Conv3d(channel, channel, kernel_size=(1, 1, 1))
        self.bn = nn.BatchNorm3d(channel)
        self.act = nn.LeakyReLU(0.1)
        self.pool_3d = nn.MaxPool3d(kernel_size=(3, 1, 1))
 
    def forward(self, x):
        p3, p4, p5 = x[0], x[1], x[2]
        p3 = self.conv0(p3)
        p4_2 = self.conv1(p4)
        p4_2 = F.interpolate(p4_2, p3.size()[2:], mode='nearest')
        p5_2 = self.conv2(p5)
        p5_2 = F.interpolate(p5_2, p3.size()[2:], mode='nearest')
        p3_3d = torch.unsqueeze(p3, -3)
        p4_3d = torch.unsqueeze(p4_2, -3)
        p5_3d = torch.unsqueeze(p5_2, -3)
        combine = torch.cat([p3_3d, p4_3d, p5_3d], dim=2)
        conv_3d = self.conv3d(combine)
        bn = self.bn(conv_3d)
        act = self.act(bn)
        x = self.pool_3d(act)
        x = torch.squeeze(x, 2)
        return x

class Add(nn.Module):
    # Concatenate a list of tensors along dimension
    def __init__(self, ch=256):
        super().__init__()
 
    def forward(self, x):
        input1, input2 = x[0], x[1]
        x = input1 + input2
        return x

class channel_att(nn.Module):
    def __init__(self, channel, b=1, gamma=2):
        super(channel_att, self).__init__()
        kernel_size = int(abs((math.log(channel, 2) + b) / gamma))
        kernel_size = kernel_size if kernel_size % 2 else kernel_size + 1
 
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.conv = nn.Conv1d(1, 1, kernel_size=kernel_size, padding=(kernel_size - 1) // 2, bias=False)
        self.sigmoid = nn.Sigmoid()
 
    def forward(self, x):
        y = self.avg_pool(x)
        y = y.squeeze(-1)
        y = y.transpose(-1, -2)
        y = self.conv(y).transpose(-1, -2).unsqueeze(-1)
        y = self.sigmoid(y)
        return x * y.expand_as(x)

class local_att(nn.Module):
    def __init__(self, channel, reduction=16):
        super(local_att, self).__init__()
 
        self.conv_1x1 = nn.Conv2d(in_channels=channel, out_channels=channel // reduction, kernel_size=1, stride=1,
                                  bias=False)
 
        self.relu = nn.ReLU()
        self.bn = nn.BatchNorm2d(channel // reduction)
 
        self.F_h = nn.Conv2d(in_channels=channel // reduction, out_channels=channel, kernel_size=1, stride=1,
                             bias=False)
        self.F_w = nn.Conv2d(in_channels=channel // reduction, out_channels=channel, kernel_size=1, stride=1,
                             bias=False)
 
        self.sigmoid_h = nn.Sigmoid()
        self.sigmoid_w = nn.Sigmoid()
 
    def forward(self, x):
        _, _, h, w = x.size()
 
        x_h = torch.mean(x, dim=3, keepdim=True).permute(0, 1, 3, 2)
        x_w = torch.mean(x, dim=2, keepdim=True)
 
        x_cat_conv_relu = self.relu(self.bn(self.conv_1x1(torch.cat((x_h, x_w), 3))))
 
        x_cat_conv_split_h, x_cat_conv_split_w = x_cat_conv_relu.split([h, w], 3)
 
        s_h = self.sigmoid_h(self.F_h(x_cat_conv_split_h.permute(0, 1, 3, 2)))
        s_w = self.sigmoid_w(self.F_w(x_cat_conv_split_w))
 
        out = x * s_h.expand_as(x) * s_w.expand_as(x)
        return out

class attention_model(nn.Module):
    # Concatenate a list of tensors along dimension
    def __init__(self, ch=256):
        super().__init__()
        self.channel_att = channel_att(ch)
        self.local_att = local_att(ch)
 
    def forward(self, x):
        input1, input2 = x[0], x[1]
        input1 = self.channel_att(input1)
        x = input1 + input2
        x = self.local_att(x)
        return x

五、修改步骤

RepViT配置步骤：

ASF-YOLO配置步骤：

六、yaml模型文件

6.1 模型改进⭐

在代码配置完成后，配置模型的YAML文件。

此处以 ultralytics/cfg/models/rt-detr/rtdetr-l.yaml 为例，在同目录下创建一个用于自己数据集训练的模型文件 rtdetr-RepVit-ASF.yaml 。

将 rtdetr-l.yaml 中的内容复制到 rtdetr-RepVit-ASF.yaml 文件下，修改 nc 数量等于自己数据中目标的数量。

📌 融合改进结果：

# Ultralytics YOLO 🚀, AGPL-3.0 license
# RT-DETR-l object detection model with P3-P5 outputs. For details see https://docs.ultralytics.com/models/rtdetr

# Parameters
nc: 1  # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n-cls.yaml' will call yolov8-cls.yaml with scale 'n'
  # [depth, width, max_channels]
  l: [1.00, 1.00, 1024]

backbone:
  # [from, repeats, module, args]
  - [-1, 1, repvit_m0_9, []]  # 4

head:
  - [-1, 1, Conv, [256, 1, 1, None, 1, 1, False]]  # 5 input_proj.2
  - [-1, 1, AIFI, [1024, 8]] # 6
  - [-1, 1, Conv, [256, 1, 1]]  # 7, Y5, lateral_convs.0

  - [-1, 1, nn.Upsample, [None, 2, 'nearest']] # 8
  - [2, 1, Conv, [256, 1, 1, None, 1, 1, False]]  # 9 input_proj.1
  - [[-1, 3, -2], 1, Zoom_cat, []] # 10
  - [-1, 3, RepC3, [256]]  # 11, fpn_blocks.0
  - [-1, 1, Conv, [256, 1, 1]]   # 12, Y4, lateral_convs.1

  - [-1, 1, nn.Upsample, [None, 2, 'nearest']] # 13
  - [1, 1, Conv, [256, 1, 1, None, 1, 1, False]]  # 14 input_proj.0
  - [[-1, 2, -2], 1, Zoom_cat, []]  # 15 cat backbone P4
  - [-1, 3, RepC3, [256]]    # X3 (16), fpn_blocks.1

  - [-1, 1, Conv, [256, 3, 2]]   # 17, downsample_convs.0
  - [[-1, 12], 1, Concat, [1]]  # 18 cat Y4
  - [-1, 3, RepC3, [256]]    # F4 (19), pan_blocks.0

  - [-1, 1, Conv, [256, 3, 2]]   # 20, downsample_convs.1
  - [[-1, 7], 1, Concat, [1]]  # 21 cat Y5
  - [-1, 3, RepC3, [256]]    # F5 (22), pan_blocks.1

  - [[2, 3, 5], 1, ScalSeq, [256]] # 23
  - [[16, -1], 1, Add, []] # 24
  # - [[16, -1], 1, asf_attention_model, []] # 24

  - [[23, 19, 22], 1, RTDETRDecoder, [nc, 256, 300, 4, 8, 3]]  # Detect(P3, P4, P5)

七、成功运行结果

分别打印网络模型可以看到 RepVit、ASF 已经加入到模型中，并可以进行训练了。

rtdetr-RepVit-ASF ：

rtdetr-RepVit-ASF summary: 839 layers, 20,158,548 parameters, 20,158,548 gradients, 74.0 GFLOPs

                   from  n    params  module                                       arguments                     
  0                  -1  1   4717792  repvit_m0_9                                  []                            
  1                  -1  1     98816  ultralytics.nn.modules.conv.Conv             [384, 256, 1, 1, None, 1, 1, False]
  2                  -1  1    789760  ultralytics.nn.modules.transformer.AIFI      [256, 1024, 8]                
  3                  -1  1     66048  ultralytics.nn.modules.conv.Conv             [256, 256, 1, 1]              
  4                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
  5                   2  1     25088  ultralytics.nn.modules.conv.Conv             [96, 256, 1, 1, None, 1, 1, False]
  6         [-1, 3, -2]  1         0  ultralytics.nn.AddModules.ASF.Zoom_cat       []                            
  7                  -1  3   2330624  ultralytics.nn.modules.block.RepC3           [704, 256, 3]                 
  8                  -1  1     66048  ultralytics.nn.modules.conv.Conv             [256, 256, 1, 1]              
  9                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 10                   1  1     12800  ultralytics.nn.modules.conv.Conv             [48, 256, 1, 1, None, 1, 1, False]
 11         [-1, 2, -2]  1         0  ultralytics.nn.AddModules.ASF.Zoom_cat       []                            
 12                  -1  3   2281472  ultralytics.nn.modules.block.RepC3           [608, 256, 3]                 
 13                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]              
 14            [-1, 12]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 15                  -1  3   2232320  ultralytics.nn.modules.block.RepC3           [512, 256, 3]                 
 16                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]              
 17             [-1, 7]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 18                  -1  3   2232320  ultralytics.nn.modules.block.RepC3           [512, 256, 3]                 
 19           [2, 3, 5]  1    207104  ultralytics.nn.AddModules.ASF.ScalSeq        [[96, 192, 256], 256]         
 20            [16, -1]  1         0  ultralytics.nn.AddModules.HSFPN.Add          []                            
 21        [23, 19, 22]  1   3917684  ultralytics.nn.modules.head.RTDETRDecoder    [1, [256, 256, 256], 256, 300, 4, 8, 3]
rtdetr-RepVit-ASF summary: 839 layers, 20,158,548 parameters, 20,158,548 gradients, 74.0 GFLOPs

学习资源站

RT-DETR改进策略【独家融合改进】RepVit+ASF-YOLO，轻量提点，适用专栏内所有的骨干替换-

RT-DETR改进策略【独家融合改进】| RepVit+ASF-YOLO，轻量提点，适用专栏内所有的骨干替换

一、本文介绍

二、RepVit结构详解

2.1 出发点

2.2 原理

2.2.1 借鉴ViT的设计理念

2.2.2 宏观设计（Macro design）

2.2.3 微观设计（Micro design）

2.3 结构

2.4 优势

三、ASF-YOLO介绍

3.1 出发点

3.2 原理

3.2.1 多尺度特征融合

3.2.2 注意力机制

3.3 结构

3.3.1 SSFF模块结构

3.3.2 TFE模块结构

3.3.3 CPAM模块结构

3.4 优势

四、RepVit、ASF-YOLO的实现代码

五、修改步骤

六、yaml模型文件

6.1 模型改进⭐

七、成功运行结果