RT-DETR改进策略【Neck】| NeurIPS 2023 融合GOLD-YOLO颈部结构，强化小目标检测能力

一、本文介绍

本文主要 利用GOLD - YOLO中的颈部结构优化RT-DETR的网络模型 。 GOLD - YOLO 颈部结构中的GD机制借鉴了 全局信息融合 的理念，通过独特的模块设计，在不显著增加延迟的情况下， 高效融合不同层级的特征信息 。将其应用于 RT-DETR 的改进过程中，能够使模型 更有效地整合多尺度特征，减少信息损失，强化对不同大小目标物体的特征表达 ，从而提升模型在复杂场景下对目标物体的检测精度与定位准确性。

二、GOLD-YOLO介绍

Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism

GOLD - YOLO 颈部结构的设计 旨在解决传统信息融合方法的缺陷 ，提升模型性能。以下从出发点、结构原理、模块组成和优势四方面详细介绍：

2.1 出发点

传统YOLO系列颈部结构采用PAFPN，该结构在 融合跨层信息时存在缺陷，信息传递存在损失 。

例如，当level - 1获取level - 3信息时，需先将level - 2和level - 3信息融合，导致信息交互只能传递中间层选择的信息，未被选择的信息在传输中被丢弃，限制了信息融合的整体效果。为避免这种信息损失，构建了Gather - and - Distribute（GD）机制。

2.2 结构原理

GD机制 通过统一模块 收集和融合所有层级的信息 ，并将其分配到不同层级。该过程对应三个模块： Feature Alignment Module （FAM）、 Information Fusion Module （IFM）和 Information Injection Module （Inject）。

具体来说， FAM 收集并对齐各层级特征 ， IFM 融合对齐后的特征生成全局信息 ， Inject模块 将全局信息分配到各层级并通过简单注意力操作注入，从而增强分支的检测能力 。为提升对不同大小物体的检测能力，还开发了 低阶段 （Low-GD）和 高阶段 （High-GD）两个分支，分别 提取和融合大尺寸和小尺寸特征图 ，颈部输入为骨干网络提取的特征图 $B 2$ ， $B 3$ ， $B 4$ ， $B 5$ 。

2.3 模块组成

低阶段收集 - 分发分支（Low-GD） ：选择骨干网络输出的 $B 2$ ， $B 3$ ， $B 4$ ， $B 5$ 特征进行融合以 获取保留小目标信息的高分辨率特征。
- 包含 低阶段特征对齐模块（Low-FAM） ，使用 平均池化操作 将输入特征下采样到统一大小（以 $R_{B4}=\frac{1}{4}R$ 为目标对齐大小）；
- 低阶段信息融合模块（Low-IFM） ，由 多层重参数化卷积块 （RepBlock）和一个 分割 操作组成；
- 信息注入模块 ，借鉴分割经验，采用 注意力操作 将 **全局信息高效注入不同层级 ** ，如图所示。

在这里插入图片描述

高阶段收集 - 分发分支（High-GD） ：融合 Low-GD 生成的 ${P3, P4, P5}$ 特征。
- 包括 高阶段特征对齐模块（High-FAM） ，利用 平均池化 将输入特征维度降低到统一大小（以 $R_{P5}=\frac{1}{8}R$ 为目标）；
- 高阶段信息融合模块（High-IFM） ，由 Transformer块 和分割操作组成；
- 信息注入模块 与 Low-GD 中的相同，如图所示。

在这里插入图片描述

增强跨层信息流模块（Inject - LAF） ：受 PAFPN 模块启发，对 信息注入模块 进行增强，包含 轻量级相邻层融合 （LAF）模块。
- 设计了 LAF低层级模型 和 LAF高层级模型 ，分别用于低层级注入（融合相邻两层特征）和高层级注入（融合相邻一层特征）。
- 该模块通过 双线性插值 、 平均池化 和 1x1卷积操作 ，在不显著增加延迟的情况下， 增加不同层级之间的信息流路径 ，平衡了模型的精度和速度，如图所示。

在这里插入图片描述

2.4 优势

通过 GD机制 和 Inject-LAF模块 ， GOLD-YOLO 颈部结构有效提升了 信息融合能力 ，避免了传统FPN结构的信息损失问题。实验表明，该颈部结构在不同模型尺寸下，都能在不显著增加延迟的情况下，提高模型在不同大小物体上的检测性能。

论文： https://arxiv.org/abs/2309.11331v4
源码： https://github.com/huawei-noah/Efficient-Computing/tree/master/Detection/Gold-YOLO

三、GOLD-YOLO的实现代码

GOLD-YOLO模块 的实现代码如下：

import numpy as np
import torch
import torch.nn as nn  
import torch.nn.functional as F

def autopad(k, p=None, d=1):
    """
    Pads kernel to 'same' output shape, adjusting for optional dilation; returns padding size.
    `k`: kernel, `p`: padding, `d`: dilation.
    """
    if d > 1:
        k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k]  # actual kernel-size
    if p is None:
        p = k // 2 if isinstance(k, int) else [x // 2 for x in k]  # auto-pad
    return p

class Conv(nn.Module):
    # Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)
    default_act = nn.SiLU()  # default activation
 
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
        """Initializes a standard convolution layer with optional batch normalization and activation."""
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()
 
    def forward(self, x):
        """Applies a convolution followed by batch normalization and an activation function to the input tensor `x`."""
        return self.act(self.bn(self.conv(x)))
 
    def forward_fuse(self, x):
        """Applies a fused convolution and activation function to the input tensor `x`."""
        return self.act(self.conv(x))

def conv_bn(in_channels, out_channels, kernel_size, stride, padding, groups=1, bias=False):
    '''Basic cell for rep-style block, including conv and bn'''
    result = nn.Sequential()
    result.add_module('conv', nn.Conv2d(in_channels=in_channels, out_channels=out_channels,
                                        kernel_size=kernel_size, stride=stride, padding=padding, groups=groups,
                                        bias=bias))
    result.add_module('bn', nn.BatchNorm2d(num_features=out_channels))
    return result

class RepVGGBlock(nn.Module):
    '''RepVGGBlock is a basic rep-style block, including training and deploy status
    This code is based on https://github.com/DingXiaoH/RepVGG/blob/main/repvgg.py
    '''
    
    def __init__(self, in_channels, out_channels, kernel_size=3,
                 stride=1, padding=1, dilation=1, groups=1, padding_mode='zeros', deploy=False, use_se=False):
        super(RepVGGBlock, self).__init__()
        """ Initialization of the class.
        Args:
            in_channels (int): Number of channels in the input image
            out_channels (int): Number of channels produced by the convolution
            kernel_size (int or tuple): Size of the convolving kernel
            stride (int or tuple, optional): Stride of the convolution. Default: 1
            padding (int or tuple, optional): Zero-padding added to both sides of
                the input. Default: 1
            dilation (int or tuple, optional): Spacing between kernel elements. Default: 1
            groups (int, optional): Number of blocked connections from input
                channels to output channels. Default: 1
            padding_mode (string, optional): Default: 'zeros'
            deploy: Whether to be deploy status or training status. Default: False
            use_se: Whether to use se. Default: False
        """
        self.deploy = deploy
        self.groups = groups
        self.in_channels = in_channels
        self.out_channels = out_channels
        
        assert kernel_size == 3
        assert padding == 1
        
        padding_11 = padding - kernel_size // 2
        
        self.nonlinearity = nn.ReLU()
        
        if use_se:
            raise NotImplementedError("se block not supported yet")
        else:
            self.se = nn.Identity()
        
        if deploy:
            self.rbr_reparam = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=kernel_size,
                                         stride=stride,
                                         padding=padding, dilation=dilation, groups=groups, bias=True,
                                         padding_mode=padding_mode)
        
        else:
            self.rbr_identity = nn.BatchNorm2d(
                    num_features=in_channels) if out_channels == in_channels and stride == 1 else None
            self.rbr_dense = conv_bn(in_channels=in_channels, out_channels=out_channels, kernel_size=kernel_size,
                                     stride=stride, padding=padding, groups=groups)
            self.rbr_1x1 = conv_bn(in_channels=in_channels, out_channels=out_channels, kernel_size=1, stride=stride,
                                   padding=padding_11, groups=groups)
    
    def forward(self, inputs):
        '''Forward process'''
        if hasattr(self, 'rbr_reparam'):
            return self.nonlinearity(self.se(self.rbr_reparam(inputs)))
        
        if self.rbr_identity is None:
            id_out = 0
        else:
            id_out = self.rbr_identity(inputs)
        
        return self.nonlinearity(self.se(self.rbr_dense(inputs) + self.rbr_1x1(inputs) + id_out))
    
    def get_equivalent_kernel_bias(self):
        kernel3x3, bias3x3 = self._fuse_bn_tensor(self.rbr_dense)
        kernel1x1, bias1x1 = self._fuse_bn_tensor(self.rbr_1x1)
        kernelid, biasid = self._fuse_bn_tensor(self.rbr_identity)
        return kernel3x3 + self._pad_1x1_to_3x3_tensor(kernel1x1) + kernelid, bias3x3 + bias1x1 + biasid
    
    def _pad_1x1_to_3x3_tensor(self, kernel1x1):
        if kernel1x1 is None:
            return 0
        else:
            return torch.nn.functional.pad(kernel1x1, [1, 1, 1, 1])
    
    def _fuse_bn_tensor(self, branch):
        if branch is None:
            return 0, 0
        if isinstance(branch, nn.Sequential):
            kernel = branch.conv.weight
            running_mean = branch.bn.running_mean
            running_var = branch.bn.running_var
            gamma = branch.bn.weight
            beta = branch.bn.bias
            eps = branch.bn.eps
        else:
            assert isinstance(branch, nn.BatchNorm2d)
            if not hasattr(self, 'id_tensor'):
                input_dim = self.in_channels // self.groups
                kernel_value = np.zeros((self.in_channels, input_dim, 3, 3), dtype=np.float32)
                for i in range(self.in_channels):
                    kernel_value[i, i % input_dim, 1, 1] = 1
                self.id_tensor = torch.from_numpy(kernel_value).to(branch.weight.device)
            kernel = self.id_tensor
            running_mean = branch.running_mean
            running_var = branch.running_var
            gamma = branch.weight
            beta = branch.bias
            eps = branch.eps
        std = (running_var + eps).sqrt()
        t = (gamma / std).reshape(-1, 1, 1, 1)
        return kernel * t, beta - running_mean * gamma / std
    
    def switch_to_deploy(self):
        if hasattr(self, 'rbr_reparam'):
            return
        kernel, bias = self.get_equivalent_kernel_bias()
        self.rbr_reparam = nn.Conv2d(in_channels=self.rbr_dense.conv.in_channels,
                                     out_channels=self.rbr_dense.conv.out_channels,
                                     kernel_size=self.rbr_dense.conv.kernel_size, stride=self.rbr_dense.conv.stride,
                                     padding=self.rbr_dense.conv.padding, dilation=self.rbr_dense.conv.dilation,
                                     groups=self.rbr_dense.conv.groups, bias=True)
        self.rbr_reparam.weight.data = kernel
        self.rbr_reparam.bias.data = bias
        for para in self.parameters():
            para.detach_()
        self.__delattr__('rbr_dense')
        self.__delattr__('rbr_1x1')
        if hasattr(self, 'rbr_identity'):
            self.__delattr__('rbr_identity')
        if hasattr(self, 'id_tensor'):
            self.__delattr__('id_tensor')
        self.deploy = True

def onnx_AdaptiveAvgPool2d(x, output_size):
    stride_size = np.floor(np.array(x.shape[-2:]) / output_size).astype(np.int32)
    kernel_size = np.array(x.shape[-2:]) - (output_size - 1) * stride_size
    avg = nn.AvgPool2d(kernel_size=list(kernel_size), stride=list(stride_size))
    x = avg(x)
    return x

def get_avg_pool():
    if torch.onnx.is_in_onnx_export():
        avg_pool = onnx_AdaptiveAvgPool2d
    else:
        avg_pool = nn.functional.adaptive_avg_pool2d
    return avg_pool

class SimFusion_3in(nn.Module):
    def __init__(self, in_channel_list, out_channels):
        super().__init__()
        self.cv1 = Conv(in_channel_list[0], out_channels, act=nn.ReLU()) if in_channel_list[0] != out_channels else nn.Identity()
        self.cv2 = Conv(in_channel_list[1], out_channels, act=nn.ReLU()) if in_channel_list[1] != out_channels else nn.Identity()
        self.cv3 = Conv(in_channel_list[2], out_channels, act=nn.ReLU()) if in_channel_list[2] != out_channels else nn.Identity()
        self.cv_fuse = Conv(out_channels * 3, out_channels, act=nn.ReLU())
        self.downsample = nn.functional.adaptive_avg_pool2d
    
    def forward(self, x):
        N, C, H, W = x[1].shape
        output_size = (H, W)
        
        if torch.onnx.is_in_onnx_export():
            self.downsample = onnx_AdaptiveAvgPool2d
            output_size = np.array([H, W])
        
        x0 = self.cv1(self.downsample(x[0], output_size))
        x1 = self.cv2(x[1])
        x2 = self.cv3(F.interpolate(x[2], size=(H, W), mode='bilinear', align_corners=False))
        return self.cv_fuse(torch.cat((x0, x1, x2), dim=1))

class SimFusion_4in(nn.Module):
    def __init__(self):
        super().__init__()
        self.avg_pool = nn.functional.adaptive_avg_pool2d
    
    def forward(self, x):
        x_l, x_m, x_s, x_n = x
        B, C, H, W = x_s.shape
        output_size = np.array([H, W])
        
        if torch.onnx.is_in_onnx_export():
            self.avg_pool = onnx_AdaptiveAvgPool2d
        
        x_l = self.avg_pool(x_l, output_size)
        x_m = self.avg_pool(x_m, output_size)
        x_n = F.interpolate(x_n, size=(H, W), mode='bilinear', align_corners=False)
        
        out = torch.cat([x_l, x_m, x_s, x_n], 1)
        return out

class IFM(nn.Module):
    def __init__(self, inc, ouc, embed_dim_p=96, fuse_block_num=3) -> None:
        super().__init__()
        
        self.conv = nn.Sequential(
            Conv(inc, embed_dim_p),
            *[RepVGGBlock(embed_dim_p, embed_dim_p) for _ in range(fuse_block_num)],
            Conv(embed_dim_p, sum(ouc))
        )
    
    def forward(self, x):
        return self.conv(x)

class h_sigmoid(nn.Module):
    def __init__(self, inplace=True):
        super(h_sigmoid, self).__init__()
        self.relu = nn.ReLU6(inplace=inplace)
    
    def forward(self, x):
        return self.relu(x + 3) / 6

class InjectionMultiSum_Auto_pool(nn.Module):
    def __init__(
            self,
            inp: int,
            oup: int,
            global_inp: list,
            flag: int
    ) -> None:
        super().__init__()
        self.global_inp = global_inp
        self.flag = flag
        self.local_embedding = Conv(inp, oup, 1, act=False)
        self.global_embedding = Conv(global_inp[self.flag], oup, 1, act=False)
        self.global_act = Conv(global_inp[self.flag], oup, 1, act=False)
        self.act = h_sigmoid()
    
    def forward(self, x):
        '''
        x_g: global features
        x_l: local features
        '''
        x_l, x_g = x
        B, C, H, W = x_l.shape
        g_B, g_C, g_H, g_W = x_g.shape
        use_pool = H < g_H
        
        gloabl_info = x_g.split(self.global_inp, dim=1)[self.flag]
        
        local_feat = self.local_embedding(x_l)
        
        global_act = self.global_act(gloabl_info)
        global_feat = self.global_embedding(gloabl_info)
        
        if use_pool:
            avg_pool = get_avg_pool()
            output_size = np.array([H, W])
            
            sig_act = avg_pool(global_act, output_size)
            global_feat = avg_pool(global_feat, output_size)
        
        else:
            sig_act = F.interpolate(self.act(global_act), size=(H, W), mode='bilinear', align_corners=False)
            global_feat = F.interpolate(global_feat, size=(H, W), mode='bilinear', align_corners=False)
        
        out = local_feat * sig_act + global_feat
        return out

def get_shape(tensor):
    shape = tensor.shape
    if torch.onnx.is_in_onnx_export():
        shape = [i.cpu().numpy() for i in shape]
    return shape

class PyramidPoolAgg(nn.Module):
    def __init__(self, inc, ouc, stride, pool_mode='torch'):
        super().__init__()
        self.stride = stride
        if pool_mode == 'torch':
            self.pool = nn.functional.adaptive_avg_pool2d
        elif pool_mode == 'onnx':
            self.pool = onnx_AdaptiveAvgPool2d
        self.conv = Conv(inc, ouc)
    
    def forward(self, inputs):
        B, C, H, W = get_shape(inputs[-1])
        H = (H - 1) // self.stride + 1
        W = (W - 1) // self.stride + 1
        
        output_size = np.array([H, W])
        
        if not hasattr(self, 'pool'):
            self.pool = nn.functional.adaptive_avg_pool2d
        
        if torch.onnx.is_in_onnx_export():
            self.pool = onnx_AdaptiveAvgPool2d
        
        out = [self.pool(inp, output_size) for inp in inputs]
        
        return self.conv(torch.cat(out, dim=1))

def drop_path(x, drop_prob: float = 0., training: bool = False):
    """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
    This is the same as the DropConnect impl I created for EfficientNet, etc networks, however,
    the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper...
    See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for
    changing the layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use
    'survival rate' as the argument.
    """
    if drop_prob == 0. or not training:
        return x
    keep_prob = 1 - drop_prob
    shape = (x.shape[0],) + (1,) * (x.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
    random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
    random_tensor.floor_()  # binarize
    output = x.div(keep_prob) * random_tensor
    return output

class Mlp(nn.Module):
    def __init__(self, in_features, hidden_features=None, out_features=None, drop=0.):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features
        self.fc1 = Conv(in_features, hidden_features, act=False)
        self.dwconv = nn.Conv2d(hidden_features, hidden_features, 3, 1, 1, bias=True, groups=hidden_features)
        self.act = nn.ReLU6()
        self.fc2 = Conv(hidden_features, out_features, act=False)
        self.drop = nn.Dropout(drop)
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.dwconv(x)
        x = self.act(x)
        x = self.drop(x)
        x = self.fc2(x)
        x = self.drop(x)
        return x

class DropPath(nn.Module):
    """Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
    """
    
    def __init__(self, drop_prob=None):
        super(DropPath, self).__init__()
        self.drop_prob = drop_prob
    
    def forward(self, x):
        return drop_path(x, self.drop_prob, self.training)

class GOLDYOLO_Attention(torch.nn.Module):
    def __init__(self, dim, key_dim, num_heads, attn_ratio=4):
        super().__init__()
        self.num_heads = num_heads
        self.scale = key_dim ** -0.5
        self.key_dim = key_dim
        self.nh_kd = nh_kd = key_dim * num_heads  # num_head key_dim
        self.d = int(attn_ratio * key_dim)
        self.dh = int(attn_ratio * key_dim) * num_heads
        self.attn_ratio = attn_ratio
        
        self.to_q = Conv(dim, nh_kd, 1, act=False)
        self.to_k = Conv(dim, nh_kd, 1, act=False)
        self.to_v = Conv(dim, self.dh, 1, act=False)
        
        self.proj = torch.nn.Sequential(nn.ReLU6(), Conv(self.dh, dim, act=False))
    
    def forward(self, x):  # x (B,N,C)
        B, C, H, W = get_shape(x)
        
        qq = self.to_q(x).reshape(B, self.num_heads, self.key_dim, H * W).permute(0, 1, 3, 2)
        kk = self.to_k(x).reshape(B, self.num_heads, self.key_dim, H * W)
        vv = self.to_v(x).reshape(B, self.num_heads, self.d, H * W).permute(0, 1, 3, 2)
        
        attn = torch.matmul(qq, kk)
        attn = attn.softmax(dim=-1)  # dim = k
        
        xx = torch.matmul(attn, vv)
        
        xx = xx.permute(0, 1, 3, 2).reshape(B, self.dh, H, W)
        xx = self.proj(xx)
        return xx

class top_Block(nn.Module):
    
    def __init__(self, dim, key_dim, num_heads, mlp_ratio=4., attn_ratio=2., drop=0.,
                 drop_path=0.):
        super().__init__()
        self.dim = dim
        self.num_heads = num_heads
        self.mlp_ratio = mlp_ratio
        
        self.attn = GOLDYOLO_Attention(dim, key_dim=key_dim, num_heads=num_heads, attn_ratio=attn_ratio)
        
        # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
        mlp_hidden_dim = int(dim * mlp_ratio)
        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, drop=drop)
    
    def forward(self, x1):
        x1 = x1 + self.drop_path(self.attn(x1))
        x1 = x1 + self.drop_path(self.mlp(x1))
        return x1

class TopBasicLayer(nn.Module):
    def __init__(self, embedding_dim, ouc_list, block_num=2, key_dim=8, num_heads=4,
                 mlp_ratio=4., attn_ratio=2., drop=0., attn_drop=0., drop_path=0.):
        super().__init__()
        self.block_num = block_num
        
        self.transformer_blocks = nn.ModuleList()
        for i in range(self.block_num):
            self.transformer_blocks.append(top_Block(
                    embedding_dim, key_dim=key_dim, num_heads=num_heads,
                    mlp_ratio=mlp_ratio, attn_ratio=attn_ratio,
                    drop=drop, drop_path=drop_path[i] if isinstance(drop_path, list) else drop_path))
        self.conv = nn.Conv2d(embedding_dim, sum(ouc_list), 1)
        
    def forward(self, x):
        # token * N 
        for i in range(self.block_num):
            x = self.transformer_blocks[i](x)
        return self.conv(x)

class AdvPoolFusion(nn.Module):
    def forward(self, x):
        x1, x2 = x
        if torch.onnx.is_in_onnx_export():
            self.pool = onnx_AdaptiveAvgPool2d
        else:
            self.pool = nn.functional.adaptive_avg_pool2d
        
        N, C, H, W = x2.shape
        output_size = np.array([H, W])
        x1 = self.pool(x1, output_size)
        
        return torch.cat([x1, x2], 1)

四、添加步骤

4.1 修改一

① 在 ultralytics/nn/ 目录下新建 AddModules 文件夹用于存放模块代码

② 在 AddModules 文件夹下新建 GoldYolo.py ，将 第三节 中的代码粘贴到此处

在这里插入图片描述

4.2 修改二

在 AddModules 文件夹下新建 __init__.py （已有则不用新建），在文件内导入模块： from .GoldYolo import *

在这里插入图片描述

4.3 修改三

在 ultralytics/nn/modules/tasks.py 文件中，需要在两处位置添加各模块类名称。

首先：导入模块

在这里插入图片描述

然后，在 parse_model函数 中添加以下代码注册模块：

elif m is IFM:
    c1 = ch[f]
    c2 = sum(args[0])
    args = [c1, *args]
elif m is InjectionMultiSum_Auto_pool:
    c1 = ch[f[0]]
    c2 = args[0]
    args = [c1, *args]
elif m is PyramidPoolAgg:
    c2 = args[0]
    args = [sum([ch[f_] for f_ in f]), *args]
elif m is TopBasicLayer:
    c2 = sum(args[1])
elif m in {SimFusion_4in, AdvPoolFusion}:
    c2 = sum(ch[x] for x in f)
elif m is SimFusion_3in:
    c2 = args[0]
    if c2 != nc:  
        c2 = make_divisible(min(c2, max_channels) * width, 8)
    args = [[ch[f_] for f_ in f], c2]

在这里插入图片描述

五、yaml模型文件

5.1 模型改进版本⭐

此处以 ultralytics/cfg/models/rt-detr/rtdetr-l.yaml 为例，在同目录下创建一个用于自己数据集训练的模型文件 rtdetr-l-GoldYolo.yaml 。

将 rtdetr-l.yaml 中的内容复制到 rtdetr-l-GoldYolo.yaml 文件下，修改 nc 数量等于自己数据中目标的数量。

📌 模型的修改方法是将 颈部网络 替换成 Gold-Yolo结构 。

# Ultralytics YOLO 🚀, AGPL-3.0 license
# RT-DETR-l object detection model with P3-P5 outputs. For details see https://docs.ultralytics.com/models/rtdetr

# Parameters
nc: 1 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n-cls.yaml' will call yolov8-cls.yaml with scale 'n'
  # [depth, width, max_channels]
  l: [1.00, 1.00, 1024]

backbone:
  # [from, repeats, module, args]
  - [-1, 1, HGStem, [32, 48]] # 0-P2/4
  - [-1, 6, HGBlock, [48, 128, 3]] # stage 1

  - [-1, 1, DWConv, [128, 3, 2, 1, False]] # 2-P3/8
  - [-1, 6, HGBlock, [96, 512, 3]] # stage 2

  - [-1, 1, DWConv, [512, 3, 2, 1, False]] # 4-P4/16
  - [-1, 6, HGBlock, [192, 1024, 5, True, False]] # cm, c2, k, light, shortcut
  - [-1, 6, HGBlock, [192, 1024, 5, True, True]]
  - [-1, 6, HGBlock, [192, 1024, 5, True, True]] # stage 3

  - [-1, 1, DWConv, [1024, 3, 2, 1, False]] # 8-P5/32
  - [-1, 6, HGBlock, [384, 2048, 5, True, False]] # stage 4

head:
  - [-1, 1, Conv, [256, 1, 1, None, 1, 1, False]]  # 10 input_proj.2
  - [-1, 1, AIFI, [1024, 8]] # 11
  - [-1, 1, Conv, [256, 1, 1]]  # 12, Y5, lateral_convs.0

  - [[1, 3, 7, 9], 1, SimFusion_4in, []] # 13
  - [-1, 1, IFM, [[64, 32]]] # 14

  - [12, 1, Conv, [256, 1, 1]] # 15
  - [[3, 7, -1], 1, SimFusion_3in, [256]] # 16
  - [[-1, 14], 1, InjectionMultiSum_Auto_pool, [256, [64, 32], 0]] # 17
  - [-1, 3, RepC3, [256]] # 18

  - [7, 1, Conv, [256, 1, 1]] # 19
  - [[1, 3, -1], 1, SimFusion_3in, [256]] # 20
  - [[-1, 14], 1, InjectionMultiSum_Auto_pool, [256, [64, 32], 1]] # 21
  - [-1, 3, RepC3, [256]] # 22

  - [[18, 18, 12], 1, PyramidPoolAgg, [352, 2]] # 23
  - [-1, 1, TopBasicLayer, [352, [64, 128]]] # 24

  - [[22, 19], 1, AdvPoolFusion, []] # 25
  - [[-1, 24], 1, InjectionMultiSum_Auto_pool, [256, [64, 128], 0]] # 26
  - [-1, 3, RepC3, [256]] # 27

  - [[-1, 15], 1, AdvPoolFusion, []] # 28
  - [[-1, 24], 1, InjectionMultiSum_Auto_pool, [256, [64, 128], 1]] # 29
  - [-1, 3, RepC3, [256]] # 30

  - [[22, 27, 30], 1, RTDETRDecoder, [nc]] # 31

六、成功运行结果

打印网络模型可以看到 Gold-Yolo 已经加入到模型中，并可以进行训练了。

rtdetr-l-GoldYolo ：

rtdetr-l-GoldYolo summary: 867 layers, 35,609,475 parameters, 35,609,475 gradients, 112.1 GFLOPs

                   from  n    params  module                                       arguments                     
  0                  -1  1     25248  ultralytics.nn.modules.block.HGStem          [3, 32, 48]                   
  1                  -1  6    155072  ultralytics.nn.modules.block.HGBlock         [48, 48, 128, 3, 6]           
  2                  -1  1      1408  ultralytics.nn.modules.conv.DWConv           [128, 128, 3, 2, 1, False]    
  3                  -1  6    839296  ultralytics.nn.modules.block.HGBlock         [128, 96, 512, 3, 6]          
  4                  -1  1      5632  ultralytics.nn.modules.conv.DWConv           [512, 512, 3, 2, 1, False]    
  5                  -1  6   1695360  ultralytics.nn.modules.block.HGBlock         [512, 192, 1024, 5, 6, True, False]
  6                  -1  6   2055808  ultralytics.nn.modules.block.HGBlock         [1024, 192, 1024, 5, 6, True, True]
  7                  -1  6   2055808  ultralytics.nn.modules.block.HGBlock         [1024, 192, 1024, 5, 6, True, True]
  8                  -1  1     11264  ultralytics.nn.modules.conv.DWConv           [1024, 1024, 3, 2, 1, False]  
  9                  -1  6   6708480  ultralytics.nn.modules.block.HGBlock         [1024, 384, 2048, 5, 6, True, False]
 10                  -1  1    524800  ultralytics.nn.modules.conv.Conv             [2048, 256, 1, 1, None, 1, 1, False]
 11                  -1  1    789760  ultralytics.nn.modules.transformer.AIFI      [256, 1024, 8]                
 12                  -1  1     66048  ultralytics.nn.modules.conv.Conv             [256, 256, 1, 1]              
 13        [1, 3, 7, 9]  1         0  ultralytics.nn.AddModules.GoldYolo.SimFusion_4in[]                            
 14                  -1  1    644160  ultralytics.nn.AddModules.GoldYolo.IFM       [3712, [64, 32]]              
 15                  12  1     66048  ultralytics.nn.modules.conv.Conv             [256, 256, 1, 1]              
 16          [3, 7, -1]  1    591360  ultralytics.nn.AddModules.GoldYolo.SimFusion_3in[[512, 1024, 256], 256]       
 17            [-1, 14]  1     99840  ultralytics.nn.AddModules.GoldYolo.InjectionMultiSum_Auto_pool[256, 256, [64, 32], 0]       
 18                  -1  3   2101248  ultralytics.nn.modules.block.RepC3           [256, 256, 3]                 
 19                   7  1    262656  ultralytics.nn.modules.conv.Conv             [1024, 256, 1, 1]             
 20          [1, 3, -1]  1    361984  ultralytics.nn.AddModules.GoldYolo.SimFusion_3in[[128, 512, 256], 256]        
 21            [-1, 14]  1     83456  ultralytics.nn.AddModules.GoldYolo.InjectionMultiSum_Auto_pool[256, 256, [64, 32], 1]       
 22                  -1  3   2101248  ultralytics.nn.modules.block.RepC3           [256, 256, 3]                 
 23        [18, 18, 12]  1    271040  ultralytics.nn.AddModules.GoldYolo.PyramidPoolAgg[768, 352, 2]                 
 24                  -1  1   2222528  ultralytics.nn.AddModules.GoldYolo.TopBasicLayer[352, [64, 128]]              
 25            [22, 19]  1         0  ultralytics.nn.AddModules.GoldYolo.AdvPoolFusion[]                            
 26            [-1, 24]  1    165376  ultralytics.nn.AddModules.GoldYolo.InjectionMultiSum_Auto_pool[512, 256, [64, 128], 0]      
 27                  -1  3   2101248  ultralytics.nn.modules.block.RepC3           [256, 256, 3]                 
 28            [-1, 15]  1         0  ultralytics.nn.AddModules.GoldYolo.AdvPoolFusion[]                            
 29            [-1, 24]  1    198144  ultralytics.nn.AddModules.GoldYolo.InjectionMultiSum_Auto_pool[512, 256, [64, 128], 1]      
 30                  -1  3   2101248  ultralytics.nn.modules.block.RepC3           [256, 256, 3]                 
 31        [22, 27, 30]  1   7303907  ultralytics.nn.modules.head.RTDETRDecoder    [1, [256, 256, 256]]          
rtdetr-l-GoldYolo summary: 867 layers, 35,609,475 parameters, 35,609,475 gradients, 112.1 GFLOPs

学习资源站

RT-DETR改进策略【Neck】NeurIPS2023融合GOLD-YOLO颈部结构，强化小目标检测能力-

RT-DETR改进策略【Neck】| NeurIPS 2023 融合GOLD-YOLO颈部结构，强化小目标检测能力

一、本文介绍

二、GOLD-YOLO介绍

2.1 出发点

2.2 结构原理

2.3 模块组成

2.4 优势

三、GOLD-YOLO的实现代码

四、添加步骤

4.1 修改一

4.2 修改二

4.3 修改三

五、yaml模型文件

5.1 模型改进版本⭐

六、成功运行结果