RT-DETR改进策略【Neck】| GFPN 超越BiFPN 通过跳层连接和跨尺度连接改进RT-DETR颈部网络
一、本文介绍
本文记录的是
利用
GFPN
颈部结构优化RT-DETR的目标检测网络模型
。利用
GFPN
改进后的颈部网络,通过
跳层连接
,
避免了在进行反向传播时的梯度消失问题
,并且引入
跨尺度连接
,
可以实现不同级别和层次的特征充分融合,获取足够的高层语义信息和低层空间信息交换,从而在大规模变化场景下提高检测性能。
二、GFPN介绍
GIRAFFEDET: A HEAVY-NECK PARADIGM FOR OBJECT DETECTION
2.1 设计出发点
- 传统的FPN及其改进方法存在一些局限性。例如,常规FPN(Lin et al., 2017a)只有单向的信息流动路径(top-down)来融合多尺度特征;PANet(Liu et al., 2018)虽增加了bottom-up路径聚合网络,但计算成本较高;BiFPN(Tan et al., 2020)对节点和连接进行了优化,但缺乏内部块连接。为了克服这些问题,同时实现高效的多尺度信息融合以应对目标检测中的大规模变化挑战,设计了GFPN模块。
2.2 原理
2.2.1 Skip-layer Connection(跳层连接)
- 目的 :减少在“giraffe”这种复杂结构的颈部进行反向传播时的梯度消失问题。
- 具体方式 :提出了两种特征连接方法,即dense-link和 l o g 2 n log _{2} n l o g 2 n -link。
- dense-link :灵感来自DenseNet(Huang et al., 2017),对于级别 k k k 中的每个尺度特征 P k l P_{k}^{l} P k l ,第 l l l 层接收所有前面层的特征图,即 P k l = C o n v ( C o n c a t ( P k 0 , . . . , P k l − 1 ) ) P_{k}^{l}=Conv\left(Concat\left(P_{k}^{0},..., P_{k}^{l - 1}\right)\right) P k l = C o n v ( C o n c a t ( P k 0 , ... , P k l − 1 ) ) 。
- l o g 2 n log _{2} n l o g 2 n -link :在每个级别 k k k 中,第 l l l 层接收最多 l o g 2 l + 1 log _{2} l + 1 l o g 2 l + 1 个前面层的特征图,这些输入层与深度(i)以2为底呈指数间隔,即 P k l = C o n v ( C o n c a t ( P k l − 2 n , . . . , P k l − 2 1 , P k l − 2 0 ) ) P_{k}^{l}=Conv\left(Concat\left(P_{k}^{l - 2^{n}},..., P_{k}^{l - 2^{1}}, P_{k}^{l - 2^{0}}\right)\right) P k l = C o n v ( C o n c a t ( P k l − 2 n , ... , P k l − 2 1 , P k l − 2 0 ) ) ,其中 l − 2 n ≥ 0 l - 2^{n}≥0 l − 2 n ≥ 0 。与dense-link相比, l o g 2 n log _{2} n l o g 2 n -link在深度 l l l 处的时间复杂度仅为 O ( l ⋅ l o g 2 l ) O(l \cdot log _{2} l) O ( l ⋅ l o g 2 l ) ,而不是 O ( l 2 ) O(l^{2}) O ( l 2 ) ,并且在反向传播时层间距离增加较小,可扩展到更深的网络。
2.2.2 Cross-scale Connection(跨尺度连接)
- 目的 :为了实现充分的信息交换,克服大规模变化,不仅要有跳层连接,还需要跨尺度连接。
- 具体方式 :提出了一种新的跨尺度融合方法Queen-fusion,它考虑了相同级别和相邻级别的特征。例如在 P 5 P_{5} P 5 处的Queen-fusion连接包括对前一层 P 4 P_{4} P 4 的下采样、前一层 P 6 P_{6} P 6 的上采样、前一层 P 5 P_{5} P 5 以及当前层 P 4 P_{4} P 4 。在实现中,分别应用双线性插值和最大池化作为上采样和下采样函数。
2.3 结构
-GFPN包含了上述的跳层连接(dense-link和 l o g 2 n log _{2} n l o g 2 n -link)以及跨尺度连接(Queen-fusion)。与其他FPN设计相比,如PANet和BiFPN,其每层代表一个深度,而PANet和BiFPN的层包含两个深度。
2.4 优势
- 高效的信息传输 : l o g 2 n log _{2} n l o g 2 n -link这种跳层连接方式在相同的FLOPs水平下能提供更有效的信息传输,相比dense-link避免了可能的冗余信息传输,并且能使网络扩展到更深层次。
- 充分的信息融合 :通过Queen-fusion这种跨尺度连接方式,可以实现不同级别和层次的特征充分融合,获取足够的高层语义信息和低层空间信息交换,从而在大规模变化场景下提高检测性能。
- 性能优势 :实验结果表明,在不同的FLOPs水平下,GFPN都能使GiraffeDet模型在准确性和效率上取得较好的平衡,优于其他基于不同骨干网络和FPN结构的方法。例如在COCO数据集上的实验结果显示,GiraffeDet-D29采用GFPN结构在多尺度测试下取得了 54.1 % 54.1\% 54.1% 的mAP,超过了其他SOTA方法。
论文: https://arxiv.org/pdf/2202.04256
源码: https://github.com/damo-cv/GiraffeDet
三、GFPN的实现代码
GFPN模块
的实现代码如下:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
def conv_bn(in_channels, out_channels, kernel_size, stride, padding, groups=1):
'''Basic cell for rep-style block, including conv and bn'''
result = nn.Sequential()
result.add_module(
'conv',
nn.Conv2d(in_channels=in_channels,
out_channels=out_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding,
groups=groups,
bias=False))
result.add_module('bn', nn.BatchNorm2d(num_features=out_channels))
return result
class RepConv(nn.Module):
'''RepConv is a basic rep-style block, including training and deploy status
Code is based on https://github.com/DingXiaoH/RepVGG/blob/main/repvgg.py
'''
def __init__(self,
in_channels,
out_channels,
kernel_size=3,
stride=1,
padding=1,
dilation=1,
groups=1,
padding_mode='zeros',
deploy=False,
act='relu',
norm=None):
super(RepConv, self).__init__()
self.deploy = deploy
self.groups = groups
self.in_channels = in_channels
self.out_channels = out_channels
assert kernel_size == 3
assert padding == 1
padding_11 = padding - kernel_size // 2
if isinstance(act, str):
self.nonlinearity = get_activation(act)
else:
self.nonlinearity = act
if deploy:
self.rbr_reparam = nn.Conv2d(in_channels=in_channels,
out_channels=out_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding,
dilation=dilation,
groups=groups,
bias=True,
padding_mode=padding_mode)
else:
self.rbr_identity = None
self.rbr_dense = conv_bn(in_channels=in_channels,
out_channels=out_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding,
groups=groups)
self.rbr_1x1 = conv_bn(in_channels=in_channels,
out_channels=out_channels,
kernel_size=1,
stride=stride,
padding=padding_11,
groups=groups)
def forward(self, inputs):
'''Forward process'''
if hasattr(self, 'rbr_reparam'):
return self.nonlinearity(self.rbr_reparam(inputs))
if self.rbr_identity is None:
id_out = 0
else:
id_out = self.rbr_identity(inputs)
return self.nonlinearity(
self.rbr_dense(inputs) + self.rbr_1x1(inputs) + id_out)
def get_equivalent_kernel_bias(self):
kernel3x3, bias3x3 = self._fuse_bn_tensor(self.rbr_dense)
kernel1x1, bias1x1 = self._fuse_bn_tensor(self.rbr_1x1)
kernelid, biasid = self._fuse_bn_tensor(self.rbr_identity)
return kernel3x3 + self._pad_1x1_to_3x3_tensor(
kernel1x1) + kernelid, bias3x3 + bias1x1 + biasid
def _pad_1x1_to_3x3_tensor(self, kernel1x1):
if kernel1x1 is None:
return 0
else:
return torch.nn.functional.pad(kernel1x1, [1, 1, 1, 1])
def _fuse_bn_tensor(self, branch):
if branch is None:
return 0, 0
if isinstance(branch, nn.Sequential):
kernel = branch.conv.weight
running_mean = branch.bn.running_mean
running_var = branch.bn.running_var
gamma = branch.bn.weight
beta = branch.bn.bias
eps = branch.bn.eps
else:
assert isinstance(branch, nn.BatchNorm2d)
if not hasattr(self, 'id_tensor'):
input_dim = self.in_channels // self.groups
kernel_value = np.zeros((self.in_channels, input_dim, 3, 3),
dtype=np.float32)
for i in range(self.in_channels):
kernel_value[i, i % input_dim, 1, 1] = 1
self.id_tensor = torch.from_numpy(kernel_value).to(
branch.weight.device)
kernel = self.id_tensor
running_mean = branch.running_mean
running_var = branch.running_var
gamma = branch.weight
beta = branch.bias
eps = branch.eps
std = (running_var + eps).sqrt()
t = (gamma / std).reshape(-1, 1, 1, 1)
return kernel * t, beta - running_mean * gamma / std
def switch_to_deploy(self):
if hasattr(self, 'rbr_reparam'):
return
kernel, bias = self.get_equivalent_kernel_bias()
self.rbr_reparam = nn.Conv2d(
in_channels=self.rbr_dense.conv.in_channels,
out_channels=self.rbr_dense.conv.out_channels,
kernel_size=self.rbr_dense.conv.kernel_size,
stride=self.rbr_dense.conv.stride,
padding=self.rbr_dense.conv.padding,
dilation=self.rbr_dense.conv.dilation,
groups=self.rbr_dense.conv.groups,
bias=True)
self.rbr_reparam.weight.data = kernel
self.rbr_reparam.bias.data = bias
for para in self.parameters():
para.detach_()
self.__delattr__('rbr_dense')
self.__delattr__('rbr_1x1')
if hasattr(self, 'rbr_identity'):
self.__delattr__('rbr_identity')
if hasattr(self, 'id_tensor'):
self.__delattr__('id_tensor')
self.deploy = True
class Swish(nn.Module):
def __init__(self, inplace=True):
super(Swish, self).__init__()
self.inplace = inplace
def forward(self, x):
if self.inplace:
x.mul_(F.sigmoid(x))
return x
else:
return x * F.sigmoid(x)
def get_activation(name='silu', inplace=True):
if name is None:
return nn.Identity()
if isinstance(name, str):
if name == 'silu':
module = nn.SiLU(inplace=inplace)
elif name == 'relu':
module = nn.ReLU(inplace=inplace)
elif name == 'lrelu':
module = nn.LeakyReLU(0.1, inplace=inplace)
elif name == 'swish':
module = Swish(inplace=inplace)
elif name == 'hardsigmoid':
module = nn.Hardsigmoid(inplace=inplace)
elif name == 'identity':
module = nn.Identity()
else:
raise AttributeError('Unsupported act type: {}'.format(name))
return module
elif isinstance(name, nn.Module):
return name
else:
raise AttributeError('Unsupported act type: {}'.format(name))
def get_norm(name, out_channels, inplace=True):
if name == 'bn':
module = nn.BatchNorm2d(out_channels)
else:
raise NotImplementedError
return module
class ConvBNAct(nn.Module):
"""A Conv2d -> Batchnorm -> silu/leaky relu block"""
def __init__(
self,
in_channels,
out_channels,
ksize,
stride=1,
groups=1,
bias=False,
act='silu',
norm='bn',
reparam=False,
):
super().__init__()
# same padding
pad = (ksize - 1) // 2
self.conv = nn.Conv2d(
in_channels,
out_channels,
kernel_size=ksize,
stride=stride,
padding=pad,
groups=groups,
bias=bias,
)
if norm is not None:
self.bn = get_norm(norm, out_channels, inplace=True)
if act is not None:
self.act = get_activation(act, inplace=True)
self.with_norm = norm is not None
self.with_act = act is not None
def forward(self, x):
x = self.conv(x)
if self.with_norm:
x = self.bn(x)
if self.with_act:
x = self.act(x)
return x
def fuseforward(self, x):
return self.act(self.conv(x))
class BasicBlock_3x3_Reverse(nn.Module):
def __init__(self,
ch_in,
ch_hidden_ratio,
ch_out,
act='relu',
shortcut=True):
super(BasicBlock_3x3_Reverse, self).__init__()
assert ch_in == ch_out
ch_hidden = int(ch_in * ch_hidden_ratio)
self.conv1 = ConvBNAct(ch_hidden, ch_out, 3, stride=1, act=act)
self.conv2 = RepConv(ch_in, ch_hidden, 3, stride=1, act=act)
self.shortcut = shortcut
def forward(self, x):
y = self.conv2(x)
y = self.conv1(y)
if self.shortcut:
return x + y
else:
return y
class SPP(nn.Module):
def __init__(
self,
ch_in,
ch_out,
k,
pool_size,
act='swish',
):
super(SPP, self).__init__()
self.pool = []
for i, size in enumerate(pool_size):
pool = nn.MaxPool2d(kernel_size=size,
stride=1,
padding=size // 2,
ceil_mode=False)
self.add_module('pool{}'.format(i), pool)
self.pool.append(pool)
self.conv = ConvBNAct(ch_in, ch_out, k, act=act)
def forward(self, x):
outs = [x]
for pool in self.pool:
outs.append(pool(x))
y = torch.cat(outs, axis=1)
y = self.conv(y)
return y
class CSPStage(nn.Module):
def __init__(self,
ch_in,
ch_out,
n=1,
block_fn='BasicBlock_3x3_Reverse',
ch_hidden_ratio=1.0,
act='silu',
spp=False):
super(CSPStage, self).__init__()
split_ratio = 2
ch_first = int(ch_out // split_ratio)
ch_mid = int(ch_out - ch_first)
self.conv1 = ConvBNAct(ch_in, ch_first, 1, act=act)
self.conv2 = ConvBNAct(ch_in, ch_mid, 1, act=act)
self.convs = nn.Sequential()
next_ch_in = ch_mid
for i in range(n):
if block_fn == 'BasicBlock_3x3_Reverse':
self.convs.add_module(
str(i),
BasicBlock_3x3_Reverse(next_ch_in,
ch_hidden_ratio,
ch_mid,
act=act,
shortcut=True))
else:
raise NotImplementedError
if i == (n - 1) // 2 and spp:
self.convs.add_module(
'spp', SPP(ch_mid * 4, ch_mid, 1, [5, 9, 13], act=act))
next_ch_in = ch_mid
self.conv3 = ConvBNAct(ch_mid * n + ch_first, ch_out, 1, act=act)
def forward(self, x):
y1 = self.conv1(x)
y2 = self.conv2(x)
mid_out = [y1]
for conv in self.convs:
y2 = conv(y2)
mid_out.append(y2)
y = torch.cat(mid_out, axis=1)
y = self.conv3(y)
return y
四、添加步骤
4.1 修改一
① 在
ultralytics/nn/
目录下新建
AddModules
文件夹用于存放模块代码
② 在
AddModules
文件夹下新建
CSPStage.py
,将
第三节
中的代码粘贴到此处
4.2 修改二
在
AddModules
文件夹下新建
__init__.py
(已有则不用新建),在文件内导入模块:
from .CSPStage import *
4.3 修改三
在
ultralytics/nn/modules/tasks.py
文件中,需要在两处位置添加各模块类名称。
首先:导入模块
其次:在
parse_model函数
中的注册
CSPStage
模块
五、yaml模型文件
5.1 模型改进版本⭐
此处以
ultralytics/cfg/models/rt-detr/rtdetr-l.yaml
为例,在同目录下创建一个用于自己数据集训练的模型文件
rtdetr-l-CSPStage.yaml
。
将
rtdetr-l.yaml
中的内容复制到
rtdetr-l-CSPStage.yaml
文件下,修改
nc
数量等于自己数据中目标的数量。
📌 模型的修改方法是将 颈部网络 进行修改。
# Ultralytics YOLO 🚀, AGPL-3.0 license
# RT-DETR-l object detection model with P3-P5 outputs. For details see https://docs.ultralytics.com/models/rtdetr
# Parameters
nc: 1 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n-cls.yaml' will call yolov8-cls.yaml with scale 'n'
# [depth, width, max_channels]
l: [1.00, 1.00, 1024]
backbone:
# [from, repeats, module, args]
- [-1, 1, HGStem, [32, 48]] # 0-P2/4
- [-1, 6, HGBlock, [48, 128, 3]] # stage 1
- [-1, 1, DWConv, [128, 3, 2, 1, False]] # 2-P3/8
- [-1, 6, HGBlock, [96, 512, 3]] # stage 2
- [-1, 1, DWConv, [512, 3, 2, 1, False]] # 4-P4/16
- [-1, 6, HGBlock, [192, 1024, 5, True, False]] # cm, c2, k, light, shortcut
- [-1, 6, HGBlock, [192, 1024, 5, True, True]]
- [-1, 6, HGBlock, [192, 1024, 5, True, True]] # stage 3
- [-1, 1, DWConv, [1024, 3, 2, 1, False]] # 8-P5/32
- [-1, 6, HGBlock, [384, 2048, 5, True, False]] # stage 4
head:
- [-1, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 10 input_proj.2
- [-1, 1, AIFI, [1024, 8]]
- [-1, 1, Conv, [256, 1, 1]] # 12, Y5, lateral_convs.0
- [-1, 1, Conv, [256, 1, 1]] # 13
- [7, 1, Conv, [256, 3, 2]] # 14, downsample_convs.0
- [[-1, 13], 1, Concat, [1]] # 15
- [-1, 3, CSPStage, [512]] # F4 (16), pan_blocks.0
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [3, 1, Conv, [256, 3, 2]] # 18 input_proj.1
- [[17, -1, 7], 1, Concat, [1]]
- [-1, 3, RepC3, [256]] # 20, fpn_blocks.0
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 3], 1, Concat, [1]] # cat backbone P4
- [-1, 3, RepC3, [256]] # X3 (23), fpn_blocks.1
- [-1, 1, Conv, [256, 3, 2]] # 24, downsample_convs.0
- [[-1, 20], 1, Concat, [1]] # cat Y4
- [-1, 3, CSPStage, [512]] # F4 (26), pan_blocks.0
- [20, 1, Conv, [256, 3, 2]] # 27, downsample_convs.1
- [26, 1, Conv, [256, 3, 2]] # 28, downsample_convs.1
- [[16, 27, -1], 1, Concat, [1]] # cat Y5
- [-1, 3, CSPStage, [1024]] # F5 (30), pan_blocks.1
- [[23, 26, 30], 1, RTDETRDecoder, [nc]] # Detect(P3, P4, P5)
六、成功运行结果
打印网络模型可以看到颈部网络已经修改完成,并可以进行训练了。
rtdetr-l-CSPStage :
rtdetr-l-CSPStage summary: 857 layers, 65,611,459 parameters, 65,611,459 gradients, 145.4 GFLOPs
from n params module arguments
0 -1 1 25248 ultralytics.nn.modules.block.HGStem [3, 32, 48]
1 -1 6 155072 ultralytics.nn.modules.block.HGBlock [48, 48, 128, 3, 6]
2 -1 1 1408 ultralytics.nn.modules.conv.DWConv [128, 128, 3, 2, 1, False]
3 -1 6 839296 ultralytics.nn.modules.block.HGBlock [128, 96, 512, 3, 6]
4 -1 1 5632 ultralytics.nn.modules.conv.DWConv [512, 512, 3, 2, 1, False]
5 -1 6 1695360 ultralytics.nn.modules.block.HGBlock [512, 192, 1024, 5, 6, True, False]
6 -1 6 2055808 ultralytics.nn.modules.block.HGBlock [1024, 192, 1024, 5, 6, True, True]
7 -1 6 2055808 ultralytics.nn.modules.block.HGBlock [1024, 192, 1024, 5, 6, True, True]
8 -1 1 11264 ultralytics.nn.modules.conv.DWConv [1024, 1024, 3, 2, 1, False]
9 -1 6 6708480 ultralytics.nn.modules.block.HGBlock [1024, 384, 2048, 5, 6, True, False]
10 -1 1 524800 ultralytics.nn.modules.conv.Conv [2048, 256, 1, 1, None, 1, 1, False]
11 -1 1 789760 ultralytics.nn.modules.transformer.AIFI [256, 1024, 8]
12 -1 1 66048 ultralytics.nn.modules.conv.Conv [256, 256, 1, 1]
13 -1 1 66048 ultralytics.nn.modules.conv.Conv [256, 256, 1, 1]
14 7 1 2359808 ultralytics.nn.modules.conv.Conv [1024, 256, 3, 2]
15 [-1, 13] 1 0 ultralytics.nn.modules.conv.Concat [1]
16 -1 3 5319168 ultralytics.nn.AddModules.CSPStage.CSPStage [512, 512]
17 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
18 3 1 1180160 ultralytics.nn.modules.conv.Conv [512, 256, 3, 2]
19 [17, -1, 7] 1 0 ultralytics.nn.modules.conv.Concat [1]
20 -1 3 2887680 ultralytics.nn.modules.block.RepC3 [1792, 256, 3]
21 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
22 [-1, 3] 1 0 ultralytics.nn.modules.conv.Concat [1]
23 -1 3 2363392 ultralytics.nn.modules.block.RepC3 [768, 256, 3]
24 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
25 [-1, 20] 1 0 ultralytics.nn.modules.conv.Concat [1]
26 -1 3 5319168 ultralytics.nn.AddModules.CSPStage.CSPStage [512, 512]
27 20 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
28 26 1 1180160 ultralytics.nn.modules.conv.Conv [512, 256, 3, 2]
29 [16, 27, -1] 1 0 ultralytics.nn.modules.conv.Concat [1]
30 -1 3 21255168 ultralytics.nn.AddModules.CSPStage.CSPStage [1024, 1024]
31 [23, 26, 30] 1 7566051 ultralytics.nn.modules.head.RTDETRDecoder [1, [256, 512, 1024]]
rtdetr-l-CSPStage summary: 857 layers, 65,611,459 parameters, 65,611,459 gradients, 145.4 GFLOPs