RT-DETR改进策略【Conv和Transformer】| CVPR-2021 Bottleneck Transformers 简单且高效的自注意力模块
一、本文介绍
本文记录的是
利用
Bottleneck Transformers (BoT)
优化
RT-DETR
的目标检测网络模型
。标准的卷积操作虽然能有效捕获局部信息,但在处理需要全局信息整合的任务时存在局限性,而自注意力机制能够有效地建模长距离依赖,因此考虑将其引入到视觉架构中。
本文利用
BoT模块
将标准卷积和自注意力相结合,提高模型的全局感知能力。
二、Bottleneck Transformers介绍
Bottleneck Transformers for Visual Recognition
Bottleneck Transformers(BoTNet)
是一种将自注意力(Self-Attention)融入计算机视觉任务的骨干架构,其设计的原理和优势如下:
2.1 原理
2.1.1 架构组成
BoT block
是通过将
ResNet瓶颈块
中的空间3×3卷积替换为
Multi-Head Self-Attention(MHSA)层
来构建的(如图所示)。
2.2.2 MHSA层
MHSA层
在二维特征图上实现全局(all2all)自注意力(如图所示)。为了使注意力操作具有位置感知能力,使用了相对位置编码。注意力的计算逻辑为
q
k
T
+
q
r
T
qk^{T} + qr^{T}
q
k
T
+
q
r
T
,其中
q
,
k
,
r
q, k, r
q
,
k
,
r
分别代表查询、键和相对位置编码。此外,MHSA层还使用了多个头,并且相对位置编码和值投影是它与Non - Local Layer的主要区别。
2.1 优势
-
性能提升
:
-
在COCO实例分割基准测试中,使用
BoTNet显著提高了性能,如在不同训练配置和数据增强情况下,性能均有提升。 - 对小对象的检测性能有显著增强,在不同ResNet家族骨干网络上的实验也表明了其适用性。
-
与Non - Local Neural Networks相比,
BoTNet中的BoT块设计更好,能够带来更高的性能提升。
-
在COCO实例分割基准测试中,使用
-
可扩展性
:通过调整和扩展
BoTNet的模型结构,可以在ImageNet验证集上达到较高的准确率,同时在计算效率上具有优势。 -
简单有效
:
BoT block的设计简单,基于已有的ResNet架构进行改进,易于实现和应用。尽管在构建上相对简单,但性能出色,为未来视觉架构中自注意力的应用提供了一个强有力的基线。
论文: https://arxiv.org/pdf/2101.11605
源码: https://github.com/tensorflow/tpu/tree/master/models/official/detection
三、Bottleneck Transformers的实现代码
Bottleneck Transformers模块
的实现代码如下:
import torch
import torch.nn as nn
from ultralytics.nn.modules.conv import LightConv
from ultralytics.utils.torch_utils import fuse_conv_and_bn
def autopad(k, p=None, d=1): # kernel, padding, dilation
"""Pad to 'same' shape outputs."""
if d > 1:
k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k] # actual kernel-size
if p is None:
p = k // 2 if isinstance(k, int) else [x // 2 for x in k] # auto-pad
return p
class Conv(nn.Module):
"""Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)."""
default_act = nn.SiLU() # default activation
def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
"""Initialize Conv layer with given arguments including activation."""
super().__init__()
self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
self.bn = nn.BatchNorm2d(c2)
self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()
def forward(self, x):
"""Apply convolution, batch normalization and activation to input tensor."""
return self.act(self.bn(self.conv(x)))
def forward_fuse(self, x):
"""Perform transposed convolution of 2D data."""
return self.act(self.conv(x))
class MHSA(nn.Module):
def __init__(self, n_dims, width=14, height=14, heads=4,pos_emb=False):
super(MHSA, self).__init__()
self.heads = heads
self.query = nn.Conv2d(n_dims, n_dims, kernel_size=1)
self.key = nn.Conv2d(n_dims, n_dims, kernel_size=1)
self.value = nn.Conv2d(n_dims, n_dims, kernel_size=1)
self.pos=pos_emb
if self.pos :
self.rel_h_weight = nn.Parameter(torch.randn([1, heads, (n_dims ) // heads, 1, int(height)]), requires_grad=True)
self.rel_w_weight = nn.Parameter(torch.randn([1, heads, (n_dims )// heads, int(width), 1]), requires_grad=True)
self.softmax = nn.Softmax(dim=-1)
def forward(self, x):
n_batch, C, width, height = x.size()
q = self.query(x).view(n_batch, self.heads, C // self.heads, -1)
k = self.key(x).view(n_batch, self.heads, C // self.heads, -1)
v = self.value(x).view(n_batch, self.heads, C // self.heads, -1)
content_content = torch.matmul(q.permute(0,1,3,2), k)
c1,c2,c3,c4=content_content.size()
if self.pos:
content_position = (self.rel_h_weight + self.rel_w_weight).view(1, self.heads, C // self.heads, -1).permute(0,1,3,2) #1,4,1024,64
content_position = torch.matmul(content_position, q)# ([1, 4, 1024, 256])
content_position=content_position if(content_content.shape==content_position.shape)else content_position[:,: , :c3,]
assert(content_content.shape==content_position.shape)
energy = content_content + content_position
else:
energy=content_content
attention = self.softmax(energy)
out = torch.matmul(v, attention.permute(0,1,3,2)) #1,4,256,64
out = out.view(n_batch, C, width, height)
return out
class BottleneckTransformer(nn.Module):
def __init__(self, c1, c2, stride=1, heads=4, mhsa=True, resolution=(20, 20),expansion=1):
super(BottleneckTransformer, self).__init__()
c_=int(c2*expansion)
self.cv1 = Conv(c1, c_, 1,1)
if not mhsa:
self.cv2 = Conv(c_,c2, 3, 1)
else:
self.cv2 = nn.ModuleList()
self.cv2.append(MHSA(c2, width=int(resolution[0]), height=int(resolution[1]), heads=heads))
if stride == 2:
self.cv2.append(nn.AvgPool2d(2, 2))
self.cv2 = nn.Sequential(*self.cv2)
self.shortcut = c1==c2
if stride != 1 or c1 != expansion*c2:
self.shortcut = nn.Sequential(
nn.Conv2d(c1, expansion*c2, kernel_size=1, stride=stride),
nn.BatchNorm2d(expansion*c2)
)
self.fc1 = nn.Linear(c2, c2)
def forward(self, x):
out=x + self.cv2(self.cv1(x)) if self.shortcut else self.cv2(self.cv1(x))
return out
class HGBlock_BoT(nn.Module):
"""
HG_Block of PPHGNetV2 with 2 convolutions and LightConv.
https://github.com/PaddlePaddle/PaddleDetection/blob/develop/ppdet/modeling/backbones/hgnet_v2.py
"""
def __init__(self, c1, cm, c2, k=3, n=6, lightconv=False, shortcut=False, act=nn.ReLU()):
"""Initializes a CSP Bottleneck with 1 convolution using specified input and output channels."""
super().__init__()
block = LightConv if lightconv else Conv
self.m = nn.ModuleList(block(c1 if i == 0 else cm, cm, k=k, act=act) for i in range(n))
self.sc = Conv(c1 + n * cm, c2 // 2, 1, 1, act=act) # squeeze conv
self.ec = Conv(c2 // 2, c2, 1, 1, act=act) # excitation conv
self.add = shortcut and c1 == c2
self.cv = BottleneckTransformer(c1, c2)
def forward(self, x):
"""Forward pass of a PPHGNetV2 backbone layer."""
y = [x]
y.extend(m(y[-1]) for m in self.m)
y = self.cv(self.ec(self.sc(torch.cat(y, 1))))
return y + x if self.add else y
四、创新模块
4.1 改进点⭐
模块改进方法
:
1️⃣ 加入
BottleneckTransformer模块
。
BottleneckTransformer模块
添加后如下:
2️⃣:加入基于
BottleneckTransformer模块
的
HGBlock
。利用
BottleneckTransformer
改进
HGBlock
模块,
改进后的模块形成了一种混合结构,既利用了卷积学习抽象和低分辨率特征图的高效性,又利用了全局自注意力来处理和聚合卷积捕获的特征图信息。
改进代码如下:
改进
HGBlock
模块,加入
BottleneckTransformer
模块,并重命名为
HGBlock_BoT
class HGBlock_BoT(nn.Module):
"""
HG_Block of PPHGNetV2 with 2 convolutions and LightConv.
https://github.com/PaddlePaddle/PaddleDetection/blob/develop/ppdet/modeling/backbones/hgnet_v2.py
"""
def __init__(self, c1, cm, c2, k=3, n=6, lightconv=False, shortcut=False, act=nn.ReLU()):
"""Initializes a CSP Bottleneck with 1 convolution using specified input and output channels."""
super().__init__()
block = LightConv if lightconv else Conv
self.m = nn.ModuleList(block(c1 if i == 0 else cm, cm, k=k, act=act) for i in range(n))
self.sc = Conv(c1 + n * cm, c2 // 2, 1, 1, act=act) # squeeze conv
self.ec = Conv(c2 // 2, c2, 1, 1, act=act) # excitation conv
self.add = shortcut and c1 == c2
self.cv = BottleneckTransformer(c1, c2)
def forward(self, x):
"""Forward pass of a PPHGNetV2 backbone layer."""
y = [x]
y.extend(m(y[-1]) for m in self.m)
y = self.cv(self.ec(self.sc(torch.cat(y, 1))))
return y + x if self.add else y
注意❗:在
5.2和5.3小节
中需要声明的模块名称为:
HGBlock_BoT
。
五、添加步骤
5.1 修改一
① 在
ultralytics/nn/
目录下新建
AddModules
文件夹用于存放模块代码
② 在
AddModules
文件夹下新建
BoT.py
,将
第三节
中的代码粘贴到此处
5.2 修改二
在
AddModules
文件夹下新建
__init__.py
(已有则不用新建),在文件内导入模块:
from .BoT import *
5.3 修改三
在
ultralytics/nn/modules/tasks.py
文件中,需要在两处位置添加各模块类名称。
首先:导入模块
其次:在
parse_model函数
中注册
HGBlock_BoT
:
六、yaml模型文件
6.1 模型改进版本
此处以
ultralytics/cfg/models/rt-detr/rtdetr-l.yaml
为例,在同目录下创建一个用于自己数据集训练的模型文件
rtdetr-l-HGBlock_BoT.yaml
。
将
rtdetr-l.yaml
中的内容复制到
rtdetr-l-HGBlock_BoT.yaml
文件下,修改
nc
数量等于自己数据中目标的数量。
📌 模型的修改方法是将
骨干网络
中添加
HGBlock_BoT模块
。
# Ultralytics YOLO 🚀, AGPL-3.0 license
# RT-DETR-l object detection model with P3-P5 outputs. For details see https://docs.ultralytics.com/models/rtdetr
# Parameters
nc: 1 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n-cls.yaml' will call yolov8-cls.yaml with scale 'n'
# [depth, width, max_channels]
l: [1.00, 1.00, 1024]
backbone:
# [from, repeats, module, args]
- [-1, 1, HGStem, [32, 48]] # 0-P2/4
- [-1, 6, HGBlock, [48, 128, 3]] # stage 1
- [-1, 1, DWConv, [128, 3, 2, 1, False]] # 2-P3/8
- [-1, 6, HGBlock, [96, 512, 3]] # stage 2
- [-1, 1, DWConv, [512, 3, 2, 1, False]] # 4-P4/16
- [-1, 6, HGBlock_BoT, [192, 512, 5, True, False]] # cm, c2, k, light, shortcut
- [-1, 6, HGBlock_BoT, [192, 512, 5, True, True]]
- [-1, 6, HGBlock_BoT, [192, 512, 5, True, True]] # stage 3
- [-1, 1, DWConv, [1024, 3, 2, 1, False]] # 8-P5/32
- [-1, 6, HGBlock, [384, 2048, 5, True, False]] # stage 4
head:
- [-1, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 10 input_proj.2
- [-1, 1, AIFI, [1024, 8]]
- [-1, 1, Conv, [256, 1, 1]] # 12, Y5, lateral_convs.0
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [7, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 14 input_proj.1
- [[-2, -1], 1, Concat, [1]]
- [-1, 3, RepC3, [256]] # 16, fpn_blocks.0
- [-1, 1, Conv, [256, 1, 1]] # 17, Y4, lateral_convs.1
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [3, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 19 input_proj.0
- [[-2, -1], 1, Concat, [1]] # cat backbone P4
- [-1, 3, RepC3, [256]] # X3 (21), fpn_blocks.1
- [-1, 1, Conv, [256, 3, 2]] # 22, downsample_convs.0
- [[-1, 17], 1, Concat, [1]] # cat Y4
- [-1, 3, RepC3, [256]] # F4 (24), pan_blocks.0
- [-1, 1, Conv, [256, 3, 2]] # 25, downsample_convs.1
- [[-1, 12], 1, Concat, [1]] # cat Y5
- [-1, 3, RepC3, [256]] # F5 (27), pan_blocks.1
- [[21, 24, 27], 1, RTDETRDecoder, [nc]] # Detect(P3, P4, P5)
七、成功运行结果
打印网络模型可以看到
HGBlock_BoT
已经加入到模型中,并可以进行训练了。
rtdetr-l-HGBlock_BoT :
rtdetr-l-HGBlock_BoT summary: 716 layers, 33,435,331 parameters, 33,435,331 gradients, 107.6 GFLOPs
from n params module arguments
0 -1 1 25248 ultralytics.nn.modules.block.HGStem [3, 32, 48]
1 -1 6 155072 ultralytics.nn.modules.block.HGBlock [48, 48, 128, 3, 6]
2 -1 1 1408 ultralytics.nn.modules.conv.DWConv [128, 128, 3, 2, 1, False]
3 -1 6 839296 ultralytics.nn.modules.block.HGBlock [128, 96, 512, 3, 6]
4 -1 1 5632 ultralytics.nn.modules.conv.DWConv [512, 512, 3, 2, 1, False]
5 -1 6 2188416 ultralytics.nn.AddModules.BoT.HGBlock_BoT [512, 192, 512, 5, 6, True, False]
6 -1 6 2188416 ultralytics.nn.AddModules.BoT.HGBlock_BoT [512, 192, 512, 5, 6, True, True]
7 -1 6 2188416 ultralytics.nn.AddModules.BoT.HGBlock_BoT [512, 192, 512, 5, 6, True, True]
8 -1 1 11264 ultralytics.nn.modules.conv.DWConv [512, 1024, 3, 2, 1, False]
9 -1 6 6708480 ultralytics.nn.modules.block.HGBlock [1024, 384, 2048, 5, 6, True, False]
10 -1 1 524800 ultralytics.nn.modules.conv.Conv [2048, 256, 1, 1, None, 1, 1, False]
11 -1 1 789760 ultralytics.nn.modules.transformer.AIFI [256, 1024, 8]
12 -1 1 66048 ultralytics.nn.modules.conv.Conv [256, 256, 1, 1]
13 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
14 7 1 131584 ultralytics.nn.modules.conv.Conv [512, 256, 1, 1, None, 1, 1, False]
15 [-2, -1] 1 0 ultralytics.nn.modules.conv.Concat [1]
16 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
17 -1 1 66048 ultralytics.nn.modules.conv.Conv [256, 256, 1, 1]
18 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
19 3 1 131584 ultralytics.nn.modules.conv.Conv [512, 256, 1, 1, None, 1, 1, False]
20 [-2, -1] 1 0 ultralytics.nn.modules.conv.Concat [1]
21 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
22 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
23 [-1, 17] 1 0 ultralytics.nn.modules.conv.Concat [1]
24 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
25 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
26 [-1, 12] 1 0 ultralytics.nn.modules.conv.Concat [1]
27 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
28 [21, 24, 27] 1 7303907 ultralytics.nn.modules.head.RTDETRDecoder [1, [256, 256, 256]]
rtdetr-l-HGBlock_BoT summary: 716 layers, 33,435,331 parameters, 33,435,331 gradients, 107.6 GFLOPs