RT-DETR改进策略【独家融合改进】| MobileNetV4+BiFPN,轻量化+加权特征融合,轻松实现降参涨点
一、本文介绍
本文记录的是
利用MobileNet V4和BiFPN的RT-DETR目标检测轻量化改进方法研究
。
MobileNet V4
在不降低性能指标的前提下,
降低计算成本
。
BiFPN
通过
优化跨尺度连接和加权特征融合
,能够更好地
融合多尺度特征
,提高特征的表达能力,融合改进后能够有效实现
降参涨点
。
二、MoblieNet V4设计原理
MobileNetV4: Universal Models for the Mobile Ecosystem
MobileNetV4
是一系列适用于移动生态系统的通用高效模型。以下将详细介绍其轻量化设计的出发点、原理、结构和优势:
2.1 设计出发点
- 平衡精度与效率 :移动设备的计算资源有限,需要在保证模型精度的同时提高计算效率,以实现快速、实时和交互式的应用体验,同时避免通过公共网络传输私人数据。
- 硬件通用性 :针对不同的移动硬件平台(如CPUs、DSPs、GPUs以及各种加速器),设计出在性能上普遍高效的模型,使其能在各种设备上都能良好运行。
2.2 设计原理
-
基于Roofline模型的分析
- 理解硬件瓶颈 :Roofline模型通过分析模型的运算强度(LayerMACsi/(WeightBytesi + ActivationBytesi))与硬件的处理器和内存系统的理论极限,来确定模型在不同硬件上是受内存带宽还是计算能力的限制。
- 优化策略 :根据不同硬件的特点(如低RP硬件上减少MACs以提高速度,高RP硬件上利用数据移动瓶颈小的特点增加模型容量),设计模型结构,使MobileNetV4在从0到500 MACs/byte的RP范围内都能达到接近Pareto最优的性能。
-
注意力机制优化
- 考虑运算强度 :由于加速器的计算能力大幅提高但内存带宽增长不成比例,所以在设计注意力机制时考虑运算强度,即算术运算与内存访问的比率。
-
MQA机制
:
Mobile MQA通过共享键和值来减少内存带宽需求,提高运算强度,同时还采用了如不对称空间下采样等策略进一步提高效率。
2.3 结构
2.3.1 通用倒置瓶颈(UIB)模块
-
结构特点
:
UIB模块是一种统一且灵活的结构,它扩展了MobileNet的 倒置瓶颈(IB)模块 ,在扩展层之前和扩展与投影层之间引入了可选的 深度可分离卷积(DW )。它可以统一Inverted Bottleneck (IB)、ConvNext、Feed Forward Network (FFN)以及一种新的Extra Depthwise (ExtraDW)变体。 -
模块实例化
:
UIB模块中的两个可选深度卷积有 四种 可能的实例化方式,分别对应不同的权衡。例如,ExtraDW可以增加网络深度和感受野,结合了ConvNext-Like和IB的优点。
2.3.2 Mobile MQA模块
-
基础结构
:是一种基于注意力机制的模块,它简化了
多头注意力(MHSA)机制, 通过共享键和值来减少内存带宽需求。 - 优化结构 :进一步采用 不对称空间下采样(SRA) ,在优化后的MQA块中对关键和价值分辨率进行下采样,同时保留高分辨率查询,提高了模型效率。
2.4 优势
-
性能优势
-
Pareto最优
:通过整合
UIB、Mobile MQA以及优化的 NAS策略 ,MobileNetV4模型在移动CPUs、DSPs、GPUs以及各种加速器上大多达到了Pareto最优性能,即在不降低其他性能指标的情况下,某一性能指标无法进一步提升。 - 跨硬件一致性 :在不同硬件平台上表现出较为一致的性能,这是其他测试模型所不具备的。例如,在ImageNet - 1K分类任务中,MNv4 - Conv - M比MobileOne - S4和Fast ViT - S12快50%以上,且在可比延迟下比MobileNetV2的Top - 1准确率高1.5%。
-
Pareto最优
:通过整合
-
效率优势
-
计算效率
:
UIB模块提供了 空间 和 通道 混合的灵活性,可选择 扩展感受野 ,增强了计算效率。例如,ExtraDW变体可以在不显著增加计算成本的情况下增加网络深度和感受野。 -
推理速度
:
Mobile MQA模块在移动加速器上实现了超过39%的推理速度提升,大大提高了模型的运行效率。
-
计算效率
:
-
模型构建优势
- NAS优化 :采用了优化的 神经网络架构搜索(NAS)策略 ,包括两阶段搜索(粗粒度搜索和细粒度搜索)以及使用离线蒸馏数据集,提高了搜索效率和模型质量,能够创建出比以前的先进模型更大的模型。
- 蒸馏技术 :引入了一种新的 蒸馏 技术,通过动态混合不同增强策略的数据集以及添加平衡的类内数据,进一步提高了模型的准确性和泛化能力。例如,MNv4 - Hybrid - Large模型在ImageNet - 1K上的准确率达到87%,同时在Pixel 8 EdgeTPU上的运行时间仅为3.8ms。
论文: https://arxiv.org/pdf/2404.10518
源码: https://github.com/tensorflow/models/blob/master/official/vision/modeling/backbones/mobilenet.py
三、BiFPN介绍
EfficientDet
: 可扩展的高效物体检测
BiFPN(加权双向特征金字塔网络)
是该论文中提出的一种用于高效多尺度特征融合的网络结构,其设计原理和优势如下:
3.1、BiFPN原理
-
问题 formulation
:多尺度特征融合的目标是聚合不同分辨率的特征,给定多尺度特征列表
P
→
i
n
=
(
P
i
n
1
,
P
i
n
2
,
…
)
\overrightarrow{P}_{in} = (P_{in}^1, P_{in}^2, \ldots)
P
in
=
(
P
in
1
,
P
in
2
,
…
)
,其中
P
i
n
i
P_{in}^i
P
in
i
代表第
i
i
i
级的特征,目标是找到一个变换
f
f
f
,使得
P
→
o
u
t
=
f
(
P
→
i
n
)
\overrightarrow{P}_{out} = f(\overrightarrow{P}_{in})
P
o
u
t
=
f
(
P
in
)
。以传统的
FPN为例,它以自上而下的方式聚合多尺度特征,如 P 7 o u t = C o n v ( P 7 i n ) P_{7}^{out} = Conv(P_{7}^{in}) P 7 o u t = C o n v ( P 7 in ) , P 6 o u t = C o n v ( P 6 i n + R e s i z e ( P 7 o u t ) ) P_{6}^{out} = Conv(P_{6}^{in} + Resize(P_{7}^{out})) P 6 o u t = C o n v ( P 6 in + R es i ze ( P 7 o u t )) 等。
3.1.1 跨尺度连接
-
为了改进传统
FPN单向信息流的局限性,PANet增加了自底向上的路径聚合网络,其他研究也进一步探讨了跨尺度连接。通过研究FPN、PANet和NAS-FPN这三个网络的性能和效率,本文对跨尺度连接进行了优化: - 移除只有一个输入边缘的节点,因为这样的节点对旨在融合不同特征的特征网络贡献较小,从而得到简化的双向网络。
- 如果原始输入和输出节点在同一级别,添加一个额外的边缘,以便在不增加太多成本的情况下融合更多特征。
-
将每个双向(自上而下和自底向上)路径视为一个特征网络层,并多次重复该层以实现更多高级特征融合,从而命名为
双向特征金字塔网络(BiFPN)。
3.1.2 加权特征融合
当融合不同分辨率的特征时,常见的方法是先将它们调整到相同分辨率然后相加,但这种方法没有区分不同输入特征的重要性。为了解决这个问题,提出为每个输入添加一个额外的权重,让网络学习每个输入特征的重要性,并考虑了三种加权融合方法:
-
Unbounded fusion
:
O
=
∑
i
w
i
⋅
I
i
O = \sum_{i} w_{i} \cdot I_{i}
O
=
∑
i
w
i
⋅
I
i
,其中
w
i
w_{i}
w
i
是可学习的权重,可以是标量(每个特征)、向量(每个通道)或多维张量(每个像素)。通过权重归一化来限制每个权重的值范围,以避免训练不稳定。
-
Softmax - based fusion
:
O
=
∑
i
e
w
i
∑
j
e
w
j
⋅
I
i
O = \sum_{i} \frac{e^{w_{i}}}{\sum_{j} e^{w_{j}}} \cdot I_{i}
O
=
∑
i
∑
j
e
w
j
e
w
i
⋅
I
i
,对每个权重应用Softmax,使所有权重归一化为0到1之间的概率,表示每个输入的重要性。但在GPU硬件上会导致显著的减速。
-
Fast normalized fusion
:
O
=
∑
i
w
i
ϵ
+
∑
j
w
j
⋅
I
i
O = \sum_{i} \frac{w_{i}}{\epsilon + \sum_{j} w_{j}} \cdot I_{i}
O
=
∑
i
ϵ
+
∑
j
w
j
w
i
⋅
I
i
,通过在每个
w
i
w_{i}
w
i
后应用Relu确保
w
i
≥
0
w_{i} \geq 0
w
i
≥
0
,
ϵ
=
0.0001
\epsilon = 0.0001
ϵ
=
0.0001
是一个小值以避免数值不稳定。这种方法的每个归一化权重的值也落在0和1之间,且没有Softmax操作,因此更高效。
最终的BiFPN集成了双向跨尺度连接和快速归一化融合。例如,对于图2(d)中BiFPN在级别6的两个融合特征,计算方式为:
-
P
6
t
d
=
C
o
n
v
(
w
1
⋅
P
6
i
n
+
w
2
⋅
R
e
s
i
z
e
(
P
7
i
n
)
w
1
+
w
2
+
ϵ
)
P_{6}^{td} = Conv(\frac{w_{1} \cdot P_{6}^{in} + w_{2} \cdot Resize(P_{7}^{in})}{w_{1} + w_{2} + \epsilon})
P
6
t
d
=
C
o
n
v
(
w
1
+
w
2
+
ϵ
w
1
⋅
P
6
in
+
w
2
⋅
R
es
i
ze
(
P
7
in
)
)
-
P
6
o
u
t
=
C
o
n
v
(
w
1
′
⋅
P
6
i
n
+
w
2
′
⋅
P
6
t
d
+
w
3
′
⋅
R
e
s
i
z
e
(
P
5
o
u
t
)
w
1
′
+
w
2
′
+
w
3
′
+
ϵ
)
P_{6}^{out} = Conv(\frac{w_{1}' \cdot P_{6}^{in} + w_{2}' \cdot P_{6}^{td} + w_{3}' \cdot Resize(P_{5}^{out})}{w_{1}' + w_{2}' + w_{3}' + \epsilon})
P
6
o
u
t
=
C
o
n
v
(
w
1
′
+
w
2
′
+
w
3
′
+
ϵ
w
1
′
⋅
P
6
in
+
w
2
′
⋅
P
6
t
d
+
w
3
′
⋅
R
es
i
ze
(
P
5
o
u
t
)
)
为了进一步提高效率,使用深度可分离卷积进行特征融合,并在每次卷积后添加批量归一化和激活。
3.2、优势
-
通过优化跨尺度连接和加权特征融合,
BiFPN能够更好地融合多尺度特征,提高特征的表达能力。 -
与其他特征网络相比,
BiFPN在实现相似精度的情况下,使用更少的参数和计算量,提高了模型效率。 - 快速归一化融合方法与Softmax-based融合相比,具有非常相似的学习行为和精度,但在GPU上运行速度更快。
论文: https://arxiv.org/pdf/1911.09070.pdf
源码: https://github.com/google/automl/tree/master/efficientdet
四、MobileNetV4和BiFPN的实现代码
MobileNetV4模块
的实现代码如下:
from typing import Optional
import torch
import torch.nn as nn
import torch.nn.functional as F
__all__ = ['MobileNetV4ConvLarge', 'MobileNetV4ConvSmall', 'MobileNetV4ConvMedium', 'MobileNetV4HybridMedium', 'MobileNetV4HybridLarge']
MNV4ConvSmall_BLOCK_SPECS = {
"conv0": {
"block_name": "convbn",
"num_blocks": 1,
"block_specs": [
[3, 32, 3, 2]
]
},
"layer1": {
"block_name": "convbn",
"num_blocks": 2,
"block_specs": [
[32, 32, 3, 2],
[32, 32, 1, 1]
]
},
"layer2": {
"block_name": "convbn",
"num_blocks": 2,
"block_specs": [
[32, 96, 3, 2],
[96, 64, 1, 1]
]
},
"layer3": {
"block_name": "uib",
"num_blocks": 6,
"block_specs": [
[64, 96, 5, 5, True, 2, 3],
[96, 96, 0, 3, True, 1, 2],
[96, 96, 0, 3, True, 1, 2],
[96, 96, 0, 3, True, 1, 2],
[96, 96, 0, 3, True, 1, 2],
[96, 96, 3, 0, True, 1, 4],
]
},
"layer4": {
"block_name": "uib",
"num_blocks": 6,
"block_specs": [
[96, 128, 3, 3, True, 2, 6],
[128, 128, 5, 5, True, 1, 4],
[128, 128, 0, 5, True, 1, 4],
[128, 128, 0, 5, True, 1, 3],
[128, 128, 0, 3, True, 1, 4],
[128, 128, 0, 3, True, 1, 4],
]
},
"layer5": {
"block_name": "convbn",
"num_blocks": 2,
"block_specs": [
[128, 960, 1, 1],
[960, 1280, 1, 1]
]
}
}
MNV4ConvMedium_BLOCK_SPECS = {
"conv0": {
"block_name": "convbn",
"num_blocks": 1,
"block_specs": [
[3, 32, 3, 2]
]
},
"layer1": {
"block_name": "fused_ib",
"num_blocks": 1,
"block_specs": [
[32, 48, 2, 4.0, True]
]
},
"layer2": {
"block_name": "uib",
"num_blocks": 2,
"block_specs": [
[48, 80, 3, 5, True, 2, 4],
[80, 80, 3, 3, True, 1, 2]
]
},
"layer3": {
"block_name": "uib",
"num_blocks": 8,
"block_specs": [
[80, 160, 3, 5, True, 2, 6],
[160, 160, 3, 3, True, 1, 4],
[160, 160, 3, 3, True, 1, 4],
[160, 160, 3, 5, True, 1, 4],
[160, 160, 3, 3, True, 1, 4],
[160, 160, 3, 0, True, 1, 4],
[160, 160, 0, 0, True, 1, 2],
[160, 160, 3, 0, True, 1, 4]
]
},
"layer4": {
"block_name": "uib",
"num_blocks": 11,
"block_specs": [
[160, 256, 5, 5, True, 2, 6],
[256, 256, 5, 5, True, 1, 4],
[256, 256, 3, 5, True, 1, 4],
[256, 256, 3, 5, True, 1, 4],
[256, 256, 0, 0, True, 1, 4],
[256, 256, 3, 0, True, 1, 4],
[256, 256, 3, 5, True, 1, 2],
[256, 256, 5, 5, True, 1, 4],
[256, 256, 0, 0, True, 1, 4],
[256, 256, 0, 0, True, 1, 4],
[256, 256, 5, 0, True, 1, 2]
]
},
"layer5": {
"block_name": "convbn",
"num_blocks": 2,
"block_specs": [
[256, 960, 1, 1],
[960, 1280, 1, 1]
]
}
}
MNV4ConvLarge_BLOCK_SPECS = {
"conv0": {
"block_name": "convbn",
"num_blocks": 1,
"block_specs": [
[3, 24, 3, 2]
]
},
"layer1": {
"block_name": "fused_ib",
"num_blocks": 1,
"block_specs": [
[24, 48, 2, 4.0, True]
]
},
"layer2": {
"block_name": "uib",
"num_blocks": 2,
"block_specs": [
[48, 96, 3, 5, True, 2, 4],
[96, 96, 3, 3, True, 1, 4]
]
},
"layer3": {
"block_name": "uib",
"num_blocks": 11,
"block_specs": [
[96, 192, 3, 5, True, 2, 4],
[192, 192, 3, 3, True, 1, 4],
[192, 192, 3, 3, True, 1, 4],
[192, 192, 3, 3, True, 1, 4],
[192, 192, 3, 5, True, 1, 4],
[192, 192, 5, 3, True, 1, 4],
[192, 192, 5, 3, True, 1, 4],
[192, 192, 5, 3, True, 1, 4],
[192, 192, 5, 3, True, 1, 4],
[192, 192, 5, 3, True, 1, 4],
[192, 192, 3, 0, True, 1, 4]
]
},
"layer4": {
"block_name": "uib",
"num_blocks": 13,
"block_specs": [
[192, 512, 5, 5, True, 2, 4],
[512, 512, 5, 5, True, 1, 4],
[512, 512, 5, 5, True, 1, 4],
[512, 512, 5, 5, True, 1, 4],
[512, 512, 5, 0, True, 1, 4],
[512, 512, 5, 3, True, 1, 4],
[512, 512, 5, 0, True, 1, 4],
[512, 512, 5, 0, True, 1, 4],
[512, 512, 5, 3, True, 1, 4],
[512, 512, 5, 5, True, 1, 4],
[512, 512, 5, 0, True, 1, 4],
[512, 512, 5, 0, True, 1, 4],
[512, 512, 5, 0, True, 1, 4]
]
},
"layer5": {
"block_name": "convbn",
"num_blocks": 2,
"block_specs": [
[512, 960, 1, 1],
[960, 1280, 1, 1]
]
}
}
def mhsa(num_heads, key_dim, value_dim, px):
if px == 24:
kv_strides = 2
elif px == 12:
kv_strides = 1
query_h_strides = 1
query_w_strides = 1
use_layer_scale = True
use_multi_query = True
use_residual = True
return [
num_heads, key_dim, value_dim, query_h_strides, query_w_strides, kv_strides,
use_layer_scale, use_multi_query, use_residual
]
MNV4HybridConvMedium_BLOCK_SPECS = {
"conv0": {
"block_name": "convbn",
"num_blocks": 1,
"block_specs": [
[3, 32, 3, 2]
]
},
"layer1": {
"block_name": "fused_ib",
"num_blocks": 1,
"block_specs": [
[32, 48, 2, 4.0, True]
]
},
"layer2": {
"block_name": "uib",
"num_blocks": 2,
"block_specs": [
[48, 80, 3, 5, True, 2, 4],
[80, 80, 3, 3, True, 1, 2]
]
},
"layer3": {
"block_name": "uib",
"num_blocks": 8,
"block_specs": [
[80, 160, 3, 5, True, 2, 6],
[160, 160, 0, 0, True, 1, 2],
[160, 160, 3, 3, True, 1, 4],
[160, 160, 3, 5, True, 1, 4, mhsa(4, 64, 64, 24)],
[160, 160, 3, 3, True, 1, 4, mhsa(4, 64, 64, 24)],
[160, 160, 3, 0, True, 1, 4, mhsa(4, 64, 64, 24)],
[160, 160, 3, 3, True, 1, 4, mhsa(4, 64, 64, 24)],
[160, 160, 3, 0, True, 1, 4]
]
},
"layer4": {
"block_name": "uib",
"num_blocks": 12,
"block_specs": [
[160, 256, 5, 5, True, 2, 6],
[256, 256, 5, 5, True, 1, 4],
[256, 256, 3, 5, True, 1, 4],
[256, 256, 3, 5, True, 1, 4],
[256, 256, 0, 0, True, 1, 2],
[256, 256, 3, 5, True, 1, 2],
[256, 256, 0, 0, True, 1, 2],
[256, 256, 0, 0, True, 1, 4, mhsa(4, 64, 64, 12)],
[256, 256, 3, 0, True, 1, 4, mhsa(4, 64, 64, 12)],
[256, 256, 5, 5, True, 1, 4, mhsa(4, 64, 64, 12)],
[256, 256, 5, 0, True, 1, 4, mhsa(4, 64, 64, 12)],
[256, 256, 5, 0, True, 1, 4]
]
},
"layer5": {
"block_name": "convbn",
"num_blocks": 2,
"block_specs": [
[256, 960, 1, 1],
[960, 1280, 1, 1]
]
}
}
MNV4HybridConvLarge_BLOCK_SPECS = {
"conv0": {
"block_name": "convbn",
"num_blocks": 1,
"block_specs": [
[3, 24, 3, 2]
]
},
"layer1": {
"block_name": "fused_ib",
"num_blocks": 1,
"block_specs": [
[24, 48, 2, 4.0, True]
]
},
"layer2": {
"block_name": "uib",
"num_blocks": 2,
"block_specs": [
[48, 96, 3, 5, True, 2, 4],
[96, 96, 3, 3, True, 1, 4]
]
},
"layer3": {
"block_name": "uib",
"num_blocks": 11,
"block_specs": [
[96, 192, 3, 5, True, 2, 4],
[192, 192, 3, 3, True, 1, 4],
[192, 192, 3, 3, True, 1, 4],
[192, 192, 3, 3, True, 1, 4],
[192, 192, 3, 5, True, 1, 4],
[192, 192, 5, 3, True, 1, 4],
[192, 192, 5, 3, True, 1, 4, mhsa(8, 48, 48, 24)],
[192, 192, 5, 3, True, 1, 4, mhsa(8, 48, 48, 24)],
[192, 192, 5, 3, True, 1, 4, mhsa(8, 48, 48, 24)],
[192, 192, 5, 3, True, 1, 4, mhsa(8, 48, 48, 24)],
[192, 192, 3, 0, True, 1, 4]
]
},
"layer4": {
"block_name": "uib",
"num_blocks": 14,
"block_specs": [
[192, 512, 5, 5, True, 2, 4],
[512, 512, 5, 5, True, 1, 4],
[512, 512, 5, 5, True, 1, 4],
[512, 512, 5, 5, True, 1, 4],
[512, 512, 5, 0, True, 1, 4],
[512, 512, 5, 3, True, 1, 4],
[512, 512, 5, 0, True, 1, 4],
[512, 512, 5, 0, True, 1, 4],
[512, 512, 5, 3, True, 1, 4],
[512, 512, 5, 5, True, 1, 4, mhsa(8, 64, 64, 12)],
[512, 512, 5, 0, True, 1, 4, mhsa(8, 64, 64, 12)],
[512, 512, 5, 0, True, 1, 4, mhsa(8, 64, 64, 12)],
[512, 512, 5, 0, True, 1, 4, mhsa(8, 64, 64, 12)],
[512, 512, 5, 0, True, 1, 4]
]
},
"layer5": {
"block_name": "convbn",
"num_blocks": 2,
"block_specs": [
[512, 960, 1, 1],
[960, 1280, 1, 1]
]
}
}
MODEL_SPECS = {
"MobileNetV4ConvSmall": MNV4ConvSmall_BLOCK_SPECS,
"MobileNetV4ConvMedium": MNV4ConvMedium_BLOCK_SPECS,
"MobileNetV4ConvLarge": MNV4ConvLarge_BLOCK_SPECS,
"MobileNetV4HybridMedium": MNV4HybridConvMedium_BLOCK_SPECS,
"MobileNetV4HybridLarge": MNV4HybridConvLarge_BLOCK_SPECS
}
def make_divisible(
value: float,
divisor: int,
min_value: Optional[float] = None,
round_down_protect: bool = True,
) -> int:
"""
This function is copied from here
"https://github.com/tensorflow/models/blob/master/official/vision/modeling/layers/nn_layers.py"
This is to ensure that all layers have channels that are divisible by 8.
Args:
value: A `float` of original value.
divisor: An `int` of the divisor that need to be checked upon.
min_value: A `float` of minimum value threshold.
round_down_protect: A `bool` indicating whether round down more than 10%
will be allowed.
Returns:
The adjusted value in `int` that is divisible against divisor.
"""
if min_value is None:
min_value = divisor
new_value = max(min_value, int(value + divisor / 2) // divisor * divisor)
# Make sure that round down does not go down by more than 10%.
if round_down_protect and new_value < 0.9 * value:
new_value += divisor
return int(new_value)
def conv_2d(inp, oup, kernel_size=3, stride=1, groups=1, bias=False, norm=True, act=True):
conv = nn.Sequential()
padding = (kernel_size - 1) // 2
conv.add_module('conv', nn.Conv2d(inp, oup, kernel_size, stride, padding, bias=bias, groups=groups))
if norm:
conv.add_module('BatchNorm2d', nn.BatchNorm2d(oup))
if act:
conv.add_module('Activation', nn.ReLU6())
return conv
class InvertedResidual(nn.Module):
def __init__(self, inp, oup, stride, expand_ratio, act=False, squeeze_excitation=False):
super(InvertedResidual, self).__init__()
self.stride = stride
assert stride in [1, 2]
hidden_dim = int(round(inp * expand_ratio))
self.block = nn.Sequential()
if expand_ratio != 1:
self.block.add_module('exp_1x1', conv_2d(inp, hidden_dim, kernel_size=3, stride=stride))
if squeeze_excitation:
self.block.add_module('conv_3x3',
conv_2d(hidden_dim, hidden_dim, kernel_size=3, stride=stride, groups=hidden_dim))
self.block.add_module('red_1x1', conv_2d(hidden_dim, oup, kernel_size=1, stride=1, act=act))
self.use_res_connect = self.stride == 1 and inp == oup
def forward(self, x):
if self.use_res_connect:
return x + self.block(x)
else:
return self.block(x)
class UniversalInvertedBottleneckBlock(nn.Module):
def __init__(self,
inp,
oup,
start_dw_kernel_size,
middle_dw_kernel_size,
middle_dw_downsample,
stride,
expand_ratio
):
"""An inverted bottleneck block with optional depthwises.
Referenced from here https://github.com/tensorflow/models/blob/master/official/vision/modeling/layers/nn_blocks.py
"""
super().__init__()
# Starting depthwise conv.
self.start_dw_kernel_size = start_dw_kernel_size
if self.start_dw_kernel_size:
stride_ = stride if not middle_dw_downsample else 1
self._start_dw_ = conv_2d(inp, inp, kernel_size=start_dw_kernel_size, stride=stride_, groups=inp, act=False)
# Expansion with 1x1 convs.
expand_filters = make_divisible(inp * expand_ratio, 8)
self._expand_conv = conv_2d(inp, expand_filters, kernel_size=1)
# Middle depthwise conv.
self.middle_dw_kernel_size = middle_dw_kernel_size
if self.middle_dw_kernel_size:
stride_ = stride if middle_dw_downsample else 1
self._middle_dw = conv_2d(expand_filters, expand_filters, kernel_size=middle_dw_kernel_size, stride=stride_,
groups=expand_filters)
# Projection with 1x1 convs.
self._proj_conv = conv_2d(expand_filters, oup, kernel_size=1, stride=1, act=False)
# Ending depthwise conv.
# this not used
# _end_dw_kernel_size = 0
# self._end_dw = conv_2d(oup, oup, kernel_size=_end_dw_kernel_size, stride=stride, groups=inp, act=False)
def forward(self, x):
if self.start_dw_kernel_size:
x = self._start_dw_(x)
# print("_start_dw_", x.shape)
x = self._expand_conv(x)
# print("_expand_conv", x.shape)
if self.middle_dw_kernel_size:
x = self._middle_dw(x)
# print("_middle_dw", x.shape)
x = self._proj_conv(x)
# print("_proj_conv", x.shape)
return x
class MultiQueryAttentionLayerWithDownSampling(nn.Module):
def __init__(self, inp, num_heads, key_dim, value_dim, query_h_strides, query_w_strides, kv_strides,
dw_kernel_size=3, dropout=0.0):
"""Multi Query Attention with spatial downsampling.
Referenced from here https://github.com/tensorflow/models/blob/master/official/vision/modeling/layers/nn_blocks.py
3 parameters are introduced for the spatial downsampling:
1. kv_strides: downsampling factor on Key and Values only.
2. query_h_strides: vertical strides on Query only.
3. query_w_strides: horizontal strides on Query only.
This is an optimized version.
1. Projections in Attention is explict written out as 1x1 Conv2D.
2. Additional reshapes are introduced to bring a up to 3x speed up.
"""
super().__init__()
self.num_heads = num_heads
self.key_dim = key_dim
self.value_dim = value_dim
self.query_h_strides = query_h_strides
self.query_w_strides = query_w_strides
self.kv_strides = kv_strides
self.dw_kernel_size = dw_kernel_size
self.dropout = dropout
self.head_dim = key_dim // num_heads
if self.query_h_strides > 1 or self.query_w_strides > 1:
self._query_downsampling_norm = nn.BatchNorm2d(inp)
self._query_proj = conv_2d(inp, num_heads * key_dim, 1, 1, norm=False, act=False)
if self.kv_strides > 1:
self._key_dw_conv = conv_2d(inp, inp, dw_kernel_size, kv_strides, groups=inp, norm=True, act=False)
self._value_dw_conv = conv_2d(inp, inp, dw_kernel_size, kv_strides, groups=inp, norm=True, act=False)
self._key_proj = conv_2d(inp, key_dim, 1, 1, norm=False, act=False)
self._value_proj = conv_2d(inp, key_dim, 1, 1, norm=False, act=False)
self._output_proj = conv_2d(num_heads * key_dim, inp, 1, 1, norm=False, act=False)
self.dropout = nn.Dropout(p=dropout)
def forward(self, x):
batch_size, seq_length, _, _ = x.size()
if self.query_h_strides > 1 or self.query_w_strides > 1:
q = F.avg_pool2d(self.query_h_stride, self.query_w_stride)
q = self._query_downsampling_norm(q)
q = self._query_proj(q)
else:
q = self._query_proj(x)
px = q.size(2)
q = q.view(batch_size, self.num_heads, -1, self.key_dim) # [batch_size, num_heads, seq_length, key_dim]
if self.kv_strides > 1:
k = self._key_dw_conv(x)
k = self._key_proj(k)
v = self._value_dw_conv(x)
v = self._value_proj(v)
else:
k = self._key_proj(x)
v = self._value_proj(x)
k = k.view(batch_size, self.key_dim, -1) # [batch_size, key_dim, seq_length]
v = v.view(batch_size, -1, self.key_dim) # [batch_size, seq_length, key_dim]
# calculate attn score
attn_score = torch.matmul(q, k) / (self.head_dim ** 0.5)
attn_score = self.dropout(attn_score)
attn_score = F.softmax(attn_score, dim=-1)
context = torch.matmul(attn_score, v)
context = context.view(batch_size, self.num_heads * self.key_dim, px, px)
output = self._output_proj(context)
return output
class MNV4LayerScale(nn.Module):
def __init__(self, init_value):
"""LayerScale as introduced in CaiT: https://arxiv.org/abs/2103.17239
Referenced from here https://github.com/tensorflow/models/blob/master/official/vision/modeling/layers/nn_blocks.py
As used in MobileNetV4.
Attributes:
init_value (float): value to initialize the diagonal matrix of LayerScale.
"""
super().__init__()
self.init_value = init_value
def forward(self, x):
gamma = self.init_value * torch.ones(x.size(-1), dtype=x.dtype, device=x.device)
return x * gamma
class MultiHeadSelfAttentionBlock(nn.Module):
def __init__(
self,
inp,
num_heads,
key_dim,
value_dim,
query_h_strides,
query_w_strides,
kv_strides,
use_layer_scale,
use_multi_query,
use_residual=True
):
super().__init__()
self.query_h_strides = query_h_strides
self.query_w_strides = query_w_strides
self.kv_strides = kv_strides
self.use_layer_scale = use_layer_scale
self.use_multi_query = use_multi_query
self.use_residual = use_residual
self._input_norm = nn.BatchNorm2d(inp)
if self.use_multi_query:
self.multi_query_attention = MultiQueryAttentionLayerWithDownSampling(
inp, num_heads, key_dim, value_dim, query_h_strides, query_w_strides, kv_strides
)
else:
self.multi_head_attention = nn.MultiheadAttention(inp, num_heads, kdim=key_dim)
if self.use_layer_scale:
self.layer_scale_init_value = 1e-5
self.layer_scale = MNV4LayerScale(self.layer_scale_init_value)
def forward(self, x):
# Not using CPE, skipped
# input norm
shortcut = x
x = self._input_norm(x)
# multi query
if self.use_multi_query:
x = self.multi_query_attention(x)
else:
x = self.multi_head_attention(x, x)
# layer scale
if self.use_layer_scale:
x = self.layer_scale(x)
# use residual
if self.use_residual:
x = x + shortcut
return x
def build_blocks(layer_spec):
if not layer_spec.get('block_name'):
return nn.Sequential()
block_names = layer_spec['block_name']
layers = nn.Sequential()
if block_names == "convbn":
schema_ = ['inp', 'oup', 'kernel_size', 'stride']
for i in range(layer_spec['num_blocks']):
args = dict(zip(schema_, layer_spec['block_specs'][i]))
layers.add_module(f"convbn_{i}", conv_2d(**args))
elif block_names == "uib":
schema_ = ['inp', 'oup', 'start_dw_kernel_size', 'middle_dw_kernel_size', 'middle_dw_downsample', 'stride',
'expand_ratio', 'msha']
for i in range(layer_spec['num_blocks']):
args = dict(zip(schema_, layer_spec['block_specs'][i]))
msha = args.pop("msha") if "msha" in args else 0
layers.add_module(f"uib_{i}", UniversalInvertedBottleneckBlock(**args))
if msha:
msha_schema_ = [
"inp", "num_heads", "key_dim", "value_dim", "query_h_strides", "query_w_strides", "kv_strides",
"use_layer_scale", "use_multi_query", "use_residual"
]
args = dict(zip(msha_schema_, [args['oup']] + (msha)))
layers.add_module(f"msha_{i}", MultiHeadSelfAttentionBlock(**args))
elif block_names == "fused_ib":
schema_ = ['inp', 'oup', 'stride', 'expand_ratio', 'act']
for i in range(layer_spec['num_blocks']):
args = dict(zip(schema_, layer_spec['block_specs'][i]))
layers.add_module(f"fused_ib_{i}", InvertedResidual(**args))
else:
raise NotImplementedError
return layers
class MobileNetV4(nn.Module):
def __init__(self, model):
# MobileNetV4ConvSmall MobileNetV4ConvMedium MobileNetV4ConvLarge
# MobileNetV4HybridMedium MobileNetV4HybridLarge
"""Params to initiate MobilenNetV4
Args:
model : support 5 types of models as indicated in
"https://github.com/tensorflow/models/blob/master/official/vision/modeling/backbones/mobilenet.py"
"""
super().__init__()
assert model in MODEL_SPECS.keys()
self.model = model
self.spec = MODEL_SPECS[self.model]
# conv0
self.conv0 = build_blocks(self.spec['conv0'])
# layer1
self.layer1 = build_blocks(self.spec['layer1'])
# layer2
self.layer2 = build_blocks(self.spec['layer2'])
# layer3
self.layer3 = build_blocks(self.spec['layer3'])
# layer4
self.layer4 = build_blocks(self.spec['layer4'])
# layer5
self.layer5 = build_blocks(self.spec['layer5'])
self.width_list = [i.size(1) for i in self.forward(torch.randn(1, 3, 640, 640))]
def forward(self, x):
x0 = self.conv0(x)
x1 = self.layer1(x0)
x2 = self.layer2(x1)
x3 = self.layer3(x2)
x4 = self.layer4(x3)
# x5 = self.layer5(x4)
# x5 = nn.functional.adaptive_avg_pool2d(x5, 1)
return [x1, x2, x3, x4]
def MobileNetV4ConvSmall():
model = MobileNetV4('MobileNetV4ConvSmall')
return model
def MobileNetV4ConvMedium():
model = MobileNetV4('MobileNetV4ConvMedium')
return model
def MobileNetV4ConvLarge():
model = MobileNetV4('MobileNetV4ConvLarge')
return model
def MobileNetV4HybridMedium():
model = MobileNetV4('MobileNetV4HybridMedium')
return model
def MobileNetV4HybridLarge():
model = MobileNetV4('MobileNetV4HybridLarge')
return model
BiFPN模块
的实现代码如下:
import torch.nn as nn
import torch
class swish(nn.Module):
def forward(self, x):
return x * torch.sigmoid(x)
class BiFPN(nn.Module):
def __init__(self, length):
super().__init__()
self.weight = nn.Parameter(torch.ones(length, dtype=torch.float32), requires_grad=True)
self.swish = swish()
self.epsilon = 0.0001
def forward(self, x):
weights = self.weight / (torch.sum(self.swish(self.weight), dim=0) + self.epsilon)
weighted_feature_maps = [weights[i] * x[i] for i in range(len(x))]
stacked_feature_maps = torch.stack(weighted_feature_maps, dim=0)
result = torch.sum(stacked_feature_maps, dim=0)
return result
五、修改步骤
MobileNetV4
修改步骤参考:
BiFPN
修改步骤参考:
六、yaml模型文件
6.1 模型改进⭐
在代码配置完成后,配置模型的YAML文件。
此处以
ultralytics/cfg/models/rt-detr/rtdetr-l.yaml
为例,在同目录下创建一个用于自己数据集训练的模型文件
rtdetr-l-MobileNetv4.yaml
。
将
rtdetr-l.yaml
中的内容复制到
rtdetr-MobileNetv4-BiFPN.yaml
文件下,修改
nc
数量等于自己数据中目标的数量。
📌 模型配置如下:。
# Ultralytics YOLO 🚀, AGPL-3.0 license
# RT-DETR-l object detection model with P3-P5 outputs. For details see https://docs.ultralytics.com/models/rtdetr
# Parameters
nc: 1 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n-cls.yaml' will call yolov8-cls.yaml with scale 'n'
# [depth, width, max_channels]
l: [1.00, 1.00, 1024]
backbone:
# [from, repeats, module, args]
- [-1, 1, MobileNetV4ConvSmall, []] # 4
head:
- [-1, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 5 input_proj.2
- [-1, 1, AIFI, [1024, 8]]
- [-1, 1, Conv, [256, 1, 1]] # 7, Y5, lateral_convs.0
- [2, 1, Conv, [256]] # 8-P3/8
- [3, 1, Conv, [256]] # 9-P4/16
- [7, 1, Conv, [256]] # 10-P5/32
- [-1, 1, nn.Upsample, [None, 2, 'nearest']] # 11 P5->P4
- [[-1, 9], 1, BiFPN, []] # 12
- [-1, 3, RepC3, [256]] # 13-P4/16
- [-1, 1, nn.Upsample, [None, 2, 'nearest']] # 14 P4->P3
- [[-1, 8], 1, BiFPN, []] # 15
- [-1, 3, RepC3, [256]] # 16-P3/8
- [1, 1, Conv, [256, 3, 2]] # 17 P2->P3
- [[-1, 8, 16], 1, BiFPN, []] # 18
- [-1, 3, RepC3, [256]] # 19-P3/8
- [-1, 1, Conv, [256, 3, 2]] # 20 P3->P4
- [[-1, 9, 13], 1, BiFPN, []] # 21
- [-1, 3, RepC3, [256]] # 22-P4/16
- [-1, 1, Conv, [256, 3, 2]] # 23 P4->P5
- [[-1, 10], 1, BiFPN, []] # 24
- [-1, 3, RepC3, [256]] # 25-P5/32
- [[19, 22, 25], 1, RTDETRDecoder, [nc, 256, 300, 4, 8, 3]] # Detect(P3, P4, P5)
七、成功运行结果
分别打印网络模型可以看到
MobileNetV4和BiFPN模块
已经加入到模型中,并可以进行训练了。
rtdetr-MobileNetV4-BiFPN :
rtdetr-MobileNetv4-BiFPN summary: 526 layers, 19,463,904 parameters, 19,463,904 gradients, 97.2 GFLOPs
from n params module arguments
0 -1 1 2493024 MobileNetV4ConvSmall []
1 -1 1 328192 ultralytics.nn.modules.conv.Conv [1280, 256, 1, 1, None, 1, 1, False]
2 -1 1 789760 ultralytics.nn.modules.transformer.AIFI [256, 1024, 8]
3 -1 1 66048 ultralytics.nn.modules.conv.Conv [256, 256, 1, 1]
4 2 1 16896 ultralytics.nn.modules.conv.Conv [64, 256]
5 3 1 25088 ultralytics.nn.modules.conv.Conv [96, 256]
6 7 1 66048 ultralytics.nn.modules.conv.Conv [256, 256]
7 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
8 [-1, 9] 1 2 ultralytics.nn.AddModules.BiFPN.BiFPN [2]
9 -1 3 2101248 ultralytics.nn.modules.block.RepC3 [256, 256, 3]
10 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
11 [-1, 8] 1 2 ultralytics.nn.AddModules.BiFPN.BiFPN [2]
12 -1 3 2101248 ultralytics.nn.modules.block.RepC3 [256, 256, 3]
13 1 1 74240 ultralytics.nn.modules.conv.Conv [32, 256, 3, 2]
14 [-1, 8, 16] 1 3 ultralytics.nn.AddModules.BiFPN.BiFPN [3]
15 -1 3 2101248 ultralytics.nn.modules.block.RepC3 [256, 256, 3]
16 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
17 [-1, 9, 13] 1 3 ultralytics.nn.AddModules.BiFPN.BiFPN [3]
18 -1 3 2101248 ultralytics.nn.modules.block.RepC3 [256, 256, 3]
19 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
20 [-1, 10] 1 2 ultralytics.nn.AddModules.BiFPN.BiFPN [2]
21 -1 3 2101248 ultralytics.nn.modules.block.RepC3 [256, 256, 3]
22 [19, 22, 25] 1 3917684 ultralytics.nn.modules.head.RTDETRDecoder [1, [256, 256, 256], 256, 300, 4, 8, 3]
rtdetr-MobileNetv4-BiFPN summary: 526 layers, 19,463,904 parameters, 19,463,904 gradients, 97.2 GFLOPs