RT-DETR改进策略【模型轻量化】| MoblieNetV3:基于搜索技术和新颖架构设计的轻量型网络模型
一、本文介绍
本文记录的是
基于MobileNet V3的RT-DETR目标检测轻量化改进方法研究
。
MobileNet V3
的模型结构是通过
网络搜索
得来的,其中的基础模块结合了
MobileNet V1
的深度可分离卷积、
MobileNet V2
的线性瓶颈和倒置残差结构以及
MnasNet
中基于挤压和激励的轻量级注意力模块,使模型在性能、效率和灵活性方面都具有显著的优势。
本文在替换骨干网络中配置了原论文中的
small
和
large
2种模型,以满足不同的需求。
| 模型 | 参数量 | 计算量 | 推理速度 |
|---|---|---|---|
| rtdetr-lm | 32.8M | 108.0GFLOPs | 11.6ms |
| Improved | 19.4M | 61.3GFLOPs | 10.6ms |
二、MoblieNet V3设计原理
MobileNet V3
是基于一系列互补的搜索技术和新颖的架构设计而提出的新一代神经网络模型,其设计的原理和优势主要包括以下几个方面:
2.1 原理
2.1.1 网络搜索
-
平台感知的NAS(Platform - Aware NAS)
:用于搜索全局网络结构,通过优化每个网络块来实现。对于大型移动模型,复用了
MnasNet - A1的结构,并在此基础上应用NetAdapt和其他优化。对于小型移动模型,观察到原奖励设计未针对其优化,因此调整了权重因子w,重新进行架构搜索以找到初始种子模型。 - NetAdapt :用于逐层搜索过滤器的数量,是对平台感知的NAS的补充。它从平台感知的NAS找到的种子网络架构开始,通过生成新的提案并根据某些指标选择最佳提案,逐步微调单个层,直到达到目标延迟。在选择提案时,修改了算法以最小化延迟变化和准确率变化的比率。
2.1.2 网络改进
- 重新设计昂贵层 :对网络末尾和开头的一些昂贵层进行修改。对于末尾的层,将产生最终特征的层移动到最终平均池化之后,以降低延迟并保持高维特征,同时去除了之前瓶颈层中的投影和过滤层,进一步降低计算复杂度。对于初始的滤波器层,实验发现使用hard swish非线性函数并将滤波器数量减少到16时,能在保持准确率的同时减少延迟和计算量。
-
非线性函数
:引入了名为
h-swish的非线性函数,它是swish非线性函数的改进版本,计算更快且更有利于量化。通过将sigmoid函数替换为分段线性的hard版本(如h - swish [x] = x * ReLU6(x + 3) / 6),并在网络的后半部分使用h-swish,减少了计算成本,同时在准确率上与原始版本没有明显差异。 - 大的挤压 - 激励(Large squeeze - and - excite) :将挤压 - 激励瓶颈的大小固定为扩展层通道数的1 / 4,在增加少量参数的情况下提高了准确率,且没有明显的延迟成本。
2.1.3 高效的移动构建块
结合了
MobileNet V1
的深度可分离卷积、
MobileNet V2
的线性瓶颈和倒置残差结构以及
MnasNet
中基于挤压和激励的轻量级注意力模块,同时升级了这些层,使用修改后的swish非线性函数以提高效率。
2.2 优势
-
MobileNet V3通过网络搜索和改进,结合了多种技术的优势,在性能、效率和灵活性方面都具有显著的优势,适用于移动设备上的各种计算机视觉任务。并且定义了MobileNetV3 - Large和MobileNetV3 - Small两个模型,分别针对高资源和低资源使用场景,可根据不同需求进行选择和应用。
论文: https://arxiv.org/abs/1905.02244.pdf
源码: https://github.com/d-li14/mobilenetv3.pytorch
三、MobileNetV3 模块的实现代码
MobileNetV3模块
的实现代码如下:
"""A from-scratch implementation of MobileNetV3 paper ( for educational purposes ).
Paper
Searching for MobileNetV3 - https://arxiv.org/abs/1905.02244v5
author : shubham.aiengineer@gmail.com
"""
import torch
from torch import nn
# from torchsummary import summary
class SqueezeExitationBlock(nn.Module):
def __init__(self, in_channels: int):
"""Constructor for SqueezeExitationBlock.
Args:
in_channels (int): Number of input channels.
"""
super().__init__()
self.pool1 = nn.AdaptiveAvgPool2d(1)
self.linear1 = nn.Linear(
in_channels, in_channels // 4
) # divide by 4 is mentioned in the paper, 5.3. Large squeeze-and-excite
self.act1 = nn.ReLU()
self.linear2 = nn.Linear(in_channels // 4, in_channels)
self.act2 = nn.Hardsigmoid()
def forward(self, x):
"""Forward pass for SqueezeExitationBlock."""
identity = x
x = self.pool1(x)
x = torch.flatten(x, 1)
x = self.linear1(x)
x = self.act1(x)
x = self.linear2(x)
x = self.act2(x)
x = identity * x[:, :, None, None]
return x
class ConvNormActivationBlock(nn.Module):
def __init__(
self,
in_channels: int,
out_channels: int,
kernel_size: list,
stride: int = 1,
padding: int = 0,
groups: int = 1,
bias: bool = False,
activation: torch.nn = nn.Hardswish,
):
"""Constructs a block containing a convolution, batch normalization and activation layer
Args:
in_channels (int): number of input channels
out_channels (int): number of output channels
kernel_size (list): size of the convolutional kernel
stride (int, optional): stride of the convolutional kernel. Defaults to 1.
padding (int, optional): padding of the convolutional kernel. Defaults to 0.
groups (int, optional): number of groups for depthwise seperable convolution. Defaults to 1.
bias (bool, optional): whether to use bias. Defaults to False.
activation (torch.nn, optional): activation function. Defaults to nn.Hardswish.
"""
super().__init__()
self.conv = nn.Conv2d(
in_channels,
out_channels,
kernel_size,
stride=stride,
padding=padding,
groups=groups,
bias=bias,
)
self.norm = nn.BatchNorm2d(out_channels)
self.activation = activation()
def forward(self, x):
"""Perform forward pass."""
x = self.conv(x)
x = self.norm(x)
x = self.activation(x)
return x
class InverseResidualBlock(nn.Module):
def __init__(
self,
in_channels: int,
out_channels: int,
kernel_size: int,
expansion_size: int = 6,
stride: int = 1,
squeeze_exitation: bool = True,
activation: nn.Module = nn.Hardswish,
):
"""Constructs a inverse residual block
Args:
in_channels (int): number of input channels
out_channels (int): number of output channels
kernel_size (int): size of the convolutional kernel
expansion_size (int, optional): size of the expansion factor. Defaults to 6.
stride (int, optional): stride of the convolutional kernel. Defaults to 1.
squeeze_exitation (bool, optional): whether to add squeeze and exitation block or not. Defaults to True.
activation (nn.Module, optional): activation function. Defaults to nn.Hardswish.
"""
super().__init__()
self.residual = in_channels == out_channels and stride == 1
self.squeeze_exitation = squeeze_exitation
self.conv1 = (
ConvNormActivationBlock(
in_channels, expansion_size, (1, 1), activation=activation
)
if in_channels != expansion_size
else nn.Identity()
) # If it's not the first layer, then we need to add a 1x1 convolutional layer to expand the number of channels
self.depthwise_conv = ConvNormActivationBlock(
expansion_size,
expansion_size,
(kernel_size, kernel_size),
stride=stride,
padding=kernel_size // 2,
groups=expansion_size,
activation=activation,
)
if self.squeeze_exitation:
self.se = SqueezeExitationBlock(expansion_size)
self.conv2 = nn.Conv2d(
expansion_size, out_channels, (1, 1), bias=False
) # bias is false because we are using batch normalization, which already has bias
self.norm = nn.BatchNorm2d(out_channels)
def forward(self, x):
"""Perform forward pass."""
identity = x
x = self.conv1(x)
x = self.depthwise_conv(x)
if self.squeeze_exitation:
x = self.se(x)
x = self.conv2(x)
x = self.norm(x)
if self.residual:
x = x + identity
return x
class MobileNetV3(nn.Module):
def __init__(
self,
n_classes: int = 1000,
input_channel: int = 3,
config: str = "small",
dropout: float = 0.8,
):
"""Constructs MobileNetV3 architecture
Args:
`n_classes`: An integer count of output neuron in last layer, default 1000
`input_channel`: An integer value input channels in first conv layer, default is 3.
`config`: A string value indicating the configuration of MobileNetV3, either `large` or `small`, default is `large`.
`dropout` [0, 1] : A float parameter for dropout in last layer, between 0 and 1, default is 0.8.
"""
super().__init__()
# The configuration of MobileNetv3.
# input channels, kernel size, expension size, output channels, squeeze exitation, activation, stride
RE = nn.ReLU
HS = nn.Hardswish
configs_dict = {
"small": (
(16, 3, 16, 16, True, RE, 2),
(16, 3, 72, 24, False, RE, 2),
(24, 3, 88, 24, False, RE, 1),
(24, 5, 96, 40, True, HS, 2),
(40, 5, 240, 40, True, HS, 1),
(40, 5, 240, 40, True, HS, 1),
(40, 5, 120, 48, True, HS, 1),
(48, 5, 144, 48, True, HS, 1),
(48, 5, 288, 96, True, HS, 2),
(96, 5, 576, 96, True, HS, 1),
(96, 5, 576, 96, True, HS, 1),
),
"large": (
(16, 3, 16, 16, False, RE, 1),
(16, 3, 64, 24, False, RE, 2),
(24, 3, 72, 24, False, RE, 1),
(24, 5, 72, 40, True, RE, 2),
(40, 5, 120, 40, True, RE, 1),
(40, 5, 120, 40, True, RE, 1),
(40, 3, 240, 80, False, HS, 2),
(80, 3, 200, 80, False, HS, 1),
(80, 3, 184, 80, False, HS, 1),
(80, 3, 184, 80, False, HS, 1),
(80, 3, 480, 112, True, HS, 1),
(112, 3, 672, 112, True, HS, 1),
(112, 5, 672, 160, True, HS, 2),
(160, 5, 960, 160, True, HS, 1),
(160, 5, 960, 160, True, HS, 1),
),
}
self.model = nn.Sequential(
ConvNormActivationBlock(
input_channel, 16, (3, 3), stride=2, padding=1, activation=nn.Hardswish
),
)
for (
in_channels,
kernel_size,
expansion_size,
out_channels,
squeeze_exitation,
activation,
stride,
) in configs_dict[config]:
self.model.append(
InverseResidualBlock(
in_channels=in_channels,
out_channels=out_channels,
kernel_size=kernel_size,
expansion_size=expansion_size,
stride=stride,
squeeze_exitation=squeeze_exitation,
activation=activation,
)
)
hidden_channels = 576 if config == "small" else 960
_out_channel = 1024 if config == "small" else 1280
self.model.append(
ConvNormActivationBlock(
out_channels,
hidden_channels,
(1, 1),
bias=False,
activation=nn.Hardswish,
)
)
if config == 'small':
self.index = [16, 24, 48, 576]
else:
self.index = [24, 40, 112, 960]
self.width_list = [i.size(1) for i in self.forward(torch.randn(1, 3, 640, 640))]
def forward(self, x):
"""Perform forward pass."""
results = [None, None, None, None]
for model in self.model:
x = model(x)
if x.size(1) in self.index:
position = self.index.index(x.size(1)) # Find the position in the index list
results[position] = x
# results.append(x)
return results
四、修改步骤
4.1 修改一
① 在
ultralytics/nn/
目录下新建
AddModules
文件夹用于存放模块代码
② 在
AddModules
文件夹下新建
MoblieNetV3.py
,将
第三节
中的代码粘贴到此处
4.2 修改二
在
AddModules
文件夹下新建
__init__.py
(已有则不用新建),在文件内导入模块:
from .MoblieNetV3 import *
4.3 修改三
在
ultralytics/nn/modules/tasks.py
文件中,需要在两处位置添加各模块类名称。
① 首先:导入模块
② 接着,在此函数下添加如下代码:
elif m in {MobileNetV3}:
m = m(*args)
c2 = m.width_list
此时使用的是MoblieNetV3 small模型,若是想要使用MoblieNetV3 large模型,则需要找到源码中的
class MobileNetV3(nn.Module):
,并将
config: str = "small"
,改成
config: str = "large"
。
③ 然后,将下方红框内的代码全部替换:
if isinstance(c2, list):
is_backbone = True
m_ = m
m_.backbone = True
else:
m_ = nn.Sequential(*(m(*args) for _ in range(n))) if n > 1 else m(*args) # module
t = str(m)[8:-2].replace('__main__.', '') # module type
m.np = sum(x.numel() for x in m_.parameters()) # number params
m_.i, m_.f, m_.type = i + 4 if is_backbone else i, f, t # attach index, 'from' index, type
if verbose:
LOGGER.info(f'{i:>3}{str(f):>20}{n_:>3}{m.np:10.0f} {t:<45}{str(args):<30}') # print
save.extend(x % (i + 4 if is_backbone else i) for x in ([f] if isinstance(f, int) else f) if
x != -1) # append to savelist
layers.append(m_)
if i == 0:
ch = []
if isinstance(c2, list):
ch.extend(c2)
for _ in range(5 - len(ch)):
ch.insert(0, 0)
else:
ch.append(c2)
替换后如下:
④ 在此文件下找到
base_model
的
_predict_once
,并将其替换成如下代码。
def _predict_once(self, x, profile=False, visualize=False, embed=None):
y, dt, embeddings = [], [], [] # outputs
for m in self.model:
if m.f != -1: # if not from previous layer
x = y[m.f] if isinstance(m.f, int) else [x if j == -1 else y[j] for j in m.f] # from earlier layers
if profile:
self._profile_one_layer(m, x, dt)
if hasattr(m, 'backbone'):
x = m(x)
if len(x) != 5: # 0 - 5
x.insert(0, None)
for index, i in enumerate(x):
if index in self.save:
y.append(i)
else:
y.append(None)
x = x[-1] # 最后一个输出传给下一层
else:
x = m(x) # run
y.append(x if m.i in self.save else None) # save output
if visualize:
feature_visualization(x, m.type, m.i, save_dir=visualize)
if embed and m.i in embed:
embeddings.append(nn.functional.adaptive_avg_pool2d(x, (1, 1)).squeeze(-1).squeeze(-1)) # flatten
if m.i == max(embed):
return torch.unbind(torch.cat(embeddings, 1), dim=0)
return x
至此就修改完成了,可以配置模型开始训练了
五、yaml模型文件
5.1 模型改进⭐
在代码配置完成后,配置模型的YAML文件。
此处以
ultralytics/cfg/models/rt-detr/rtdetr-l.yaml
为例,在同目录下创建一个用于自己数据集训练的模型文件
rtdetr-MoblieNetv3.yaml
。
将
rtdetr-l.yaml
中的内容复制到
rtdetr-MoblieNetv3.yaml
文件下,修改
nc
数量等于自己数据中目标的数量。
📌 模型的修改方法是将
骨干网络
替换成
MoblieNetv3
。
# Ultralytics YOLO 🚀, AGPL-3.0 license
# RT-DETR-l object detection model with P3-P5 outputs. For details see https://docs.ultralytics.com/models/rtdetr
# Parameters
nc: 1 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n-cls.yaml' will call yolov8-cls.yaml with scale 'n'
# [depth, width, max_channels]
l: [1.00, 1.00, 1024]
backbone:
# [from, repeats, module, args]
- [-1, 1, MobileNetV3, []] # 4
head:
- [-1, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 5 input_proj.2
- [-1, 1, AIFI, [1024, 8]] # 6
- [-1, 1, Conv, [256, 1, 1]] # 7, Y5, lateral_convs.0
- [-1, 1, nn.Upsample, [None, 2, 'nearest']] # 8
- [3, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 9 input_proj.1
- [[-2, -1], 1, Concat, [1]] # 10
- [-1, 3, RepC3, [256]] # 11, fpn_blocks.0
- [-1, 1, Conv, [256, 1, 1]] # 12, Y4, lateral_convs.1
- [-1, 1, nn.Upsample, [None, 2, 'nearest']] # 13
- [2, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 14 input_proj.0
- [[-2, -1], 1, Concat, [1]] # 15 cat backbone P4
- [-1, 3, RepC3, [256]] # X3 (16), fpn_blocks.1
- [-1, 1, Conv, [256, 3, 2]] # 17, downsample_convs.0
- [[-1, 12], 1, Concat, [1]] # 18 cat Y4
- [-1, 3, RepC3, [256]] # F4 (19), pan_blocks.0
- [-1, 1, Conv, [256, 3, 2]] # 20, downsample_convs.1
- [[-1, 7], 1, Concat, [1]] # 21 cat Y5
- [-1, 3, RepC3, [256]] # F5 (22), pan_blocks.1
- [[16, 19, 22], 1, RTDETRDecoder, [nc]] # Detect(P3, P4, P5)
六、成功运行结果
分别打印网络模型可以看到
MoblieNetV3模块
已经加入到模型中,并可以进行训练了。
rtdetr-MoblieNetV3 :
rtdetr-MobileNetV3 summary: 552 layers, 19,424,529 parameters, 19,424,529 gradients, 61.3 GFLOPs
from n params module arguments
0 -1 1 921390 MobileNetV3 []
1 -1 1 147968 ultralytics.nn.modules.conv.Conv [576, 256, 1, 1, None, 1, 1, False]
2 -1 1 789760 ultralytics.nn.modules.transformer.AIFI [256, 1024, 8]
3 -1 1 66048 ultralytics.nn.modules.conv.Conv [256, 256, 1, 1]
4 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
5 3 1 12800 ultralytics.nn.modules.conv.Conv [48, 256, 1, 1, None, 1, 1, False]
6 [-2, -1] 1 0 ultralytics.nn.modules.conv.Concat [1]
7 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
8 -1 1 66048 ultralytics.nn.modules.conv.Conv [256, 256, 1, 1]
9 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
10 2 1 6656 ultralytics.nn.modules.conv.Conv [24, 256, 1, 1, None, 1, 1, False]
11 [-2, -1] 1 0 ultralytics.nn.modules.conv.Concat [1]
12 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
13 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
14 [-1, 12] 1 0 ultralytics.nn.modules.conv.Concat [1]
15 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
16 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
17 [-1, 7] 1 0 ultralytics.nn.modules.conv.Concat [1]
18 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
19 [16, 19, 22] 1 7303907 ultralytics.nn.modules.head.RTDETRDecoder [1, [256, 256, 256]]
rtdetr-MobileNetV3 summary: 552 layers, 19,424,529 parameters, 19,424,529 gradients, 61.3 GFLOPs