RT-DETR改进策略【Backbone/主干网络】| CVPR 2025 替换骨干为MambaOut,去除冗余结构,挖掘视觉Mamba潜力
一、本文介绍
本文记录的是 基于MambaOut的RT-DETR骨干网络改进方法研究 。
MambaOut
提出了基于
Gated CNN块
的模型结构,通过去除
Mamba 块
中的核心
token mixer(SSM)
构建而成。将
MambaOut
应用于RT-DETR中,在处理图像分类时,其简单的
Gated CNN 块
结构避免了 SSM 在非长序列、非 autoregressive 任务中的冗余性。
本文在
RT-DETR
的基础上配置了原论文中
mambaout_femto
,
mambaout_kobe
,
mambaout_tiny
,
mambaout_small
,
mambaout_base
五种模型,以满足不同的需求。
二、MambaOut原理介绍
MambaOut: Do WeReally Need Mamba for Vision?
MambaOut是一种基于Gated CNN块构建的模型,其设计出发点基于对Mamba模型特性及视觉任务特点的深入分析,在结构上有独特之处,并展现出多方面优势。
2.1 设计出发点
Mamba模型的token mixer是基于结构化状态空间模型(SSM),具有处理长序列和适用于因果token混合模式的特点。但视觉任务中,如ImageNet图像分类任务,既不属于长序列任务,也不需要因果token混合模式。而目标检测、实例分割和语义分割等任务虽属于长序列任务,但并非因果token混合模式。
因此,对于视觉任务,Mamba模型中的SSM可能并非必需。基于此,提出假设:SSM对ImageNet图像分类不是必需的;对视觉检测和分割任务,SSM仍有探索价值。为验证假设,构建了MambaOut模型,通过去除Mamba块中的核心token mixer(SSM),来评估Mamba对视觉识别任务的必要性。
2.2 结构
MambaOut采用类似
ResNet
的
4
阶段分层架构,通过在每个阶段堆叠
Gated CNN
块构建而成。
Gated CNN块与Mamba块的主要区别在于没有SSM。
其元架构可看作是MetaFormer的token mixer和MLP的简化集成,给定输入
X
∈
R
N
×
D
X \in \mathbb{R}^{N ×D}
X
∈
R
N
×
D
,先进行归一化处理
X
′
=
N
o
r
m
(
X
)
X' = Norm(X)
X
′
=
N
or
m
(
X
)
,然后通过
Y
=
(
T
o
k
e
n
M
i
x
e
r
(
X
′
W
1
)
⊙
σ
(
X
′
W
2
)
)
W
3
+
X
Y = \left( TokenMixer \left(X' W_{1}\right) \odot \sigma\left(X' W_{2}\right)\right) W_{3}+X
Y
=
(
T
o
k
e
n
M
i
x
er
(
X
′
W
1
)
⊙
σ
(
X
′
W
2
)
)
W
3
+
X
进行变换,其中
T
o
k
e
n
M
i
x
e
r
G
a
t
e
d
C
N
N
(
Z
)
=
C
o
n
v
(
Z
)
TokenMixer_{GatedCNN}(Z)=Conv(Z)
T
o
k
e
n
M
i
x
e
r
G
a
t
e
d
CNN
(
Z
)
=
C
o
n
v
(
Z
)
,采用
7
×
7
7×7
7
×
7
核大小的深度wise卷积,且为提高实际速度,仅在部分通道上进行深度wise卷积。
论文: https://arxiv.org/pdf/2405.07992
源码: https://github.com/yuweihao/MambaOut
三、MambaOut的实现代码
MambaOut
的实现代码如下:
"""
MambaOut models for image classification.
Some implementations are modified from:
timm (https://github.com/rwightman/pytorch-image-models),
MetaFormer (https://github.com/sail-sg/metaformer),
InceptionNeXt (https://github.com/sail-sg/inceptionnext)
"""
from functools import partial
import torch
import torch.nn as nn
import torch.nn.functional as F
from timm.layers import trunc_normal_, DropPath
from timm.data import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD
__all__ = ['mambaout_femto', 'mambaout_kobe', 'mambaout_tiny', 'mambaout_small', 'mambaout_base']
def _cfg(url='', **kwargs):
return {
'url': url,
'num_classes': 1000, 'input_size': (3, 224, 224), 'pool_size': None,
'crop_pct': 1.0, 'interpolation': 'bicubic',
'mean': IMAGENET_DEFAULT_MEAN, 'std': IMAGENET_DEFAULT_STD, 'classifier': 'head',
**kwargs
}
default_cfgs = {
'mambaout_femto': _cfg(
url='https://github.com/yuweihao/MambaOut/releases/download/model/mambaout_femto.pth'),
'mambaout_kobe': _cfg(
url='https://github.com/yuweihao/MambaOut/releases/download/model/mambaout_kobe.pth'),
'mambaout_tiny': _cfg(
url='https://github.com/yuweihao/MambaOut/releases/download/model/mambaout_tiny.pth'),
'mambaout_small': _cfg(
url='https://github.com/yuweihao/MambaOut/releases/download/model/mambaout_small.pth'),
'mambaout_base': _cfg(
url='https://github.com/yuweihao/MambaOut/releases/download/model/mambaout_base.pth'),
}
class StemLayer(nn.Module):
r""" Code modified from InternImage:
https://github.com/OpenGVLab/InternImage
"""
def __init__(self,
in_channels=3,
out_channels=96,
act_layer=nn.GELU,
norm_layer=partial(nn.LayerNorm, eps=1e-6)):
super().__init__()
self.conv1 = nn.Conv2d(in_channels,
out_channels // 2,
kernel_size=3,
stride=2,
padding=1)
self.norm1 = norm_layer(out_channels // 2)
self.act = act_layer()
self.conv2 = nn.Conv2d(out_channels // 2,
out_channels,
kernel_size=3,
stride=2,
padding=1)
self.norm2 = norm_layer(out_channels)
def forward(self, x):
x = self.conv1(x)
x = x.permute(0, 2, 3, 1)
x = self.norm1(x)
x = x.permute(0, 3, 1, 2)
x = self.act(x)
x = self.conv2(x)
x = x.permute(0, 2, 3, 1)
x = self.norm2(x)
return x
class DownsampleLayer(nn.Module):
r""" Code modified from InternImage:
https://github.com/OpenGVLab/InternImage
"""
def __init__(self, in_channels=96, out_channels=198, norm_layer=partial(nn.LayerNorm, eps=1e-6)):
super().__init__()
self.conv = nn.Conv2d(in_channels,
out_channels,
kernel_size=3,
stride=2,
padding=1)
self.norm = norm_layer(out_channels)
def forward(self, x):
x = self.conv(x.permute(0, 3, 1, 2)).permute(0, 2, 3, 1)
x = self.norm(x)
return x
class MlpHead(nn.Module):
""" MLP classification head
"""
def __init__(self, dim, num_classes=1000, act_layer=nn.GELU, mlp_ratio=4,
norm_layer=partial(nn.LayerNorm, eps=1e-6), head_dropout=0., bias=True):
super().__init__()
hidden_features = int(mlp_ratio * dim)
self.fc1 = nn.Linear(dim, hidden_features, bias=bias)
self.act = act_layer()
self.norm = norm_layer(hidden_features)
self.fc2 = nn.Linear(hidden_features, num_classes, bias=bias)
self.head_dropout = nn.Dropout(head_dropout)
def forward(self, x):
x = self.fc1(x)
x = self.act(x)
x = self.norm(x)
x = self.head_dropout(x)
x = self.fc2(x)
return x
class GatedCNNBlock(nn.Module):
r""" Our implementation of Gated CNN Block: https://arxiv.org/pdf/1612.08083
Args:
conv_ratio: control the number of channels to conduct depthwise convolution.
Conduct convolution on partial channels can improve practical efficiency.
The idea of partial channels is from ShuffleNet V2 (https://arxiv.org/abs/1807.11164) and
also used by InceptionNeXt (https://arxiv.org/abs/2303.16900) and FasterNet (https://arxiv.org/abs/2303.03667)
"""
def __init__(self, dim, expansion_ratio=8/3, kernel_size=7, conv_ratio=1.0,
norm_layer=partial(nn.LayerNorm,eps=1e-6),
act_layer=nn.GELU,
drop_path=0.,
**kwargs):
super().__init__()
self.norm = norm_layer(dim)
hidden = int(expansion_ratio * dim)
self.fc1 = nn.Linear(dim, hidden * 2)
self.act = act_layer()
conv_channels = int(conv_ratio * dim)
self.split_indices = (hidden, hidden - conv_channels, conv_channels)
self.conv = nn.Conv2d(conv_channels, conv_channels, kernel_size=kernel_size, padding=kernel_size//2, groups=conv_channels)
self.fc2 = nn.Linear(hidden, dim)
self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
def forward(self, x):
shortcut = x # [B, H, W, C]
x = self.norm(x)
g, i, c = torch.split(self.fc1(x), self.split_indices, dim=-1)
c = c.permute(0, 3, 1, 2) # [B, H, W, C] -> [B, C, H, W]
c = self.conv(c)
c = c.permute(0, 2, 3, 1) # [B, C, H, W] -> [B, H, W, C]
x = self.fc2(self.act(g) * torch.cat((i, c), dim=-1))
x = self.drop_path(x)
return x + shortcut
class LayerNormGeneral(nn.Module):
r""" General LayerNorm for different situations.
Args:
affine_shape (int, list or tuple): The shape of affine weight and bias.
Usually the affine_shape=C, but in some implementation, like torch.nn.LayerNorm,
the affine_shape is the same as normalized_dim by default.
To adapt to different situations, we offer this argument here.
normalized_dim (tuple or list): Which dims to compute mean and variance.
scale (bool): Flag indicates whether to use scale or not.
bias (bool): Flag indicates whether to use scale or not.
We give several examples to show how to specify the arguments.
LayerNorm (https://arxiv.org/abs/1607.06450):
For input shape of (B, *, C) like (B, N, C) or (B, H, W, C),
affine_shape=C, normalized_dim=(-1, ), scale=True, bias=True;
For input shape of (B, C, H, W),
affine_shape=(C, 1, 1), normalized_dim=(1, ), scale=True, bias=True.
Modified LayerNorm (https://arxiv.org/abs/2111.11418)
that is idental to partial(torch.nn.GroupNorm, num_groups=1):
For input shape of (B, N, C),
affine_shape=C, normalized_dim=(1, 2), scale=True, bias=True;
For input shape of (B, H, W, C),
affine_shape=C, normalized_dim=(1, 2, 3), scale=True, bias=True;
For input shape of (B, C, H, W),
affine_shape=(C, 1, 1), normalized_dim=(1, 2, 3), scale=True, bias=True.
For the several metaformer baslines,
IdentityFormer, RandFormer and PoolFormerV2 utilize Modified LayerNorm without bias (bias=False);
ConvFormer and CAFormer utilizes LayerNorm without bias (bias=False).
"""
def __init__(self, affine_shape=None, normalized_dim=(-1, ), scale=True,
bias=True, eps=1e-5):
super().__init__()
self.normalized_dim = normalized_dim
self.use_scale = scale
self.use_bias = bias
self.weight = nn.Parameter(torch.ones(affine_shape)) if scale else None
self.bias = nn.Parameter(torch.zeros(affine_shape)) if bias else None
self.eps = eps
def forward(self, x):
c = x - x.mean(self.normalized_dim, keepdim=True)
s = c.pow(2).mean(self.normalized_dim, keepdim=True)
x = c / torch.sqrt(s + self.eps)
if self.use_scale:
x = x * self.weight
if self.use_bias:
x = x + self.bias
return x
class GatedCNNBlock_BCHW(nn.Module):
r""" Our implementation of Gated CNN Block: https://arxiv.org/pdf/1612.08083
Args:
conv_ratio: control the number of channels to conduct depthwise convolution.
Conduct convolution on partial channels can improve practical efficiency.
The idea of partial channels is from ShuffleNet V2 (https://arxiv.org/abs/1807.11164) and
also used by InceptionNeXt (https://arxiv.org/abs/2303.16900) and FasterNet (https://arxiv.org/abs/2303.03667)
"""
def __init__(self, dim, expansion_ratio=8/3, kernel_size=7, conv_ratio=1.0,
norm_layer=partial(LayerNormGeneral,eps=1e-6,normalized_dim=(1, 2, 3)),
act_layer=nn.GELU,
drop_path=0.,
**kwargs):
super().__init__()
self.norm = norm_layer((dim, 1, 1))
hidden = int(expansion_ratio * dim)
self.fc1 = nn.Conv2d(dim, hidden * 2, 1)
self.act = act_layer()
conv_channels = int(conv_ratio * dim)
self.split_indices = (hidden, hidden - conv_channels, conv_channels)
self.conv = nn.Conv2d(conv_channels, conv_channels, kernel_size=kernel_size, padding=kernel_size//2, groups=conv_channels)
self.fc2 = nn.Conv2d(hidden, dim, 1)
self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
def forward(self, x):
shortcut = x # [B, H, W, C]
x = self.norm(x)
g, i, c = torch.split(self.fc1(x), self.split_indices, dim=1)
# c = c.permute(0, 3, 1, 2) # [B, H, W, C] -> [B, C, H, W]
c = self.conv(c)
# c = c.permute(0, 2, 3, 1) # [B, C, H, W] -> [B, H, W, C]
x = self.fc2(self.act(g) * torch.cat((i, c), dim=1))
x = self.drop_path(x)
return x + shortcut
r"""
downsampling (stem) for the first stage is two layer of conv with k3, s2 and p1
downsamplings for the last 3 stages is a layer of conv with k3, s2 and p1
DOWNSAMPLE_LAYERS_FOUR_STAGES format: [Downsampling, Downsampling, Downsampling, Downsampling]
use `partial` to specify some arguments
"""
DOWNSAMPLE_LAYERS_FOUR_STAGES = [StemLayer] + [DownsampleLayer]*3
class MambaOut(nn.Module):
r""" MetaFormer
A PyTorch impl of : `MetaFormer Baselines for Vision` -
https://arxiv.org/abs/2210.13452
Args:
in_chans (int): Number of input image channels. Default: 3.
num_classes (int): Number of classes for classification head. Default: 1000.
depths (list or tuple): Number of blocks at each stage. Default: [3, 3, 9, 3].
dims (int): Feature dimension at each stage. Default: [96, 192, 384, 576].
downsample_layers: (list or tuple): Downsampling layers before each stage.
drop_path_rate (float): Stochastic depth rate. Default: 0.
output_norm: norm before classifier head. Default: partial(nn.LayerNorm, eps=1e-6).
head_fn: classification head. Default: nn.Linear.
head_dropout (float): dropout for MLP classifier. Default: 0.
"""
def __init__(self, in_chans=3, num_classes=1000,
depths=[3, 3, 9, 3],
dims=[96, 192, 384, 576],
downsample_layers=DOWNSAMPLE_LAYERS_FOUR_STAGES,
norm_layer=partial(nn.LayerNorm, eps=1e-6),
act_layer=nn.GELU,
conv_ratio=1.0,
kernel_size=7,
drop_path_rate=0.,
output_norm=partial(nn.LayerNorm, eps=1e-6),
head_fn=MlpHead,
head_dropout=0.0,
**kwargs,
):
super().__init__()
self.num_classes = num_classes
if not isinstance(depths, (list, tuple)):
depths = [depths] # it means the model has only one stage
if not isinstance(dims, (list, tuple)):
dims = [dims]
num_stage = len(depths)
self.num_stage = num_stage
if not isinstance(downsample_layers, (list, tuple)):
downsample_layers = [downsample_layers] * num_stage
down_dims = [in_chans] + dims
self.downsample_layers = nn.ModuleList(
[downsample_layers[i](down_dims[i], down_dims[i+1]) for i in range(num_stage)]
)
dp_rates=[x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))]
self.stages = nn.ModuleList()
cur = 0
for i in range(num_stage):
stage = nn.Sequential(
*[GatedCNNBlock(dim=dims[i],
norm_layer=norm_layer,
act_layer=act_layer,
kernel_size=kernel_size,
conv_ratio=conv_ratio,
drop_path=dp_rates[cur + j],
) for j in range(depths[i])]
)
self.stages.append(stage)
cur += depths[i]
self.norm = output_norm(dims[-1])
if head_dropout > 0.0:
self.head = head_fn(dims[-1], num_classes, head_dropout=head_dropout)
else:
self.head = head_fn(dims[-1], num_classes)
self.apply(self._init_weights)
self.channel = [i.size(1) for i in self.forward(torch.randn(1, 3, 640, 640))]
def _init_weights(self, m):
if isinstance(m, (nn.Conv2d, nn.Linear)):
trunc_normal_(m.weight, std=.02)
if m.bias is not None:
nn.init.constant_(m.bias, 0)
def forward(self, x):
outs = []
for i in range(self.num_stage):
x = self.downsample_layers[i](x)
x = self.stages[i](x)
outs.append(x.permute(0, 3, 1, 2).contiguous())
return outs
###############################################################################
# a series of MambaOut model
def mambaout_femto(pretrained=False, **kwargs):
model = MambaOut(
depths=[3, 3, 9, 3],
dims=[48, 96, 192, 288],
**kwargs)
model.default_cfg = default_cfgs['mambaout_femto']
if pretrained:
state_dict = torch.hub.load_state_dict_from_url(
url= model.default_cfg['url'], map_location="cpu", check_hash=True)
model.load_state_dict(state_dict)
return model
# Kobe Memorial Version with 24 Gated CNN block
def mambaout_kobe(pretrained=False, **kwargs):
model = MambaOut(
depths=[3, 3, 15, 3],
dims=[48, 96, 192, 288],
**kwargs)
model.default_cfg = default_cfgs['mambaout_kobe']
if pretrained:
state_dict = torch.hub.load_state_dict_from_url(
url= model.default_cfg['url'], map_location="cpu", check_hash=True)
model.load_state_dict(state_dict)
return model
def mambaout_tiny(pretrained=False, **kwargs):
model = MambaOut(
depths=[3, 3, 9, 3],
dims=[96, 192, 384, 576],
**kwargs)
model.default_cfg = default_cfgs['mambaout_tiny']
if pretrained:
state_dict = torch.hub.load_state_dict_from_url(
url= model.default_cfg['url'], map_location="cpu", check_hash=True)
model.load_state_dict(state_dict)
return model
def mambaout_small(pretrained=False, **kwargs):
model = MambaOut(
depths=[3, 4, 27, 3],
dims=[96, 192, 384, 576],
**kwargs)
model.default_cfg = default_cfgs['mambaout_small']
if pretrained:
state_dict = torch.hub.load_state_dict_from_url(
url= model.default_cfg['url'], map_location="cpu", check_hash=True)
model.load_state_dict(state_dict)
return model
def mambaout_base(pretrained=False, **kwargs):
model = MambaOut(
depths=[3, 4, 27, 3],
dims=[128, 256, 512, 768],
**kwargs)
model.default_cfg = default_cfgs['mambaout_base']
if pretrained:
state_dict = torch.hub.load_state_dict_from_url(
url= model.default_cfg['url'], map_location="cpu", check_hash=True)
model.load_state_dict(state_dict)
return model
四、修改步骤
4.1 修改一
① 在
ultralytics/nn/
目录下新建
AddModules
文件夹用于存放模块代码
② 在
AddModules
文件夹下新建
mambaout.py
,将
第三节
中的代码粘贴到此处
4.2 修改二
在
AddModules
文件夹下新建
__init__.py
(已有则不用新建),在文件内导入模块:
from .mambaout import *
4.3 修改三
在
ultralytics/nn/modules/tasks.py
文件中,需要添加各模块类。
① 首先:导入模块
② 在BaseModel类的predict函数中,在如下两处位置中去掉
embed
参数:
③ 在BaseModel类的_predict_once函数,替换如下代码:
def _predict_once(self, x, profile=False, visualize=False):
"""
Perform a forward pass through the network.
Args:
x (torch.Tensor): The input tensor to the model.
profile (bool): Print the computation time of each layer if True, defaults to False.
visualize (bool): Save the feature maps of the model if True, defaults to False.
Returns:
(torch.Tensor): The last output of the model.
"""
y, dt = [], [] # outputs
for m in self.model:
if m.f != -1: # if not from previous layer
x = y[m.f] if isinstance(m.f, int) else [x if j == -1 else y[j] for j in m.f] # from earlier layers
if profile:
self._profile_one_layer(m, x, dt)
x = m(x) # run
y.append(x if m.i in self.save else None) # save output
if visualize:
feature_visualization(x, m.type, m.i, save_dir=visualize)
return x
④ 将
RTDETRDetectionModel类
中的
predict函数
完整替换:
def predict(self, x, profile=False, visualize=False, batch=None, augment=False):
"""
Perform a forward pass through the model.
Args:
x (torch.Tensor): The input tensor.
profile (bool, optional): If True, profile the computation time for each layer. Defaults to False.
visualize (bool, optional): If True, save feature maps for visualization. Defaults to False.
batch (dict, optional): Ground truth data for evaluation. Defaults to None.
augment (bool, optional): If True, perform data augmentation during inference. Defaults to False.
Returns:
(torch.Tensor): Model's output tensor.
"""
y, dt = [], [] # outputs
for m in self.model[:-1]: # except the head part
if m.f != -1: # if not from previous layer
x = y[m.f] if isinstance(m.f, int) else [x if j == -1 else y[j] for j in m.f] # from earlier layers
if profile:
self._profile_one_layer(m, x, dt)
if hasattr(m, 'backbone'):
x = m(x)
for _ in range(5 - len(x)):
x.insert(0, None)
for i_idx, i in enumerate(x):
if i_idx in self.save:
y.append(i)
else:
y.append(None)
# for i in x:
# if i is not None:
# print(i.size())
x = x[-1]
else:
x = m(x) # run
y.append(x if m.i in self.save else None) # save output
if visualize:
feature_visualization(x, m.type, m.i, save_dir=visualize)
head = self.model[-1]
x = head([y[j] for j in head.f], batch) # head inference
return x
⑤ 在
parse_model函数
如下位置替换如下代码:
if verbose:
LOGGER.info(f"\n{'':>3}{'from':>20}{'n':>3}{'params':>10} {'module':<45}{'arguments':<30}")
ch = [ch]
layers, save, c2 = [], [], ch[-1] # layers, savelist, ch out
is_backbone = False
for i, (f, n, m, args) in enumerate(d['backbone'] + d['head']): # from, number, module, args
try:
if m == 'node_mode':
m = d[m]
if len(args) > 0:
if args[0] == 'head_channel':
args[0] = int(d[args[0]])
t = m
m = getattr(torch.nn, m[3:]) if 'nn.' in m else globals()[m] # get module
except:
pass
for j, a in enumerate(args):
if isinstance(a, str):
with contextlib.suppress(ValueError):
try:
args[j] = locals()[a] if a in locals() else ast.literal_eval(a)
except:
args[j] = a
替换后如下:
⑥ 在
parse_model
函数,添加如下代码。
elif m in {mambaout_femto, mambaout_kobe, mambaout_tiny, mambaout_small, mambaout_base,}:
m = m(*args)
c2 = m.channel
backbone = True
⑦ 在
parse_model函数
如下位置替换如下代码:
if isinstance(c2, list):
backbone = True
m_ = m
m_.backbone = True
else:
m_ = nn.Sequential(*(m(*args) for _ in range(n))) if n > 1 else m(*args) # module
t = str(m)[8:-2].replace('__main__.', '') # module type
m_.np = sum(x.numel() for x in m_.parameters()) # number params
m_.i, m_.f, m_.type = i + 4 if backbone else i, f, t # attach index, 'from' index, type
if verbose:
LOGGER.info(f'{i:>3}{str(f):>20}{n_:>3}{m_.np:10.0f} {t:<45}{str(args):<30}') # print
save.extend(x % (i + 4 if backbone else i) for x in ([f] if isinstance(f, int) else f) if x != -1) # append to savelist
layers.append(m_)
if i == 0:
ch = []
if isinstance(c2, list):
ch.extend(c2)
for _ in range(5 - len(ch)):
ch.insert(0, 0)
else:
ch.append(c2)
return nn.Sequential(*layers), sorted(save)
⑧ 在
ultralytics\nn\autobackend.py
文件的
AutoBackend类
中的
forward函数
,完整替换如下代码:
def forward(self, im, augment=False, visualize=False):
"""
Runs inference on the YOLOv8 MultiBackend model.
Args:
im (torch.Tensor): The image tensor to perform inference on.
augment (bool): whether to perform data augmentation during inference, defaults to False
visualize (bool): whether to visualize the output predictions, defaults to False
Returns:
(tuple): Tuple containing the raw output tensor, and processed output for visualization (if visualize=True)
"""
b, ch, h, w = im.shape # batch, channel, height, width
if self.fp16 and im.dtype != torch.float16:
im = im.half() # to FP16
if self.nhwc:
im = im.permute(0, 2, 3, 1) # torch BCHW to numpy BHWC shape(1,320,192,3)
if self.pt or self.nn_module: # PyTorch
y = self.model(im, augment=augment, visualize=visualize) if augment or visualize else self.model(im)
elif self.jit: # TorchScript
y = self.model(im)
elif self.dnn: # ONNX OpenCV DNN
im = im.cpu().numpy() # torch to numpy
self.net.setInput(im)
y = self.net.forward()
elif self.onnx: # ONNX Runtime
im = im.cpu().numpy() # torch to numpy
y = self.session.run(self.output_names, {self.session.get_inputs()[0].name: im})
elif self.xml: # OpenVINO
im = im.cpu().numpy() # FP32
y = list(self.ov_compiled_model(im).values())
elif self.engine: # TensorRT
if self.dynamic and im.shape != self.bindings['images'].shape:
i = self.model.get_binding_index('images')
self.context.set_binding_shape(i, im.shape) # reshape if dynamic
self.bindings['images'] = self.bindings['images']._replace(shape=im.shape)
for name in self.output_names:
i = self.model.get_binding_index(name)
self.bindings[name].data.resize_(tuple(self.context.get_binding_shape(i)))
s = self.bindings['images'].shape
assert im.shape == s, f"input size {im.shape} {'>' if self.dynamic else 'not equal to'} max model size {s}"
self.binding_addrs['images'] = int(im.data_ptr())
self.context.execute_v2(list(self.binding_addrs.values()))
y = [self.bindings[x].data for x in sorted(self.output_names)]
elif self.coreml: # CoreML
im = im[0].cpu().numpy()
im_pil = Image.fromarray((im * 255).astype('uint8'))
# im = im.resize((192, 320), Image.BILINEAR)
y = self.model.predict({'image': im_pil}) # coordinates are xywh normalized
if 'confidence' in y:
raise TypeError('Ultralytics only supports inference of non-pipelined CoreML models exported with '
f"'nms=False', but 'model={w}' has an NMS pipeline created by an 'nms=True' export.")
# TODO: CoreML NMS inference handling
# from ultralytics.utils.ops import xywh2xyxy
# box = xywh2xyxy(y['coordinates'] * [[w, h, w, h]]) # xyxy pixels
# conf, cls = y['confidence'].max(1), y['confidence'].argmax(1).astype(np.float32)
# y = np.concatenate((box, conf.reshape(-1, 1), cls.reshape(-1, 1)), 1)
elif len(y) == 1: # classification model
y = list(y.values())
elif len(y) == 2: # segmentation model
y = list(reversed(y.values())) # reversed for segmentation models (pred, proto)
elif self.paddle: # PaddlePaddle
im = im.cpu().numpy().astype(np.float32)
self.input_handle.copy_from_cpu(im)
self.predictor.run()
y = [self.predictor.get_output_handle(x).copy_to_cpu() for x in self.output_names]
elif self.ncnn: # ncnn
mat_in = self.pyncnn.Mat(im[0].cpu().numpy())
ex = self.net.create_extractor()
input_names, output_names = self.net.input_names(), self.net.output_names()
ex.input(input_names[0], mat_in)
y = []
for output_name in output_names:
mat_out = self.pyncnn.Mat()
ex.extract(output_name, mat_out)
y.append(np.array(mat_out)[None])
elif self.triton: # NVIDIA Triton Inference Server
im = im.cpu().numpy() # torch to numpy
y = self.model(im)
else: # TensorFlow (SavedModel, GraphDef, Lite, Edge TPU)
im = im.cpu().numpy()
if self.saved_model: # SavedModel
y = self.model(im, training=False) if self.keras else self.model(im)
if not isinstance(y, list):
y = [y]
elif self.pb: # GraphDef
y = self.frozen_func(x=self.tf.constant(im))
if len(y) == 2 and len(self.names) == 999: # segments and names not defined
ip, ib = (0, 1) if len(y[0].shape) == 4 else (1, 0) # index of protos, boxes
nc = y[ib].shape[1] - y[ip].shape[3] - 4 # y = (1, 160, 160, 32), (1, 116, 8400)
self.names = {i: f'class{i}' for i in range(nc)}
else: # Lite or Edge TPU
details = self.input_details[0]
integer = details['dtype'] in (np.int8, np.int16) # is TFLite quantized int8 or int16 model
if integer:
scale, zero_point = details['quantization']
im = (im / scale + zero_point).astype(details['dtype']) # de-scale
self.interpreter.set_tensor(details['index'], im)
self.interpreter.invoke()
y = []
for output in self.output_details:
x = self.interpreter.get_tensor(output['index'])
if integer:
scale, zero_point = output['quantization']
x = (x.astype(np.float32) - zero_point) * scale # re-scale
if x.ndim > 2: # if task is not classification
# Denormalize xywh by image size. See https://github.com/ultralytics/ultralytics/pull/1695
# xywh are normalized in TFLite/EdgeTPU to mitigate quantization error of integer models
x[:, [0, 2]] *= w
x[:, [1, 3]] *= h
y.append(x)
# TF segment fixes: export is reversed vs ONNX export and protos are transposed
if len(y) == 2: # segment with (det, proto) output order reversed
if len(y[1].shape) != 4:
y = list(reversed(y)) # should be y = (1, 116, 8400), (1, 160, 160, 32)
y[1] = np.transpose(y[1], (0, 3, 1, 2)) # should be y = (1, 116, 8400), (1, 32, 160, 160)
y = [x if isinstance(x, np.ndarray) else x.numpy() for x in y]
# for x in y:
# print(type(x), len(x)) if isinstance(x, (list, tuple)) else print(type(x), x.shape) # debug shapes
if isinstance(y, (list, tuple)):
return self.from_numpy(y[0]) if len(y) == 1 else [self.from_numpy(x) for x in y]
else:
return self.from_numpy(y)
至此就修改完成了,可以配置模型开始训练了
五、yaml模型文件
5.1 模型改进⭐
在代码配置完成后,配置模型的YAML文件。
此处以
ultralytics/cfg/models/rt-detr/rtdetr-l.yaml
为例(其它版本的颈部都是一样的,可以直接替换成其它版本),在同目录下创建一个用于自己数据集训练的模型文件
rtdetr-mambaout.yaml
。
将
rtdetr-l.yaml
中的内容复制到
rtdetr-mambaout.yaml
文件下,修改
nc
数量等于自己数据中目标的数量。
📌 模型的修改方法是将
骨干网络
替换成
mambaout_femto
。
# Ultralytics YOLO 🚀, AGPL-3.0 license
# RT-DETR-l object detection model with P3-P5 outputs. For details see https://docs.ultralytics.com/models/rtdetr
# Parameters
nc: 1 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n-cls.yaml' will call yolov8-cls.yaml with scale 'n'
# [depth, width, max_channels]
l: [1.00, 1.00, 1024]
backbone:
# [from, repeats, module, args]
- [-1, 1, mambaout_femto, []] # 4
head:
- [-1, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 5 input_proj.2
- [-1, 1, AIFI, [1024, 8]] # 6
- [-1, 1, Conv, [256, 1, 1]] # 7, Y5, lateral_convs.0
- [-1, 1, nn.Upsample, [None, 2, 'nearest']] # 8
- [3, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 9 input_proj.1
- [[-2, -1], 1, Concat, [1]] # 10
- [-1, 3, RepC3, [256]] # 11, fpn_blocks.0
- [-1, 1, Conv, [256, 1, 1]] # 12, Y4, lateral_convs.1
- [-1, 1, nn.Upsample, [None, 2, 'nearest']] # 13
- [2, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 14 input_proj.0
- [[-2, -1], 1, Concat, [1]] # 15 cat backbone P4
- [-1, 3, RepC3, [256]] # X3 (16), fpn_blocks.1
- [-1, 1, Conv, [256, 3, 2]] # 17, downsample_convs.0
- [[-1, 12], 1, Concat, [1]] # 18 cat Y4
- [-1, 3, RepC3, [256]] # F4 (19), pan_blocks.0
- [-1, 1, Conv, [256, 3, 2]] # 20, downsample_convs.1
- [[-1, 7], 1, Concat, [1]] # 21 cat Y5
- [-1, 3, RepC3, [256]] # F5 (22), pan_blocks.1
- [[16, 19, 22], 1, RTDETRDecoder, [nc]] # Detect(P3, P4, P5)
六、成功运行结果
分别打印网络模型可以看到
mambaout_femto
已经加入到模型中,并可以进行训练了。
rtdetr-l-mambaout :
rtdetr-mambaout summary: 525 layers, 25,789,243 parameters, 25,789,243 gradients, 79.4 GFLOPs
from n params module arguments
0 -1 1 7304536 mambaout_femto []
1 -1 1 74240 ultralytics.nn.modules.conv.Conv [288, 256, 1, 1, None, 1, 1, False]
2 -1 1 789760 ultralytics.nn.modules.transformer.AIFI [256, 1024, 8]
3 -1 1 66048 ultralytics.nn.modules.conv.Conv [256, 256, 1, 1]
4 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
5 3 1 49664 ultralytics.nn.modules.conv.Conv [192, 256, 1, 1, None, 1, 1, False]
6 [-2, -1] 1 0 ultralytics.nn.modules.conv.Concat [1]
7 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
8 -1 1 66048 ultralytics.nn.modules.conv.Conv [256, 256, 1, 1]
9 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
10 2 1 25088 ultralytics.nn.modules.conv.Conv [96, 256, 1, 1, None, 1, 1, False]
11 [-2, -1] 1 0 ultralytics.nn.modules.conv.Concat [1]
12 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
13 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
14 [-1, 12] 1 0 ultralytics.nn.modules.conv.Concat [1]
15 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
16 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
17 [-1, 7] 1 0 ultralytics.nn.modules.conv.Concat [1]
18 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
19 [16, 19, 22] 1 7303907 ultralytics.nn.modules.head.RTDETRDecoder [1, [256, 256, 256]]
rtdetr-mambaout summary: 525 layers, 25,789,243 parameters, 25,789,243 gradients, 79.4 GFLOPs