RT-DETR改进策略【模型轻量化】| EMO:ICCV 2023,结构简洁的轻量化自注意力模型
一、本文介绍
本文记录的是
基于EMO的RT-DETR轻量化改进方法研究
。
EMO
设计简洁,仅由
iRMB
构成4阶段架构,无复杂操作与模块,无需精细调整超参数。其中
iRMB
通过特定算子设计,用
DW - Conv
和
EW - MHSA
分别
建模短/长距离依赖关系
,在
降低计算量的同时保障精度
。将
EMO
应用到
RT-DETR
的骨干网络中,使模型在保持轻量化的前提下,提升其在目标检测任务中的性能。
本文在
RT-DETR
的基础上配置了原论文中
EMO_1M
,
EMO_2M
,
EMO_5M
,
EMO_6M
四种模型,以满足不同的需求。
| 模型 | 参数量 | 计算量 |
|---|---|---|
| rtdetr-l | 32.8M | 108.0GFLOPs |
| Improved | 22.9M | 64.6GFLOPs |
二、EMO原理介绍
RethinkingMobileBlockforEfficientAttention-basedModels
EMO模型
旨在为移动应用设计
高效的基于注意力的轻量级模型
,在多个视觉任务上取得了优异的性能。以下从出发点、结构原理和优势三个方面详细介绍:
2.1 出发点
- 随着对存储和计算资源有限的移动应用中高效视觉模型需求的增加,传统基于CNN的模型 受静态CNN自然归纳偏差限制 ,准确性有待提高;
-
基于注意力的模型虽有优势,但因
多头自注意力MHSA计算量呈二次方增长 ,资源消耗大。 - 此外,当前高效混合模型存在 结构复杂或模块繁多 的问题,不利于应用优化。
因此,需要探索为基于 注意力 的模型构建 类似IRB的轻量级基础架构 。
2.2 结构原理
-
Meta Mobile Block(元移动块)
:从
MobileNetv2的倒残差块(IRB)和Transformer的核心模块MHSA、FFN重新思考,归纳抽象出元移动块(MMB)。 - 以图像输入 X ( ∈ R C × H × W ) X(\in \mathbb{R}^{C ×H ×W}) X ( ∈ R C × H × W ) 为例,MMB首先通过输出/输入比为λ的扩展 M L P e MLP_{e} M L P e 扩展通道维度,得到 X e = M L P e ( X ) ( ∈ R λ C × H × W ) X_{e}=MLP_{e}(X)\left(\in \mathbb{R}^{\lambda C × H × W}\right) X e = M L P e ( X ) ( ∈ R λ C × H × W ) ;然后通过高效算子F增强图像特征;最后通过输入/输出比为λ的收缩 M L P s MLP_{s} M L P s 收缩通道维度,得到 X s = M L P s ( X f ) ( ∈ R C × H × W ) X_{s}=MLP_{s}\left(X_{f}\right)\left(\in \mathbb{R}^{C × H × W}\right) X s = M L P s ( X f ) ( ∈ R C × H × W ) ,并通过残差连接得到最终输出 Y = X + X s ( ∈ R C × H × W ) Y = X + X_{s}(\in \mathbb{R}^{C ×H ×W}) Y = X + X s ( ∈ R C × H × W ) 。
-
Inverted Residual Mobile Block(倒残差移动块,iRMB)
:基于
MMB,将iRMB中的F建模为级联的MHSA和卷积操作,即 F ( ⋅ ) = C o n v ( M H S A ( ⋅ ) ) F(\cdot)=Conv(MHSA(\cdot)) F ( ⋅ ) = C o n v ( M H S A ( ⋅ )) 。为解决高成本问题,采用高效的窗口MHSA(WMHSA)和深度可分离卷积(DW-Conv)并添加残差连接,同时提出改进的EW - MHSA,即 Q = K = X ( ∈ R C × H × W ) Q = K = X(\in \mathbb{R}^{C ×H ×W}) Q = K = X ( ∈ R C × H × W ) , V ( ∈ R λ C × H × W ) V(\in \mathbb{R}^{\lambda C ×H ×W}) V ( ∈ R λ C × H × W ) ,公式为 F ( ⋅ ) = ( D W − C o n v , S k i p ) ( E W − M H S A ( ⋅ ) ) \mathcal{F}(\cdot)=( DW-Conv, Skip )(EW-MHSA (\cdot)) F ( ⋅ ) = ( D W − C o n v , S ki p ) ( E W − M H S A ( ⋅ )) 。
-
EMO整体架构
:设计了基于一系列
iRMB的类似ResNet的4阶段高效模型(EMO)。整体框架仅由iRMB组成,无多样化模块;iRMB仅包含 标准卷积 和 多头自注意力 ,无需其他复杂算子,且能通过步长适应下采样操作,无需位置嵌入;采用逐渐增加的扩展率和通道数。
2.3 优势
- 性能卓越 :在ImageNet - 1K、COCO2017和ADE20K等基准测试上,EMO表现出色。)。
- 计算高效 :与其他模型相比,EMO在参数数量和计算量上更具优势。)。
- 设计简洁 :遵循简单的设计准则,模型结构简洁,仅由iRMB组成,避免了复杂的操作和模块,更易于优化和部署 。
论文: https://arxiv.org/pdf/2301.01146
源码: https://github.com/zhangzjn/EMO
三、EMO的实现代码
EMO
的实现代码如下:
from timm.models.layers import trunc_normal_
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
from functools import partial
from einops import rearrange, reduce
from timm.models.layers import DropPath
inplace = True
__all__ = ['EMO_1M', 'EMO_2M', 'EMO_5M', 'EMO_6M']
class SELayerV2(nn.Module):
def __init__(self, in_channel, reduction=4):
super(SELayerV2, self).__init__()
assert in_channel >= reduction and in_channel % reduction == 0, 'invalid in_channel in SaElayer'
self.reduction = reduction
self.cardinality = 4
self.avg_pool = nn.AdaptiveAvgPool2d(1)
# cardinality 1
self.fc1 = nn.Sequential(
nn.Linear(in_channel, in_channel // self.reduction, bias=False),
nn.ReLU(inplace=True)
)
# cardinality 2
self.fc2 = nn.Sequential(
nn.Linear(in_channel, in_channel // self.reduction, bias=False),
nn.ReLU(inplace=True)
)
# cardinality 3
self.fc3 = nn.Sequential(
nn.Linear(in_channel, in_channel // self.reduction, bias=False),
nn.ReLU(inplace=True)
)
# cardinality 4
self.fc4 = nn.Sequential(
nn.Linear(in_channel, in_channel // self.reduction, bias=False),
nn.ReLU(inplace=True)
)
self.fc = nn.Sequential(
nn.Linear(in_channel // self.reduction * self.cardinality, in_channel, bias=False),
nn.Sigmoid()
)
def forward(self, x):
b, c, _, _ = x.size()
y = self.avg_pool(x).view(b, c)
y1 = self.fc1(y)
y2 = self.fc2(y)
y3 = self.fc3(y)
y4 = self.fc4(y)
y_concate = torch.cat([y1, y2, y3, y4], dim=1)
y_ex_dim = self.fc(y_concate).view(b, c, 1, 1)
return x * y_ex_dim.expand_as(x)
def get_act(act_layer='relu'):
act_dict = {
'none': nn.Identity,
'relu': nn.ReLU,
'relu6': nn.ReLU6,
'silu': nn.SiLU,
'gelu': nn.GELU
}
return act_dict[act_layer]
class LayerNorm2d(nn.Module):
def __init__(self, normalized_shape, eps=1e-6, elementwise_affine=True):
super().__init__()
self.norm = nn.LayerNorm(normalized_shape, eps, elementwise_affine)
def forward(self, x):
x = rearrange(x, 'b c h w -> b h w c').contiguous()
x = self.norm(x)
x = rearrange(x, 'b h w c -> b c h w').contiguous()
return x
def get_norm(norm_layer='in_1d'):
eps = 1e-6
norm_dict = {
'none': nn.Identity,
'in_1d': partial(nn.InstanceNorm1d, eps=eps),
'in_2d': partial(nn.InstanceNorm2d, eps=eps),
'in_3d': partial(nn.InstanceNorm3d, eps=eps),
'bn_1d': partial(nn.BatchNorm1d, eps=eps),
'bn_2d': partial(nn.BatchNorm2d, eps=eps),
# 'bn_2d': partial(nn.SyncBatchNorm, eps=eps),
'bn_3d': partial(nn.BatchNorm3d, eps=eps),
'gn': partial(nn.GroupNorm, eps=eps),
'ln_1d': partial(nn.LayerNorm, eps=eps),
'ln_2d': partial(LayerNorm2d, eps=eps),
}
return norm_dict[norm_layer]
class LayerScale(nn.Module):
def __init__(self, dim, init_values=1e-5, inplace=True):
super().__init__()
self.inplace = inplace
self.gamma = nn.Parameter(init_values * torch.ones(1, 1, dim))
def forward(self, x):
return x.mul_(self.gamma) if self.inplace else x * self.gamma
class LayerScale2D(nn.Module):
def __init__(self, dim, init_values=1e-5, inplace=True):
super().__init__()
self.inplace = inplace
self.gamma = nn.Parameter(init_values * torch.ones(1, dim, 1, 1))
def forward(self, x):
return x.mul_(self.gamma) if self.inplace else x * self.gamma
class ConvNormAct(nn.Module):
def __init__(self, dim_in, dim_out, kernel_size, stride=1, dilation=1, groups=1, bias=False,
skip=False, norm_layer='bn_2d', act_layer='relu', inplace=True, drop_path_rate=0.):
super(ConvNormAct, self).__init__()
self.has_skip = skip and dim_in == dim_out
padding = math.ceil((kernel_size - stride) / 2)
self.conv = nn.Conv2d(dim_in, dim_out, kernel_size, stride, padding, dilation, groups, bias)
self.norm = get_norm(norm_layer)(dim_out)
self.act = nn.GELU()
self.drop_path = DropPath(drop_path_rate) if drop_path_rate else nn.Identity()
def forward(self, x):
shortcut = x
x = self.conv(x)
x = self.norm(x)
x = self.act(x)
if self.has_skip:
x = self.drop_path(x) + shortcut
return x
# ========== Multi-Scale Populations, for down-sampling and inductive bias ==========
class MSPatchEmb(nn.Module):
def __init__(self, dim_in, emb_dim, kernel_size=2, c_group=-1, stride=1, dilations=[1, 2, 3],
norm_layer='bn_2d', act_layer='silu'):
super().__init__()
self.dilation_num = len(dilations)
assert dim_in % c_group == 0
c_group = math.gcd(dim_in, emb_dim) if c_group == -1 else c_group
self.convs = nn.ModuleList()
for i in range(len(dilations)):
padding = math.ceil(((kernel_size - 1) * dilations[i] + 1 - stride) / 2)
self.convs.append(nn.Sequential(
nn.Conv2d(dim_in, emb_dim, kernel_size, stride, padding, dilations[i], groups=c_group),
get_norm(norm_layer)(emb_dim),
get_act(act_layer)(emb_dim)))
def forward(self, x):
if self.dilation_num == 1:
x = self.convs[0](x)
else:
x = torch.cat([self.convs[i](x).unsqueeze(dim=-1) for i in range(self.dilation_num)], dim=-1)
x = reduce(x, 'b c h w n -> b c h w', 'mean').contiguous()
return x
class iRMB(nn.Module):
def __init__(self, dim_in, dim_out, norm_in=True, has_skip=True, exp_ratio=1.0, norm_layer='bn_2d',
act_layer='relu', v_proj=True, dw_ks=3, stride=1, dilation=1, se_ratio=0.0, dim_head=64, window_size=7,
attn_s=True, qkv_bias=False, attn_drop=0., drop=0., drop_path=0., v_group=False, attn_pre=False):
super().__init__()
self.norm = get_norm(norm_layer)(dim_in) if norm_in else nn.Identity()
dim_mid = int(dim_in * exp_ratio)
self.has_skip = (dim_in == dim_out and stride == 1) and has_skip
self.attn_s = attn_s
if self.attn_s:
assert dim_in % dim_head == 0, 'dim should be divisible by num_heads'
self.dim_head = dim_head
self.window_size = window_size
self.num_head = dim_in // dim_head
self.scale = self.dim_head ** -0.5
self.attn_pre = attn_pre
self.qk = ConvNormAct(dim_in, int(dim_in * 2), kernel_size=1, bias=qkv_bias, norm_layer='none',
act_layer='none')
self.v = ConvNormAct(dim_in, dim_mid, kernel_size=1, groups=self.num_head if v_group else 1, bias=qkv_bias,
norm_layer='none', act_layer=act_layer, inplace=inplace)
self.attn_drop = nn.Dropout(attn_drop)
else:
if v_proj:
self.v = ConvNormAct(dim_in, dim_mid, kernel_size=1, bias=qkv_bias, norm_layer='none',
act_layer=act_layer, inplace=inplace)
else:
self.v = nn.Identity()
self.conv_local = ConvNormAct(dim_mid, dim_mid, kernel_size=dw_ks, stride=stride, dilation=dilation,
groups=dim_mid, norm_layer='bn_2d', act_layer='silu', inplace=inplace)
self.se = SELayerV2(dim_mid)
self.proj_drop = nn.Dropout(drop)
self.proj = ConvNormAct(dim_mid, dim_out, kernel_size=1, norm_layer='none', act_layer='none', inplace=inplace)
self.drop_path = DropPath(drop_path) if drop_path else nn.Identity()
def forward(self, x):
shortcut = x
x = self.norm(x)
B, C, H, W = x.shape
if self.attn_s:
# padding
if self.window_size <= 0:
window_size_W, window_size_H = W, H
else:
window_size_W, window_size_H = self.window_size, self.window_size
pad_l, pad_t = 0, 0
pad_r = (window_size_W - W % window_size_W) % window_size_W
pad_b = (window_size_H - H % window_size_H) % window_size_H
x = F.pad(x, (pad_l, pad_r, pad_t, pad_b, 0, 0,))
n1, n2 = (H + pad_b) // window_size_H, (W + pad_r) // window_size_W
x = rearrange(x, 'b c (h1 n1) (w1 n2) -> (b n1 n2) c h1 w1', n1=n1, n2=n2).contiguous()
# attention
b, c, h, w = x.shape
qk = self.qk(x)
qk = rearrange(qk, 'b (qk heads dim_head) h w -> qk b heads (h w) dim_head', qk=2, heads=self.num_head,
dim_head=self.dim_head).contiguous()
q, k = qk[0], qk[1]
attn_spa = (q @ k.transpose(-2, -1)) * self.scale
attn_spa = attn_spa.softmax(dim=-1)
attn_spa = self.attn_drop(attn_spa)
if self.attn_pre:
x = rearrange(x, 'b (heads dim_head) h w -> b heads (h w) dim_head', heads=self.num_head).contiguous()
x_spa = attn_spa @ x
x_spa = rearrange(x_spa, 'b heads (h w) dim_head -> b (heads dim_head) h w', heads=self.num_head, h=h,
w=w).contiguous()
x_spa = self.v(x_spa)
else:
v = self.v(x)
v = rearrange(v, 'b (heads dim_head) h w -> b heads (h w) dim_head', heads=self.num_head).contiguous()
x_spa = attn_spa @ v
x_spa = rearrange(x_spa, 'b heads (h w) dim_head -> b (heads dim_head) h w', heads=self.num_head, h=h,
w=w).contiguous()
# unpadding
x = rearrange(x_spa, '(b n1 n2) c h1 w1 -> b c (h1 n1) (w1 n2)', n1=n1, n2=n2).contiguous()
if pad_r > 0 or pad_b > 0:
x = x[:, :, :H, :W].contiguous()
else:
x = self.v(x)
x = x + self.se(self.conv_local(x)) if self.has_skip else self.se(self.conv_local(x))
x = self.proj_drop(x)
x = self.proj(x)
x = (shortcut + self.drop_path(x)) if self.has_skip else x
return x
class EMO(nn.Module):
def __init__(self, dim_in=3, num_classes=1000, img_size=224,
depths=[1, 2, 4, 2], stem_dim=16, embed_dims=[64, 128, 256, 512], exp_ratios=[4., 4., 4., 4.],
norm_layers=['bn_2d', 'bn_2d', 'bn_2d', 'bn_2d'], act_layers=['relu', 'relu', 'relu', 'relu'],
dw_kss=[3, 3, 5, 5], se_ratios=[0.0, 0.0, 0.0, 0.0], dim_heads=[32, 32, 32, 32],
window_sizes=[7, 7, 7, 7], attn_ss=[False, False, True, True], qkv_bias=True,
attn_drop=0., drop=0., drop_path=0., v_group=False, attn_pre=False, pre_dim=0):
super().__init__()
self.num_classes = num_classes
assert num_classes > 0
dprs = [x.item() for x in torch.linspace(0, drop_path, sum(depths))]
self.stage0 = nn.ModuleList([
MSPatchEmb( # down to 112
dim_in, stem_dim, kernel_size=dw_kss[0], c_group=1, stride=2, dilations=[1],
norm_layer=norm_layers[0], act_layer='none'),
iRMB( # ds
stem_dim, stem_dim, norm_in=False, has_skip=False, exp_ratio=1,
norm_layer=norm_layers[0], act_layer=act_layers[0], v_proj=False, dw_ks=dw_kss[0],
stride=1, dilation=1, se_ratio=1,
dim_head=dim_heads[0], window_size=window_sizes[0], attn_s=False,
qkv_bias=qkv_bias, attn_drop=attn_drop, drop=drop, drop_path=0.,
attn_pre=attn_pre
)
])
emb_dim_pre = stem_dim
for i in range(len(depths)):
layers = []
dpr = dprs[sum(depths[:i]):sum(depths[:i + 1])]
for j in range(depths[i]):
if j == 0:
stride, has_skip, attn_s, exp_ratio = 2, False, False, exp_ratios[i] * 2
else:
stride, has_skip, attn_s, exp_ratio = 1, True, attn_ss[i], exp_ratios[i]
layers.append(iRMB(
emb_dim_pre, embed_dims[i], norm_in=True, has_skip=has_skip, exp_ratio=exp_ratio,
norm_layer=norm_layers[i], act_layer=act_layers[i], v_proj=True, dw_ks=dw_kss[i],
stride=stride, dilation=1, se_ratio=se_ratios[i],
dim_head=dim_heads[i], window_size=window_sizes[i], attn_s=attn_s,
qkv_bias=qkv_bias, attn_drop=attn_drop, drop=drop, drop_path=dpr[j], v_group=v_group,
attn_pre=attn_pre
))
emb_dim_pre = embed_dims[i]
self.__setattr__(f'stage{i + 1}', nn.ModuleList(layers))
self.norm = get_norm(norm_layers[-1])(embed_dims[-1])
if pre_dim > 0:
self.pre_head = nn.Sequential(nn.Linear(embed_dims[-1], pre_dim), get_act(act_layers[-1])(inplace=inplace))
self.pre_dim = pre_dim
else:
self.pre_head = nn.Identity()
self.pre_dim = embed_dims[-1]
self.head = nn.Linear(self.pre_dim, num_classes)
self.apply(self._init_weights)
self.width_list = [i.size(1) for i in self.forward(torch.randn(1, 3, 640, 640))]
def _init_weights(self, m):
if isinstance(m, nn.Linear):
trunc_normal_(m.weight, std=.02)
if m.bias is not None:
nn.init.zeros_(m.bias)
elif isinstance(m, (nn.LayerNorm, nn.GroupNorm,
nn.BatchNorm1d, nn.BatchNorm2d, nn.BatchNorm3d,
nn.InstanceNorm1d, nn.InstanceNorm2d, nn.InstanceNorm3d)):
nn.init.zeros_(m.bias)
nn.init.ones_(m.weight)
@torch.jit.ignore
def no_weight_decay(self):
return {'token'}
@torch.jit.ignore
def no_weight_decay_keywords(self):
return {'alpha', 'gamma', 'beta'}
@torch.jit.ignore
def no_ft_keywords(self):
# return {'head.weight', 'head.bias'}
return {}
@torch.jit.ignore
def ft_head_keywords(self):
return {'head.weight', 'head.bias'}, self.num_classes
def get_classifier(self):
return self.head
def reset_classifier(self, num_classes):
self.num_classes = num_classes
self.head = nn.Linear(self.pre_dim, num_classes) if num_classes > 0 else nn.Identity()
def check_bn(self):
for name, m in self.named_modules():
if isinstance(m, nn.modules.batchnorm._NormBase):
m.running_mean = torch.nan_to_num(m.running_mean, nan=0, posinf=1, neginf=-1)
m.running_var = torch.nan_to_num(m.running_var, nan=0, posinf=1, neginf=-1)
def forward(self, x):
unique_tensors = {}
for blk in self.stage0:
x = blk(x)
width, height = x.shape[2], x.shape[3]
unique_tensors[(width, height)] = x
for blk in self.stage1:
x = blk(x)
width, height = x.shape[2], x.shape[3]
unique_tensors[(width, height)] = x
for blk in self.stage2:
x = blk(x)
width, height = x.shape[2], x.shape[3]
unique_tensors[(width, height)] = x
for blk in self.stage3:
x = blk(x)
width, height = x.shape[2], x.shape[3]
unique_tensors[(width, height)] = x
for blk in self.stage4:
x = blk(x)
width, height = x.shape[2], x.shape[3]
unique_tensors[(width, height)] = x
result_list = list(unique_tensors.values())[-4:]
return result_list
def EMO_1M(pretrained=False, **kwargs):
model = EMO(
# dim_in=3, num_classes=1000, img_size=224,
depths=[2, 2, 8, 3], stem_dim=24, embed_dims=[32, 48, 80, 168], exp_ratios=[2., 2.5, 3.0, 3.5],
norm_layers=['bn_2d', 'bn_2d', 'ln_2d', 'ln_2d'], act_layers=['silu', 'silu', 'gelu', 'gelu'],
dw_kss=[3, 3, 5, 5], dim_heads=[16, 16, 20, 21], window_sizes=[7, 7, 7, 7], attn_ss=[False, False, True, True],
qkv_bias=True, attn_drop=0., drop=0., drop_path=0.04036, v_group=False, attn_pre=True, pre_dim=0,
**kwargs)
return model
def EMO_2M(pretrained=False, **kwargs):
model = EMO(
# dim_in=3, num_classes=1000, img_size=224,
depths=[3, 3, 9, 3], stem_dim=24, embed_dims=[32, 48, 120, 200], exp_ratios=[2., 2.5, 3.0, 3.5],
norm_layers=['bn_2d', 'bn_2d', 'ln_2d', 'ln_2d'], act_layers=['silu', 'silu', 'gelu', 'gelu'],
dw_kss=[3, 3, 5, 5], dim_heads=[16, 16, 20, 20], window_sizes=[7, 7, 7, 7], attn_ss=[False, False, True, True],
qkv_bias=True, attn_drop=0., drop=0., drop_path=0.05, v_group=False, attn_pre=True, pre_dim=0,
**kwargs)
return model
def EMO_5M(pretrained=False, **kwargs):
model = EMO(
# dim_in=3, num_classes=1000, img_size=224,
depths=[3, 3, 9, 3], stem_dim=24, embed_dims=[48, 72, 160, 288], exp_ratios=[2., 3., 4., 4.],
norm_layers=['bn_2d', 'bn_2d', 'ln_2d', 'ln_2d'], act_layers=['silu', 'silu', 'gelu', 'gelu'],
dw_kss=[3, 3, 5, 5], dim_heads=[24, 24, 32, 32], window_sizes=[7, 7, 7, 7], attn_ss=[False, False, True, True],
qkv_bias=True, attn_drop=0., drop=0., drop_path=0.05, v_group=False, attn_pre=True, pre_dim=0,
**kwargs)
return model
def EMO_6M(pretrained=False, **kwargs):
model = EMO(
# dim_in=3, num_classes=1000, img_size=224,
depths=[3, 3, 9, 3], stem_dim=24, embed_dims=[48, 72, 160, 320], exp_ratios=[2., 3., 4., 5.],
norm_layers=['bn_2d', 'bn_2d', 'ln_2d', 'ln_2d'], act_layers=['silu', 'silu', 'gelu', 'gelu'],
dw_kss=[3, 3, 5, 5], dim_heads=[16, 24, 20, 32], window_sizes=[7, 7, 7, 7], attn_ss=[False, False, True, True],
qkv_bias=True, attn_drop=0., drop=0., drop_path=0.05, v_group=False, attn_pre=True, pre_dim=0,
**kwargs)
return model
if __name__ == "__main__":
# Generating Sample image
image_size = (1, 3, 640, 640)
image = torch.rand(*image_size)
# Model
model = EMO_6M()
out = model(image)
print(len(out))
四、修改步骤
4.1 修改一
① 在
ultralytics/nn/
目录下新建
AddModules
文件夹用于存放模块代码
② 在
AddModules
文件夹下新建
EMO.py
,将
第三节
中的代码粘贴到此处
4.2 修改二
在
AddModules
文件夹下新建
__init__.py
(已有则不用新建),在文件内导入模块:
from .EMO import *
4.3 修改三
在
ultralytics/nn/modules/tasks.py
文件中,需要在两处位置添加各模块类名称。
① 首先:导入模块
② 其次:在
parse_model函数
的如下位置添加两行代码:
backbone = False
t=m
③ 接着,在此函数下添加如下代码:
elif m in {EMO_1M, EMO_2M, EMO_5M, EMO_6M, }:
m = m(*args)
c2 = m.width_list
backbone = True
④ 然后,将下方红框内的代码全部替换:
if isinstance(c2, list):
backbone = True
m_ = m
m_.backbone = True
else:
m_ = nn.Sequential(*(m(*args) for _ in range(n))) if n > 1 else m(*args) # module
t = str(m)[8:-2].replace('__main__.', '') # module type
m.np = sum(x.numel() for x in m_.parameters()) # number params
m_.i, m_.f, m_.type = i + 4 if backbone else i, f, t # attach index, 'from' index, type
if verbose:
LOGGER.info(f'{i:>3}{str(f):>20}{n_:>3}{m.np:10.0f} {t:<45}{str(args):<30}') # print
save.extend(x % (i + 4 if backbone else i) for x in ([f] if isinstance(f, int) else f) if
x != -1) # append to savelist
layers.append(m_)
if i == 0:
ch = []
if isinstance(c2, list):
ch.extend(c2)
for _ in range(5 - len(ch)):
ch.insert(0, 0)
else:
ch.append(c2)
替换后如下:
⑤ 在此文件下找到
base_model
的
_predict_once
,并将其替换成如下代码。
def _predict_once(self, x, profile=False, visualize=False, embed=None):
y, dt, embeddings = [], [], [] # outputs
for m in self.model:
if m.f != -1: # if not from previous layer
x = y[m.f] if isinstance(m.f, int) else [x if j == -1 else y[j] for j in m.f] # from earlier layers
if profile:
self._profile_one_layer(m, x, dt)
if hasattr(m, 'backbone'):
x = m(x)
if len(x) != 5: # 0 - 5
x.insert(0, None)
for index, i in enumerate(x):
if index in self.save:
y.append(i)
else:
y.append(None)
x = x[-1] # 最后一个输出传给下一层
else:
x = m(x) # run
y.append(x if m.i in self.save else None) # save output
if visualize:
feature_visualization(x, m.type, m.i, save_dir=visualize)
if embed and m.i in embed:
embeddings.append(nn.functional.adaptive_avg_pool2d(x, (1, 1)).squeeze(-1).squeeze(-1)) # flatten
if m.i == max(embed):
return torch.unbind(torch.cat(embeddings, 1), dim=0)
return x
至此就修改完成了,可以配置模型开始训练了
五、yaml模型文件
5.1 模型改进⭐
在代码配置完成后,配置模型的YAML文件。
此处以
ultralytics/cfg/models/rt-detr/rtdetr-l.yaml
为例,在同目录下创建一个用于自己数据集训练的模型文件
rtdetr-l-EMO.yaml
。
将
rtdetr-l.yaml
中的内容复制到
rtdetr-l-EMO.yaml
文件下,修改
nc
数量等于自己数据中目标的数量。
📌 模型的修改方法是将
骨干网络
替换成
EMO
。
# Ultralytics YOLO 🚀, AGPL-3.0 license
# RT-DETR-l object detection model with P3-P5 outputs. For details see https://docs.ultralytics.com/models/rtdetr
# Parameters
nc: 1 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n-cls.yaml' will call yolov8-cls.yaml with scale 'n'
# [depth, width, max_channels]
l: [1.00, 1.00, 1024]
backbone:
# [from, repeats, module, args]
- [-1, 1, EMO_1M, []] # 4
head:
- [-1, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 5 input_proj.2
- [-1, 1, AIFI, [1024, 8]] # 6
- [-1, 1, Conv, [256, 1, 1]] # 7, Y5, lateral_convs.0
- [-1, 1, nn.Upsample, [None, 2, 'nearest']] # 8
- [3, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 9 input_proj.1
- [[-2, -1], 1, Concat, [1]] # 10
- [-1, 3, RepC3, [256]] # 11, fpn_blocks.0
- [-1, 1, Conv, [256, 1, 1]] # 12, Y4, lateral_convs.1
- [-1, 1, nn.Upsample, [None, 2, 'nearest']] # 13
- [2, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 14 input_proj.0
- [[-2, -1], 1, Concat, [1]] # 15 cat backbone P4
- [-1, 3, RepC3, [256]] # X3 (16), fpn_blocks.1
- [-1, 1, Conv, [256, 3, 2]] # 17, downsample_convs.0
- [[-1, 12], 1, Concat, [1]] # 18 cat Y4
- [-1, 3, RepC3, [256]] # F4 (19), pan_blocks.0
- [-1, 1, Conv, [256, 3, 2]] # 20, downsample_convs.1
- [[-1, 7], 1, Concat, [1]] # 21 cat Y5
- [-1, 3, RepC3, [256]] # F5 (22), pan_blocks.1
- [[16, 19, 22], 1, RTDETRDecoder, [nc]] # Detect(P3, P4, P5)
六、成功运行结果
分别打印网络模型可以看到
EMO
已经加入到模型中,并可以进行训练了。
rtdetr-l-EMO :
rtdetr-l-EMO summary: 1,023 layers, 22,863,235 parameters, 22,863,235 gradients, 64.6 GFLOPs
from n params module arguments
0 -1 1 4450208 EMO_1M []
1 -1 1 43520 ultralytics.nn.modules.conv.Conv [168, 256, 1, 1, None, 1, 1, False]
2 -1 1 789760 ultralytics.nn.modules.transformer.AIFI [256, 1024, 8]
3 -1 1 66048 ultralytics.nn.modules.conv.Conv [256, 256, 1, 1]
4 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
5 3 1 20992 ultralytics.nn.modules.conv.Conv [80, 256, 1, 1, None, 1, 1, False]
6 [-2, -1] 1 0 ultralytics.nn.modules.conv.Concat [1]
7 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
8 -1 1 66048 ultralytics.nn.modules.conv.Conv [256, 256, 1, 1]
9 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
10 2 1 12800 ultralytics.nn.modules.conv.Conv [48, 256, 1, 1, None, 1, 1, False]
11 [-2, -1] 1 0 ultralytics.nn.modules.conv.Concat [1]
12 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
13 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
14 [-1, 12] 1 0 ultralytics.nn.modules.conv.Concat [1]
15 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
16 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
17 [-1, 7] 1 0 ultralytics.nn.modules.conv.Concat [1]
18 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
19 [16, 19, 22] 1 7303907 ultralytics.nn.modules.head.RTDETRDecoder [1, [256, 256, 256]]
rtdetr-l-EMO summary: 1,023 layers, 22,863,235 parameters, 22,863,235 gradients, 64.6 GFLOPs