学习资源站

RT-DETR改进策略【模型轻量化】EMO:ICCV2023,结构简洁的轻量化自注意力模型-

RT-DETR改进策略【模型轻量化】| EMO:ICCV 2023,结构简洁的轻量化自注意力模型

一、本文介绍

本文记录的是 基于EMO的RT-DETR轻量化改进方法研究 EMO 设计简洁,仅由 iRMB 构成4阶段架构,无复杂操作与模块,无需精细调整超参数。其中 iRMB 通过特定算子设计,用 DW - Conv EW - MHSA 分别 建模短/长距离依赖关系 ,在 降低计算量的同时保障精度 。将 EMO 应用到 RT-DETR 的骨干网络中,使模型在保持轻量化的前提下,提升其在目标检测任务中的性能。

本文在 RT-DETR 的基础上配置了原论文中 EMO_1M , EMO_2M , EMO_5M , EMO_6M 四种模型,以满足不同的需求。

模型 参数量 计算量
rtdetr-l 32.8M 108.0GFLOPs
Improved 22.9M 64.6GFLOPs


二、EMO原理介绍

RethinkingMobileBlockforEfficientAttention-basedModels

EMO模型 旨在为移动应用设计 高效的基于注意力的轻量级模型 ,在多个视觉任务上取得了优异的性能。以下从出发点、结构原理和优势三个方面详细介绍:

2.1 出发点

  1. 随着对存储和计算资源有限的移动应用中高效视觉模型需求的增加,传统基于CNN的模型 受静态CNN自然归纳偏差限制 ,准确性有待提高;
  2. 基于注意力的模型虽有优势,但因 多头自注意力MHSA 计算量呈二次方增长 ,资源消耗大。
  3. 此外,当前高效混合模型存在 结构复杂或模块繁多 的问题,不利于应用优化。

因此,需要探索为基于 注意力 的模型构建 类似IRB的轻量级基础架构

2.2 结构原理

  • Meta Mobile Block(元移动块) :从 MobileNetv2 倒残差块(IRB) Transformer 的核心模块 MHSA FFN 重新思考,归纳抽象出 元移动块(MMB)
  • 以图像输入 X ( ∈ R C × H × W ) X(\in \mathbb{R}^{C ×H ×W}) X ( R C × H × W ) 为例,MMB首先通过输出/输入比为λ的扩展 M L P e MLP_{e} M L P e 扩展通道维度,得到 X e = M L P e ( X ) ( ∈ R λ C × H × W ) X_{e}=MLP_{e}(X)\left(\in \mathbb{R}^{\lambda C × H × W}\right) X e = M L P e ( X ) ( R λ C × H × W ) ;然后通过高效算子F增强图像特征;最后通过输入/输出比为λ的收缩 M L P s MLP_{s} M L P s 收缩通道维度,得到 X s = M L P s ( X f ) ( ∈ R C × H × W ) X_{s}=MLP_{s}\left(X_{f}\right)\left(\in \mathbb{R}^{C × H × W}\right) X s = M L P s ( X f ) ( R C × H × W ) ,并通过残差连接得到最终输出 Y = X + X s ( ∈ R C × H × W ) Y = X + X_{s}(\in \mathbb{R}^{C ×H ×W}) Y = X + X s ( R C × H × W )

在这里插入图片描述

  • Inverted Residual Mobile Block(倒残差移动块,iRMB) :基于 MMB ,将 iRMB 中的F建模为 级联的MHSA 卷积操作 ,即 F ( ⋅ ) = C o n v ( M H S A ( ⋅ ) ) F(\cdot)=Conv(MHSA(\cdot)) F ( ) = C o n v ( M H S A ( )) 。为解决高成本问题,采用 高效的窗口MHSA(WMHSA) 深度可分离卷积(DW-Conv) 并添加 残差连接 ,同时提出改进的 EW - MHSA ,即 Q = K = X ( ∈ R C × H × W ) Q = K = X(\in \mathbb{R}^{C ×H ×W}) Q = K = X ( R C × H × W ) V ( ∈ R λ C × H × W ) V(\in \mathbb{R}^{\lambda C ×H ×W}) V ( R λ C × H × W ) ,公式为 F ( ⋅ ) = ( D W − C o n v , S k i p ) ( E W − M H S A ( ⋅ ) ) \mathcal{F}(\cdot)=( DW-Conv, Skip )(EW-MHSA (\cdot)) F ( ) = ( D W C o n v , S ki p ) ( E W M H S A ( ))

在这里插入图片描述

  • EMO整体架构 :设计了基于一系列 iRMB 的类似ResNet的4阶段高效模型(EMO)。整体框架仅由 iRMB 组成,无多样化模块; iRMB 仅包含 标准卷积 多头自注意力 ,无需其他复杂算子,且能通过步长适应下采样操作,无需位置嵌入;采用逐渐增加的扩展率和通道数。

在这里插入图片描述

2.3 优势

  • 性能卓越 :在ImageNet - 1K、COCO2017和ADE20K等基准测试上,EMO表现出色。)。
  • 计算高效 :与其他模型相比,EMO在参数数量和计算量上更具优势。)。
  • 设计简洁 :遵循简单的设计准则,模型结构简洁,仅由iRMB组成,避免了复杂的操作和模块,更易于优化和部署 。

论文: https://arxiv.org/pdf/2301.01146
源码: https://github.com/zhangzjn/EMO

三、EMO的实现代码

EMO 的实现代码如下:

from timm.models.layers import  trunc_normal_
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
from functools import partial
from einops import rearrange, reduce
from timm.models.layers import DropPath
 
inplace = True
 
__all__ = ['EMO_1M', 'EMO_2M', 'EMO_5M', 'EMO_6M']
 
class SELayerV2(nn.Module):
    def __init__(self, in_channel, reduction=4):
        super(SELayerV2, self).__init__()
        assert in_channel >= reduction and in_channel % reduction == 0, 'invalid in_channel in SaElayer'
        self.reduction = reduction
        self.cardinality = 4
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        # cardinality 1
        self.fc1 = nn.Sequential(
            nn.Linear(in_channel, in_channel // self.reduction, bias=False),
            nn.ReLU(inplace=True)
        )
        # cardinality 2
        self.fc2 = nn.Sequential(
            nn.Linear(in_channel, in_channel // self.reduction, bias=False),
            nn.ReLU(inplace=True)
        )
        # cardinality 3
        self.fc3 = nn.Sequential(
            nn.Linear(in_channel, in_channel // self.reduction, bias=False),
            nn.ReLU(inplace=True)
        )
        # cardinality 4
        self.fc4 = nn.Sequential(
            nn.Linear(in_channel, in_channel // self.reduction, bias=False),
            nn.ReLU(inplace=True)
        )
 
        self.fc = nn.Sequential(
            nn.Linear(in_channel // self.reduction * self.cardinality, in_channel, bias=False),
            nn.Sigmoid()
        )
 
    def forward(self, x):
        b, c, _, _ = x.size()
        y = self.avg_pool(x).view(b, c)
        y1 = self.fc1(y)
        y2 = self.fc2(y)
        y3 = self.fc3(y)
        y4 = self.fc4(y)
        y_concate = torch.cat([y1, y2, y3, y4], dim=1)
        y_ex_dim = self.fc(y_concate).view(b, c, 1, 1)
 
        return x * y_ex_dim.expand_as(x)
 
def get_act(act_layer='relu'):
    act_dict = {
        'none': nn.Identity,
        'relu': nn.ReLU,
        'relu6': nn.ReLU6,
        'silu': nn.SiLU,
        'gelu': nn.GELU
    }
    return act_dict[act_layer]

class LayerNorm2d(nn.Module):
 
    def __init__(self, normalized_shape, eps=1e-6, elementwise_affine=True):
        super().__init__()
        self.norm = nn.LayerNorm(normalized_shape, eps, elementwise_affine)
 
    def forward(self, x):
        x = rearrange(x, 'b c h w -> b h w c').contiguous()
        x = self.norm(x)
        x = rearrange(x, 'b h w c -> b c h w').contiguous()
        return x

def get_norm(norm_layer='in_1d'):
    eps = 1e-6
    norm_dict = {
        'none': nn.Identity,
        'in_1d': partial(nn.InstanceNorm1d, eps=eps),
        'in_2d': partial(nn.InstanceNorm2d, eps=eps),
        'in_3d': partial(nn.InstanceNorm3d, eps=eps),
        'bn_1d': partial(nn.BatchNorm1d, eps=eps),
        'bn_2d': partial(nn.BatchNorm2d, eps=eps),
        # 'bn_2d': partial(nn.SyncBatchNorm, eps=eps),
        'bn_3d': partial(nn.BatchNorm3d, eps=eps),
        'gn': partial(nn.GroupNorm, eps=eps),
        'ln_1d': partial(nn.LayerNorm, eps=eps),
        'ln_2d': partial(LayerNorm2d, eps=eps),
    }
    return norm_dict[norm_layer]

class LayerScale(nn.Module):
    def __init__(self, dim, init_values=1e-5, inplace=True):
        super().__init__()
        self.inplace = inplace
        self.gamma = nn.Parameter(init_values * torch.ones(1, 1, dim))
 
    def forward(self, x):
        return x.mul_(self.gamma) if self.inplace else x * self.gamma

class LayerScale2D(nn.Module):
    def __init__(self, dim, init_values=1e-5, inplace=True):
        super().__init__()
        self.inplace = inplace
        self.gamma = nn.Parameter(init_values * torch.ones(1, dim, 1, 1))
 
    def forward(self, x):
        return x.mul_(self.gamma) if self.inplace else x * self.gamma

class ConvNormAct(nn.Module):
 
    def __init__(self, dim_in, dim_out, kernel_size, stride=1, dilation=1, groups=1, bias=False,
                 skip=False, norm_layer='bn_2d', act_layer='relu', inplace=True, drop_path_rate=0.):
        super(ConvNormAct, self).__init__()
        self.has_skip = skip and dim_in == dim_out
        padding = math.ceil((kernel_size - stride) / 2)
        self.conv = nn.Conv2d(dim_in, dim_out, kernel_size, stride, padding, dilation, groups, bias)
        self.norm = get_norm(norm_layer)(dim_out)
        self.act = nn.GELU()
        self.drop_path = DropPath(drop_path_rate) if drop_path_rate else nn.Identity()
 
    def forward(self, x):
        shortcut = x
        x = self.conv(x)
        x = self.norm(x)
        x = self.act(x)
        if self.has_skip:
            x = self.drop_path(x) + shortcut
        return x

# ========== Multi-Scale Populations, for down-sampling and inductive bias ==========
class MSPatchEmb(nn.Module):
 
    def __init__(self, dim_in, emb_dim, kernel_size=2, c_group=-1, stride=1, dilations=[1, 2, 3],
                 norm_layer='bn_2d', act_layer='silu'):
        super().__init__()
        self.dilation_num = len(dilations)
        assert dim_in % c_group == 0
        c_group = math.gcd(dim_in, emb_dim) if c_group == -1 else c_group
        self.convs = nn.ModuleList()
        for i in range(len(dilations)):
            padding = math.ceil(((kernel_size - 1) * dilations[i] + 1 - stride) / 2)
            self.convs.append(nn.Sequential(
                nn.Conv2d(dim_in, emb_dim, kernel_size, stride, padding, dilations[i], groups=c_group),
                get_norm(norm_layer)(emb_dim),
                get_act(act_layer)(emb_dim)))
 
    def forward(self, x):
        if self.dilation_num == 1:
            x = self.convs[0](x)
        else:
            x = torch.cat([self.convs[i](x).unsqueeze(dim=-1) for i in range(self.dilation_num)], dim=-1)
            x = reduce(x, 'b c h w n -> b c h w', 'mean').contiguous()
        return x
class iRMB(nn.Module):
 
    def __init__(self, dim_in, dim_out, norm_in=True, has_skip=True, exp_ratio=1.0, norm_layer='bn_2d',
                 act_layer='relu', v_proj=True, dw_ks=3, stride=1, dilation=1, se_ratio=0.0, dim_head=64, window_size=7,
                 attn_s=True, qkv_bias=False, attn_drop=0., drop=0., drop_path=0., v_group=False, attn_pre=False):
        super().__init__()
        self.norm = get_norm(norm_layer)(dim_in) if norm_in else nn.Identity()
        dim_mid = int(dim_in * exp_ratio)
        self.has_skip = (dim_in == dim_out and stride == 1) and has_skip
        self.attn_s = attn_s
        if self.attn_s:
            assert dim_in % dim_head == 0, 'dim should be divisible by num_heads'
            self.dim_head = dim_head
            self.window_size = window_size
            self.num_head = dim_in // dim_head
            self.scale = self.dim_head ** -0.5
            self.attn_pre = attn_pre
            self.qk = ConvNormAct(dim_in, int(dim_in * 2), kernel_size=1, bias=qkv_bias, norm_layer='none',
                                  act_layer='none')
            self.v = ConvNormAct(dim_in, dim_mid, kernel_size=1, groups=self.num_head if v_group else 1, bias=qkv_bias,
                                 norm_layer='none', act_layer=act_layer, inplace=inplace)
            self.attn_drop = nn.Dropout(attn_drop)
        else:
            if v_proj:
                self.v = ConvNormAct(dim_in, dim_mid, kernel_size=1, bias=qkv_bias, norm_layer='none',
                                     act_layer=act_layer, inplace=inplace)
            else:
                self.v = nn.Identity()
        self.conv_local = ConvNormAct(dim_mid, dim_mid, kernel_size=dw_ks, stride=stride, dilation=dilation,
                                      groups=dim_mid, norm_layer='bn_2d', act_layer='silu', inplace=inplace)
        self.se = SELayerV2(dim_mid)
 
        self.proj_drop = nn.Dropout(drop)
        self.proj = ConvNormAct(dim_mid, dim_out, kernel_size=1, norm_layer='none', act_layer='none', inplace=inplace)
        self.drop_path = DropPath(drop_path) if drop_path else nn.Identity()
 
    def forward(self, x):
        shortcut = x
        x = self.norm(x)
        B, C, H, W = x.shape
        if self.attn_s:
            # padding
            if self.window_size <= 0:
                window_size_W, window_size_H = W, H
            else:
                window_size_W, window_size_H = self.window_size, self.window_size
            pad_l, pad_t = 0, 0
            pad_r = (window_size_W - W % window_size_W) % window_size_W
            pad_b = (window_size_H - H % window_size_H) % window_size_H
            x = F.pad(x, (pad_l, pad_r, pad_t, pad_b, 0, 0,))
            n1, n2 = (H + pad_b) // window_size_H, (W + pad_r) // window_size_W
            x = rearrange(x, 'b c (h1 n1) (w1 n2) -> (b n1 n2) c h1 w1', n1=n1, n2=n2).contiguous()
            # attention
            b, c, h, w = x.shape
            qk = self.qk(x)
            qk = rearrange(qk, 'b (qk heads dim_head) h w -> qk b heads (h w) dim_head', qk=2, heads=self.num_head,
                           dim_head=self.dim_head).contiguous()
            q, k = qk[0], qk[1]
            attn_spa = (q @ k.transpose(-2, -1)) * self.scale
            attn_spa = attn_spa.softmax(dim=-1)
            attn_spa = self.attn_drop(attn_spa)
            if self.attn_pre:
                x = rearrange(x, 'b (heads dim_head) h w -> b heads (h w) dim_head', heads=self.num_head).contiguous()
                x_spa = attn_spa @ x
                x_spa = rearrange(x_spa, 'b heads (h w) dim_head -> b (heads dim_head) h w', heads=self.num_head, h=h,
                                  w=w).contiguous()
                x_spa = self.v(x_spa)
            else:
                v = self.v(x)
                v = rearrange(v, 'b (heads dim_head) h w -> b heads (h w) dim_head', heads=self.num_head).contiguous()
                x_spa = attn_spa @ v
                x_spa = rearrange(x_spa, 'b heads (h w) dim_head -> b (heads dim_head) h w', heads=self.num_head, h=h,
                                  w=w).contiguous()
            # unpadding
            x = rearrange(x_spa, '(b n1 n2) c h1 w1 -> b c (h1 n1) (w1 n2)', n1=n1, n2=n2).contiguous()
            if pad_r > 0 or pad_b > 0:
                x = x[:, :, :H, :W].contiguous()
        else:
            x = self.v(x)
 
        x = x + self.se(self.conv_local(x)) if self.has_skip else self.se(self.conv_local(x))
 
        x = self.proj_drop(x)
        x = self.proj(x)
 
        x = (shortcut + self.drop_path(x)) if self.has_skip else x
        return x

class EMO(nn.Module):
 
    def __init__(self, dim_in=3, num_classes=1000, img_size=224,
                 depths=[1, 2, 4, 2], stem_dim=16, embed_dims=[64, 128, 256, 512], exp_ratios=[4., 4., 4., 4.],
                 norm_layers=['bn_2d', 'bn_2d', 'bn_2d', 'bn_2d'], act_layers=['relu', 'relu', 'relu', 'relu'],
                 dw_kss=[3, 3, 5, 5], se_ratios=[0.0, 0.0, 0.0, 0.0], dim_heads=[32, 32, 32, 32],
                 window_sizes=[7, 7, 7, 7], attn_ss=[False, False, True, True], qkv_bias=True,
                 attn_drop=0., drop=0., drop_path=0., v_group=False, attn_pre=False, pre_dim=0):
        super().__init__()
        self.num_classes = num_classes
        assert num_classes > 0
        dprs = [x.item() for x in torch.linspace(0, drop_path, sum(depths))]
        self.stage0 = nn.ModuleList([
            MSPatchEmb(  # down to 112
                dim_in, stem_dim, kernel_size=dw_kss[0], c_group=1, stride=2, dilations=[1],
                norm_layer=norm_layers[0], act_layer='none'),
            iRMB(  # ds
                stem_dim, stem_dim, norm_in=False, has_skip=False, exp_ratio=1,
                norm_layer=norm_layers[0], act_layer=act_layers[0], v_proj=False, dw_ks=dw_kss[0],
                stride=1, dilation=1, se_ratio=1,
                dim_head=dim_heads[0], window_size=window_sizes[0], attn_s=False,
                qkv_bias=qkv_bias, attn_drop=attn_drop, drop=drop, drop_path=0.,
                attn_pre=attn_pre
            )
        ])
        emb_dim_pre = stem_dim
        for i in range(len(depths)):
            layers = []
            dpr = dprs[sum(depths[:i]):sum(depths[:i + 1])]
            for j in range(depths[i]):
                if j == 0:
                    stride, has_skip, attn_s, exp_ratio = 2, False, False, exp_ratios[i] * 2
                else:
                    stride, has_skip, attn_s, exp_ratio = 1, True, attn_ss[i], exp_ratios[i]
                layers.append(iRMB(
                    emb_dim_pre, embed_dims[i], norm_in=True, has_skip=has_skip, exp_ratio=exp_ratio,
                    norm_layer=norm_layers[i], act_layer=act_layers[i], v_proj=True, dw_ks=dw_kss[i],
                    stride=stride, dilation=1, se_ratio=se_ratios[i],
                    dim_head=dim_heads[i], window_size=window_sizes[i], attn_s=attn_s,
                    qkv_bias=qkv_bias, attn_drop=attn_drop, drop=drop, drop_path=dpr[j], v_group=v_group,
                    attn_pre=attn_pre
                ))
                emb_dim_pre = embed_dims[i]
            self.__setattr__(f'stage{i + 1}', nn.ModuleList(layers))
 
        self.norm = get_norm(norm_layers[-1])(embed_dims[-1])
        if pre_dim > 0:
            self.pre_head = nn.Sequential(nn.Linear(embed_dims[-1], pre_dim), get_act(act_layers[-1])(inplace=inplace))
            self.pre_dim = pre_dim
        else:
            self.pre_head = nn.Identity()
            self.pre_dim = embed_dims[-1]
        self.head = nn.Linear(self.pre_dim, num_classes)
        self.apply(self._init_weights)
        self.width_list = [i.size(1) for i in self.forward(torch.randn(1, 3, 640, 640))]
    def _init_weights(self, m):
        if isinstance(m, nn.Linear):
            trunc_normal_(m.weight, std=.02)
            if m.bias is not None:
                nn.init.zeros_(m.bias)
        elif isinstance(m, (nn.LayerNorm, nn.GroupNorm,
                            nn.BatchNorm1d, nn.BatchNorm2d, nn.BatchNorm3d,
                            nn.InstanceNorm1d, nn.InstanceNorm2d, nn.InstanceNorm3d)):
            nn.init.zeros_(m.bias)
            nn.init.ones_(m.weight)
 
    @torch.jit.ignore
    def no_weight_decay(self):
        return {'token'}
 
    @torch.jit.ignore
    def no_weight_decay_keywords(self):
        return {'alpha', 'gamma', 'beta'}
 
    @torch.jit.ignore
    def no_ft_keywords(self):
        # return {'head.weight', 'head.bias'}
        return {}
 
    @torch.jit.ignore
    def ft_head_keywords(self):
        return {'head.weight', 'head.bias'}, self.num_classes
 
    def get_classifier(self):
        return self.head
 
    def reset_classifier(self, num_classes):
        self.num_classes = num_classes
        self.head = nn.Linear(self.pre_dim, num_classes) if num_classes > 0 else nn.Identity()
 
    def check_bn(self):
        for name, m in self.named_modules():
            if isinstance(m, nn.modules.batchnorm._NormBase):
                m.running_mean = torch.nan_to_num(m.running_mean, nan=0, posinf=1, neginf=-1)
                m.running_var = torch.nan_to_num(m.running_var, nan=0, posinf=1, neginf=-1)
 
    def forward(self, x):
        unique_tensors = {}
        for blk in self.stage0:
            x = blk(x)
            width, height = x.shape[2], x.shape[3]
            unique_tensors[(width, height)] = x
        for blk in self.stage1:
            x = blk(x)
            width, height = x.shape[2], x.shape[3]
            unique_tensors[(width, height)] = x
        for blk in self.stage2:
            x = blk(x)
            width, height = x.shape[2], x.shape[3]
            unique_tensors[(width, height)] = x
        for blk in self.stage3:
            x = blk(x)
            width, height = x.shape[2], x.shape[3]
            unique_tensors[(width, height)] = x
        for blk in self.stage4:
            x = blk(x)
            width, height = x.shape[2], x.shape[3]
            unique_tensors[(width, height)] = x
        result_list = list(unique_tensors.values())[-4:]
        return result_list

def EMO_1M(pretrained=False, **kwargs):
    model = EMO(
        # dim_in=3, num_classes=1000, img_size=224,
        depths=[2, 2, 8, 3], stem_dim=24, embed_dims=[32, 48, 80, 168], exp_ratios=[2., 2.5, 3.0, 3.5],
        norm_layers=['bn_2d', 'bn_2d', 'ln_2d', 'ln_2d'], act_layers=['silu', 'silu', 'gelu', 'gelu'],
        dw_kss=[3, 3, 5, 5], dim_heads=[16, 16, 20, 21], window_sizes=[7, 7, 7, 7], attn_ss=[False, False, True, True],
        qkv_bias=True, attn_drop=0., drop=0., drop_path=0.04036, v_group=False, attn_pre=True, pre_dim=0,
        **kwargs)
    return model

def EMO_2M(pretrained=False, **kwargs):
    model = EMO(
        # dim_in=3, num_classes=1000, img_size=224,
        depths=[3, 3, 9, 3], stem_dim=24, embed_dims=[32, 48, 120, 200], exp_ratios=[2., 2.5, 3.0, 3.5],
        norm_layers=['bn_2d', 'bn_2d', 'ln_2d', 'ln_2d'], act_layers=['silu', 'silu', 'gelu', 'gelu'],
        dw_kss=[3, 3, 5, 5], dim_heads=[16, 16, 20, 20], window_sizes=[7, 7, 7, 7], attn_ss=[False, False, True, True],
        qkv_bias=True, attn_drop=0., drop=0., drop_path=0.05, v_group=False, attn_pre=True, pre_dim=0,
        **kwargs)
    return model

def EMO_5M(pretrained=False, **kwargs):
    model = EMO(
        # dim_in=3, num_classes=1000, img_size=224,
        depths=[3, 3, 9, 3], stem_dim=24, embed_dims=[48, 72, 160, 288], exp_ratios=[2., 3., 4., 4.],
        norm_layers=['bn_2d', 'bn_2d', 'ln_2d', 'ln_2d'], act_layers=['silu', 'silu', 'gelu', 'gelu'],
        dw_kss=[3, 3, 5, 5], dim_heads=[24, 24, 32, 32], window_sizes=[7, 7, 7, 7], attn_ss=[False, False, True, True],
        qkv_bias=True, attn_drop=0., drop=0., drop_path=0.05, v_group=False, attn_pre=True, pre_dim=0,
        **kwargs)
    return model

def EMO_6M(pretrained=False, **kwargs):
    model = EMO(
        # dim_in=3, num_classes=1000, img_size=224,
        depths=[3, 3, 9, 3], stem_dim=24, embed_dims=[48, 72, 160, 320], exp_ratios=[2., 3., 4., 5.],
        norm_layers=['bn_2d', 'bn_2d', 'ln_2d', 'ln_2d'], act_layers=['silu', 'silu', 'gelu', 'gelu'],
        dw_kss=[3, 3, 5, 5], dim_heads=[16, 24, 20, 32], window_sizes=[7, 7, 7, 7], attn_ss=[False, False, True, True],
        qkv_bias=True, attn_drop=0., drop=0., drop_path=0.05, v_group=False, attn_pre=True, pre_dim=0,
        **kwargs)
    return model

if __name__ == "__main__":

    # Generating Sample image
    image_size = (1, 3, 640, 640)
    image = torch.rand(*image_size)
 
    # Model
    model = EMO_6M()
 
    out = model(image)
    print(len(out))

四、修改步骤

4.1 修改一

① 在 ultralytics/nn/ 目录下新建 AddModules 文件夹用于存放模块代码

② 在 AddModules 文件夹下新建 EMO.py ,将 第三节 中的代码粘贴到此处

在这里插入图片描述

4.2 修改二

AddModules 文件夹下新建 __init__.py (已有则不用新建),在文件内导入模块: from .EMO import *

在这里插入图片描述

4.3 修改三

ultralytics/nn/modules/tasks.py 文件中,需要在两处位置添加各模块类名称。

① 首先:导入模块

在这里插入图片描述

② 其次:在 parse_model函数 的如下位置添加两行代码:

在这里插入图片描述

backbone = False
t=m

在这里插入图片描述

③ 接着,在此函数下添加如下代码:

elif m in {EMO_1M, EMO_2M, EMO_5M, EMO_6M, }:
    m = m(*args)
    c2 = m.width_list
    backbone = True

在这里插入图片描述

④ 然后,将下方红框内的代码全部替换:

if isinstance(c2, list):
   backbone = True
   m_ = m
   m_.backbone = True
else:
   m_ = nn.Sequential(*(m(*args) for _ in range(n))) if n > 1 else m(*args)  # module
   t = str(m)[8:-2].replace('__main__.', '')  # module type
m.np = sum(x.numel() for x in m_.parameters())  # number params
m_.i, m_.f, m_.type = i + 4 if backbone else i, f, t  # attach index, 'from' index, type
if verbose:
   LOGGER.info(f'{i:>3}{str(f):>20}{n_:>3}{m.np:10.0f}  {t:<45}{str(args):<30}')  # print
save.extend(x % (i + 4 if backbone else i) for x in ([f] if isinstance(f, int) else f) if
           x != -1)  # append to savelist
layers.append(m_)
if i == 0:
   ch = []
if isinstance(c2, list):
   ch.extend(c2)
   for _ in range(5 - len(ch)):
       ch.insert(0, 0)
else:
   ch.append(c2)

替换后如下:

在这里插入图片描述

⑤ 在此文件下找到 base_model _predict_once ,并将其替换成如下代码。

def _predict_once(self, x, profile=False, visualize=False, embed=None):
    y, dt, embeddings = [], [], []  # outputs
    for m in self.model:
        if m.f != -1:  # if not from previous layer
            x = y[m.f] if isinstance(m.f, int) else [x if j == -1 else y[j] for j in m.f]  # from earlier layers
        if profile:
            self._profile_one_layer(m, x, dt)
        if hasattr(m, 'backbone'):
            x = m(x)
            if len(x) != 5:  # 0 - 5
                x.insert(0, None)
            for index, i in enumerate(x):
                if index in self.save:
                    y.append(i)
                else:
                    y.append(None)
            x = x[-1]  # 最后一个输出传给下一层
        else:
            x = m(x)  # run
            y.append(x if m.i in self.save else None)  # save output
        if visualize:
            feature_visualization(x, m.type, m.i, save_dir=visualize)
        if embed and m.i in embed:
            embeddings.append(nn.functional.adaptive_avg_pool2d(x, (1, 1)).squeeze(-1).squeeze(-1))  # flatten
            if m.i == max(embed):
                return torch.unbind(torch.cat(embeddings, 1), dim=0)
    return x

在这里插入图片描述

至此就修改完成了,可以配置模型开始训练了


五、yaml模型文件

5.1 模型改进⭐

在代码配置完成后,配置模型的YAML文件。

此处以 ultralytics/cfg/models/rt-detr/rtdetr-l.yaml 为例,在同目录下创建一个用于自己数据集训练的模型文件 rtdetr-l-EMO.yaml

rtdetr-l.yaml 中的内容复制到 rtdetr-l-EMO.yaml 文件下,修改 nc 数量等于自己数据中目标的数量。

📌 模型的修改方法是将 骨干网络 替换成 EMO

# Ultralytics YOLO 🚀, AGPL-3.0 license
# RT-DETR-l object detection model with P3-P5 outputs. For details see https://docs.ultralytics.com/models/rtdetr

# Parameters
nc: 1  # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n-cls.yaml' will call yolov8-cls.yaml with scale 'n'
  # [depth, width, max_channels]
  l: [1.00, 1.00, 1024]

backbone:
  # [from, repeats, module, args]
  - [-1, 1, EMO_1M, []]  # 4

head:
  - [-1, 1, Conv, [256, 1, 1, None, 1, 1, False]]  # 5 input_proj.2
  - [-1, 1, AIFI, [1024, 8]] # 6
  - [-1, 1, Conv, [256, 1, 1]]  # 7, Y5, lateral_convs.0

  - [-1, 1, nn.Upsample, [None, 2, 'nearest']] # 8
  - [3, 1, Conv, [256, 1, 1, None, 1, 1, False]]  # 9 input_proj.1
  - [[-2, -1], 1, Concat, [1]] # 10
  - [-1, 3, RepC3, [256]]  # 11, fpn_blocks.0
  - [-1, 1, Conv, [256, 1, 1]]   # 12, Y4, lateral_convs.1

  - [-1, 1, nn.Upsample, [None, 2, 'nearest']] # 13
  - [2, 1, Conv, [256, 1, 1, None, 1, 1, False]]  # 14 input_proj.0
  - [[-2, -1], 1, Concat, [1]]  # 15 cat backbone P4
  - [-1, 3, RepC3, [256]]    # X3 (16), fpn_blocks.1

  - [-1, 1, Conv, [256, 3, 2]]   # 17, downsample_convs.0
  - [[-1, 12], 1, Concat, [1]]  # 18 cat Y4
  - [-1, 3, RepC3, [256]]    # F4 (19), pan_blocks.0

  - [-1, 1, Conv, [256, 3, 2]]   # 20, downsample_convs.1
  - [[-1, 7], 1, Concat, [1]]  # 21 cat Y5
  - [-1, 3, RepC3, [256]]    # F5 (22), pan_blocks.1

  - [[16, 19, 22], 1, RTDETRDecoder, [nc]]  # Detect(P3, P4, P5)


六、成功运行结果

分别打印网络模型可以看到 EMO 已经加入到模型中,并可以进行训练了。

rtdetr-l-EMO

rtdetr-l-EMO summary: 1,023 layers, 22,863,235 parameters, 22,863,235 gradients, 64.6 GFLOPs

                  from  n    params  module                                       arguments                     
  0                  -1  1   4450208  EMO_1M                                       []                            
  1                  -1  1     43520  ultralytics.nn.modules.conv.Conv             [168, 256, 1, 1, None, 1, 1, False]
  2                  -1  1    789760  ultralytics.nn.modules.transformer.AIFI      [256, 1024, 8]                
  3                  -1  1     66048  ultralytics.nn.modules.conv.Conv             [256, 256, 1, 1]              
  4                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
  5                   3  1     20992  ultralytics.nn.modules.conv.Conv             [80, 256, 1, 1, None, 1, 1, False]
  6            [-2, -1]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
  7                  -1  3   2232320  ultralytics.nn.modules.block.RepC3           [512, 256, 3]                 
  8                  -1  1     66048  ultralytics.nn.modules.conv.Conv             [256, 256, 1, 1]              
  9                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 10                   2  1     12800  ultralytics.nn.modules.conv.Conv             [48, 256, 1, 1, None, 1, 1, False]
 11            [-2, -1]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 12                  -1  3   2232320  ultralytics.nn.modules.block.RepC3           [512, 256, 3]                 
 13                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]              
 14            [-1, 12]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 15                  -1  3   2232320  ultralytics.nn.modules.block.RepC3           [512, 256, 3]                 
 16                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]              
 17             [-1, 7]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 18                  -1  3   2232320  ultralytics.nn.modules.block.RepC3           [512, 256, 3]                 
 19        [16, 19, 22]  1   7303907  ultralytics.nn.modules.head.RTDETRDecoder    [1, [256, 256, 256]]          
rtdetr-l-EMO summary: 1,023 layers, 22,863,235 parameters, 22,863,235 gradients, 64.6 GFLOPs