【YOLOv10多模态融合改进】| 利用 iRMB 倒置残差移动块二次改进CGA Fusion
一、本文介绍
本文记录的是利用 iRMB 模块改进 YOLOv10 的多模态融合部分 。主要讲解如何利用一些现有的模块二次改进多模态的融合部分。
iRMB(Inverted Residual Mobile Block)
的作用在于
克服了常见模块无法同时吸收CNN 效率建模局部特征
和
利用Transformer 动态建模能力学习不同模态之间的长距离交互问题
。相比一些复杂结构或多个混合模块的方法,
能更好地权衡模型成本和精度
。本文将其用于
CGA Fusion
模块中并进行二次创新,更好地突出不同模态的重要特征,提升模型性能。
二、iRMB注意力介绍
Rethinking Mobile Block for Efficient Attention-based Models
2.1 设计出发点
- 统一CNN和Transformer优势 :从高效的Inverted Residual Block(IRB)和Transformer的有效组件出发,期望在基础设施设计层面整合两者优势,为注意力模型构建类似IRB的轻量级基础结构。
- 解决现有模型问题 :当前方法存在引入复杂结构或多个混合模块的问题,不利于应用优化。希望通过重新思考IRB和Transformer组件,构建简单有效的模块。
2.2 原理
-
基于Meta Mobile Block(MMB)
:
MMB是通过对 MobileNetv2 中的IRB和Transformer中的核心MHSA及FFN模块重新思考并归纳抽象得到的。它以参数化的扩展比率λ和高效算子F来实例化不同模块(如IRB、MHSA、FFN),揭示了这些模块的一致本质表达。
- 遵循通用高效模型准则 :设计遵循可用性(简单实现,不使用复杂算子,易于应用优化)、均匀性(核心模块少,降低模型复杂度,加速部署)、有效性(分类和密集预测性能好)、效率(参数和计算少,权衡精度)的准则。
2.3 结构
2.3.1 主要组成部分
从微观角度看,
iRMB
由
Depth - Wise Convolution(DW - Conv)
和
改进的Expanded Window MHSA(EW - MHSA)
组成。
2.3.2 具体操作流程
-
首先,类似MMB的操作,使用
扩展MLP( M L P e MLP_{e} M L P e )以输出/输入比等于λ来扩展通道维度,即 X e = M L P e ( X ) ( ∈ R λ C × H × W ) X_{e}=MLP_{e}(X)\left(\in \mathbb{R}^{\lambda C × H × W}\right) X e = M L P e ( X ) ( ∈ R λ C × H × W ) 。 -
然后,中间算子
F进一步增强图像特征,这里F被建模为级联的MHSA和卷积操作,即 F ( ⋅ ) = C o n v ( M H S A ( ⋅ ) ) F(\cdot)=Conv(MHSA(\cdot)) F ( ⋅ ) = C o n v ( M H S A ( ⋅ )) ,具体采用DW - Conv和EW - MHSA的组合,其中EW - MHSA计算注意力矩阵的方式为 Q = K = X ( ∈ R C × H × W ) Q = K = X(\in \mathbb{R}^{C ×H ×W}) Q = K = X ( ∈ R C × H × W ) ,而扩展值 x e x_{e} x e 用于 V ( ∈ R λ C × H × W ) V(\in \mathbb{R}^{\lambda C ×H ×W}) V ( ∈ R λ C × H × W ) 。 -
最后,使用收缩
M
L
P
MLP
M
L
P
(
M
L
P
s
MLP_{s}
M
L
P
s
)以倒置的输入/输出比等于
λ来收缩通道维度,即 X s = M L P s ( X f ) ( ∈ R C × H × W ) X_{s}=MLP_{s}\left(X_{f}\right)\left(\in \mathbb{R}^{C × H × W}\right) X s = M L P s ( X f ) ( ∈ R C × H × W ) ,并通过 残差连接 得到最终输出 Y = X + X s ( ∈ R C × H × W ) Y = X + X_{s}(\in \mathbb{R}^{C ×H ×W}) Y = X + X s ( ∈ R C × H × W ) 。
2.4 优势
-
吸收CNN和Transformer优点
:既能吸收
CNN的效率来 建模局部特征 ,又能利用Transformer的 动态建模能力学习长距离交互 。 -
降低模型成本
-
通过采用高效的
Window - MHSA(WMHSA)和Depth - Wise Convolution(DW - Conv)并带有 跳跃连接 ,权衡了模型成本和精度。 - 设计灵活性高,如不同深度可采用不同设置,满足性能需求的同时保持结构简洁。
-
通过采用高效的
-
性能优势
-
在ImageNet - 1K数据集上进行图像分类实验,
iRMB替换标准Transformer结构后,在相同训练设置下能以更少的参数和计算提高性能。 -
在下游任务(如目标检测和语义分割)中,基于
iRMB构建的EMO模型在多个基准测试中取得了非常有竞争力的结果,超过了当前的SoTA方法。
-
在ImageNet - 1K数据集上进行图像分类实验,
论文: https://arxiv.org/pdf/2301.01146.pdf
源码: https://github.com/zhangzjn/EMO
三、iRMBFusion的实现代码
iRMBFusion
的实现代码如下:
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
from functools import partial
from einops import rearrange
from timm.models._efficientnet_blocks import SqueezeExcite
from timm.models.layers import DropPath
inplace = True
class LayerNorm2d(nn.Module):
def __init__(self, normalized_shape, eps=1e-6, elementwise_affine=True):
super().__init__()
self.norm = nn.LayerNorm(normalized_shape, eps, elementwise_affine)
def forward(self, x):
x = rearrange(x, 'b c h w -> b h w c').contiguous()
x = self.norm(x)
x = rearrange(x, 'b h w c -> b c h w').contiguous()
return x
def get_norm(norm_layer='in_1d'):
eps = 1e-6
norm_dict = {
'none': nn.Identity,
'in_1d': partial(nn.InstanceNorm1d, eps=eps),
'in_2d': partial(nn.InstanceNorm2d, eps=eps),
'in_3d': partial(nn.InstanceNorm3d, eps=eps),
'bn_1d': partial(nn.BatchNorm1d, eps=eps),
'bn_2d': partial(nn.BatchNorm2d, eps=eps),
# 'bn_2d': partial(nn.SyncBatchNorm, eps=eps),
'bn_3d': partial(nn.BatchNorm3d, eps=eps),
'gn': partial(nn.GroupNorm, eps=eps),
'ln_1d': partial(nn.LayerNorm, eps=eps),
'ln_2d': partial(LayerNorm2d, eps=eps),
}
return norm_dict[norm_layer]
def get_act(act_layer='relu'):
act_dict = {
'none': nn.Identity,
'relu': nn.ReLU,
'relu6': nn.ReLU6,
'silu': nn.SiLU
}
return act_dict[act_layer]
class ConvNormAct(nn.Module):
def __init__(self, dim_in, dim_out, kernel_size, stride=1, dilation=1, groups=1, bias=False,
skip=False, norm_layer='bn_2d', act_layer='relu', inplace=True, drop_path_rate=0.):
super(ConvNormAct, self).__init__()
self.has_skip = skip and dim_in == dim_out
padding = math.ceil((kernel_size - stride) / 2)
self.conv = nn.Conv2d(dim_in, dim_out, kernel_size, stride, padding, dilation, groups, bias)
self.norm = get_norm(norm_layer)(dim_out)
self.act = get_act(act_layer)(inplace=inplace)
self.drop_path = DropPath(drop_path_rate) if drop_path_rate else nn.Identity()
def forward(self, x):
shortcut = x
x = self.conv(x)
x = self.norm(x)
x = self.act(x)
if self.has_skip:
x = self.drop_path(x) + shortcut
return x
class iRMB(nn.Module):
def __init__(self, dim_in, norm_in=True, has_skip=True, exp_ratio=1.0, norm_layer='bn_2d',
act_layer='relu', v_proj=True, dw_ks=3, stride=1, dilation=1, se_ratio=0.0, dim_head=8, window_size=7,
attn_s=True, qkv_bias=False, attn_drop=0., drop=0., drop_path=0., v_group=False, attn_pre=False):
super().__init__()
dim_out = dim_in
self.norm = get_norm(norm_layer)(dim_in) if norm_in else nn.Identity()
dim_mid = int(dim_in * exp_ratio)
self.has_skip = (dim_in == dim_out and stride == 1) and has_skip
self.attn_s = attn_s
if self.attn_s:
assert dim_in % dim_head == 0, 'dim should be divisible by num_heads'
self.dim_head = dim_head
self.window_size = window_size
self.num_head = dim_in // dim_head
self.scale = self.dim_head ** -0.5
self.attn_pre = attn_pre
self.qk = ConvNormAct(dim_in, int(dim_in * 2), kernel_size=1, bias=qkv_bias, norm_layer='none',
act_layer='none')
self.v = ConvNormAct(dim_in, dim_mid, kernel_size=1, groups=self.num_head if v_group else 1, bias=qkv_bias,
norm_layer='none', act_layer=act_layer, inplace=inplace)
self.attn_drop = nn.Dropout(attn_drop)
else:
if v_proj:
self.v = ConvNormAct(dim_in, dim_mid, kernel_size=1, bias=qkv_bias, norm_layer='none',
act_layer=act_layer, inplace=inplace)
else:
self.v = nn.Identity()
self.conv_local = ConvNormAct(dim_mid, dim_mid, kernel_size=dw_ks, stride=stride, dilation=dilation,
groups=dim_mid, norm_layer='bn_2d', act_layer='silu', inplace=inplace)
self.se = SqueezeExcite(dim_mid, rd_ratio=se_ratio, act_layer=get_act(act_layer)) if se_ratio > 0.0 else nn.Identity()
self.proj_drop = nn.Dropout(drop)
self.proj = ConvNormAct(dim_mid, dim_out, kernel_size=1, norm_layer='none', act_layer='none', inplace=inplace)
self.drop_path = DropPath(drop_path) if drop_path else nn.Identity()
def forward(self, x):
shortcut = x
x = self.norm(x)
B, C, H, W = x.shape
if self.attn_s:
# padding
if self.window_size <= 0:
window_size_W, window_size_H = W, H
else:
window_size_W, window_size_H = self.window_size, self.window_size
pad_l, pad_t = 0, 0
pad_r = (window_size_W - W % window_size_W) % window_size_W
pad_b = (window_size_H - H % window_size_H) % window_size_H
x = F.pad(x, (pad_l, pad_r, pad_t, pad_b, 0, 0,))
n1, n2 = (H + pad_b) // window_size_H, (W + pad_r) // window_size_W
x = rearrange(x, 'b c (h1 n1) (w1 n2) -> (b n1 n2) c h1 w1', n1=n1, n2=n2).contiguous()
# attention
b, c, h, w = x.shape
qk = self.qk(x)
qk = rearrange(qk, 'b (qk heads dim_head) h w -> qk b heads (h w) dim_head', qk=2, heads=self.num_head,
dim_head=self.dim_head).contiguous()
q, k = qk[0], qk[1]
attn_spa = (q @ k.transpose(-2, -1)) * self.scale
attn_spa = attn_spa.softmax(dim=-1)
attn_spa = self.attn_drop(attn_spa)
if self.attn_pre:
x = rearrange(x, 'b (heads dim_head) h w -> b heads (h w) dim_head', heads=self.num_head).contiguous()
x_spa = attn_spa @ x
x_spa = rearrange(x_spa, 'b heads (h w) dim_head -> b (heads dim_head) h w', heads=self.num_head, h=h,
w=w).contiguous()
x_spa = self.v(x_spa)
else:
v = self.v(x)
v = rearrange(v, 'b (heads dim_head) h w -> b heads (h w) dim_head', heads=self.num_head).contiguous()
x_spa = attn_spa @ v
x_spa = rearrange(x_spa, 'b heads (h w) dim_head -> b (heads dim_head) h w', heads=self.num_head, h=h,
w=w).contiguous()
# unpadding
x = rearrange(x_spa, '(b n1 n2) c h1 w1 -> b c (h1 n1) (w1 n2)', n1=n1, n2=n2).contiguous()
if pad_r > 0 or pad_b > 0:
x = x[:, :, :H, :W].contiguous()
else:
x = self.v(x)
x = x + self.se(self.conv_local(x)) if self.has_skip else self.se(self.conv_local(x))
x = self.proj_drop(x)
x = self.proj(x)
x = (shortcut + self.drop_path(x)) if self.has_skip else x
return x
class PixelAttention_CGA(nn.Module):
def __init__(self, dim):
super(PixelAttention_CGA, self).__init__()
self.pa2 = nn.Conv2d(2 * dim, dim, 7, padding=3, padding_mode='reflect' ,groups=dim, bias=True)
self.sigmoid = nn.Sigmoid()
def forward(self, x, pattn1):
B, C, H, W = x.shape
x = x.unsqueeze(dim=2) # B, C, 1, H, W
pattn1 = pattn1.unsqueeze(dim=2) # B, C, 1, H, W
x2 = torch.cat([x, pattn1], dim=2) # B, C, 2, H, W
x2 = rearrange(x2, 'b c t h w -> b (c t) h w')
pattn2 = self.pa2(x2)
pattn2 = self.sigmoid(pattn2)
return pattn2
class iRMBFusion(nn.Module):
def __init__(self, dim):
super(iRMBFusion, self).__init__()
self.cfam = iRMB(dim)
self.pa = PixelAttention_CGA(dim)
self.conv = nn.Conv2d(dim, dim, 1, bias=True)
self.sigmoid = nn.Sigmoid()
def forward(self, data):
x, y = data
initial = x + y
pattn1 = self.cfam(initial)
pattn2 = self.sigmoid(self.pa(initial, pattn1))
result = initial + pattn2 * x + (1 - pattn2) * y
result = self.conv(result)
return result
四、融合步骤
5.1 修改一
① 在
ultralytics/nn/
目录下新建
AddModules
文件夹用于存放模块代码
② 在
AddModules
文件夹下新建
iRMBFusion.py
,将
第三节
中的代码粘贴到此处
5.2 修改二
在
AddModules
文件夹下新建
__init__.py
(已有则不用新建),在文件内导入模块:
from .iRMBFusion import *
5.3 修改三
在
ultralytics/nn/modules/tasks.py
文件中,需要在两处位置添加各模块类名称。
首先:导入模块
其次:在
parse_model函数
中注册
iRMBFusion
模块
elif m in {iRMBFusion}:
c2 = ch[f[0]]
args = [c2]
最后将
ultralytics/utils/torch_utils.py
中的
get_flops
函数中的
stride
指定为
640
。
五、yaml模型文件
5.1 中期融合⭐
📌 此模型的修方法是将原本的中期融合中的Concat融合部分换成iRMBFusion,融合骨干部分的多模态信息。
# Ultralytics YOLO 🚀, AGPL-3.0 license
# YOLOv10 object detection model. For Usage examples see https://docs.ultralytics.com/tasks/detect
# Parameters
ch: 6
nc: 80 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov10n.yaml' will call yolov10.yaml with scale 'n'
# [depth, width, max_channels]
n: [0.33, 0.25, 1024]
backbone:
# [from, repeats, module, args]
- [-1, 1, IN, []] # 0
- [-1, 1, Multiin, [1]] # 1
- [-2, 1, Multiin, [2]] # 2
- [1, 1, Conv, [64, 3, 2]] # 3-P1/2
- [-1, 1, Conv, [128, 3, 2]] # 4-P2/4
- [-1, 3, C2f, [128, True]]
- [-1, 1, Conv, [256, 3, 2]] # 6-P3/8
- [-1, 6, C2f, [256, True]]
- [-1, 1, SCDown, [512, 3, 2]] # 8-P4/16
- [-1, 6, C2f, [512, True]]
- [-1, 1, SCDown, [1024, 3, 2]] # 10-P5/32
- [-1, 3, C2f, [1024, True]]
- [2, 1, Conv, [64, 3, 2]] # 12-P1/2
- [-1, 1, Conv, [128, 3, 2]] # 13-P2/4
- [-1, 3, C2f, [128, True]]
- [-1, 1, Conv, [256, 3, 2]] # 15-P3/8
- [-1, 6, C2f, [256, True]]
- [-1, 1, SCDown, [512, 3, 2]] # 17-P4/16
- [-1, 6, C2f, [512, True]]
- [-1, 1, SCDown, [1024, 3, 2]] # 19-P5/32
- [-1, 3, C2f, [1024, True]]
- [[7, 16], 1, iRMBFusion, []] # 21 cat backbone P3
- [[9, 18], 1, iRMBFusion, []] # 22 cat backbone P4
- [[11, 20], 1, iRMBFusion, []] # 23 cat backbone P5
- [-1, 1, SPPF, [1024, 5]] # 24
- [-1, 1, PSA, [1024]] # 25
# YOLOv10.0n head
head:
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 22], 1, Concat, [1]] # cat backbone P4
- [-1, 3, C2f, [512]] # 28
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 21], 1, Concat, [1]] # cat backbone P3
- [-1, 3, C2f, [256]] # 31 (P3/8-small)
- [-1, 1, Conv, [256, 3, 2]]
- [[-1, 28], 1, Concat, [1]] # cat head P4
- [-1, 3, C2f, [512]] # 34 (P4/16-medium)
- [-1, 1, SCDown, [512, 3, 2]]
- [[-1, 25], 1, Concat, [1]] # cat head P5
- [-1, 3, C2fCIB, [1024, True, True]] # 37 (P5/32-large)
- [[31, 34, 37], 1, v10Detect, [nc]] # Detect(P3, P4, P5)
5.2 中-后期融合⭐
📌 此模型的修方法是将原本的中-后期融合中的Concat融合部分换成iRMBFusion,融合FPN部分的多模态信息。
# Ultralytics YOLO 🚀, AGPL-3.0 license
# YOLOv10 object detection model. For Usage examples see https://docs.ultralytics.com/tasks/detect
# Parameters
ch: 6
nc: 80 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov10n.yaml' will call yolov10.yaml with scale 'n'
# [depth, width, max_channels]
n: [0.33, 0.25, 1024]
backbone:
# [from, repeats, module, args]
- [-1, 1, IN, []] # 0
- [-1, 1, Multiin, [1]] # 1
- [-2, 1, Multiin, [2]] # 2
- [1, 1, Conv, [64, 3, 2]] # 3-P1/2
- [-1, 1, Conv, [128, 3, 2]] # 4-P2/4
- [-1, 3, C2f, [128, True]]
- [-1, 1, Conv, [256, 3, 2]] # 6-P3/8
- [-1, 6, C2f, [256, True]]
- [-1, 1, SCDown, [512, 3, 2]] # 8-P4/16
- [-1, 6, C2f, [512, True]]
- [-1, 1, SCDown, [1024, 3, 2]] # 10-P5/32
- [-1, 3, C2f, [1024, True]]
- [-1, 1, SPPF, [1024, 5]] # 12
- [-1, 1, PSA, [1024]] # 13
- [2, 1, Conv, [64, 3, 2]] # 14-P1/2
- [-1, 1, Conv, [128, 3, 2]] # 15-P2/4
- [-1, 3, C2f, [128, True]]
- [-1, 1, Conv, [256, 3, 2]] # 17-P3/8
- [-1, 6, C2f, [256, True]]
- [-1, 1, SCDown, [512, 3, 2]] # 19-P4/16
- [-1, 6, C2f, [512, True]]
- [-1, 1, SCDown, [1024, 3, 2]] # 21-P5/32
- [-1, 3, C2f, [1024, True]]
- [-1, 1, SPPF, [1024, 5]] # 23
- [-1, 1, PSA, [1024]] # 24
# YOLOv10.0n head
head:
- [13, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 9], 1, Concat, [1]] # cat backbone P4
- [-1, 3, C2f, [512]] # 27
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 7], 1, Concat, [1]] # cat backbone P3
- [-1, 3, C2f, [256]] # 30 (P3/8-small)
- [24, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 20], 1, Concat, [1]] # cat backbone P4
- [-1, 3, C2f, [512]] # 33
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 18], 1, Concat, [1]] # cat backbone P3
- [-1, 3, C2f, [256]] # 36 (P3/8-small)
- [[13, 24], 1, iRMBFusion, []] # 37 cat backbone P3
- [[27, 33], 1, iRMBFusion, []] # 38 cat backbone P4
- [[30, 36], 1, iRMBFusion, []] # 39 cat backbone P5
- [-1, 1, Conv, [256, 3, 2]]
- [[-1, 38], 1, Concat, [1]] # cat head P4
- [-1, 3, C2f, [512]] # 42 (P4/16-medium)
- [-1, 1, SCDown, [512, 3, 2]]
- [[-1, 37], 1, Concat, [1]] # cat head P5
- [-1, 3, C2fCIB, [1024, True, True]] # 45 (P5/32-large)
- [[39, 42, 45], 1, v10Detect, [nc]] # Detect(P3, P4, P5)
5.3 后期融合⭐
📌 此模型的修方法是将原本的后期融合中的Concat融合部分换成iRMBFusion,融合颈部部分的多模态信息。
# Ultralytics YOLO 🚀, AGPL-3.0 license
# YOLOv10 object detection model. For Usage examples see https://docs.ultralytics.com/tasks/detect
# Parameters
ch: 6
nc: 80 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov10n.yaml' will call yolov10.yaml with scale 'n'
# [depth, width, max_channels]
n: [0.33, 0.25, 1024]
backbone:
# [from, repeats, module, args]
- [-1, 1, IN, []] # 0
- [-1, 1, Multiin, [1]] # 1
- [-2, 1, Multiin, [2]] # 2
- [1, 1, Conv, [64, 3, 2]] # 3-P1/2
- [-1, 1, Conv, [128, 3, 2]] # 4-P2/4
- [-1, 3, C2f, [128, True]]
- [-1, 1, Conv, [256, 3, 2]] # 6-P3/8
- [-1, 6, C2f, [256, True]]
- [-1, 1, SCDown, [512, 3, 2]] # 8-P4/16
- [-1, 6, C2f, [512, True]]
- [-1, 1, SCDown, [1024, 3, 2]] # 10-P5/32
- [-1, 3, C2f, [1024, True]]
- [-1, 1, SPPF, [1024, 5]] # 12
- [-1, 1, PSA, [1024]] # 13
- [2, 1, Conv, [64, 3, 2]] # 14-P1/2
- [-1, 1, Conv, [128, 3, 2]] # 15-P2/4
- [-1, 3, C2f, [128, True]]
- [-1, 1, Conv, [256, 3, 2]] # 17-P3/8
- [-1, 6, C2f, [256, True]]
- [-1, 1, SCDown, [512, 3, 2]] # 19-P4/16
- [-1, 6, C2f, [512, True]]
- [-1, 1, SCDown, [1024, 3, 2]] # 21-P5/32
- [-1, 3, C2f, [1024, True]]
- [-1, 1, SPPF, [1024, 5]] # 23
- [-1, 1, PSA, [1024]] # 24
# YOLOv10.0n head
head:
- [13, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 9], 1, Concat, [1]] # cat backbone P4
- [-1, 3, C2f, [512]] # 27
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 7], 1, Concat, [1]] # cat backbone P3
- [-1, 3, C2f, [256]] # 30 (P3/8-small)
- [-1, 1, Conv, [256, 3, 2]]
- [[-1, 27], 1, Concat, [1]] # cat head P4
- [-1, 3, C2f, [512]] # 33 (P4/16-medium)
- [-1, 1, SCDown, [512, 3, 2]]
- [[-1, 13], 1, Concat, [1]] # cat head P5
- [-1, 3, C2fCIB, [1024, True, True]] # 36 (P5/32-large)
- [24, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 20], 1, Concat, [1]] # cat backbone P4
- [-1, 3, C2f, [512]] # 39
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 18], 1, Concat, [1]] # cat backbone P3
- [-1, 3, C2f, [256]] # 42 (P3/8-small)
- [-1, 1, Conv, [256, 3, 2]]
- [[-1, 39], 1, Concat, [1]] # cat head P4
- [-1, 3, C2f, [512]] # 45 (P4/16-medium)
- [-1, 1, SCDown, [512, 3, 2]]
- [[-1, 24], 1, Concat, [1]] # cat head P5
- [-1, 3, C2fCIB, [1024, True, True]] # 48 (P5/32-large)
- [[30, 42], 1, iRMBFusion, []] # 49 cat backbone P3
- [[33, 45], 1, iRMBFusion, []] # 50 cat backbone P4
- [[36, 48], 1, iRMBFusion, []] # 51 cat backbone P5
- [[49, 50, 51], 1, v10Detect, [nc]] # Detect(P3, P4, P5)
六、成功运行结果
打印网络模型可以看到不同的融合层已经加入到模型中,并可以进行训练了。
YOLOv10n-mid-iRMBFusion :
YOLOv12-mid-CGAFusion summary: 752 layers, 4,071,548 parameters, 4,071,532 gradients, 9.4 GFLOPs
from n params module arguments
0 -1 1 0 ultralytics.nn.AddModules.multimodal.IN []
1 -1 1 0 ultralytics.nn.AddModules.multimodal.Multiin [1]
2 -2 1 0 ultralytics.nn.AddModules.multimodal.Multiin [2]
3 1 1 464 ultralytics.nn.modules.conv.Conv [3, 16, 3, 2]
4 -1 1 4672 ultralytics.nn.modules.conv.Conv [16, 32, 3, 2]
5 -1 1 7360 ultralytics.nn.modules.block.C2f [32, 32, 1, True]
6 -1 1 18560 ultralytics.nn.modules.conv.Conv [32, 64, 3, 2]
7 -1 2 49664 ultralytics.nn.modules.block.C2f [64, 64, 2, True]
8 -1 1 9856 ultralytics.nn.modules.block.SCDown [64, 128, 3, 2]
9 -1 2 197632 ultralytics.nn.modules.block.C2f [128, 128, 2, True]
10 -1 1 36096 ultralytics.nn.modules.block.SCDown [128, 256, 3, 2]
11 -1 1 460288 ultralytics.nn.modules.block.C2f [256, 256, 1, True]
12 2 1 464 ultralytics.nn.modules.conv.Conv [3, 16, 3, 2]
13 -1 1 4672 ultralytics.nn.modules.conv.Conv [16, 32, 3, 2]
14 -1 1 7360 ultralytics.nn.modules.block.C2f [32, 32, 1, True]
15 -1 1 18560 ultralytics.nn.modules.conv.Conv [32, 64, 3, 2]
16 -1 2 49664 ultralytics.nn.modules.block.C2f [64, 64, 2, True]
17 -1 1 9856 ultralytics.nn.modules.block.SCDown [64, 128, 3, 2]
18 -1 2 197632 ultralytics.nn.modules.block.C2f [128, 128, 2, True]
19 -1 1 36096 ultralytics.nn.modules.block.SCDown [128, 256, 3, 2]
20 -1 1 460288 ultralytics.nn.modules.block.C2f [256, 256, 1, True]
21 [7, 16] 1 27712 ultralytics.nn.AddModules.iRMBFusion.iRMBFusion[64]
22 [9, 18] 1 96384 ultralytics.nn.AddModules.iRMBFusion.iRMBFusion[128]
23 [11, 20] 1 356608 ultralytics.nn.AddModules.iRMBFusion.iRMBFusion[256]
24 -1 1 164608 ultralytics.nn.modules.block.SPPF [256, 256, 5]
25 -1 1 249728 ultralytics.nn.modules.block.PSA [256, 256]
26 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
27 [-1, 22] 1 0 ultralytics.nn.modules.conv.Concat [1]
28 -1 1 148224 ultralytics.nn.modules.block.C2f [384, 128, 1]
29 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
30 [-1, 21] 1 0 ultralytics.nn.modules.conv.Concat [1]
31 -1 1 37248 ultralytics.nn.modules.block.C2f [192, 64, 1]
32 -1 1 36992 ultralytics.nn.modules.conv.Conv [64, 64, 3, 2]
33 [-1, 28] 1 0 ultralytics.nn.modules.conv.Concat [1]
34 -1 1 123648 ultralytics.nn.modules.block.C2f [192, 128, 1]
35 -1 1 18048 ultralytics.nn.modules.block.SCDown [128, 128, 3, 2]
36 [-1, 25] 1 0 ultralytics.nn.modules.conv.Concat [1]
37 -1 1 282624 ultralytics.nn.modules.block.C2fCIB [384, 256, 1, True, True]
38 [31, 34, 37] 1 861718 ultralytics.nn.modules.head.v10Detect [1, [64, 128, 256]]
YOLOv10n-mid-iRMBFusion summary: 583 layers, 3,972,726 parameters, 3,972,710 gradients, 12.2 GFLOPs
YOLOv10n-mid-to-late-iRMBFusion :
YOLOv10n-mid-to-late-iRMBFusion summary: 653 layers, 4,572,534 parameters, 4,572,518 gradients, 13.5 GFLOPs
from n params module arguments
0 -1 1 0 ultralytics.nn.AddModules.multimodal.IN []
1 -1 1 0 ultralytics.nn.AddModules.multimodal.Multiin [1]
2 -2 1 0 ultralytics.nn.AddModules.multimodal.Multiin [2]
3 1 1 464 ultralytics.nn.modules.conv.Conv [3, 16, 3, 2]
4 -1 1 4672 ultralytics.nn.modules.conv.Conv [16, 32, 3, 2]
5 -1 1 7360 ultralytics.nn.modules.block.C2f [32, 32, 1, True]
6 -1 1 18560 ultralytics.nn.modules.conv.Conv [32, 64, 3, 2]
7 -1 2 49664 ultralytics.nn.modules.block.C2f [64, 64, 2, True]
8 -1 1 9856 ultralytics.nn.modules.block.SCDown [64, 128, 3, 2]
9 -1 2 197632 ultralytics.nn.modules.block.C2f [128, 128, 2, True]
10 -1 1 36096 ultralytics.nn.modules.block.SCDown [128, 256, 3, 2]
11 -1 1 460288 ultralytics.nn.modules.block.C2f [256, 256, 1, True]
12 -1 1 164608 ultralytics.nn.modules.block.SPPF [256, 256, 5]
13 -1 1 249728 ultralytics.nn.modules.block.PSA [256, 256]
14 2 1 464 ultralytics.nn.modules.conv.Conv [3, 16, 3, 2]
15 -1 1 4672 ultralytics.nn.modules.conv.Conv [16, 32, 3, 2]
16 -1 1 7360 ultralytics.nn.modules.block.C2f [32, 32, 1, True]
17 -1 1 18560 ultralytics.nn.modules.conv.Conv [32, 64, 3, 2]
18 -1 2 49664 ultralytics.nn.modules.block.C2f [64, 64, 2, True]
19 -1 1 9856 ultralytics.nn.modules.block.SCDown [64, 128, 3, 2]
20 -1 2 197632 ultralytics.nn.modules.block.C2f [128, 128, 2, True]
21 -1 1 36096 ultralytics.nn.modules.block.SCDown [128, 256, 3, 2]
22 -1 1 460288 ultralytics.nn.modules.block.C2f [256, 256, 1, True]
23 -1 1 164608 ultralytics.nn.modules.block.SPPF [256, 256, 5]
24 -1 1 249728 ultralytics.nn.modules.block.PSA [256, 256]
25 13 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
26 [-1, 9] 1 0 ultralytics.nn.modules.conv.Concat [1]
27 -1 1 148224 ultralytics.nn.modules.block.C2f [384, 128, 1]
28 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
29 [-1, 7] 1 0 ultralytics.nn.modules.conv.Concat [1]
30 -1 1 37248 ultralytics.nn.modules.block.C2f [192, 64, 1]
31 24 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
32 [-1, 20] 1 0 ultralytics.nn.modules.conv.Concat [1]
33 -1 1 148224 ultralytics.nn.modules.block.C2f [384, 128, 1]
34 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
35 [-1, 18] 1 0 ultralytics.nn.modules.conv.Concat [1]
36 -1 1 37248 ultralytics.nn.modules.block.C2f [192, 64, 1]
37 [13, 24] 1 356608 ultralytics.nn.AddModules.iRMBFusion.iRMBFusion[256]
38 [27, 33] 1 96384 ultralytics.nn.AddModules.iRMBFusion.iRMBFusion[128]
39 [30, 36] 1 27712 ultralytics.nn.AddModules.iRMBFusion.iRMBFusion[64]
40 -1 1 36992 ultralytics.nn.modules.conv.Conv [64, 64, 3, 2]
41 [-1, 38] 1 0 ultralytics.nn.modules.conv.Concat [1]
42 -1 1 123648 ultralytics.nn.modules.block.C2f [192, 128, 1]
43 -1 1 18048 ultralytics.nn.modules.block.SCDown [128, 128, 3, 2]
44 [-1, 37] 1 0 ultralytics.nn.modules.conv.Concat [1]
45 -1 1 282624 ultralytics.nn.modules.block.C2fCIB [384, 256, 1, True, True]
46 [39, 42, 45] 1 861718 ultralytics.nn.modules.head.v10Detect [1, [64, 128, 256]]
YOLOv10n-mid-to-late-iRMBFusion summary: 653 layers, 4,572,534 parameters, 4,572,518 gradients, 13.5 GFLOPs
YOLOv10n-late-iRMBFusion :
YOLOv10n-late-iRMBFusion summary: 713 layers, 5,033,846 parameters, 5,033,830 gradients, 14.3 GFLOPs
from n params module arguments
0 -1 1 0 ultralytics.nn.AddModules.multimodal.IN []
1 -1 1 0 ultralytics.nn.AddModules.multimodal.Multiin [1]
2 -2 1 0 ultralytics.nn.AddModules.multimodal.Multiin [2]
3 1 1 464 ultralytics.nn.modules.conv.Conv [3, 16, 3, 2]
4 -1 1 4672 ultralytics.nn.modules.conv.Conv [16, 32, 3, 2]
5 -1 1 7360 ultralytics.nn.modules.block.C2f [32, 32, 1, True]
6 -1 1 18560 ultralytics.nn.modules.conv.Conv [32, 64, 3, 2]
7 -1 2 49664 ultralytics.nn.modules.block.C2f [64, 64, 2, True]
8 -1 1 9856 ultralytics.nn.modules.block.SCDown [64, 128, 3, 2]
9 -1 2 197632 ultralytics.nn.modules.block.C2f [128, 128, 2, True]
10 -1 1 36096 ultralytics.nn.modules.block.SCDown [128, 256, 3, 2]
11 -1 1 460288 ultralytics.nn.modules.block.C2f [256, 256, 1, True]
12 -1 1 164608 ultralytics.nn.modules.block.SPPF [256, 256, 5]
13 -1 1 249728 ultralytics.nn.modules.block.PSA [256, 256]
14 2 1 464 ultralytics.nn.modules.conv.Conv [3, 16, 3, 2]
15 -1 1 4672 ultralytics.nn.modules.conv.Conv [16, 32, 3, 2]
16 -1 1 7360 ultralytics.nn.modules.block.C2f [32, 32, 1, True]
17 -1 1 18560 ultralytics.nn.modules.conv.Conv [32, 64, 3, 2]
18 -1 2 49664 ultralytics.nn.modules.block.C2f [64, 64, 2, True]
19 -1 1 9856 ultralytics.nn.modules.block.SCDown [64, 128, 3, 2]
20 -1 2 197632 ultralytics.nn.modules.block.C2f [128, 128, 2, True]
21 -1 1 36096 ultralytics.nn.modules.block.SCDown [128, 256, 3, 2]
22 -1 1 460288 ultralytics.nn.modules.block.C2f [256, 256, 1, True]
23 -1 1 164608 ultralytics.nn.modules.block.SPPF [256, 256, 5]
24 -1 1 249728 ultralytics.nn.modules.block.PSA [256, 256]
25 13 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
26 [-1, 9] 1 0 ultralytics.nn.modules.conv.Concat [1]
27 -1 1 148224 ultralytics.nn.modules.block.C2f [384, 128, 1]
28 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
29 [-1, 7] 1 0 ultralytics.nn.modules.conv.Concat [1]
30 -1 1 37248 ultralytics.nn.modules.block.C2f [192, 64, 1]
31 -1 1 36992 ultralytics.nn.modules.conv.Conv [64, 64, 3, 2]
32 [-1, 27] 1 0 ultralytics.nn.modules.conv.Concat [1]
33 -1 1 123648 ultralytics.nn.modules.block.C2f [192, 128, 1]
34 -1 1 18048 ultralytics.nn.modules.block.SCDown [128, 128, 3, 2]
35 [-1, 13] 1 0 ultralytics.nn.modules.conv.Concat [1]
36 -1 1 282624 ultralytics.nn.modules.block.C2fCIB [384, 256, 1, True, True]
37 24 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
38 [-1, 20] 1 0 ultralytics.nn.modules.conv.Concat [1]
39 -1 1 148224 ultralytics.nn.modules.block.C2f [384, 128, 1]
40 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
41 [-1, 18] 1 0 ultralytics.nn.modules.conv.Concat [1]
42 -1 1 37248 ultralytics.nn.modules.block.C2f [192, 64, 1]
43 -1 1 36992 ultralytics.nn.modules.conv.Conv [64, 64, 3, 2]
44 [-1, 39] 1 0 ultralytics.nn.modules.conv.Concat [1]
45 -1 1 123648 ultralytics.nn.modules.block.C2f [192, 128, 1]
46 -1 1 18048 ultralytics.nn.modules.block.SCDown [128, 128, 3, 2]
47 [-1, 24] 1 0 ultralytics.nn.modules.conv.Concat [1]
48 -1 1 282624 ultralytics.nn.modules.block.C2fCIB [384, 256, 1, True, True]
49 [30, 42] 1 27712 ultralytics.nn.AddModules.iRMBFusion.iRMBFusion[64]
50 [33, 45] 1 96384 ultralytics.nn.AddModules.iRMBFusion.iRMBFusion[128]
51 [36, 48] 1 356608 ultralytics.nn.AddModules.iRMBFusion.iRMBFusion[256]
52 [49, 50, 51] 1 861718 ultralytics.nn.modules.head.v10Detect [1, [64, 128, 256]]
YOLOv10n-late-iRMBFusion summary: 713 layers, 5,033,846 parameters, 5,033,830 gradients, 14.3 GFLOPs