RT-DETR改进策略【独家融合改进】| Mamba-YOLO+SDI 增强长距离依赖,聚焦目标特征
一、本文介绍
本文记录的是
利用
Mamba-YOLO
优化
RT-DETR
的目标检测网络模型
。
Mamba-YOLO
模型是一种基于状态空间模型的目标检测模型,
旨在解决传统目标检测模型在处理复杂场景和长距离依赖关系时的局限性
,是
目前最新的发文热点
。本文在此基础上添加
SDI模块
,融合改进后的模型
能够更精准地聚焦于图像中的目标物体,有效抑制背景及其他干扰因素,凸显目标的关键特征与位置信息。
二、Mamba YOLO模型介绍
Mamba YOLO:SSMs-Based YOLO For Object Detection
Mamba-YOLO模型结构包括ODMamba骨干网络和颈部网络,各部分协同工作以实现目标检测任务:
- ODMamba骨干网络 :由Simple Stem和Downsample Block组成。Simple Stem采用两个步长为2、内核大小为3的卷积,替代传统ViTs中用内核大小为4、步长为4的卷积划分图像块的方式,在性能和效率间达成平衡。Downsample Block包含ODSSBlock和Vision Clue Merge模块,先通过ODSSBlock学习特征,再经Vision Clue Merge模块下采样 。Vision Clue Merge模块通过去除归一化、分割维度图、将多余特征图附加到通道维度并利用4×压缩逐点卷积下采样,保留SS2D选择的特征图,利于模型训练。
- 颈部网络 :采用PAFPN设计,用ODSSBlock模块替换C2f,以捕捉更丰富的梯度信息流。Conv负责下采样,通过融合骨干网络不同层级的特征,为后续检测提供更具代表性的特征图。
- 核心模块ODSSBlock :由ConvModule、SS2D和RG Block组成。输入先经过ConvModule,学习更深层次和丰富的特征表示。SS2D通过Scan Expansion、S6 Block和Scan Merge操作,实现全局空间信息传播。RG Block采用多分支结构对通道维度建模,通过深度可分离卷积、残差连接和非线性激活函数,有效捕捉局部依赖关系,增强模型鲁棒性,解决SSM在序列建模中的局限性。
论文: https://arxiv.org/pdf/2406.05835
源码: https://github.com/HZAI-ZJNU/Mamba-YOLO
三、SDI模块介绍
U-NET V2: RETHINKING THE SKIP CONNECTIONS OF U-NET FOR MEDICAL IMAGE SEGMENTATION
U-Net V2
中的
SDI模块
在整个网络架构中起着关键作用,其设计旨在
解决传统模型在特征融合方面的不足
,通过独特的结构原理实现更高效的语义信息和细节融合,从而提升医学图像分割的性能。
3.1 设计出发点
在传统的
U - Net模型
中,基于
U - Net
的模型连接方式
在整合低层次和高层次特征时可能效果不佳
。
- 编码器提取的低层次特征通常保留更多细节但缺乏足够语义信息且可能含噪声,高层次特征虽有较多语义信息却因分辨率降低缺少精确细节。
- 简单的拼接融合依赖网络学习能力,在医学影像数据有限的情况下是个挑战,且会增加GPU内存消耗和计算量。
因此,需要一种更有效的方式来融合不同层次的特征,这就是
SDI模块
的设计出发点。
3.2 SDI结构原理
- 首先,对编码器生成的每一层级(i)的特征 f i 0 f_{i}^{0} f i 0 应用空间和通道注意力机制 φ i s \varphi_{i}^{s} φ i s 和 ϕ i c \phi_{i}^{c} ϕ i c ,公式为 f i 1 = ϕ i c ( φ i s ( f i 0 ) ) f_{i}^{1}=\phi_{i}^{c}\left(\varphi_{i}^{s}\left(f_{i}^{0}\right)\right) f i 1 = ϕ i c ( φ i s ( f i 0 ) ) 使特征 能够整合局部空间信息和全局通道信息 。然后应用 1 × 1 1\times1 1 × 1 卷积将 f i 1 f_{i}^{1} f i 1 的通道数减少到 c c c (超参数),得到 f i 2 f_{i}^{2} f i 2 。
- 在解码器的每一层级 i i i ,以 f i 2 f_{i}^{2} f i 2 为目标参考,调整其他层级特征图的大小使其与 f i 2 f_{i}^{2} f i 2 分辨率匹配,对于 j < i j < i j < i : f i j 3 = D ( f j 2 , ( H i , W i ) ) f_{ij}^{3}=D\left(f_{j}^{2},\left(H_{i}, W_{i}\right)\right) f ij 3 = D ( f j 2 , ( H i , W i ) ) ( D D D 为自适应平均池化); j = i j = i j = i 时: f i j 3 = ( I ) ( f j 2 ) f_{ij}^{3}=(I)\left(f_{j}^{2}\right) f ij 3 = ( I ) ( f j 2 ) ( I I I 为恒等映射); j > i j > i j > i 时: f i j 3 = U ) ( f j 2 , ( H i , W i ) ) f_{ij}^{3}=U)\left(f_{j}^{2},\left(H_{i}, W_{i}\right)\right) f ij 3 = U ) ( f j 2 , ( H i , W i ) ) ( U U U 为双线性插值)。接着对调整后的特征图 f i j 3 f_{ij}^{3} f ij 3 应用 3 × 3 3\times3 3 × 3 卷积进行平滑,得到 f i j 4 = θ i j ( f i j 3 ) f_{ij}^{4}=\theta_{ij}\left(f_{ij}^{3}\right) f ij 4 = θ ij ( f ij 3 ) 。最后,将所有调整为相同分辨率的第(i)层级特征图通过逐元素哈达玛积(H)进行融合,公式为 f i 5 = H ( [ f i 1 4 , f i 2 4 , ⋯ , f i M 4 ] ) f_{i}^{5}=H\left(\left[f_{i1}^{4}, f_{i2}^{4},\cdots, f_{iM}^{4}\right]\right) f i 5 = H ( [ f i 1 4 , f i 2 4 , ⋯ , f i M 4 ] ) 并将 f i 5 f_{i}^{5} f i 5 发送到第 i i i 层级解码器进行后续处理。
3.3 优势
- 从实验结果来看,在皮肤病变分割和息肉分割数据集上,U - Net V2相比其他先进方法有更好的分割效果。
- 在计算复杂度、GPU内存使用和推理时间方面,U - Net V2也表现出优势。相比UNet++,U - Net V2引入的参数更少,GPU内存使用量更低,FLOPs和FPS更优,说明 SDI模块在提升性能的同时没有带来过多的计算和存储负担。
论文: https://arxiv.org/pdf/2311.17791
源码: https://github.com/yaoppeng/U-Net_v2
四、Mamba-YOLO和SDI相关模块的实现代码
Mamba-YOLO
实现代码如下:
import torch
import math
from functools import partial
from typing import Callable, Any
import torch.nn as nn
from einops import rearrange, repeat
from timm.layers import DropPath
DropPath.__repr__ = lambda self: f"timm.DropPath({self.drop_prob})"
try:
import selective_scan_cuda_core
import selective_scan_cuda_oflex
import selective_scan_cuda_ndstate
# import selective_scan_cuda_nrow
import selective_scan_cuda
except:
pass
__all__ = ("VSSBlock_YOLO", "SimpleStem", "VisionClueMerge", "XSSBlock")
class LayerNorm2d(nn.Module):
def __init__(self, normalized_shape, eps=1e-6, elementwise_affine=True):
super().__init__()
self.norm = nn.LayerNorm(normalized_shape, eps, elementwise_affine)
def forward(self, x):
x = rearrange(x, 'b c h w -> b h w c').contiguous()
x = self.norm(x)
x = rearrange(x, 'b h w c -> b c h w').contiguous()
return x
def autopad(k, p=None, d=1): # kernel, padding, dilation
"""Pad to 'same' shape outputs."""
if d > 1:
k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k] # actual kernel-size
if p is None:
p = k // 2 if isinstance(k, int) else [x // 2 for x in k] # auto-pad
return p
# Cross Scan
class CrossScan(torch.autograd.Function):
@staticmethod
def forward(ctx, x: torch.Tensor):
B, C, H, W = x.shape
ctx.shape = (B, C, H, W)
xs = x.new_empty((B, 4, C, H * W))
xs[:, 0] = x.flatten(2, 3)
xs[:, 1] = x.transpose(dim0=2, dim1=3).flatten(2, 3)
xs[:, 2:4] = torch.flip(xs[:, 0:2], dims=[-1])
return xs
@staticmethod
def backward(ctx, ys: torch.Tensor):
# out: (b, k, d, l)
B, C, H, W = ctx.shape
L = H * W
ys = ys[:, 0:2] + ys[:, 2:4].flip(dims=[-1]).view(B, 2, -1, L)
y = ys[:, 0] + ys[:, 1].view(B, -1, W, H).transpose(dim0=2, dim1=3).contiguous().view(B, -1, L)
return y.view(B, -1, H, W)
class CrossMerge(torch.autograd.Function):
@staticmethod
def forward(ctx, ys: torch.Tensor):
B, K, D, H, W = ys.shape
ctx.shape = (H, W)
ys = ys.view(B, K, D, -1)
ys = ys[:, 0:2] + ys[:, 2:4].flip(dims=[-1]).view(B, 2, D, -1)
y = ys[:, 0] + ys[:, 1].view(B, -1, W, H).transpose(dim0=2, dim1=3).contiguous().view(B, D, -1)
return y
@staticmethod
def backward(ctx, x: torch.Tensor):
# B, D, L = x.shape
# out: (b, k, d, l)
H, W = ctx.shape
B, C, L = x.shape
xs = x.new_empty((B, 4, C, L))
xs[:, 0] = x
xs[:, 1] = x.view(B, C, H, W).transpose(dim0=2, dim1=3).flatten(2, 3)
xs[:, 2:4] = torch.flip(xs[:, 0:2], dims=[-1])
xs = xs.view(B, 4, C, H, W)
return xs, None, None
class SelectiveScanCore(torch.autograd.Function):
@staticmethod
@torch.cuda.amp.custom_fwd
def forward(ctx, u, delta, A, B, C, D=None, delta_bias=None, delta_softplus=False, nrows=1, backnrows=1,
oflex=True):
# all in float
if u.stride(-1) != 1:
u = u.contiguous()
if delta.stride(-1) != 1:
delta = delta.contiguous()
if D is not None and D.stride(-1) != 1:
D = D.contiguous()
if B.stride(-1) != 1:
B = B.contiguous()
if C.stride(-1) != 1:
C = C.contiguous()
if B.dim() == 3:
B = B.unsqueeze(dim=1)
ctx.squeeze_B = True
if C.dim() == 3:
C = C.unsqueeze(dim=1)
ctx.squeeze_C = True
ctx.delta_softplus = delta_softplus
ctx.backnrows = backnrows
out, x, *rest = selective_scan_cuda_core.fwd(u, delta, A, B, C, D, delta_bias, delta_softplus, 1)
ctx.save_for_backward(u, delta, A, B, C, D, delta_bias, x)
return out
@staticmethod
@torch.cuda.amp.custom_bwd
def backward(ctx, dout, *args):
u, delta, A, B, C, D, delta_bias, x = ctx.saved_tensors
if dout.stride(-1) != 1:
dout = dout.contiguous()
du, ddelta, dA, dB, dC, dD, ddelta_bias, *rest = selective_scan_cuda_core.bwd(
u, delta, A, B, C, D, delta_bias, dout, x, ctx.delta_softplus, 1
)
return (du, ddelta, dA, dB, dC, dD, ddelta_bias, None, None, None, None)
def cross_selective_scan(
x: torch.Tensor = None,
x_proj_weight: torch.Tensor = None,
x_proj_bias: torch.Tensor = None,
dt_projs_weight: torch.Tensor = None,
dt_projs_bias: torch.Tensor = None,
A_logs: torch.Tensor = None,
Ds: torch.Tensor = None,
out_norm: torch.nn.Module = None,
out_norm_shape="v0",
nrows=-1,
backnrows=-1,
delta_softplus=True,
to_dtype=True,
force_fp32=False,
ssoflex=True,
SelectiveScan=None,
scan_mode_type='default'
):
B, D, H, W = x.shape
D, N = A_logs.shape
K, D, R = dt_projs_weight.shape
L = H * W
def selective_scan(u, delta, A, B, C, D=None, delta_bias=None, delta_softplus=True):
return SelectiveScan.apply(u, delta, A, B, C, D, delta_bias, delta_softplus, nrows, backnrows, ssoflex)
xs = CrossScan.apply(x)
x_dbl = torch.einsum("b k d l, k c d -> b k c l", xs, x_proj_weight)
if x_proj_bias is not None:
x_dbl = x_dbl + x_proj_bias.view(1, K, -1, 1)
dts, Bs, Cs = torch.split(x_dbl, [R, N, N], dim=2)
dts = torch.einsum("b k r l, k d r -> b k d l", dts, dt_projs_weight)
xs = xs.view(B, -1, L)
dts = dts.contiguous().view(B, -1, L)
As = -torch.exp(A_logs.to(torch.float))
Bs = Bs.contiguous()
Cs = Cs.contiguous()
Ds = Ds.to(torch.float)
delta_bias = dt_projs_bias.view(-1).to(torch.float)
if force_fp32:
xs = xs.to(torch.float)
dts = dts.to(torch.float)
Bs = Bs.to(torch.float)
Cs = Cs.to(torch.float)
ys: torch.Tensor = selective_scan(
xs, dts, As, Bs, Cs, Ds, delta_bias, delta_softplus
).view(B, K, -1, H, W)
y: torch.Tensor = CrossMerge.apply(ys)
if out_norm_shape in ["v1"]:
y = out_norm(y.view(B, -1, H, W)).permute(0, 2, 3, 1)
else:
y = y.transpose(dim0=1, dim1=2).contiguous()
y = out_norm(y).view(B, H, W, -1)
return (y.to(x.dtype) if to_dtype else y)
class SS2D(nn.Module):
def __init__(
self,
d_model=96,
d_state=16,
ssm_ratio=2.0,
ssm_rank_ratio=2.0,
dt_rank="auto",
act_layer=nn.SiLU,
d_conv=3,
conv_bias=True,
dropout=0.0,
bias=False,
forward_type="v2",
**kwargs,
):
"""
ssm_rank_ratio would be used in the future...
"""
factory_kwargs = {"device": None, "dtype": None}
super().__init__()
d_expand = int(ssm_ratio * d_model)
d_inner = int(min(ssm_rank_ratio, ssm_ratio) * d_model) if ssm_rank_ratio > 0 else d_expand
self.dt_rank = math.ceil(d_model / 16) if dt_rank == "auto" else dt_rank
self.d_state = math.ceil(d_model / 6) if d_state == "auto" else d_state
self.d_conv = d_conv
self.K = 4
def checkpostfix(tag, value):
ret = value[-len(tag):] == tag
if ret:
value = value[:-len(tag)]
return ret, value
self.disable_force32, forward_type = checkpostfix("no32", forward_type)
self.disable_z, forward_type = checkpostfix("noz", forward_type)
self.disable_z_act, forward_type = checkpostfix("nozact", forward_type)
self.out_norm = nn.LayerNorm(d_inner)
FORWARD_TYPES = dict(
v2=partial(self.forward_corev2, force_fp32=None, SelectiveScan=SelectiveScanCore),
)
self.forward_core = FORWARD_TYPES.get(forward_type, FORWARD_TYPES.get("v2", None))
d_proj = d_expand if self.disable_z else (d_expand * 2)
self.in_proj = nn.Conv2d(d_model, d_proj, kernel_size=1, stride=1, groups=1, bias=bias, **factory_kwargs)
self.act: nn.Module = nn.GELU()
if self.d_conv > 1:
self.conv2d = nn.Conv2d(
in_channels=d_expand,
out_channels=d_expand,
groups=d_expand,
bias=conv_bias,
kernel_size=d_conv,
padding=(d_conv - 1) // 2,
**factory_kwargs,
)
self.ssm_low_rank = False
if d_inner < d_expand:
self.ssm_low_rank = True
self.in_rank = nn.Conv2d(d_expand, d_inner, kernel_size=1, bias=False, **factory_kwargs)
self.out_rank = nn.Linear(d_inner, d_expand, bias=False, **factory_kwargs)
self.x_proj = [
nn.Linear(d_inner, (self.dt_rank + self.d_state * 2), bias=False,
**factory_kwargs)
for _ in range(self.K)
]
self.x_proj_weight = nn.Parameter(torch.stack([t.weight for t in self.x_proj], dim=0)) # (K, N, inner)
del self.x_proj
self.out_proj = nn.Conv2d(d_expand, d_model, kernel_size=1, stride=1, bias=bias, **factory_kwargs)
self.dropout = nn.Dropout(dropout) if dropout > 0. else nn.Identity()
self.Ds = nn.Parameter(torch.ones((self.K * d_inner)))
self.A_logs = nn.Parameter(
torch.zeros((self.K * d_inner, self.d_state)))
self.dt_projs_weight = nn.Parameter(torch.randn((self.K, d_inner, self.dt_rank)))
self.dt_projs_bias = nn.Parameter(torch.randn((self.K, d_inner)))
@staticmethod
def dt_init(dt_rank, d_inner, dt_scale=1.0, dt_init="random", dt_min=0.001, dt_max=0.1, dt_init_floor=1e-4,
**factory_kwargs):
dt_proj = nn.Linear(dt_rank, d_inner, bias=True, **factory_kwargs)
dt_init_std = dt_rank ** -0.5 * dt_scale
if dt_init == "constant":
nn.init.constant_(dt_proj.weight, dt_init_std)
elif dt_init == "random":
nn.init.uniform_(dt_proj.weight, -dt_init_std, dt_init_std)
else:
raise NotImplementedError
dt = torch.exp(
torch.rand(d_inner, **factory_kwargs) * (math.log(dt_max) - math.log(dt_min))
+ math.log(dt_min)
).clamp(min=dt_init_floor)
inv_dt = dt + torch.log(-torch.expm1(-dt))
with torch.no_grad():
dt_proj.bias.copy_(inv_dt)
return dt_proj
@staticmethod
def A_log_init(d_state, d_inner, copies=-1, device=None, merge=True):
A = repeat(
torch.arange(1, d_state + 1, dtype=torch.float32, device=device),
"n -> d n",
d=d_inner,
).contiguous()
A_log = torch.log(A)
if copies > 0:
A_log = repeat(A_log, "d n -> r d n", r=copies)
if merge:
A_log = A_log.flatten(0, 1)
A_log = nn.Parameter(A_log)
A_log._no_weight_decay = True
return A_log
@staticmethod
def D_init(d_inner, copies=-1, device=None, merge=True):
D = torch.ones(d_inner, device=device)
if copies > 0:
D = repeat(D, "n1 -> r n1", r=copies)
if merge:
D = D.flatten(0, 1)
D = nn.Parameter(D)
D._no_weight_decay = True
return D
def forward_corev2(self, x: torch.Tensor, channel_first=False, SelectiveScan=SelectiveScanCore,
cross_selective_scan=cross_selective_scan, force_fp32=None):
force_fp32 = (self.training and (not self.disable_force32)) if force_fp32 is None else force_fp32
if not channel_first:
x = x.permute(0, 3, 1, 2).contiguous()
if self.ssm_low_rank:
x = self.in_rank(x)
x = cross_selective_scan(
x, self.x_proj_weight, None, self.dt_projs_weight, self.dt_projs_bias,
self.A_logs, self.Ds,
out_norm=getattr(self, "out_norm", None),
out_norm_shape=getattr(self, "out_norm_shape", "v0"),
delta_softplus=True, force_fp32=force_fp32,
SelectiveScan=SelectiveScan, ssoflex=self.training, # output fp32
)
if self.ssm_low_rank:
x = self.out_rank(x)
return x
def forward(self, x: torch.Tensor, **kwargs):
x = self.in_proj(x)
if not self.disable_z:
x, z = x.chunk(2, dim=1)
if not self.disable_z_act:
z1 = self.act(z)
if self.d_conv > 0:
x = self.conv2d(x)
x = self.act(x)
y = self.forward_core(x, channel_first=(self.d_conv > 1))
y = y.permute(0, 3, 1, 2).contiguous()
if not self.disable_z:
y = y * z1
out = self.dropout(self.out_proj(y))
return out
class RGBlock(nn.Module):
def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.,
channels_first=False):
super().__init__()
out_features = out_features or in_features
hidden_features = hidden_features or in_features
hidden_features = int(2 * hidden_features / 3)
self.fc1 = nn.Conv2d(in_features, hidden_features * 2, kernel_size=1)
self.dwconv = nn.Conv2d(hidden_features, hidden_features, kernel_size=3, stride=1, padding=1, bias=True,
groups=hidden_features)
self.act = act_layer()
self.fc2 = nn.Conv2d(hidden_features, out_features, kernel_size=1)
self.drop = nn.Dropout(drop)
def forward(self, x):
x, v = self.fc1(x).chunk(2, dim=1)
x = self.act(self.dwconv(x) + x) * v
x = self.drop(x)
x = self.fc2(x)
x = self.drop(x)
return x
class LSBlock(nn.Module):
def __init__(self, in_features, hidden_features=None, act_layer=nn.GELU, drop=0):
super().__init__()
self.fc1 = nn.Conv2d(in_features, hidden_features, kernel_size=3, padding=3 // 2, groups=hidden_features)
self.norm = nn.BatchNorm2d(hidden_features)
self.fc2 = nn.Conv2d(hidden_features, hidden_features, kernel_size=1, padding=0)
self.act = act_layer()
self.fc3 = nn.Conv2d(hidden_features, in_features, kernel_size=1, padding=0)
self.drop = nn.Dropout(drop)
def forward(self, x):
input = x
x = self.fc1(x)
x = self.norm(x)
x = self.fc2(x)
x = self.act(x)
x = self.fc3(x)
x = input + self.drop(x)
return x
class XSSBlock(nn.Module):
def __init__(
self,
in_channels: int = 0,
hidden_dim: int = 0,
n: int = 1,
mlp_ratio=4.0,
drop_path: float = 0,
norm_layer: Callable[..., torch.nn.Module] = partial(LayerNorm2d, eps=1e-6),
ssm_d_state: int = 16,
ssm_ratio=2.0,
ssm_rank_ratio=2.0,
ssm_dt_rank: Any = "auto",
ssm_act_layer=nn.SiLU,
ssm_conv: int = 3,
ssm_conv_bias=True,
ssm_drop_rate: float = 0,
ssm_init="v0",
forward_type="v2",
mlp_act_layer=nn.GELU,
mlp_drop_rate: float = 0.0,
use_checkpoint: bool = False,
post_norm: bool = False,
**kwargs,
):
super().__init__()
self.in_proj = nn.Sequential(
nn.Conv2d(in_channels, hidden_dim, kernel_size=1, stride=1, padding=0, bias=False),
nn.BatchNorm2d(hidden_dim),
nn.SiLU()
) if in_channels != hidden_dim else nn.Identity()
self.hidden_dim = hidden_dim
self.norm = norm_layer(hidden_dim)
self.ss2d = nn.Sequential(*(SS2D(d_model=self.hidden_dim,
d_state=ssm_d_state,
ssm_ratio=ssm_ratio,
ssm_rank_ratio=ssm_rank_ratio,
dt_rank=ssm_dt_rank,
act_layer=ssm_act_layer,
d_conv=ssm_conv,
conv_bias=ssm_conv_bias,
dropout=ssm_drop_rate, ) for _ in range(n)))
self.drop_path = DropPath(drop_path)
self.lsblock = LSBlock(hidden_dim, hidden_dim)
self.mlp_branch = mlp_ratio > 0
if self.mlp_branch:
self.norm2 = norm_layer(hidden_dim)
mlp_hidden_dim = int(hidden_dim * mlp_ratio)
self.mlp = RGBlock(in_features=hidden_dim, hidden_features=mlp_hidden_dim, act_layer=mlp_act_layer,
drop=mlp_drop_rate)
def forward(self, input):
input = self.in_proj(input)
X1 = self.lsblock(input)
input = input + self.drop_path(self.ss2d(self.norm(X1)))
if self.mlp_branch:
input = input + self.drop_path(self.mlp(self.norm2(input)))
return input
class VSSBlock_YOLO(nn.Module):
def __init__(
self,
in_channels: int = 0,
hidden_dim: int = 0,
drop_path: float = 0,
norm_layer: Callable[..., torch.nn.Module] = partial(LayerNorm2d, eps=1e-6),
ssm_d_state: int = 16,
ssm_ratio=2.0,
ssm_rank_ratio=2.0,
ssm_dt_rank: Any = "auto",
ssm_act_layer=nn.SiLU,
ssm_conv: int = 3,
ssm_conv_bias=True,
ssm_drop_rate: float = 0,
ssm_init="v0",
forward_type="v2",
mlp_ratio=4.0,
mlp_act_layer=nn.GELU,
mlp_drop_rate: float = 0.0,
use_checkpoint: bool = False,
post_norm: bool = False,
**kwargs,
):
super().__init__()
self.ssm_branch = ssm_ratio > 0
self.mlp_branch = mlp_ratio > 0
self.use_checkpoint = use_checkpoint
self.post_norm = post_norm
# proj
self.proj_conv = nn.Sequential(
nn.Conv2d(in_channels, hidden_dim, kernel_size=1, stride=1, padding=0, bias=True),
nn.BatchNorm2d(hidden_dim),
nn.SiLU()
)
if self.ssm_branch:
self.norm = norm_layer(hidden_dim)
self.op = SS2D(
d_model=hidden_dim,
d_state=ssm_d_state,
ssm_ratio=ssm_ratio,
ssm_rank_ratio=ssm_rank_ratio,
dt_rank=ssm_dt_rank,
act_layer=ssm_act_layer,
d_conv=ssm_conv,
conv_bias=ssm_conv_bias,
dropout=ssm_drop_rate,
# bias=False,
# dt_min=0.001,
# dt_max=0.1,
# dt_init="random",
# dt_scale="random",
# dt_init_floor=1e-4,
initialize=ssm_init,
forward_type=forward_type,
)
self.drop_path = DropPath(drop_path)
self.lsblock = LSBlock(hidden_dim, hidden_dim)
if self.mlp_branch:
self.norm2 = norm_layer(hidden_dim)
mlp_hidden_dim = int(hidden_dim * mlp_ratio)
self.mlp = RGBlock(in_features=hidden_dim, hidden_features=mlp_hidden_dim, act_layer=mlp_act_layer,
drop=mlp_drop_rate, channels_first=False)
def forward(self, input: torch.Tensor):
input = self.proj_conv(input)
X1 = self.lsblock(input)
x = input + self.drop_path(self.op(self.norm(X1)))
if self.mlp_branch:
x = x + self.drop_path(self.mlp(self.norm2(x))) # FFN
return x
class SimpleStem(nn.Module):
def __init__(self, inp, embed_dim, ks=3):
super().__init__()
self.hidden_dims = embed_dim // 2
self.conv = nn.Sequential(
nn.Conv2d(inp, self.hidden_dims, kernel_size=ks, stride=2, padding=autopad(ks, d=1), bias=False),
nn.BatchNorm2d(self.hidden_dims),
nn.GELU(),
nn.Conv2d(self.hidden_dims, embed_dim, kernel_size=ks, stride=2, padding=autopad(ks, d=1), bias=False),
nn.BatchNorm2d(embed_dim),
nn.SiLU(),
)
def forward(self, x):
return self.conv(x)
class VisionClueMerge(nn.Module):
def __init__(self, dim, out_dim):
super().__init__()
self.hidden = int(dim * 4)
self.pw_linear = nn.Sequential(
nn.Conv2d(self.hidden, out_dim, kernel_size=1, stride=1, padding=0),
nn.BatchNorm2d(out_dim),
nn.SiLU()
)
def forward(self, x):
y = torch.cat([
x[..., ::2, ::2],
x[..., 1::2, ::2],
x[..., ::2, 1::2],
x[..., 1::2, 1::2]
], dim=1)
return self.pw_linear(y)
SDI模块
的实现代码如下:
import torch
from torch import nn
import torch.nn.functional as F
def autopad(k, p=None, d=1):
"""
Pads kernel to 'same' output shape, adjusting for optional dilation; returns padding size.
`k`: kernel, `p`: padding, `d`: dilation.
"""
if d > 1:
k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k] # actual kernel-size
if p is None:
p = k // 2 if isinstance(k, int) else [x // 2 for x in k] # auto-pad
return p
class Conv(nn.Module):
# Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)
default_act = nn.SiLU() # default activation
def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
"""Initializes a standard convolution layer with optional batch normalization and activation."""
super().__init__()
self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
self.bn = nn.BatchNorm2d(c2)
self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()
def forward(self, x):
"""Applies a convolution followed by batch normalization and an activation function to the input tensor `x`."""
return self.act(self.bn(self.conv(x)))
def forward_fuse(self, x):
"""Applies a fused convolution and activation function to the input tensor `x`."""
return self.act(self.conv(x))
class GSConv(nn.Module):
# GSConv https://github.com/AlanLi1997/slim-neck-by-gsconv
def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
super().__init__()
c_ = c2 // 2
self.cv1 = Conv(c1, c_, k, s, p, g, d, Conv.default_act)
self.cv2 = Conv(c_, c_, 5, 1, p, c_, d, Conv.default_act)
def forward(self, x):
x1 = self.cv1(x)
x2 = torch.cat((x1, self.cv2(x1)), 1)
# shuffle
# y = x2.reshape(x2.shape[0], 2, x2.shape[1] // 2, x2.shape[2], x2.shape[3])
# y = y.permute(0, 2, 1, 3, 4)
# return y.reshape(y.shape[0], -1, y.shape[3], y.shape[4])
b, n, h, w = x2.size()
b_n = b * n // 2
y = x2.reshape(b_n, 2, h * w)
y = y.permute(1, 0, 2)
y = y.reshape(2, -1, n // 2, h, w)
return torch.cat((y[0], y[1]), 1)
class SDI(nn.Module):
def __init__(self, channels):
super().__init__()
# self.convs = nn.ModuleList([nn.Conv2d(channel, channels[0], kernel_size=3, stride=1, padding=1) for channel in channels])
self.convs = nn.ModuleList([GSConv(channel, channels[0]) for channel in channels])
def forward(self, xs):
ans = torch.ones_like(xs[0])
target_size = xs[0].shape[2:]
for i, x in enumerate(xs):
if x.shape[-1] > target_size[-1]:
x = F.adaptive_avg_pool2d(x, (target_size[0], target_size[1]))
elif x.shape[-1] < target_size[-1]:
x = F.interpolate(x, size=(target_size[0], target_size[1]),
mode='bilinear', align_corners=True)
ans = ans * self.convs[i](x)
return ans
五、添加步骤
Mamba-YOLO添加步骤参考:
SDI添加步骤参考:
六、yaml模型文件
7.1 模型改进版本
📌 新建模型文件
rtdetr-mamba-SDI.yaml
,并配置如下结构,可通过设置
T
,
B
,
L
设置不同的模型大小:
nc: 1 # number of classes
scales: # [depth, width, max_channels]
T: [0.33, 0.25, 1024] #Mamba-YOLOv8-T summary: 6.1M parameters, 14.3GFLOPs
# Mamba-YOLO backbone
backbone:
# [from, repeats, module, args]
- [-1, 1, SimpleStem, [128, 3]] # 0-P2/4
- [-1, 2, VSSBlock_YOLO, [128]] # 1
- [-1, 1, VisionClueMerge, [256]] # 2 p3/8
- [-1, 2, VSSBlock_YOLO, [256]] # 3
- [-1, 1, VisionClueMerge, [512]] # 4 p4/16
- [-1, 2, VSSBlock_YOLO, [512]] # 5
- [-1, 1, VisionClueMerge, [1024]] # 6 p5/32
- [-1, 2, VSSBlock_YOLO, [1024]] # 7
- [-1, 1, SPPF, [1024, 5]] # 8
- [-1, 2, C2PSA, [1024]] # 9
# Mamba-YOLO PAFPN
head:
- [-1, 1, nn.Upsample, [None, 2, 'nearest']]
- [[-1, 5], 1, SDI, []] # cat backbone P4
- [-1, 2, XSSBlock, [512]] # 12
- [-1, 1, nn.Upsample, [None, 2, 'nearest']]
- [[-1, 3], 1, SDI, []] # cat backbone P3
- [-1, 2, XSSBlock, [256]] # 15 (P3/8-small)
- [-1, 1, Conv, [256, 3, 2]]
- [[-1, 12], 1, SDI, []] # cat head P4
- [-1, 2, XSSBlock, [512]] # 18 (P4/16-medium)
- [-1, 1, Conv, [512, 3, 2]]
- [[-1, 9], 1, SDI, []] # cat head P5
- [-1, 2, XSSBlock, [1024]] # 21 (P5/32-large)
- [[15, 18, 21], 1, RTDETRDecoder, [nc, 256, 300, 4, 8, 3]] # Detect(P3, P4, P5)
七、成功运行结果
打印网络模型可以看到
Mamba-YOLO
和
SDI
已经加入到模型中,并可以进行训练了。
rtdetr-mamba-SDI :
rtdetr-mamba-SDI summary: 487 layers, 8,614,068 parameters, 8,614,068 gradients, 15.8 GFLOPs
from n params module arguments
0 -1 1 5136 ultralytics.nn.AddModules.mamba_yolo.SimpleStem[3, 32, 3]
1 -1 1 33692 ultralytics.nn.AddModules.mamba_yolo.VSSBlock_YOLO[32, 32]
2 -1 1 8384 ultralytics.nn.AddModules.mamba_yolo.VisionClueMerge[32, 64]
3 -1 1 104184 ultralytics.nn.AddModules.mamba_yolo.VSSBlock_YOLO[64, 64]
4 -1 1 33152 ultralytics.nn.AddModules.mamba_yolo.VisionClueMerge[64, 128]
5 -1 1 355964 ultralytics.nn.AddModules.mamba_yolo.VSSBlock_YOLO[128, 128]
6 -1 1 131840 ultralytics.nn.AddModules.mamba_yolo.VisionClueMerge[128, 256]
7 -1 1 1301496 ultralytics.nn.AddModules.mamba_yolo.VSSBlock_YOLO[256, 256]
8 -1 1 164608 ultralytics.nn.modules.block.SPPF [256, 256, 5]
9 -1 1 249728 ultralytics.nn.modules.block.C2PSA [256, 256, 1]
10 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
11 [-1, 5] 1 56576 ultralytics.nn.AddModules.SDI.SDI [[256, 128]]
12 -1 1 372220 ultralytics.nn.AddModules.mamba_yolo.XSSBlock[256, 128]
13 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
14 [-1, 3] 1 16000 ultralytics.nn.AddModules.SDI.SDI [[128, 64]]
15 -1 1 108216 ultralytics.nn.AddModules.mamba_yolo.XSSBlock[128, 64]
16 -1 1 36992 ultralytics.nn.modules.conv.Conv [64, 64, 3, 2]
17 [-1, 12] 1 8000 ultralytics.nn.AddModules.SDI.SDI [[64, 128]]
18 -1 1 347644 ultralytics.nn.AddModules.mamba_yolo.XSSBlock[64, 128]
19 -1 1 147712 ultralytics.nn.modules.conv.Conv [128, 128, 3, 2]
20 [-1, 9] 1 28288 ultralytics.nn.AddModules.SDI.SDI [[128, 256]]
21 -1 1 1268472 ultralytics.nn.AddModules.mamba_yolo.XSSBlock[128, 256]
22 [15, 18, 21] 1 3835764 ultralytics.nn.modules.head.RTDETRDecoder [1, [64, 128, 256], 256, 300, 4, 8, 3]
rtdetr-mamba-SDI summary: 487 layers, 8,614,068 parameters, 8,614,068 gradients, 15.8 GFLOPs