RT-DETR改进策略【独家融合改进】| AssemFormer + HS-FPN 减少目标尺度变化影响,增加多尺度的学习能力
一、本文介绍
本文记录的是
利用
AssemFormer
优化
RT-DETR
的目标检测网络模型
。本文的融合改进中利用
AssemFormer
在特征传递和融合过程中
增加多尺度的学习能力
,配合
HS-FPN
,
减少因尺度变化及特征不足导致的检测误差
,显著提升
RT-DETR
在各项检测任务中的准确性与稳定性。
二、AssemFormer介绍
Exploiting Scale-Variant Attention for Segmenting Small Medical Objects
2.1 设计出发点
- 弥补传统方法不足 :传统的深度学习算法在处理医学图像中的小对象时面临诸多挑战。例如卷积和池化操作会导致信息丢失和压缩缺陷,尤其对于小医学对象,随着网络加深这些问题更加明显。传统的注意力机制通常产生固定维度的注意力图,往往不足以分析医学图像,因为这些方法主要关注中心特征,忽略了背景中的丰富上下文信息,而这些信息对于临床解释至关重要。
- 结合卷积与Transformer优势 :卷积操作侧重于学习医学对象的局部和一般特征,如角落、边缘、角度和颜色;而Transformer模块利用多头自注意力机制提取医学对象的全局信息,包括形态、深度和颜色分布,同时还能学习医学对象的位置关联。为了综合两者的优势,设计了AssemFormer模块。
2.2 原理
2.2.1 结构组成
AssemFormer模块
包含一个
3
×
3
3×3
3
×
3
卷积和一个
1
×
1
1×1
1
×
1
卷积,接着是两个
Transformer
块和两个
卷积
操作。它通过堆叠和拆分特征图来连接卷积和Transformer操作。
2.2.2 注意力机制
AssemFormer
采用的多头部自注意力机制(MHSA),公式为
A
V
i
T
(
q
,
k
,
v
)
=
s
o
f
t
m
a
x
(
q
k
T
D
h
)
v
\mathcal{A}_{ViT}(q,k,v)=softmax\left(\frac{qk^{T}}{\sqrt{D_{h}}}\right)v
A
Vi
T
(
q
,
k
,
v
)
=
so
f
t
ma
x
(
D
h
q
k
T
)
v
,其中
q
q
q
、
k
k
k
、
v
v
v
是输入序列
z
∈
R
N
^
×
D
˙
z\in\mathbb{R}^{\hat{N}×\dot{D}}
z
∈
R
N
^
×
D
˙
的查询、键和值向量,
N
N
N
表示补丁数量,
D
D
D
表示补丁大小,
m
m
m
个自注意力操作时
D
h
=
D
/
m
D_{h}=D/m
D
h
=
D
/
m
。这种机制有助于补丁交互并丰富上下文信息。
2.3 特点
- 融合多尺度特征 :通过堆叠和拆分特征图的方式,使得模块能够同时学习输入医学图像的局部和全局表示,从而更好地捕捉不同尺度的特征,为准确分割小医学对象提供更全面的信息。
-
提高分割性能
:在实验中,根据消融研究结果,
AssemFormer显著提高了SvANet的分割性能。例如在不同数据集上,SvANet结合AssemFormer在各项评估指标上均取得了较好的成绩,证明了其对小医学对象分割的有效性。 -
增强特征表示
:从特征图的变化可以看出,
AssemFormer能够逐渐突出更小的区域,这些区域更准确地与 ground truth对齐。在不同的医学对象分割场景中,都表现出能够更好地聚焦于目标区域,增强了小医学对象的可见性和精确定位,突出了其形态细节和准确位置。
论文: https://arxiv.org/abs/2407.07720
源码: https://github.com/anthonyweidai/SvANet
三、HS-FPN介绍
Accurate Leukocyte Detection Based on Deformable-DETR and Multi-Level Feature Fusion for Aiding Diagnosis of Blood Diseases
HS - FPN结构
由
特征选择模块
和
特征融合模块
组成。
-
特征选择模块中,
CA模块先处理输入 特征图 ,经 池化 、 激活函数 确定各通道权重以 过滤特征图 ,DM模块再对不同尺度特征图降维; -
特征融合模块中,利用
SFF机制, 以高级特征为权重筛选低级特征语义信息后融合 ,提升模型检测能力。
3.1 出发点
在白细胞数据集中,白细胞识别任务面临多尺度问题,不同类型白细胞直径通常有差异,相同白细胞在不同显微镜下成像大小也会不同,这使得模型难以准确识别白细胞,所以需要设计HS - FPN来实现多尺度特征融合,帮助模型捕捉更全面的白细胞特征信息。
3.2 结构原理
-
特征选择模块 :由
CA模块和DM模块组成。对于输入特征图 f i n ∈ R C × H × W f_{in } \in R^{C ×H ×W} f in ∈ R C × H × W ,CA模块先进行全局平均池化和全局最大池化,再结合结果,经Sigmoid激活函数确定各通道权重 f C A ∈ R C × 1 × 1 f_{C A} \in R^{C ×1 ×1} f C A ∈ R C × 1 × 1 ,通过与对应尺度特征图相乘得到过滤后的特征图。因不同尺度特征图通道数不同,DM模块用1×1卷积将各尺度特征图通道数降为 256。 -
特征融合模块 :骨干网络生成的多尺度特征图中, 高级特征语义信息丰富但目标定位粗糙,低级特征定位精确但语义信息有限 。传统直接像素求和融合有缺陷,研究中的
SFF模块以高级特征为权重筛选低级特征中的关键语义信息 。对于输入高级特征 f h i g h ∈ R C × H × W f_{high } \in R^{C ×H ×W} f hi g h ∈ R C × H × W 和低级特征 f l o w ∈ R C × H 1 × W 1 f_{low } \in R^{C ×H_{1} ×W_{1}} f l o w ∈ R C × H 1 × W 1 ,先对高级特征用步长为2、卷积核为3 x3的转置卷积扩展,再用双线性插值统一维度得到 f a t t ∈ R C × H 1 × W 1 f_{att } \in R^{C ×H_{1} ×W_{1}} f a tt ∈ R C × H 1 × W 1 ,经CA 模块将高级特征转为注意力权重过滤低级特征,最后融合得到 f o u t ∈ R C × H 1 × W 1 f_{out } \in R^{C ×H_{1} ×W_{1}} f o u t ∈ R C × H 1 × W 1 ,其融合过程公式为 f a t t = B L ( T − C o n v ( f h i g h ) ) f_{att }=B L\left(T - Conv\left(f_{high }\right)\right) f a tt = B L ( T − C o n v ( f hi g h ) ) f o u t = f l o w ∗ C A ( f a t t ) + f a t t f_{out }=f_{low } * C A\left(f_{att }\right)+f_{att } f o u t = f l o w ∗ C A ( f a tt ) + f a tt
3.3 作用
HS-FPN
能够利用通道注意力模块,以
高级语义特征为权重过滤低级特征
,并将筛选后的特征与高级特征逐点相加,实现多尺度特征融合,从而提高模型的特征表达能力,有助于检测到细微特征,增强模型的检测能力。
论文: https://arxiv.org/pdf/2212.11677
源码: https://github.com/Barrett-python/DuAT
四、AssemFormer和HS-FPN的实现代码
AssemFormer模块
的实现代码如下:
import numpy as np
from typing import Union, Sequence, Tuple, Optional
import torch
from torch import nn, Tensor
import torch.nn.functional as F
from typing import Any, Callable
from torchvision.ops import StochasticDepth as StochasticDepthTorch
from ultralytics.nn.modules.conv import LightConv
from ultralytics.utils.torch_utils import fuse_conv_and_bn
class Dropout(nn.Dropout):
def __init__(self, p: float=0.5, inplace: bool=False):
super(Dropout, self).__init__(p=p, inplace=inplace)
class StochasticDepth(StochasticDepthTorch):
def __init__(self, p: float, Mode: str="row") -> None:
super().__init__(p, Mode)
def pair(Val):
return Val if isinstance(Val, (tuple, list)) else (Val, Val)
def makeDivisible(v: float, divisor: int, min_value: Optional[int] = None) -> int:
"""
This function is taken from the original tf repo.
It ensures that all layers have a channel number that is divisible by 8
It can be seen here:
https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.Py
"""
if min_value is None:
min_value = divisor
new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
# Make sure that round down does not go down by more than 10%.
if new_v < 0.9 * v:
new_v += divisor
return new_v
class LinearSelfAttention(nn.Module):
"""
This layer applies a self-attention with linear complexity, as described in `MobileViTv2 <https://arxiv.org/abs/2206.02680>`_ paper.
This layer can be used for self- as well as cross-attention.
Args:
opts: command line arguments
DimEmbed (int): :math:`C` from an expected input of size :math:`(N, C, H, W)`
AttnDropRate (Optional[float]): Dropout value for context scores. Default: 0.0
bias (Optional[bool]): Use bias in learnable layers. Default: True
Shape:
- Input: :math:`(N, C, P, N)` where :math:`N` is the batch size, :math:`C` is the input channels,
:math:`P` is the number of pixels in the patch, and :math:`N` is the number of patches
- Output: same as the input
.. note::
For MobileViTv2, we unfold the feature map [B, C, H, W] into [B, C, P, N] where P is the number of pixels
in a patch and N is the number of patches. Because channel is the first dimension in this unfolded tensor,
we use point-wise convolution (instead of a linear layer). This avoids a transpose operation (which may be
expensive on resource-constrained devices) that may be required to convert the unfolded tensor from
channel-first to channel-last format in case of a linear layer.
"""
def __init__(
self,
DimEmbed: int,
AttnDropRate: Optional[float]=0.0,
Bias: Optional[bool]=True,
) -> None:
super().__init__()
self.qkv_proj = BaseConv2d(DimEmbed, 1 + (2 * DimEmbed), 1, bias=Bias)
self.AttnDropRate = Dropout(p=AttnDropRate)
self.out_proj = BaseConv2d(DimEmbed, DimEmbed, 1, bias=Bias)
self.DimEmbed = DimEmbed
def forward(self, x: Tensor) -> Tensor:
# [B, C, P, N] --> [B, h + 2d, P, N]
qkv = self.qkv_proj(x)
# Project x into query, key and value
# Query --> [B, 1, P, N]
# value, key --> [B, d, P, N]
query, key, value = torch.split(
qkv, split_size_or_sections=[1, self.DimEmbed, self.DimEmbed], dim=1
)
# apply softmax along N dimension
context_scores = F.softmax(query, dim=-1)
# Uncomment below line to visualize context scores
# self.visualize_context_scores(context_scores=context_scores)
context_scores = self.AttnDropRate(context_scores)
# Compute context vector
# [B, d, P, N] x [B, 1, P, N] -> [B, d, P, N]
context_vector = key * context_scores
# [B, d, P, N] --> [B, d, P, 1]
context_vector = torch.sum(context_vector, dim=-1, keepdim=True)
# combine context vector with values
# [B, d, P, N] * [B, d, P, 1] --> [B, d, P, N]
out = F.relu(value) * context_vector.expand_as(value)
out = self.out_proj(out)
return out
class LinearAttnFFN(nn.Module):
def __init__(
self,
DimEmbed: int,
DimFfnLatent: int,
AttnDropRate: Optional[float] = 0.0,
DropRate: Optional[float] = 0.1,
FfnDropRate: Optional[float] = 0.0,
) -> None:
super().__init__()
AttnUnit = LinearSelfAttention(DimEmbed, AttnDropRate, Bias=True)
self.PreNormAttn = nn.Sequential(
nn.BatchNorm2d(DimEmbed),
AttnUnit,
Dropout(DropRate),
)
self.PreNormFfn = nn.Sequential(
nn.BatchNorm2d(DimEmbed),
BaseConv2d(DimEmbed, DimFfnLatent, 1, 1, ActLayer=nn.SiLU),
Dropout(FfnDropRate),
BaseConv2d(DimFfnLatent, DimEmbed, 1, 1),
Dropout(DropRate),
)
self.DimEmbed = DimEmbed
def forward(self, x: Tensor) -> Tensor:
# self-attention
x = x + self.PreNormAttn(x)
# Feed forward network
x = x + self.PreNormFfn(x)
return x
class BaseConv2d(nn.Module):
def __init__(
self,
in_channels: int,
out_channels: int,
kernel_size: int,
stride: Optional[int] = 1,
padding: Optional[int] = None,
groups: Optional[int] = 1,
bias: Optional[bool] = None,
BNorm: bool = False,
# norm_layer: Optional[Callable[..., nn.Module]]=nn.BatchNorm2d,
ActLayer: Optional[Callable[..., nn.Module]] = None,
dilation: int = 1,
Momentum: Optional[float] = 0.1,
**kwargs: Any
) -> None:
super(BaseConv2d, self).__init__()
if padding is None:
padding = int((kernel_size - 1) // 2 * dilation)
if bias is None:
bias = not BNorm
self.in_channels = in_channels
self.out_channels = out_channels
self.kernel_size = kernel_size
self.stride = stride
self.padding = padding
self.groups = groups
self.bias = bias
self.Conv = nn.Conv2d(in_channels, out_channels,
kernel_size, stride, padding, dilation, groups, bias, **kwargs)
self.Bn = nn.BatchNorm2d(out_channels, eps=0.001, momentum=Momentum) if BNorm else nn.Identity()
if ActLayer is not None:
if isinstance(list(ActLayer().named_modules())[0][1], nn.Sigmoid):
self.Act = ActLayer()
else:
self.Act = ActLayer(inplace=True)
else:
self.Act = ActLayer
def forward(self, x: Tensor) -> Tensor:
x = self.Conv(x)
x = self.Bn(x)
if self.Act is not None:
x = self.Act(x)
return x
class BaseFormer(nn.Module):
def __init__(
self,
InChannels: int,
FfnMultiplier: Optional[Union[Sequence[Union[int, float]], int, float]] = 2.0,
NumAttnBlocks: Optional[int] = 2,
AttnDropRate: Optional[float] = 0.0,
DropRate: Optional[float] = 0.0,
FfnDropRate: Optional[float] = 0.0,
PatchRes: Optional[int] = 2,
Dilation: Optional[int] = 1,
ViTSELayer: Optional[nn.Module] = None,
**kwargs: Any,
) -> None:
DimAttnUnit = InChannels // 2
DimCNNOut = DimAttnUnit
Conv3x3In = BaseConv2d(
InChannels, InChannels, 3, 1, dilation=Dilation,
BNorm=True, ActLayer=nn.SiLU,
) # depth-wise separable convolution
ViTSELayer = ViTSELayer(InChannels, **kwargs) if ViTSELayer is not None else nn.Identity()
Conv1x1In = BaseConv2d(InChannels, DimCNNOut, 1, 1, bias=False)
super(BaseFormer, self).__init__()
self.LocalRep = nn.Sequential(Conv3x3In, ViTSELayer, Conv1x1In)
self.GlobalRep, DimAttnUnit = self.buildAttnLayer(
DimAttnUnit, FfnMultiplier, NumAttnBlocks, AttnDropRate, DropRate, FfnDropRate,
)
self.ConvProj = BaseConv2d(DimCNNOut, InChannels, 1, 1, BNorm=True)
self.DimCNNOut = DimCNNOut
self.HPatch, self.WPatch = pair(PatchRes)
self.PatchArea = self.WPatch * self.HPatch
def buildAttnLayer(
self,
DimModel: int,
FfnMult: Union[Sequence, int, float],
NumAttnBlocks: int,
AttnDropRate: float,
DropRate: float,
FfnDropRate: float,
) -> Tuple[nn.Module, int]:
if isinstance(FfnMult, Sequence) and len(FfnMult) == 2:
DimFfn = (
np.linspace(FfnMult[0], FfnMult[1], NumAttnBlocks, dtype=float) * DimModel
)
elif isinstance(FfnMult, Sequence) and len(FfnMult) == 1:
DimFfn = [FfnMult[0] * DimModel] * NumAttnBlocks
elif isinstance(FfnMult, (int, float)):
DimFfn = [FfnMult * DimModel] * NumAttnBlocks
else:
raise NotImplementedError
# ensure that dims are multiple of 16
DimFfn = [makeDivisible(d, 16) for d in DimFfn]
GlobalRep = [
LinearAttnFFN(DimModel, DimFfn[block_idx], AttnDropRate, DropRate, FfnDropRate)
for block_idx in range(NumAttnBlocks)
]
GlobalRep.append(nn.BatchNorm2d(DimModel))
return nn.Sequential(*GlobalRep), DimModel
def unfolding(self, FeatureMap: Tensor) -> Tuple[Tensor, Tuple[int, int]]:
B, C, H, W = FeatureMap.shape
# [B, C, H, W] --> [B, C, P, N]
Patches = F.unfold(
FeatureMap,
kernel_size=(self.HPatch, self.WPatch),
stride=(self.HPatch, self.WPatch),
)
Patches = Patches.reshape(
B, C, self.HPatch * self.WPatch, -1
)
return Patches, (H, W)
def folding(self, Patches: Tensor, OutputSize: Tuple[int, int]) -> Tensor:
B, C, P, N = Patches.shape # BatchSize, DimIn, PatchSize, NumPatches
# [B, C, P, N]
Patches = Patches.reshape(B, C * P, N)
FeatureMap = F.fold(
Patches,
output_size=OutputSize,
kernel_size=(self.HPatch, self.WPatch),
stride=(self.HPatch, self.WPatch),
)
return FeatureMap
def forward(self, x: Tensor, *args, **kwargs) -> Tensor:
Fm = self.LocalRep(x)
# convert feature map to patches
Patches, OutputSize = self.unfolding(Fm)
# learn global representations on all patches
Patches = self.GlobalRep(Patches)
# [B x Patch x Patches x C] --> [B x C x Patches x Patch]
Fm = self.folding(Patches, OutputSize)
Fm = self.ConvProj(Fm)
return Fm
#AssemFormer, a method that combines convolution with a vision transformer by assembling tensors.
class AssemFormer(BaseFormer):
"""
Inspired by MobileViTv3.
Adapted from https://github.com/micronDLA/MobileViTv3/blob/main/MobileViTv3-v2/cvnets/modules/mobilevit_block.py
"""
def __init__(
self,
InChannels: int,
FfnMultiplier: Optional[Union[Sequence[Union[int, float]], int, float]] = 2.0,
NumAttnBlocks: Optional[int] = 2,
AttnDropRate: Optional[float] = 0.0,
DropRate: Optional[float] = 0.0,
FfnDropRate: Optional[float] = 0.0,
PatchRes: Optional[int] = 2,
Dilation: Optional[int] = 1,
SDProb: Optional[float] = 0.0,
ViTSELayer: Optional[nn.Module] = None,
**kwargs: Any,
) -> None:
super().__init__(InChannels, FfnMultiplier, NumAttnBlocks, AttnDropRate,
DropRate, FfnDropRate, PatchRes, Dilation, ViTSELayer, **kwargs)
# AssembleFormer: input changed from just global to local + global
self.ConvProj = BaseConv2d(2 * self.DimCNNOut, InChannels, 1, 1, BNorm=True)
self.Dropout = StochasticDepth(SDProb)
def forward(self, x: Tensor) -> Tensor:
FmConv = self.LocalRep(x)
# convert feature map to patches
Patches, OutputSize = self.unfolding(FmConv)
# learn global representations on all patches
Patches = self.GlobalRep(Patches)
# [B x Patch x Patches x C] --> [B x C x Patches x Patch]
Fm = self.folding(Patches, OutputSize)
# AssembleFormer: local + global instead of only global
Fm = self.ConvProj(torch.cat((Fm, FmConv), dim=1))
# AssembleFormer: skip connection
return x + self.Dropout(Fm)
def autopad(k, p=None, d=1): # kernel, padding, dilation
"""Pad to 'same' shape outputs."""
if d > 1:
k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k] # actual kernel-size
if p is None:
p = k // 2 if isinstance(k, int) else [x // 2 for x in k] # auto-pad
return p
class Conv(nn.Module):
"""Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)."""
default_act = nn.SiLU() # default activation
def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
"""Initialize Conv layer with given arguments including activation."""
super().__init__()
self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
self.bn = nn.BatchNorm2d(c2)
self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()
def forward(self, x):
"""Apply convolution, batch normalization and activation to input tensor."""
return self.act(self.bn(self.conv(x)))
def forward_fuse(self, x):
"""Perform transposed convolution of 2D data."""
return self.act(self.conv(x))
class HGBlock_AssemFormer(nn.Module):
"""
HG_Block of PPHGNetV2 with 2 convolutions and LightConv.
https://github.com/PaddlePaddle/PaddleDetection/blob/develop/ppdet/modeling/backbones/hgnet_v2.py
"""
def __init__(self, c1, cm, c2, k=3, n=6, lightconv=False, shortcut=False, act=nn.ReLU()):
"""Initializes a CSP Bottleneck with 1 convolution using specified input and output channels."""
super().__init__()
block = LightConv if lightconv else Conv
self.m = nn.ModuleList(block(c1 if i == 0 else cm, cm, k=k, act=act) for i in range(n))
self.sc = Conv(c1 + n * cm, c2 // 2, 1, 1, act=act) # squeeze conv
self.ec = Conv(c2 // 2, c2, 1, 1, act=act) # excitation conv
self.add = shortcut and c1 == c2
self.cv = AssemFormer(c2)
def forward(self, x):
"""Forward pass of a PPHGNetV2 backbone layer."""
y = [x]
y.extend(m(y[-1]) for m in self.m)
y = self.cv(self.ec(self.sc(torch.cat(y, 1))))
return y + x if self.add else y
HS-FPN模块
的实现代码如下:
import torch
import torch.nn as nn
class ChannelAttention_HSFPN(nn.Module):
def __init__(self, in_planes, ratio=4, flag=True):
super(ChannelAttention_HSFPN, self).__init__()
self.avg_pool = nn.AdaptiveAvgPool2d(1)
self.max_pool = nn.AdaptiveMaxPool2d(1)
self.conv1 = nn.Conv2d(in_planes, in_planes // ratio, 1, bias=False)
self.relu = nn.ReLU()
self.conv2 = nn.Conv2d(in_planes // ratio, in_planes, 1, bias=False)
self.flag = flag
self.sigmoid = nn.Sigmoid()
nn.init.xavier_uniform_(self.conv1.weight)
nn.init.xavier_uniform_(self.conv2.weight)
def forward(self, x):
avg_out = self.conv2(self.relu(self.conv1(self.avg_pool(x))))
max_out = self.conv2(self.relu(self.conv1(self.max_pool(x))))
out = avg_out + max_out
return self.sigmoid(out) * x if self.flag else self.sigmoid(out)
class Multiply(nn.Module):
def __init__(self) -> None:
super().__init__()
def forward(self, x):
return x[0] * x[1]
class Add(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x):
return torch.sum(torch.stack(x, dim=0), dim=0)
五、添加步骤
AssemFormer修改步骤参考:
HS-FPN修改步骤参考:
六、yaml模型文件
6.1 模型改进⭐
在代码配置完成后,配置模型的YAML文件。
此处以
ultralytics/cfg/models/rt-detr/rtdetr-l.yaml
为例,在同目录下创建一个用于自己数据集训练的模型文件
rtdetr-l-AssemFormer-HSFPN.yaml
。
将
rtdetr-l.yaml
中的内容复制到
rtdetr-AssemFormer-HSFPN.yaml
文件下,修改
nc
数量等于自己数据中目标的数量。
# Ultralytics YOLO 🚀, AGPL-3.0 license
# RT-DETR-l object detection model with P3-P5 outputs. For details see https://docs.ultralytics.com/models/rtdetr
# Parameters
nc: 1 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n-cls.yaml' will call yolov8-cls.yaml with scale 'n'
# [depth, width, max_channels]
l: [1.00, 1.00, 1024]
backbone:
# [from, repeats, module, args]
- [-1, 1, HGStem, [32, 48]] # 0-P2/4
- [-1, 6, HGBlock, [48, 128, 3]] # stage 1
- [-1, 1, DWConv, [128, 3, 2, 1, False]] # 2-P3/8
- [-1, 6, HGBlock, [96, 512, 3]] # stage 2
- [-1, 1, DWConv, [512, 3, 2, 1, False]] # 4-P4/16
- [-1, 6, HGBlock_AssemFormer, [192, 1024, 5, True, False]] # cm, c2, k, light, shortcut
- [-1, 6, HGBlock_AssemFormer, [192, 1024, 5, True, True]]
- [-1, 6, HGBlock_AssemFormer, [192, 1024, 5, True, True]] # stage 3
- [-1, 1, DWConv, [1024, 3, 2, 1, False]] # 8-P5/32
- [-1, 6, HGBlock, [384, 2048, 5, True, False]] # stage 4
head:
- [-1, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 10 input_proj.2
- [-1, 1, AIFI, [1024, 8]] # 11
- [-1, 1, Conv, [256, 1, 1]] # 12, Y5, lateral_convs.0
- [-1, 1, ChannelAttention_HSFPN, []] # 13
- [-1, 1, nn.Conv2d, [256, 1]] # 14
- [-1, 1, nn.ConvTranspose2d, [256, 3, 2, 1, 1]] # 15
- [7, 1, ChannelAttention_HSFPN, []] # 16
- [-1, 1, nn.Conv2d, [256, 1]] # 17
- [15, 1, ChannelAttention_HSFPN, [4, False]] # 18
- [[-1, -2], 1, Multiply, []] # 19
- [[-1, 15], 1, Add, []] # 20
- [-1, 3, RepC3, [256]] # 21 P4/16
- [15, 1, nn.ConvTranspose2d, [256, 3, 2, 1, 1, 16]] # 22
- [3, 1, ChannelAttention_HSFPN, []] # 23
- [-1, 1, nn.Conv2d, [256, 1]] # 24
- [22, 1, ChannelAttention_HSFPN, [4, False]] # 25
- [[-1, -2], 1, Multiply, []] # 26
- [[-1, 22], 1, Add, []] # 27
- [-1, 3, RepC3, [256]] # 28 P3/8
- [[28, 21, 14], 1, RTDETRDecoder, [nc]] # Detect(P3, P4, P5)
七、成功运行结果
分别打印网络模型可以看到
AssemFormer模块
和
HSFPN
结构已经加入到模型中,并可以进行训练了。
rtdetr-AssemFormer-HSFPN :
from n params module arguments
0 -1 1 25248 ultralytics.nn.modules.block.HGStem [3, 32, 48]
1 -1 6 155072 ultralytics.nn.modules.block.HGBlock [48, 48, 128, 3, 6]
2 -1 1 1408 ultralytics.nn.modules.conv.DWConv [128, 128, 3, 2, 1, False]
3 -1 6 839296 ultralytics.nn.modules.block.HGBlock [128, 96, 512, 3, 6]
4 -1 1 5632 ultralytics.nn.modules.conv.DWConv [512, 512, 3, 2, 1, False]
5 -1 6 16391810 ultralytics.nn.AddModules.AssemFormer.HGBlock_AssemFormer[512, 192, 1024, 5, 6, True, False]
6 -1 6 16752258 ultralytics.nn.AddModules.AssemFormer.HGBlock_AssemFormer[1024, 192, 1024, 5, 6, True, True]
7 -1 6 16752258 ultralytics.nn.AddModules.AssemFormer.HGBlock_AssemFormer[1024, 192, 1024, 5, 6, True, True]
8 -1 1 11264 ultralytics.nn.modules.conv.DWConv [1024, 1024, 3, 2, 1, False]
9 -1 6 6708480 ultralytics.nn.modules.block.HGBlock [1024, 384, 2048, 5, 6, True, False]
10 -1 1 524800 ultralytics.nn.modules.conv.Conv [2048, 256, 1, 1, None, 1, 1, False]
11 -1 1 789760 ultralytics.nn.modules.transformer.AIFI [256, 1024, 8]
12 -1 1 66048 ultralytics.nn.modules.conv.Conv [256, 256, 1, 1]
13 -1 1 32768 ultralytics.nn.AddModules.HSFPN.ChannelAttention_HSFPN[256]
14 -1 1 65792 torch.nn.modules.conv.Conv2d [256, 256, 1]
15 -1 1 590080 torch.nn.modules.conv.ConvTranspose2d [256, 256, 3, 2, 1, 1]
16 7 1 524288 ultralytics.nn.AddModules.HSFPN.ChannelAttention_HSFPN[1024]
17 -1 1 262400 torch.nn.modules.conv.Conv2d [1024, 256, 1]
18 15 1 32768 ultralytics.nn.AddModules.HSFPN.ChannelAttention_HSFPN[256, 4, False]
19 [-1, -2] 1 0 ultralytics.nn.AddModules.HSFPN.Multiply []
20 [-1, 15] 1 0 ultralytics.nn.AddModules.HSFPN.Add []
21 -1 3 2101248 ultralytics.nn.modules.block.RepC3 [256, 256, 3]
22 15 1 37120 torch.nn.modules.conv.ConvTranspose2d [256, 256, 3, 2, 1, 1, 16]
23 3 1 131072 ultralytics.nn.AddModules.HSFPN.ChannelAttention_HSFPN[512]
24 -1 1 131328 torch.nn.modules.conv.Conv2d [512, 256, 1]
25 22 1 32768 ultralytics.nn.AddModules.HSFPN.ChannelAttention_HSFPN[256, 4, False]
26 [-1, -2] 1 0 ultralytics.nn.AddModules.HSFPN.Multiply []
27 [-1, 22] 1 0 ultralytics.nn.AddModules.HSFPN.Add []
28 -1 3 2101248 ultralytics.nn.modules.block.RepC3 [256, 256, 3]
29 [28, 21, 14] 1 7303907 ultralytics.nn.modules.head.RTDETRDecoder [1, [256, 256, 256]]
rtdetr-AssemFormer-HSFPN summary: 817 layers, 72,370,121 parameters, 72,370,121 gradients, 237.9 GFLOPs