RT-DETR改进策略【Conv和Transformer】| 2024 AssemFormer 结合卷积与 Transformer 优势,弥补传统方法不足
一、本文介绍
本文记录的是
利用
AssemFormer
优化
RT-DETR
的目标检测网络模型
。传统卷积和池化操作会导致信息丢失和压缩缺陷,且传统的注意力机制通常产生固定维度的注意力图,忽略了背景中的丰富上下文信息。本文的利用
AssemFormer
改进
RT-DETR
,
以在特征传递和融合过程中增加多尺度的学习能力。
二、AssemFormer介绍
Exploiting Scale-Variant Attention for Segmenting Small Medical Objects
2.1 设计出发点
- 弥补传统方法不足 :传统的深度学习算法在处理医学图像中的小对象时面临诸多挑战。例如卷积和池化操作会导致信息丢失和压缩缺陷,尤其对于小医学对象,随着网络加深这些问题更加明显。传统的注意力机制通常产生固定维度的注意力图,往往不足以分析医学图像,因为这些方法主要关注中心特征,忽略了背景中的丰富上下文信息,而这些信息对于临床解释至关重要。
- 结合卷积与Transformer优势 :卷积操作侧重于学习医学对象的局部和一般特征,如角落、边缘、角度和颜色;而Transformer模块利用多头自注意力机制提取医学对象的全局信息,包括形态、深度和颜色分布,同时还能学习医学对象的位置关联。为了综合两者的优势,设计了AssemFormer模块。
2.2 原理
2.2.1 结构组成
AssemFormer模块
包含一个
3
×
3
3×3
3
×
3
卷积和一个
1
×
1
1×1
1
×
1
卷积,接着是两个
Transformer
块和两个
卷积
操作。它通过堆叠和拆分特征图来连接卷积和Transformer操作。
2.2.2 注意力机制
AssemFormer
采用的多头部自注意力机制(MHSA),公式为
A
V
i
T
(
q
,
k
,
v
)
=
s
o
f
t
m
a
x
(
q
k
T
D
h
)
v
\mathcal{A}_{ViT}(q,k,v)=softmax\left(\frac{qk^{T}}{\sqrt{D_{h}}}\right)v
A
Vi
T
(
q
,
k
,
v
)
=
so
f
t
ma
x
(
D
h
q
k
T
)
v
,其中
q
q
q
、
k
k
k
、
v
v
v
是输入序列
z
∈
R
N
^
×
D
˙
z\in\mathbb{R}^{\hat{N}×\dot{D}}
z
∈
R
N
^
×
D
˙
的查询、键和值向量,
N
N
N
表示补丁数量,
D
D
D
表示补丁大小,
m
m
m
个自注意力操作时
D
h
=
D
/
m
D_{h}=D/m
D
h
=
D
/
m
。这种机制有助于补丁交互并丰富上下文信息。
2.3 特点
- 融合多尺度特征 :通过堆叠和拆分特征图的方式,使得模块能够同时学习输入医学图像的局部和全局表示,从而更好地捕捉不同尺度的特征,为准确分割小医学对象提供更全面的信息。
-
提高分割性能
:在实验中,根据消融研究结果,
AssemFormer显著提高了SvANet的分割性能。例如在不同数据集上,SvANet结合AssemFormer在各项评估指标上均取得了较好的成绩,证明了其对小医学对象分割的有效性。 -
增强特征表示
:从特征图的变化可以看出,
AssemFormer能够逐渐突出更小的区域,这些区域更准确地与 ground truth对齐。在不同的医学对象分割场景中,都表现出能够更好地聚焦于目标区域,增强了小医学对象的可见性和精确定位,突出了其形态细节和准确位置。
论文: https://arxiv.org/abs/2407.07720
源码: https://github.com/anthonyweidai/SvANet
三、AssemFormer的实现代码
AssemFormer模块
的实现代码如下:
import numpy as np
from typing import Union, Sequence, Tuple, Optional
import torch
from torch import nn, Tensor
import torch.nn.functional as F
from typing import Any, Callable
from torchvision.ops import StochasticDepth as StochasticDepthTorch
from ultralytics.nn.modules.conv import LightConv
from ultralytics.utils.torch_utils import fuse_conv_and_bn
class Dropout(nn.Dropout):
def __init__(self, p: float=0.5, inplace: bool=False):
super(Dropout, self).__init__(p=p, inplace=inplace)
class StochasticDepth(StochasticDepthTorch):
def __init__(self, p: float, Mode: str="row") -> None:
super().__init__(p, Mode)
def pair(Val):
return Val if isinstance(Val, (tuple, list)) else (Val, Val)
def makeDivisible(v: float, divisor: int, min_value: Optional[int] = None) -> int:
"""
This function is taken from the original tf repo.
It ensures that all layers have a channel number that is divisible by 8
It can be seen here:
https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.Py
"""
if min_value is None:
min_value = divisor
new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
# Make sure that round down does not go down by more than 10%.
if new_v < 0.9 * v:
new_v += divisor
return new_v
class LinearSelfAttention(nn.Module):
"""
This layer applies a self-attention with linear complexity, as described in `MobileViTv2 <https://arxiv.org/abs/2206.02680>`_ paper.
This layer can be used for self- as well as cross-attention.
Args:
opts: command line arguments
DimEmbed (int): :math:`C` from an expected input of size :math:`(N, C, H, W)`
AttnDropRate (Optional[float]): Dropout value for context scores. Default: 0.0
bias (Optional[bool]): Use bias in learnable layers. Default: True
Shape:
- Input: :math:`(N, C, P, N)` where :math:`N` is the batch size, :math:`C` is the input channels,
:math:`P` is the number of pixels in the patch, and :math:`N` is the number of patches
- Output: same as the input
.. note::
For MobileViTv2, we unfold the feature map [B, C, H, W] into [B, C, P, N] where P is the number of pixels
in a patch and N is the number of patches. Because channel is the first dimension in this unfolded tensor,
we use point-wise convolution (instead of a linear layer). This avoids a transpose operation (which may be
expensive on resource-constrained devices) that may be required to convert the unfolded tensor from
channel-first to channel-last format in case of a linear layer.
"""
def __init__(
self,
DimEmbed: int,
AttnDropRate: Optional[float]=0.0,
Bias: Optional[bool]=True,
) -> None:
super().__init__()
self.qkv_proj = BaseConv2d(DimEmbed, 1 + (2 * DimEmbed), 1, bias=Bias)
self.AttnDropRate = Dropout(p=AttnDropRate)
self.out_proj = BaseConv2d(DimEmbed, DimEmbed, 1, bias=Bias)
self.DimEmbed = DimEmbed
def forward(self, x: Tensor) -> Tensor:
# [B, C, P, N] --> [B, h + 2d, P, N]
qkv = self.qkv_proj(x)
# Project x into query, key and value
# Query --> [B, 1, P, N]
# value, key --> [B, d, P, N]
query, key, value = torch.split(
qkv, split_size_or_sections=[1, self.DimEmbed, self.DimEmbed], dim=1
)
# apply softmax along N dimension
context_scores = F.softmax(query, dim=-1)
# Uncomment below line to visualize context scores
# self.visualize_context_scores(context_scores=context_scores)
context_scores = self.AttnDropRate(context_scores)
# Compute context vector
# [B, d, P, N] x [B, 1, P, N] -> [B, d, P, N]
context_vector = key * context_scores
# [B, d, P, N] --> [B, d, P, 1]
context_vector = torch.sum(context_vector, dim=-1, keepdim=True)
# combine context vector with values
# [B, d, P, N] * [B, d, P, 1] --> [B, d, P, N]
out = F.relu(value) * context_vector.expand_as(value)
out = self.out_proj(out)
return out
class LinearAttnFFN(nn.Module):
def __init__(
self,
DimEmbed: int,
DimFfnLatent: int,
AttnDropRate: Optional[float] = 0.0,
DropRate: Optional[float] = 0.1,
FfnDropRate: Optional[float] = 0.0,
) -> None:
super().__init__()
AttnUnit = LinearSelfAttention(DimEmbed, AttnDropRate, Bias=True)
self.PreNormAttn = nn.Sequential(
nn.BatchNorm2d(DimEmbed),
AttnUnit,
Dropout(DropRate),
)
self.PreNormFfn = nn.Sequential(
nn.BatchNorm2d(DimEmbed),
BaseConv2d(DimEmbed, DimFfnLatent, 1, 1, ActLayer=nn.SiLU),
Dropout(FfnDropRate),
BaseConv2d(DimFfnLatent, DimEmbed, 1, 1),
Dropout(DropRate),
)
self.DimEmbed = DimEmbed
def forward(self, x: Tensor) -> Tensor:
# self-attention
x = x + self.PreNormAttn(x)
# Feed forward network
x = x + self.PreNormFfn(x)
return x
class BaseConv2d(nn.Module):
def __init__(
self,
in_channels: int,
out_channels: int,
kernel_size: int,
stride: Optional[int] = 1,
padding: Optional[int] = None,
groups: Optional[int] = 1,
bias: Optional[bool] = None,
BNorm: bool = False,
# norm_layer: Optional[Callable[..., nn.Module]]=nn.BatchNorm2d,
ActLayer: Optional[Callable[..., nn.Module]] = None,
dilation: int = 1,
Momentum: Optional[float] = 0.1,
**kwargs: Any
) -> None:
super(BaseConv2d, self).__init__()
if padding is None:
padding = int((kernel_size - 1) // 2 * dilation)
if bias is None:
bias = not BNorm
self.in_channels = in_channels
self.out_channels = out_channels
self.kernel_size = kernel_size
self.stride = stride
self.padding = padding
self.groups = groups
self.bias = bias
self.Conv = nn.Conv2d(in_channels, out_channels,
kernel_size, stride, padding, dilation, groups, bias, **kwargs)
self.Bn = nn.BatchNorm2d(out_channels, eps=0.001, momentum=Momentum) if BNorm else nn.Identity()
if ActLayer is not None:
if isinstance(list(ActLayer().named_modules())[0][1], nn.Sigmoid):
self.Act = ActLayer()
else:
self.Act = ActLayer(inplace=True)
else:
self.Act = ActLayer
def forward(self, x: Tensor) -> Tensor:
x = self.Conv(x)
x = self.Bn(x)
if self.Act is not None:
x = self.Act(x)
return x
class BaseFormer(nn.Module):
def __init__(
self,
InChannels: int,
FfnMultiplier: Optional[Union[Sequence[Union[int, float]], int, float]] = 2.0,
NumAttnBlocks: Optional[int] = 2,
AttnDropRate: Optional[float] = 0.0,
DropRate: Optional[float] = 0.0,
FfnDropRate: Optional[float] = 0.0,
PatchRes: Optional[int] = 2,
Dilation: Optional[int] = 1,
ViTSELayer: Optional[nn.Module] = None,
**kwargs: Any,
) -> None:
DimAttnUnit = InChannels // 2
DimCNNOut = DimAttnUnit
Conv3x3In = BaseConv2d(
InChannels, InChannels, 3, 1, dilation=Dilation,
BNorm=True, ActLayer=nn.SiLU,
) # depth-wise separable convolution
ViTSELayer = ViTSELayer(InChannels, **kwargs) if ViTSELayer is not None else nn.Identity()
Conv1x1In = BaseConv2d(InChannels, DimCNNOut, 1, 1, bias=False)
super(BaseFormer, self).__init__()
self.LocalRep = nn.Sequential(Conv3x3In, ViTSELayer, Conv1x1In)
self.GlobalRep, DimAttnUnit = self.buildAttnLayer(
DimAttnUnit, FfnMultiplier, NumAttnBlocks, AttnDropRate, DropRate, FfnDropRate,
)
self.ConvProj = BaseConv2d(DimCNNOut, InChannels, 1, 1, BNorm=True)
self.DimCNNOut = DimCNNOut
self.HPatch, self.WPatch = pair(PatchRes)
self.PatchArea = self.WPatch * self.HPatch
def buildAttnLayer(
self,
DimModel: int,
FfnMult: Union[Sequence, int, float],
NumAttnBlocks: int,
AttnDropRate: float,
DropRate: float,
FfnDropRate: float,
) -> Tuple[nn.Module, int]:
if isinstance(FfnMult, Sequence) and len(FfnMult) == 2:
DimFfn = (
np.linspace(FfnMult[0], FfnMult[1], NumAttnBlocks, dtype=float) * DimModel
)
elif isinstance(FfnMult, Sequence) and len(FfnMult) == 1:
DimFfn = [FfnMult[0] * DimModel] * NumAttnBlocks
elif isinstance(FfnMult, (int, float)):
DimFfn = [FfnMult * DimModel] * NumAttnBlocks
else:
raise NotImplementedError
# ensure that dims are multiple of 16
DimFfn = [makeDivisible(d, 16) for d in DimFfn]
GlobalRep = [
LinearAttnFFN(DimModel, DimFfn[block_idx], AttnDropRate, DropRate, FfnDropRate)
for block_idx in range(NumAttnBlocks)
]
GlobalRep.append(nn.BatchNorm2d(DimModel))
return nn.Sequential(*GlobalRep), DimModel
def unfolding(self, FeatureMap: Tensor) -> Tuple[Tensor, Tuple[int, int]]:
B, C, H, W = FeatureMap.shape
# [B, C, H, W] --> [B, C, P, N]
Patches = F.unfold(
FeatureMap,
kernel_size=(self.HPatch, self.WPatch),
stride=(self.HPatch, self.WPatch),
)
Patches = Patches.reshape(
B, C, self.HPatch * self.WPatch, -1
)
return Patches, (H, W)
def folding(self, Patches: Tensor, OutputSize: Tuple[int, int]) -> Tensor:
B, C, P, N = Patches.shape # BatchSize, DimIn, PatchSize, NumPatches
# [B, C, P, N]
Patches = Patches.reshape(B, C * P, N)
FeatureMap = F.fold(
Patches,
output_size=OutputSize,
kernel_size=(self.HPatch, self.WPatch),
stride=(self.HPatch, self.WPatch),
)
return FeatureMap
def forward(self, x: Tensor, *args, **kwargs) -> Tensor:
Fm = self.LocalRep(x)
# convert feature map to patches
Patches, OutputSize = self.unfolding(Fm)
# learn global representations on all patches
Patches = self.GlobalRep(Patches)
# [B x Patch x Patches x C] --> [B x C x Patches x Patch]
Fm = self.folding(Patches, OutputSize)
Fm = self.ConvProj(Fm)
return Fm
#AssemFormer, a method that combines convolution with a vision transformer by assembling tensors.
class AssemFormer(BaseFormer):
"""
Inspired by MobileViTv3.
Adapted from https://github.com/micronDLA/MobileViTv3/blob/main/MobileViTv3-v2/cvnets/modules/mobilevit_block.py
"""
def __init__(
self,
InChannels: int,
FfnMultiplier: Optional[Union[Sequence[Union[int, float]], int, float]] = 2.0,
NumAttnBlocks: Optional[int] = 2,
AttnDropRate: Optional[float] = 0.0,
DropRate: Optional[float] = 0.0,
FfnDropRate: Optional[float] = 0.0,
PatchRes: Optional[int] = 2,
Dilation: Optional[int] = 1,
SDProb: Optional[float] = 0.0,
ViTSELayer: Optional[nn.Module] = None,
**kwargs: Any,
) -> None:
super().__init__(InChannels, FfnMultiplier, NumAttnBlocks, AttnDropRate,
DropRate, FfnDropRate, PatchRes, Dilation, ViTSELayer, **kwargs)
# AssembleFormer: input changed from just global to local + global
self.ConvProj = BaseConv2d(2 * self.DimCNNOut, InChannels, 1, 1, BNorm=True)
self.Dropout = StochasticDepth(SDProb)
def forward(self, x: Tensor) -> Tensor:
FmConv = self.LocalRep(x)
# convert feature map to patches
Patches, OutputSize = self.unfolding(FmConv)
# learn global representations on all patches
Patches = self.GlobalRep(Patches)
# [B x Patch x Patches x C] --> [B x C x Patches x Patch]
Fm = self.folding(Patches, OutputSize)
# AssembleFormer: local + global instead of only global
Fm = self.ConvProj(torch.cat((Fm, FmConv), dim=1))
# AssembleFormer: skip connection
return x + self.Dropout(Fm)
def autopad(k, p=None, d=1): # kernel, padding, dilation
"""Pad to 'same' shape outputs."""
if d > 1:
k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k] # actual kernel-size
if p is None:
p = k // 2 if isinstance(k, int) else [x // 2 for x in k] # auto-pad
return p
class Conv(nn.Module):
"""Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)."""
default_act = nn.SiLU() # default activation
def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
"""Initialize Conv layer with given arguments including activation."""
super().__init__()
self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
self.bn = nn.BatchNorm2d(c2)
self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()
def forward(self, x):
"""Apply convolution, batch normalization and activation to input tensor."""
return self.act(self.bn(self.conv(x)))
def forward_fuse(self, x):
"""Perform transposed convolution of 2D data."""
return self.act(self.conv(x))
class HGBlock_AssemFormer(nn.Module):
"""
HG_Block of PPHGNetV2 with 2 convolutions and LightConv.
https://github.com/PaddlePaddle/PaddleDetection/blob/develop/ppdet/modeling/backbones/hgnet_v2.py
"""
def __init__(self, c1, cm, c2, k=3, n=6, lightconv=False, shortcut=False, act=nn.ReLU()):
"""Initializes a CSP Bottleneck with 1 convolution using specified input and output channels."""
super().__init__()
block = LightConv if lightconv else Conv
self.m = nn.ModuleList(block(c1 if i == 0 else cm, cm, k=k, act=act) for i in range(n))
self.sc = Conv(c1 + n * cm, c2 // 2, 1, 1, act=act) # squeeze conv
self.ec = Conv(c2 // 2, c2, 1, 1, act=act) # excitation conv
self.add = shortcut and c1 == c2
self.cv = AssemFormer(c2)
def forward(self, x):
"""Forward pass of a PPHGNetV2 backbone layer."""
y = [x]
y.extend(m(y[-1]) for m in self.m)
y = self.cv(self.ec(self.sc(torch.cat(y, 1))))
return y + x if self.add else y
四、创新模块
4.1 改进点⭐
模块改进方法
:
1️⃣ 加入
AssemFormer模块
。
AssemFormer模块
添加后如下:
2️⃣:加入基于
AssemFormer模块
的
HGBlock
。利用
AssemFormer
改进
HGBlock
模块,
使模型能够更好地捕捉长距离像素间的依赖关系。
改进代码如下:
添加如下代码改进
HGBlock
模块,加入
AssemFormer模块
,并重命名为
HGBlock_AssemFormer
class HGBlock_AssemFormer(nn.Module):
"""
HG_Block of PPHGNetV2 with 2 convolutions and LightConv.
https://github.com/PaddlePaddle/PaddleDetection/blob/develop/ppdet/modeling/backbones/hgnet_v2.py
"""
def __init__(self, c1, cm, c2, k=3, n=6, lightconv=False, shortcut=False, act=nn.ReLU()):
"""Initializes a CSP Bottleneck with 1 convolution using specified input and output channels."""
super().__init__()
block = LightConv if lightconv else Conv
self.m = nn.ModuleList(block(c1 if i == 0 else cm, cm, k=k, act=act) for i in range(n))
self.sc = Conv(c1 + n * cm, c2 // 2, 1, 1, act=act) # squeeze conv
self.ec = Conv(c2 // 2, c2, 1, 1, act=act) # excitation conv
self.add = shortcut and c1 == c2
self.cv = AssemFormer(c2)
def forward(self, x):
"""Forward pass of a PPHGNetV2 backbone layer."""
y = [x]
y.extend(m(y[-1]) for m in self.m)
y = self.cv(self.ec(self.sc(torch.cat(y, 1))))
return y + x if self.add else y
注意❗:在
5.2和5.3小节
中需要声明的模块名称为:
HGBlock_AssemFormer
。
五、添加步骤
5.1 修改一
① 在
ultralytics/nn/
目录下新建
AddModules
文件夹用于存放模块代码
② 在
AddModules
文件夹下新建
AssemFormer.py
,将
第三节
中的代码粘贴到此处
5.2 修改二
在
AddModules
文件夹下新建
__init__.py
(已有则不用新建),在文件内导入模块:
from .AssemFormer import *
5.3 修改三
在
ultralytics/nn/modules/tasks.py
文件中,需要在两处位置添加各模块类名称。
首先:导入模块
其次:在
parse_model函数
中注册
HGBlock_AssemFormer
:
六、yaml模型文件
6.1 模型改进版本
此处以
ultralytics/cfg/models/rt-detr/rtdetr-l.yaml
为例,在同目录下创建一个用于自己数据集训练的模型文件
rtdetr-l-HGBlock_AssemFormer.yaml
。
将
rtdetr-l.yaml
中的内容复制到
rtdetr-l-HGBlock_AssemFormer.yaml
文件下,修改
nc
数量等于自己数据中目标的数量。
📌 模型的修改方法是将
骨干网络
中添加
HGBlock_AssemFormer模块
。
# Ultralytics YOLO 🚀, AGPL-3.0 license
# RT-DETR-l object detection model with P3-P5 outputs. For details see https://docs.ultralytics.com/models/rtdetr
# Parameters
nc: 1 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n-cls.yaml' will call yolov8-cls.yaml with scale 'n'
# [depth, width, max_channels]
l: [1.00, 1.00, 1024]
backbone:
# [from, repeats, module, args]
- [-1, 1, HGStem, [32, 48]] # 0-P2/4
- [-1, 6, HGBlock, [48, 128, 3]] # stage 1
- [-1, 1, DWConv, [128, 3, 2, 1, False]] # 2-P3/8
- [-1, 6, HGBlock, [96, 512, 3]] # stage 2
- [-1, 1, DWConv, [512, 3, 2, 1, False]] # 4-P4/16
- [-1, 6, HGBlock_AssemFormer, [192, 1024, 5, True, False]] # cm, c2, k, light, shortcut
- [-1, 6, HGBlock_AssemFormer, [192, 1024, 5, True, True]]
- [-1, 6, HGBlock_AssemFormer, [192, 1024, 5, True, True]] # stage 3
- [-1, 1, DWConv, [1024, 3, 2, 1, False]] # 8-P5/32
- [-1, 6, HGBlock, [384, 2048, 5, True, False]] # stage 4
head:
- [-1, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 10 input_proj.2
- [-1, 1, AIFI, [1024, 8]]
- [-1, 1, Conv, [256, 1, 1]] # 12, Y5, lateral_convs.0
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [7, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 14 input_proj.1
- [[-2, -1], 1, Concat, [1]]
- [-1, 3, RepC3, [256]] # 16, fpn_blocks.0
- [-1, 1, Conv, [256, 1, 1]] # 17, Y4, lateral_convs.1
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [3, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 19 input_proj.0
- [[-2, -1], 1, Concat, [1]] # cat backbone P4
- [-1, 3, RepC3, [256]] # X3 (21), fpn_blocks.1
- [-1, 1, Conv, [256, 3, 2]] # 22, downsample_convs.0
- [[-1, 17], 1, Concat, [1]] # cat Y4
- [-1, 3, RepC3, [256]] # F4 (24), pan_blocks.0
- [-1, 1, Conv, [256, 3, 2]] # 25, downsample_convs.1
- [[-1, 12], 1, Concat, [1]] # cat Y5
- [-1, 3, RepC3, [256]] # F5 (27), pan_blocks.1
- [[21, 24, 27], 1, RTDETRDecoder, [nc]] # Detect(P3, P4, P5)
七、成功运行结果
打印网络模型可以看到
HGBlock_AssemFormer
已经加入到模型中,并可以进行训练了。
rtdetr-l-HGBlock_AssemFormer :
rtdetr-l-HGBlock_AssemFormer summary: 868 layers, 76,897,481 parameters, 76,897,481 gradients, 249.1 GFLOPs
from n params module arguments
0 -1 1 25248 ultralytics.nn.modules.block.HGStem [3, 32, 48]
1 -1 6 155072 ultralytics.nn.modules.block.HGBlock [48, 48, 128, 3, 6]
2 -1 1 1408 ultralytics.nn.modules.conv.DWConv [128, 128, 3, 2, 1, False]
3 -1 6 839296 ultralytics.nn.modules.block.HGBlock [128, 96, 512, 3, 6]
4 -1 1 5632 ultralytics.nn.modules.conv.DWConv [512, 512, 3, 2, 1, False]
5 -1 6 16391810 ultralytics.nn.AddModules.AssemFormer.HGBlock_AssemFormer[512, 192, 1024, 5, 6, True, False]
6 -1 6 16752258 ultralytics.nn.AddModules.AssemFormer.HGBlock_AssemFormer[1024, 192, 1024, 5, 6, True, True]
7 -1 6 16752258 ultralytics.nn.AddModules.AssemFormer.HGBlock_AssemFormer[1024, 192, 1024, 5, 6, True, True]
8 -1 1 11264 ultralytics.nn.modules.conv.DWConv [1024, 1024, 3, 2, 1, False]
9 -1 6 6708480 ultralytics.nn.modules.block.HGBlock [1024, 384, 2048, 5, 6, True, False]
10 -1 1 524800 ultralytics.nn.modules.conv.Conv [2048, 256, 1, 1, None, 1, 1, False]
11 -1 1 789760 ultralytics.nn.modules.transformer.AIFI [256, 1024, 8]
12 -1 1 66048 ultralytics.nn.modules.conv.Conv [256, 256, 1, 1]
13 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
14 7 1 262656 ultralytics.nn.modules.conv.Conv [1024, 256, 1, 1, None, 1, 1, False]
15 [-2, -1] 1 0 ultralytics.nn.modules.conv.Concat [1]
16 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
17 -1 1 66048 ultralytics.nn.modules.conv.Conv [256, 256, 1, 1]
18 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
19 3 1 131584 ultralytics.nn.modules.conv.Conv [512, 256, 1, 1, None, 1, 1, False]
20 [-2, -1] 1 0 ultralytics.nn.modules.conv.Concat [1]
21 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
22 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
23 [-1, 17] 1 0 ultralytics.nn.modules.conv.Concat [1]
24 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
25 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
26 [-1, 12] 1 0 ultralytics.nn.modules.conv.Concat [1]
27 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
28 [21, 24, 27] 1 7303907 ultralytics.nn.modules.head.RTDETRDecoder [1, [256, 256, 256]]
rtdetr-l-HGBlock_AssemFormer summary: 868 layers, 76,897,481 parameters, 76,897,481 gradients, 249.1 GFLOPs