RT-DETR改进策略【注意力机制篇】| ICCV2023 聚焦线性注意力模块 Focused Linear Attention 聚焦能力与特征多样性双重提升,含二次创新
一、本文介绍
本文记录的是
利用
Focused Linear Attention
聚焦线性注意力模块优化
RT-DETR
的目标检测网络模型
。
Focused Linear Attention module
的作用在于
同时解决了线性注意力的焦点能力不足和特征多样性受限的问题
,
克服了常见线性注意力方法存在性能下降或引入额外计算开销。
本文将其加入到
RT-DETR
中,进一步发挥其性能。
二、聚焦线性注意力模块介绍
FLatten Transformer: Vision Transformer using Focused Linear Attention
2.1 设计出发点
- 解决线性注意力性能下降问题 :当前线性注意力方法存在性能下降或引入额外计算开销的问题,从焦点能力和特征多样性两个角度分析并解决这些局限。
2.2 原理
2.2.1 焦点能力(Focus ability)原理
-
Softmax注意力的优势
:
Softmax注意力提供非线性加权机制,能聚焦重要特征,其注意力图分布在某些区域(如前景物体)很尖锐。 - 线性注意力的不足 :线性注意力的注意力图分布相对平滑, 缺乏聚焦能力。
-
提出的解决方案
:提出
聚焦函数(Focused Function)f p f_p f p ,通过调整查询(query)和键(key)特征的方向,使相似的查询 - 键对靠近,不同的远离。具体映射函数为 S i m ( Q i , K j ) = ϕ p ( Q i ) ϕ p ( K j ) T Sim(Q_i, K_j)=\phi_p(Q_i)\phi_p(K_j)^T S im ( Q i , K j ) = ϕ p ( Q i ) ϕ p ( K j ) T ,其中 ϕ p ( x ) = f p ( R e L U ( x ) ) , f p ( x ) = ∥ x ∥ ∥ x ∗ ∗ p ∥ x ∗ ∗ p \phi_p(x)=f_p(ReLU(x)), f_p(x)=\frac{\|x\|}{\|x^{**p}\|}x^{**p} ϕ p ( x ) = f p ( R e LU ( x )) , f p ( x ) = ∥ x ∗∗ p ∥ ∥ x ∥ x ∗∗ p 。该函数仅调整特征方向,在适当假设下,能影响注意力分布,使相似和不相似的查询 - 键对有更显著差异,恢复类似Softmax的尖锐注意力分布。
2.2.2 特征多样性(Feature diversity)原理
- 注意力矩阵秩的影响 :注意力矩阵的秩影响特征多样性。在一些Transformer层中,如DeiT - Tiny, Softmax注意力 的矩阵可满秩,能体现特征聚合的多样性;而 线性注意力 的秩受限于令牌数量N和通道维度d(通常 d < N d<N d < N ),导致许多行的注意力图同质化,聚合特征相似。
-
提出的解决方案
:添加
深度可分离卷积(DWC)模块到注意力矩阵,输出为 O = ϕ ( Q ) ϕ ( K ) T V + D W C ( V ) O=\phi(Q)\phi(K)^T V + DWC(V) O = ϕ ( Q ) ϕ ( K ) T V + D W C ( V ) 。从矩阵秩角度看, O = ( ϕ ( Q ) ϕ ( K ) T + M D W C ) V = M e q V O = (\phi(Q)\phi(K)^T + M_{DWC})V = M_{eq}V O = ( ϕ ( Q ) ϕ ( K ) T + M D W C ) V = M e q V , M D W C M_{DWC} M D W C 是对应DWC的稀疏矩阵,可能是满秩矩阵,从而增加等效注意力矩阵秩的上界,在几乎不增加计算开销的同时提高线性注意力性能。
2.3 结构
基于上述分析,
聚焦线性注意力模块(Focused Linear Attention module)
结构如下:
O
=
S
i
m
(
Q
,
K
)
V
=
ϕ
p
(
Q
)
ϕ
p
(
K
)
T
V
+
D
W
C
(
V
)
O = Sim(Q, K)V=\phi_p(Q)\phi_p(K)^T V + DWC(V)
O
=
S
im
(
Q
,
K
)
V
=
ϕ
p
(
Q
)
ϕ
p
(
K
)
T
V
+
D
W
C
(
V
)
2.4 优势
-
计算复杂度低
- 与线性注意力一样,通过改变自注意力计算顺序,将复杂度从 O ( N 2 d ) O(N^2d) O ( N 2 d ) 降为 O ( N d 2 ) O(Nd^2) O ( N d 2 ) ,在常见Vision Transformer设计中,整体计算量实际减少。
- 与设计复杂核函数的先前线性注意力模块相比,提出的聚焦函数 f p f_p f p 仅采用简单操作,以最小计算开销实现近似。
-
表达能力强
- 从焦点能力和特征多样性角度分析,先前基于核的线性注意力设计不如Softmax注意力,但该模块通过聚焦函数 f p f_p f p 和深度可分离卷积,能比Softmax注意力实现更好性能。
-
适应性强
- 可适应更大的感受野和不同模型架构。基于Softmax注意力的现代Transformer模型因二次复杂度对令牌数量有限制,而该模块的线性复杂度使其能在保持计算量不变的情况下扩展感受野,还能作为插件模块应用于多种现代视觉Transformer架构。
论文: https://arxiv.org/pdf/2308.00442
源码: https://github.com/LeapLabTHU/FLatten-Transformer
三、FLatten Transformer的实现代码
FLatten Transformer
及其改进的实现代码如下:
import torch.nn as nn
import torch
from einops import rearrange
from ultralytics.nn.modules.conv import LightConv
# https://arxiv.org/pdf/2308.00442
# https://github.com/LeapLabTHU/FLatten-Transformer
class FocusedLinearAttention(nn.Module):
def __init__(self, dim, num_patches=64, num_heads=8, qkv_bias=True, qk_scale=None, attn_drop=0.0, proj_drop=0.0, sr_ratio=1,
focusing_factor=3.0, kernel_size=5):
super().__init__()
assert dim % num_heads == 0, f"dim {dim} should be divided by num_heads {num_heads}."
self.dim = dim
self.num_heads = num_heads
head_dim = dim // num_heads
self.q = nn.Linear(dim, dim, bias=qkv_bias)
self.kv = nn.Linear(dim, dim * 2, bias=qkv_bias)
self.attn_drop = nn.Dropout(attn_drop)
self.proj = nn.Linear(dim, dim)
self.proj_drop = nn.Dropout(proj_drop)
self.sr_ratio = sr_ratio
if sr_ratio > 1:
self.sr = nn.Conv2d(dim, dim, kernel_size=sr_ratio, stride=sr_ratio)
self.norm = nn.LayerNorm(dim)
self.focusing_factor = focusing_factor
self.dwc = nn.Conv2d(in_channels=head_dim, out_channels=head_dim, kernel_size=kernel_size,
groups=head_dim, padding=kernel_size // 2)
self.scale = nn.Parameter(torch.zeros(size=(1, 1, dim)))
# self.positional_encoding = nn.Parameter(torch.zeros(size=(1, num_patches // (sr_ratio * sr_ratio), dim)))
def forward(self, x):
B, C, H, W = x.shape
dtype, device = x.dtype, x.device
x = rearrange(x, 'b c h w -> b (h w) c')
q = self.q(x)
if self.sr_ratio > 1:
x_ = x.permute(0, 2, 1).reshape(B, C, H, W)
x_ = self.sr(x_).reshape(B, C, -1).permute(0, 2, 1)
x_ = self.norm(x_)
kv = self.kv(x_).reshape(B, -1, 2, C).permute(2, 0, 1, 3)
else:
kv = self.kv(x).reshape(B, -1, 2, C).permute(2, 0, 1, 3)
k, v = kv[0], kv[1]
N = H * W
positional_encoding = nn.Parameter(torch.zeros(size=(1, N, self.dim), device=device))
k = k + positional_encoding
focusing_factor = self.focusing_factor
kernel_function = nn.ReLU()
scale = nn.Softplus()(self.scale)
q = kernel_function(q) + 1e-6
k = kernel_function(k) + 1e-6
q = q / scale
k = k / scale
q_norm = q.norm(dim=-1, keepdim=True)
k_norm = k.norm(dim=-1, keepdim=True)
q = q ** focusing_factor
k = k ** focusing_factor
q = (q / q.norm(dim=-1, keepdim=True)) * q_norm
k = (k / k.norm(dim=-1, keepdim=True)) * k_norm
bool = False
if dtype == torch.float16:
q = q.float()
k = k.float()
v = v.float()
bool = True
q, k, v = (rearrange(x, "b n (h c) -> (b h) n c", h=self.num_heads) for x in [q, k, v])
i, j, c, d = q.shape[-2], k.shape[-2], k.shape[-1], v.shape[-1]
z = 1 / (torch.einsum("b i c, b c -> b i", q, k.sum(dim=1)) + 1e-6)
if i * j * (c + d) > c * d * (i + j):
kv = torch.einsum("b j c, b j d -> b c d", k, v)
x = torch.einsum("b i c, b c d, b i -> b i d", q, kv, z)
else:
qk = torch.einsum("b i c, b j c -> b i j", q, k)
x = torch.einsum("b i j, b j d, b i -> b i d", qk, v, z)
if self.sr_ratio > 1:
v = nn.functional.interpolate(v.permute(0, 2, 1), size=x.shape[1], mode='linear').permute(0, 2, 1)
if bool:
v = v.to(torch.float16)
x = x.to(torch.float16)
num = int(v.shape[1] ** 0.5)
feature_map = rearrange(v, "b (w h) c -> b c w h", w=num, h=num)
feature_map = rearrange(self.dwc(feature_map), "b c w h -> b (w h) c")
x = x + feature_map
x = rearrange(x, "(b h) n c -> b n (h c)", h=self.num_heads)
x = self.proj(x)
x = self.proj_drop(x)
x = rearrange(x, 'b (h w) c -> b c h w', h=H, w=W)
return x
def autopad(k, p=None, d=1): # kernel, padding, dilation
"""Pad to 'same' shape outputs."""
if d > 1:
k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k] # actual kernel-size
if p is None:
p = k // 2 if isinstance(k, int) else [x // 2 for x in k] # auto-pad
return p
class Conv(nn.Module):
"""Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)."""
default_act = nn.SiLU() # default activation
def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
"""Initialize Conv layer with given arguments including activation."""
super().__init__()
self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
self.bn = nn.BatchNorm2d(c2)
self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()
def forward(self, x):
"""Apply convolution, batch normalization and activation to input tensor."""
return self.act(self.bn(self.conv(x)))
def forward_fuse(self, x):
"""Perform transposed convolution of 2D data."""
return self.act(self.conv(x))
class HGBlock_FLA(nn.Module):
"""
HG_Block of PPHGNetV2 with 2 convolutions and LightConv.
https://github.com/PaddlePaddle/PaddleDetection/blob/develop/ppdet/modeling/backbones/hgnet_v2.py
"""
def __init__(self, c1, cm, c2, k=3, n=6, lightconv=False, shortcut=False, act=nn.ReLU()):
"""Initializes a CSP Bottleneck with 1 convolution using specified input and output channels."""
super().__init__()
block = LightConv if lightconv else Conv
self.m = nn.ModuleList(block(c1 if i == 0 else cm, cm, k=k, act=act) for i in range(n))
self.sc = Conv(c1 + n * cm, c2 // 2, 1, 1, act=act) # squeeze conv
self.ec = Conv(c2 // 2, c2, 1, 1, act=act) # excitation conv
self.add = shortcut and c1 == c2
self.cv = FocusedLinearAttention(c2)
def forward(self, x):
"""Forward pass of a PPHGNetV2 backbone layer."""
y = [x]
y.extend(m(y[-1]) for m in self.m)
y = self.cv(self.ec(self.sc(torch.cat(y, 1))))
return y + x if self.add else y
四、创新模块
4.1 改进点1⭐
模块改进方法
:直接加入
FLatten Transformer模块
(
第五节讲解添加步骤
)。
FLatten Transformer模块
添加后如下:
4.2 改进点2⭐
模块改进方法
:基于
FLatten Transformer模块
的
HGBlock
(
第五节讲解添加步骤
)。
第二种改进方法是对
RT-DETR
中的
HGBlock模块
进行改进,并将
FLatten Transformer
在加入到
HGBlock
模块中。
改进代码如下:
在
HGBlock
中加入
FLatten Transformer模块
,重命名为
HGBlock_FLA
class HGBlock_FLA(nn.Module):
"""
HG_Block of PPHGNetV2 with 2 convolutions and LightConv.
https://github.com/PaddlePaddle/PaddleDetection/blob/develop/ppdet/modeling/backbones/hgnet_v2.py
"""
def __init__(self, c1, cm, c2, k=3, n=6, lightconv=False, shortcut=False, act=nn.ReLU()):
"""Initializes a CSP Bottleneck with 1 convolution using specified input and output channels."""
super().__init__()
block = LightConv if lightconv else Conv
self.m = nn.ModuleList(block(c1 if i == 0 else cm, cm, k=k, act=act) for i in range(n))
self.sc = Conv(c1 + n * cm, c2 // 2, 1, 1, act=act) # squeeze conv
self.ec = Conv(c2 // 2, c2, 1, 1, act=act) # excitation conv
self.add = shortcut and c1 == c2
self.cv = FocusedLinearAttention(c2)
def forward(self, x):
"""Forward pass of a PPHGNetV2 backbone layer."""
y = [x]
y.extend(m(y[-1]) for m in self.m)
y = self.cv(self.ec(self.sc(torch.cat(y, 1))))
return y + x if self.add else y
注意❗:在
第五小节
中需要声明的模块名称为:
HGBlock_FLA
。
五、添加步骤
5.1 修改一
① 在
ultralytics/nn/
目录下新建
AddModules
文件夹用于存放模块代码
② 在
AddModules
文件夹下新建
FLA.py
,将
第三节
中的代码粘贴到此处
5.2 修改二
在
AddModules
文件夹下新建
__init__.py
(已有则不用新建),在文件内导入模块:
from .FLA import *
5.3 修改三
在
ultralytics/nn/modules/tasks.py
文件中,需要在两处位置添加各模块类名称。
首先:导入模块
其次:在
parse_model函数
中声明
HGBlock_FLA
模块,并添加如下代码
elif m in {FocusedLinearAttention}:
args = [ch[f], *args]
六、yaml模型文件
6.1 模型改进版本1
此处以
ultralytics/cfg/models/rt-detr/rtdetr-l.yaml
为例,在同目录下创建一个用于自己数据集训练的模型文件
rtdetr-l-FLA.yaml
。
将
rtdetr-l.yaml
中的内容复制到
rtdetr-l-FLA.yaml
文件下,修改
nc
数量等于自己数据中目标的数量。
📌 模型的修改方法是
骨干网络中
添加
FocusedLinearAttention模块
。
# Ultralytics YOLO 🚀, AGPL-3.0 license
# RT-DETR-l object detection model with P3-P5 outputs. For details see https://docs.ultralytics.com/models/rtdetr
# Parameters
nc: 1 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n-cls.yaml' will call yolov8-cls.yaml with scale 'n'
# [depth, width, max_channels]
l: [1.00, 1.00, 1024]
backbone:
# [from, repeats, module, args]
- [-1, 1, HGStem, [32, 48]] # 0-P2/4
- [-1, 6, HGBlock, [48, 128, 3]] # stage 1
- [-1, 1, DWConv, [128, 3, 2, 1, False]] # 2-P3/8
- [-1, 6, HGBlock, [96, 512, 3]] # stage 2
- [-1, 1, DWConv, [512, 3, 2, 1, False]] # 4-P4/16
- [-1, 6, HGBlock, [192, 1024, 5, True, False]] # cm, c2, k, light, shortcut
- [-1, 6, HGBlock, [192, 1024, 5, True, True]]
- [-1, 6, HGBlock, [192, 1024, 5, True, True]] # stage 3
- [-1, 1, DWConv, [1024, 3, 2, 1, False]] # 8-P5/32
- [-1, 1, FocusedLinearAttention, [1024]] # stage 4
- [-1, 6, HGBlock, [384, 2048, 5, True, False]] # stage 4
head:
- [-1, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 10 input_proj.2
- [-1, 1, AIFI, [1024, 8]]
- [-1, 1, Conv, [256, 1, 1]] # 12, Y5, lateral_convs.0
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [7, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 14 input_proj.1
- [[-2, -1], 1, Concat, [1]]
- [-1, 3, RepC3, [256]] # 16, fpn_blocks.0
- [-1, 1, Conv, [256, 1, 1]] # 17, Y4, lateral_convs.1
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [3, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 19 input_proj.0
- [[-2, -1], 1, Concat, [1]] # cat backbone P4
- [-1, 3, RepC3, [256]] # X3 (21), fpn_blocks.1
- [-1, 1, Conv, [256, 3, 2]] # 22, downsample_convs.0
- [[-1, 18], 1, Concat, [1]] # cat Y4
- [-1, 3, RepC3, [256]] # F4 (24), pan_blocks.0
- [-1, 1, Conv, [256, 3, 2]] # 25, downsample_convs.1
- [[-1, 13], 1, Concat, [1]] # cat Y5
- [-1, 3, RepC3, [256]] # F5 (27), pan_blocks.1
- [[22, 25, 28], 1, RTDETRDecoder, [nc]] # Detect(P3, P4, P5)
6.2 模型改进版本2⭐
此处以
ultralytics/cfg/models/rt-detr/rtdetr-l.yaml
为例,在同目录下创建一个用于自己数据集训练的模型文件
rtdetr-l-HGBlock_FLA.yaml
。
将
rtdetr-l.yaml
中的内容复制到
rtdetr-l-HGBlock_FLA.yaml
文件下,修改
nc
数量等于自己数据中目标的数量。
📌 模型的修改方法是将
骨干网络
中的
HGBlock模块
替换成
HGBlock_FLA模块
。
# Ultralytics YOLO 🚀, AGPL-3.0 license
# RT-DETR-l object detection model with P3-P5 outputs. For details see https://docs.ultralytics.com/models/rtdetr
# Parameters
nc: 1 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n-cls.yaml' will call yolov8-cls.yaml with scale 'n'
# [depth, width, max_channels]
l: [1.00, 1.00, 1024]
backbone:
# [from, repeats, module, args]
- [-1, 1, HGStem, [32, 48]] # 0-P2/4
- [-1, 6, HGBlock, [48, 128, 3]] # stage 1
- [-1, 1, DWConv, [128, 3, 2, 1, False]] # 2-P3/8
- [-1, 6, HGBlock, [96, 512, 3]] # stage 2
- [-1, 1, DWConv, [512, 3, 2, 1, False]] # 4-P4/16
- [-1, 6, HGBlock_FLA, [192, 1024, 5, True, False]] # cm, c2, k, light, shortcut
- [-1, 6, HGBlock_FLA, [192, 1024, 5, True, True]]
- [-1, 6, HGBlock_FLA, [192, 1024, 5, True, True]] # stage 3
- [-1, 1, DWConv, [1024, 3, 2, 1, False]] # 8-P5/32
- [-1, 6, HGBlock, [384, 2048, 5, True, False]] # stage 4
head:
- [-1, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 10 input_proj.2
- [-1, 1, AIFI, [1024, 8]]
- [-1, 1, Conv, [256, 1, 1]] # 12, Y5, lateral_convs.0
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [7, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 14 input_proj.1
- [[-2, -1], 1, Concat, [1]]
- [-1, 3, RepC3, [256]] # 16, fpn_blocks.0
- [-1, 1, Conv, [256, 1, 1]] # 17, Y4, lateral_convs.1
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [3, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 19 input_proj.0
- [[-2, -1], 1, Concat, [1]] # cat backbone P4
- [-1, 3, RepC3, [256]] # X3 (21), fpn_blocks.1
- [-1, 1, Conv, [256, 3, 2]] # 22, downsample_convs.0
- [[-1, 17], 1, Concat, [1]] # cat Y4
- [-1, 3, RepC3, [256]] # F4 (24), pan_blocks.0
- [-1, 1, Conv, [256, 3, 2]] # 25, downsample_convs.1
- [[-1, 12], 1, Concat, [1]] # cat Y5
- [-1, 3, RepC3, [256]] # F5 (27), pan_blocks.1
- [[21, 24, 27], 1, RTDETRDecoder, [nc]] # Detect(P3, P4, P5)
七、成功运行结果
打印网络模型可以看到
FocusedLinearAttention
和
HGBlock_FLA
已经加入到模型中,并可以进行训练了。
rtdetr-l-FLA :
rtdetr-l-FLA summary: 688 layers, 37,010,883 parameters, 37,010,883 gradients, 111.4 GFLOPs
from n params module arguments
0 -1 1 25248 ultralytics.nn.modules.block.HGStem [3, 32, 48]
1 -1 6 155072 ultralytics.nn.modules.block.HGBlock [48, 48, 128, 3, 6]
2 -1 1 1408 ultralytics.nn.modules.conv.DWConv [128, 128, 3, 2, 1, False]
3 -1 6 839296 ultralytics.nn.modules.block.HGBlock [128, 96, 512, 3, 6]
4 -1 1 5632 ultralytics.nn.modules.conv.DWConv [512, 512, 3, 2, 1, False]
5 -1 6 1695360 ultralytics.nn.modules.block.HGBlock [512, 192, 1024, 5, 6, True, False]
6 -1 6 2055808 ultralytics.nn.modules.block.HGBlock [1024, 192, 1024, 5, 6, True, True]
7 -1 6 2055808 ultralytics.nn.modules.block.HGBlock [1024, 192, 1024, 5, 6, True, True]
8 -1 1 11264 ultralytics.nn.modules.conv.DWConv [1024, 1024, 3, 2, 1, False]
9 -1 1 4202752 ultralytics.nn.AddModules.FLA.FocusedLinearAttention[1024, 1024]
10 -1 6 6708480 ultralytics.nn.modules.block.HGBlock [1024, 384, 2048, 5, 6, True, False]
11 -1 1 524800 ultralytics.nn.modules.conv.Conv [2048, 256, 1, 1, None, 1, 1, False]
12 -1 1 789760 ultralytics.nn.modules.transformer.AIFI [256, 1024, 8]
13 -1 1 66048 ultralytics.nn.modules.conv.Conv [256, 256, 1, 1]
14 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
15 7 1 262656 ultralytics.nn.modules.conv.Conv [1024, 256, 1, 1, None, 1, 1, False]
16 [-2, -1] 1 0 ultralytics.nn.modules.conv.Concat [1]
17 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
18 -1 1 66048 ultralytics.nn.modules.conv.Conv [256, 256, 1, 1]
19 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
20 3 1 131584 ultralytics.nn.modules.conv.Conv [512, 256, 1, 1, None, 1, 1, False]
21 [-2, -1] 1 0 ultralytics.nn.modules.conv.Concat [1]
22 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
23 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
24 [-1, 18] 1 0 ultralytics.nn.modules.conv.Concat [1]
25 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
26 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
27 [-1, 13] 1 0 ultralytics.nn.modules.conv.Concat [1]
28 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
29 [22, 25, 28] 1 7303907 ultralytics.nn.modules.head.RTDETRDecoder [1, [256, 256, 256]]
rtdetr-l-FLA summary: 688 layers, 37,010,883 parameters, 37,010,883 gradients, 111.4 GFLOPs
rtdetr-l-HGBlock_FLA :
rtdetr-l-HGBlock_FLA summary: 703 layers, 45,416,387 parameters, 45,416,387 gradients, 148.5 GFLOPs
from n params module arguments
0 -1 1 25248 ultralytics.nn.modules.block.HGStem [3, 32, 48]
1 -1 6 155072 ultralytics.nn.modules.block.HGBlock [48, 48, 128, 3, 6]
2 -1 1 1408 ultralytics.nn.modules.conv.DWConv [128, 128, 3, 2, 1, False]
3 -1 6 839296 ultralytics.nn.modules.block.HGBlock [128, 96, 512, 3, 6]
4 -1 1 5632 ultralytics.nn.modules.conv.DWConv [512, 512, 3, 2, 1, False]
5 -1 6 5898112 ultralytics.nn.AddModules.FLA.HGBlock_FLA [512, 192, 1024, 5, 6, True, False]
6 -1 6 6258560 ultralytics.nn.AddModules.FLA.HGBlock_FLA [1024, 192, 1024, 5, 6, True, True]
7 -1 6 6258560 ultralytics.nn.AddModules.FLA.HGBlock_FLA [1024, 192, 1024, 5, 6, True, True]
8 -1 1 11264 ultralytics.nn.modules.conv.DWConv [1024, 1024, 3, 2, 1, False]
9 -1 6 6708480 ultralytics.nn.modules.block.HGBlock [1024, 384, 2048, 5, 6, True, False]
10 -1 1 524800 ultralytics.nn.modules.conv.Conv [2048, 256, 1, 1, None, 1, 1, False]
11 -1 1 789760 ultralytics.nn.modules.transformer.AIFI [256, 1024, 8]
12 -1 1 66048 ultralytics.nn.modules.conv.Conv [256, 256, 1, 1]
13 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
14 7 1 262656 ultralytics.nn.modules.conv.Conv [1024, 256, 1, 1, None, 1, 1, False]
15 [-2, -1] 1 0 ultralytics.nn.modules.conv.Concat [1]
16 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
17 -1 1 66048 ultralytics.nn.modules.conv.Conv [256, 256, 1, 1]
18 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
19 3 1 131584 ultralytics.nn.modules.conv.Conv [512, 256, 1, 1, None, 1, 1, False]
20 [-2, -1] 1 0 ultralytics.nn.modules.conv.Concat [1]
21 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
22 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
23 [-1, 17] 1 0 ultralytics.nn.modules.conv.Concat [1]
24 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
25 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
26 [-1, 12] 1 0 ultralytics.nn.modules.conv.Concat [1]
27 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
28 [21, 24, 27] 1 7303907 ultralytics.nn.modules.head.RTDETRDecoder [1, [256, 256, 256]]
rtdetr-l-HGBlock_FLA summary: 703 layers, 45,416,387 parameters, 45,416,387 gradients, 148.5 GFLOPs