RT-DETR改进策略【Backbone/主干网络】| 替换骨干网络为CVPR-2024 PKINet 获取多尺度纹理特征,适应尺度变化大的目标
一、本文介绍
本文记录的是
利用
PKINet
优化
RT-DETR
的目标检测方法研究
。在
遥感图像
目标检测中,
目标尺度变化大
,本文引入
PKINet
来
捕获多尺度纹理特征
,并在
RT-DETR
的基础上配置了原论文中
PKINET_T', 'PKINET_S
,
PKINET_B
三种模型,以满足不同的需求。
二、PKI Module原理介绍
Poly Kernel Inception Network for Remote Sensing Detection
PKI Module
是
Poly Kernel Inception Network (PKINet)
中的一个重要模块,其设计原理、结构和优势如下:
2.1 原理
PKI Module
是一个
Inception-Style
模块,
通过不同尺寸的卷积核组合来提取不同尺度的特征
。它先使用小卷积核卷积抓取局部信息,然后使用一组并行的深度可分离卷积来捕获多尺度的上下文信息。通过这种方式,
可以在不同感受野上提取特征,并将局部和上下文特征进行融合,以获取更丰富的特征表示,同时避免因单一尺度卷积核或扩张卷积带来的问题
,如小卷积核可能丢失长距离上下文信息,大卷积核可能引入背景噪声或生成过于稀疏的特征表示。
2.2 结构
2.2.1 局部特征提取
- 对于第 l l l 阶段第 n n n 个PKI Block中的PKI Module,首先通过 k s × k s k_{s}×k_{s} k s × k s 卷积(在实验中 k s = 3 k_{s}=3 k s = 3 )对输入 X l − 1 , n ( 2 ) X_{l - 1,n}^{(2)} X l − 1 , n ( 2 ) 进行局部特征提取,得到 L l − 1 , n ∈ R 1 2 C l × H l × W l L_{l - 1,n}\in\mathbb{R}^{\frac{1}{2}C_{l}×H_{l}×W_{l}} L l − 1 , n ∈ R 2 1 C l × H l × W l 。
2.2.2 多尺度上下文特征提取
- 接着使用一组并行的深度可分离卷积 D W C o n v k ( m ) × k ( m ) DWConv_{k^{(m)}×k^{(m)}} D W C o n v k ( m ) × k ( m ) ( m = 1 , ⋯ , 4 m = 1,\cdots,4 m = 1 , ⋯ , 4 ,其中 k ( m ) = ( m + 1 ) × 2 + 1 k^{(m)}=(m + 1)×2+1 k ( m ) = ( m + 1 ) × 2 + 1 )对 L l − 1 , n L_{l - 1,n} L l − 1 , n 进行卷积,得到不同尺度的上下文特征 Z l − 1 , n ( m ) ∈ R 1 2 C l × H l × W l Z_{l - 1,n}^{(m)}\in\mathbb{R}^{\frac{1}{2}C_{l}×H_{l}×W_{l}} Z l − 1 , n ( m ) ∈ R 2 1 C l × H l × W l 。
2.2.3 特征融合
- 将局部特征 L l − 1 , n L_{l - 1,n} L l − 1 , n 和多尺度上下文特征 ∑ m = 1 4 Z l − 1 , n ( m ) \sum_{m = 1}^{4}Z_{l - 1,n}^{(m)} ∑ m = 1 4 Z l − 1 , n ( m ) 相加后,通过一个 1 × 1 1×1 1 × 1 卷积进行融合,得到输出特征 P l − 1 , n ∈ R 1 2 C i × H i × W i P_{l - 1,n}\in\mathbb{R}^{\frac{1}{2}C_{i}×H_{i}×W_{i}} P l − 1 , n ∈ R 2 1 C i × H i × W i 。这个 1 × 1 1×1 1 × 1 卷积起到了通道融合的作用,能够整合不同感受野大小的特征。
2.3 优势
-
多尺度特征提取
- 能够有效捕获多尺度纹理特征,适应遥感图像中目标尺度变化大的特点,通过不同大小的卷积核组合,可以获取不同尺度的局部和上下文信息,提高对不同大小目标的检测能力。
-
避免特征稀疏和噪声问题
- 不使用扩张卷积,防止了提取过于稀疏的特征表示,同时相比于单纯使用大卷积核,避免了引入过多背景噪声,从而提高了特征的质量和检测的准确性。
-
特征融合优势
- 通过 1 × 1 1×1 1 × 1 卷积进行特征融合,能够合理地整合不同尺度的特征,在保留局部纹理特征完整性的同时,捕获到更广泛的上下文信息,使得提取的特征更具代表性和判别力。
论文: https://arxiv.org/pdf/2403.06258
源码: https://github.com/NUST-Machine-Intelligence-Laboratory/PKINet
三、PKINet的实现代码
PKINet
的实现代码如下:
import math
from typing import Optional, Union, Sequence
import torch
import torch.nn as nn
try:
from mmcv.cnn import ConvModule, build_norm_layer
from mmengine.model import BaseModule
from mmengine.model import constant_init
from mmengine.model.weight_init import trunc_normal_init, normal_init
except ImportError as e:
pass
try:
from mmengine.model import BaseModule
except ImportError as e:
BaseModule = nn.Module
__all__ = ['PKINET_T', 'PKINET_S', 'PKINET_B']
def drop_path(x: torch.Tensor,
drop_prob: float = 0.,
training: bool = False) -> torch.Tensor:
"""Drop paths (Stochastic Depth) per sample (when applied in main path of
residual blocks).
We follow the implementation
https://github.com/rwightman/pytorch-image-models/blob/a2727c1bf78ba0d7b5727f5f95e37fb7f8866b1f/timm/models/layers/drop.py # noqa: E501
"""
if drop_prob == 0. or not training:
return x
keep_prob = 1 - drop_prob
# handle tensors with different dimensions, not just 4D tensors.
shape = (x.shape[0], ) + (1, ) * (x.ndim - 1)
random_tensor = keep_prob + torch.rand(
shape, dtype=x.dtype, device=x.device)
output = x.div(keep_prob) * random_tensor.floor()
return output
class DropPath(nn.Module):
"""Drop paths (Stochastic Depth) per sample (when applied in main path of
residual blocks).
We follow the implementation
https://github.com/rwightman/pytorch-image-models/blob/a2727c1bf78ba0d7b5727f5f95e37fb7f8866b1f/timm/models/layers/drop.py # noqa: E501
Args:
drop_prob (float): Probability of the path to be zeroed. Default: 0.1
"""
def __init__(self, drop_prob: float = 0.1):
super().__init__()
self.drop_prob = drop_prob
def forward(self, x: torch.Tensor) -> torch.Tensor:
return drop_path(x, self.drop_prob, self.training)
def autopad(kernel_size: int, padding: int = None, dilation: int = 1):
assert kernel_size % 2 == 1, 'if use autopad, kernel size must be odd'
if dilation > 1:
kernel_size = dilation * (kernel_size - 1) + 1
if padding is None:
padding = kernel_size // 2
return padding
def make_divisible(value, divisor, min_value=None, min_ratio=0.9):
"""Make divisible function.
This function rounds the channel number to the nearest value that can be
divisible by the divisor. It is taken from the original tf repo. It ensures
that all layers have a channel number that is divisible by divisor. It can
be seen here: https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py # noqa
Args:
value (int, float): The original channel number.
divisor (int): The divisor to fully divide the channel number.
min_value (int): The minimum value of the output channel.
Default: None, means that the minimum value equal to the divisor.
min_ratio (float): The minimum ratio of the rounded channel number to
the original channel number. Default: 0.9.
Returns:
int: The modified output channel number.
"""
if min_value is None:
min_value = divisor
new_value = max(min_value, int(value + divisor / 2) // divisor * divisor)
# Make sure that round down does not go down by more than (1-min_ratio).
if new_value < min_ratio * value:
new_value += divisor
return new_value
class BCHW2BHWC(nn.Module):
def __init__(self):
super().__init__()
@staticmethod
def forward(x):
return x.permute([0, 2, 3, 1])
class BHWC2BCHW(nn.Module):
def __init__(self):
super().__init__()
@staticmethod
def forward(x):
return x.permute([0, 3, 1, 2])
class GSiLU(nn.Module):
"""Global Sigmoid-Gated Linear Unit, reproduced from paper <SIMPLE CNN FOR VISION>"""
def __init__(self):
super().__init__()
self.adpool = nn.AdaptiveAvgPool2d(1)
def forward(self, x):
return x * torch.sigmoid(self.adpool(x))
class CAA(BaseModule):
"""Context Anchor Attention"""
def __init__(
self,
channels: int,
h_kernel_size: int = 11,
v_kernel_size: int = 11,
norm_cfg: Optional[dict] = dict(type='BN', momentum=0.03, eps=0.001),
act_cfg: Optional[dict] = dict(type='SiLU'),
init_cfg: Optional[dict] = None,
):
super().__init__(init_cfg)
self.avg_pool = nn.AvgPool2d(7, 1, 3)
self.conv1 = ConvModule(channels, channels, 1, 1, 0,
norm_cfg=norm_cfg, act_cfg=act_cfg)
self.h_conv = ConvModule(channels, channels, (1, h_kernel_size), 1,
(0, h_kernel_size // 2), groups=channels,
norm_cfg=None, act_cfg=None)
self.v_conv = ConvModule(channels, channels, (v_kernel_size, 1), 1,
(v_kernel_size // 2, 0), groups=channels,
norm_cfg=None, act_cfg=None)
self.conv2 = ConvModule(channels, channels, 1, 1, 0,
norm_cfg=norm_cfg, act_cfg=act_cfg)
self.act = nn.Sigmoid()
def forward(self, x):
attn_factor = self.act(self.conv2(self.v_conv(self.h_conv(self.conv1(self.avg_pool(x))))))
return attn_factor
class ConvFFN(BaseModule):
"""Multi-layer perceptron implemented with ConvModule"""
def __init__(
self,
in_channels: int,
out_channels: Optional[int] = None,
hidden_channels_scale: float = 4.0,
hidden_kernel_size: int = 3,
dropout_rate: float = 0.,
add_identity: bool = True,
norm_cfg: Optional[dict] = dict(type='BN', momentum=0.03, eps=0.001),
act_cfg: Optional[dict] = dict(type='SiLU'),
init_cfg: Optional[dict] = None,
):
super().__init__(init_cfg)
out_channels = out_channels or in_channels
hidden_channels = int(in_channels * hidden_channels_scale)
self.ffn_layers = nn.Sequential(
BCHW2BHWC(),
nn.LayerNorm(in_channels),
BHWC2BCHW(),
ConvModule(in_channels, hidden_channels, kernel_size=1, stride=1, padding=0,
norm_cfg=norm_cfg, act_cfg=act_cfg),
ConvModule(hidden_channels, hidden_channels, kernel_size=hidden_kernel_size, stride=1,
padding=hidden_kernel_size // 2, groups=hidden_channels,
norm_cfg=norm_cfg, act_cfg=None),
GSiLU(),
nn.Dropout(dropout_rate),
ConvModule(hidden_channels, out_channels, kernel_size=1, stride=1, padding=0,
norm_cfg=norm_cfg, act_cfg=act_cfg),
nn.Dropout(dropout_rate),
)
self.add_identity = add_identity
def forward(self, x):
x = x + self.ffn_layers(x) if self.add_identity else self.ffn_layers(x)
return x
class Stem(BaseModule):
"""Stem layer"""
def __init__(
self,
in_channels: int,
out_channels: int,
expansion: float = 1.0,
norm_cfg: Optional[dict] = dict(type='BN', momentum=0.03, eps=0.001),
act_cfg: Optional[dict] = dict(type='SiLU'),
init_cfg: Optional[dict] = None,
):
super().__init__(init_cfg)
hidden_channels = make_divisible(int(out_channels * expansion), 8)
self.down_conv = ConvModule(in_channels, hidden_channels, kernel_size=3, stride=2, padding=1,
norm_cfg=norm_cfg, act_cfg=act_cfg)
self.conv1 = ConvModule(hidden_channels, hidden_channels, kernel_size=3, stride=1, padding=1,
norm_cfg=norm_cfg, act_cfg=act_cfg)
self.conv2 = ConvModule(hidden_channels, out_channels, kernel_size=3, stride=1, padding=1,
norm_cfg=norm_cfg, act_cfg=act_cfg)
def forward(self, x):
return self.conv2(self.conv1(self.down_conv(x)))
class DownSamplingLayer(BaseModule):
"""Down sampling layer"""
def __init__(
self,
in_channels: int,
out_channels: Optional[int] = None,
norm_cfg: Optional[dict] = dict(type='BN', momentum=0.03, eps=0.001),
act_cfg: Optional[dict] = dict(type='SiLU'),
init_cfg: Optional[dict] = None,
):
super().__init__(init_cfg)
out_channels = out_channels or (in_channels * 2)
self.down_conv = ConvModule(in_channels, out_channels, kernel_size=3, stride=2, padding=1,
norm_cfg=norm_cfg, act_cfg=act_cfg)
def forward(self, x):
return self.down_conv(x)
class InceptionBottleneck(BaseModule):
"""Bottleneck with Inception module"""
def __init__(
self,
in_channels: int,
out_channels: Optional[int] = None,
kernel_sizes: Sequence[int] = (3, 5, 7, 9, 11),
dilations: Sequence[int] = (1, 1, 1, 1, 1),
expansion: float = 1.0,
add_identity: bool = True,
with_caa: bool = True,
caa_kernel_size: int = 11,
norm_cfg: Optional[dict] = dict(type='BN', momentum=0.03, eps=0.001),
act_cfg: Optional[dict] = dict(type='SiLU'),
init_cfg: Optional[dict] = None,
):
super().__init__(init_cfg)
out_channels = out_channels or in_channels
hidden_channels = make_divisible(int(out_channels * expansion), 8)
self.pre_conv = ConvModule(in_channels, hidden_channels, 1, 1, 0, 1,
norm_cfg=norm_cfg, act_cfg=act_cfg)
self.dw_conv = ConvModule(hidden_channels, hidden_channels, kernel_sizes[0], 1,
autopad(kernel_sizes[0], None, dilations[0]), dilations[0],
groups=hidden_channels, norm_cfg=None, act_cfg=None)
self.dw_conv1 = ConvModule(hidden_channels, hidden_channels, kernel_sizes[1], 1,
autopad(kernel_sizes[1], None, dilations[1]), dilations[1],
groups=hidden_channels, norm_cfg=None, act_cfg=None)
self.dw_conv2 = ConvModule(hidden_channels, hidden_channels, kernel_sizes[2], 1,
autopad(kernel_sizes[2], None, dilations[2]), dilations[2],
groups=hidden_channels, norm_cfg=None, act_cfg=None)
self.dw_conv3 = ConvModule(hidden_channels, hidden_channels, kernel_sizes[3], 1,
autopad(kernel_sizes[3], None, dilations[3]), dilations[3],
groups=hidden_channels, norm_cfg=None, act_cfg=None)
self.dw_conv4 = ConvModule(hidden_channels, hidden_channels, kernel_sizes[4], 1,
autopad(kernel_sizes[4], None, dilations[4]), dilations[4],
groups=hidden_channels, norm_cfg=None, act_cfg=None)
self.pw_conv = ConvModule(hidden_channels, hidden_channels, 1, 1, 0, 1,
norm_cfg=norm_cfg, act_cfg=act_cfg)
if with_caa:
self.caa_factor = CAA(hidden_channels, caa_kernel_size, caa_kernel_size, None, None)
else:
self.caa_factor = None
self.add_identity = add_identity and in_channels == out_channels
self.post_conv = ConvModule(hidden_channels, out_channels, 1, 1, 0, 1,
norm_cfg=norm_cfg, act_cfg=act_cfg)
def forward(self, x):
x = self.pre_conv(x)
y = x # if there is an inplace operation of x, use y = x.clone() instead of y = x
x = self.dw_conv(x)
x = x + self.dw_conv1(x) + self.dw_conv2(x) + self.dw_conv3(x) + self.dw_conv4(x)
x = self.pw_conv(x)
if self.caa_factor is not None:
y = self.caa_factor(y)
if self.add_identity:
y = x * y
x = x + y
else:
x = x * y
x = self.post_conv(x)
return x
class PKIBlock(BaseModule):
"""Poly Kernel Inception Block"""
def __init__(
self,
in_channels: int,
out_channels: Optional[int] = None,
kernel_sizes: Sequence[int] = (3, 5, 7, 9, 11),
dilations: Sequence[int] = (1, 1, 1, 1, 1),
with_caa: bool = True,
caa_kernel_size: int = 11,
expansion: float = 1.0,
ffn_scale: float = 4.0,
ffn_kernel_size: int = 3,
dropout_rate: float = 0.,
drop_path_rate: float = 0.,
layer_scale: Optional[float] = 1.0,
add_identity: bool = True,
norm_cfg: Optional[dict] = dict(type='BN', momentum=0.03, eps=0.001),
act_cfg: Optional[dict] = dict(type='SiLU'),
init_cfg: Optional[dict] = None,
):
super().__init__(init_cfg)
out_channels = out_channels or in_channels
hidden_channels = make_divisible(int(out_channels * expansion), 8)
if norm_cfg is not None:
self.norm1 = build_norm_layer(norm_cfg, in_channels)[1]
self.norm2 = build_norm_layer(norm_cfg, hidden_channels)[1]
else:
self.norm1 = nn.BatchNorm2d(in_channels)
self.norm2 = nn.BatchNorm2d(hidden_channels)
self.block = InceptionBottleneck(in_channels, hidden_channels, kernel_sizes, dilations,
expansion=1.0, add_identity=True,
with_caa=with_caa, caa_kernel_size=caa_kernel_size,
norm_cfg=norm_cfg, act_cfg=act_cfg)
self.ffn = ConvFFN(hidden_channels, out_channels, ffn_scale, ffn_kernel_size, dropout_rate, add_identity=False,
norm_cfg=None, act_cfg=None)
self.drop_path = DropPath(drop_path_rate) if drop_path_rate > 0 else nn.Identity()
self.layer_scale = layer_scale
if self.layer_scale:
self.gamma1 = nn.Parameter(layer_scale * torch.ones(hidden_channels), requires_grad=True)
self.gamma2 = nn.Parameter(layer_scale * torch.ones(out_channels), requires_grad=True)
self.add_identity = add_identity and in_channels == out_channels
def forward(self, x):
if self.layer_scale:
if self.add_identity:
x = x + self.drop_path(self.gamma1.unsqueeze(-1).unsqueeze(-1) * self.block(self.norm1(x)))
x = x + self.drop_path(self.gamma2.unsqueeze(-1).unsqueeze(-1) * self.ffn(self.norm2(x)))
else:
x = self.drop_path(self.gamma1.unsqueeze(-1).unsqueeze(-1) * self.block(self.norm1(x)))
x = self.drop_path(self.gamma2.unsqueeze(-1).unsqueeze(-1) * self.ffn(self.norm2(x)))
else:
if self.add_identity:
x = x + self.drop_path(self.block(self.norm1(x)))
x = x + self.drop_path(self.ffn(self.norm2(x)))
else:
x = self.drop_path(self.block(self.norm1(x)))
x = self.drop_path(self.ffn(self.norm2(x)))
return x
class PKIStage(BaseModule):
"""Poly Kernel Inception Stage"""
def __init__(
self,
in_channels: int,
out_channels: int,
num_blocks: int,
kernel_sizes: Sequence[int] = (3, 5, 7, 9, 11),
dilations: Sequence[int] = (1, 1, 1, 1, 1),
expansion: float = 0.5,
ffn_scale: float = 4.0,
ffn_kernel_size: int = 3,
dropout_rate: float = 0.,
drop_path_rate: Union[float, list] = 0.,
layer_scale: Optional[float] = 1.0,
shortcut_with_ffn: bool = True,
shortcut_ffn_scale: float = 4.0,
shortcut_ffn_kernel_size: int = 5,
add_identity: bool = True,
with_caa: bool = True,
caa_kernel_size: int = 11,
norm_cfg: Optional[dict] = dict(type='BN', momentum=0.03, eps=0.001),
act_cfg: Optional[dict] = dict(type='SiLU'),
init_cfg: Optional[dict] = None,
):
super().__init__(init_cfg)
hidden_channels = make_divisible(int(out_channels * expansion), 8)
self.downsample = DownSamplingLayer(in_channels, out_channels, norm_cfg, act_cfg)
self.conv1 = ConvModule(out_channels, 2 * hidden_channels, kernel_size=1, stride=1, padding=0, dilation=1,
norm_cfg=norm_cfg, act_cfg=act_cfg)
self.conv2 = ConvModule(2 * hidden_channels, out_channels, kernel_size=1, stride=1, padding=0, dilation=1,
norm_cfg=norm_cfg, act_cfg=act_cfg)
self.conv3 = ConvModule(out_channels, out_channels, kernel_size=1, stride=1, padding=0, dilation=1,
norm_cfg=norm_cfg, act_cfg=act_cfg)
self.ffn = ConvFFN(hidden_channels, hidden_channels, shortcut_ffn_scale, shortcut_ffn_kernel_size, 0.,
add_identity=True, norm_cfg=None, act_cfg=None) if shortcut_with_ffn else None
self.blocks = nn.ModuleList([
PKIBlock(hidden_channels, hidden_channels, kernel_sizes, dilations, with_caa,
caa_kernel_size+2*i, 1.0, ffn_scale, ffn_kernel_size, dropout_rate,
drop_path_rate[i] if isinstance(drop_path_rate, list) else drop_path_rate,
layer_scale, add_identity, norm_cfg, act_cfg) for i in range(num_blocks)
])
def forward(self, x):
x = self.downsample(x)
x, y = list(self.conv1(x).chunk(2, 1))
if self.ffn is not None:
x = self.ffn(x)
z = [x]
t = torch.zeros(y.shape, device=y.device, dtype=x.dtype)
for block in self.blocks:
t = t + block(y)
z.append(t)
z = torch.cat(z, dim=1)
z = self.conv2(z)
z = self.conv3(z)
return z
class PKINet(BaseModule):
"""Poly Kernel Inception Network"""
arch_settings = {
# from left to right: (indices)
# in_channels(0), out_channels(1), num_blocks(2), kernel_sizes(3), dilations(4), expansion(5),
# ffn_scale(6), ffn_kernel_size(7), dropout_rate(8), layer_scale(9), shortcut_with_ffn(10),
# shortcut_ffn_scale(11), shortcut_ffn_kernel_size(12), add_identity(13), with_caa(14), caa_kernel_size(15)
'T': [[16, 32, 4, (3, 5, 7, 9, 11), (1, 1, 1, 1, 1), 0.5, 4.0, 3, 0.1, 1.0, True, 8.0, 5, True, True, 11],
[32, 64, 14, (3, 5, 7, 9, 11), (1, 1, 1, 1, 1), 0.5, 4.0, 3, 0.1, 1.0, True, 8.0, 7, True, True, 11],
[64, 128, 22, (3, 5, 7, 9, 11), (1, 1, 1, 1, 1), 0.5, 4.0, 3, 0.1, 1.0, True, 4.0, 9, True, True, 11],
[128, 256, 4, (3, 5, 7, 9, 11), (1, 1, 1, 1, 1), 0.5, 4.0, 3, 0.1, 1.0, True, 4.0, 11, True, True, 11]],
'S': [[32, 64, 4, (3, 5, 7, 9, 11), (1, 1, 1, 1, 1), 0.5, 4.0, 3, 0.1, 1.0, True, 8.0, 5, True, True, 11],
[64, 128, 12, (3, 5, 7, 9, 11), (1, 1, 1, 1, 1), 0.5, 4.0, 3, 0.1, 1.0, True, 8.0, 7, True, True, 11],
[128, 256, 20, (3, 5, 7, 9, 11), (1, 1, 1, 1, 1), 0.5, 4.0, 3, 0.1, 1.0, True, 4.0, 9, True, True, 11],
[256, 512, 4, (3, 5, 7, 9, 11), (1, 1, 1, 1, 1), 0.5, 4.0, 3, 0.1, 1.0, True, 4.0, 11, True, True, 11]],
'B': [[40, 80, 6, (3, 5, 7, 9, 11), (1, 1, 1, 1, 1), 0.5, 4.0, 3, 0.1, 1.0, True, 8.0, 5, True, True, 11],
[80, 160, 16, (3, 5, 7, 9, 11), (1, 1, 1, 1, 1), 0.5, 4.0, 3, 0.1, 1.0, True, 8.0, 7, True, True, 11],
[160, 320, 24, (3, 5, 7, 9, 11), (1, 1, 1, 1, 1), 0.5, 4.0, 3, 0.1, 1.0, True, 4.0, 9, True, True, 11],
[320, 640, 6, (3, 5, 7, 9, 11), (1, 1, 1, 1, 1), 0.5, 4.0, 3, 0.1, 1.0, True, 4.0, 11, True, True, 11]],
}
def __init__(
self,
arch: str = 'S',
out_indices: Sequence[int] = (0, 1, 2, 3, 4),
drop_path_rate: float = 0.1,
frozen_stages: int = -1,
norm_eval: bool = False,
arch_setting: Optional[Sequence[list]] = None,
norm_cfg: Optional[dict] = dict(type='BN', momentum=0.03, eps=0.001),
act_cfg: Optional[dict] = dict(type='SiLU'),
init_cfg: Optional[dict] = dict(type='Kaiming',
layer='Conv2d',
a=math.sqrt(5),
distribution='uniform',
mode='fan_in',
nonlinearity='leaky_relu'),
):
super().__init__(init_cfg=init_cfg)
arch_setting = arch_setting or self.arch_settings[arch]
assert set(out_indices).issubset(i for i in range(len(arch_setting) + 1))
if frozen_stages not in range(-1, len(arch_setting) + 1):
raise ValueError(f'frozen_stages must be in range(-1, len(arch_setting) + 1). But received {frozen_stages}')
self.out_indices = out_indices
self.frozen_stages = frozen_stages
self.norm_eval = norm_eval
self.stages = nn.ModuleList()
self.stem = Stem(3, arch_setting[0][0], expansion=1.0, norm_cfg=norm_cfg, act_cfg=act_cfg)
self.stages.append(self.stem)
depths = [x[2] for x in arch_setting]
dpr = [x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))]
for i, (in_channels, out_channels, num_blocks, kernel_sizes, dilations, expansion, ffn_scale, ffn_kernel_size,
dropout_rate, layer_scale, shortcut_with_ffn, shortcut_ffn_scale, shortcut_ffn_kernel_size,
add_identity, with_caa, caa_kernel_size) in enumerate(arch_setting):
stage = PKIStage(in_channels, out_channels, num_blocks, kernel_sizes, dilations, expansion,
ffn_scale, ffn_kernel_size, dropout_rate, dpr[sum(depths[:i]):sum(depths[:i + 1])],
layer_scale, shortcut_with_ffn, shortcut_ffn_scale, shortcut_ffn_kernel_size,
add_identity, with_caa, caa_kernel_size, norm_cfg, act_cfg)
self.stages.append(stage)
self.init_weights()
self.channel = [i.size(1) for i in self.forward(torch.randn(1, 3, 640, 640))]
def forward(self, x):
outs = []
for i, stage in enumerate(self.stages):
x = stage(x)
if i in self.out_indices:
outs.append(x)
return tuple(outs)
def init_weights(self):
if self.init_cfg is None:
for m in self.modules():
if isinstance(m, nn.Linear):
trunc_normal_init(m, std=.02, bias=0.)
elif isinstance(m, nn.LayerNorm):
constant_init(m, val=1.0, bias=0.)
elif isinstance(m, nn.Conv2d):
fan_out = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
fan_out //= m.groups
normal_init(m, mean=0, std=math.sqrt(2.0 / fan_out), bias=0)
else:
super().init_weights()
def PKINET_T():
return PKINet('T')
def PKINET_S():
return PKINet('S')
def PKINET_B():
return PKINet('B')
# if __name__ == '__main__':
# model = PKINET_T()
# inputs = torch.randn((1, 3, 640, 640))
# res = model(inputs)
# for i in res:
# print(i.size())
四、修改步骤
4.1 修改一
① 在
ultralytics/nn/
目录下新建
AddModules
文件夹用于存放模块代码
② 在
AddModules
文件夹下新建
PKINet.py
,将
第三节
中的代码粘贴到此处
4.2 修改二
在
AddModules
文件夹下新建
__init__.py
(已有则不用新建),在文件内导入模块:
from .PKINet import *
4.3 修改三
在
ultralytics/nn/modules/tasks.py
文件中,需要在两处位置添加各模块类名称。
① 首先:导入模块
② 其次:在
parse_model函数
的如下位置添加两行代码:
backbone = False
t=m
③ 接着,在此函数下添加如下代码:
elif m in {PKINET_T, PKINET_S, PKINET_B}:
m = m(*args)
c2 = m.channel
backbone = True
④ 然后,将下方红框内的代码全部替换:
if isinstance(c2, list):
is_backbone = True
m_ = m
m_.backbone = True
else:
m_ = nn.Sequential(*(m(*args) for _ in range(n))) if n > 1 else m(*args) # module
t = str(m)[8:-2].replace('__main__.', '') # module type
m.np = sum(x.numel() for x in m_.parameters()) # number params
m_.i, m_.f, m_.type = i + 4 if is_backbone else i, f, t # attach index, 'from' index, type
if verbose:
LOGGER.info(f'{i:>3}{str(f):>20}{n_:>3}{m.np:10.0f} {t:<45}{str(args):<30}') # print
save.extend(x % (i + 4 if is_backbone else i) for x in ([f] if isinstance(f, int) else f) if
x != -1) # append to savelist
layers.append(m_)
if i == 0:
ch = []
if isinstance(c2, list):
ch.extend(c2)
for _ in range(5 - len(ch)):
ch.insert(0, 0)
else:
ch.append(c2)
替换后如下:
⑤ 在此文件下找到
base_model
的
_predict_once
,并将其替换成如下代码。
def _predict_once(self, x, profile=False, visualize=False, embed=None):
y, dt, embeddings = [], [], [] # outputs
for m in self.model:
if m.f != -1: # if not from previous layer
x = y[m.f] if isinstance(m.f, int) else [x if j == -1 else y[j] for j in m.f] # from earlier layers
if profile:
self._profile_one_layer(m, x, dt)
if hasattr(m, 'backbone'):
x = m(x)
if len(x) != 5: # 0 - 5
x.insert(0, None)
for index, i in enumerate(x):
if index in self.save:
y.append(i)
else:
y.append(None)
x = x[-1] # 最后一个输出传给下一层
else:
x = m(x) # run
y.append(x if m.i in self.save else None) # save output
if visualize:
feature_visualization(x, m.type, m.i, save_dir=visualize)
if embed and m.i in embed:
embeddings.append(nn.functional.adaptive_avg_pool2d(x, (1, 1)).squeeze(-1).squeeze(-1)) # flatten
if m.i == max(embed):
return torch.unbind(torch.cat(embeddings, 1), dim=0)
return x
至此就修改完成了,可以配置模型开始训练了
五、yaml模型文件
5.1 模型改进⭐
在代码配置完成后,配置模型的YAML文件。
此处以
ultralytics/cfg/models/rt-detr/rtdetr-l.yaml
为例,在同目录下创建一个用于自己数据集训练的模型文件
rtdetr-l-PKINet.yaml
。
将
rtdetr-l.yaml
中的内容复制到
rtdetr-l-PKINet.yaml
文件下,修改
nc
数量等于自己数据中目标的数量。
📌 模型的修改方法是将
骨干网络
替换成
PKINET_T
。
# Ultralytics YOLO 🚀, AGPL-3.0 license
# RT-DETR-l object detection model with P3-P5 outputs. For details see https://docs.ultralytics.com/models/rtdetr
# Parameters
nc: 1 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n-cls.yaml' will call yolov8-cls.yaml with scale 'n'
# [depth, width, max_channels]
l: [1.00, 1.00, 1024]
backbone:
# [from, repeats, module, args]
- [-1, 1, PKINET_T, []] # 4
head:
- [-1, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 5 input_proj.2
- [-1, 1, AIFI, [1024, 8]] # 6
- [-1, 1, Conv, [256, 1, 1]] # 7, Y5, lateral_convs.0
- [-1, 1, nn.Upsample, [None, 2, 'nearest']] # 8
- [3, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 9 input_proj.1
- [[-2, -1], 1, Concat, [1]] # 10
- [-1, 3, RepC3, [256]] # 11, fpn_blocks.0
- [-1, 1, Conv, [256, 1, 1]] # 12, Y4, lateral_convs.1
- [-1, 1, nn.Upsample, [None, 2, 'nearest']] # 13
- [2, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 14 input_proj.0
- [[-2, -1], 1, Concat, [1]] # 15 cat backbone P4
- [-1, 3, RepC3, [256]] # X3 (16), fpn_blocks.1
- [-1, 1, Conv, [256, 3, 2]] # 17, downsample_convs.0
- [[-1, 12], 1, Concat, [1]] # 18 cat Y4
- [-1, 3, RepC3, [256]] # F4 (19), pan_blocks.0
- [-1, 1, Conv, [256, 3, 2]] # 20, downsample_convs.1
- [[-1, 7], 1, Concat, [1]] # 21 cat Y5
- [-1, 3, RepC3, [256]] # F5 (22), pan_blocks.1
- [[16, 19, 22], 1, RTDETRDecoder, [nc]] # Detect(P3, P4, P5)
六、成功运行结果
分别打印网络模型可以看到
PKINET_T
已经加入到模型中,并可以进行训练了。
rtdetr-l-PKINet :
rtdetr-l-PKINet summary: 2,853 layers, 22,608,835 parameters, 22,608,835 gradients, 80.5 GFLOPs
0 -1 1 4156896 PKINET_T []
1 -1 1 66048 ultralytics.nn.modules.conv.Conv [256, 256, 1, 1, None, 1, 1, False]
2 -1 1 789760 ultralytics.nn.modules.transformer.AIFI [256, 1024, 8]
3 -1 1 66048 ultralytics.nn.modules.conv.Conv [256, 256, 1, 1]
4 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
5 3 1 33280 ultralytics.nn.modules.conv.Conv [128, 256, 1, 1, None, 1, 1, False]
6 [-2, -1] 1 0 ultralytics.nn.modules.conv.Concat [1]
7 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
8 -1 1 66048 ultralytics.nn.modules.conv.Conv [256, 256, 1, 1]
9 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
10 2 1 16896 ultralytics.nn.modules.conv.Conv [64, 256, 1, 1, None, 1, 1, False]
11 [-2, -1] 1 0 ultralytics.nn.modules.conv.Concat [1]
12 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
13 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
14 [-1, 12] 1 0 ultralytics.nn.modules.conv.Concat [1]
15 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
16 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
17 [-1, 7] 1 0 ultralytics.nn.modules.conv.Concat [1]
18 -1 3 2232320 ultralytics.nn.modules.block.RepC3 [512, 256, 3]
19 [16, 19, 22] 1 7303907 ultralytics.nn.modules.head.RTDETRDecoder [1, [256, 256, 256]]
rtdetr-l-PKINet summary: 2,853 layers, 22,608,835 parameters, 22,608,835 gradients, 80.5 GFLOPs