💡💡💡本文独家改进:多尺度卷积核:与之前依赖大卷积核或膨胀卷积的方法不同,PKINet利用不同大小的多个深度卷积核,优势:无需膨胀即可提取不同感受野中的多尺度纹理特征。
,改进思路来自CVPR2024 PKINet,2024年前沿最新改进,抢先使用
💡💡💡小目标数据集,涨点近两个点,强烈推荐
💡💡💡小目标涨点友好
改进结构图如下:
收录
YOLOv8原创自研
💡💡💡全网独家首发创新(原创),适合paper !!!
💡💡💡 2024年计算机视觉顶会创新点适用于Yolov5、Yolov7、Yolov8等各个Yolo系列,专栏文章提供每一步步骤和源码,轻松带你上手魔改网络 !!!
💡💡💡重点:通过本专栏的阅读,后续你也可以设计魔改网络,在网络不同位置(Backbone、head、detect、loss等)进行魔改,实现创新!!!
1.PKINet原理介绍

论文: 2403.06258.pdf (arxiv.org)
摘要:遥感图像目标检测(RSIs)经常面临一些日益严峻的挑战,包括目标尺度的巨大变化和不同的测距环境。先前的方法试图通过大核卷积或扩展卷积来扩展主干的空间感受野来解决这些挑战。然而,前者通常会引入相当大的背景噪声,而后者可能会产生过于稀疏的特征表示。在本文中,我们引入聚核初始网络(Poly Kernel Inception Network ,PKINet)来解决上述挑战。PKINet采用无扩展的多尺度卷积核来提取不同尺度的目标特征并捕获局部上下文。此外,还并行引入了上下文锚定注意(CAA)模块来捕获远程上下文信息。这两个组件共同工作,以提高PKINet在四个具有挑战性的遥感检测基准上的性能,即DOTA-v1.0, DOTA-v1.5, HRSC2016和DIOR-R。
我们的方法在DOTA-v1.0[64]上使用更少的参数,在各种遥感探测器[10,20,59,65,71]上获得了稳定的性能提升。

下图:具有小核的网络在大目标检测中忽略了远程上下文,而具有大核的网络在小目标检测中引入了噪声。然而,我们的多尺度卷积可以很好地处理尺度变化。

首先,PKINet不依赖于大核卷积或扩展卷积来扩展感受野,而是利用无扩展的初始式深度卷积来提取不同感受野之间的多尺度纹理特征。其次,我们的方法结合了上下文锚定注意(CAA)机制来捕获远程上下文信息。这两个组件协同工作,促进了局部和全局上下文信息的自适应特征提取,从而提高了遥感目标检测的性能。

2.PKIBlock将入到YOLOv8
2.1 加入ultralytics/nn/backbone/pkinet.py
import math
from typing import Optional, Union, Sequence
import torch
import torch.nn as nn
from torch.nn.modules.batchnorm import _BatchNorm
from mmcv.cnn import ConvModule, build_norm_layer
from mmcv.cnn.bricks import DropPath
from mmengine.model import BaseModule, constant_init
from mmengine.model.weight_init import trunc_normal_init, normal_init
from mmengine.logging import MMLogger
from ultralytics.nn.modules.conv import Conv,autopad
from ultralytics.nn.modules.block import C3,C2f
def autopad(kernel_size: int, padding: int = None, dilation: int = 1):
assert kernel_size % 2 == 1, 'if use autopad, kernel size must be odd'
if dilation > 1:
kernel_size = dilation * (kernel_size - 1) + 1
if padding is None:
padding = kernel_size // 2
return padding
def make_divisible(value, divisor, min_value=None, min_ratio=0.9):
"""Make divisible function.
This function rounds the channel number to the nearest value that can be
divisible by the divisor. It is taken from the original tf repo. It ensures
that all layers have a channel number that is divisible by divisor. It can
be seen here: https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py # noqa
Args:
value (int, float): The original channel number.
divisor (int): The divisor to fully divide the channel number.
min_value (int): The minimum value of the output channel.
Default: None, means that the minimum value equal to the divisor.
min_ratio (float): The minimum ratio of the rounded channel number to
the original channel number. Default: 0.9.
Returns:
int: The modified output channel number.
"""
if min_value is None:
min_value = divisor
new_value = max(min_value, int(value + divisor / 2) // divisor * divisor)
# Make sure that round down does not go down by more than (1-min_ratio).
if new_value < min_ratio * value:
new_value += divisor
return new_value
class BCHW2BHWC(nn.Module):
def __init__(self):
super().__init__()
@staticmethod
def forward(x):
return x.permute([0, 2, 3, 1])
class BHWC2BCHW(nn.Module):
def __init__(self):
super().__init__()
@staticmethod
def forward(x):
return x.permute([0, 3, 1, 2])
class GSiLU(BaseModule):
"""Global Sigmoid-Gated Linear Unit, reproduced from paper <SIMPLE CNN FOR VISION>"""
def __init__(self):
super().__init__()
self.adpool = nn.AdaptiveAvgPool2d(1)
def forward(self, x):
return x * torch.sigmoid(self.adpool(x))
class CAA(BaseModule):
"""Context Anchor Attention"""
def __init__(
self,
channels: int,
h_kernel_size: int = 11,
v_kernel_size: int = 11,
norm_cfg: Optional[dict] = dict(type='BN', momentum=0.03, eps=0.001),
act_cfg: Optional[dict] = dict(type='SiLU'),
init_cfg: Optional[dict] = None,
):
super().__init__(init_cfg)
self.avg_pool = nn.AvgPool2d(7, 1, 3)
self.conv1 = ConvModule(channels, channels, 1, 1, 0,
norm_cfg=norm_cfg, act_cfg=act_cfg)
self.h_conv = ConvModule(channels, channels, (1, h_kernel_size), 1,
(0, h_kernel_size // 2), groups=channels,
norm_cfg=None, act_cfg=None)
self.v_conv = ConvModule(channels, channels, (v_kernel_size, 1), 1,
(v_kernel_size // 2, 0), groups=channels,
norm_cfg=None, act_cfg=None)
self.conv2 = ConvModule(channels, channels, 1, 1, 0,
norm_cfg=norm_cfg, act_cfg=act_cfg)
self.act = nn.Sigmoid()
def forward(self, x):
attn_factor = self.act(self.conv2(self.v_conv(self.h_conv(self.conv1(self.avg_pool(x))))))
return attn_factor
class C2f_CAA(C2f):
def __init__(self, c1, c2, n=1, k=7, shortcut=False, g=1, e=0.5):
super().__init__(c1, c2, n, shortcut, g, e)
self.m = nn.ModuleList(CAA(self.c) for _ in range(n))
class ConvFFN(BaseModule):
"""Multi-layer perceptron implemented with ConvModule"""
def __init__(
self,
in_channels: int,
out_channels: Optional[int] = None,
hidden_channels_scale: float = 4.0,
hidden_kernel_size: int = 3,
dropout_rate: float = 0.,
add_identity: bool = True,
norm_cfg: Optional[dict] = dict(type='BN', momentum=0.03, eps=0.001),
act_cfg: Optional[dict] = dict(type='SiLU'),
init_cfg: Optional[dict] = None,
):
super().__init__(init_cfg)
out_channels = out_channels or in_channels
hidden_channels = int(in_channels * hidden_channels_scale)
self.ffn_layers = nn.Sequential(
BCHW2BHWC(),
nn.LayerNorm(in_channels),
BHWC2BCHW(),
ConvModule(in_channels, hidden_channels, kernel_size=1, stride=1, padding=0,
norm_cfg=norm_cfg, act_cfg=act_cfg),
ConvModule(hidden_channels, hidden_channels, kernel_size=hidden_kernel_size, stride=1,
padding=hidden_kernel_size // 2, groups=hidden_channels,
norm_cfg=norm_cfg, act_cfg=None),
GSiLU(),
nn.Dropout(dropout_rate),
ConvModule(hidden_channels, out_channels, kernel_size=1, stride=1, padding=0,
norm_cfg=norm_cfg, act_cfg=act_cfg),
nn.Dropout(dropout_rate),
)
self.add_identity = add_identity
def forward(self, x):
x = x + self.ffn_layers(x) if self.add_identity else self.ffn_layers(x)
return x
class Stem(BaseModule):
"""Stem layer"""
def __init__(
self,
in_channels: int,
out_channels: int,
expansion: float = 1.0,
norm_cfg: Optional[dict] = dict(type='BN', momentum=0.03, eps=0.001),
act_cfg: Optional[dict] = dict(type='SiLU'),
init_cfg: Optional[dict] = None,
):
super().__init__(init_cfg)
hidden_channels = make_divisible(int(out_channels * expansion), 8)
self.down_conv = ConvModule(in_channels, hidden_channels, kernel_size=3, stride=2, padding=1,
norm_cfg=norm_cfg, act_cfg=act_cfg)
self.conv1 = ConvModule(hidden_channels, hidden_channels, kernel_size=3, stride=1, padding=1,
norm_cfg=norm_cfg, act_cfg=act_cfg)
self.conv2 = ConvModule(hidden_channels, out_channels, kernel_size=3, stride=1, padding=1,
norm_cfg=norm_cfg, act_cfg=act_cfg)
def forward(self, x):
return self.conv2(self.conv1(self.down_conv(x)))
class DownSamplingLayer(BaseModule):
"""Down sampling layer"""
def __init__(
self,
in_channels: int,
out_channels: Optional[int] = None,
norm_cfg: Optional[dict] = dict(type='BN', momentum=0.03, eps=0.001),
act_cfg: Optional[dict] = dict(type='SiLU'),
init_cfg: Optional[dict] = None,
):
super().__init__(init_cfg)
out_channels = out_channels or (in_channels * 2)
self.down_conv = ConvModule(in_channels, out_channels, kernel_size=3, stride=2, padding=1,
norm_cfg=norm_cfg, act_cfg=act_cfg)
def forward(self, x):
return self.down_conv(x)
class InceptionBottleneck(BaseModule):
"""Bottleneck with Inception module"""
def __init__(
self,
in_channels: int,
out_channels: Optional[int] = None,
kernel_sizes: Sequence[int] = (3, 5, 7, 9, 11),
dilations: Sequence[int] = (1, 1, 1, 1, 1),
expansion: float = 1.0,
add_identity: bool = True,
with_caa: bool = True,
caa_kernel_size: int = 11,
norm_cfg: Optional[dict] = dict(type='BN', momentum=0.03, eps=0.001),
act_cfg: Optional[dict] = dict(type='SiLU'),
init_cfg: Optional[dict] = None,
):
super().__init__(init_cfg)
out_channels = out_channels or in_channels
hidden_channels = make_divisible(int(out_channels * expansion), 8)
self.pre_conv = ConvModule(in_channels, hidden_channels, 1, 1, 0, 1,
norm_cfg=norm_cfg, act_cfg=act_cfg)
self.dw_conv = ConvModule(hidden_channels, hidden_channels, kernel_sizes[0], 1,
autopad(kernel_sizes[0], None, dilations[0]), dilations[0],
groups=hidden_channels, norm_cfg=None, act_cfg=None)
self.dw_conv1 = ConvModule(hidden_channels, hidden_channels, kernel_sizes[1], 1,
autopad(kernel_sizes[1], None, dilations[1]), dilations[1],
groups=hidden_channels, norm_cfg=None, act_cfg=None)
self.dw_conv2 = ConvModule(hidden_channels, hidden_channels, kernel_sizes[2], 1,
autopad(kernel_sizes[2], None, dilations[2]), dilations[2],
groups=hidden_channels, norm_cfg=None, act_cfg=None)
self.dw_conv3 = ConvModule(hidden_channels, hidden_channels, kernel_sizes[3], 1,
autopad(kernel_sizes[3], None, dilations[3]), dilations[3],
groups=hidden_channels, norm_cfg=None, act_cfg=None)
self.dw_conv4 = ConvModule(hidden_channels, hidden_channels, kernel_sizes[4], 1,
autopad(kernel_sizes[4], None, dilations[4]), dilations[4],
groups=hidden_channels, norm_cfg=None, act_cfg=None)
self.pw_conv = ConvModule(hidden_channels, hidden_channels, 1, 1, 0, 1,
norm_cfg=norm_cfg, act_cfg=act_cfg)
if with_caa:
self.caa_factor = CAA(hidden_channels, caa_kernel_size, caa_kernel_size, None, None)
else:
self.caa_factor = None
self.add_identity = add_identity and in_channels == out_channels
self.post_conv = ConvModule(hidden_channels, out_channels, 1, 1, 0, 1,
norm_cfg=norm_cfg, act_cfg=act_cfg)
def forward(self, x):
x = self.pre_conv(x)
y = x # if there is an inplace operation of x, use y = x.clone() instead of y = x
x = self.dw_conv(x)
x = x + self.dw_conv1(x) + self.dw_conv2(x) + self.dw_conv3(x) + self.dw_conv4(x)
x = self.pw_conv(x)
if self.caa_factor is not None:
y = self.caa_factor(y)
if self.add_identity:
y = x * y
x = x + y
else:
x = x * y
x = self.post_conv(x)
return x
class PKIBlock(BaseModule):
"""Poly Kernel Inception Block"""
def __init__(
self,
in_channels: int,
out_channels: Optional[int] = None,
kernel_sizes: Sequence[int] = (3, 5, 7, 9, 11),
dilations: Sequence[int] = (1, 1, 1, 1, 1),
with_caa: bool = True,
caa_kernel_size: int = 11,
expansion: float = 1.0,
ffn_scale: float = 4.0,
ffn_kernel_size: int = 3,
dropout_rate: float = 0.,
drop_path_rate: float = 0.,
layer_scale: Optional[float] = 1.0,
add_identity: bool = True,
norm_cfg: Optional[dict] = dict(type='BN', momentum=0.03, eps=0.001),
act_cfg: Optional[dict] = dict(type='SiLU'),
init_cfg: Optional[dict] = None,
):
super().__init__(init_cfg)
out_channels = out_channels or in_channels
hidden_channels = make_divisible(int(out_channels * expansion), 8)
if norm_cfg is not None:
self.norm1 = build_norm_layer(norm_cfg, in_channels)[1]
self.norm2 = build_norm_layer(norm_cfg, hidden_channels)[1]
else:
self.norm1 = nn.BatchNorm2d(in_channels)
self.norm2 = nn.BatchNorm2d(hidden_channels)
self.block = InceptionBottleneck(in_channels, hidden_channels, kernel_sizes, dilations,
expansion=1.0, add_identity=True,
with_caa=with_caa, caa_kernel_size=caa_kernel_size,
norm_cfg=norm_cfg, act_cfg=act_cfg)
self.ffn = ConvFFN(hidden_channels, out_channels, ffn_scale, ffn_kernel_size, dropout_rate, add_identity=False,
norm_cfg=None, act_cfg=None)
self.drop_path = DropPath(drop_path_rate) if drop_path_rate > 0 else nn.Identity()
self.layer_scale = layer_scale
if self.layer_scale:
self.gamma1 = nn.Parameter(layer_scale * torch.ones(hidden_channels), requires_grad=True)
self.gamma2 = nn.Parameter(layer_scale * torch.ones(out_channels), requires_grad=True)
self.add_identity = add_identity and in_channels == out_channels
def forward(self, x):
if self.layer_scale:
if self.add_identity:
x = x + self.drop_path(self.gamma1.unsqueeze(-1).unsqueeze(-1) * self.block(self.norm1(x)))
x = x + self.drop_path(self.gamma2.unsqueeze(-1).unsqueeze(-1) * self.ffn(self.norm2(x)))
else:
x = self.drop_path(self.gamma1.unsqueeze(-1).unsqueeze(-1) * self.block(self.norm1(x)))
x = self.drop_path(self.gamma2.unsqueeze(-1).unsqueeze(-1) * self.ffn(self.norm2(x)))
else:
if self.add_identity:
x = x + self.drop_path(self.block(self.norm1(x)))
x = x + self.drop_path(self.ffn(self.norm2(x)))
else:
x = self.drop_path(self.block(self.norm1(x)))
x = self.drop_path(self.ffn(self.norm2(x)))
return x
2.2修改task.py
本文改进基于官方最新版本,如新加入C2fAttn等等
下载地址:GitHub - ultralytics/ultralytics: NEW - YOLOv8 🚀 in PyTorch > ONNX > OpenVINO > CoreML > TFLite
1)首先进行注册
from ultralytics.nn.backbone.pkinet import PKIBlock2)修改def parse_model(d, ch, verbose=True): # model_dict, input_channels(3)
只需要在你源码基础上加入PKIBlock,其他模块为博主其他文章的优化点
n = n_ = max(round(n * depth), 1) if n > 1 else n # depth gain
if m in (
Classify,
Conv,
ConvTranspose,
GhostConv,
Bottleneck,
GhostBottleneck,
SPP,
SPPF,
DWConv,
Focus,
BottleneckCSP,
C1,
C2,
C2f,
C2fAttn,
C3,
C3TR,
C3Ghost,
nn.ConvTranspose2d,
DWConvTranspose2d,
C3x,
RepC3,
PKIBlock
):
c1, c2 = ch[f], args[0]
if c2 != nc: # if c2 not equal to number of classes (i.e. for Classify() output)
c2 = make_divisible(min(c2, max_channels) * width, 8)
args = [c1, c2, *args[1:]]
if m in (BottleneckCSP, C1, C2, C2f, C2fAttn, C3, C3TR, C3Ghost, C3x, RepC3):
args.insert(2, n) # number of repeats
n = 12.3 yolov8-PKIBlock.yaml

# Ultralytics YOLO 🚀, AGPL-3.0 license
# YOLOv8 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect
# Parameters
nc: 80 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n.yaml' will call yolov8.yaml with scale 'n'
# [depth, width, max_channels]
n: [0.33, 0.25, 1024] # YOLOv8n summary: 225 layers, 3157200 parameters, 3157184 gradients, 8.9 GFLOPs
s: [0.33, 0.50, 1024] # YOLOv8s summary: 225 layers, 11166560 parameters, 11166544 gradients, 28.8 GFLOPs
m: [0.67, 0.75, 768] # YOLOv8m summary: 295 layers, 25902640 parameters, 25902624 gradients, 79.3 GFLOPs
l: [1.00, 1.00, 512] # YOLOv8l summary: 365 layers, 43691520 parameters, 43691504 gradients, 165.7 GFLOPs
x: [1.00, 1.25, 512] # YOLOv8x summary: 365 layers, 68229648 parameters, 68229632 gradients, 258.5 GFLOPs
# YOLOv8.0n backbone
backbone:
# [from, repeats, module, args]
- [-1, 1, Conv, [64, 3, 2]] # 0-P1/2
- [-1, 1, Conv, [128, 3, 2]] # 1-P2/4
- [-1, 3, C2f, [128, True]]
- [-1, 1, Conv, [256, 3, 2]] # 3-P3/8
- [-1, 6, C2f, [256, True]]
- [-1, 1, Conv, [512, 3, 2]] # 5-P4/16
- [-1, 6, C2f, [512, True]]
- [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32
- [-1, 3, C2f, [1024, True]]
- [-1, 1, SPPF, [1024, 5]] # 9
- [-1, 1, PKIBlock, [1024]] # 10
# YOLOv8.0n head
head:
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 6], 1, Concat, [1]] # cat backbone P4
- [-1, 3, C2f, [512]] # 13
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 4], 1, Concat, [1]] # cat backbone P3
- [-1, 3, C2f, [256]] # 16 (P3/8-small)
- [-1, 1, Conv, [256, 3, 2]]
- [[-1, 13], 1, Concat, [1]] # cat head P4
- [-1, 3, C2f, [512]] # 19(P4/16-medium)
- [-1, 1, Conv, [512, 3, 2]]
- [[-1, 10], 1, Concat, [1]] # cat head P5
- [-1, 3, C2f, [1024]] # 22 (P5/32-large)
- [[16, 19, 22], 1, Detect, [nc]] # Detect(P3, P4, P5)