YOLOv11改进-主干_Backbone篇-轻量化ConvNeXtV2全卷积掩码自编码器目标检测网络（适配yolov11全系列）

一、本文介绍

本文给大家带来的改进机制是ConvNeXtV2网络， ConvNeXt V2 是一种新型的卷积神经网络架构，它融合了自监督学习技术和架构改进，特别是加入了 全卷积掩码自编码器框架 和 全局响应归一化（GRN）层 。我将其替换YOLOv11的特征提取网络，用于提取更有用的特征。经过我的实验该主干网络确实能够涨点在大中小三种物体检测上， 同时该主干网络也提供多种版本 ，大家可以在源代码中进行修改版本的使用。 本文通过介绍其主要框架原理，然后教大家如何添加该网络结构到网络模型中。

（本文内容可根据yolov11的N、S、M、L、X进行二次缩放，轻量化更上一层）。

二、ConvNeXt V2架构原理

论文地址： 官方论文地址

代码地址： 官方代码地址

2.1 ConvNeXt V2的基本原理

ConvNeXt V2 是一种新型的卷积神经网络架构，它融合了自监督学习技术和架构改进，特别是加入了 全卷积掩码自编码器框架 和 全局响应归一化（GRN）层 。这些创新显著提升了纯ConvNet在多个识别基准测试上的性能，如ImageNet分类、COCO检测和ADE20K分割。ConvNeXt V2还包括从效率型的3.7M参数Atto模型到650M参数的Huge模型的多个版本，覆盖了从轻量级到高性能的各种应用需求。

ConvNeXt V2的核心要点包括：

1. 架构创新： 融合全卷积掩码自编码器框架和全局响应归一化（GRN）层，优化了原有ConvNeXt架构。
2. 自监督学习： 利用自监督学习技术提高了模型的泛化能力和效率。

下图为大家 比较了ConvNeXt V1和ConvNeXt V2两个版本中的块设计 ：

在ConvNeXt V2块中，新增加了 全局响应归一化（GRN）层 ，并且由于GRN层的引入，原先的LayerScale层变得多余，因此在V2版本中被去除。这些变化旨在优化网络的特征表示和提高模型的学习效率。

2.2 架构创新

ConvNeXt V2 架构创新 主要体现在以下几个方面：

1. 全卷积掩码自动编码器（FCMAE）： 采用全卷积方法处理图像，特别适合处理带有掩码的图像数据。

2. 全局响应归一化（GRN）层： 在卷积块中引入GRN层，增强了模型处理信息时的通道间竞争，提高特征表达的质量。

3. 去除LayerScale层： 因为GRN层的加入，原来的LayerScale层变得多余，在V2架构中被移除，简化了模型结构。

这张图展示了 ConvNeXt V2中提出的全卷积掩码自动编码器（FCMAE）框架 ：

在这张图中，ConvNeXt V2的FCMAE框架采用了 稀疏卷积技术 作为其编码器的核心，这是为了有效地处理输入图像中的非掩蔽（可见）像素。 编码器结构层次化 ，有助于捕获不同层级的特征信息。解码器相对简单，使用轻量级的ConvNeXt块，目的是重构图像，但仅限于目标（即被掩蔽的）区域。这种不对称设计允许模型在预训练时专注于关键区域，这对于图像的自监督学习特别有效。损失函数的计算仅在掩蔽的区域进行，进一步强化了模型对于目标区域的学习和重构能力。

三、ConvNeXt V2的核心代码

使用方式看章节四

import torch
import torch.nn as nn
import torch.nn.functional as F
from timm.models.layers import trunc_normal_, DropPath
__all__ = ['convnextv2_atto', 'convnextv2_femto', 'convnext_pico', 'convnextv2_nano', 'convnextv2_tiny', 'convnextv2_base', 'convnextv2_large', 'convnextv2_huge']
class LayerNorm(nn.Module):
    """ LayerNorm that supports two data formats: channels_last (default) or channels_first.
    The ordering of the dimensions in the inputs. channels_last corresponds to inputs with
    shape (batch_size, height, width, channels) while channels_first corresponds to inputs
    with shape (batch_size, channels, height, width).
    """
    def __init__(self, normalized_shape, eps=1e-6, data_format="channels_last"):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(normalized_shape))
        self.bias = nn.Parameter(torch.zeros(normalized_shape))
        self.eps = eps
        self.data_format = data_format
        if self.data_format not in ["channels_last", "channels_first"]:
            raise NotImplementedError
        self.normalized_shape = (normalized_shape,)
    def forward(self, x):
        if self.data_format == "channels_last":
            return F.layer_norm(x, self.normalized_shape, self.weight, self.bias, self.eps)
        elif self.data_format == "channels_first":
            u = x.mean(1, keepdim=True)
            s = (x - u).pow(2).mean(1, keepdim=True)
            x = (x - u) / torch.sqrt(s + self.eps)
            x = self.weight[:, None, None] * x + self.bias[:, None, None]
            return x
class GRN(nn.Module):
    """ GRN (Global Response Normalization) layer
    """
    def __init__(self, dim):
        super().__init__()
        self.gamma = nn.Parameter(torch.zeros(1, 1, 1, dim))
        self.beta = nn.Parameter(torch.zeros(1, 1, 1, dim))
    def forward(self, x):
        Gx = torch.norm(x, p=2, dim=(1, 2), keepdim=True)
        Nx = Gx / (Gx.mean(dim=-1, keepdim=True) + 1e-6)
        return self.gamma * (x * Nx) + self.beta + x
class Block(nn.Module):
    """ ConvNeXtV2 Block.
    Args:
        dim (int): Number of input channels.
        drop_path (float): Stochastic depth rate. Default: 0.0
    """
    def __init__(self, dim, drop_path=0.):
        super().__init__()
        self.dwconv = nn.Conv2d(dim, dim, kernel_size=7, padding=3, groups=dim)  # depthwise conv
        self.norm = LayerNorm(dim, eps=1e-6)
        self.pwconv1 = nn.Linear(dim, 4 * dim)  # pointwise/1x1 convs, implemented with linear layers
        self.act = nn.GELU()
        self.grn = GRN(4 * dim)
        self.pwconv2 = nn.Linear(4 * dim, dim)
        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
    def forward(self, x):
        input = x
        x = self.dwconv(x)
        x = x.permute(0, 2, 3, 1)  # (N, C, H, W) -> (N, H, W, C)
        x = self.norm(x)
        x = self.pwconv1(x)
        x = self.act(x)
        x = self.grn(x)
        x = self.pwconv2(x)
        x = x.permute(0, 3, 1, 2)  # (N, H, W, C) -> (N, C, H, W)
        x = input + self.drop_path(x)
        return x
class ConvNeXtV2(nn.Module):
    """ ConvNeXt V2
    Args:
        in_chans (int): Number of input image channels. Default: 3
        num_classes (int): Number of classes for classification head. Default: 1000
        depths (tuple(int)): Number of blocks at each stage. Default: [3, 3, 9, 3]
        dims (int): Feature dimension at each stage. Default: [96, 192, 384, 768]
        drop_path_rate (float): Stochastic depth rate. Default: 0.
        head_init_scale (float): Init scaling value for classifier weights and biases. Default: 1.
    """
    def __init__(self, factor, in_chans=3, num_classes=1000,
                 depths=[3, 3, 9, 3], dims=[96, 192, 384, 768],
                 drop_path_rate=0., head_init_scale=1.
                 ):
        super().__init__()
        dims = [int(dim * factor) for dim in dims]
        self.depths = depths
        self.downsample_layers = nn.ModuleList()  # stem and 3 intermediate downsampling conv layers
        stem = nn.Sequential(
            nn.Conv2d(in_chans, dims[0], kernel_size=4, stride=4),
            LayerNorm(dims[0], eps=1e-6, data_format="channels_first")
        )
        self.downsample_layers.append(stem)
        for i in range(3):
            downsample_layer = nn.Sequential(
                LayerNorm(dims[i], eps=1e-6, data_format="channels_first"),
                nn.Conv2d(dims[i], dims[i + 1], kernel_size=2, stride=2),
            )
            self.downsample_layers.append(downsample_layer)
        self.stages = nn.ModuleList()  # 4 feature resolution stages, each consisting of multiple residual blocks
        dp_rates = [x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))]
        cur = 0
        for i in range(4):
            stage = nn.Sequential(
                *[Block(dim=dims[i], drop_path=dp_rates[cur + j]) for j in range(depths[i])]
            )
            self.stages.append(stage)
            cur += depths[i]
        self.norm = nn.LayerNorm(dims[-1], eps=1e-6)  # final norm layer
        self.head = nn.Linear(dims[-1], num_classes)
        self.apply(self._init_weights)
        self.head.weight.data.mul_(head_init_scale)
        self.head.bias.data.mul_(head_init_scale)
        self.width_list = [i.size(1) for i in self.forward(torch.randn(1, 3, 640, 640))]
    def _init_weights(self, m):
        if isinstance(m, (nn.Conv2d, nn.Linear)):
            trunc_normal_(m.weight, std=.02)
            nn.init.constant_(m.bias, 0)
    def forward(self, x):
        results = []
        for i in range(4):
            x = self.downsample_layers[i](x)
            x = self.stages[i](x)
            results.append(x)
        return results  # global average pooling, (N, C, H, W) -> (N, C)
def convnextv2_atto(factor):
    model = ConvNeXtV2(factor=factor, depths=[2, 2, 6, 2], dims=[40, 80, 160, 320])
    return model
def convnextv2_femto(factor):
    model = ConvNeXtV2(factor=factor, depths=[2, 2, 6, 2], dims=[48, 96, 192, 384])
    return model
def convnext_pico(factor):
    model = ConvNeXtV2(factor=factor, depths=[2, 2, 6, 2], dims=[64, 128, 256, 512])
    return model
def convnextv2_nano(factor):
    model = ConvNeXtV2(factor=factor, depths=[2, 2, 8, 2], dims=[80, 160, 320, 640])
    return model
def convnextv2_tiny(factor):
    model = ConvNeXtV2(factor=factor, depths=[3, 3, 9, 3], dims=[96, 192, 384, 768])
    return model
def convnextv2_base(factor):
    model = ConvNeXtV2(factor=factor, depths=[3, 3, 27, 3], dims=[128, 256, 512, 1024])
    return model
def convnextv2_large(factor):
    model = ConvNeXtV2(factor=factor, depths=[3, 3, 27, 3], dims=[192, 384, 768, 1536])
    return model
def convnextv2_huge(factor):
    model = ConvNeXtV2(factor=factor, depths=[3, 3, 27, 3], dims=[352, 704, 1408, 2816])
    return model
if __name__ == "__main__":
    model = convnextv2_atto(factor=0.5)
    inputs = torch.randn((1, 3, 640, 640))
    for i in model(inputs):
        print(i.size())

四、手把手教你添加ConvNeXt V2机制

4.1 修改一

第一步还是建立文件，我们找到如下ultralytics/nn文件夹下建立一个目录名字呢就是'Addmodules'文件夹( 用群内的文件的话已经有了无需新建) ！然后在其内部建立一个新的py文件将核心代码复制粘贴进去即可

4.2 修改二

第二步我们在该目录下创建一个新的py文件名字为'__init__.py'( 用群内的文件的话已经有了无需新建) ，然后在其内部导入我们的检测头如下图所示。

4.3 修改三

第三步我门中到如下文件'ultralytics/nn/tasks.py'进行导入和注册我们的模块( 用群内的文件的话已经有了无需重新导入直接开始第四步即可) ！

从今天开始以后的教程就都统一成这个样子了，因为我默认大家用了我群内的文件来进行修改！！

4.4 修改四

添加如下两行代码！！！

4.5 修改五

找到七百多行大概把具体看图片，按照图片来修改就行，添加红框内的部分，注意没有()只是函数名。

        elif m in {convnextv2_atto, convnextv2_femto, convnext_pico, convnextv2_nano, convnextv2_tiny, convnextv2_base, convnextv2_large, convnextv2_huge}:
            m = m(*args)
            c2 = m.width_list  # 返回通道列表
            backbone = True

4.6 修改六

下面的两个红框内都是需要改动的。

        if isinstance(c2, list):
            m_ = m
            m_.backbone = True
        else:
            m_ = nn.Sequential(*(m(*args) for _ in range(n))) if n > 1 else m(*args)  # module
            t = str(m)[8:-2].replace('__main__.', '')  # module type
        m.np = sum(x.numel() for x in m_.parameters())  # number params
        m_.i, m_.f, m_.type = i + 4 if backbone else i, f, t  # attach index, 'from' index, type

4.7 修改七

如下的也需要修改，全部按照我的来。

代码如下把原先的代码替换了即可。

        if verbose:
            LOGGER.info(f'{i:>3}{str(f):>20}{n_:>3}{m.np:10.0f}  {t:<45}{str(args):<30}')  # print
        save.extend(x % (i + 4 if backbone else i) for x in ([f] if isinstance(f, int) else f) if x != -1)  # append to savelist
        layers.append(m_)
        if i == 0:
            ch = []
        if isinstance(c2, list):
            ch.extend(c2)
            if len(c2) != 5:
                ch.insert(0, 0)
        else:
            ch.append(c2)

4.8 修改八

修改八和前面的都不太一样，需要修改前向传播中的一个部分，已经离开了parse_model方法了。

可以在图片中开代码行数，没有离开task.py文件都是同一个文件。同时这个部分有好几个前向传播都很相似，大家不要看错了， 是70多行左右的！！！，同时我后面提供了代码，大家直接复制粘贴即可，有时间我针对这里会出一个视频。

代码如下->

    def _predict_once(self, x, profile=False, visualize=False, embed=None):
        """
        Perform a forward pass through the network.
        Args:
            x (torch.Tensor): The input tensor to the model.
            profile (bool):  Print the computation time of each layer if True, defaults to False.
            visualize (bool): Save the feature maps of the model if True, defaults to False.
            embed (list, optional): A list of feature vectors/embeddings to return.
        Returns:
            (torch.Tensor): The last output of the model.
        """
        y, dt, embeddings = [], [], []  # outputs
        for m in self.model:
            if m.f != -1:  # if not from previous layer
                x = y[m.f] if isinstance(m.f, int) else [x if j == -1 else y[j] for j in m.f]  # from earlier layers
            if profile:
                self._profile_one_layer(m, x, dt)
            if hasattr(m, 'backbone'):
                x = m(x)
                if len(x) != 5:  # 0 - 5
                    x.insert(0, None)
                for index, i in enumerate(x):
                    if index in self.save:
                        y.append(i)
                    else:
                        y.append(None)
                x = x[-1]  # 最后一个输出传给下一层
            else:
                x = m(x)  # run
                y.append(x if m.i in self.save else None)  # save output
            if visualize:
                feature_visualization(x, m.type, m.i, save_dir=visualize)
            if embed and m.i in embed:
                embeddings.append(nn.functional.adaptive_avg_pool2d(x, (1, 1)).squeeze(-1).squeeze(-1))  # flatten
                if m.i == max(embed):
                    return torch.unbind(torch.cat(embeddings, 1), dim=0)
        return x

到这里就完成了修改部分，但是这里面细节很多，大家千万要注意不要替换多余的代码，导致报错，也不要拉下任何一部，都会导致运行失败，而且报错很难排查！！！很难排查！！！

注意！！！额外的修改！

关注我的其实都知道，我大部分的修改都是一样的，这个网络需要额外的修改一步，就是s一个参数，将下面的s改为640！！！即可完美运行！！

打印计算量问题解决方案

我们找到如下文件'ultralytics/utils/torch_utils.py'按照如下的图片进行修改，否则容易打印不出来计算量。

注意事项！！！

如果大家在验证的时候报错形状不匹配的错误可以固定验证集的图片尺寸，方法如下 ->

找到下面这个文件ultralytics/ models /yolo/detect/train.py然后其中有一个类是DetectionTrainer class中的build_dataset函数中的一个参数rect=mode == 'val'改为rect=False

五、 Convnextv2 的yaml文件

5.1 Convnextv2 的yaml文件

训练信息：YOLO11-ConvNeXtV2 summary: 325 layers, 2,597,387 parameters, 2,597,371 gradients, 5.9 GFLOP

使用说明：#使用说明：# 下面 [-1, 1, convnextv2_atto, [0.25]] 参数位置的0.25是通道放缩的系数, YOLOv11N是0.25 YOLOv11S是0.5 YOLOv11M是1. YOLOv11l是1 YOLOv11是1.5大家根据自己训练的YOLO版本设定即可.
# 本文支持版本有 'convnextv2_atto', 'convnextv2_femto', 'convnext_pico', 'convnextv2_nano', 'convnextv2_tiny', 'convnextv2_base', 'convnextv2_large', 'convnextv2_huge'

# Ultralytics YOLO 🚀, AGPL-3.0 license
# YOLO11 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect
# Parameters
nc: 80 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolo11n.yaml' will call yolo11.yaml with scale 'n'
  # [depth, width, max_channels]
  n: [0.50, 0.25, 1024] # summary: 319 layers, 2624080 parameters, 2624064 gradients, 6.6 GFLOPs
  s: [0.50, 0.50, 1024] # summary: 319 layers, 9458752 parameters, 9458736 gradients, 21.7 GFLOPs
  m: [0.50, 1.00, 512] # summary: 409 layers, 20114688 parameters, 20114672 gradients, 68.5 GFLOPs
  l: [1.00, 1.00, 512] # summary: 631 layers, 25372160 parameters, 25372144 gradients, 87.6 GFLOPs
  x: [1.00, 1.50, 512] # summary: 631 layers, 56966176 parameters, 56966160 gradients, 196.0 GFLOPs
# 下面 [-1, 1, convnextv2_atto, [0.25]] 参数位置的0.25是通道放缩的系数, YOLOv11N是0.25 YOLOv11S是0.5 YOLOv11M是1. YOLOv11l是1 YOLOv11是1.5大家根据自己训练的YOLO版本设定即可.
# 本文支持版本有 'convnextv2_atto', 'convnextv2_femto', 'convnext_pico', 'convnextv2_nano', 'convnextv2_tiny', 'convnextv2_base', 'convnextv2_large', 'convnextv2_huge'
# YOLO11n backbone
backbone:
  # [from, repeats, module, args]
  - [-1, 1, convnextv2_atto, [0.5]] # 0-4 P1/2 这里是四层大家不要被yaml文件限制住了思维，不会画图进群看视频.
  - [-1, 1, SPPF, [1024, 5]] # 5
  - [-1, 2, C2PSA, [1024]] # 6
# YOLO11n head
head:
  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  - [[-1, 3], 1, Concat, [1]] # cat backbone P4
  - [-1, 2, C3k2, [512, False]] # 9
  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  - [[-1, 2], 1, Concat, [1]] # cat backbone P3
  - [-1, 2, C3k2, [256, False]] # 12 (P3/8-small)
  - [-1, 1, Conv, [256, 3, 2]]
  - [[-1, 9], 1, Concat, [1]] # cat head P4
  - [-1, 2, C3k2, [512, False]] # 15 (P4/16-medium)
  - [-1, 1, Conv, [512, 3, 2]]
  - [[-1, 6], 1, Concat, [1]] # cat head P5
  - [-1, 2, C3k2, [1024, True]] # 18 (P5/32-large)
  - [[12, 15, 18], 1, Detect, [nc]] # Detect(P3, P4, P5)

5.2 训练文件的代码

可以复制我的运行文件进行运行。

import warnings
warnings.filterwarnings('ignore')
from ultralytics import YOLO
if __name__ == '__main__':
    model = YOLO('yolov8-MLLA.yaml')
    # 如何切换模型版本, 上面的ymal文件可以改为 yolov8s.yaml就是使用的v8s,
    # 类似某个改进的yaml文件名称为yolov8-XXX.yaml那么如果想使用其它版本就把上面的名称改为yolov8l-XXX.yaml即可（改的是上面YOLO中间的名字不是配置文件的）！
    # model.load('yolov8n.pt') # 是否加载预训练权重,科研不建议大家加载否则很难提升精度
    model.train(data=r"C:\Users\Administrator\PycharmProjects\yolov5-master\yolov5-master\Construction Site Safety.v30-raw-images_latestversion.yolov8\data.yaml",
                # 如果大家任务是其它的'ultralytics/cfg/default.yaml'找到这里修改task可以改成detect, segment, classify, pose
                cache=False,
                imgsz=640,
                epochs=150,
                single_cls=False,  # 是否是单类别检测
                batch=16,
                close_mosaic=0,
                workers=0,
                device='0',
                optimizer='SGD', # using SGD
                # resume='runs/train/exp21/weights/last.pt', # 如过想续训就设置last.pt的地址
                amp=True,  # 如果出现训练损失为Nan可以关闭amp
                project='runs/train',
                name='exp',
                )

六、成功运行记录

下面是成功运行的截图，已经完成了有1个epochs的训练，图片太大截不全第2个epochs了。

七、本文总结

到此本文的正式分享内容就结束了，在这里给大家推荐我的YOLOv11改进有效涨点专栏，本专栏目前为新开的平均质量分98分，后期我会根据各种最新的前沿顶会进行论文复现，也会对一些老的改进机制进行补充，如果大家觉得本文帮助到你了，订阅本专栏，关注后续更多的更新~