学习资源站

RT-DETR改进策略【独家融合改进】MobileNetV4+BiFPN,轻量化+加权特征融合,轻松实现降参涨点-

RT-DETR改进策略【独家融合改进】| MobileNetV4+BiFPN,轻量化+加权特征融合,轻松实现降参涨点

一、本文介绍

本文记录的是 利用MobileNet V4和BiFPN的RT-DETR目标检测轻量化改进方法研究 MobileNet V4 在不降低性能指标的前提下, 降低计算成本 BiFPN 通过 优化跨尺度连接和加权特征融合 ,能够更好地 融合多尺度特征 ,提高特征的表达能力,融合改进后能够有效实现 降参涨点



二、MoblieNet V4设计原理

MobileNetV4: Universal Models for the Mobile Ecosystem

MobileNetV4 是一系列适用于移动生态系统的通用高效模型。以下将详细介绍其轻量化设计的出发点、原理、结构和优势:

2.1 设计出发点

  • 平衡精度与效率 :移动设备的计算资源有限,需要在保证模型精度的同时提高计算效率,以实现快速、实时和交互式的应用体验,同时避免通过公共网络传输私人数据。
  • 硬件通用性 :针对不同的移动硬件平台(如CPUs、DSPs、GPUs以及各种加速器),设计出在性能上普遍高效的模型,使其能在各种设备上都能良好运行。

2.2 设计原理

  1. 基于Roofline模型的分析
    • 理解硬件瓶颈 :Roofline模型通过分析模型的运算强度(LayerMACsi/(WeightBytesi + ActivationBytesi))与硬件的处理器和内存系统的理论极限,来确定模型在不同硬件上是受内存带宽还是计算能力的限制。
    • 优化策略 :根据不同硬件的特点(如低RP硬件上减少MACs以提高速度,高RP硬件上利用数据移动瓶颈小的特点增加模型容量),设计模型结构,使MobileNetV4在从0到500 MACs/byte的RP范围内都能达到接近Pareto最优的性能。
  2. 注意力机制优化
    • 考虑运算强度 :由于加速器的计算能力大幅提高但内存带宽增长不成比例,所以在设计注意力机制时考虑运算强度,即算术运算与内存访问的比率。
    • MQA机制 Mobile MQA 通过共享键和值来减少内存带宽需求,提高运算强度,同时还采用了如不对称空间下采样等策略进一步提高效率。

2.3 结构

2.3.1 通用倒置瓶颈(UIB)模块

  • 结构特点 UIB模块 是一种统一且灵活的结构,它扩展了MobileNet的 倒置瓶颈(IB)模块 ,在扩展层之前和扩展与投影层之间引入了可选的 深度可分离卷积(DW )。它可以统一Inverted Bottleneck (IB)、ConvNext、Feed Forward Network (FFN)以及一种新的Extra Depthwise (ExtraDW)变体。
  • 模块实例化 UIB模块 中的两个可选深度卷积有 四种 可能的实例化方式,分别对应不同的权衡。例如, ExtraDW 可以增加网络深度和感受野,结合了ConvNext-Like和IB的优点。

在这里插入图片描述

2.3.2 Mobile MQA模块

  • 基础结构 :是一种基于注意力机制的模块,它简化了 多头注意力(MHSA)机制 通过共享键和值来减少内存带宽需求。
  • 优化结构 :进一步采用 不对称空间下采样(SRA) ,在优化后的MQA块中对关键和价值分辨率进行下采样,同时保留高分辨率查询,提高了模型效率。

2.4 优势

  1. 性能优势
    • Pareto最优 :通过整合 UIB Mobile MQA 以及优化的 NAS策略 ,MobileNetV4模型在移动CPUs、DSPs、GPUs以及各种加速器上大多达到了Pareto最优性能,即在不降低其他性能指标的情况下,某一性能指标无法进一步提升。
    • 跨硬件一致性 :在不同硬件平台上表现出较为一致的性能,这是其他测试模型所不具备的。例如,在ImageNet - 1K分类任务中,MNv4 - Conv - M比MobileOne - S4和Fast ViT - S12快50%以上,且在可比延迟下比MobileNetV2的Top - 1准确率高1.5%。
  2. 效率优势
    • 计算效率 UIB模块 提供了 空间 通道 混合的灵活性,可选择 扩展感受野 ,增强了计算效率。例如, ExtraDW变体 可以在不显著增加计算成本的情况下增加网络深度和感受野。
    • 推理速度 Mobile MQA模块 在移动加速器上实现了超过39%的推理速度提升,大大提高了模型的运行效率。
  3. 模型构建优势
    • NAS优化 :采用了优化的 神经网络架构搜索(NAS)策略 ,包括两阶段搜索(粗粒度搜索和细粒度搜索)以及使用离线蒸馏数据集,提高了搜索效率和模型质量,能够创建出比以前的先进模型更大的模型。
    • 蒸馏技术 :引入了一种新的 蒸馏 技术,通过动态混合不同增强策略的数据集以及添加平衡的类内数据,进一步提高了模型的准确性和泛化能力。例如,MNv4 - Hybrid - Large模型在ImageNet - 1K上的准确率达到87%,同时在Pixel 8 EdgeTPU上的运行时间仅为3.8ms。

论文: https://arxiv.org/pdf/2404.10518
源码: https://github.com/tensorflow/models/blob/master/official/vision/modeling/backbones/mobilenet.py

三、BiFPN介绍

EfficientDet : 可扩展的高效物体检测

BiFPN(加权双向特征金字塔网络) 是该论文中提出的一种用于高效多尺度特征融合的网络结构,其设计原理和优势如下:

3.1、BiFPN原理

  • 问题 formulation :多尺度特征融合的目标是聚合不同分辨率的特征,给定多尺度特征列表 P → i n = ( P i n 1 , P i n 2 , … ) \overrightarrow{P}_{in} = (P_{in}^1, P_{in}^2, \ldots) P in = ( P in 1 , P in 2 , ) ,其中 P i n i P_{in}^i P in i 代表第 i i i 级的特征,目标是找到一个变换 f f f ,使得 P → o u t = f ( P → i n ) \overrightarrow{P}_{out} = f(\overrightarrow{P}_{in}) P o u t = f ( P in ) 。以传统的 FPN 为例,它以自上而下的方式聚合多尺度特征,如 P 7 o u t = C o n v ( P 7 i n ) P_{7}^{out} = Conv(P_{7}^{in}) P 7 o u t = C o n v ( P 7 in ) P 6 o u t = C o n v ( P 6 i n + R e s i z e ( P 7 o u t ) ) P_{6}^{out} = Conv(P_{6}^{in} + Resize(P_{7}^{out})) P 6 o u t = C o n v ( P 6 in + R es i ze ( P 7 o u t )) 等。

3.1.1 跨尺度连接

  • 为了改进传统 FPN 单向信息流的局限性, PANet 增加了自底向上的路径聚合网络,其他研究也进一步探讨了跨尺度连接。通过研究 FPN PANet NAS-FPN 这三个网络的性能和效率,本文对跨尺度连接进行了优化:
  • 移除只有一个输入边缘的节点,因为这样的节点对旨在融合不同特征的特征网络贡献较小,从而得到简化的双向网络。
  • 如果原始输入和输出节点在同一级别,添加一个额外的边缘,以便在不增加太多成本的情况下融合更多特征。
  • 将每个双向(自上而下和自底向上)路径视为一个特征网络层,并多次重复该层以实现更多高级特征融合,从而命名为 双向特征金字塔网络(BiFPN)

3.1.2 加权特征融合

当融合不同分辨率的特征时,常见的方法是先将它们调整到相同分辨率然后相加,但这种方法没有区分不同输入特征的重要性。为了解决这个问题,提出为每个输入添加一个额外的权重,让网络学习每个输入特征的重要性,并考虑了三种加权融合方法:
- Unbounded fusion O = ∑ i w i ⋅ I i O = \sum_{i} w_{i} \cdot I_{i} O = i w i I i ,其中 w i w_{i} w i 是可学习的权重,可以是标量(每个特征)、向量(每个通道)或多维张量(每个像素)。通过权重归一化来限制每个权重的值范围,以避免训练不稳定。
- Softmax - based fusion O = ∑ i e w i ∑ j e w j ⋅ I i O = \sum_{i} \frac{e^{w_{i}}}{\sum_{j} e^{w_{j}}} \cdot I_{i} O = i j e w j e w i I i ,对每个权重应用Softmax,使所有权重归一化为0到1之间的概率,表示每个输入的重要性。但在GPU硬件上会导致显著的减速。
- Fast normalized fusion O = ∑ i w i ϵ + ∑ j w j ⋅ I i O = \sum_{i} \frac{w_{i}}{\epsilon + \sum_{j} w_{j}} \cdot I_{i} O = i ϵ + j w j w i I i ,通过在每个 w i w_{i} w i 后应用Relu确保 w i ≥ 0 w_{i} \geq 0 w i 0 ϵ = 0.0001 \epsilon = 0.0001 ϵ = 0.0001 是一个小值以避免数值不稳定。这种方法的每个归一化权重的值也落在0和1之间,且没有Softmax操作,因此更高效。

最终的BiFPN集成了双向跨尺度连接和快速归一化融合。例如,对于图2(d)中BiFPN在级别6的两个融合特征,计算方式为:
- P 6 t d = C o n v ( w 1 ⋅ P 6 i n + w 2 ⋅ R e s i z e ( P 7 i n ) w 1 + w 2 + ϵ ) P_{6}^{td} = Conv(\frac{w_{1} \cdot P_{6}^{in} + w_{2} \cdot Resize(P_{7}^{in})}{w_{1} + w_{2} + \epsilon}) P 6 t d = C o n v ( w 1 + w 2 + ϵ w 1 P 6 in + w 2 R es i ze ( P 7 in ) )
- P 6 o u t = C o n v ( w 1 ′ ⋅ P 6 i n + w 2 ′ ⋅ P 6 t d + w 3 ′ ⋅ R e s i z e ( P 5 o u t ) w 1 ′ + w 2 ′ + w 3 ′ + ϵ ) P_{6}^{out} = Conv(\frac{w_{1}' \cdot P_{6}^{in} + w_{2}' \cdot P_{6}^{td} + w_{3}' \cdot Resize(P_{5}^{out})}{w_{1}' + w_{2}' + w_{3}' + \epsilon}) P 6 o u t = C o n v ( w 1 + w 2 + w 3 + ϵ w 1 P 6 in + w 2 P 6 t d + w 3 R es i ze ( P 5 o u t ) )

为了进一步提高效率,使用深度可分离卷积进行特征融合,并在每次卷积后添加批量归一化和激活。

在这里插入图片描述

在这里插入图片描述

3.2、优势

  • 通过优化跨尺度连接和加权特征融合, BiFPN 能够更好地融合多尺度特征,提高特征的表达能力。
  • 与其他特征网络相比, BiFPN 在实现相似精度的情况下,使用更少的参数和计算量,提高了模型效率。
  • 快速归一化融合方法与Softmax-based融合相比,具有非常相似的学习行为和精度,但在GPU上运行速度更快。

论文: https://arxiv.org/pdf/1911.09070.pdf
源码: https://github.com/google/automl/tree/master/efficientdet

四、MobileNetV4和BiFPN的实现代码

MobileNetV4模块 的实现代码如下:

from typing import Optional
import torch
import torch.nn as nn
import torch.nn.functional as F
 
__all__ = ['MobileNetV4ConvLarge', 'MobileNetV4ConvSmall', 'MobileNetV4ConvMedium', 'MobileNetV4HybridMedium', 'MobileNetV4HybridLarge']
 
MNV4ConvSmall_BLOCK_SPECS = {
    "conv0": {
        "block_name": "convbn",
        "num_blocks": 1,
        "block_specs": [
            [3, 32, 3, 2]
        ]
    },
    "layer1": {
        "block_name": "convbn",
        "num_blocks": 2,
        "block_specs": [
            [32, 32, 3, 2],
            [32, 32, 1, 1]
        ]
    },
    "layer2": {
        "block_name": "convbn",
        "num_blocks": 2,
        "block_specs": [
            [32, 96, 3, 2],
            [96, 64, 1, 1]
        ]
    },
    "layer3": {
        "block_name": "uib",
        "num_blocks": 6,
        "block_specs": [
            [64, 96, 5, 5, True, 2, 3],
            [96, 96, 0, 3, True, 1, 2],
            [96, 96, 0, 3, True, 1, 2],
            [96, 96, 0, 3, True, 1, 2],
            [96, 96, 0, 3, True, 1, 2],
            [96, 96, 3, 0, True, 1, 4],
        ]
    },
    "layer4": {
        "block_name": "uib",
        "num_blocks": 6,
        "block_specs": [
            [96,  128, 3, 3, True, 2, 6],
            [128, 128, 5, 5, True, 1, 4],
            [128, 128, 0, 5, True, 1, 4],
            [128, 128, 0, 5, True, 1, 3],
            [128, 128, 0, 3, True, 1, 4],
            [128, 128, 0, 3, True, 1, 4],
        ]
    },
    "layer5": {
        "block_name": "convbn",
        "num_blocks": 2,
        "block_specs": [
            [128, 960, 1, 1],
            [960, 1280, 1, 1]
        ]
    }
}
 
MNV4ConvMedium_BLOCK_SPECS = {
    "conv0": {
        "block_name": "convbn",
        "num_blocks": 1,
        "block_specs": [
            [3, 32, 3, 2]
        ]
    },
    "layer1": {
        "block_name": "fused_ib",
        "num_blocks": 1,
        "block_specs": [
            [32, 48, 2, 4.0, True]
        ]
    },
    "layer2": {
        "block_name": "uib",
        "num_blocks": 2,
        "block_specs": [
            [48, 80, 3, 5, True, 2, 4],
            [80, 80, 3, 3, True, 1, 2]
        ]
    },
    "layer3": {
        "block_name": "uib",
        "num_blocks": 8,
        "block_specs": [
            [80,  160, 3, 5, True, 2, 6],
            [160, 160, 3, 3, True, 1, 4],
            [160, 160, 3, 3, True, 1, 4],
            [160, 160, 3, 5, True, 1, 4],
            [160, 160, 3, 3, True, 1, 4],
            [160, 160, 3, 0, True, 1, 4],
            [160, 160, 0, 0, True, 1, 2],
            [160, 160, 3, 0, True, 1, 4]
        ]
    },
    "layer4": {
        "block_name": "uib",
        "num_blocks": 11,
        "block_specs": [
            [160, 256, 5, 5, True, 2, 6],
            [256, 256, 5, 5, True, 1, 4],
            [256, 256, 3, 5, True, 1, 4],
            [256, 256, 3, 5, True, 1, 4],
            [256, 256, 0, 0, True, 1, 4],
            [256, 256, 3, 0, True, 1, 4],
            [256, 256, 3, 5, True, 1, 2],
            [256, 256, 5, 5, True, 1, 4],
            [256, 256, 0, 0, True, 1, 4],
            [256, 256, 0, 0, True, 1, 4],
            [256, 256, 5, 0, True, 1, 2]
        ]
    },
    "layer5": {
        "block_name": "convbn",
        "num_blocks": 2,
        "block_specs": [
            [256, 960, 1, 1],
            [960, 1280, 1, 1]
        ]
    }
}
 
MNV4ConvLarge_BLOCK_SPECS = {
    "conv0": {
        "block_name": "convbn",
        "num_blocks": 1,
        "block_specs": [
            [3, 24, 3, 2]
        ]
    },
    "layer1": {
        "block_name": "fused_ib",
        "num_blocks": 1,
        "block_specs": [
            [24, 48, 2, 4.0, True]
        ]
    },
    "layer2": {
        "block_name": "uib",
        "num_blocks": 2,
        "block_specs": [
            [48, 96, 3, 5, True, 2, 4],
            [96, 96, 3, 3, True, 1, 4]
        ]
    },
    "layer3": {
        "block_name": "uib",
        "num_blocks": 11,
        "block_specs": [
            [96,  192, 3, 5, True, 2, 4],
            [192, 192, 3, 3, True, 1, 4],
            [192, 192, 3, 3, True, 1, 4],
            [192, 192, 3, 3, True, 1, 4],
            [192, 192, 3, 5, True, 1, 4],
            [192, 192, 5, 3, True, 1, 4],
            [192, 192, 5, 3, True, 1, 4],
            [192, 192, 5, 3, True, 1, 4],
            [192, 192, 5, 3, True, 1, 4],
            [192, 192, 5, 3, True, 1, 4],
            [192, 192, 3, 0, True, 1, 4]
        ]
    },
    "layer4": {
        "block_name": "uib",
        "num_blocks": 13,
        "block_specs": [
            [192, 512, 5, 5, True, 2, 4],
            [512, 512, 5, 5, True, 1, 4],
            [512, 512, 5, 5, True, 1, 4],
            [512, 512, 5, 5, True, 1, 4],
            [512, 512, 5, 0, True, 1, 4],
            [512, 512, 5, 3, True, 1, 4],
            [512, 512, 5, 0, True, 1, 4],
            [512, 512, 5, 0, True, 1, 4],
            [512, 512, 5, 3, True, 1, 4],
            [512, 512, 5, 5, True, 1, 4],
            [512, 512, 5, 0, True, 1, 4],
            [512, 512, 5, 0, True, 1, 4],
            [512, 512, 5, 0, True, 1, 4]
        ]
    },
    "layer5": {
        "block_name": "convbn",
        "num_blocks": 2,
        "block_specs": [
            [512, 960, 1, 1],
            [960, 1280, 1, 1]
        ]
    }
}
 
def mhsa(num_heads, key_dim, value_dim, px):
    if px == 24:
        kv_strides = 2
    elif px == 12:
        kv_strides = 1
    query_h_strides = 1
    query_w_strides = 1
    use_layer_scale = True
    use_multi_query = True
    use_residual = True
    return [
        num_heads, key_dim, value_dim, query_h_strides, query_w_strides, kv_strides,
        use_layer_scale, use_multi_query, use_residual
    ]
 
MNV4HybridConvMedium_BLOCK_SPECS = {
    "conv0": {
        "block_name": "convbn",
        "num_blocks": 1,
        "block_specs": [
            [3, 32, 3, 2]
        ]
    },
    "layer1": {
        "block_name": "fused_ib",
        "num_blocks": 1,
        "block_specs": [
            [32, 48, 2, 4.0, True]
        ]
    },
    "layer2": {
        "block_name": "uib",
        "num_blocks": 2,
        "block_specs": [
            [48, 80, 3, 5, True, 2, 4],
            [80, 80, 3, 3, True, 1, 2]
        ]
    },
    "layer3": {
        "block_name": "uib",
        "num_blocks": 8,
        "block_specs": [
            [80,  160, 3, 5, True, 2, 6],
            [160, 160, 0, 0, True, 1, 2],
            [160, 160, 3, 3, True, 1, 4],
            [160, 160, 3, 5, True, 1, 4, mhsa(4, 64, 64, 24)],
            [160, 160, 3, 3, True, 1, 4, mhsa(4, 64, 64, 24)],
            [160, 160, 3, 0, True, 1, 4, mhsa(4, 64, 64, 24)],
            [160, 160, 3, 3, True, 1, 4, mhsa(4, 64, 64, 24)],
            [160, 160, 3, 0, True, 1, 4]
        ]
    },
    "layer4": {
        "block_name": "uib",
        "num_blocks": 12,
        "block_specs": [
            [160, 256, 5, 5, True, 2, 6],
            [256, 256, 5, 5, True, 1, 4],
            [256, 256, 3, 5, True, 1, 4],
            [256, 256, 3, 5, True, 1, 4],
            [256, 256, 0, 0, True, 1, 2],
            [256, 256, 3, 5, True, 1, 2],
            [256, 256, 0, 0, True, 1, 2],
            [256, 256, 0, 0, True, 1, 4, mhsa(4, 64, 64, 12)],
            [256, 256, 3, 0, True, 1, 4, mhsa(4, 64, 64, 12)],
            [256, 256, 5, 5, True, 1, 4, mhsa(4, 64, 64, 12)],
            [256, 256, 5, 0, True, 1, 4, mhsa(4, 64, 64, 12)],
            [256, 256, 5, 0, True, 1, 4]
        ]
    },
    "layer5": {
        "block_name": "convbn",
        "num_blocks": 2,
        "block_specs": [
            [256, 960, 1, 1],
            [960, 1280, 1, 1]
        ]
    }
}
 
MNV4HybridConvLarge_BLOCK_SPECS = {
    "conv0": {
        "block_name": "convbn",
        "num_blocks": 1,
        "block_specs": [
            [3, 24, 3, 2]
        ]
    },
    "layer1": {
        "block_name": "fused_ib",
        "num_blocks": 1,
        "block_specs": [
            [24, 48, 2, 4.0, True]
        ]
    },
    "layer2": {
        "block_name": "uib",
        "num_blocks": 2,
        "block_specs": [
            [48, 96, 3, 5, True, 2, 4],
            [96, 96, 3, 3, True, 1, 4]
        ]
    },
    "layer3": {
        "block_name": "uib",
        "num_blocks": 11,
        "block_specs": [
            [96,  192, 3, 5, True, 2, 4],
            [192, 192, 3, 3, True, 1, 4],
            [192, 192, 3, 3, True, 1, 4],
            [192, 192, 3, 3, True, 1, 4],
            [192, 192, 3, 5, True, 1, 4],
            [192, 192, 5, 3, True, 1, 4],
            [192, 192, 5, 3, True, 1, 4, mhsa(8, 48, 48, 24)],
            [192, 192, 5, 3, True, 1, 4, mhsa(8, 48, 48, 24)],
            [192, 192, 5, 3, True, 1, 4, mhsa(8, 48, 48, 24)],
            [192, 192, 5, 3, True, 1, 4, mhsa(8, 48, 48, 24)],
            [192, 192, 3, 0, True, 1, 4]
        ]
    },
    "layer4": {
        "block_name": "uib",
        "num_blocks": 14,
        "block_specs": [
            [192, 512, 5, 5, True, 2, 4],
            [512, 512, 5, 5, True, 1, 4],
            [512, 512, 5, 5, True, 1, 4],
            [512, 512, 5, 5, True, 1, 4],
            [512, 512, 5, 0, True, 1, 4],
            [512, 512, 5, 3, True, 1, 4],
            [512, 512, 5, 0, True, 1, 4],
            [512, 512, 5, 0, True, 1, 4],
            [512, 512, 5, 3, True, 1, 4],
            [512, 512, 5, 5, True, 1, 4, mhsa(8, 64, 64, 12)],
            [512, 512, 5, 0, True, 1, 4, mhsa(8, 64, 64, 12)],
            [512, 512, 5, 0, True, 1, 4, mhsa(8, 64, 64, 12)],
            [512, 512, 5, 0, True, 1, 4, mhsa(8, 64, 64, 12)],
            [512, 512, 5, 0, True, 1, 4]
        ]
    },
    "layer5": {
        "block_name": "convbn",
        "num_blocks": 2,
        "block_specs": [
            [512, 960, 1, 1],
            [960, 1280, 1, 1]
        ]
    }
}
 
MODEL_SPECS = {
    "MobileNetV4ConvSmall": MNV4ConvSmall_BLOCK_SPECS,
    "MobileNetV4ConvMedium": MNV4ConvMedium_BLOCK_SPECS,
    "MobileNetV4ConvLarge": MNV4ConvLarge_BLOCK_SPECS,
    "MobileNetV4HybridMedium": MNV4HybridConvMedium_BLOCK_SPECS,
    "MobileNetV4HybridLarge": MNV4HybridConvLarge_BLOCK_SPECS
}

def make_divisible(
        value: float,
        divisor: int,
        min_value: Optional[float] = None,
        round_down_protect: bool = True,
) -> int:
    """
    This function is copied from here
    "https://github.com/tensorflow/models/blob/master/official/vision/modeling/layers/nn_layers.py"
    This is to ensure that all layers have channels that are divisible by 8.
    Args:
        value: A `float` of original value.
        divisor: An `int` of the divisor that need to be checked upon.
        min_value: A `float` of  minimum value threshold.
        round_down_protect: A `bool` indicating whether round down more than 10%
        will be allowed.
    Returns:
        The adjusted value in `int` that is divisible against divisor.
    """
    if min_value is None:
        min_value = divisor
    new_value = max(min_value, int(value + divisor / 2) // divisor * divisor)
    # Make sure that round down does not go down by more than 10%.
    if round_down_protect and new_value < 0.9 * value:
        new_value += divisor
    return int(new_value)

def conv_2d(inp, oup, kernel_size=3, stride=1, groups=1, bias=False, norm=True, act=True):
    conv = nn.Sequential()
    padding = (kernel_size - 1) // 2
    conv.add_module('conv', nn.Conv2d(inp, oup, kernel_size, stride, padding, bias=bias, groups=groups))
    if norm:
        conv.add_module('BatchNorm2d', nn.BatchNorm2d(oup))
    if act:
        conv.add_module('Activation', nn.ReLU6())
    return conv

class InvertedResidual(nn.Module):
    def __init__(self, inp, oup, stride, expand_ratio, act=False, squeeze_excitation=False):
        super(InvertedResidual, self).__init__()
        self.stride = stride
        assert stride in [1, 2]
        hidden_dim = int(round(inp * expand_ratio))
        self.block = nn.Sequential()
        if expand_ratio != 1:
            self.block.add_module('exp_1x1', conv_2d(inp, hidden_dim, kernel_size=3, stride=stride))
        if squeeze_excitation:
            self.block.add_module('conv_3x3',
                                  conv_2d(hidden_dim, hidden_dim, kernel_size=3, stride=stride, groups=hidden_dim))
        self.block.add_module('red_1x1', conv_2d(hidden_dim, oup, kernel_size=1, stride=1, act=act))
        self.use_res_connect = self.stride == 1 and inp == oup
 
    def forward(self, x):
        if self.use_res_connect:
            return x + self.block(x)
        else:
            return self.block(x)

class UniversalInvertedBottleneckBlock(nn.Module):
    def __init__(self,
                 inp,
                 oup,
                 start_dw_kernel_size,
                 middle_dw_kernel_size,
                 middle_dw_downsample,
                 stride,
                 expand_ratio
                 ):
        """An inverted bottleneck block with optional depthwises.
        Referenced from here https://github.com/tensorflow/models/blob/master/official/vision/modeling/layers/nn_blocks.py
        """
        super().__init__()
        # Starting depthwise conv.
        self.start_dw_kernel_size = start_dw_kernel_size
        if self.start_dw_kernel_size:
            stride_ = stride if not middle_dw_downsample else 1
            self._start_dw_ = conv_2d(inp, inp, kernel_size=start_dw_kernel_size, stride=stride_, groups=inp, act=False)
        # Expansion with 1x1 convs.
        expand_filters = make_divisible(inp * expand_ratio, 8)
        self._expand_conv = conv_2d(inp, expand_filters, kernel_size=1)
        # Middle depthwise conv.
        self.middle_dw_kernel_size = middle_dw_kernel_size
        if self.middle_dw_kernel_size:
            stride_ = stride if middle_dw_downsample else 1
            self._middle_dw = conv_2d(expand_filters, expand_filters, kernel_size=middle_dw_kernel_size, stride=stride_,
                                      groups=expand_filters)
        # Projection with 1x1 convs.
        self._proj_conv = conv_2d(expand_filters, oup, kernel_size=1, stride=1, act=False)
 
        # Ending depthwise conv.
        # this not used
        # _end_dw_kernel_size = 0
        # self._end_dw = conv_2d(oup, oup, kernel_size=_end_dw_kernel_size, stride=stride, groups=inp, act=False)
 
    def forward(self, x):
        if self.start_dw_kernel_size:
            x = self._start_dw_(x)
            # print("_start_dw_", x.shape)
        x = self._expand_conv(x)
        # print("_expand_conv", x.shape)
        if self.middle_dw_kernel_size:
            x = self._middle_dw(x)
            # print("_middle_dw", x.shape)
        x = self._proj_conv(x)
        # print("_proj_conv", x.shape)
        return x

class MultiQueryAttentionLayerWithDownSampling(nn.Module):
    def __init__(self, inp, num_heads, key_dim, value_dim, query_h_strides, query_w_strides, kv_strides,
                 dw_kernel_size=3, dropout=0.0):
        """Multi Query Attention with spatial downsampling.
        Referenced from here https://github.com/tensorflow/models/blob/master/official/vision/modeling/layers/nn_blocks.py
        3 parameters are introduced for the spatial downsampling:
        1. kv_strides: downsampling factor on Key and Values only.
        2. query_h_strides: vertical strides on Query only.
        3. query_w_strides: horizontal strides on Query only.
        This is an optimized version.
        1. Projections in Attention is explict written out as 1x1 Conv2D.
        2. Additional reshapes are introduced to bring a up to 3x speed up.
        """
        super().__init__()
        self.num_heads = num_heads
        self.key_dim = key_dim
        self.value_dim = value_dim
        self.query_h_strides = query_h_strides
        self.query_w_strides = query_w_strides
        self.kv_strides = kv_strides
        self.dw_kernel_size = dw_kernel_size
        self.dropout = dropout
 
        self.head_dim = key_dim // num_heads
 
        if self.query_h_strides > 1 or self.query_w_strides > 1:
            self._query_downsampling_norm = nn.BatchNorm2d(inp)
        self._query_proj = conv_2d(inp, num_heads * key_dim, 1, 1, norm=False, act=False)
 
        if self.kv_strides > 1:
            self._key_dw_conv = conv_2d(inp, inp, dw_kernel_size, kv_strides, groups=inp, norm=True, act=False)
            self._value_dw_conv = conv_2d(inp, inp, dw_kernel_size, kv_strides, groups=inp, norm=True, act=False)
        self._key_proj = conv_2d(inp, key_dim, 1, 1, norm=False, act=False)
        self._value_proj = conv_2d(inp, key_dim, 1, 1, norm=False, act=False)
 
        self._output_proj = conv_2d(num_heads * key_dim, inp, 1, 1, norm=False, act=False)
        self.dropout = nn.Dropout(p=dropout)
 
    def forward(self, x):
        batch_size, seq_length, _, _ = x.size()
        if self.query_h_strides > 1 or self.query_w_strides > 1:
            q = F.avg_pool2d(self.query_h_stride, self.query_w_stride)
            q = self._query_downsampling_norm(q)
            q = self._query_proj(q)
        else:
            q = self._query_proj(x)
        px = q.size(2)
        q = q.view(batch_size, self.num_heads, -1, self.key_dim)  # [batch_size, num_heads, seq_length, key_dim]
 
        if self.kv_strides > 1:
            k = self._key_dw_conv(x)
            k = self._key_proj(k)
            v = self._value_dw_conv(x)
            v = self._value_proj(v)
        else:
            k = self._key_proj(x)
            v = self._value_proj(x)
        k = k.view(batch_size, self.key_dim, -1)  # [batch_size, key_dim, seq_length]
        v = v.view(batch_size, -1, self.key_dim)  # [batch_size, seq_length, key_dim]
 
        # calculate attn score
        attn_score = torch.matmul(q, k) / (self.head_dim ** 0.5)
        attn_score = self.dropout(attn_score)
        attn_score = F.softmax(attn_score, dim=-1)
 
        context = torch.matmul(attn_score, v)
        context = context.view(batch_size, self.num_heads * self.key_dim, px, px)
        output = self._output_proj(context)
        return output

class MNV4LayerScale(nn.Module):
    def __init__(self, init_value):
        """LayerScale as introduced in CaiT: https://arxiv.org/abs/2103.17239
        Referenced from here https://github.com/tensorflow/models/blob/master/official/vision/modeling/layers/nn_blocks.py
        As used in MobileNetV4.
        Attributes:
            init_value (float): value to initialize the diagonal matrix of LayerScale.
        """
        super().__init__()
        self.init_value = init_value
 
    def forward(self, x):
        gamma = self.init_value * torch.ones(x.size(-1), dtype=x.dtype, device=x.device)
        return x * gamma

class MultiHeadSelfAttentionBlock(nn.Module):
    def __init__(
            self,
            inp,
            num_heads,
            key_dim,
            value_dim,
            query_h_strides,
            query_w_strides,
            kv_strides,
            use_layer_scale,
            use_multi_query,
            use_residual=True
    ):
        super().__init__()
        self.query_h_strides = query_h_strides
        self.query_w_strides = query_w_strides
        self.kv_strides = kv_strides
        self.use_layer_scale = use_layer_scale
        self.use_multi_query = use_multi_query
        self.use_residual = use_residual
 
        self._input_norm = nn.BatchNorm2d(inp)
        if self.use_multi_query:
            self.multi_query_attention = MultiQueryAttentionLayerWithDownSampling(
                inp, num_heads, key_dim, value_dim, query_h_strides, query_w_strides, kv_strides
            )
        else:
            self.multi_head_attention = nn.MultiheadAttention(inp, num_heads, kdim=key_dim)
 
        if self.use_layer_scale:
            self.layer_scale_init_value = 1e-5
            self.layer_scale = MNV4LayerScale(self.layer_scale_init_value)
 
    def forward(self, x):
        # Not using CPE, skipped
        # input norm
        shortcut = x
        x = self._input_norm(x)
        # multi query
        if self.use_multi_query:
            x = self.multi_query_attention(x)
        else:
            x = self.multi_head_attention(x, x)
        # layer scale
        if self.use_layer_scale:
            x = self.layer_scale(x)
        # use residual
        if self.use_residual:
            x = x + shortcut
        return x

def build_blocks(layer_spec):
    if not layer_spec.get('block_name'):
        return nn.Sequential()
    block_names = layer_spec['block_name']
    layers = nn.Sequential()
    if block_names == "convbn":
        schema_ = ['inp', 'oup', 'kernel_size', 'stride']
        for i in range(layer_spec['num_blocks']):
            args = dict(zip(schema_, layer_spec['block_specs'][i]))
            layers.add_module(f"convbn_{i}", conv_2d(**args))
    elif block_names == "uib":
        schema_ = ['inp', 'oup', 'start_dw_kernel_size', 'middle_dw_kernel_size', 'middle_dw_downsample', 'stride',
                   'expand_ratio', 'msha']
        for i in range(layer_spec['num_blocks']):
            args = dict(zip(schema_, layer_spec['block_specs'][i]))
            msha = args.pop("msha") if "msha" in args else 0
            layers.add_module(f"uib_{i}", UniversalInvertedBottleneckBlock(**args))
            if msha:
                msha_schema_ = [
                    "inp", "num_heads", "key_dim", "value_dim", "query_h_strides", "query_w_strides", "kv_strides",
                    "use_layer_scale", "use_multi_query", "use_residual"
                ]
                args = dict(zip(msha_schema_, [args['oup']] + (msha)))
                layers.add_module(f"msha_{i}", MultiHeadSelfAttentionBlock(**args))
    elif block_names == "fused_ib":
        schema_ = ['inp', 'oup', 'stride', 'expand_ratio', 'act']
        for i in range(layer_spec['num_blocks']):
            args = dict(zip(schema_, layer_spec['block_specs'][i]))
            layers.add_module(f"fused_ib_{i}", InvertedResidual(**args))
    else:
        raise NotImplementedError
    return layers

class MobileNetV4(nn.Module):
    def __init__(self, model):
        # MobileNetV4ConvSmall  MobileNetV4ConvMedium  MobileNetV4ConvLarge
        # MobileNetV4HybridMedium  MobileNetV4HybridLarge
        """Params to initiate MobilenNetV4
        Args:
            model : support 5 types of models as indicated in
            "https://github.com/tensorflow/models/blob/master/official/vision/modeling/backbones/mobilenet.py"
        """
        super().__init__()
        assert model in MODEL_SPECS.keys()
        self.model = model
        self.spec = MODEL_SPECS[self.model]
 
        # conv0
        self.conv0 = build_blocks(self.spec['conv0'])
        # layer1
        self.layer1 = build_blocks(self.spec['layer1'])
        # layer2
        self.layer2 = build_blocks(self.spec['layer2'])
        # layer3
        self.layer3 = build_blocks(self.spec['layer3'])
        # layer4
        self.layer4 = build_blocks(self.spec['layer4'])
        # layer5
        self.layer5 = build_blocks(self.spec['layer5'])
        self.width_list = [i.size(1) for i in self.forward(torch.randn(1, 3, 640, 640))]
    def forward(self, x):
        x0 = self.conv0(x)
        x1 = self.layer1(x0)
        x2 = self.layer2(x1)
        x3 = self.layer3(x2)
        x4 = self.layer4(x3)
        # x5 = self.layer5(x4)
        # x5 = nn.functional.adaptive_avg_pool2d(x5, 1)
        return [x1, x2, x3, x4]

def MobileNetV4ConvSmall():
    model = MobileNetV4('MobileNetV4ConvSmall')
    return model
 
def MobileNetV4ConvMedium():
    model = MobileNetV4('MobileNetV4ConvMedium')
    return model
 
def MobileNetV4ConvLarge():
    model = MobileNetV4('MobileNetV4ConvLarge')
    return model
 
def MobileNetV4HybridMedium():
    model = MobileNetV4('MobileNetV4HybridMedium')
    return model
 
def MobileNetV4HybridLarge():
    model = MobileNetV4('MobileNetV4HybridLarge')
    return model


BiFPN模块 的实现代码如下:

import torch.nn as nn
import torch

class swish(nn.Module):
    def forward(self, x):
        return x * torch.sigmoid(x)

class BiFPN(nn.Module):
    def __init__(self, length):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(length, dtype=torch.float32), requires_grad=True)
        self.swish = swish()
        self.epsilon = 0.0001

    def forward(self, x):
        weights = self.weight / (torch.sum(self.swish(self.weight), dim=0) + self.epsilon) 
        weighted_feature_maps = [weights[i] * x[i] for i in range(len(x))]
        stacked_feature_maps = torch.stack(weighted_feature_maps, dim=0)
        result = torch.sum(stacked_feature_maps, dim=0)
        return result

五、修改步骤

MobileNetV4 修改步骤参考:

BiFPN 修改步骤参考:


六、yaml模型文件

6.1 模型改进⭐

在代码配置完成后,配置模型的YAML文件。

此处以 ultralytics/cfg/models/rt-detr/rtdetr-l.yaml 为例,在同目录下创建一个用于自己数据集训练的模型文件 rtdetr-l-MobileNetv4.yaml

rtdetr-l.yaml 中的内容复制到 rtdetr-MobileNetv4-BiFPN.yaml 文件下,修改 nc 数量等于自己数据中目标的数量。

📌 模型配置如下:。

# Ultralytics YOLO 🚀, AGPL-3.0 license
# RT-DETR-l object detection model with P3-P5 outputs. For details see https://docs.ultralytics.com/models/rtdetr

# Parameters
nc: 1  # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n-cls.yaml' will call yolov8-cls.yaml with scale 'n'
  # [depth, width, max_channels]
  l: [1.00, 1.00, 1024]

backbone:
  # [from, repeats, module, args]
  - [-1, 1, MobileNetV4ConvSmall, []]  # 4

head:
  - [-1, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 5 input_proj.2
  - [-1, 1, AIFI, [1024, 8]]
  - [-1, 1, Conv, [256, 1, 1]] # 7, Y5, lateral_convs.0

  - [2, 1, Conv, [256]]  # 8-P3/8
  - [3, 1, Conv, [256]]  # 9-P4/16
  - [7, 1, Conv, [256]]  # 10-P5/32

  - [-1, 1, nn.Upsample, [None, 2, 'nearest']] # 11 P5->P4
  - [[-1, 9], 1, BiFPN, []] # 12
  - [-1, 3, RepC3, [256]] # 13-P4/16
  
  - [-1, 1, nn.Upsample, [None, 2, 'nearest']] # 14 P4->P3
  - [[-1, 8], 1, BiFPN, []] # 15
  - [-1, 3, RepC3, [256]] # 16-P3/8

  - [1, 1, Conv, [256, 3, 2]] # 17 P2->P3
  - [[-1, 8, 16], 1, BiFPN, []] # 18
  - [-1, 3, RepC3, [256]] # 19-P3/8

  - [-1, 1, Conv, [256, 3, 2]] # 20 P3->P4
  - [[-1, 9, 13], 1, BiFPN, []] # 21
  - [-1, 3, RepC3, [256]] # 22-P4/16

  - [-1, 1, Conv, [256, 3, 2]] # 23 P4->P5
  - [[-1, 10], 1, BiFPN, []] # 24
  - [-1, 3, RepC3, [256]] # 25-P5/32

  - [[19, 22, 25], 1, RTDETRDecoder, [nc, 256, 300, 4, 8, 3]]  # Detect(P3, P4, P5)


七、成功运行结果

分别打印网络模型可以看到 MobileNetV4和BiFPN模块 已经加入到模型中,并可以进行训练了。

rtdetr-MobileNetV4-BiFPN

rtdetr-MobileNetv4-BiFPN summary: 526 layers, 19,463,904 parameters, 19,463,904 gradients, 97.2 GFLOPs

                   from  n    params  module                                       arguments                     
  0                  -1  1   2493024  MobileNetV4ConvSmall                         []                            
  1                  -1  1    328192  ultralytics.nn.modules.conv.Conv             [1280, 256, 1, 1, None, 1, 1, False]
  2                  -1  1    789760  ultralytics.nn.modules.transformer.AIFI      [256, 1024, 8]                
  3                  -1  1     66048  ultralytics.nn.modules.conv.Conv             [256, 256, 1, 1]              
  4                   2  1     16896  ultralytics.nn.modules.conv.Conv             [64, 256]                     
  5                   3  1     25088  ultralytics.nn.modules.conv.Conv             [96, 256]                     
  6                   7  1     66048  ultralytics.nn.modules.conv.Conv             [256, 256]                    
  7                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
  8             [-1, 9]  1         2  ultralytics.nn.AddModules.BiFPN.BiFPN        [2]                           
  9                  -1  3   2101248  ultralytics.nn.modules.block.RepC3           [256, 256, 3]                 
 10                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 11             [-1, 8]  1         2  ultralytics.nn.AddModules.BiFPN.BiFPN        [2]                           
 12                  -1  3   2101248  ultralytics.nn.modules.block.RepC3           [256, 256, 3]                 
 13                   1  1     74240  ultralytics.nn.modules.conv.Conv             [32, 256, 3, 2]               
 14         [-1, 8, 16]  1         3  ultralytics.nn.AddModules.BiFPN.BiFPN        [3]                           
 15                  -1  3   2101248  ultralytics.nn.modules.block.RepC3           [256, 256, 3]                 
 16                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]              
 17         [-1, 9, 13]  1         3  ultralytics.nn.AddModules.BiFPN.BiFPN        [3]                           
 18                  -1  3   2101248  ultralytics.nn.modules.block.RepC3           [256, 256, 3]                 
 19                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]              
 20            [-1, 10]  1         2  ultralytics.nn.AddModules.BiFPN.BiFPN        [2]                           
 21                  -1  3   2101248  ultralytics.nn.modules.block.RepC3           [256, 256, 3]                 
 22        [19, 22, 25]  1   3917684  ultralytics.nn.modules.head.RTDETRDecoder    [1, [256, 256, 256], 256, 300, 4, 8, 3]
rtdetr-MobileNetv4-BiFPN summary: 526 layers, 19,463,904 parameters, 19,463,904 gradients, 97.2 GFLOPs