学习资源站

YOLOv11改进-主干_Backbone篇-2024最新目标检测网络MobileNetV4改进YOLOv11(支持根据yolov11版本nsmlx进行自由放缩通道数)

一、本文介绍

本文给大家带来的改进机制是 MobileNetV4 ,其发布时间是2024.5月 MobileNetV4是一种高度优化的 神经网络 架构,专为 移动设备 设计。它最新的改动总结主要有两点 采用了通用反向瓶颈(UIB) 和针对移动 加速器 优化的 Mobile MQA注意力模块 (一种全新的注意力机制)。这些创新有助于在不牺牲准确性的情况下,显著提高推理速度和计算效率。MobileNetV4作为一种移动端的网络,其实它的论文中主要是配合蒸馏技术进行改进, 大家可以搭配本专栏的蒸馏进行二次创新涨点, 本文的代码支持大家根据yolov11系列的nsmlx五个版本的通道放缩系数去主动放缩MobileNetV4轻量化更上一层,同时本文组合方式至少25种以上, 文章内容为个人所创抄袭必究。

欢迎大家订阅我的专栏一起学习YOLO!



二、原理介绍

官方论文地址: 官方论文地址点击此处即可跳转

官方代码地址: 官方代码地址点击此处即可跳转


MobileNetV4是MobileNet系列的最新版本,专为移动设备设计,引入了多种新颖且高效的架构组件。其中最关键的是通用反转瓶颈(UIB),它结合了以前模型如MobileNetV2的反转瓶颈和新元素,例如ConvNext块和视觉变换器(ViT)中的前馈网络。这种结构允许在不过度复杂化架构搜索过程的情况下,适应性地并有效地扩展模型到各种平台。

此外,MobileNetV4还包括一种名为Mobile MQA的新型注意力机制,该机制通过优化算术运算与内存访问的比率,显著提高了移动加速器上的推理速度,这是移动 性能 的关键因素。该架构通过精细的神经 网络架构 搜索(NAS)和新颖的蒸馏技术进一步优化,使得MobileNetV4能够在多种硬件平台上达到最优性能,包括移动CPU、DSP、GPU和特定的加速器,如Apple的Neural Engine和Google的Pixel EdgeTPU。

此外,MobileNetV4还引入了改进的NAS策略,通过粗粒度和细粒度搜索相结合的方法,显著提高搜索效率并改善模型质量。通过这种方法,MobileNetV4能够实现大多数情况下的Pareto最优性能,这意味着在不同设备上都能达到效率和准确性的最佳平衡。

最后,通过一种新的蒸馏技术,MobileNetV4进一步提高了准确性,其混合型大模型在ImageNet-1K数据集上达到了87%的顶级准确率,同时在Pixel 8 EdgeTPU上的运行时间仅为3.8毫秒。这些特性使MobileNetV4成为适用于移动环境中高效视觉任务的理想选择。

主要思想提取和总结:

1. 通用反转瓶颈(UIB):

MobileNetV4引入了一种名为通用反转瓶颈(UIB)的新架构组件。UIB是一个灵活的架构单元,融合了反转瓶颈(IB)、ConvNext、前馈网络(FFN),以及新颖的额外深度(ExtraDW)变体。

2. Mobile MQA注意力机制:

为了优化移动加速器的性能,MobileNetV4设计了一个特殊的注意力模块,名为Mobile MQA。这一模块针对移动设备的计算和存储限制进行了优化,提供了高达39%的推理速度提升。

3. 优化的神经架构搜索(NAS)配方:

通过改进的NAS配方,MobileNetV4能够更高效地搜索和优化网络架构,这有助于发现适合特定 硬件 的最优模型配置。

4. 模型蒸馏技术:

引入了一种新的蒸馏技术,用以提高模型的准确性。通过这种技术,MNv4-Hybrid-Large模型在ImageNet-1K上达到了87%的准确率,并且在Pixel 8 EdgeTPU上的运行时间仅为3.8毫秒。

个人总结: MobileNetV4是一个专为移动设备设计的高效 深度学习 模型。它通过整合多种先进技术,如通用反转瓶颈(UIB)、针对移动设备优化的注意力机制(Mobile MQA),以及先进的架构搜索方法(NAS),实现了在不同硬件上的高效运行。这些技术的融合不仅大幅提升了模型的运行速度,还显著提高了准确率。特别是,它的一个变体模型在标准图像识别测试中取得了87%的准确率,运行速度极快。


三、核心代码

核心代码的使用方式看章节四!

  1. from typing import Optional
  2. import torch
  3. import torch.nn as nn
  4. import torch.nn.functional as F
  5. __all__ = ['MobileNetV4ConvLarge', 'MobileNetV4ConvSmall', 'MobileNetV4ConvMedium', 'MobileNetV4HybridMedium', 'MobileNetV4HybridLarge']
  6. MNV4ConvSmall_BLOCK_SPECS = {
  7. "conv0": {
  8. "block_name": "convbn",
  9. "num_blocks": 1,
  10. "block_specs": [
  11. [3, 32, 3, 2]
  12. ]
  13. },
  14. "layer1": {
  15. "block_name": "convbn",
  16. "num_blocks": 2,
  17. "block_specs": [
  18. [32, 32, 3, 2],
  19. [32, 32, 1, 1]
  20. ]
  21. },
  22. "layer2": {
  23. "block_name": "convbn",
  24. "num_blocks": 2,
  25. "block_specs": [
  26. [32, 96, 3, 2],
  27. [96, 64, 1, 1]
  28. ]
  29. },
  30. "layer3": {
  31. "block_name": "uib",
  32. "num_blocks": 6,
  33. "block_specs": [
  34. [64, 96, 5, 5, True, 2, 3],
  35. [96, 96, 0, 3, True, 1, 2],
  36. [96, 96, 0, 3, True, 1, 2],
  37. [96, 96, 0, 3, True, 1, 2],
  38. [96, 96, 0, 3, True, 1, 2],
  39. [96, 96, 3, 0, True, 1, 4],
  40. ]
  41. },
  42. "layer4": {
  43. "block_name": "uib",
  44. "num_blocks": 6,
  45. "block_specs": [
  46. [96, 128, 3, 3, True, 2, 6],
  47. [128, 128, 5, 5, True, 1, 4],
  48. [128, 128, 0, 5, True, 1, 4],
  49. [128, 128, 0, 5, True, 1, 3],
  50. [128, 128, 0, 3, True, 1, 4],
  51. [128, 128, 0, 3, True, 1, 4],
  52. ]
  53. },
  54. "layer5": {
  55. "block_name": "convbn",
  56. "num_blocks": 2,
  57. "block_specs": [
  58. [128, 960, 1, 1],
  59. [960, 1280, 1, 1]
  60. ]
  61. }
  62. }
  63. MNV4ConvMedium_BLOCK_SPECS = {
  64. "conv0": {
  65. "block_name": "convbn",
  66. "num_blocks": 1,
  67. "block_specs": [
  68. [3, 32, 3, 2]
  69. ]
  70. },
  71. "layer1": {
  72. "block_name": "fused_ib",
  73. "num_blocks": 1,
  74. "block_specs": [
  75. [32, 48, 2, 4.0, True]
  76. ]
  77. },
  78. "layer2": {
  79. "block_name": "uib",
  80. "num_blocks": 2,
  81. "block_specs": [
  82. [48, 80, 3, 5, True, 2, 4],
  83. [80, 80, 3, 3, True, 1, 2]
  84. ]
  85. },
  86. "layer3": {
  87. "block_name": "uib",
  88. "num_blocks": 8,
  89. "block_specs": [
  90. [80, 160, 3, 5, True, 2, 6],
  91. [160, 160, 3, 3, True, 1, 4],
  92. [160, 160, 3, 3, True, 1, 4],
  93. [160, 160, 3, 5, True, 1, 4],
  94. [160, 160, 3, 3, True, 1, 4],
  95. [160, 160, 3, 0, True, 1, 4],
  96. [160, 160, 0, 0, True, 1, 2],
  97. [160, 160, 3, 0, True, 1, 4]
  98. ]
  99. },
  100. "layer4": {
  101. "block_name": "uib",
  102. "num_blocks": 11,
  103. "block_specs": [
  104. [160, 256, 5, 5, True, 2, 6],
  105. [256, 256, 5, 5, True, 1, 4],
  106. [256, 256, 3, 5, True, 1, 4],
  107. [256, 256, 3, 5, True, 1, 4],
  108. [256, 256, 0, 0, True, 1, 4],
  109. [256, 256, 3, 0, True, 1, 4],
  110. [256, 256, 3, 5, True, 1, 2],
  111. [256, 256, 5, 5, True, 1, 4],
  112. [256, 256, 0, 0, True, 1, 4],
  113. [256, 256, 0, 0, True, 1, 4],
  114. [256, 256, 5, 0, True, 1, 2]
  115. ]
  116. },
  117. "layer5": {
  118. "block_name": "convbn",
  119. "num_blocks": 2,
  120. "block_specs": [
  121. [256, 960, 1, 1],
  122. [960, 1280, 1, 1]
  123. ]
  124. }
  125. }
  126. MNV4ConvLarge_BLOCK_SPECS = {
  127. "conv0": {
  128. "block_name": "convbn",
  129. "num_blocks": 1,
  130. "block_specs": [
  131. [3, 24, 3, 2]
  132. ]
  133. },
  134. "layer1": {
  135. "block_name": "fused_ib",
  136. "num_blocks": 1,
  137. "block_specs": [
  138. [24, 48, 2, 4.0, True]
  139. ]
  140. },
  141. "layer2": {
  142. "block_name": "uib",
  143. "num_blocks": 2,
  144. "block_specs": [
  145. [48, 96, 3, 5, True, 2, 4],
  146. [96, 96, 3, 3, True, 1, 4]
  147. ]
  148. },
  149. "layer3": {
  150. "block_name": "uib",
  151. "num_blocks": 11,
  152. "block_specs": [
  153. [96, 192, 3, 5, True, 2, 4],
  154. [192, 192, 3, 3, True, 1, 4],
  155. [192, 192, 3, 3, True, 1, 4],
  156. [192, 192, 3, 3, True, 1, 4],
  157. [192, 192, 3, 5, True, 1, 4],
  158. [192, 192, 5, 3, True, 1, 4],
  159. [192, 192, 5, 3, True, 1, 4],
  160. [192, 192, 5, 3, True, 1, 4],
  161. [192, 192, 5, 3, True, 1, 4],
  162. [192, 192, 5, 3, True, 1, 4],
  163. [192, 192, 3, 0, True, 1, 4]
  164. ]
  165. },
  166. "layer4": {
  167. "block_name": "uib",
  168. "num_blocks": 13,
  169. "block_specs": [
  170. [192, 512, 5, 5, True, 2, 4],
  171. [512, 512, 5, 5, True, 1, 4],
  172. [512, 512, 5, 5, True, 1, 4],
  173. [512, 512, 5, 5, True, 1, 4],
  174. [512, 512, 5, 0, True, 1, 4],
  175. [512, 512, 5, 3, True, 1, 4],
  176. [512, 512, 5, 0, True, 1, 4],
  177. [512, 512, 5, 0, True, 1, 4],
  178. [512, 512, 5, 3, True, 1, 4],
  179. [512, 512, 5, 5, True, 1, 4],
  180. [512, 512, 5, 0, True, 1, 4],
  181. [512, 512, 5, 0, True, 1, 4],
  182. [512, 512, 5, 0, True, 1, 4]
  183. ]
  184. },
  185. "layer5": {
  186. "block_name": "convbn",
  187. "num_blocks": 2,
  188. "block_specs": [
  189. [512, 960, 1, 1],
  190. [960, 1280, 1, 1]
  191. ]
  192. }
  193. }
  194. def mhsa(num_heads, key_dim, value_dim, px):
  195. if px == 24:
  196. kv_strides = 2
  197. elif px == 12:
  198. kv_strides = 1
  199. query_h_strides = 1
  200. query_w_strides = 1
  201. use_layer_scale = True
  202. use_multi_query = True
  203. use_residual = True
  204. return [
  205. num_heads, key_dim, value_dim, query_h_strides, query_w_strides, kv_strides,
  206. use_layer_scale, use_multi_query, use_residual
  207. ]
  208. MNV4HybridConvMedium_BLOCK_SPECS = {
  209. "conv0": {
  210. "block_name": "convbn",
  211. "num_blocks": 1,
  212. "block_specs": [
  213. [3, 32, 3, 2]
  214. ]
  215. },
  216. "layer1": {
  217. "block_name": "fused_ib",
  218. "num_blocks": 1,
  219. "block_specs": [
  220. [32, 48, 2, 4.0, True]
  221. ]
  222. },
  223. "layer2": {
  224. "block_name": "uib",
  225. "num_blocks": 2,
  226. "block_specs": [
  227. [48, 80, 3, 5, True, 2, 4],
  228. [80, 80, 3, 3, True, 1, 2]
  229. ]
  230. },
  231. "layer3": {
  232. "block_name": "uib",
  233. "num_blocks": 8,
  234. "block_specs": [
  235. [80, 160, 3, 5, True, 2, 6],
  236. [160, 160, 0, 0, True, 1, 2],
  237. [160, 160, 3, 3, True, 1, 4],
  238. [160, 160, 3, 5, True, 1, 4, mhsa(4, 64, 64, 24)],
  239. [160, 160, 3, 3, True, 1, 4, mhsa(4, 64, 64, 24)],
  240. [160, 160, 3, 0, True, 1, 4, mhsa(4, 64, 64, 24)],
  241. [160, 160, 3, 3, True, 1, 4, mhsa(4, 64, 64, 24)],
  242. [160, 160, 3, 0, True, 1, 4]
  243. ]
  244. },
  245. "layer4": {
  246. "block_name": "uib",
  247. "num_blocks": 12,
  248. "block_specs": [
  249. [160, 256, 5, 5, True, 2, 6],
  250. [256, 256, 5, 5, True, 1, 4],
  251. [256, 256, 3, 5, True, 1, 4],
  252. [256, 256, 3, 5, True, 1, 4],
  253. [256, 256, 0, 0, True, 1, 2],
  254. [256, 256, 3, 5, True, 1, 2],
  255. [256, 256, 0, 0, True, 1, 2],
  256. [256, 256, 0, 0, True, 1, 4, mhsa(4, 64, 64, 12)],
  257. [256, 256, 3, 0, True, 1, 4, mhsa(4, 64, 64, 12)],
  258. [256, 256, 5, 5, True, 1, 4, mhsa(4, 64, 64, 12)],
  259. [256, 256, 5, 0, True, 1, 4, mhsa(4, 64, 64, 12)],
  260. [256, 256, 5, 0, True, 1, 4]
  261. ]
  262. },
  263. "layer5": {
  264. "block_name": "convbn",
  265. "num_blocks": 2,
  266. "block_specs": [
  267. [256, 960, 1, 1],
  268. [960, 1280, 1, 1]
  269. ]
  270. }
  271. }
  272. MNV4HybridConvLarge_BLOCK_SPECS = {
  273. "conv0": {
  274. "block_name": "convbn",
  275. "num_blocks": 1,
  276. "block_specs": [
  277. [3, 24, 3, 2]
  278. ]
  279. },
  280. "layer1": {
  281. "block_name": "fused_ib",
  282. "num_blocks": 1,
  283. "block_specs": [
  284. [24, 48, 2, 4.0, True]
  285. ]
  286. },
  287. "layer2": {
  288. "block_name": "uib",
  289. "num_blocks": 2,
  290. "block_specs": [
  291. [48, 96, 3, 5, True, 2, 4],
  292. [96, 96, 3, 3, True, 1, 4]
  293. ]
  294. },
  295. "layer3": {
  296. "block_name": "uib",
  297. "num_blocks": 11,
  298. "block_specs": [
  299. [96, 192, 3, 5, True, 2, 4],
  300. [192, 192, 3, 3, True, 1, 4],
  301. [192, 192, 3, 3, True, 1, 4],
  302. [192, 192, 3, 3, True, 1, 4],
  303. [192, 192, 3, 5, True, 1, 4],
  304. [192, 192, 5, 3, True, 1, 4],
  305. [192, 192, 5, 3, True, 1, 4, mhsa(8, 48, 48, 24)],
  306. [192, 192, 5, 3, True, 1, 4, mhsa(8, 48, 48, 24)],
  307. [192, 192, 5, 3, True, 1, 4, mhsa(8, 48, 48, 24)],
  308. [192, 192, 5, 3, True, 1, 4, mhsa(8, 48, 48, 24)],
  309. [192, 192, 3, 0, True, 1, 4]
  310. ]
  311. },
  312. "layer4": {
  313. "block_name": "uib",
  314. "num_blocks": 14,
  315. "block_specs": [
  316. [192, 512, 5, 5, True, 2, 4],
  317. [512, 512, 5, 5, True, 1, 4],
  318. [512, 512, 5, 5, True, 1, 4],
  319. [512, 512, 5, 5, True, 1, 4],
  320. [512, 512, 5, 0, True, 1, 4],
  321. [512, 512, 5, 3, True, 1, 4],
  322. [512, 512, 5, 0, True, 1, 4],
  323. [512, 512, 5, 0, True, 1, 4],
  324. [512, 512, 5, 3, True, 1, 4],
  325. [512, 512, 5, 5, True, 1, 4, mhsa(8, 64, 64, 12)],
  326. [512, 512, 5, 0, True, 1, 4, mhsa(8, 64, 64, 12)],
  327. [512, 512, 5, 0, True, 1, 4, mhsa(8, 64, 64, 12)],
  328. [512, 512, 5, 0, True, 1, 4, mhsa(8, 64, 64, 12)],
  329. [512, 512, 5, 0, True, 1, 4]
  330. ]
  331. },
  332. "layer5": {
  333. "block_name": "convbn",
  334. "num_blocks": 2,
  335. "block_specs": [
  336. [512, 960, 1, 1],
  337. [960, 1280, 1, 1]
  338. ]
  339. }
  340. }
  341. MODEL_SPECS = {
  342. "MobileNetV4ConvSmall": MNV4ConvSmall_BLOCK_SPECS,
  343. "MobileNetV4ConvMedium": MNV4ConvMedium_BLOCK_SPECS,
  344. "MobileNetV4ConvLarge": MNV4ConvLarge_BLOCK_SPECS,
  345. "MobileNetV4HybridMedium": MNV4HybridConvMedium_BLOCK_SPECS,
  346. "MobileNetV4HybridLarge": MNV4HybridConvLarge_BLOCK_SPECS
  347. }
  348. def make_divisible(
  349. value: float,
  350. divisor: int,
  351. min_value: Optional[float] = None,
  352. round_down_protect: bool = True,
  353. ) -> int:
  354. """
  355. This function is copied from here
  356. "https://github.com/tensorflow/models/blob/master/official/vision/modeling/layers/nn_layers.py"
  357. This is to ensure that all layers have channels that are divisible by 8.
  358. Args:
  359. value: A `float` of original value.
  360. divisor: An `int` of the divisor that need to be checked upon.
  361. min_value: A `float` of minimum value threshold.
  362. round_down_protect: A `bool` indicating whether round down more than 10%
  363. will be allowed.
  364. Returns:
  365. The adjusted value in `int` that is divisible against divisor.
  366. """
  367. if min_value is None:
  368. min_value = divisor
  369. new_value = max(min_value, int(value + divisor / 2) // divisor * divisor)
  370. # Make sure that round down does not go down by more than 10%.
  371. if round_down_protect and new_value < 0.9 * value:
  372. new_value += divisor
  373. return int(new_value)
  374. def conv_2d(inp, oup, kernel_size=3, stride=1, groups=1, bias=False, norm=True, act=True):
  375. conv = nn.Sequential()
  376. padding = (kernel_size - 1) // 2
  377. conv.add_module('conv', nn.Conv2d(inp, oup, kernel_size, stride, padding, bias=bias, groups=groups))
  378. if norm:
  379. conv.add_module('BatchNorm2d', nn.BatchNorm2d(oup))
  380. if act:
  381. conv.add_module('Activation', nn.ReLU6())
  382. return conv
  383. class InvertedResidual(nn.Module):
  384. def __init__(self, inp, oup, stride, expand_ratio, act=False, squeeze_excitation=False):
  385. super(InvertedResidual, self).__init__()
  386. self.stride = stride
  387. assert stride in [1, 2]
  388. hidden_dim = int(round(inp * expand_ratio))
  389. self.block = nn.Sequential()
  390. if expand_ratio != 1:
  391. self.block.add_module('exp_1x1', conv_2d(inp, hidden_dim, kernel_size=3, stride=stride))
  392. if squeeze_excitation:
  393. self.block.add_module('conv_3x3',
  394. conv_2d(hidden_dim, hidden_dim, kernel_size=3, stride=stride, groups=hidden_dim))
  395. self.block.add_module('red_1x1', conv_2d(hidden_dim, oup, kernel_size=1, stride=1, act=act))
  396. self.use_res_connect = self.stride == 1 and inp == oup
  397. def forward(self, x):
  398. if self.use_res_connect:
  399. return x + self.block(x)
  400. else:
  401. return self.block(x)
  402. class UniversalInvertedBottleneckBlock(nn.Module):
  403. def __init__(self,
  404. inp,
  405. oup,
  406. start_dw_kernel_size,
  407. middle_dw_kernel_size,
  408. middle_dw_downsample,
  409. stride,
  410. expand_ratio
  411. ):
  412. """An inverted bottleneck block with optional depthwises.
  413. Referenced from here https://github.com/tensorflow/models/blob/master/official/vision/modeling/layers/nn_blocks.py
  414. """
  415. super().__init__()
  416. # Starting depthwise conv.
  417. self.start_dw_kernel_size = start_dw_kernel_size
  418. if self.start_dw_kernel_size:
  419. stride_ = stride if not middle_dw_downsample else 1
  420. self._start_dw_ = conv_2d(inp, inp, kernel_size=start_dw_kernel_size, stride=stride_, groups=inp, act=False)
  421. # Expansion with 1x1 convs.
  422. expand_filters = make_divisible(inp * expand_ratio, 8)
  423. self._expand_conv = conv_2d(inp, expand_filters, kernel_size=1)
  424. # Middle depthwise conv.
  425. self.middle_dw_kernel_size = middle_dw_kernel_size
  426. if self.middle_dw_kernel_size:
  427. stride_ = stride if middle_dw_downsample else 1
  428. self._middle_dw = conv_2d(expand_filters, expand_filters, kernel_size=middle_dw_kernel_size, stride=stride_,
  429. groups=expand_filters)
  430. # Projection with 1x1 convs.
  431. self._proj_conv = conv_2d(expand_filters, oup, kernel_size=1, stride=1, act=False)
  432. # Ending depthwise conv.
  433. # this not used
  434. # _end_dw_kernel_size = 0
  435. # self._end_dw = conv_2d(oup, oup, kernel_size=_end_dw_kernel_size, stride=stride, groups=inp, act=False)
  436. def forward(self, x):
  437. if self.start_dw_kernel_size:
  438. x = self._start_dw_(x)
  439. # print("_start_dw_", x.shape)
  440. x = self._expand_conv(x)
  441. # print("_expand_conv", x.shape)
  442. if self.middle_dw_kernel_size:
  443. x = self._middle_dw(x)
  444. # print("_middle_dw", x.shape)
  445. x = self._proj_conv(x)
  446. # print("_proj_conv", x.shape)
  447. return x
  448. class MultiQueryAttentionLayerWithDownSampling(nn.Module):
  449. def __init__(self, inp, num_heads, key_dim, value_dim, query_h_strides, query_w_strides, kv_strides,
  450. dw_kernel_size=3, dropout=0.0):
  451. """Multi Query Attention with spatial downsampling.
  452. Referenced from here https://github.com/tensorflow/models/blob/master/official/vision/modeling/layers/nn_blocks.py
  453. 3 parameters are introduced for the spatial downsampling:
  454. 1. kv_strides: downsampling factor on Key and Values only.
  455. 2. query_h_strides: vertical strides on Query only.
  456. 3. query_w_strides: horizontal strides on Query only.
  457. This is an optimized version.
  458. 1. Projections in Attention is explict written out as 1x1 Conv2D.
  459. 2. Additional reshapes are introduced to bring a up to 3x speed up.
  460. """
  461. super().__init__()
  462. self.num_heads = num_heads
  463. self.key_dim = key_dim
  464. self.value_dim = value_dim
  465. self.query_h_strides = query_h_strides
  466. self.query_w_strides = query_w_strides
  467. self.kv_strides = kv_strides
  468. self.dw_kernel_size = dw_kernel_size
  469. self.dropout = dropout
  470. self.head_dim = key_dim // num_heads
  471. if self.query_h_strides > 1 or self.query_w_strides > 1:
  472. self._query_downsampling_norm = nn.BatchNorm2d(inp)
  473. self._query_proj = conv_2d(inp, num_heads * key_dim, 1, 1, norm=False, act=False)
  474. if self.kv_strides > 1:
  475. self._key_dw_conv = conv_2d(inp, inp, dw_kernel_size, kv_strides, groups=inp, norm=True, act=False)
  476. self._value_dw_conv = conv_2d(inp, inp, dw_kernel_size, kv_strides, groups=inp, norm=True, act=False)
  477. self._key_proj = conv_2d(inp, key_dim, 1, 1, norm=False, act=False)
  478. self._value_proj = conv_2d(inp, key_dim, 1, 1, norm=False, act=False)
  479. self._output_proj = conv_2d(num_heads * key_dim, inp, 1, 1, norm=False, act=False)
  480. self.dropout = nn.Dropout(p=dropout)
  481. def forward(self, x):
  482. batch_size, seq_length, _, _ = x.size()
  483. if self.query_h_strides > 1 or self.query_w_strides > 1:
  484. q = F.avg_pool2d(self.query_h_stride, self.query_w_stride)
  485. q = self._query_downsampling_norm(q)
  486. q = self._query_proj(q)
  487. else:
  488. q = self._query_proj(x)
  489. px = q.size(2)
  490. q = q.view(batch_size, self.num_heads, -1, self.key_dim) # [batch_size, num_heads, seq_length, key_dim]
  491. if self.kv_strides > 1:
  492. k = self._key_dw_conv(x)
  493. k = self._key_proj(k)
  494. v = self._value_dw_conv(x)
  495. v = self._value_proj(v)
  496. else:
  497. k = self._key_proj(x)
  498. v = self._value_proj(x)
  499. k = k.view(batch_size, self.key_dim, -1) # [batch_size, key_dim, seq_length]
  500. v = v.view(batch_size, -1, self.key_dim) # [batch_size, seq_length, key_dim]
  501. # calculate attn score
  502. attn_score = torch.matmul(q, k) / (self.head_dim ** 0.5)
  503. attn_score = self.dropout(attn_score)
  504. attn_score = F.softmax(attn_score, dim=-1)
  505. context = torch.matmul(attn_score, v)
  506. context = context.view(batch_size, self.num_heads * self.key_dim, px, px)
  507. output = self._output_proj(context)
  508. return output
  509. class MNV4LayerScale(nn.Module):
  510. def __init__(self, init_value):
  511. """LayerScale as introduced in CaiT: https://arxiv.org/abs/2103.17239
  512. Referenced from here https://github.com/tensorflow/models/blob/master/official/vision/modeling/layers/nn_blocks.py
  513. As used in MobileNetV4.
  514. Attributes:
  515. init_value (float): value to initialize the diagonal matrix of LayerScale.
  516. """
  517. super().__init__()
  518. self.init_value = init_value
  519. def forward(self, x):
  520. gamma = self.init_value * torch.ones(x.size(-1), dtype=x.dtype, device=x.device)
  521. return x * gamma
  522. class MultiHeadSelfAttentionBlock(nn.Module):
  523. def __init__(
  524. self,
  525. inp,
  526. num_heads,
  527. key_dim,
  528. value_dim,
  529. query_h_strides,
  530. query_w_strides,
  531. kv_strides,
  532. use_layer_scale,
  533. use_multi_query,
  534. use_residual=True
  535. ):
  536. super().__init__()
  537. self.query_h_strides = query_h_strides
  538. self.query_w_strides = query_w_strides
  539. self.kv_strides = kv_strides
  540. self.use_layer_scale = use_layer_scale
  541. self.use_multi_query = use_multi_query
  542. self.use_residual = use_residual
  543. self._input_norm = nn.BatchNorm2d(inp)
  544. if self.use_multi_query:
  545. self.multi_query_attention = MultiQueryAttentionLayerWithDownSampling(
  546. inp, num_heads, key_dim, value_dim, query_h_strides, query_w_strides, kv_strides
  547. )
  548. else:
  549. self.multi_head_attention = nn.MultiheadAttention(inp, num_heads, kdim=key_dim)
  550. if self.use_layer_scale:
  551. self.layer_scale_init_value = 1e-5
  552. self.layer_scale = MNV4LayerScale(self.layer_scale_init_value)
  553. def forward(self, x):
  554. # Not using CPE, skipped
  555. # input norm
  556. shortcut = x
  557. x = self._input_norm(x)
  558. # multi query
  559. if self.use_multi_query:
  560. x = self.multi_query_attention(x)
  561. else:
  562. x = self.multi_head_attention(x, x)
  563. # layer scale
  564. if self.use_layer_scale:
  565. x = self.layer_scale(x)
  566. # use residual
  567. if self.use_residual:
  568. x = x + shortcut
  569. return x
  570. def build_blocks(layer_spec, factor=0.25):
  571. if not layer_spec.get('block_name'):
  572. return nn.Sequential()
  573. block_names = layer_spec['block_name']
  574. layers = nn.Sequential()
  575. if block_names == "convbn":
  576. schema_ = ['inp', 'oup', 'kernel_size', 'stride']
  577. for i in range(layer_spec['num_blocks']):
  578. args = dict(zip(schema_, layer_spec['block_specs'][i]))
  579. if args['inp'] != 3:
  580. args['inp'] = int(args['inp'] * factor)
  581. args['oup'] = int(args['oup'] * factor)
  582. layers.add_module(f"convbn_{i}", conv_2d(**args))
  583. elif block_names == "uib":
  584. schema_ = ['inp', 'oup', 'start_dw_kernel_size', 'middle_dw_kernel_size', 'middle_dw_downsample', 'stride',
  585. 'expand_ratio', 'msha']
  586. for i in range(layer_spec['num_blocks']):
  587. args = dict(zip(schema_, layer_spec['block_specs'][i]))
  588. args['inp'] = int(args['inp'] * factor)
  589. args['oup'] = int(args['oup'] * factor)
  590. msha = args.pop("msha") if "msha" in args else 0
  591. layers.add_module(f"uib_{i}", UniversalInvertedBottleneckBlock(**args))
  592. if msha:
  593. msha_schema_ = [
  594. "inp", "num_heads", "key_dim", "value_dim", "query_h_strides", "query_w_strides", "kv_strides",
  595. "use_layer_scale", "use_multi_query", "use_residual"
  596. ]
  597. args['inp'] = int(args['inp'] * factor)
  598. args = dict(zip(msha_schema_, [args['oup']] + (msha)))
  599. layers.add_module(f"msha_{i}", MultiHeadSelfAttentionBlock(**args))
  600. elif block_names == "fused_ib":
  601. schema_ = ['inp', 'oup', 'stride', 'expand_ratio', 'act']
  602. for i in range(layer_spec['num_blocks']):
  603. args = dict(zip(schema_, layer_spec['block_specs'][i]))
  604. args['inp'] = int(args['inp'] * factor)
  605. args['oup'] = int(args['oup'] * factor)
  606. layers.add_module(f"fused_ib_{i}", InvertedResidual(**args))
  607. else:
  608. raise NotImplementedError
  609. return layers
  610. class MobileNetV4(nn.Module):
  611. def __init__(self, model, factor=0.25):
  612. # MobileNetV4ConvSmall MobileNetV4ConvMedium MobileNetV4ConvLarge
  613. # MobileNetV4HybridMedium MobileNetV4HybridLarge
  614. """Params to initiate MobilenNetV4
  615. Args:
  616. model : support 5 types of models as indicated in
  617. "https://github.com/tensorflow/models/blob/master/official/vision/modeling/backbones/mobilenet.py"
  618. """
  619. super().__init__()
  620. assert model in MODEL_SPECS.keys()
  621. self.model = model
  622. self.spec = MODEL_SPECS[self.model]
  623. # conv0
  624. self.conv0 = build_blocks(self.spec['conv0'], factor=factor)
  625. # layer1
  626. self.layer1 = build_blocks(self.spec['layer1'], factor=factor)
  627. # layer2
  628. self.layer2 = build_blocks(self.spec['layer2'], factor=factor)
  629. # layer3
  630. self.layer3 = build_blocks(self.spec['layer3'], factor=factor)
  631. # layer4
  632. self.layer4 = build_blocks(self.spec['layer4'], factor=factor)
  633. self.width_list = [i.size(1) for i in self.forward(torch.randn(1, 3, 640, 640))]
  634. def forward(self, x):
  635. x0 = self.conv0(x)
  636. x1 = self.layer1(x0)
  637. x2 = self.layer2(x1)
  638. x3 = self.layer3(x2)
  639. x4 = self.layer4(x3)
  640. return [x1, x2, x3, x4]
  641. def MobileNetV4ConvSmall(factor=0.5):
  642. model = MobileNetV4('MobileNetV4ConvSmall', factor=factor)
  643. return model
  644. def MobileNetV4ConvMedium(factor=0.5):
  645. model = MobileNetV4('MobileNetV4ConvMedium', factor=factor)
  646. return model
  647. def MobileNetV4ConvLarge(factor=0.5):
  648. model = MobileNetV4('MobileNetV4ConvLarge', factor=factor)
  649. return model
  650. def MobileNetV4HybridMedium(factor=0.5):
  651. model = MobileNetV4('MobileNetV4HybridMedium', factor=factor)
  652. return model
  653. def MobileNetV4HybridLarge(factor=0.5):
  654. model = MobileNetV4('MobileNetV4HybridLarge', factor=factor)
  655. return model
  656. if __name__ == "__main__":
  657. # Generating Sample image
  658. image_size = (1, 3, 640, 640)
  659. image = torch.rand(*image_size)
  660. # Model
  661. model = MobileNetV4HybridMedium()
  662. out = model(image)
  663. for i in range(len(out)):
  664. print(out[i].shape)


四、手把手教你添加MobileNetv4

4.1 修改一

第一步还是建立文件,我们找到如下ultralytics/nn文件夹下建立一个目录名字呢就是'Addmodules'文件夹( 用群内的文件的话已经有了无需新建) !然后在其内部建立一个新的py文件将核心代码复制粘贴进去即可


4.2 修改二

第二步我们在该目录下创建一个新的py文件名字为'__init__.py'( 用群内的文件的话已经有了无需新建) ,然后在其内部导入我们的检测头如下图所示。


4.3 修改三

第三步我门中到如下文件'ultralytics/nn/tasks.py'进行导入和注册我们的模块( 用群内的文件的话已经有了无需重新导入直接开始第四步即可)

从今天开始以后的教程就都统一成这个样子了,因为我默认大家用了我群内的文件来进行修改!!


4.4 修改四

添加如下两行代码!!!


4.5 修改五

找到七百多行大概把具体看图片,按照图片来修改就行,添加红框内的部分,注意没有()只是函数名。

  1. elif m in {自行添加对应的模型即可,下面都是一样的}:
  2. m = m(*args)
  3. c2 = m.width_list # 返回通道列表
  4. backbone = True


4.6 修改六

下面的两个红框内都是需要改动的。

  1. if isinstance(c2, list):
  2. m_ = m
  3. m_.backbone = True
  4. else:
  5. m_ = nn.Sequential(*(m(*args) for _ in range(n))) if n > 1 else m(*args) # module
  6. t = str(m)[8:-2].replace('__main__.', '') # module type
  7. m.np = sum(x.numel() for x in m_.parameters()) # number params
  8. m_.i, m_.f, m_.type = i + 4 if backbone else i, f, t # attach index, 'from' index, type


4.7 修改七

如下的也需要修改,全部按照我的来。

代码如下把原先的代码替换了即可。

  1. if verbose:
  2. LOGGER.info(f'{i:>3}{str(f):>20}{n_:>3}{m.np:10.0f} {t:<45}{str(args):<30}') # print
  3. save.extend(x % (i + 4 if backbone else i) for x in ([f] if isinstance(f, int) else f) if x != -1) # append to savelist
  4. layers.append(m_)
  5. if i == 0:
  6. ch = []
  7. if isinstance(c2, list):
  8. ch.extend(c2)
  9. if len(c2) != 5:
  10. ch.insert(0, 0)
  11. else:
  12. ch.append(c2)


4.8 修改八

修改八和前面的都不太一样,需要修改前向传播中的一个部分, 已经离开了parse_model方法了。

可以在图片中开代码行数,没有离开task.py文件都是同一个文件。 同时这个部分有好几个前向传播都很相似,大家不要看错了, 是70多行左右的!!!,同时我后面提供了代码,大家直接复制粘贴即可,有时间我针对这里会出一个视频。

​​

代码如下->

  1. def _predict_once(self, x, profile=False, visualize=False, embed=None):
  2. """
  3. Perform a forward pass through the network.
  4. Args:
  5. x (torch.Tensor): The input tensor to the model.
  6. profile (bool): Print the computation time of each layer if True, defaults to False.
  7. visualize (bool): Save the feature maps of the model if True, defaults to False.
  8. embed (list, optional): A list of feature vectors/embeddings to return.
  9. Returns:
  10. (torch.Tensor): The last output of the model.
  11. """
  12. y, dt, embeddings = [], [], [] # outputs
  13. for m in self.model:
  14. if m.f != -1: # if not from previous layer
  15. x = y[m.f] if isinstance(m.f, int) else [x if j == -1 else y[j] for j in m.f] # from earlier layers
  16. if profile:
  17. self._profile_one_layer(m, x, dt)
  18. if hasattr(m, 'backbone'):
  19. x = m(x)
  20. if len(x) != 5: # 0 - 5
  21. x.insert(0, None)
  22. for index, i in enumerate(x):
  23. if index in self.save:
  24. y.append(i)
  25. else:
  26. y.append(None)
  27. x = x[-1] # 最后一个输出传给下一层
  28. else:
  29. x = m(x) # run
  30. y.append(x if m.i in self.save else None) # save output
  31. if visualize:
  32. feature_visualization(x, m.type, m.i, save_dir=visualize)
  33. if embed and m.i in embed:
  34. embeddings.append(nn.functional.adaptive_avg_pool2d(x, (1, 1)).squeeze(-1).squeeze(-1)) # flatten
  35. if m.i == max(embed):
  36. return torch.unbind(torch.cat(embeddings, 1), dim=0)
  37. return x

到这里就完成了修改部分,但是这里面细节很多,大家千万要注意不要替换多余的代码,导致报错,也不要拉下任何一部,都会导致运行失败,而且报错很难排查!!!很难排查!!!


注意!!! 额外的修改!

关注我的其实都知道,我大部分的修改都是一样的,这个网络需要额外的修改一步,就是s一个参数,将下面的s改为640!!!即可完美运行!!


打印计算量问题解决方案

我们找到如下文件'ultralytics/utils/torch_utils.py'按照如下的图片进行修改,否则容易打印不出来计算量。


注意事项!!!

如果大家在验证的时候报错形状不匹配的错误可以固定验证集的图片尺寸,方法如下 ->

找到下面这个文件ultralytics/models/yolo/detect/train.py然后其中有一个类是DetectionTrainer class中的build_dataset函数中的一个参数rect=mode == 'val'改为rect=False


五、MobileNetv4 的yaml文件

5.1 MobileNetv4 的yaml文件

此版本训练信息:YOLO11-MobileNetV4 summary: 386 layers, 2,173,383 parameters, 2,173,367 gradients, 5.3 GFLOPs

# 支持版本有__all__ = ['MobileNetV4ConvLarge', 'MobileNetV4ConvSmall', 'MobileNetV4ConvMedium', 'MobileNetV4HybridMedium', 'MobileNetV4HybridLarge'] 5
- [-1, 1, MobileNetV4ConvSmall, [0.75]]
# 上面的参数0.75支持大家自由根据v11的 0.25 0.50 0.75 1.00 1.5 进行放缩我只尝试过这几个如果你设置一些比较特殊的数字可能会报错.所以本文有至少 5 x 5 = 25种组合方式.

#本文是用MobileNetV4ConvSmall, [0.75]参数实验

  1. # Ultralytics YOLO 🚀, AGPL-3.0 license
  2. # YOLO11 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect
  3. # Parameters
  4. nc: 80 # number of classes
  5. scales: # model compound scaling constants, i.e. 'model=yolo11n.yaml' will call yolo11.yaml with scale 'n'
  6. # [depth, width, max_channels]
  7. n: [0.50, 0.25, 1024] # summary: 319 layers, 2624080 parameters, 2624064 gradients, 6.6 GFLOPs
  8. s: [0.50, 0.50, 1024] # summary: 319 layers, 9458752 parameters, 9458736 gradients, 21.7 GFLOPs
  9. m: [0.50, 1.00, 512] # summary: 409 layers, 20114688 parameters, 20114672 gradients, 68.5 GFLOPs
  10. l: [1.00, 1.00, 512] # summary: 631 layers, 25372160 parameters, 25372144 gradients, 87.6 GFLOPs
  11. x: [1.00, 1.50, 512] # summary: 631 layers, 56966176 parameters, 56966160 gradients, 196.0 GFLOPs
  12. # 支持版本有__all__ = ['MobileNetV4ConvLarge', 'MobileNetV4ConvSmall', 'MobileNetV4ConvMedium', 'MobileNetV4HybridMedium', 'MobileNetV4HybridLarge'] 5
  13. # YOLO11n backbone
  14. backbone:
  15. # [from, repeats, module, args]
  16. - [-1, 1, MobileNetV4ConvSmall, [0.75]] # 0-4 P1/2 这里是四层大家不要被yaml文件限制住了思维,不会画图进群看视频.
  17. # 上面的参数0.75支持大家自由根据v110.25 0.50 0.75 1.00 1.5 进行放缩我只尝试过这几个如果你设置一些比较特殊的数字可能会报错.
  18. # 所以本文有至少 5 x 5 = 25种组合方式.
  19. - [-1, 1, SPPF, [1024, 5]] # 5
  20. - [-1, 2, C2PSA, [1024]] # 6
  21. # YOLO11n head
  22. head:
  23. - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  24. - [[-1, 3], 1, Concat, [1]] # cat backbone P4
  25. - [-1, 2, C3k2, [512, False]] # 9
  26. - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  27. - [[-1, 2], 1, Concat, [1]] # cat backbone P3
  28. - [-1, 2, C3k2, [256, False]] # 12 (P3/8-small)
  29. - [-1, 1, Conv, [256, 3, 2]]
  30. - [[-1, 9], 1, Concat, [1]] # cat head P4
  31. - [-1, 2, C3k2, [512, False]] # 15 (P4/16-medium)
  32. - [-1, 1, Conv, [512, 3, 2]]
  33. - [[-1, 6], 1, Concat, [1]] # cat head P5
  34. - [-1, 2, C3k2, [1024, True]] # 18 (P5/32-large)
  35. - [[12, 15, 18], 1, Detect, [nc]] # Detect(P3, P4, P5)

5.2 训练文件的代码

可以复制我的运行文件进行运行。

  1. import warnings
  2. warnings.filterwarnings('ignore')
  3. from ultralytics import YOLO
  4. if __name__ == '__main__':
  5. model = YOLO('yolov8-MLLA.yaml')
  6. # 如何切换模型版本, 上面的ymal文件可以改为 yolov8s.yaml就是使用的v8s,
  7. # 类似某个改进的yaml文件名称为yolov8-XXX.yaml那么如果想使用其它版本就把上面的名称改为yolov8l-XXX.yaml即可(改的是上面YOLO中间的名字不是配置文件的)!
  8. # model.load('yolov8n.pt') # 是否加载预训练权重,科研不建议大家加载否则很难提升精度
  9. model.train(data=r"C:\Users\Administrator\PycharmProjects\yolov5-master\yolov5-master\Construction Site Safety.v30-raw-images_latestversion.yolov8\data.yaml",
  10. # 如果大家任务是其它的'ultralytics/cfg/default.yaml'找到这里修改task可以改成detect, segment, classify, pose
  11. cache=False,
  12. imgsz=640,
  13. epochs=150,
  14. single_cls=False, # 是否是单类别检测
  15. batch=16,
  16. close_mosaic=0,
  17. workers=0,
  18. device='0',
  19. optimizer='SGD', # using SGD
  20. # resume='runs/train/exp21/weights/last.pt', # 如过想续训就设置last.pt的地址
  21. amp=True, # 如果出现训练损失为Nan可以关闭amp
  22. project='runs/train',
  23. name='exp',
  24. )


六、成功运行记录

下面是成功运行的截图.


七、本文总结

到此本文的正式分享内容就结束了,在这里给大家推荐我的YOLOv11改进有效涨点专栏,本专栏目前为新开的平均质量分98分,后期我会根据各种最新的前沿顶会进行论文复现,也会对一些老的改进机制进行补充 如果大家觉得本文帮助到你了,订阅本专栏,关注后续更多的更新~

​​