学习资源站

YOLOv11改进-模型知识蒸馏篇-利用模型蒸馏改进YOLOv11进行无损涨点-CWDLoss(在线蒸馏加离线蒸馏)

一、本文介绍

这篇文章给大家带来最新改进的是 模型的蒸馏 利用教师模型指导学生模型从而进行模型的涨点,本文的内容不仅可以用于论文中,在目前的绝大多数的工作中模型蒸馏是一项非常重要的技术,所以大家可以仔细学习一下本文的内容,本文从YOLOv11的项目文件为例,进行详细的修改教程, 文章内包括完整的修改教程,针对小白我出了视频修改教程,如果你还不会我提供了修改后的文件大家直接运行即可, 所以说不用担心不会适用! 模型蒸馏真正的无损涨点, 蒸馏你只看这一篇文章就足够了!

欢迎大家订阅我的专栏一起学习YOLO!



二、蒸馏教程

知识蒸馏 的主要方法可以分为三种:基于响应的知识蒸馏(利用教师模型的输出或对最终预测的模仿)、基于特征的知识蒸馏(使用教师模型中间层的特征表示)以及基于关系的知识蒸馏(利用模型内部不同层或不同数据点之间的关系)。每种方法都旨在从大模型中提取有效信息,并通过特定的 损失函数 将这些信息灌输给学生模型​。

首先,基于模型的知识蒸馏类型包括:

  • 1. 基于响应的蒸馏(Response-based):使用教师模型的最后输出层的信息(如类别概率)来训练学生模型。
  • 2. 基于特征的蒸馏(Feature-based):利用教师模型的中间层特征来指导学生模型。
  • 3. 基于关系的蒸馏(Relation-based):侧重于教师模型内不同特征之间的关系,如特征图之间的相互作用。

蒸馏过程的实施方式:

  • 1. 在线蒸馏(Online distillation):教师模型和学生模型同时训练,学生模型实时学习教师模型的知识。
  • 2. 离线蒸馏(Offline distillation):先训练教师模型,再使用该模型来训练学生模型,学生模型不会影响教师模型。
  • 3. 自蒸馏(Self distillation):模型使用自己的预测作为软标签来提高自己的性能。

知识蒸馏是一个多样化的领域,包括各种不同的方法来优化 深度学习 模型的性能和大小。从基本的基于响应、特征和关系的蒸馏,到更高级的在线、离线和自蒸馏过程,再到特定的技术如对抗性蒸馏或量化蒸馏等,每一种方法都旨在解决不同的问题和需求。


PS: 开始之前给大家说一下,本文的修改内容涉及的改动比较多(我的教程会很详细的写),我也会出一期视频带大家从头到尾修改一遍,如果你还修改不对我也会提供修改完成版本的代码(直接训练运行即可),但是大家肯定想自己修改到自己的项目里如果你修改过程中的报错,我已经提供很完整的教程了(多种方案给大家选择),所以报错回复我只能是随缘回复(因为本文的内容肯定大家一堆报错会遇到稍微操作不当就是报错),同时本项目仅用于 YOLOv8 项目上,和本项目无关的报错一律不回复。


👑正式修改教程👑


2.1 修改一

下面给出了一段代码,我们将这段代码找到目录' ultralytics /utils'下创建一个.py文件存放进去,文件的名字我们命名为AddLoss.py,创建好的文件如下图所示->

​​

  1. import torch
  2. import torch.nn as nn
  3. import torch.nn.functional as F
  4. def is_parallel(model):
  5. """Returns True if model is of type DP or DDP."""
  6. return isinstance(model, (nn.parallel.DataParallel, nn.parallel.DistributedDataParallel))
  7. def de_parallel(model):
  8. """De-parallelize a model: returns single-GPU model if model is of type DP or DDP."""
  9. return model.module if is_parallel(model) else model
  10. class MimicLoss(nn.Module):
  11. def __init__(self, channels_s, channels_t):
  12. super(MimicLoss, self).__init__()
  13. device = 'cuda' if torch.cuda.is_available() else 'cpu'
  14. self.mse = nn.MSELoss()
  15. def forward(self, y_s, y_t):
  16. """Forward computation.
  17. Args:
  18. y_s (list): The student model prediction with
  19. shape (N, C, H, W) in list.
  20. y_t (list): The teacher model prediction with
  21. shape (N, C, H, W) in list.
  22. Return:
  23. torch.Tensor: The calculated loss value of all stages.
  24. """
  25. assert len(y_s) == len(y_t)
  26. losses = []
  27. for idx, (s, t) in enumerate(zip(y_s, y_t)):
  28. assert s.shape == t.shape
  29. losses.append(self.mse(s, t))
  30. loss = sum(losses)
  31. return loss
  32. class CWDLoss(nn.Module):
  33. """PyTorch version of `Channel-wise Distillation for Semantic Segmentation.
  34. <https://arxiv.org/abs/2011.13256>`_.
  35. """
  36. def __init__(self, channels_s, channels_t, tau=1.0):
  37. super(CWDLoss, self).__init__()
  38. self.tau = tau
  39. def forward(self, y_s, y_t):
  40. """Forward computation.
  41. Args:
  42. y_s (list): The student model prediction with
  43. shape (N, C, H, W) in list.
  44. y_t (list): The teacher model prediction with
  45. shape (N, C, H, W) in list.
  46. Return:
  47. torch.Tensor: The calculated loss value of all stages.
  48. """
  49. assert len(y_s) == len(y_t)
  50. losses = []
  51. for idx, (s, t) in enumerate(zip(y_s, y_t)):
  52. assert s.shape == t.shape
  53. N, C, H, W = s.shape
  54. # normalize in channel diemension
  55. softmax_pred_T = F.softmax(t.view(-1, W * H) / self.tau, dim=1) # [N*C, H*W]
  56. logsoftmax = torch.nn.LogSoftmax(dim=1)
  57. cost = torch.sum(
  58. softmax_pred_T * logsoftmax(t.view(-1, W * H) / self.tau) -
  59. softmax_pred_T * logsoftmax(s.view(-1, W * H) / self.tau)) * (self.tau ** 2)
  60. losses.append(cost / (C * N))
  61. loss = sum(losses)
  62. return loss
  63. class MGDLoss(nn.Module):
  64. def __init__(self, channels_s, channels_t, alpha_mgd=0.00002, lambda_mgd=0.65):
  65. super(MGDLoss, self).__init__()
  66. device = 'cuda' if torch.cuda.is_available() else 'cpu'
  67. self.alpha_mgd = alpha_mgd
  68. self.lambda_mgd = lambda_mgd
  69. self.generation = [
  70. nn.Sequential(
  71. nn.Conv2d(channel, channel, kernel_size=3, padding=1),
  72. nn.ReLU(inplace=True),
  73. nn.Conv2d(channel, channel, kernel_size=3, padding=1)).to(device) for channel in channels_t
  74. ]
  75. def forward(self, y_s, y_t):
  76. """Forward computation.
  77. Args:
  78. y_s (list): The student model prediction with
  79. shape (N, C, H, W) in list.
  80. y_t (list): The teacher model prediction with
  81. shape (N, C, H, W) in list.
  82. Return:
  83. torch.Tensor: The calculated loss value of all stages.
  84. """
  85. assert len(y_s) == len(y_t)
  86. losses = []
  87. for idx, (s, t) in enumerate(zip(y_s, y_t)):
  88. assert s.shape == t.shape
  89. losses.append(self.get_dis_loss(s, t, idx) * self.alpha_mgd)
  90. loss = sum(losses)
  91. return loss
  92. def get_dis_loss(self, preds_S, preds_T, idx):
  93. loss_mse = nn.MSELoss(reduction='sum')
  94. N, C, H, W = preds_T.shape
  95. device = preds_S.device
  96. mat = torch.rand((N, 1, H, W)).to(device)
  97. mat = torch.where(mat > 1 - self.lambda_mgd, 0, 1).to(device)
  98. masked_fea = torch.mul(preds_S, mat)
  99. new_fea = self.generation[idx](masked_fea)
  100. dis_loss = loss_mse(new_fea, preds_T) / N
  101. return dis_loss
  102. class Distill_LogitLoss:
  103. def __init__(self, p, t_p, alpha=0.25):
  104. t_ft = torch.cuda.FloatTensor if t_p[0].is_cuda else torch.Tensor
  105. self.p = p
  106. self.t_p = t_p
  107. self.logit_loss = t_ft([0])
  108. self.DLogitLoss = nn.MSELoss(reduction="none")
  109. self.bs = p[0].shape[0]
  110. self.alpha = alpha
  111. def __call__(self):
  112. # per output
  113. assert len(self.p) == len(self.t_p)
  114. for i, (pi, t_pi) in enumerate(zip(self.p, self.t_p)): # layer index, layer predictions
  115. assert pi.shape == t_pi.shape
  116. self.logit_loss += torch.mean(self.DLogitLoss(pi, t_pi))
  117. return self.logit_loss[0] * self.alpha
  118. def get_fpn_features(x, model, fpn_layers=[15, 18, 21]):
  119. y, fpn_feats = [], []
  120. with torch.no_grad():
  121. model = de_parallel(model)
  122. module_list = model.model[:-1] if hasattr(model, "model") else model[:-1]
  123. for m in module_list:
  124. # if not from previous layer
  125. if m.f != -1:
  126. x = y[m.f] if isinstance(m.f, int) else [x if j == -1 else y[j] for j in m.f] # from earlier layers
  127. x = m(x)
  128. y.append(x if m.i in model.save else None) # save output
  129. if m.i in fpn_layers:
  130. fpn_feats.append(x)
  131. return fpn_feats
  132. def get_channels(model, fpn_layers=[15, 18, 21]):
  133. y, out_channels = [], []
  134. p = next(model.parameters())
  135. x = torch.zeros((1, 3, 64, 64), device=p.device)
  136. with torch.no_grad():
  137. model = de_parallel(model)
  138. module_list = model.model[:-1] if hasattr(model, "model") else model[:-1]
  139. for m in module_list:
  140. # if not from previous layer
  141. if m.f != -1:
  142. x = y[m.f] if isinstance(m.f, int) else [x if j == -1 else y[j] for j in m.f] # from earlier layers
  143. x = m(x)
  144. y.append(x if m.i in model.save else None) # save output
  145. if m.i in fpn_layers:
  146. out_channels.append(x.shape[1])
  147. return out_channels
  148. class FeatureLoss(nn.Module):
  149. def __init__(self, channels_s, channels_t, distiller='cwd'):
  150. super(FeatureLoss, self).__init__()
  151. device = 'cuda' if torch.cuda.is_available() else 'cpu'
  152. self.align_module = nn.ModuleList([
  153. nn.Conv2d(channel, tea_channel, kernel_size=1, stride=1, padding=0).to(device)
  154. for channel, tea_channel in zip(channels_s, channels_t)
  155. ])
  156. self.norm = [
  157. nn.BatchNorm2d(tea_channel, affine=False).to(device)
  158. for tea_channel in channels_t
  159. ]
  160. if distiller == 'mimic':
  161. self.feature_loss = MimicLoss(channels_s, channels_t)
  162. elif distiller == 'mgd':
  163. self.feature_loss = MGDLoss(channels_s, channels_t)
  164. elif distiller == 'cwd':
  165. self.feature_loss = CWDLoss(channels_s, channels_t)
  166. else:
  167. raise NotImplementedError
  168. def forward(self, y_s, y_t):
  169. assert len(y_s) == len(y_t)
  170. tea_feats = []
  171. stu_feats = []
  172. for idx, (s, t) in enumerate(zip(y_s, y_t)):
  173. s = self.align_module[idx](s)
  174. s = self.norm[idx](s)
  175. t = self.norm[idx](t)
  176. tea_feats.append(t)
  177. stu_feats.append(s)
  178. loss = self.feature_loss(stu_feats, tea_feats)
  179. return loss


2.2 修改二(注意看此处,和之前的文章不一样)

下面的代码我们找到文件'ultralytics/engine/trainer.py'按照我的图片内容复制粘贴到指定位置即可! 根据下图进行修改->

在该文件的开头我们先添加两行模块的导入代码,

  1. from ultralytics.utils import IterableSimpleNamespace
  2. from ultralytics.utils.AddLoss import get_fpn_features, Distill_LogitLoss, de_parallel, get_channels, FeatureLoss

​​

注意就是在此处我们将下面的    self.distill_feat_type = "cwd" 即可对应CWDLoss,如果为False则用的是离线蒸馏,除此之外本文和之前的文章无任何区别!!

  1. #------------------------------Add-Param-Start---------------
  2. self.featureloss = 0
  3. self.logitloss = 0
  4. self.teacherloss = 0
  5. self.distillloss =None
  6. self.model_t = overrides.get("model_t", None)
  7. self.distill_feat_type = "cwd" # "cwd","mgd","mimic"
  8. self.distillonline = False # False or True
  9. self.logit_loss = False # False or True
  10. self.distill_layers = [2, 4, 6, 8, 12, 15, 18, 21]
  11. # ------------------------------Add-Param-End-----------------


2.3 修改三

按照图片进行修改即可。

​​

  1. if self.model_t is not None:
  2. for k, v in self.model_t.model.named_parameters():
  3. v.requires_grad = True
  4. self.model_t = self.model_t.to(self.device)


2.4 修改四

按照图片进行修改即可。

​​

  1. if self.model_t is not None:
  2. self.model_t = nn.parallel.DistributedDataParallel(self.model_t, device_ids=[RANK])


2.5 修改五

此处的修改和上面有些不一样,上面的代码都是添加,此处的代码为替换。

  1. self.optimizer = self.build_optimizer(model=self.model,
  2. model_t=self.model_t,
  3. distillloss=self.distillloss,
  4. distillonline=self.distillonline,
  5. name=self.args.optimizer,
  6. lr=self.args.lr0,
  7. momentum=self.args.momentum,
  8. decay=weight_decay,
  9. iterations=iterations)

​​


2.6 修改六

修改教程看图片!

  1. self.model = de_parallel(self.model)
  2. if self.model_t is not None:
  3. self.model_t = de_parallel(self.model_t)
  4. self.channels_s = get_channels(self.model,self.distill_layers)
  5. self.channels_t = get_channels(self.model_t,self.distill_layers)
  6. self.distillloss = FeatureLoss(channels_s=self.channels_s, channels_t=self.channels_t, distiller= self.distill_feat_type)

​​


2.7 修改七

修改教程看图片!

  1. if self.model_t is not None:
  2. self.model_t.eval()

​​


2.8 修改八

修改教程看图片!

  1. pred_s= self.model(batch['img'])
  2. stu_features = get_fpn_features(batch['img'], self.model,fpn_layers=self.distill_layers)


2.8 修改九

​​

  1. if self.model_t is not None:
  2. distill_weight = ((1 - math.cos(i * math.pi / len(self.train_loader))) / 2) * (0.1 - 1) + 1
  3. with torch.no_grad():
  4. pred_t_offline = self.model_t(batch['img'])
  5. tea_features = get_fpn_features(batch['img'], self.model_t,
  6. fpn_layers=self.distill_layers) # forward
  7. self.featureloss = self.distillloss(stu_features, tea_features) * distill_weight
  8. self.loss += self.featureloss
  9. if self.distillonline:
  10. self.model_t.train()
  11. pred_t_online = self.model_t(batch['img'])
  12. for p in pred_t_online:
  13. p = p.detach()
  14. if i == 0 and epoch == 0:
  15. self.model_t.args["box"] = self.model.args.box
  16. self.model_t.args["cls"] = self.model.args.cls
  17. self.model_t.args["dfl"] = self.model.args.dfl
  18. self.model_t.args = IterableSimpleNamespace(**self.model_t.args)
  19. self.teacherloss, _ = self.model_t(batch, pred_t_online)
  20. if RANK != -1:
  21. self.teacherloss *= world_size
  22. self.loss += self.teacherloss
  23. if self.logit_loss:
  24. if not self.distillonline:
  25. distill_logit = Distill_LogitLoss(pred_s, pred_t_offline)
  26. else:
  27. distill_logit = Distill_LogitLoss(pred_s, pred_t_online)
  28. self.logitloss = distill_logit()
  29. self.loss += self.logitloss


2.9 修改十

修改教程看图片!

​​

  1. mem = f"{torch.cuda.memory_reserved() / 1E9 if torch.cuda.is_available() else 0:.3g}G" # (GB)
  2. loss_len = self.tloss.shape[0] if len(self.tloss.shape) else 1
  3. losses = self.tloss if loss_len > 1 else torch.unsqueeze(self.tloss, 0)
  4. if RANK in {-1, 0}:
  5. loss_length = self.tloss.shape[0] if len(self.tloss.shape) else 1
  6. pbar.set_description(
  7. ('%12s' * 2 + '%12.4g' * (5 + loss_length)) %
  8. (f'{epoch + 1}/{self.epochs}', mem, * losses, self.featureloss, self.teacherloss, self.logitloss, batch['cls'].shape[0], batch['img'].shape[-1]))
  9. self.run_callbacks("on_batch_end")
  10. if self.args.plots and ni in self.plot_idx:
  11. self.plot_training_samples(batch, ni)


2.10 修改十一

修改教程看图片!

, model_t, distillloss, distillonline=False,

​​


2.11 修改十二

修改教程看图片!

  1. if model_t is not None and distillonline:
  2. for v in model_t.modules():
  3. # print(v)
  4. if hasattr(v, 'bias') and isinstance(v.bias, nn.Parameter): # bias (no decay)
  5. g[2].append(v.bias)
  6. if isinstance(v, bn): # weight (no decay)
  7. g[1].append(v.weight)
  8. elif hasattr(v, 'weight') and isinstance(v.weight, nn.Parameter): # weight (with decay)
  9. g[0].append(v.weight)
  10. if model_t is not None and distillloss is not None:
  11. for k, v in distillloss.named_modules():
  12. # print(v)
  13. if hasattr(v, 'bias') and isinstance(v.bias, nn.Parameter): # bias (no decay)
  14. g[2].append(v.bias)
  15. if isinstance(v, bn) or 'bn' in k: # weight (no decay)
  16. g[1].append(v.weight)
  17. elif hasattr(v, 'weight') and isinstance(v.weight, nn.Parameter): # weight (with decay)
  18. g[0].append(v.weight)

​​


2.12 修改十三

PS:注意此处我们更换了修改的文件了!!

我们找到文件'ultralytics/cfg/__init__.py',按照我的图片进行修改!

​​


2.13 修改十四

PS:注意此处我们更换了修改的文件了!!

我们找到文件'ultralytics/ models / yolo /detect/train.py'按照图片进行修改即可!

  1. return ('\n' + '%12s' *
  2. (7 + len(self.loss_names))) % (
  3. 'Epoch', 'GPU_mem', *self.loss_names, 'dfeaLoss', 'dlineLoss', 'dlogitLoss', 'Instances',
  4. 'Size')

​​


2.14 修改十五

PS:注意此处我们更换了修改的文件了!!

我们找到文件'ultralytics/engine/model.py'按照图片进行修改即可!

​​

到此处就修改完成了,剩下的就是如何使用进行模型蒸馏了!


三、使用教程

模型蒸馏指的是:用训练好的模型(注意是训练好的模型)去教另一个模型!(其中教师模型必须是训练好的权重,学生模型可以是yaml文件也可以是权重文件地址)

在开始之前我们需要准备一个教师模型,我们这里就用YOLOv8l为例,其权重文件可以去官方下载。

3.1 模型蒸馏代码

PS:需要注意的是,学生模型和教师模型的模型配置文件需要保持一致,也就是说你学生模型假设用了BiFPN那么你的教师模型也需要用BiFPN去训练否则就会报错!

  1. import warnings
  2. warnings.filterwarnings('ignore')
  3. from ultralytics import YOLO
  4. if __name__ == '__main__':
  5. model_t = YOLO(r'yolo11l.pt') # 此处填写教师模型的权重文件地址
  6. model_t.model.model[-1].set_Distillation = True # 不用理会此处用于设置模型蒸馏
  7. model_s = YOLO(r'yolo11.yaml') # 学生文件的yaml文件 or 权重文件地址
  8. model_s.train(data=r'C:\Users\Administrator\Desktop\YOLOv11-Dis\ultralytics-main\NEU-DET\data.yaml',
  9. # 将data后面替换你自己的数据集地址
  10. cache=False,
  11. imgsz=640,
  12. epochs=100,
  13. single_cls=False, # 是否是单类别检测
  14. batch=1,
  15. close_mosaic=10,
  16. workers=0,
  17. device='0',
  18. optimizer='SGD', # using SGD
  19. amp=True, # 如果出现训练损失为Nan可以关闭amp
  20. project='runs/train',
  21. name='exp',
  22. model_t=model_t.model
  23. )


3.2 开始蒸馏

我们将蒸馏的代码复制粘贴到一个py文件内,如下图所示!

​​

我们运行蒸馏的py文件即可,模型就会开始训练并且蒸馏。下面的图片就是模型开始训练并且蒸馏,可以看到我们开启了蒸馏损失'dfeaLoss   dlineLoss'

​​


四、完整文件和视频讲解

百度网盘因为链接很容易过期,如果下面文件过期大家可以提醒我,或者进群,群内我也会上传!

链接:https://pan.baidu.com/s/1_gwwemBDPF6YEoyO2BkTlA?pwd=j8uq
提取码:j8uq


四、本文总结

到此本文的正式分享内容就结束了,在这里给大家推荐我的YOLOv11改进有效涨点专栏,本专栏目前为新开的平均质量分98分,后期我会根据各种最新的前沿顶会进行论文复现,也会对一些老的改进机制进行补充,如果大家觉得本文帮助到你了,订阅本专栏,关注后续更多的更新~