YOLOv11改进-模型知识蒸馏篇-利用模型蒸馏改进YOLOv11进行无损涨点-CWDLoss（在线蒸馏加离线蒸馏）

一、本文介绍

这篇文章给大家带来最新改进的是 模型的蒸馏 ，利用教师模型指导学生模型从而进行模型的涨点，本文的内容不仅可以用于论文中，在目前的绝大多数的工作中模型蒸馏是一项非常重要的技术，所以大家可以仔细学习一下本文的内容，本文从YOLOv11的项目文件为例，进行详细的修改教程， 文章内包括完整的修改教程，针对小白我出了视频修改教程，如果你还不会我提供了修改后的文件大家直接运行即可， 所以说不用担心不会适用！ 模型蒸馏真正的无损涨点， 蒸馏你只看这一篇文章就足够了！

欢迎大家订阅我的专栏一起学习YOLO！

二、蒸馏教程

知识蒸馏的主要方法可以分为三种：基于响应的知识蒸馏（利用教师模型的输出或对最终预测的模仿）、基于特征的知识蒸馏（使用教师模型中间层的特征表示）以及基于关系的知识蒸馏（利用模型内部不同层或不同数据点之间的关系）。每种方法都旨在从大模型中提取有效信息，并通过特定的损失函数将这些信息灌输给学生模型。

首先，基于模型的知识蒸馏类型包括：

1. 基于响应的蒸馏（Response-based）：使用教师模型的最后输出层的信息（如类别概率）来训练学生模型。
2. 基于特征的蒸馏（Feature-based）：利用教师模型的中间层特征来指导学生模型。
3. 基于关系的蒸馏（Relation-based）：侧重于教师模型内不同特征之间的关系，如特征图之间的相互作用。

蒸馏过程的实施方式：

1. 在线蒸馏（Online distillation）：教师模型和学生模型同时训练，学生模型实时学习教师模型的知识。
2. 离线蒸馏（Offline distillation）：先训练教师模型，再使用该模型来训练学生模型，学生模型不会影响教师模型。
3. 自蒸馏（Self distillation）：模型使用自己的预测作为软标签来提高自己的性能。

知识蒸馏是一个多样化的领域，包括各种不同的方法来优化深度学习模型的性能和大小。从基本的基于响应、特征和关系的蒸馏，到更高级的在线、离线和自蒸馏过程，再到特定的技术如对抗性蒸馏或量化蒸馏等，每一种方法都旨在解决不同的问题和需求。

PS: 开始之前给大家说一下，本文的修改内容涉及的改动比较多（我的教程会很详细的写），我也会出一期视频带大家从头到尾修改一遍，如果你还修改不对我也会提供修改完成版本的代码（直接训练运行即可），但是大家肯定想自己修改到自己的项目里如果你修改过程中的报错，我已经提供很完整的教程了（多种方案给大家选择），所以报错回复我只能是随缘回复（因为本文的内容肯定大家一堆报错会遇到稍微操作不当就是报错），同时本项目仅用于 YOLOv8 项目上，和本项目无关的报错一律不回复。

👑正式修改教程👑

2.1 修改一

下面给出了一段代码，我们将这段代码找到目录' ultralytics /utils'下创建一个.py文件存放进去，文件的名字我们命名为AddLoss.py，创建好的文件如下图所示->

import torch
import torch.nn as nn
import torch.nn.functional as F
def is_parallel(model):
    """Returns True if model is of type DP or DDP."""
    return isinstance(model, (nn.parallel.DataParallel, nn.parallel.DistributedDataParallel))
def de_parallel(model):
    """De-parallelize a model: returns single-GPU model if model is of type DP or DDP."""
    return model.module if is_parallel(model) else model
class MimicLoss(nn.Module):
    def __init__(self, channels_s, channels_t):
        super(MimicLoss, self).__init__()
        device = 'cuda' if torch.cuda.is_available() else 'cpu'
        self.mse = nn.MSELoss()
    def forward(self, y_s, y_t):
        """Forward computation.
        Args:
            y_s (list): The student model prediction with
                shape (N, C, H, W) in list.
            y_t (list): The teacher model prediction with
                shape (N, C, H, W) in list.
        Return:
            torch.Tensor: The calculated loss value of all stages.
        """
        assert len(y_s) == len(y_t)
        losses = []
        for idx, (s, t) in enumerate(zip(y_s, y_t)):
            assert s.shape == t.shape
            losses.append(self.mse(s, t))
        loss = sum(losses)
        return loss
class CWDLoss(nn.Module):
    """PyTorch version of `Channel-wise Distillation for Semantic Segmentation.
    <https://arxiv.org/abs/2011.13256>`_.
    """
    def __init__(self, channels_s, channels_t, tau=1.0):
        super(CWDLoss, self).__init__()
        self.tau = tau
    def forward(self, y_s, y_t):
        """Forward computation.
        Args:
            y_s (list): The student model prediction with
                shape (N, C, H, W) in list.
            y_t (list): The teacher model prediction with
                shape (N, C, H, W) in list.
        Return:
            torch.Tensor: The calculated loss value of all stages.
        """
        assert len(y_s) == len(y_t)
        losses = []
        for idx, (s, t) in enumerate(zip(y_s, y_t)):
            assert s.shape == t.shape
            N, C, H, W = s.shape
            # normalize in channel diemension
            softmax_pred_T = F.softmax(t.view(-1, W * H) / self.tau, dim=1)  # [N*C, H*W]
            logsoftmax = torch.nn.LogSoftmax(dim=1)
            cost = torch.sum(
                softmax_pred_T * logsoftmax(t.view(-1, W * H) / self.tau) -
                softmax_pred_T * logsoftmax(s.view(-1, W * H) / self.tau)) * (self.tau ** 2)
            losses.append(cost / (C * N))
        loss = sum(losses)
        return loss
class MGDLoss(nn.Module):
    def __init__(self, channels_s, channels_t, alpha_mgd=0.00002, lambda_mgd=0.65):
        super(MGDLoss, self).__init__()
        device = 'cuda' if torch.cuda.is_available() else 'cpu'
        self.alpha_mgd = alpha_mgd
        self.lambda_mgd = lambda_mgd
        self.generation = [
            nn.Sequential(
                nn.Conv2d(channel, channel, kernel_size=3, padding=1),
                nn.ReLU(inplace=True),
                nn.Conv2d(channel, channel, kernel_size=3, padding=1)).to(device) for channel in channels_t
        ]
    def forward(self, y_s, y_t):
        """Forward computation.
        Args:
            y_s (list): The student model prediction with
                shape (N, C, H, W) in list.
            y_t (list): The teacher model prediction with
                shape (N, C, H, W) in list.
        Return:
            torch.Tensor: The calculated loss value of all stages.
        """
        assert len(y_s) == len(y_t)
        losses = []
        for idx, (s, t) in enumerate(zip(y_s, y_t)):
            assert s.shape == t.shape
            losses.append(self.get_dis_loss(s, t, idx) * self.alpha_mgd)
        loss = sum(losses)
        return loss
    def get_dis_loss(self, preds_S, preds_T, idx):
        loss_mse = nn.MSELoss(reduction='sum')
        N, C, H, W = preds_T.shape
        device = preds_S.device
        mat = torch.rand((N, 1, H, W)).to(device)
        mat = torch.where(mat > 1 - self.lambda_mgd, 0, 1).to(device)
        masked_fea = torch.mul(preds_S, mat)
        new_fea = self.generation[idx](masked_fea)
        dis_loss = loss_mse(new_fea, preds_T) / N
        return dis_loss
class Distill_LogitLoss:
    def __init__(self, p, t_p, alpha=0.25):
        t_ft = torch.cuda.FloatTensor if t_p[0].is_cuda else torch.Tensor
        self.p = p
        self.t_p = t_p
        self.logit_loss = t_ft([0])
        self.DLogitLoss = nn.MSELoss(reduction="none")
        self.bs = p[0].shape[0]
        self.alpha = alpha
    def __call__(self):
        # per output
        assert len(self.p) == len(self.t_p)
        for i, (pi, t_pi) in enumerate(zip(self.p, self.t_p)):  # layer index, layer predictions
            assert pi.shape == t_pi.shape
            self.logit_loss += torch.mean(self.DLogitLoss(pi, t_pi))
        return self.logit_loss[0] * self.alpha
def get_fpn_features(x, model, fpn_layers=[15, 18, 21]):
    y, fpn_feats = [], []
    with torch.no_grad():
        model = de_parallel(model)
        module_list = model.model[:-1] if hasattr(model, "model") else model[:-1]
        for m in module_list:
            # if not from previous layer
            if m.f != -1:
                x = y[m.f] if isinstance(m.f, int) else [x if j == -1 else y[j] for j in m.f]  # from earlier layers
            x = m(x)
            y.append(x if m.i in model.save else None)  # save output
            if m.i in fpn_layers:
                fpn_feats.append(x)
    return fpn_feats
def get_channels(model, fpn_layers=[15, 18, 21]):
    y, out_channels = [], []
    p = next(model.parameters())
    x = torch.zeros((1, 3, 64, 64), device=p.device)
    with torch.no_grad():
        model = de_parallel(model)
        module_list = model.model[:-1] if hasattr(model, "model") else model[:-1]
        for m in module_list:
            # if not from previous layer
            if m.f != -1:
                x = y[m.f] if isinstance(m.f, int) else [x if j == -1 else y[j] for j in m.f]  # from earlier layers
            x = m(x)
            y.append(x if m.i in model.save else None)  # save output
            if m.i in fpn_layers:
                out_channels.append(x.shape[1])
    return out_channels
class FeatureLoss(nn.Module):
    def __init__(self, channels_s, channels_t, distiller='cwd'):
        super(FeatureLoss, self).__init__()
        device = 'cuda' if torch.cuda.is_available() else 'cpu'
        self.align_module = nn.ModuleList([
            nn.Conv2d(channel, tea_channel, kernel_size=1, stride=1, padding=0).to(device)
            for channel, tea_channel in zip(channels_s, channels_t)
        ])
        self.norm = [
            nn.BatchNorm2d(tea_channel, affine=False).to(device)
            for tea_channel in channels_t
        ]
        if distiller == 'mimic':
            self.feature_loss = MimicLoss(channels_s, channels_t)
        elif distiller == 'mgd':
            self.feature_loss = MGDLoss(channels_s, channels_t)
        elif distiller == 'cwd':
            self.feature_loss = CWDLoss(channels_s, channels_t)
        else:
            raise NotImplementedError
    def forward(self, y_s, y_t):
        assert len(y_s) == len(y_t)
        tea_feats = []
        stu_feats = []
        for idx, (s, t) in enumerate(zip(y_s, y_t)):
            s = self.align_module[idx](s)
            s = self.norm[idx](s)
            t = self.norm[idx](t)
            tea_feats.append(t)
            stu_feats.append(s)
        loss = self.feature_loss(stu_feats, tea_feats)
        return loss

2.2 修改二（注意看此处，和之前的文章不一样）

下面的代码我们找到文件'ultralytics/engine/trainer.py'按照我的图片内容复制粘贴到指定位置即可！ 根据下图进行修改->

在该文件的开头我们先添加两行模块的导入代码，

from ultralytics.utils import IterableSimpleNamespace
from ultralytics.utils.AddLoss import get_fpn_features, Distill_LogitLoss, de_parallel, get_channels, FeatureLoss

注意就是在此处我们将下面的 self.distill_feat_type = "cwd" 即可对应CWDLoss，如果为False则用的是离线蒸馏，除此之外本文和之前的文章无任何区别！！

        #------------------------------Add-Param-Start---------------
        self.featureloss = 0
        self.logitloss = 0
        self.teacherloss = 0
        self.distillloss =None
        self.model_t = overrides.get("model_t", None)
        self.distill_feat_type = "cwd"  # "cwd","mgd","mimic"
        self.distillonline = False  # False or True
        self.logit_loss = False # False or True
        self.distill_layers = [2, 4, 6, 8, 12, 15, 18, 21]
        # ------------------------------Add-Param-End-----------------

2.3 修改三

按照图片进行修改即可。

        if self.model_t is not None:
            for k, v in self.model_t.model.named_parameters():
                v.requires_grad = True
            self.model_t = self.model_t.to(self.device)

2.4 修改四

按照图片进行修改即可。

            if self.model_t is not None:
                self.model_t = nn.parallel.DistributedDataParallel(self.model_t, device_ids=[RANK])

2.5 修改五

此处的修改和上面有些不一样，上面的代码都是添加，此处的代码为替换。

        self.optimizer = self.build_optimizer(model=self.model,
                                              model_t=self.model_t,
                                              distillloss=self.distillloss,
                                              distillonline=self.distillonline,
                                              name=self.args.optimizer,
                                              lr=self.args.lr0,
                                              momentum=self.args.momentum,
                                              decay=weight_decay,
                                              iterations=iterations)

2.6 修改六

修改教程看图片！

        self.model = de_parallel(self.model)
        if self.model_t is not None:
            self.model_t = de_parallel(self.model_t)
            self.channels_s = get_channels(self.model,self.distill_layers)
            self.channels_t = get_channels(self.model_t,self.distill_layers)
            self.distillloss = FeatureLoss(channels_s=self.channels_s, channels_t=self.channels_t, distiller= self.distill_feat_type)

2.7 修改七

修改教程看图片！

            if self.model_t is not None:
                self.model_t.eval()

2.8 修改八

修改教程看图片！

                    pred_s= self.model(batch['img'])
                    stu_features = get_fpn_features(batch['img'], self.model,fpn_layers=self.distill_layers)

2.8 修改九

                    if self.model_t is not None:
                        distill_weight = ((1 - math.cos(i * math.pi / len(self.train_loader))) / 2) * (0.1 - 1) + 1
                        with torch.no_grad():
                            pred_t_offline = self.model_t(batch['img'])
                            tea_features = get_fpn_features(batch['img'], self.model_t,
                                                            fpn_layers=self.distill_layers)  # forward
                            self.featureloss = self.distillloss(stu_features, tea_features) * distill_weight
                            self.loss += self.featureloss
                        if self.distillonline:
                            self.model_t.train()
                            pred_t_online = self.model_t(batch['img'])
                            for p in pred_t_online:
                                p = p.detach()
                            if i == 0 and epoch == 0:
                                self.model_t.args["box"] = self.model.args.box
                                self.model_t.args["cls"] = self.model.args.cls
                                self.model_t.args["dfl"] = self.model.args.dfl
                                self.model_t.args = IterableSimpleNamespace(**self.model_t.args)
                            self.teacherloss, _ = self.model_t(batch, pred_t_online)
                            if RANK != -1:
                                self.teacherloss *= world_size
                            self.loss += self.teacherloss
                        if self.logit_loss:
                            if not self.distillonline:
                                distill_logit = Distill_LogitLoss(pred_s, pred_t_offline)
                            else:
                                distill_logit = Distill_LogitLoss(pred_s, pred_t_online)
                            self.logitloss = distill_logit()
                            self.loss += self.logitloss

2.9 修改十

修改教程看图片！

                mem = f"{torch.cuda.memory_reserved() / 1E9 if torch.cuda.is_available() else 0:.3g}G"  # (GB)
                loss_len = self.tloss.shape[0] if len(self.tloss.shape) else 1
                losses = self.tloss if loss_len > 1 else torch.unsqueeze(self.tloss, 0)
                if RANK in {-1, 0}:
                    loss_length = self.tloss.shape[0] if len(self.tloss.shape) else 1
                    pbar.set_description(
                        ('%12s' * 2 + '%12.4g' * (5 + loss_length)) %
                        (f'{epoch + 1}/{self.epochs}', mem, * losses, self.featureloss, self.teacherloss, self.logitloss, batch['cls'].shape[0], batch['img'].shape[-1]))
                    self.run_callbacks("on_batch_end")
                    if self.args.plots and ni in self.plot_idx:
                        self.plot_training_samples(batch, ni)

2.10 修改十一

修改教程看图片！

, model_t, distillloss, distillonline=False,

2.11 修改十二

修改教程看图片！

        if model_t is not None and distillonline:
            for v in model_t.modules():
                # print(v)
                if hasattr(v, 'bias') and isinstance(v.bias, nn.Parameter):  # bias (no decay)
                    g[2].append(v.bias)
                if isinstance(v, bn):  # weight (no decay)
                    g[1].append(v.weight)
                elif hasattr(v, 'weight') and isinstance(v.weight, nn.Parameter):  # weight (with decay)
                    g[0].append(v.weight)
        if model_t is not None and distillloss is not None:
            for k, v in distillloss.named_modules():
                # print(v)
                if hasattr(v, 'bias') and isinstance(v.bias, nn.Parameter):  # bias (no decay)
                    g[2].append(v.bias)
                if isinstance(v, bn) or 'bn' in k:  # weight (no decay)
                    g[1].append(v.weight)
                elif hasattr(v, 'weight') and isinstance(v.weight, nn.Parameter):  # weight (with decay)
                    g[0].append(v.weight)

2.12 修改十三

PS：注意此处我们更换了修改的文件了！！

我们找到文件'ultralytics/cfg/__init__.py'，按照我的图片进行修改！

2.13 修改十四

PS：注意此处我们更换了修改的文件了！！

我们找到文件'ultralytics/ models / yolo /detect/train.py'按照图片进行修改即可！

        return ('\n' + '%12s' *
                (7 + len(self.loss_names))) % (
        'Epoch', 'GPU_mem', *self.loss_names, 'dfeaLoss', 'dlineLoss', 'dlogitLoss', 'Instances',
        'Size')

2.14 修改十五

PS：注意此处我们更换了修改的文件了！！

我们找到文件'ultralytics/engine/model.py'按照图片进行修改即可！

到此处就修改完成了，剩下的就是如何使用进行模型蒸馏了！

三、使用教程

模型蒸馏指的是：用训练好的模型（注意是训练好的模型）去教另一个模型！（其中教师模型必须是训练好的权重，学生模型可以是yaml文件也可以是权重文件地址）

在开始之前我们需要准备一个教师模型，我们这里就用YOLOv8l为例，其权重文件可以去官方下载。

3.1 模型蒸馏代码

PS：需要注意的是，学生模型和教师模型的模型配置文件需要保持一致，也就是说你学生模型假设用了BiFPN那么你的教师模型也需要用BiFPN去训练否则就会报错！

import warnings
warnings.filterwarnings('ignore')
from ultralytics import YOLO
if __name__ == '__main__':
    model_t = YOLO(r'yolo11l.pt')  # 此处填写教师模型的权重文件地址
    model_t.model.model[-1].set_Distillation = True  # 不用理会此处用于设置模型蒸馏
    model_s = YOLO(r'yolo11.yaml')  # 学生文件的yaml文件 or 权重文件地址
    model_s.train(data=r'C:\Users\Administrator\Desktop\YOLOv11-Dis\ultralytics-main\NEU-DET\data.yaml',
                  # 将data后面替换你自己的数据集地址
                  cache=False,
                  imgsz=640,
                  epochs=100,
                  single_cls=False,  # 是否是单类别检测
                  batch=1,
                  close_mosaic=10,
                  workers=0,
                  device='0',
                  optimizer='SGD',  # using SGD
                  amp=True,  # 如果出现训练损失为Nan可以关闭amp
                  project='runs/train',
                  name='exp',
                  model_t=model_t.model
                  )

3.2 开始蒸馏

我们将蒸馏的代码复制粘贴到一个py文件内，如下图所示！

我们运行蒸馏的py文件即可，模型就会开始训练并且蒸馏。下面的图片就是模型开始训练并且蒸馏，可以看到我们开启了蒸馏损失'dfeaLoss dlineLoss'

四、完整文件和视频讲解

百度网盘因为链接很容易过期，如果下面文件过期大家可以提醒我，或者进群，群内我也会上传！

链接：https://pan.baidu.com/s/1_gwwemBDPF6YEoyO2BkTlA?pwd=j8uq
提取码：j8uq

四、本文总结

到此本文的正式分享内容就结束了，在这里给大家推荐我的YOLOv11改进有效涨点专栏，本专栏目前为新开的平均质量分98分，后期我会根据各种最新的前沿顶会进行论文复现，也会对一些老的改进机制进行补充，如果大家觉得本文帮助到你了，订阅本专栏，关注后续更多的更新~