论文必备 - RT-DETR输出模型每一层的耗时和GFLOPs，深入比较每一层模块的改进效果

前言

在做一些比较实验中，特别是在模型轻量化时,如何更精确的查看和对比我们的改进模块时候有效，是否有提升呢？

这时候我们就可以打印改进模型中每一层的耗时和 GFLOPS 来比较不同模块的占用量。这样就可以有针对性的改进我们的模型，并且在写论文中直接展示某一层的改进效果，会更加的直观，也能够丰富论文内容。

在下方的实现效果中，可以看出来能够打印每一层的相关 耗时time ， 计算量GFLOPs ， 参数量params 等，可查看改进后的任意模型，并且可调节 batch-size 。

在这里插入图片描述

核心代码

在 根目录 新建 print.py ，并填入如下代码：

import argparse
from pathlib import Path
from typing import Optional
import inspect
import os
import sys
import platform

import torch
from ultralytics.nn.tasks import DetectionModel
from ultralytics.utils import LOGGER, colorstr
from ultralytics.utils.torch_utils import profile
from datetime import datetime
from ultralytics.utils.checks import check_yaml, git_describe

Model = DetectionModel  # retain YOLO 'Model' class for backwards compatibility

FILE = Path(__file__).resolve()
ROOT = FILE.parents[1]  # YOLO root directory
if str(ROOT) not in sys.path:
    sys.path.append(str(ROOT))  # add ROOT to PATH
if platform.system() != 'Windows':
    ROOT = Path(os.path.relpath(ROOT, Path.cwd()))  # relative

def print_args(args: Optional[dict] = None, show_file=True, show_func=False):
    # Print function arguments (optional args dict)
    x = inspect.currentframe().f_back  # previous frame
    file, _, func, _, _ = inspect.getframeinfo(x)
    if args is None:  # get args automatically
        args, _, _, frm = inspect.getargvalues(x)
        args = {k: v for k, v in frm.items() if k in args}
    try:
        file = Path(file).resolve().relative_to(ROOT).with_suffix('')
    except ValueError:
        file = Path(file).stem
    s = (f'{file}: ' if show_file else '') + (f'{func}: ' if show_func else '')
    LOGGER.info(colorstr(s) + ', '.join(f'{k}={v}' for k, v in args.items()))

def file_date(path=__file__):
    # Return human-readable file modification date, i.e. '2021-3-26'
    t = datetime.fromtimestamp(Path(path).stat().st_mtime)
    return f'{t.year}-{t.month}-{t.day}'

def select_device(device='', batch_size=0, newline=True):
    # device = None or 'cpu' or 0 or '0' or '0,1,2,3'
    s = f'YOLO 🚀 {git_describe() or file_date()} Python-{platform.python_version()} torch-{torch.__version__} '
    device = str(device).strip().lower().replace('cuda:', '').replace('none', '')  # to string, 'cuda:0' to '0'
    cpu = device == 'cpu'
    mps = device == 'mps'  # Apple Metal Performance Shaders (MPS)
    if cpu or mps:
        os.environ['CUDA_VISIBLE_DEVICES'] = '-1'  # force torch.cuda.is_available() = False
    elif device:  # non-cpu device requested
        os.environ['CUDA_VISIBLE_DEVICES'] = device  # set environment variable - must be before assert is_available()
        assert torch.cuda.is_available() and torch.cuda.device_count() >= len(device.replace(',', '')), \
            f"Invalid CUDA '--device {device}' requested, use '--device cpu' or pass valid CUDA device(s)"

    if not cpu and not mps and torch.cuda.is_available():  # prefer GPU if available
        devices = device.split(',') if device else '0'  # range(torch.cuda.device_count())  # i.e. 0,1,6,7
        n = len(devices)  # device count
        if n > 1 and batch_size > 0:  # check batch_size is divisible by device_count
            assert batch_size % n == 0, f'batch-size {batch_size} not multiple of GPU count {n}'
        space = ' ' * (len(s) + 1)
        for i, d in enumerate(devices):
            p = torch.cuda.get_device_properties(i)
            s += f"{'' if i == 0 else space}CUDA:{d} ({p.name}, {p.total_memory / (1 << 20):.0f}MiB)\n"  # bytes to MB
        arg = 'cuda:0'
    elif mps and getattr(torch, 'has_mps', False) and torch.backends.mps.is_available():  # prefer MPS if available
        s += 'MPS\n'
        arg = 'mps'
    else:  # revert to CPU
        s += 'CPU\n'
        arg = 'cpu'

    if not newline:
        s = s.rstrip()
    LOGGER.info(s)
    return torch.device(arg)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--cfg', type=str, default='ultralytics/cfg/models/rt-detr/rtdetr-resnet50.yaml', help='model.yaml')
    parser.add_argument('--batch-size', type=int, default=1, help='total batch size for all GPUs')
    parser.add_argument('--device', default='0', help='cuda device, i.e. 0 or 0,1,2,3 or cpu')
    parser.add_argument('--profile', action='store_true',default=True, help='profile model speed')
    parser.add_argument('--line-profile', action='store_true', default=True, help='profile model speed layer by layer')
    parser.add_argument('--test', action='store_true', help='test all yolo*.yaml')
    opt = parser.parse_args()
    opt.cfg = check_yaml(opt.cfg)  # check YAML
    print_args(vars(opt))
    device = select_device(opt.device)

    # Create model
    im = torch.rand(opt.batch_size, 3, 640, 640).to(device)
    model = Model(opt.cfg).to(device)
    model.eval()

    # Options
    if opt.line_profile:  # profile layer by layer
        model(im, profile=True)

    elif opt.profile:  # profile forward-backward
        results = profile(input=im, ops=[model], n=3)

    elif opt.test:  # test all models
        for cfg in Path(ROOT / 'models').rglob('yolo*.yaml'):
            try:
                _ = Model(cfg)
            except Exception as e:
                print(f'Error in {cfg}: {e}')

    else:  # report fused model summary
        model.fuse()

在这里插入图片描述

参数解释

需要调整的参数在 if __name__ == '__main__': 中

cfg :指向 模型的配置文件 ，即要查看的模型，我这里设置的是 rtdetr-resnet50.yaml ，目前只适用于 resnet 系列及其相关的改进模型， HGBlock 还有报错。
batch-size ：批处理图像的大小，我这里设置成1，越大，速度越快，但需要同一量级下去比较速度
device ：所选择的设备， GPU：0，1，2... 或 CPU
profile ：是否打印相关指标， default=True 为打印
line-profile ：是否按行打印，是则添加 default=True
test ：是否打印所有配置文件， 否，只需要查看我们指定的即可

配置完成后的参数如下：

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--cfg', type=str, default='ultralytics/cfg/models/rt-detr/rtdetr-resnet50.yaml', help='model.yaml')
    parser.add_argument('--batch-size', type=int, default=1, help='total batch size for all GPUs')
    parser.add_argument('--device', default='0', help='cuda device, i.e. 0 or 0,1,2,3 or cpu')
    parser.add_argument('--profile', action='store_true',default=True, help='profile model speed')
    parser.add_argument('--line-profile', action='store_true', default=True, help='profile model speed layer by layer')
    parser.add_argument('--test', action='store_true', help='test all yolo*.yaml')
    opt = parser.parse_args()

在这里插入图片描述

运行方法及效果

配置完成后运行此代码，可以看到它先输出了模型的结构信息：

在这里插入图片描述

紧接着才输出了每一层的 耗时 ， GFLOPs ， 参数量 ：

 time (ms)     GFLOPs     params  module
      0.32       1.98       9536  ultralytics.nn.modules.block.ResNetLayer
      1.35      11.19     215808  ultralytics.nn.modules.block.ResNetLayer
      1.78      16.98    1219584  ultralytics.nn.modules.block.ResNetLayer
      2.55      24.05    7098368  ultralytics.nn.modules.block.ResNetLayer
      1.33      13.25   14964736  ultralytics.nn.modules.block.ResNetLayer
      0.08       0.42     524800  ultralytics.nn.modules.conv.Conv
      2.65       0.42     789760  ultralytics.nn.modules.transformer.AIFI
      0.12       0.05      66048  ultralytics.nn.modules.conv.Conv
      0.02       0.00          0  torch.nn.modules.upsampling.Upsample
      0.08       0.84     262656  ultralytics.nn.modules.conv.Conv
      0.02       0.00          0  ultralytics.nn.modules.conv.Concat
      0.99       7.16    2232320  ultralytics.nn.modules.block.RepC3
      0.12       0.21      66048  ultralytics.nn.modules.conv.Conv
      0.02       0.00          0  torch.nn.modules.upsampling.Upsample
      0.09       1.69     131584  ultralytics.nn.modules.conv.Conv
      0.02       0.00          0  ultralytics.nn.modules.conv.Concat
      1.00      28.63    2232320  ultralytics.nn.modules.block.RepC3
      0.13       1.89     590336  ultralytics.nn.modules.conv.Conv
      0.02       0.00          0  ultralytics.nn.modules.conv.Concat
      1.03       7.16    2232320  ultralytics.nn.modules.block.RepC3
      0.14       0.47     590336  ultralytics.nn.modules.conv.Conv
      0.02       0.00          0  ultralytics.nn.modules.conv.Concat
      1.10       1.79    2232320  ultralytics.nn.modules.block.RepC3
     31.87      12.63    7466252  ultralytics.nn.modules.head.RTDETRDecoder
     46.83          -          -  Total

在这里插入图片描述

这样就可以查看每一层的 耗时 和 GFLOPs 了。

学习资源站

论文必备-RT-DETR输出模型每一层的耗时和GFLOPs，深入比较每一层模块的改进效果_rt-detr的gflops-

论文必备 - RT-DETR输出模型每一层的耗时和GFLOPs，深入比较每一层模块的改进效果

前言

核心代码

参数解释

运行方法及效果