目标检测¶

目标检测是计算机视觉任务中的一种，用于在图像中检测实例（如人、建筑物或汽车）。目标检测模型接收图像作为输入，并输出检测到的目标的边界框坐标和关联标签。一张图像可以包含多个目标，每个目标有自己的边界框和标签（例如，图像中可能同时包含汽车和建筑物），并且每个目标可以在图像的不同位置出现（例如，图像中可能有几辆汽车）。这项任务常用于自动驾驶中，用于检测行人、路标和交通信号灯等。其他应用场景包括图像中的目标计数、图像搜索等。

在这个指南中，你将学习如何：

在 CPPE-5 数据集上微调 DETR，这是一种结合了卷积主干和编码器-解码器 Transformer 的模型。
使用微调后的模型进行推理。

要查看与该任务兼容的所有架构和检查点，建议查阅任务页面。

在开始之前，请确保已安装所有必要的库：

In [ ]:

pip install -q datasets transformers accelerate timm
pip install -q -U albumentations>=1.4.5 torchmetrics pycocotools

你将使用 🤗 Datasets 加载来自 Hugging Face Hub 的数据集，使用 🤗 Transformers 训练模型，并使用 albumentations 增强数据。

我们鼓励你与社区分享你的模型。登录 Hugging Face 账户以将其上传到 Hub。当提示时，输入你的令牌以登录：

In [ ]:

from huggingface_hub import notebook_login

notebook_login()

首先，我们定义全局常量，即模型名称和图像大小。在这个教程中，我们将使用条件 DETR 模型，因为它收敛速度更快。你可以选择 transformers 库中可用的任何目标检测模型。

In [ ]:

MODEL_NAME = "microsoft/conditional-detr-resnet-50"  # 或 "facebook/detr-resnet-50"
IMAGE_SIZE = 480

加载 CPPE-5 数据集¶

CPPE-5 数据集包含注释了医疗个人防护装备（PPE）的图像，这些注释是在 COVID-19 大流行期间生成的。

首先加载数据集并从训练集中创建一个验证集：

In [ ]:

from datasets import load_dataset

cppe5 = load_dataset("cppe-5")

if "validation" not in cppe5:
    split = cppe5["train"].train_test_split(0.15, seed=1337)
    cppe5["train"] = split["train"]
    cppe5["validation"] = split["test"]

cppe5

你会发现这个数据集有 1000 张图像用于训练和验证集，以及一个包含 29 张图像的测试集。

为了熟悉数据，探索一下示例的样子。

In [ ]:

cppe5["train"][0]

数据集中的示例包含以下字段：

image_id：示例图像的 ID
image：包含图像的 PIL.Image.Image 对象
width：图像的宽度
height：图像的高度
objects：包含图像中目标的边界框元数据的字典：
- id：注释 ID
- area：边界框的面积
- bbox：目标的边界框（采用 COCO 格式）
- category：目标的类别，可能的值包括 Coverall (0)、Face_Shield (1)、Gloves (2)、Goggles (3) 和 Mask (4)

你可能会注意到 bbox 字段遵循 COCO 格式，这是 DETR 模型期望的格式。然而，字段在 objects 内部的分组与 DETR 要求的注释格式不同。在使用此数据进行训练之前，需要应用一些预处理转换。

为了更好地理解数据，可视化数据集中的一个示例。

In [ ]:

import numpy as np
import os
from PIL import Image, ImageDraw

image = cppe5["train"][2]["image"]
annotations = cppe5["train"][2]["objects"]
draw = ImageDraw.Draw(image)

categories = cppe5["train"].features["objects"].feature["category"].names

id2label = {index: x for index, x in enumerate(categories, start=0)}
label2id = {v: k for k, v in id2label.items()}

for i in range(len(annotations["id"])):
    box = annotations["bbox"][i]
    class_idx = annotations["category"][i]
    x, y, w, h = tuple(box)
    # 检查坐标是否已归一化
    if max(box) > 1.0:
        # 坐标未归一化，无需重新缩放
        x1, y1 = int(x), int(y)
        x2, y2 = int(x + w), int(y + h)
    else:
        # 坐标已归一化，重新缩放
        x1 = int(x * width)
        y1 = int(y * height)
        x2 = int((x + w) * width)
        y2 = int((y + h) * height)
    draw.rectangle((x, y, x + w, y + h), outline="red", width=1)
    draw.text((x, y), id2label[class_idx], fill="white")

image

为了可视化带有关联标签的边界框，你可以从数据集的元数据中获取标签，特别是 category 字段。你还希望创建将标签 ID 映射到标签类的字典（id2label）和反向映射的字典（label2id）。稍后设置模型时可以使用这些映射。包括这些映射可以使你的模型在上传到 Hugging Face Hub 后更具可重用性。请注意，上述代码中绘制边界框的部分假设它们是 COCO 格式 (x_min, y_min, width, height)。如果要支持其他格式（如 (x_min, y_min, x_max, y_max)），需要进行调整。

作为熟悉数据的最后一步，探索数据中的潜在问题。目标检测数据集中常见的问题是边界框“超出”图像边缘。这种“越界”的边界框在训练过程中可能会引发错误，应该解决。此数据集中有几个这样的示例。为了简化本指南，我们将在下面的转换中设置 clip=True。

预处理数据¶

为了微调模型，必须对计划使用的数据进行预处理，使其完全符合预训练模型的方法。AutoImageProcessor 会处理图像数据，生成 DETR 模型可以训练的 pixel_values、pixel_mask 和 labels。图像处理器有一些属性你不需要担心：

image_mean = [0.485, 0.456, 0.406 ]
image_std = [0.229, 0.224, 0.225]

这些是在模型预训练期间用于归一化图像的均值和标准差。这些值在进行推理或微调预训练图像模型时至关重要。

从与要微调的模型相同的检查点实例化图像处理器。

In [ ]:

from transformers import AutoImageProcessor

MAX_SIZE = IMAGE_SIZE

image_processor = AutoImageProcessor.from_pretrained(
    MODEL_NAME,
    do_resize=True,
    size={"max_height": MAX_SIZE, "max_width": MAX_SIZE},
    do_pad=True,
    pad_size={"height": MAX_SIZE, "width": MAX_SIZE},
)

在将图像传递给 image_processor 之前，对数据集应用两个预处理转换：

增强图像
调整注释以满足 DETR 的期望

首先，为了避免模型在训练数据上过拟合，可以使用任何数据增强库对图像进行增强。这里我们使用 Albumentations。这个库确保转换会影响图像并相应地更新边界框。🤗 Datasets 文档中有详细指南介绍如何为对象检测增强图像，并使用了相同的示例数据集。对图像应用一些几何和颜色变换。有关更多增强选项，可以探索 Albumentations 示例空间。

In [ ]:

import albumentations as A

train_augment_and_transform = A.Compose(
    [
        A.Perspective(p=0.1),
        A.HorizontalFlip(p=0.5),
        A.RandomBrightnessContrast(p=0.5),
        A.HueSaturationValue(p=0.1),
    ],
    bbox_params=A.BboxParams(format="coco", label_fields=["category"], clip=True, min_area=25),
)

validation_transform = A.Compose(
    [A.NoOp()],
    bbox_params=A.BboxParams(format="coco", label_fields=["category"], clip=True),
)

image_processor 期望注释的格式如下：{'image_id': int, 'annotations': List[Dict]}，其中每个字典是一个 COCO 对象注释。添加一个函数来重新格式化单个示例的注释：

In [ ]:

def format_image_annotations_as_coco(image_id, categories, areas, bboxes):
    """将一组图像注释格式化为 COCO 格式

    参数：
        image_id (str): 图像 ID。例如："0001"
        categories (List[int]): 与提供的边界框对应的类别/类标签列表
        areas (List[float]): 与提供的边界框对应的面积列表
        bboxes (List[Tuple[float]]): 以 COCO 格式提供的边界框列表
            ([center_x, center_y, width, height] 为绝对坐标)

    返回：
        dict: {
            "image_id": 图像 ID,
            "annotations": 格式化的注释列表
        }
    """
    annotations = []
    for category, area, bbox in zip(categories, areas, bboxes):
        formatted_annotation = {
            "image_id": image_id,
            "category_id": category,
            "iscrowd": 0,
            "area": area,
            "bbox": list(bbox),
        }
        annotations.append(formatted_annotation)

    return {
        "image_id": image_id,
        "annotations": annotations,
    }

现在可以组合图像和注释变换，以便在一批示例上使用：

In [ ]:

def augment_and_transform_batch(examples, transform, image_processor, return_pixel_mask=False):
    """对对象检测任务应用增强并格式化注释为 COCO 格式"""

    images = []
    annotations = []
    for image_id, image, objects in zip(examples["image_id"], examples["image"], examples["objects"]):
        image = np.array(image.convert("RGB"))

        # 应用增强
        output = transform(image=image, bboxes=objects["bbox"], category=objects["category"])
        images.append(output["image"])

        # 将注释格式化为 COCO 格式
        formatted_annotations = format_image_annotations_as_coco(
            image_id, output["category"], objects["area"], output["bboxes"]
        )
        annotations.append(formatted_annotations)

    # 应用图像处理器变换：调整大小、重新缩放、归一化
    result = image_processor(images=images, annotations=annotations, return_tensors="pt")

    if not return_pixel_mask:
        result.pop("pixel_mask", None)

    return result

使用 🤗 Datasets with_transform 方法将此预处理函数应用于整个数据集。此方法会在加载数据集元素时动态应用转换。

此时，可以检查转换后的数据集示例。你应该看到一个包含 pixel_values 的张量、一个包含 pixel_mask 的张量和 labels。

In [ ]:

from functools import partial

# 为批处理创建转换函数并应用于数据集切片
train_transform_batch = partial(
    augment_and_transform_batch, transform=train_augment_and_transform, image_processor=image_processor
)
validation_transform_batch = partial(
    augment_and_transform_batch, transform=validation_transform, image_processor=image_processor
)

cppe5["train"] = cppe5["train"].with_transform(train_transform_batch)
cppe5["validation"] = cppe5["validation"].with_transform(validation_transform_batch)
cppe5["test"] = cppe5["test"].with_transform(validation_transform_batch)

cppe5["train"][15]

你已经成功增强了单个图像并准备了它们的注释。但是，预处理尚未完成。在最后一步中，创建一个自定义 collate_fn 来批量处理图像。将图像（现为 pixel_values）填充到批次中最大的图像，并创建相应的 pixel_mask 以指示哪些像素是真实的（1），哪些是填充的（0）。

In [ ]:

import torch

def collate_fn(batch):
    data = {}
    data["pixel_values"] = torch.stack([x["pixel_values"] for x in batch])
    data["labels"] = [x["labels"] for x in batch]
    if "pixel_mask" in batch[0]:
        data["pixel_mask"] = torch.stack([x["pixel_mask"] for x in batch])
    return data

准备计算 mAP 的函数¶

对象检测模型通常使用一组 COCO 风格的指标进行评估。我们将使用 torchmetrics 计算 mAP（平均精度）和 mAR（平均召回率）指标，并将其包装到 compute_metrics 函数中，以便在 Trainer 中进行评估。

用于训练的中间格式的框是 YOLO（归一化），但我们将计算 Pascal VOC（绝对坐标）格式的框，以正确处理框的面积。定义一个函数将边界框转换为 Pascal VOC 格式：

In [ ]:

from transformers.image_transforms import center_to_corners_format

def convert_bbox_yolo_to_pascal(boxes, image_size):
    """
    将边界框从 YOLO 格式 (x_center, y_center, width, height) 转换为 Pascal VOC 格式 (x_min, y_min, x_max, y_max)，并在绝对坐标中表示。

    参数：
        boxes (torch.Tensor): 以 YOLO 格式的边界框
        image_size (Tuple[int, int]): 图像尺寸，格式为 (高度, 宽度)

    返回：
        torch.Tensor: 以 Pascal VOC 格式 (x_min, y_min, x_max, y_max) 的边界框
    """
    # 转换为中心到角点格式
    boxes = center_to_corners_format(boxes)

    # 转换为绝对坐标
    height, width = image_size
    boxes = boxes * torch.tensor([[width, height, width, height]])

    return boxes

然后，在 compute_metrics 函数中收集评估循环结果中的预测和目标边界框、得分和标签，并传递给评分函数。

In [ ]:

import numpy as np
from dataclasses import dataclass
from torchmetrics.detection.mean_ap import MeanAveragePrecision

@dataclass
class ModelOutput:
    logits: torch.Tensor
    pred_boxes: torch.Tensor

@torch.no_grad()
def compute_metrics(evaluation_results, image_processor, threshold=0.0, id2label=None):
    """
    计算对象检测任务的平均 mAP、mAR 及其变体。

    参数：
        evaluation_results (EvalPrediction): 评估的预测和目标。
        threshold (float, optional): 通过置信度筛选预测框的阈值。默认为 0.0。
        id2label (Optional[dict], optional): 类别 ID 到类别名称的映射。默认为 None。

    返回：
        Mapping[str, float]: 以字典形式表示的指标 {<metric_name>: <metric_value>}
    """

    predictions, targets = evaluation_results.predictions, evaluation_results.label_ids

    # 为指标计算提供：
    #  - 目标：列表中的字典，键为 "boxes" 和 "labels"
    #  - 预测：列表中的字典，键为 "boxes"、"scores" 和 "labels"

    image_sizes = []
    post_processed_targets = []
    post_processed_predictions = []

    # 收集用于指标计算的目标
    for batch in targets:
        # 收集图像尺寸，用于预测后处理
        batch_image_sizes = torch.tensor(np.array([x["orig_size"] for x in batch]))
        image_sizes.append(batch_image_sizes)
        # 收集目标，用于指标计算
        # 框已被转换为 YOLO 格式，用于模型训练
        # 这里将它们转换为 Pascal VOC 格式 (x_min, y_min, x_max, y_max)
        for image_target in batch:
            boxes = torch.tensor(image_target["boxes"])
            boxes = convert_bbox_yolo_to_pascal(boxes, image_target["orig_size"])
            labels = torch.tensor(image_target["class_labels"])
            post_processed_targets.append({"boxes": boxes, "labels": labels})

    # 收集用于指标计算的预测，
    # 模型生成的框为 YOLO 格式，然后图像处理器将其转换为 Pascal VOC 格式
    for batch, target_sizes in zip(predictions, image_sizes):
        batch_logits, batch_boxes = batch[1], batch[2]
        output = ModelOutput(logits=torch.tensor(batch_logits), pred_boxes=torch.tensor(batch_boxes))
        post_processed_output = image_processor.post_process_object_detection(
            output, threshold=threshold, target_sizes=target_sizes
        )
        post_processed_predictions.extend(post_processed_output)

    # 计算指标
    metric = MeanAveragePrecision(box_format="xyxy", class_metrics=True)
    metric.update(post_processed_predictions, post_processed_targets)
    metrics = metric.compute()

    # 将每类指标的列表替换为每类的单独指标
    classes = metrics.pop("classes")
    map_per_class = metrics.pop("map_per_class")
    mar_100_per_class = metrics.pop("mar_100_per_class")
    for class_id, class_map, class_mar in zip(classes, map_per_class, mar_100_per_class):
        class_name = id2label[class_id.item()] if id2label is not None else class_id.item()
        metrics[f"map_{class_name}"] = class_map
        metrics[f"mar_100_{class_name}"] = class_mar

    metrics = {k: round(v.item(), 4) for k, v in metrics.items()}

    return metrics

eval_compute_metrics_fn = partial(
    compute_metrics, image_processor=image_processor, id2label=id2label, threshold=0.0
)

训练检测模型¶

在前几节中，你已经完成了大部分繁重的工作，现在可以开始训练模型了！即使经过调整大小后，该数据集中的图像仍然相当大。这意味着微调此模型至少需要一个 GPU。

训练涉及以下步骤：

使用 AutoModelForObjectDetection 从预处理中使用的相同检查点加载模型。
在 TrainingArguments 中定义训练超参数。
将训练参数传递给 Trainer，并附带模型、数据集、图像处理器和数据组合器。
调用 train() 以微调模型。

从用于预处理的相同检查点加载模型时，记得传递从数据集元数据中创建的 label2id 和 id2label 映射。此外，我们指定 ignore_mismatched_sizes=True 以替换现有的分类头。

In [ ]:

from transformers import AutoModelForObjectDetection

model = AutoModelForObjectDetection.from_pretrained(
    MODEL_NAME,
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True,
)

在 TrainingArguments 中使用 output_dir 指定保存模型的位置，然后根据需要配置超参数。对于 num_train_epochs=30，在 Google Colab T4 GPU 上训练大约需要 35 分钟，增加训练轮数可以获得更好的结果。

重要说明：

不要删除未使用的列，因为这将删除图像列。没有图像列，你无法创建 pixel_values。因此，将 remove_unused_columns 设置为 False。
设置 eval_do_concat_batches=False 以获得正确的评估结果。图像具有不同数量的目标框，如果批次被连接起来，我们将无法确定哪个框属于特定图像。

如果你希望将模型推送到 Hub，将 push_to_hub 设置为 True（你必须登录 Hugging Face 才能上传模型）。

In [ ]:

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="detr_finetuned_cppe5",
    num_train_epochs=30,
    fp16=False,
    per_device_train_batch_size=8,
    dataloader_num_workers=4,
    learning_rate=5e-5,
    lr_scheduler_type="cosine",
    weight_decay=1e-4,
    max_grad_norm=0.01,
    metric_for_best_model="eval_map",
    greater_is_better=True,
    load_best_model_at_end=True,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    remove_unused_columns=False,
    eval_do_concat_batches=False,
    push_to_hub=True,
)

最后，将所有内容整合在一起，并调用 train()：

In [ ]:

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=cppe5["train"],
    eval_dataset=cppe5["validation"],
    processing_class=image_processor,
    data_collator=collate_fn,
    compute_metrics=eval_compute_metrics_fn,
)

trainer.train()

如果你在 training_args 中设置了 push_to_hub 为 True，训练检查点将推送到 Hugging Face Hub。训练完成后，通过调用 push_to_hub() 方法推送最终模型到 Hub。

In [ ]:

trainer.push_to_hub()

评估¶

In [ ]:

from pprint import pprint

metrics = trainer.evaluate(eval_dataset=cppe5["test"], metric_key_prefix="test")
pprint(metrics)

这些结果可以通过调整 TrainingArguments 中的超参数进一步改进。试试看！

推理¶

现在你已经微调了一个模型，进行了评估，并将其上传到了 Hugging Face Hub，你可以使用它进行推理。

In [ ]:

import torch
import requests

from PIL import Image, ImageDraw
from transformers import AutoImageProcessor, AutoModelForObjectDetection

url = "https://images.pexels.com/photos/8413299/pexels-photo-8413299.jpeg?auto=compress&cs=tinysrgb&w=630&h=375&dpr=2"
image = Image.open(requests.get(url, stream=True).raw)

从 Hugging Face Hub 加载模型和图像处理器（如果要在当前会话中使用已训练的模型，可以跳过这一步）：

In [ ]:

device = "cuda"
model_repo = "qubvel-hf/detr_finetuned_cppe5"

image_processor = AutoImageProcessor.from_pretrained(model_repo)
model = AutoModelForObjectDetection.from_pretrained(model_repo)
model = model.to(device)

检测边界框：

In [ ]:

with torch.no_grad():
    inputs = image_processor(images=[image], return_tensors="pt")
    outputs = model(**inputs.to(device))
    target_sizes = torch.tensor([[image.size[1], image.size[0]]])
    results = image_processor.post_process_object_detection(outputs, threshold=0.3, target_sizes=target_sizes)[0]

for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    box = [round(i, 2) for i in box.tolist()]
    print(
        f"Detected {model.config.id2label[label.item()]} with confidence "
        f"{round(score.item(), 3)} at location {box}"
    )

让我们绘制结果：

In [ ]:

draw = ImageDraw.Draw(image)

for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    box = [round(i, 2) for i in box.tolist()]
    x, y, x2, y2 = tuple(box)
    draw.rectangle((x, y, x2, y2), outline="red", width=1)
    draw.text((x, y), model.config.id2label[label.item()], fill="white")

image

学习资源站

036目标检测

目标检测¶

加载 CPPE-5 数据集¶

预处理数据¶

准备计算 mAP 的函数¶

训练检测模型¶

评估¶

推理¶