YOLO 数据集制作：标注工具与格式转换

May 11, 2026 AI工具 YOLO, 数据集, 数据标注, 数据增强 AI 工程实践系列 4921 字 10 分钟阅读

🔊

数据标注工具使用

LabelImg 安装与使用

bash
1
2
3
4
5
# 安装
pip install labelImg

# 启动
labelImg

标注流程：

Open Dir → 选择图片文件夹
Change Save Dir → 选择标注保存文件夹
选择 YOLO 格式
Create RectBox → 框选目标 → 输入类别名
Save 保存

LabelMe 安装与使用

bash
1
2
pip install labelme
labelme

CVAT 自托管标注平台

CVAT（Computer Vision Annotation Tool）是由 Intel 开源的强大标注平台，支持 Docker 自托管部署，适合团队协作和大规模标注项目。

bash
1
2
3
4
# Docker 部署
git clone https://github.com/opencv/cvat
cd cvat
docker compose up -d

标注流程：

创建项目 → 定义标签列表（person, car, dog…）
上传图片或视频帧序列
创建任务 → 分配标注员
使用矩形框、多边形、关键点等工具标注
审核 → 导出为 YOLO 格式

导出 YOLO 格式： CVAT 支持导出为 YOLO 1.1 格式，自动生成 data.yaml 配置文件。

Roboflow 在线标注平台

Roboflow 是基于云端的全流程数据集管理平台，无需本地部署，提供从标注到模型部署的完整工具链。

核心功能：

在线标注（矩形框、多边形、分割掩码、关键点等）
数据集版本管理（每次修改生成新版本）
内置预处理（自动调整尺寸、归一化等）
内置数据增强（旋转、翻转、噪声、马赛克等）
一键导出 YOLO、COCO、Pascal VOC 等格式

导出 YOLO 格式步骤：

创建项目 → 上传图片
在线标注或导入已有标注（支持 COCO/VOC/YOLO 格式导入）
点击 Generate → 选择预处理与增强
Export → 选择 YOLO v5/v8 PyTorch 格式
下载 ZIP 包，直接用于训练

python
1
2
3
4
5
6
7
# Roboflow API 直接下载数据集到本地
# pip install roboflow
from roboflow import Roboflow

rf = Roboflow(api_key="YOUR_API_KEY")
project = rf.workspace("workspace-name").project("project-name")
dataset = project.version(1).download("yolov8")

Label Studio 多任务标注

Label Studio 是一个开源的多任务标注平台，支持图像、文本、音频、时间序列等多种数据类型。

bash
1
2
3
4
5
6
7
8
# 安装
pip install label-studio

# 启动
label-studio

# Docker 部署（推荐生产使用）
docker run -it -p 8080:8080 -v $(pwd)/data:/label-studio/data heartexlabs/label-studio:latest

YOLO 标注配置：

创建项目 → 选择 Object Detection with Bounding Boxes 模板
设置标签列表（Labeling Setup → 添加标签名称）
导入图片 → 开始标注
标注完成后导出 → 选择 YOLO 格式

高级特性：

多人协作标注与标注一致性检查
机器学习辅助标注（ML-assisted labeling）自动预标注
自定义标注界面与标注模板

数据集标准格式

YOLO 格式目录结构

Plain
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
my_dataset/
├── images/
│   ├── train/          # 训练集图片
│   ├── val/            # 验证集图片
│   └── test/           # 测试集图片（可选）
├── labels/
│   ├── train/          # 训练集标注
│   ├── val/            # 验证集标注
│   └── test/           # 测试集标注
└── data.yaml           # 数据集配置文件

YOLO 标注格式说明

每个 .txt 标注文件格式：

Plain

1
<class_id> <x_center> <y_center> <width> <height>

所有坐标均为归一化值（0~1）
x_center, y_center：框中心点相对于图片宽高
width, height：框的宽高相对于图片宽高

data.yaml 配置文件

yaml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# 数据集根路径（绝对或相对）
path: ../datasets/my_dataset

# 训练/验证/测试集路径（相对于path）
train: images/train
val: images/val
test: images/test  # 可选

# 类别数量
nc: 3

# 类别名称
names:
  0: person
  1: car
  2: dog

格式转换工具

VOC → YOLO 转换

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import xml.etree.ElementTree as ET
import os

def voc_to_yolo(xml_path, img_w, img_h):
    tree = ET.parse(xml_path)
    root = tree.getroot()
    
    yolo_lines = []
    for obj in root.iter('object'):
        cls = obj.find('name').text
        xmlbox = obj.find('bndbox')
        xmin = float(xmlbox.find('xmin').text)
        ymin = float(xmlbox.find('ymin').text)
        xmax = float(xmlbox.find('xmax').text)
        ymax = float(xmlbox.find('ymax').text)
        
        # 转换为归一化坐标
        x_center = (xmin + xmax) / 2.0 / img_w
        y_center = (ymin + ymax) / 2.0 / img_h
        width = (xmax - xmin) / img_w
        height = (ymax - ymin) / img_h
        
        yolo_lines.append(f"0 {x_center:.6f} {y_center:.6f} {width:.6f} {height:.6f}")
    
    return yolo_lines

COCO -> YOLO 转换

COCO 格式将所有标注信息存储在一个 JSON 文件中，每张图片可以有多个目标实例和类别。转换为 YOLO 格式需要将 JSON 按图片拆分为单独的 .txt 文件。

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import json
import os

def coco_to_yolo(coco_json_path, output_dir, images_dir):
    """
    将 COCO JSON 标注转换为 YOLO txt 格式

    Args:
        coco_json_path: COCO 标注 JSON 文件路径
        output_dir: YOLO labels 输出目录
        images_dir: 图片目录（用于校验）
    """
    with open(coco_json_path, 'r') as f:
        coco = json.load(f)

    # 构建类别 ID 到连续索引的映射
    # COCO 原始类别 ID 不连续（如 1: person, 3: car...）
    categories = {cat['id']: idx for idx, cat in enumerate(coco['categories'])}
    print(f"类别映射: {categories}")

    # 构建图片信息字典
    images_info = {}
    for img in coco['images']:
        images_info[img['id']] = {
            'file_name': img['file_name'],
            'width': img['width'],
            'height': img['height']
        }

    # 按 image_id 分组标注
    annotations_by_image = {}
    for ann in coco['annotations']:
        img_id = ann['image_id']
        if img_id not in annotations_by_image:
            annotations_by_image[img_id] = []
        annotations_by_image[img_id].append(ann)

    # 逐图片转换为 YOLO 格式
    for img_id, anns in annotations_by_image.items():
        img_info = images_info[img_id]
        img_w, img_h = img_info['width'], img_info['height']

        base_name = os.path.splitext(img_info['file_name'])[0]
        txt_path = os.path.join(output_dir, f"{base_name}.txt")

        with open(txt_path, 'w') as f:
            for ann in anns:
                cls_id = categories.get(ann['category_id'], -1)
                if cls_id == -1:
                    continue  # 跳过未映射的类别

                # COCO 格式: [x, y, width, height]（左上角 + 宽高）
                bbox = ann['bbox']
                x, y, w, h = bbox

                # 转换为 YOLO 归一化中心点坐标
                x_center = (x + w / 2) / img_w
                y_center = (y + h / 2) / img_h
                width = w / img_w
                height = h / img_h

                f.write(f"{cls_id} {x_center:.6f} {y_center:.6f} {width:.6f} {height:.6f}\n")

    print(f"转换完成！共处理 {len(annotations_by_image)} 张图片")


# 使用示例
coco_to_yolo(
    coco_json_path="annotations/instances_train2017.json",
    output_dir="labels/train",
    images_dir="images/train"
)

数据集拆分工具

将图片和标注文件按比例随机拆分为训练集、验证集和测试集。

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import os
import random
import shutil
from sklearn.model_selection import train_test_split

def split_dataset(image_dir, label_dir, output_dir,
                  train_ratio=0.7, val_ratio=0.15,
                  test_ratio=0.15, random_seed=42):
    """
    按比例拆分数据集

    Args:
        image_dir: 原始图片目录
        label_dir: 原始标注目录
        output_dir: 输出根目录
        train_ratio: 训练集比例
        val_ratio: 验证集比例
        test_ratio: 测试集比例
        random_seed: 随机种子（保证可复现）
    """
    random.seed(random_seed)

    # 获取所有图片文件
    images = [f for f in os.listdir(image_dir)
              if f.lower().endswith(('.jpg', '.jpeg', '.png'))]

    if test_ratio > 0:
        # 先分出测试集，再从剩余中分验证集
        train_val, test = train_test_split(
            images, test_size=test_ratio, random_state=random_seed)
        val_ratio_adj = val_ratio / (train_ratio + val_ratio)
        train, val = train_test_split(
            train_val, test_size=val_ratio_adj, random_state=random_seed)
    else:
        train, val = train_test_split(
            images, test_size=val_ratio/(train_ratio+val_ratio),
            random_state=random_seed)
        test = []

    splits = {'train': train, 'val': val, 'test': test}

    # 复制文件到对应目录
    for split_name, split_images in splits.items():
        os.makedirs(f"{output_dir}/images/{split_name}", exist_ok=True)
        os.makedirs(f"{output_dir}/labels/{split_name}", exist_ok=True)

        for img_file in split_images:
            # 复制图片
            shutil.copy2(
                f"{image_dir}/{img_file}",
                f"{output_dir}/images/{split_name}/{img_file}")
            # 复制对应标注文件
            base = os.path.splitext(img_file)[0]
            label_file = f"{base}.txt"
            if os.path.exists(f"{label_dir}/{label_file}"):
                shutil.copy2(
                    f"{label_dir}/{label_file}",
                    f"{output_dir}/labels/{split_name}/{label_file}")

    print(f"数据集拆分完成！")
    print(f"   训练集: {len(train)} 张")
    print(f"   验证集: {len(val)} 张")
    print(f"   测试集: {len(test)} 张")


# 使用示例
split_dataset(
    image_dir="raw_images",
    label_dir="raw_labels",
    output_dir="datasets/my_dataset",
    train_ratio=0.7,
    val_ratio=0.15,
    test_ratio=0.15,
    random_seed=42
)

分层抽样（Stratified Split）： 当类别分布严重不均衡时，简单随机拆分可能导致某个子集缺少某类样本。使用分层抽样按类别比例拆分：

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from sklearn.model_selection import StratifiedShuffleSplit

def stratified_split(image_dir, label_dir, output_dir,
                     test_size=0.2, random_seed=42):
    """按类别比例分层拆分数据集"""
    images, labels = [], []

    for f in os.listdir(image_dir):
        if not f.lower().endswith(('.jpg', '.jpeg', '.png')):
            continue
        base = os.path.splitext(f)[0]
        txt_path = f"{label_dir}/{base}.txt"
        if not os.path.exists(txt_path):
            continue

        # 读取该图片包含的类别
        with open(txt_path) as fh:
            classes = [int(line.split()[0]) for line in fh if line.strip()]

        if classes:
            images.append(f)
            labels.append(classes[0])  # 用第一个类别做分层依据

    sss = StratifiedShuffleSplit(
        n_splits=1, test_size=test_size, random_state=random_seed)

    train_idx, val_idx = next(sss.split(images, labels))
    train = [images[i] for i in train_idx]
    val = [images[i] for i in val_idx]

    # 复制文件逻辑（同上 split_dataset）
    print(f"分层拆分完成：训练集 {len(train)} 张，验证集 {len(val)} 张")

类别平衡分析

分析数据集中各类别的样本分布，检测类不平衡问题并采取解决措施。

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import matplotlib.pyplot as plt
from collections import Counter

def analyze_class_balance(label_dir, class_names=None):
    """
    分析数据集类别分布

    Args:
        label_dir: 标注文件目录
        class_names: 类别名称列表（可选）

    Returns:
        class_counts: 各类别实例数量（Counter 对象）
    """
    class_counts = Counter()
    label_files = [f for f in os.listdir(label_dir) if f.endswith('.txt')]

    for f in label_files:
        with open(f"{label_dir}/{f}", 'r') as fh:
            for line in fh:
                if line.strip():
                    cls_id = int(line.split()[0])
                    class_counts[cls_id] += 1

    # 可视化分布
    if class_names is None:
        class_names = [f"class_{i}" for i in range(len(class_counts))]

    names = [class_names[cid] for cid, _ in class_counts.most_common()]
    counts = [cnt for _, cnt in class_counts.most_common()]

    plt.figure(figsize=(10, 5))
    bars = plt.bar(range(len(names)), counts, color='steelblue')
    plt.xticks(range(len(names)), names, rotation=45)
    plt.ylabel('实例数量 (Instance Count)')
    plt.title('数据集类别分布 (Dataset Class Distribution)')

    # 在柱状图上标注具体数值
    for bar, count in zip(bars, counts):
        plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(counts)*0.01,
                 str(count), ha='center', va='bottom')

    plt.tight_layout()
    plt.savefig('class_distribution.png', dpi=150)
    plt.show()

    return class_counts


# 使用示例
class_names = ['person', 'car', 'dog']
counts = analyze_class_balance('labels/train', class_names)
print(f"类别分布: {dict(counts)}")
total_instances = sum(counts.values())
print(f"总实例数: {total_instances}")

处理类别不平衡的方法：

方法	说明	适用场景
过采样	复制少数类样本或做轻微数据增强	多数类样本充足
欠采样	随机丢弃多数类样本	数据集很大且多数类冗余
类别权重	在损失函数中给少数类更高权重	YOLO 中通过 `cls_pw` 参数设置
数据增强	对少数类应用更多增强变换	通用方法，推荐优先尝试

python
1
2
3
4
5
# YOLO 训练时按类别设置权重
model.train(
    data="data.yaml",
    cls_pw=[1.0, 2.0, 5.0],  # 各类别损失权重（第三类权重最高）
)

数据增强策略

内置增强（Ultralytics）

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
model.train(
    data="data.yaml",
    # 基础增强
    hsv_h=0.015,      # 色调增强
    hsv_s=0.7,        # 饱和度增强
    hsv_v=0.4,        # 明度增强
    degrees=0.0,      # 旋转角度
    translate=0.1,    # 平移
    scale=0.5,        # 缩放
    shear=0.0,        # 剪切
    perspective=0.0,  # 透视变换
    flipud=0.0,       # 上下翻转概率
    fliplr=0.5,       # 左右翻转概率
    # 高级增强
    mosaic=1.0,       # Mosaic增强
    mixup=0.0,        # Mixup增强
    copy_paste=0.0,   # Copy-Paste增强
)

自定义增强（Albumentations）

python
1
2
3
4
5
6
7
8
import albumentations as A

transform = A.Compose([
    A.RandomBrightnessContrast(p=0.5),
    A.GaussianBlur(p=0.3),
    A.GaussNoise(p=0.3),
    A.HorizontalFlip(p=0.5),
], bbox_params=A.BboxParams(format='yolo', label_fields=['class_labels']))

Mosaic 增强详解

Mosaic 增强是 YOLOv4 引入的关键技术，将 4 张图片拼接为一张大图，大幅提升模型对小目标的检测能力。

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import cv2
import numpy as np

def mosaic_augmentation(image1, image2, image3, image4, img_size=640):
    """模拟 Mosaic 增强的核心拼接逻辑"""
    h, w = img_size, img_size
    mid_x = w // 2
    mid_y = h // 2

    # 将四张图缩放到相同尺寸
    images = [image1, image2, image3, image4]
    resized = [cv2.resize(img, (w, h)) for img in images]

    # 创建画布
    canvas = np.zeros((h, w, 3), dtype=np.uint8)

    # 四张图分别放置于左上、右上、左下、右下
    canvas[:mid_y, :mid_x] = cv2.resize(resized[0], (mid_x, mid_y))    # 左上
    canvas[:mid_y, mid_x:] = cv2.resize(resized[1], (w-mid_x, mid_y))  # 右上
    canvas[mid_y:, :mid_x] = cv2.resize(resized[2], (mid_x, h-mid_y))  # 左下
    canvas[mid_y:, mid_x:] = cv2.resize(resized[3], (w-mid_x, h-mid_y))  # 右下

    # 实际实现中拼接点是随机选择的，每轮训练不同
    return canvas

内部原理：

每次训练迭代随机选取 4 张图片
随机选择拼接中心点（而非固定图片中心）
四张图片按中心点拼接为一张大图
所有边界框坐标重新计算为相对拼接图的坐标
超出拼接图边界的框被裁剪或丢弃

效果： 每张训练图包含 4 张图的内容，BN 层统计量更丰富，小目标上下文多样性增强。

Mixup 增强详解

Mixup 来自图像分类领域，后被引入目标检测。核心思想是将两张图片按比例混合叠加。

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import cv2
import numpy as np

def mixup_augmentation(image1, image2, alpha=0.5):
    """模拟 Mixup 增强的核心混合逻辑"""
    # Mixup 系数从 Beta 分布采样
    lam = np.random.beta(alpha, alpha)

    # 两张图按比例叠加
    mixed = lam * image1 + (1 - lam) * image2

    # 边界框的处理：两张图的框都保留
    # 但分别带有权重 lam 和 (1-lam) 的损失贡献
    return mixed.astype(np.uint8)

内部原理：

从当前批次中随机选取两张图及其标注
从 Beta(alpha, alpha) 分布采样混合系数 lam（通常 alpha=0.5~1.0）
像素级叠加：mixed_img = lam * img1 + (1-lam) * img2
两张图的边界框都保留，分别以 lam 和 1-lam 的权重参与损失计算
标签平滑效果：模型看到的是混合图像，隐式正则化防止过拟合

注意： Mixup 会略微降低训练速度（需处理双倍框），但能有效减少过拟合。

Copy-Paste 增强详解

Copy-Paste 增强将一张图片中的目标对象直接复制粘贴到另一张图片上，可显著增加目标实例数量和背景多样性。

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import cv2
import numpy as np

def copy_paste_augmentation(source_img, target_img, source_bbox):
    """模拟 Copy-Paste 增强的核心粘贴逻辑"""
    x1, y1, x2, y2 = map(int, source_bbox)
    obj = source_img[y1:y2, x1:x2].copy()

    # 随机选择粘贴位置
    h, w = target_img.shape[:2]
    obj_h, obj_w = obj.shape[:2]
    paste_x = np.random.randint(0, w - obj_w)
    paste_y = np.random.randint(0, h - obj_h)

    # 叠加到目标图
    target_img[paste_y:paste_y+obj_h, paste_x:paste_x+obj_w] = obj

    # 新边界框
    new_bbox = [paste_x, paste_y, paste_x+obj_w, paste_y+obj_h]
    return target_img, new_bbox

内部原理：

从数据集中随机选取两张图：源图（提供目标）和目标图
从源图中随机选择一个目标实例
有分割掩码则用掩码提取精确轮廓；否则用矩形框裁剪
对目标做小幅缩放和旋转变换
粘贴到目标图上，可能随机缩放/翻转后再粘贴
若粘贴位置与其他框重叠，可选择跳过或用透明度混合

适用场景： 小目标检测、罕见类别增强。注意不要过度使用以免背景信息丢失。

增强参数配置参考：

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
model.train(
    # Mosaic 参数
    mosaic=1.0,           # 1.0 = 每张训练图都做 Mosaic
    mosaic_center=0.5,    # 拼接中心点随机范围

    # Mixup 参数
    mixup=0.1,            # Mixup 应用概率
    mixup_alpha=0.5,      # Beta 分布 alpha 参数

    # Copy-Paste 参数
    copy_paste=0.1,       # Copy-Paste 应用概率
    paste_in=0.15,        # 将目标粘贴到背景上的概率
)

数据质量检查清单

在训练前仔细检查数据集质量，避免浪费训练时间。

1. 标注一致性

同类目标的框大小和比例是否一致？
是否所有目标都标注了（没有遗漏）？
类别名称拼写和大小写是否一致？

2. 标注边界检查

坐标是否在 0~1 范围内（YOLO 归一化格式）？
是否有宽度或高度为 0 的无效框？
框是否超出图片边界？少量超出可接受，过多则说明标注有误

3. 图片质量

是否存在破损或无法打开的图片文件？
图片通道数是否一致（全部 RGB）？
图片分辨率差异是否过大？建议长边不超过 1920

4. 类别检查

data.yaml 中的类别数量 nc 是否与实际标注文件匹配？
所有 .txt 文件中的 class_id 是否在 0 ~ nc-1 范围？
是否存在空标注文件（对应背景图）？是否合理？

5. 数据集平衡

各类别实例数量是否严重不均衡？
训练/验证/测试集的类别分布是否一致？

6. 自动化检查脚本

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import os

def validate_dataset(image_dir, label_dir, num_classes):
    """数据集质量自动检查"""
    issues = []

    for f in os.listdir(label_dir):
        if not f.endswith('.txt'):
            continue

        filepath = os.path.join(label_dir, f)
        with open(filepath, 'r') as fh:
            for line_num, line in enumerate(fh, 1):
                parts = line.strip().split()
                if len(parts) != 5:
                    issues.append(f"{f}:{line_num} 格式错误——需要 5 个字段，实际 {len(parts)}")
                    continue

                cls_id, xc, yc, w, h = parts
                cls_id = int(cls_id)
                xc, yc, w, h = map(float, [xc, yc, w, h])

                if cls_id < 0 or cls_id >= num_classes:
                    issues.append(f"{f}:{line_num} class_id {cls_id} 超出范围 0~{num_classes-1}")
                if w <= 0 or h <= 0:
                    issues.append(f"{f}:{line_num} 框尺寸无效: w={w}, h={h}")
                if xc < 0 or xc > 1 or yc < 0 or yc > 1:
                    issues.append(f"{f}:{line_num} 中心坐标越界: xc={xc}, yc={yc}")

        # 检查对应的图片是否存在
        base = os.path.splitext(f)[0]
        img_exists = any(os.path.exists(f"{image_dir}/{base}{ext}")
                        for ext in ['.jpg', '.jpeg', '.png'])
        if not img_exists:
            issues.append(f"{f} 对应的图片不存在")

    if issues:
        print("发现以下问题：")
        for issue in issues:
            print(f"  - {issue}")
    else:
        print("数据集检查通过！")

    return len(issues)


# 使用示例
validate_dataset(
    image_dir='datasets/my_dataset/images/train',
    label_dir='datasets/my_dataset/labels/train',
    num_classes=3
)

所属系列: AI 工程实践系列

← 上一篇 YOLO 快速实战：模型加载与推理下一篇 → YOLO 模型训练：自定义数据集完整流程