YOLO Dataset Preparation: Annotation Tools and Format Conversion

Data Annotation Tools Usage

LabelImg Installation and Usage

bash
1
2
3
4
5
# Installation
pip install labelImg

# Launch
labelImg

Annotation Process:

  1. Open Dir → Select image folder
  2. Change Save Dir → Select annotation save folder
  3. Select YOLO format
  4. Create RectBox → Draw bounding box → Enter class name
  5. Save

LabelMe Installation and Usage

bash
1
2
pip install labelme
labelme

CVAT Self-Hosted Annotation Platform

CVAT (Computer Vision Annotation Tool) is an open-source annotation platform by Intel, supporting Docker self-hosted deployment for team collaboration and large-scale annotation projects.

bash
1
2
3
4
# Docker deployment
git clone https://github.com/opencv/cvat
cd cvat
docker compose up -d

Annotation Workflow:

  1. Create a project → Define label list (person, car, dog…)
  2. Upload images or video frame sequences
  3. Create tasks → Assign annotators
  4. Annotate using rectangles, polygons, keypoints, etc.
  5. Review → Export in YOLO format

Export to YOLO: CVAT supports YOLO 1.1 format export, automatically generating data.yaml.

Roboflow Online Annotation Platform

Roboflow is a cloud-based full-pipeline dataset management platform that requires no local deployment, providing a complete toolchain from annotation to model deployment.

Core Features:

  • Online annotation (bounding boxes, polygons, segmentation masks, keypoints)
  • Dataset version management (new version on each modification)
  • Built-in preprocessing (auto-resize, normalization, etc.)
  • Built-in data augmentation (rotation, flip, noise, mosaic, etc.)
  • One-click export to YOLO, COCO, Pascal VOC formats

Export to YOLO Steps:

  1. Create a project → Upload images
  2. Annotate online or import existing annotations (supports COCO/VOC/YOLO format import)
  3. Click Generate → Select preprocessing and augmentation
  4. Export → Select YOLO v5/v8 PyTorch format
  5. Download ZIP, ready for training
python
1
2
3
4
5
6
7
# Download dataset directly via Roboflow API
# pip install roboflow
from roboflow import Roboflow

rf = Roboflow(api_key="YOUR_API_KEY")
project = rf.workspace("workspace-name").project("project-name")
dataset = project.version(1).download("yolov8")

Label Studio Multi-Task Annotation

Label Studio is an open-source multi-task annotation platform supporting images, text, audio, time series, and more.

bash
1
2
3
4
5
6
7
8
# Installation
pip install label-studio

# Launch
label-studio

# Docker deployment (recommended for production)
docker run -it -p 8080:8080 -v $(pwd)/data:/label-studio/data heartexlabs/label-studio:latest

YOLO Annotation Configuration:

  1. Create a project → Select Object Detection with Bounding Boxes template
  2. Configure labels (Labeling Setup → Add label names)
  3. Import images → Start annotating
  4. Export after completion → Select YOLO format

Advanced Features:

  • Multi-user collaboration and annotation consistency checks
  • ML-assisted labeling for auto pre-annotation
  • Customizable annotation interface and templates

Dataset Standard Format

YOLO Format Directory Structure

Plain
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
my_dataset/
├── images/
│   ├── train/          # Training set images
│   ├── val/            # Validation set images
│   └── test/           # Test set images (optional)
├── labels/
│   ├── train/          # Training set annotations
│   ├── val/            # Validation set annotations
│   └── test/           # Test set annotations
└── data.yaml           # Dataset configuration file

YOLO Annotation Format Specification

Each .txt annotation file format:

Plain
1
<class_id> <x_center> <y_center> <width> <height>
  • All coordinates are normalized values (0~1)
  • x_center, y_center: Bounding box center relative to image width/height
  • width, height: Bounding box width/height relative to image width/height

data.yaml Configuration File

yaml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Dataset root path (absolute or relative)
path: ../datasets/my_dataset

# Train/validation/test set paths (relative to path)
train: images/train
val: images/val
test: images/test  # Optional

# Number of classes
nc: 3

# Class names
names:
  0: person
  1: car
  2: dog

Format Conversion Tools

VOC → YOLO Conversion

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import xml.etree.ElementTree as ET
import os

def voc_to_yolo(xml_path, img_w, img_h):
    tree = ET.parse(xml_path)
    root = tree.getroot()
    
    yolo_lines = []
    for obj in root.iter('object'):
        cls = obj.find('name').text
        xmlbox = obj.find('bndbox')
        xmin = float(xmlbox.find('xmin').text)
        ymin = float(xmlbox.find('ymin').text)
        xmax = float(xmlbox.find('xmax').text)
        ymax = float(xmlbox.find('ymax').text)
        
        # Convert to normalized coordinates
        x_center = (xmin + xmax) / 2.0 / img_w
        y_center = (ymin + ymax) / 2.0 / img_h
        width = (xmax - xmin) / img_w
        height = (ymax - ymin) / img_h
        
        yolo_lines.append(f"0 {x_center:.6f} {y_center:.6f} {width:.6f} {height:.6f}")
    
    return yolo_lines

COCO -> YOLO Conversion

COCO format stores all annotations in a single JSON file, where each image can have multiple object instances and categories. Converting to YOLO format requires splitting the JSON into individual .txt files per image.

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import json
import os

def coco_to_yolo(coco_json_path, output_dir, images_dir):
    """
    Convert COCO JSON annotations to YOLO txt format

    Args:
        coco_json_path: Path to COCO annotation JSON
        output_dir: Output directory for YOLO labels
        images_dir: Image directory (for validation)
    """
    with open(coco_json_path, 'r') as f:
        coco = json.load(f)

    # Build category ID to sequential index mapping
    # COCO category IDs are not sequential (e.g., 1: person, 3: car...)
    categories = {cat['id']: idx for idx, cat in enumerate(coco['categories'])}
    print(f"Category mapping: {categories}")

    # Build image info dictionary
    images_info = {}
    for img in coco['images']:
        images_info[img['id']] = {
            'file_name': img['file_name'],
            'width': img['width'],
            'height': img['height']
        }

    # Group annotations by image_id
    annotations_by_image = {}
    for ann in coco['annotations']:
        img_id = ann['image_id']
        if img_id not in annotations_by_image:
            annotations_by_image[img_id] = []
        annotations_by_image[img_id].append(ann)

    # Convert per-image to YOLO format
    for img_id, anns in annotations_by_image.items():
        img_info = images_info[img_id]
        img_w, img_h = img_info['width'], img_info['height']

        base_name = os.path.splitext(img_info['file_name'])[0]
        txt_path = os.path.join(output_dir, f"{base_name}.txt")

        with open(txt_path, 'w') as f:
            for ann in anns:
                cls_id = categories.get(ann['category_id'], -1)
                if cls_id == -1:
                    continue  # Skip unmapped categories

                # COCO format: [x, y, width, height] (top-left + dimensions)
                bbox = ann['bbox']
                x, y, w, h = bbox

                # Convert to YOLO normalized center coordinates
                x_center = (x + w / 2) / img_w
                y_center = (y + h / 2) / img_h
                width = w / img_w
                height = h / img_h

                f.write(f"{cls_id} {x_center:.6f} {y_center:.6f} {width:.6f} {height:.6f}\n")

    print(f"Conversion complete! Processed {len(annotations_by_image)} images")


# Usage example
coco_to_yolo(
    coco_json_path="annotations/instances_train2017.json",
    output_dir="labels/train",
    images_dir="images/train"
)

Dataset Splitting Tool

Split images and annotation files into training, validation, and test sets by ratio.

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import os
import random
import shutil
from sklearn.model_selection import train_test_split

def split_dataset(image_dir, label_dir, output_dir,
                  train_ratio=0.7, val_ratio=0.15,
                  test_ratio=0.15, random_seed=42):
    """
    Split dataset by ratio

    Args:
        image_dir: Source image directory
        label_dir: Source label directory
        output_dir: Output root directory
        train_ratio: Training set ratio
        val_ratio: Validation set ratio
        test_ratio: Test set ratio
        random_seed: Random seed (for reproducibility)
    """
    random.seed(random_seed)

    # Get all image files
    images = [f for f in os.listdir(image_dir)
              if f.lower().endswith(('.jpg', '.jpeg', '.png'))]

    if test_ratio > 0:
        # Split test set first, then split validation from remainder
        train_val, test = train_test_split(
            images, test_size=test_ratio, random_state=random_seed)
        val_ratio_adj = val_ratio / (train_ratio + val_ratio)
        train, val = train_test_split(
            train_val, test_size=val_ratio_adj, random_state=random_seed)
    else:
        train, val = train_test_split(
            images, test_size=val_ratio/(train_ratio+val_ratio),
            random_state=random_seed)
        test = []

    splits = {'train': train, 'val': val, 'test': test}

    # Copy files to corresponding directories
    for split_name, split_images in splits.items():
        os.makedirs(f"{output_dir}/images/{split_name}", exist_ok=True)
        os.makedirs(f"{output_dir}/labels/{split_name}", exist_ok=True)

        for img_file in split_images:
            # Copy image
            shutil.copy2(
                f"{image_dir}/{img_file}",
                f"{output_dir}/images/{split_name}/{img_file}")
            # Copy corresponding label file
            base = os.path.splitext(img_file)[0]
            label_file = f"{base}.txt"
            if os.path.exists(f"{label_dir}/{label_file}"):
                shutil.copy2(
                    f"{label_dir}/{label_file}",
                    f"{output_dir}/labels/{split_name}/{label_file}")

    print(f"Dataset split complete!")
    print(f"   Training set: {len(train)} images")
    print(f"   Validation set: {len(val)} images")
    print(f"   Test set: {len(test)} images")


# Usage example
split_dataset(
    image_dir="raw_images",
    label_dir="raw_labels",
    output_dir="datasets/my_dataset",
    train_ratio=0.7,
    val_ratio=0.15,
    test_ratio=0.15,
    random_seed=42
)

Stratified Split: When class distribution is severely imbalanced, simple random splitting may cause a subset to lack certain classes. Use stratified splitting to maintain class proportions:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from sklearn.model_selection import StratifiedShuffleSplit

def stratified_split(image_dir, label_dir, output_dir,
                     test_size=0.2, random_seed=42):
    """Split dataset by class proportion"""
    images, labels = [], []

    for f in os.listdir(image_dir):
        if not f.lower().endswith(('.jpg', '.jpeg', '.png')):
            continue
        base = os.path.splitext(f)[0]
        txt_path = f"{label_dir}/{base}.txt"
        if not os.path.exists(txt_path):
            continue

        # Read classes present in this image
        with open(txt_path) as fh:
            classes = [int(line.split()[0]) for line in fh if line.strip()]

        if classes:
            images.append(f)
            labels.append(classes[0])  # Use first class for stratification

    sss = StratifiedShuffleSplit(
        n_splits=1, test_size=test_size, random_state=random_seed)

    train_idx, val_idx = next(sss.split(images, labels))
    train = [images[i] for i in train_idx]
    val = [images[i] for i in val_idx]

    # File copying logic (same as split_dataset)
    print(f"Stratified split complete: {len(train)} training, {len(val)} validation")

Class Balance Analysis

Analyze the distribution of each class in the dataset, detect class imbalance issues, and take corrective measures.

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import matplotlib.pyplot as plt
from collections import Counter

def analyze_class_balance(label_dir, class_names=None):
    """
    Analyze dataset class distribution

    Args:
        label_dir: Label file directory
        class_names: List of class names (optional)

    Returns:
        class_counts: Counter object with instance counts per class
    """
    class_counts = Counter()
    label_files = [f for f in os.listdir(label_dir) if f.endswith('.txt')]

    for f in label_files:
        with open(f"{label_dir}/{f}", 'r') as fh:
            for line in fh:
                if line.strip():
                    cls_id = int(line.split()[0])
                    class_counts[cls_id] += 1

    # Visualize distribution
    if class_names is None:
        class_names = [f"class_{i}" for i in range(len(class_counts))]

    names = [class_names[cid] for cid, _ in class_counts.most_common()]
    counts = [cnt for _, cnt in class_counts.most_common()]

    plt.figure(figsize=(10, 5))
    bars = plt.bar(range(len(names)), counts, color='steelblue')
    plt.xticks(range(len(names)), names, rotation=45)
    plt.ylabel('Instance Count')
    plt.title('Dataset Class Distribution')

    # Annotate bar chart with values
    for bar, count in zip(bars, counts):
        plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(counts)*0.01,
                 str(count), ha='center', va='bottom')

    plt.tight_layout()
    plt.savefig('class_distribution.png', dpi=150)
    plt.show()

    return class_counts


# Usage example
class_names = ['person', 'car', 'dog']
counts = analyze_class_balance('labels/train', class_names)
print(f"Class distribution: {dict(counts)}")
total_instances = sum(counts.values())
print(f"Total instances: {total_instances}")

Handling Class Imbalance:

MethodDescriptionUse Case
OversamplingDuplicate minority class samples or apply mild augmentationMajority class has enough samples
UndersamplingRandomly discard majority class samplesLarge dataset with redundant majority class
Class WeightAssign higher weight to minority classes in the loss functionSet via cls_pw in YOLO
Data AugmentationApply more augmentations to minority classesGeneral approach, recommended first
python
1
2
3
4
5
# Set per-class weights in YOLO training
model.train(
    data="data.yaml",
    cls_pw=[1.0, 2.0, 5.0],  # Per-class loss weights (3rd class has highest weight)
)

Data Augmentation Strategies

Built-in Augmentation (Ultralytics)

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
model.train(
    data="data.yaml",
    # Basic augmentation
    hsv_h=0.015,      # Hue augmentation
    hsv_s=0.7,        # Saturation augmentation
    hsv_v=0.4,        # Brightness augmentation
    degrees=0.0,      # Rotation angle
    translate=0.1,    # Translation
    scale=0.5,        # Scale
    shear=0.0,        # Shear
    perspective=0.0,  # Perspective transformation
    flipud=0.0,       # Vertical flip probability
    fliplr=0.5,       # Horizontal flip probability
    # Advanced augmentation
    mosaic=1.0,       # Mosaic augmentation
    mixup=0.0,        # Mixup augmentation
    copy_paste=0.0,   # Copy-Paste augmentation
)

Custom Augmentation (Albumentations)

python
1
2
3
4
5
6
7
8
import albumentations as A

transform = A.Compose([
    A.RandomBrightnessContrast(p=0.5),
    A.GaussianBlur(p=0.3),
    A.GaussNoise(p=0.3),
    A.HorizontalFlip(p=0.5),
], bbox_params=A.BboxParams(format='yolo', label_fields=['class_labels']))

Mosaic Augmentation Explained

Mosaic augmentation, introduced in YOLOv4, is a key technique that stitches 4 images into one large composite, significantly improving small object detection.

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import cv2
import numpy as np

def mosaic_augmentation(image1, image2, image3, image4, img_size=640):
    """Simulate the core stitching logic of Mosaic augmentation"""
    h, w = img_size, img_size
    mid_x = w // 2
    mid_y = h // 2

    # Resize all four images to the same dimensions
    images = [image1, image2, image3, image4]
    resized = [cv2.resize(img, (w, h)) for img in images]

    # Create canvas
    canvas = np.zeros((h, w, 3), dtype=np.uint8)

    # Place images at top-left, top-right, bottom-left, bottom-right
    canvas[:mid_y, :mid_x] = cv2.resize(resized[0], (mid_x, mid_y))    # top-left
    canvas[:mid_y, mid_x:] = cv2.resize(resized[1], (w-mid_x, mid_y))  # top-right
    canvas[mid_y:, :mid_x] = cv2.resize(resized[2], (mid_x, h-mid_y))  # bottom-left
    canvas[mid_y:, mid_x:] = cv2.resize(resized[3], (w-mid_x, h-mid_y))  # bottom-right

    # In practice, the stitch point is randomly chosen each epoch
    return canvas

Internal Mechanics:

  1. Randomly select 4 images per training iteration
  2. Randomly choose a stitch center point (not fixed image center)
  3. Stitch the 4 images around the center point into one composite
  4. Recalculate all bounding box coordinates relative to the composite
  5. Boxes that fall outside the composite are cropped or discarded

Effect: Each training image contains content from 4 images, enriching BN layer statistics and improving small object context diversity.

Mixup Augmentation Explained

Mixup originates from image classification and was later adopted in object detection. The core idea is to blend two images proportionally.

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import cv2
import numpy as np

def mixup_augmentation(image1, image2, alpha=0.5):
    """Simulate the core blending logic of Mixup augmentation"""
    # Mixup coefficient sampled from Beta distribution
    lam = np.random.beta(alpha, alpha)

    # Blend two images proportionally
    mixed = lam * image1 + (1 - lam) * image2

    # Bounding boxes from both images are retained
    # but contribute to the loss with weights lam and (1-lam) respectively
    return mixed.astype(np.uint8)

Internal Mechanics:

  1. Randomly select two images and their annotations from the current batch
  2. Sample blend coefficient lam from Beta(alpha, alpha) (typically alpha=0.5~1.0)
  3. Pixel-level blending: mixed_img = lam * img1 + (1-lam) * img2
  4. Bounding boxes from both images are retained, contributing to the loss with weights lam and 1-lam
  5. Label smoothing effect: the model sees blended images, providing implicit regularization

Note: Mixup slightly reduces training speed (double the boxes to process) but effectively reduces overfitting.

Copy-Paste Augmentation Explained

Copy-Paste augmentation copies objects from one image and pastes them onto another, significantly increasing instance count and background diversity.

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import cv2
import numpy as np

def copy_paste_augmentation(source_img, target_img, source_bbox):
    """Simulate the core pasting logic of Copy-Paste augmentation"""
    x1, y1, x2, y2 = map(int, source_bbox)
    obj = source_img[y1:y2, x1:x2].copy()

    # Randomly select paste location
    h, w = target_img.shape[:2]
    obj_h, obj_w = obj.shape[:2]
    paste_x = np.random.randint(0, w - obj_w)
    paste_y = np.random.randint(0, h - obj_h)

    # Overlay onto target image
    target_img[paste_y:paste_y+obj_h, paste_x:paste_x+obj_w] = obj

    # New bounding box
    new_bbox = [paste_x, paste_y, paste_x+obj_w, paste_y+obj_h]
    return target_img, new_bbox

Internal Mechanics:

  1. Randomly select two images from the dataset: source (providing the object) and target
  2. Randomly choose one object instance from the source image
  3. Use segmentation mask for precise contour extraction if available; otherwise crop via bounding box
  4. Apply slight scaling and rotation to the object
  5. Paste onto the target image — may apply random scaling/flipping before pasting
  6. If the paste location overlaps with other boxes, skip or blend with transparency

Use Cases: Small object detection, rare class augmentation. Avoid overuse to prevent losing background information.

Augmentation Parameter Reference:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
model.train(
    # Mosaic parameters
    mosaic=1.0,           # 1.0 = apply Mosaic to every training image
    mosaic_center=0.5,    # Random range for stitch center point

    # Mixup parameters
    mixup=0.1,            # Mixup application probability
    mixup_alpha=0.5,      # Beta distribution alpha parameter

    # Copy-Paste parameters
    copy_paste=0.1,       # Copy-Paste application probability
    paste_in=0.15,        # Probability of pasting objects onto background
)

Data Quality Checklist

Carefully inspect dataset quality before training to avoid wasting training time.

1. Annotation Consistency

  • Are bounding box sizes and aspect ratios consistent for the same class?
  • Are all objects annotated (no missing labels)?
  • Are class names spelled consistently (no typos or capitalization issues)?

2. Annotation Boundary Check

  • Are coordinates within the 0~1 range (YOLO normalized format)?
  • Are there any invalid boxes with width or height of 0?
  • Do boxes extend beyond image boundaries? Minor issues are acceptable, excessive ones indicate annotation errors

3. Image Quality

  • Are there any corrupted or unopenable image files?
  • Are image channel counts consistent (all RGB)?
  • Are image resolution differences too large? Recommend keeping the longest side under 1920

4. Class Check

  • Does nc (number of classes) in data.yaml match the actual annotation files?
  • Are all class_ids in .txt files within the 0 ~ nc-1 range?
  • Are there empty annotation files (background images)? Is that expected?

5. Dataset Balance

  • Are instance counts severely imbalanced across classes?
  • Is the class distribution consistent across training/validation/test sets?

6. Automated Validation Script

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import os

def validate_dataset(image_dir, label_dir, num_classes):
    """Automated dataset quality check"""
    issues = []

    for f in os.listdir(label_dir):
        if not f.endswith('.txt'):
            continue

        filepath = os.path.join(label_dir, f)
        with open(filepath, 'r') as fh:
            for line_num, line in enumerate(fh, 1):
                parts = line.strip().split()
                if len(parts) != 5:
                    issues.append(f"{f}:{line_num} format error — expected 5 fields, got {len(parts)}")
                    continue

                cls_id, xc, yc, w, h = parts
                cls_id = int(cls_id)
                xc, yc, w, h = map(float, [xc, yc, w, h])

                if cls_id < 0 or cls_id >= num_classes:
                    issues.append(f"{f}:{line_num} class_id {cls_id} out of range 0~{num_classes-1}")
                if w <= 0 or h <= 0:
                    issues.append(f"{f}:{line_num} invalid box size: w={w}, h={h}")
                if xc < 0 or xc > 1 or yc < 0 or yc > 1:
                    issues.append(f"{f}:{line_num} center coordinates out of bounds: xc={xc}, yc={yc}")

        # Check if corresponding image exists
        base = os.path.splitext(f)[0]
        img_exists = any(os.path.exists(f"{image_dir}/{base}{ext}")
                        for ext in ['.jpg', '.jpeg', '.png'])
        if not img_exists:
            issues.append(f"{f}: corresponding image not found")

    if issues:
        print("Issues found:")
        for issue in issues:
            print(f"  - {issue}")
    else:
        print("Dataset validation passed!")

    return len(issues)


# Usage example
validate_dataset(
    image_dir='datasets/my_dataset/images/train',
    label_dir='datasets/my_dataset/labels/train',
    num_classes=3
)