YOLO Model Training: Complete Custom Dataset Workflow

Complete Custom Dataset Training Process

Ultralytics Unified Training Code

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
from ultralytics import YOLO

# Load model
# model = YOLO("yolov8n.yaml")  # Train from scratch
# model = YOLO("yolo11n.pt")    # Based on pre-trained weights
model = YOLO("yolo26n.pt")      # 2026 recommended, edge deployment first choice

# Start training
results = model.train(
    # Basic configuration
    data="data.yaml",        # Dataset configuration
    epochs=100,              # Training epochs
    imgsz=640,               # Input size
    batch=16,                # Batch size
    workers=8,               # Data loading threads
    
    # Optimizer configuration
    optimizer="auto",        # YOLO26 automatically uses MuSGD
    lr0=0.01,                # Initial learning rate
    lrf=0.01,                # Final learning rate factor
    momentum=0.937,          # SGD momentum
    weight_decay=0.0005,     # Weight decay
    
    # Data augmentation
    mosaic=1.0,
    mixup=0.1,
    copy_paste=0.1,
    
    # Other configuration
    device=0,                # GPU device, "cpu" for CPU
    project="runs/train",    # Save path
    name="yolo26_exp1",      # Experiment name
    exist_ok=False,          # Whether to overwrite
    pretrained=True,         # Use pre-trained
    verbose=True,            # Detailed logs
    seed=42,                 # Random seed
)

# Validate model
metrics = model.val()
print(f"mAP50: {metrics.box.map50:.3f}")
print(f"mAP50-95: {metrics.box.map:.3f}")

Training Parameter Differences Across Versions

ParameterYOLOv8YOLO11YOLO26
Default OptimizerSGDSGDMuSGD
DFL Loss❌ Removed
NMS Post-processing❌ Native no NMS
Small Object OptimizationAverageBetterBest (STAL)
CPU Inference SpeedBaseline+25%+43%

Loss Function Breakdown

YOLO’s loss function consists of three components, each targeting a different learning objective:

yaml
1
2
3
4
5
# Loss weight configuration (ultralytics/cfg/default.yaml)
loss_weights:
  box: 7.5    # Box regression loss weight
  cls: 0.5    # Classification loss weight
  dfl: 1.5    # DFL loss weight (YOLOv8/v11 only)

Box Regression Loss — CIoU

YOLO uses CIoU (Complete IoU) as the box regression loss. CIoU adds center distance and aspect ratio consistency penalties on top of standard IoU:

$$ L_{CIoU} = 1 - IoU + \frac{\rho^2(b, b_{gt})}{c^2} + \alpha v $$

  • $IoU$: Intersection over Union between predicted and ground truth boxes
  • $\rho^2(b, b_{gt})$: Euclidean distance between the two box centers
  • $c^2$: Diagonal length of the smallest enclosing box
  • $\alpha v$: Aspect ratio consistency penalty

The key advantage of CIoU over plain IoU: gradients still exist even when the two boxes don’t overlap, enabling continuous learning.

Classification Loss — BCE Loss

Classification uses Binary Cross-Entropy loss:

$$ L_{cls} = -\sum_{i=1}^{C} [y_i \cdot \log(\hat{y}_i) + (1 - y_i) \cdot \log(1 - \hat{y}_i)] $$

YOLO adopts multi-label classification instead of Softmax because a single anchor may contain multiple objects. BCE loss is computed independently per class and summed.

DFL Loss (Distribution Focal Loss)

In YOLOv8/YOLO11, the four box edges are modeled as discrete probability distributions:

$$ DFL(\mathcal{S}i, \mathcal{S}{gt}) = -\sum_{k=0}^{15} \text{Cat}(k; y_{gt}) \cdot \log(\mathcal{S}_i[k]) $$

Each edge is represented by 16 discrete values, allowing the model to express localization uncertainty. The DFL loss typically starts around 3~5 and gradually drops below 1 as training progresses.

YOLO26’s ProgLoss

YOLO26 removes DFL and introduces ProgLoss (Progressive Loss Balancing):

python
1
2
3
4
5
6
# YOLO26 loss function features
# ❌ DFL removed — simpler training, fewer hyperparameters
# ✅ ProgLoss: dynamic box/cls loss weight balancing
#     - Early stage: focus on classification (cls weight increased)
#     - Later stage: focus on box refinement (box weight increased)
# ✅ Smoother loss curves, more intuitive tuning

ProgLoss employs a Curriculum Learning strategy: the model first learns “what” (classification), then “where” (localization), preventing early localization loss from overwhelming classification learning.

1
2
3
4
ProgLoss weight schedule (conceptual):
Epoch  0-10:  cls=1.0, box=5.0  → Focus on classification
Epoch 10-30:  cls=1.0, box=7.5  → Balanced learning
Epoch 30+:    cls=0.5, box=7.5  → Focus on localization

Loss Weight Tuning Recommendations

Scenariobox weightcls weightDescription
General detection7.50.5Ultralytics default
Small objects10.00.3Improve box precision
Classification priority5.01.0Reduce false positives
High IoU requirement12.00.3Improve localization

Training Tuning Best Practices

Learning Rate Tuning

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Small dataset: reduce learning rate
model.train(
    lr0=0.001,    # Reduce from 0.01 to 0.001
    epochs=200,   # Increase epochs
    cos_lr=True,  # Use cosine annealing
)

# Large dataset: standard configuration
model.train(
    lr0=0.01,
    epochs=100,
    warmup_epochs=3,
)

Batch Size Selection

GPU VRAMRecommended batchModel size
4GB8n/s
8GB16n/s/m
12GB32n/s/m/l
24GB64All

Multi-GPU Training

python
1
2
3
4
5
6
# Specify multiple GPUs
model.train(
    device=[0, 1, 2, 3],   # Use 4 GPUs
    batch=64,               # Total batch = per-GPU batch × GPU count
    workers=8,              # Data loading threads
)

DDP (Distributed Data Parallel)

Ultralytics uses PyTorch’s DistributedDataParallel for multi-GPU training. DDP replicates the model on each GPU, computes gradients independently per process, then synchronizes via AllReduce:

python
1
2
3
4
5
# DDP auto-enabled (no manual configuration needed)
model.train(
    device=[0, 1, 2, 3],
    batch=64,
)

Batch Size Scaling Rules

GPUsPer-GPU batchTotal batchLearning rate
11616lr0=0.01 (baseline)
21632lr0=0.02
41664lr0=0.04
816128lr0=0.08

Linear Scaling Rule: When training on multiple GPUs, the learning rate should increase linearly with the total batch size. Starting LR = baseline lr0 × (total batch / single-GPU batch). Use warmup_epochs to gradually ramp up to the target LR over the first few epochs, preventing early gradient instability.

Important Notes

  • Batch Normalization: DDP synchronizes BN statistics by default, giving more accurate estimates with multi-GPU training
  • Gradient Accumulation: Use the accumulate parameter to simulate a larger batch when VRAM is limited
  • Memory Balance: Keep batch size and image size consistent across all GPUs to avoid stragglers slowing down training
  • Seeded Synchronization: Set seed=42 to ensure consistent initialization across all GPUs

Early Stopping and Resume Training

python
1
2
3
4
5
6
7
8
9
# Enable early stopping
model.train(
    patience=50,    # Stop after 50 epochs without improvement
    ...
)

# Resume interrupted training
model = YOLO("runs/train/exp/weights/last.pt")
model.train(resume=True)

Training Visualization Analysis

Using Ultralytics Auto Tuning

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from ultralytics import YOLO

model = YOLO("yolo26n.pt")

# Start hyperparameter search
result = model.tune(
    data="data.yaml",
    epochs=50,
    iterations=100,         # Number of search iterations
    optimizer="AdamW",      # AdamW recommended for more stable search
    device=0,
    batch=16,
    plots=True,             # Plot tuning process
    save=True,              # Save each trial result
)

Search Space Configuration

Ultralytics’ built-in default search space covers 10+ hyperparameters:

yaml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Default search space (ultralytics/cfg/tune.yaml)
lr0: [0.0001, 0.01]         # Initial learning rate
lrf: [0.0001, 0.1]          # Final learning rate factor
momentum: [0.7, 0.99]       # SGD momentum
weight_decay: [0.0, 0.002]  # Weight decay
warmup_epochs: [0, 5]       # Warmup epochs
warmup_momentum: [0.0, 0.95]
mosaic: [0.0, 1.0]          # Mosaic augmentation probability
mixup: [0.0, 0.5]           # MixUp augmentation probability
copy_paste: [0.0, 0.5]      # Copy-Paste augmentation probability
hsv_h: [0.0, 0.1]           # HSV hue augmentation
hsv_s: [0.0, 0.9]           # HSV saturation augmentation
hsv_v: [0.0, 0.9]           # HSV value augmentation
flipud: [0.0, 1.0]          # Vertical flip
fliplr: [0.0, 1.0]          # Horizontal flip

Integration with Ray Tune

For larger-scale hyperparameter searches, Ultralytics supports Ray Tune integration:

bash
1
pip install ray[tune]
python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import ray
from ray import tune
from ultralytics import YOLO

# Ray Tune search space
search_space = {
    "lr0": tune.loguniform(1e-4, 1e-2),
    "lrf": tune.loguniform(1e-4, 1e-1),
    "momentum": tune.uniform(0.7, 0.99),
    "weight_decay": tune.uniform(0.0, 0.002),
    "mosaic": tune.uniform(0.0, 1.0),
}

# Start Ray Tune search
model = YOLO("yolo26n.pt")
result = model.tune(
    data="data.yaml",
    epochs=50,
    iterations=100,
    use_ray=True,                # Enable Ray Tune
    ray_search_alg="asha",       # Async Successive Halving Algorithm
    device=[0, 1, 2, 3],         # Multi-GPU parallel search
)

Interpreting Search Results

The runs/tune/ directory is generated with:

1
2
3
4
5
6
runs/tune/
├── exp1/            # Full training results per trial
├── exp2/
├── ...
├── tune_results.csv # Summary of all hyperparameters and metrics
└── tune_scatter.png # Hyperparameter correlation scatter plot

Interpretation tips:

  • Learning rate vs mAP: Dense clusters in the upper-left region of scatter plots indicate optimal LR ranges
  • Augmentation vs Overfitting: If higher mosaic/mixup values correlate with higher val mAP, data augmentation is effective
  • Weight decay sensitivity: Small datasets are typically more sensitive to weight decay; optimal values range around 0.0003~0.0007

Overfitting Diagnosis and Regularization

Spotting Overfitting from Loss Curves

When training exhibits the following signals, the model may be overfitting:

1
2
3
4
Loss curves:
📈 Train Loss:  Continuously decreasing
📉 Val Loss:    Decreasing then rising (inflection = overfitting starts)
📊 mAP:         Stagnates or decreases on validation set

Key indicator: When training loss continues to decrease but validation loss starts rising, you have a Generalization Gap — the clearest sign of overfitting.

Regularization Techniques

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Apply multiple regularization strategies in training
model.train(
    # Explicit regularization
    weight_decay=0.001,       # L2 regularization (weight decay)
    dropout=0.1,              # Dropout probability

    # Implicit regularization (data augmentation)
    mosaic=1.0,
    mixup=0.2,                # Mix two images
    copy_paste=0.2,           # Copy-paste augmentation

    # Training control
    patience=30,              # Early stopping
)
  • L2 Regularization (Weight Decay): Penalizes large weights, preventing the model from relying too heavily on a few features
  • Dropout: Randomly drops neurons, forcing the model to learn redundant features
  • Data Augmentation: Mosaic, MixUp, Copy-Paste implicitly expand the training distribution
  • Label Smoothing: Controlled via label_smoothing, prevents the model from becoming overconfident

Data Augmentation Tuning

Different datasets benefit from different augmentation strategies:

Dataset TypeRecommended AugmentationDescription
Small (<1000 images)mosaic=1.0, mixup=0.3Strong augmentation for diversity
Medium datasetmosaic=1.0, mixup=0.1Moderate augmentation
Large (>10K images)mosaic=0.5, mixup=0.0Weak augmentation, preserve distribution
Dense scenescopy_paste=0.3Improve occlusion detection
Aerial/satellitemosaic=1.0, flipud=0.5Leverage rotation invariance

Early Stopping Tuning

python
1
2
3
4
5
6
7
model.train(
    patience=50,          # Default 50, small datasets: 20~30
    save_period=10,       # Save every N epochs for rollback
    cos_lr=True,          # Cosine annealing with early stopping
    lrf=0.001,            # Final LR close to zero
    warmup_epochs=3,      # Prevent premature early stopping
)

When overfitting is severe, the primary remedy is increasing data quantity or reducing model complexity (choosing a smaller model scale like n→s). Regularization is a secondary aid.

Result Files Description

After training completes, runs/train/exp/ directory contains:

1
2
3
4
5
6
7
8
weights/
├── best.pt          # Best weights
└── last.pt          # Last epoch weights
results.csv          # Training log CSV
results.png          # Loss curves
confusion_matrix.png # Confusion matrix
PR_curve.png         # PR curve
F1_curve.png         # F1 curve

Key Metrics Interpretation

  • Box Loss: Detection box regression loss → Should continuously decrease

  • Cls Loss: Classification loss → Should continuously decrease

  • mAP50: Average precision at IOU=0.5 → Should continuously increase

  • mAP50-95: Average precision at IOU 0.5~0.95 → Core metric

Validation Metrics Deep Dive

Precision and Recall

YOLO’s training metrics are based on the following definitions:

MetricFormulaMeaning
PrecisionTP / (TP + FP)Of all positive predictions, how many are correct
RecallTP / (TP + FN)Of all actual positives, how many were found
F1-Score2 × P × R / (P + R)Harmonic mean of Precision and Recall

These metrics vary with the Confidence threshold — higher thresholds boost Precision but reduce Recall, and vice versa.

Confusion Matrix

confusion_matrix.png is the essential tool for diagnosing model behavior:

1
2
3
4
5
6
             Predicted Class
          ┌─────────────────┐
   Actual │ TP │ FP          │
   Class  ├─────┼───────────┤
          │ FN │ background │
          └─────┴───────────┘
  • Diagonal: Correctly classified samples (more is better)
  • Off-diagonal: Class confusion (e.g., predicting “dog” as “cat”) → needs more discriminative features
  • Background FP: Model falsely detects background as object → add negative samples or lower confidence threshold
  • Missed Detection (FN): Target not detected → lower confidence threshold or add small object detection heads

mAP Calculation Walkthrough

mAP (Mean Average Precision) is YOLO’s core evaluation metric:

  1. Sort all detections by Confidence score
  2. Compute Precision and Recall at each Confidence threshold
  3. Plot the PR curve (Precision-Recall Curve); the area under the curve is AP
  4. mAP50: Average AP across all classes at IoU threshold 0.5
  5. mAP50-95: Average AP across 10 IoU thresholds from 0.5 to 0.95 (step 0.05)
1
2
mAP50     = Lenient evaluation: rough location is sufficient
mAP50-95  = Strict evaluation: boxes must tightly fit objects

Diagnosing Model Issues with Metrics

ObservationPossible CauseSolution
High Precision, Low RecallMany missed detectionsLower confidence threshold, add small object data
High Recall, Low PrecisionMany false positivesRaise confidence threshold, add negative samples
High mAP50, Low mAP50-95Imprecise boxesIncrease box loss weight, refine annotations
Class A AP much lowerClass imbalanceAdd class A samples or use Class Weights
PR curve sudden dropHard example clusterHard Negative Mining, data augmentation

F1-Confidence Curve

F1_curve.png shows the F1 score across different Confidence thresholds. The peak of this curve gives the optimal threshold for the model:

python
1
2
3
4
# Find optimal Confidence threshold automatically
results = model.val()
best_conf = results.f1_score(threshold=...)
print(f"Optimal confidence threshold: {best_conf:.3f}")

YOLO26 Specific Training Recipe

MuSGD Optimizer

YOLO26 defaults to MuSGD (Momentum with Scheduled Updates), an improvement over standard SGD:

  • Adaptive momentum scheduling: Momentum adjusts dynamically based on gradient variance during training
  • Time-aware learning rate: Integrates a cosine annealing variant — no need to set cos_lr=True separately
  • Faster convergence: MuSGD typically reaches equivalent accuracy 10~15% faster than standard SGD given the same epochs
python
1
2
3
4
5
6
7
8
# YOLO26 recommended training configuration
model.train(
    optimizer="auto",         # Auto-selects MuSGD
    lr0=0.01,                 # MuSGD is LR-insensitive, 0.005~0.02 works
    momentum=0.937,
    weight_decay=0.0005,
    cos_lr=False,             # MuSGD has built-in scheduling, no extra cosine annealing needed
)

Progressive Loss Balancing (ProgLoss)

YOLO26’s ProgLoss automatically manages box/cls loss weight scheduling without manual tuning:

1
2
3
4
5
Training stage          Focus                    Loss behavior
──────────────────────────────────────────────────────────────
Epoch 1-10  (warmup)   Classification           cls loss drops fast
Epoch 10-30 (explore)  Classification + box     box loss starts decreasing significantly
Epoch 30+   (refine)   Box refinement           box loss converges slowly

If you observe cls loss not decreasing for a long time in the training log, the default ProgLoss schedule may not suit your dataset. You can fall back to manual mode:

python
1
2
3
4
5
6
7
8
# Disable ProgLoss and manually set weights
model.train(
    loss_weights={
        "box": 7.5,
        "cls": 0.5,
    },
    prog_loss=False,
)

STAL (Small Target Augmentation Layer)

YOLO26’s STAL is specifically designed to optimize small target detection:

python
1
2
3
4
5
6
# Enable STAL (enabled by default in YOLO26)
model.train(
    stal=True,                     # Small target augmentation layer
    stal_scale_range=[0.5, 1.0],  # Scale range
    stal_epoch_gamma=0.1,         # Augmentation decay coefficient
)

How STAL works:

  1. Detection phase: Identifies targets smaller than 32×32 pixels in the image
  2. Augmentation phase: Upsamples, copies, and pastes these small targets to other image regions
  3. Decay strategy: STAL intensity gradually decreases during training, allowing the model to adapt back to the original distribution

YOLO26 Training Checklist

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# ===== YOLO26 Recommended Complete Training Configuration =====
model.train(
    # Data
    data="data.yaml",
    epochs=200,
    imgsz=640,
    batch=16,

    # Optimizer (MuSGD)
    optimizer="auto",
    lr0=0.01,
    momentum=0.937,
    weight_decay=0.0005,

    # Augmentation (STAL auto-enabled)
    mosaic=1.0,
    mixup=0.1,
    copy_paste=0.1,

    # Regularization
    dropout=0.1,
    label_smoothing=0.0,

    # Other
    device=0,
    patience=50,
    seed=42,
    verbose=True,
)

Tip: When deploying YOLO26 on edge devices (Jetson, Raspberry Pi, etc.), use imgsz=320 or imgsz=480 — inference speed improves 23× while mAP drops only 13%.