YOLO Model Training: Complete Custom Dataset Workflow

May 14, 2026 AI Tools YOLO, Model Training, Deep Learning, Hyperparameter Tuning AI Engineering Series 2550 words 12 min read

🔊

Complete Custom Dataset Training Process

Ultralytics Unified Training Code

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
from ultralytics import YOLO

# Load model
# model = YOLO("yolov8n.yaml")  # Train from scratch
# model = YOLO("yolo11n.pt")    # Based on pre-trained weights
model = YOLO("yolo26n.pt")      # 2026 recommended, edge deployment first choice

# Start training
results = model.train(
    # Basic configuration
    data="data.yaml",        # Dataset configuration
    epochs=100,              # Training epochs
    imgsz=640,               # Input size
    batch=16,                # Batch size
    workers=8,               # Data loading threads
    
    # Optimizer configuration
    optimizer="auto",        # YOLO26 automatically uses MuSGD
    lr0=0.01,                # Initial learning rate
    lrf=0.01,                # Final learning rate factor
    momentum=0.937,          # SGD momentum
    weight_decay=0.0005,     # Weight decay
    
    # Data augmentation
    mosaic=1.0,
    mixup=0.1,
    copy_paste=0.1,
    
    # Other configuration
    device=0,                # GPU device, "cpu" for CPU
    project="runs/train",    # Save path
    name="yolo26_exp1",      # Experiment name
    exist_ok=False,          # Whether to overwrite
    pretrained=True,         # Use pre-trained
    verbose=True,            # Detailed logs
    seed=42,                 # Random seed
)

# Validate model
metrics = model.val()
print(f"mAP50: {metrics.box.map50:.3f}")
print(f"mAP50-95: {metrics.box.map:.3f}")

Training Parameter Differences Across Versions

Parameter	YOLOv8	YOLO11	YOLO26
Default Optimizer	SGD	SGD	MuSGD
DFL Loss	✅	✅	❌ Removed
NMS Post-processing	✅	✅	❌ Native no NMS
Small Object Optimization	Average	Better	Best (STAL)
CPU Inference Speed	Baseline	+25%	+43%

Loss Function Breakdown

YOLO’s loss function consists of three components, each targeting a different learning objective:

yaml
1
2
3
4
5
# Loss weight configuration (ultralytics/cfg/default.yaml)
loss_weights:
  box: 7.5    # Box regression loss weight
  cls: 0.5    # Classification loss weight
  dfl: 1.5    # DFL loss weight (YOLOv8/v11 only)

Box Regression Loss — CIoU

YOLO uses CIoU (Complete IoU) as the box regression loss. CIoU adds center distance and aspect ratio consistency penalties on top of standard IoU:

$$ L_{CIoU} = 1 - IoU + \frac{\rho^2(b, b_{gt})}{c^2} + \alpha v $$

$IoU$: Intersection over Union between predicted and ground truth boxes
$\rho^2(b, b_{gt})$: Euclidean distance between the two box centers
$c^2$: Diagonal length of the smallest enclosing box
$\alpha v$: Aspect ratio consistency penalty

The key advantage of CIoU over plain IoU: gradients still exist even when the two boxes don’t overlap, enabling continuous learning.

Classification Loss — BCE Loss

Classification uses Binary Cross-Entropy loss:

$$ L_{cls} = -\sum_{i=1}^{C} [y_i \cdot \log(\hat{y}_i) + (1 - y_i) \cdot \log(1 - \hat{y}_i)] $$

YOLO adopts multi-label classification instead of Softmax because a single anchor may contain multiple objects. BCE loss is computed independently per class and summed.

DFL Loss (Distribution Focal Loss)

In YOLOv8/YOLO11, the four box edges are modeled as discrete probability distributions:

$$ DFL(\mathcal{S}_i, \mathcal{S}_{gt}) = -\sum_{k=0}^{15} \text{Cat}(k; y_{gt}) \cdot \log(\mathcal{S}_i[k]) $$

Each edge is represented by 16 discrete values, allowing the model to express localization uncertainty. The DFL loss typically starts around 3~5 and gradually drops below 1 as training progresses.

YOLO26’s ProgLoss

YOLO26 removes DFL and introduces ProgLoss (Progressive Loss Balancing):

python
1
2
3
4
5
6
# YOLO26 loss function features
# ❌ DFL removed — simpler training, fewer hyperparameters
# ✅ ProgLoss: dynamic box/cls loss weight balancing
#     - Early stage: focus on classification (cls weight increased)
#     - Later stage: focus on box refinement (box weight increased)
# ✅ Smoother loss curves, more intuitive tuning

ProgLoss employs a Curriculum Learning strategy: the model first learns “what” (classification), then “where” (localization), preventing early localization loss from overwhelming classification learning.

1
2
3
4
ProgLoss weight schedule (conceptual):
Epoch  0-10:  cls=1.0, box=5.0  → Focus on classification
Epoch 10-30:  cls=1.0, box=7.5  → Balanced learning
Epoch 30+:    cls=0.5, box=7.5  → Focus on localization

Loss Weight Tuning Recommendations

Scenario	box weight	cls weight	Description
General detection	7.5	0.5	Ultralytics default
Small objects	10.0	0.3	Improve box precision
Classification priority	5.0	1.0	Reduce false positives
High IoU requirement	12.0	0.3	Improve localization

Training Tuning Best Practices

Learning Rate Tuning

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Small dataset: reduce learning rate
model.train(
    lr0=0.001,    # Reduce from 0.01 to 0.001
    epochs=200,   # Increase epochs
    cos_lr=True,  # Use cosine annealing
)

# Large dataset: standard configuration
model.train(
    lr0=0.01,
    epochs=100,
    warmup_epochs=3,
)

Batch Size Selection

GPU VRAM	Recommended batch	Model size
4GB	8	n/s
8GB	16	n/s/m
12GB	32	n/s/m/l
24GB	64	All

Multi-GPU Training

python
1
2
3
4
5
6
# Specify multiple GPUs
model.train(
    device=[0, 1, 2, 3],   # Use 4 GPUs
    batch=64,               # Total batch = per-GPU batch × GPU count
    workers=8,              # Data loading threads
)

DDP (Distributed Data Parallel)

Ultralytics uses PyTorch’s DistributedDataParallel for multi-GPU training. DDP replicates the model on each GPU, computes gradients independently per process, then synchronizes via AllReduce:

python
1
2
3
4
5
# DDP auto-enabled (no manual configuration needed)
model.train(
    device=[0, 1, 2, 3],
    batch=64,
)

Batch Size Scaling Rules

GPUs	Per-GPU batch	Total batch	Learning rate
1	16	16	lr0=0.01 (baseline)
2	16	32	lr0=0.02
4	16	64	lr0=0.04
8	16	128	lr0=0.08

Linear Scaling Rule: When training on multiple GPUs, the learning rate should increase linearly with the total batch size. Starting LR = baseline lr0 × (total batch / single-GPU batch). Use warmup_epochs to gradually ramp up to the target LR over the first few epochs, preventing early gradient instability.

Important Notes

Batch Normalization: DDP synchronizes BN statistics by default, giving more accurate estimates with multi-GPU training
Gradient Accumulation: Use the accumulate parameter to simulate a larger batch when VRAM is limited
Memory Balance: Keep batch size and image size consistent across all GPUs to avoid stragglers slowing down training
Seeded Synchronization: Set seed=42 to ensure consistent initialization across all GPUs

Early Stopping and Resume Training

python
1
2
3
4
5
6
7
8
9
# Enable early stopping
model.train(
    patience=50,    # Stop after 50 epochs without improvement
    ...
)

# Resume interrupted training
model = YOLO("runs/train/exp/weights/last.pt")
model.train(resume=True)

Training Visualization Analysis

Hyperparameter Search

Using Ultralytics Auto Tuning

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from ultralytics import YOLO

model = YOLO("yolo26n.pt")

# Start hyperparameter search
result = model.tune(
    data="data.yaml",
    epochs=50,
    iterations=100,         # Number of search iterations
    optimizer="AdamW",      # AdamW recommended for more stable search
    device=0,
    batch=16,
    plots=True,             # Plot tuning process
    save=True,              # Save each trial result
)

Search Space Configuration

Ultralytics’ built-in default search space covers 10+ hyperparameters:

yaml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Default search space (ultralytics/cfg/tune.yaml)
lr0: [0.0001, 0.01]         # Initial learning rate
lrf: [0.0001, 0.1]          # Final learning rate factor
momentum: [0.7, 0.99]       # SGD momentum
weight_decay: [0.0, 0.002]  # Weight decay
warmup_epochs: [0, 5]       # Warmup epochs
warmup_momentum: [0.0, 0.95]
mosaic: [0.0, 1.0]          # Mosaic augmentation probability
mixup: [0.0, 0.5]           # MixUp augmentation probability
copy_paste: [0.0, 0.5]      # Copy-Paste augmentation probability
hsv_h: [0.0, 0.1]           # HSV hue augmentation
hsv_s: [0.0, 0.9]           # HSV saturation augmentation
hsv_v: [0.0, 0.9]           # HSV value augmentation
flipud: [0.0, 1.0]          # Vertical flip
fliplr: [0.0, 1.0]          # Horizontal flip

Integration with Ray Tune

For larger-scale hyperparameter searches, Ultralytics supports Ray Tune integration:

bash
1
pip install ray[tune]

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import ray
from ray import tune
from ultralytics import YOLO

# Ray Tune search space
search_space = {
    "lr0": tune.loguniform(1e-4, 1e-2),
    "lrf": tune.loguniform(1e-4, 1e-1),
    "momentum": tune.uniform(0.7, 0.99),
    "weight_decay": tune.uniform(0.0, 0.002),
    "mosaic": tune.uniform(0.0, 1.0),
}

# Start Ray Tune search
model = YOLO("yolo26n.pt")
result = model.tune(
    data="data.yaml",
    epochs=50,
    iterations=100,
    use_ray=True,                # Enable Ray Tune
    ray_search_alg="asha",       # Async Successive Halving Algorithm
    device=[0, 1, 2, 3],         # Multi-GPU parallel search
)

Interpreting Search Results

The runs/tune/ directory is generated with:

1
2
3
4
5
6
runs/tune/
├── exp1/            # Full training results per trial
├── exp2/
├── ...
├── tune_results.csv # Summary of all hyperparameters and metrics
└── tune_scatter.png # Hyperparameter correlation scatter plot

Interpretation tips:

Learning rate vs mAP: Dense clusters in the upper-left region of scatter plots indicate optimal LR ranges
Augmentation vs Overfitting: If higher mosaic/mixup values correlate with higher val mAP, data augmentation is effective
Weight decay sensitivity: Small datasets are typically more sensitive to weight decay; optimal values range around 0.0003~0.0007

Overfitting Diagnosis and Regularization

Spotting Overfitting from Loss Curves

When training exhibits the following signals, the model may be overfitting:

1
2
3
4
Loss curves:
📈 Train Loss:  Continuously decreasing
📉 Val Loss:    Decreasing then rising (inflection = overfitting starts)
📊 mAP:         Stagnates or decreases on validation set

Key indicator: When training loss continues to decrease but validation loss starts rising, you have a Generalization Gap — the clearest sign of overfitting.

Regularization Techniques

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Apply multiple regularization strategies in training
model.train(
    # Explicit regularization
    weight_decay=0.001,       # L2 regularization (weight decay)
    dropout=0.1,              # Dropout probability

    # Implicit regularization (data augmentation)
    mosaic=1.0,
    mixup=0.2,                # Mix two images
    copy_paste=0.2,           # Copy-paste augmentation

    # Training control
    patience=30,              # Early stopping
)

L2 Regularization (Weight Decay): Penalizes large weights, preventing the model from relying too heavily on a few features
Dropout: Randomly drops neurons, forcing the model to learn redundant features
Data Augmentation: Mosaic, MixUp, Copy-Paste implicitly expand the training distribution
Label Smoothing: Controlled via label_smoothing, prevents the model from becoming overconfident

Data Augmentation Tuning

Different datasets benefit from different augmentation strategies:

Dataset Type	Recommended Augmentation	Description
Small (<1000 images)	mosaic=1.0, mixup=0.3	Strong augmentation for diversity
Medium dataset	mosaic=1.0, mixup=0.1	Moderate augmentation
Large (>10K images)	mosaic=0.5, mixup=0.0	Weak augmentation, preserve distribution
Dense scenes	copy_paste=0.3	Improve occlusion detection
Aerial/satellite	mosaic=1.0, flipud=0.5	Leverage rotation invariance

Early Stopping Tuning

python
1
2
3
4
5
6
7
model.train(
    patience=50,          # Default 50, small datasets: 20~30
    save_period=10,       # Save every N epochs for rollback
    cos_lr=True,          # Cosine annealing with early stopping
    lrf=0.001,            # Final LR close to zero
    warmup_epochs=3,      # Prevent premature early stopping
)

When overfitting is severe, the primary remedy is increasing data quantity or reducing model complexity (choosing a smaller model scale like n→s). Regularization is a secondary aid.

Result Files Description

After training completes, runs/train/exp/ directory contains:

1
2
3
4
5
6
7
8
weights/
├── best.pt          # Best weights
└── last.pt          # Last epoch weights
results.csv          # Training log CSV
results.png          # Loss curves
confusion_matrix.png # Confusion matrix
PR_curve.png         # PR curve
F1_curve.png         # F1 curve

Key Metrics Interpretation

Box Loss: Detection box regression loss → Should continuously decrease
Cls Loss: Classification loss → Should continuously decrease
mAP50: Average precision at IOU=0.5 → Should continuously increase
mAP50-95: Average precision at IOU 0.5~0.95 → Core metric

Validation Metrics Deep Dive

Precision and Recall

YOLO’s training metrics are based on the following definitions:

Metric	Formula	Meaning
Precision	TP / (TP + FP)	Of all positive predictions, how many are correct
Recall	TP / (TP + FN)	Of all actual positives, how many were found
F1-Score	2 × P × R / (P + R)	Harmonic mean of Precision and Recall

These metrics vary with the Confidence threshold — higher thresholds boost Precision but reduce Recall, and vice versa.

Confusion Matrix

confusion_matrix.png is the essential tool for diagnosing model behavior:

1
2
3
4
5
6
             Predicted Class
          ┌─────────────────┐
   Actual │ TP │ FP          │
   Class  ├─────┼───────────┤
          │ FN │ background │
          └─────┴───────────┘

Diagonal: Correctly classified samples (more is better)
Off-diagonal: Class confusion (e.g., predicting “dog” as “cat”) → needs more discriminative features
Background FP: Model falsely detects background as object → add negative samples or lower confidence threshold
Missed Detection (FN): Target not detected → lower confidence threshold or add small object detection heads

mAP Calculation Walkthrough

mAP (Mean Average Precision) is YOLO’s core evaluation metric:

Sort all detections by Confidence score
Compute Precision and Recall at each Confidence threshold
Plot the PR curve (Precision-Recall Curve); the area under the curve is AP
mAP50: Average AP across all classes at IoU threshold 0.5
mAP50-95: Average AP across 10 IoU thresholds from 0.5 to 0.95 (step 0.05)

1
2
mAP50     = Lenient evaluation: rough location is sufficient
mAP50-95  = Strict evaluation: boxes must tightly fit objects

Diagnosing Model Issues with Metrics

Observation	Possible Cause	Solution
High Precision, Low Recall	Many missed detections	Lower confidence threshold, add small object data
High Recall, Low Precision	Many false positives	Raise confidence threshold, add negative samples
High mAP50, Low mAP50-95	Imprecise boxes	Increase box loss weight, refine annotations
Class A AP much lower	Class imbalance	Add class A samples or use Class Weights
PR curve sudden drop	Hard example cluster	Hard Negative Mining, data augmentation

F1-Confidence Curve

F1_curve.png shows the F1 score across different Confidence thresholds. The peak of this curve gives the optimal threshold for the model:

python
1
2
3
4
# Find optimal Confidence threshold automatically
results = model.val()
best_conf = results.f1_score(threshold=...)
print(f"Optimal confidence threshold: {best_conf:.3f}")

YOLO26 Specific Training Recipe

MuSGD Optimizer

YOLO26 defaults to MuSGD (Momentum with Scheduled Updates), an improvement over standard SGD:

Adaptive momentum scheduling: Momentum adjusts dynamically based on gradient variance during training
Time-aware learning rate: Integrates a cosine annealing variant — no need to set cos_lr=True separately
Faster convergence: MuSGD typically reaches equivalent accuracy 10~15% faster than standard SGD given the same epochs

python
1
2
3
4
5
6
7
8
# YOLO26 recommended training configuration
model.train(
    optimizer="auto",         # Auto-selects MuSGD
    lr0=0.01,                 # MuSGD is LR-insensitive, 0.005~0.02 works
    momentum=0.937,
    weight_decay=0.0005,
    cos_lr=False,             # MuSGD has built-in scheduling, no extra cosine annealing needed
)

Progressive Loss Balancing (ProgLoss)

YOLO26’s ProgLoss automatically manages box/cls loss weight scheduling without manual tuning:

1
2
3
4
5
Training stage          Focus                    Loss behavior
──────────────────────────────────────────────────────────────
Epoch 1-10  (warmup)   Classification           cls loss drops fast
Epoch 10-30 (explore)  Classification + box     box loss starts decreasing significantly
Epoch 30+   (refine)   Box refinement           box loss converges slowly

If you observe cls loss not decreasing for a long time in the training log, the default ProgLoss schedule may not suit your dataset. You can fall back to manual mode:

python
1
2
3
4
5
6
7
8
# Disable ProgLoss and manually set weights
model.train(
    loss_weights={
        "box": 7.5,
        "cls": 0.5,
    },
    prog_loss=False,
)

STAL (Small Target Augmentation Layer)

YOLO26’s STAL is specifically designed to optimize small target detection:

python
1
2
3
4
5
6
# Enable STAL (enabled by default in YOLO26)
model.train(
    stal=True,                     # Small target augmentation layer
    stal_scale_range=[0.5, 1.0],  # Scale range
    stal_epoch_gamma=0.1,         # Augmentation decay coefficient
)

How STAL works:

Detection phase: Identifies targets smaller than 32×32 pixels in the image
Augmentation phase: Upsamples, copies, and pastes these small targets to other image regions
Decay strategy: STAL intensity gradually decreases during training, allowing the model to adapt back to the original distribution

YOLO26 Training Checklist

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# ===== YOLO26 Recommended Complete Training Configuration =====
model.train(
    # Data
    data="data.yaml",
    epochs=200,
    imgsz=640,
    batch=16,

    # Optimizer (MuSGD)
    optimizer="auto",
    lr0=0.01,
    momentum=0.937,
    weight_decay=0.0005,

    # Augmentation (STAL auto-enabled)
    mosaic=1.0,
    mixup=0.1,
    copy_paste=0.1,

    # Regularization
    dropout=0.1,
    label_smoothing=0.0,

    # Other
    device=0,
    patience=50,
    seed=42,
    verbose=True,
)

Tip: When deploying YOLO26 on edge devices (Jetson, Raspberry Pi, etc.), use imgsz=320 or imgsz=480 — inference speed improves 2~~3× while mAP drops only 1~~3%.

Part of series: AI Engineering Series

← Previous YOLO Dataset Preparation: Annotation Tools and Format Conversion Next → YOLO Advanced Optimization: Lightweight, Quantization and Accuracy