YOLO Model Training: Complete Custom Dataset Workflow
Complete Custom Dataset Training Process
Ultralytics Unified Training Code
| |
Training Parameter Differences Across Versions
| Parameter | YOLOv8 | YOLO11 | YOLO26 |
|---|---|---|---|
| Default Optimizer | SGD | SGD | MuSGD |
| DFL Loss | ✅ | ✅ | ❌ Removed |
| NMS Post-processing | ✅ | ✅ | ❌ Native no NMS |
| Small Object Optimization | Average | Better | Best (STAL) |
| CPU Inference Speed | Baseline | +25% | +43% |
Loss Function Breakdown
YOLO’s loss function consists of three components, each targeting a different learning objective:
| |
Box Regression Loss — CIoU
YOLO uses CIoU (Complete IoU) as the box regression loss. CIoU adds center distance and aspect ratio consistency penalties on top of standard IoU:
$$ L_{CIoU} = 1 - IoU + \frac{\rho^2(b, b_{gt})}{c^2} + \alpha v $$
- $IoU$: Intersection over Union between predicted and ground truth boxes
- $\rho^2(b, b_{gt})$: Euclidean distance between the two box centers
- $c^2$: Diagonal length of the smallest enclosing box
- $\alpha v$: Aspect ratio consistency penalty
The key advantage of CIoU over plain IoU: gradients still exist even when the two boxes don’t overlap, enabling continuous learning.
Classification Loss — BCE Loss
Classification uses Binary Cross-Entropy loss:
$$ L_{cls} = -\sum_{i=1}^{C} [y_i \cdot \log(\hat{y}_i) + (1 - y_i) \cdot \log(1 - \hat{y}_i)] $$
YOLO adopts multi-label classification instead of Softmax because a single anchor may contain multiple objects. BCE loss is computed independently per class and summed.
DFL Loss (Distribution Focal Loss)
In YOLOv8/YOLO11, the four box edges are modeled as discrete probability distributions:
$$ DFL(\mathcal{S}i, \mathcal{S}{gt}) = -\sum_{k=0}^{15} \text{Cat}(k; y_{gt}) \cdot \log(\mathcal{S}_i[k]) $$
Each edge is represented by 16 discrete values, allowing the model to express localization uncertainty. The DFL loss typically starts around 3~5 and gradually drops below 1 as training progresses.
YOLO26’s ProgLoss
YOLO26 removes DFL and introduces ProgLoss (Progressive Loss Balancing):
| |
ProgLoss employs a Curriculum Learning strategy: the model first learns “what” (classification), then “where” (localization), preventing early localization loss from overwhelming classification learning.
| |
Loss Weight Tuning Recommendations
| Scenario | box weight | cls weight | Description |
|---|---|---|---|
| General detection | 7.5 | 0.5 | Ultralytics default |
| Small objects | 10.0 | 0.3 | Improve box precision |
| Classification priority | 5.0 | 1.0 | Reduce false positives |
| High IoU requirement | 12.0 | 0.3 | Improve localization |
Training Tuning Best Practices
Learning Rate Tuning
| |
Batch Size Selection
| GPU VRAM | Recommended batch | Model size |
|---|---|---|
| 4GB | 8 | n/s |
| 8GB | 16 | n/s/m |
| 12GB | 32 | n/s/m/l |
| 24GB | 64 | All |
Multi-GPU Training
| |
DDP (Distributed Data Parallel)
Ultralytics uses PyTorch’s DistributedDataParallel for multi-GPU training. DDP replicates the model on each GPU, computes gradients independently per process, then synchronizes via AllReduce:
| |
Batch Size Scaling Rules
| GPUs | Per-GPU batch | Total batch | Learning rate |
|---|---|---|---|
| 1 | 16 | 16 | lr0=0.01 (baseline) |
| 2 | 16 | 32 | lr0=0.02 |
| 4 | 16 | 64 | lr0=0.04 |
| 8 | 16 | 128 | lr0=0.08 |
Linear Scaling Rule: When training on multiple GPUs, the learning rate should increase linearly with the total batch size. Starting LR = baseline lr0 × (total batch / single-GPU batch). Use warmup_epochs to gradually ramp up to the target LR over the first few epochs, preventing early gradient instability.
Important Notes
- Batch Normalization: DDP synchronizes BN statistics by default, giving more accurate estimates with multi-GPU training
- Gradient Accumulation: Use the
accumulateparameter to simulate a larger batch when VRAM is limited - Memory Balance: Keep batch size and image size consistent across all GPUs to avoid stragglers slowing down training
- Seeded Synchronization: Set
seed=42to ensure consistent initialization across all GPUs
Early Stopping and Resume Training
| |
Training Visualization Analysis
Hyperparameter Search
Using Ultralytics Auto Tuning
| |
Search Space Configuration
Ultralytics’ built-in default search space covers 10+ hyperparameters:
| |
Integration with Ray Tune
For larger-scale hyperparameter searches, Ultralytics supports Ray Tune integration:
| |
| |
Interpreting Search Results
The runs/tune/ directory is generated with:
| |
Interpretation tips:
- Learning rate vs mAP: Dense clusters in the upper-left region of scatter plots indicate optimal LR ranges
- Augmentation vs Overfitting: If higher mosaic/mixup values correlate with higher val mAP, data augmentation is effective
- Weight decay sensitivity: Small datasets are typically more sensitive to weight decay; optimal values range around 0.0003~0.0007
Overfitting Diagnosis and Regularization
Spotting Overfitting from Loss Curves
When training exhibits the following signals, the model may be overfitting:
| |
Key indicator: When training loss continues to decrease but validation loss starts rising, you have a Generalization Gap — the clearest sign of overfitting.
Regularization Techniques
| |
- L2 Regularization (Weight Decay): Penalizes large weights, preventing the model from relying too heavily on a few features
- Dropout: Randomly drops neurons, forcing the model to learn redundant features
- Data Augmentation: Mosaic, MixUp, Copy-Paste implicitly expand the training distribution
- Label Smoothing: Controlled via
label_smoothing, prevents the model from becoming overconfident
Data Augmentation Tuning
Different datasets benefit from different augmentation strategies:
| Dataset Type | Recommended Augmentation | Description | |
|---|---|---|---|
| Small (<1000 images) | mosaic=1.0, mixup=0.3 | Strong augmentation for diversity | |
| Medium dataset | mosaic=1.0, mixup=0.1 | Moderate augmentation | |
| Large (>10K images) | mosaic=0.5, mixup=0.0 | Weak augmentation, preserve distribution | |
| Dense scenes | copy_paste=0.3 | Improve occlusion detection | |
| Aerial/satellite | mosaic=1.0, flipud=0.5 | Leverage rotation invariance |
Early Stopping Tuning
| |
When overfitting is severe, the primary remedy is increasing data quantity or reducing model complexity (choosing a smaller model scale like n→s). Regularization is a secondary aid.
Result Files Description
After training completes, runs/train/exp/ directory contains:
| |
Key Metrics Interpretation
Box Loss: Detection box regression loss → Should continuously decrease
Cls Loss: Classification loss → Should continuously decrease
mAP50: Average precision at IOU=0.5 → Should continuously increase
mAP50-95: Average precision at IOU 0.5~0.95 → Core metric
Validation Metrics Deep Dive
Precision and Recall
YOLO’s training metrics are based on the following definitions:
| Metric | Formula | Meaning |
|---|---|---|
| Precision | TP / (TP + FP) | Of all positive predictions, how many are correct |
| Recall | TP / (TP + FN) | Of all actual positives, how many were found |
| F1-Score | 2 × P × R / (P + R) | Harmonic mean of Precision and Recall |
These metrics vary with the Confidence threshold — higher thresholds boost Precision but reduce Recall, and vice versa.
Confusion Matrix
confusion_matrix.png is the essential tool for diagnosing model behavior:
| |
- Diagonal: Correctly classified samples (more is better)
- Off-diagonal: Class confusion (e.g., predicting “dog” as “cat”) → needs more discriminative features
- Background FP: Model falsely detects background as object → add negative samples or lower confidence threshold
- Missed Detection (FN): Target not detected → lower confidence threshold or add small object detection heads
mAP Calculation Walkthrough
mAP (Mean Average Precision) is YOLO’s core evaluation metric:
- Sort all detections by Confidence score
- Compute Precision and Recall at each Confidence threshold
- Plot the PR curve (Precision-Recall Curve); the area under the curve is AP
- mAP50: Average AP across all classes at IoU threshold 0.5
- mAP50-95: Average AP across 10 IoU thresholds from 0.5 to 0.95 (step 0.05)
| |
Diagnosing Model Issues with Metrics
| Observation | Possible Cause | Solution |
|---|---|---|
| High Precision, Low Recall | Many missed detections | Lower confidence threshold, add small object data |
| High Recall, Low Precision | Many false positives | Raise confidence threshold, add negative samples |
| High mAP50, Low mAP50-95 | Imprecise boxes | Increase box loss weight, refine annotations |
| Class A AP much lower | Class imbalance | Add class A samples or use Class Weights |
| PR curve sudden drop | Hard example cluster | Hard Negative Mining, data augmentation |
F1-Confidence Curve
F1_curve.png shows the F1 score across different Confidence thresholds. The peak of this curve gives the optimal threshold for the model:
| |
YOLO26 Specific Training Recipe
MuSGD Optimizer
YOLO26 defaults to MuSGD (Momentum with Scheduled Updates), an improvement over standard SGD:
- Adaptive momentum scheduling: Momentum adjusts dynamically based on gradient variance during training
- Time-aware learning rate: Integrates a cosine annealing variant — no need to set
cos_lr=Trueseparately - Faster convergence: MuSGD typically reaches equivalent accuracy 10~15% faster than standard SGD given the same epochs
| |
Progressive Loss Balancing (ProgLoss)
YOLO26’s ProgLoss automatically manages box/cls loss weight scheduling without manual tuning:
| |
If you observe cls loss not decreasing for a long time in the training log, the default ProgLoss schedule may not suit your dataset. You can fall back to manual mode:
| |
STAL (Small Target Augmentation Layer)
YOLO26’s STAL is specifically designed to optimize small target detection:
| |
How STAL works:
- Detection phase: Identifies targets smaller than 32×32 pixels in the image
- Augmentation phase: Upsamples, copies, and pastes these small targets to other image regions
- Decay strategy: STAL intensity gradually decreases during training, allowing the model to adapt back to the original distribution
YOLO26 Training Checklist
| |
Tip: When deploying YOLO26 on edge devices (Jetson, Raspberry Pi, etc.), use
imgsz=320orimgsz=480— inference speed improves 23× while mAP drops only 13%.