YOLO FAQ: Common Problems and Solutions

May 23, 2026 AI Tools YOLO, FAQ, Troubleshooting, Best Practices AI Engineering Series 2676 words 13 min read

🔊

Environment Installation Issues

Q1: CUDA not available, only using CPU?

First confirm your NVIDIA driver version supports the required CUDA version. A driver that is too old will make CUDA unavailable:

bash
1
2
3
4
5
6
# Check driver version (Driver Version must be >= minimum for target CUDA)
nvidia-smi
# Check CUDA toolkit version
nvcc --version
# Reinstall PyTorch with matching CUDA version
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu121

If nvidia-smi shows a CUDA version but PyTorch still uses CPU, you have installed the CPU-only PyTorch build. Uninstall and reinstall with the --index-url flag for the correct CUDA version. For CUDA 11.8, replace cu121 with cu118 in the URL. Always use a conda or venv virtual environment to isolate PyTorch versions and avoid system-level conflicts.

Q2: ultralytics installation failed?

Common causes: outdated pip, dependency conflicts (numpy/opencv version incompatibility), or network timeouts:

bash
1
2
3
4
5
6
7
# Upgrade pip
pip install --upgrade pip
# Use domestic PyPI mirror for faster downloads
pip install ultralytics -i https://pypi.tuna.tsinghua.edu.cn/simple
# If dependency conflicts persist, manually install core dependencies first
pip install numpy==1.26.0 opencv-python==4.9.0.80
pip install ultralytics

ultralytics supports Python 3.8–3.12, with Python 3.10 recommended. Use conda for isolation: conda create -n yolo python=3.10 -y && conda activate yolo. On ARM macOS, if wheel conflicts arise, try pip install ultralytics --no-deps and install dependencies individually.

Q3: Out of Memory (OOM) error during training?

OOM is the most common YOLO training issue. Combine these strategies:

python
1
2
3
4
5
6
model.train(
    batch=-1,           # Auto-adjust batch size (starts from 16, decreases)
    amp=True,           # Mixed precision, reduces memory ~40%
    imgsz=640,          # Lower input resolution, 640→480 saves ~30% memory
    gradient_accumulation_steps=4,  # Gradient accumulation simulates larger batch
)

If still OOM, specify device=0 for a specific GPU or limit memory fraction via torch.cuda.set_per_process_memory_fraction(0.8). Monitor peak memory with nvidia-smi -l 1 to see if the bottleneck is data loading or backpropagation. For GPUs with less than 8GB VRAM, use lightweight models like YOLOv8n or YOLO11n with imgsz=416.

Q4: Loss doesn’t converge, mAP is very low?

Systematically diagnose from data and training dimensions:

Dataset validation: Use ultralytics inspector data.yaml to verify annotations. Ensure each image has at least one bounding box and no empty annotation files exist. Visualize augmented batch samples with TensorBoard or W&B to confirm augmentation strength is appropriate
Learning rate tuning: Default lr0=0.01 may be too high for some datasets. Try lr0=0.001 or enable cos_lr=True cosine annealing. Loss should steadily decrease after the warmup phase (default 3 epochs)
Training epochs: Small datasets (<1000 images) need 300–500 epochs, large datasets 100–300 epochs. Use patience=50 for early stopping
Class balance: When sample count varies by >10x across classes, enable cls_pw=1.5 or switch to Focal Loss. Monitor recall for rare classes
Augmentation intensity: Over-augmentation can prevent convergence. Reduce hsv_h=0.015, degrees=0.0 etc., especially for small datasets

Q5: What’s different between YOLO26 and v8 training?

YOLO26 introduces significant simplifications to the training workflow:

Optimizer: Automatically uses MuSGD (fused SGD + momentum), no need to set optimizer parameter. Converges ~15% faster than AdamW
Loss function: No DFL (Distribution Focal Loss) — only classification and regression branches, making training more stable with fewer hyperparameters to tune
Small object detection: STAL (Scale-Transfer Attention Layer) automatically activates in the neck, improving small object AP by 3–5%
Default hyperparameters: lr0=0.005 (lower than v8), momentum=0.937, weight_decay=0.0005, converging in 300 epochs on COCO
Training speed: ~20% faster than v8 at the same batch size with lower memory usage

Migration requires only upgrading the ultralytics package — no code changes needed.

Training Failure Diagnosis

Q14: NaN loss during training?

Common causes and fixes for NaN (Not a Number) loss:

Learning rate too high: Loss jumps to NaN at a specific step. Reduce lr0 to 0.0005 or lower and restart training
Invalid annotations: Check for negative coordinates, boxes extending beyond image boundaries, or zero-width/height boxes
Gradient explosion: Enable gradient clipping with model.train(gradient_clip_val=1.0) or reduce batch size to lower gradient variance
Numerical overflow: Ensure AMP is working correctly; check if logits or asc (anchor selection cost) values are overflowing
Corrupted data: Confirm no empty, broken, or single-pixel images. Validate with PIL.Image.open() on each file

After fixing, resume from the latest checkpoint with model.train(resume=True).

Q15: Dataset issues causing training to stall?

Insufficient samples: Each class needs at least 100 annotated images. Below 50, the model struggles to generalize. Enable mosaic=1.0 and mixup=0.5 for augmentation
Extreme class imbalance: When one class accounts for 90%+, the model becomes biased. Pass class_weights or switch to Focal Loss
Hard example contamination: Blurry or heavily occluded annotations slow convergence. Manually filter or add more similar hard examples
Validation distribution shift: Ensure train/val sets come from the same distribution. Use stratified sampling (train_test_split(stratify=y)) for class-proportional splits

Inference and Deployment Issues

Q6: Inference results incorrect after ONNX export?

Key configuration considerations for ONNX export:

python
1
2
3
4
5
6
7
model.export(
    format="onnx",
    simplify=True,       # Simplify computation graph, remove redundant ops
    opset=17,            # Recommended 17 or 18, compatible with ONNX Runtime 1.14+
    dynamic=True,        # Enable dynamic batch and input size
    imgsz=640,           # Must match training size
)

Common causes: 1) opset too low (<15) missing operator support; 2) simplify=False prevents runtime optimization; 3) input size mismatch with training causes interpolation errors. Validate with ONNX Runtime: python -c "import onnxruntime as ort; sess=ort.InferenceSession('model.onnx'); print(sess.get_inputs()[0].shape)". Note ONNX output format is (batch, num_dets, 6) with [x1,y1,x2,y2,conf,cls], different from PyTorch’s raw output.

Q7: TensorRT export failed?

TensorRT export failures are typically version mismatch:

TensorRT Version	Minimum CUDA	Recommended PyTorch
8.6	CUDA 11.6+	2.0 / 2.1
10.0	CUDA 12.0+	2.3+
10.7	CUDA 12.4+	2.5+

YOLO26 has the simplest export due to no NMS: model.export(format="engine", half=True). INT8 quantization requires a calibration dataset: model.export(format="engine", int8=True, data="data.yaml"). TensorRT engines are hardware-locked — different GPU architectures (A100 vs RTX 4090) need separate exports.

Q8: Why is YOLO26 faster for inference?

Optimization at three levels:

Architecture: Removed DFL module reduces head computation by ~50%; native no-NMS eliminates post-processing time (typically 5–15% of inference latency)
Operators: Extensive use of RepVGG-style reparameterized convolutions, equivalent to plain 3×3 convolutions at inference, more efficient than v8’s multi-branch structure
Instruction sets: Core operators deeply optimized with AVX2/AVX512 and NEON, achieving 43% CPU speedup. On RTX 4090: YOLO26n (FP16) reaches 2.5ms, YOLO26s ~4.0ms

Recommendations: edge devices → YOLO26n; high-throughput API → YOLO26s (INT8); maximum accuracy → YOLO26m/l.

ONNX Export Troubleshooting

Q16: ONNX export failed: Unsupported op?

Incorrect opset version selection is the most common cause:

opset=16: Supports most CNN operators, works with DFL-based YOLO versions
opset=17: Recommended — covers all YOLO26 operators, compatible with ONNX Runtime 1.14+
opset=18+: Latest operator set, but some inference frameworks (older OpenVINO, Triton) may not support it
Safe default: model.export(format="onnx", opset=17)
If still failing, add dynamic_axes={'input': {0: 'batch'}, 'output': {0: 'batch'}} parameter

Q17: Dynamic axes configuration issues?

python
1
2
3
4
# Dynamic batch + dynamic dimensions (output contains -1 axes)
model.export(format="onnx", dynamic=True, imgsz=[640, 480])
# Fixed size (better inference framework compatibility)
model.export(format="onnx", dynamic=False, imgsz=640)

dynamic=True produces -1 dimension inputs that some frameworks (older OpenVINO) cannot handle — use fixed-size export instead. Pass imgsz as a list [640, 480] to allow dynamic resizing in both dimensions.

Q18: Exported ONNX model validation failed?

Standard validation workflow:

bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# 1. Model structure check
python -c "import onnx; m=onnx.load('model.onnx'); onnx.checker.check_model(m); print('ONNX check OK')"
# 2. ONNX Runtime inference test
python -c "
import onnxruntime as ort, numpy as np
sess = ort.InferenceSession('model.onnx')
inp = np.random.randn(1,3,640,640).astype(np.float32)
out = sess.run(None, {sess.get_inputs()[0].name: inp})
print(f'Output shape: {out[0].shape}, mean: {out[0].mean():.4f}')
"
# 3. Compare with PyTorch output (numerical error should be < 1e-3)

If shape inference fails (output dimensions are 0 or -1), retry with model.export(format="onnx", simplify=False).

Deployment Environment Specific Issues

Q19: Rust ort crate CUDA compilation error?

toml
1
2
3
# Cargo.toml
[dependencies]
ort = { version = "2.0", features = ["cuda"] }

Ensure CUDA_PATH points to the correct CUDA installation and cudart shared libraries are in PATH (Windows) or LD_LIBRARY_PATH (Linux). The 'ort-sys' build failed error typically means only the NVIDIA driver is installed without the CUDA Toolkit. Explicitly configure the CUDA execution provider in Rust:

rust
1
2
3
let session = ort::SessionBuilder::new()?
    .with_execution_providers([CUDAExecutionProvider::default().build()])?
    .commit_from_file("model.onnx")?;

Q20: Go CGO cross-compilation for ARM?

bash
1
2
3
4
5
CGO_ENABLED=1 \
GOOS=linux GOARCH=arm64 \
CC=aarch64-linux-gnu-gcc \
CGO_LDFLAGS="-L/path/to/arm64/onnxruntime/lib -lonnxruntime" \
go build -o app main.go

The CGO_LDFLAGS allowed error occurs because Go’s security policy blocks custom linker flags. Set CGO_LDFLAGS_ALLOW=".*" to bypass. For ARM platforms, use the onnxruntime-arm64 package (v1.17+) to avoid compiling from source.

Q21: ONNX Runtime version and CUDA compatibility?

ONNX Runtime	Minimum CUDA	cuDNN	Notes
1.16.x	11.8	8.6	Stable, general purpose
1.17.x	11.8 / 12.x	8.7	CUDA 12 support
1.18.x	12.2	8.9	CUDA 11 dropped
1.19.x	12.4	9.0+	Latest features

Mismatched versions cause session creation failed: Error in Bind. Verify with python -m onnxruntime.capi._pybind_state.

Q22: GPU not available in Docker container?

dockerfile
1
2
FROM nvidia/cuda:12.4.0-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y nvidia-container-toolkit

The host needs NVIDIA drivers + nvidia-container-toolkit. Run with --gpus all flag. If nvidia-smi is invisible inside the container, check /etc/docker/daemon.json for nvidia runtime configuration. Docker Compose setup:

yaml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
services:
  yolo:
    image: yolo-service
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Q23: Windows vs Linux path issues?

python
1
2
3
4
5
6
7
# Windows ❌
data_path = "data\\images\\train"
# Windows ✅ — forward slashes work cross-platform
data_path = "data/images/train"
# Cross-platform solution
import pathlib
data_path = pathlib.Path("data") / "images" / "train"

Always use POSIX forward slashes in YOLO configuration files for cross-platform compatibility. The same applies to model weight file paths. On Windows, paths longer than 260 characters require enabling long path support in the registry or group policy.

Q9: Annotation file format errors?

YOLO annotation format is strict. Common mistakes:

Coordinate normalization: Each line <class_id> <x_center> <y_center> <width> <height>, all values must be in 0~1 range (divided by image width/height). These are normalized coordinates, not pixel coordinates
File structure: One .txt file per image (keep empty file for no-object images to avoid warnings). One object per line
Format validation: Use ultralytics inspector data.yaml to verify integrity
Coordinate type: YOLO uses xywh (center + dimensions), different from COCO’s xyxy (top-left + bottom-right). Confirm conversion when migrating formats
Empty annotations: Keep empty .txt for truly empty images, set forbid_empty=False in data.yaml

Use LabelImg or Label Studio for YOLO format export to avoid manual editing errors.

Q10: Class IDs not consecutive?

YOLO requires class IDs to start from 0 and increment consecutively without gaps:

Remapping script: Original IDs [0, 1, 4, 7] → map to [0, 1, 2, 3] using a dictionary {0:0, 1:1, 4:2, 7:3} for batch replacement
Background class: YOLO has no explicit background class (unlike Faster R-CNN’s class 0). All annotation IDs are foreground. If your dataset has background ID 0, offset all IDs: new_id = old_id - 1
Verification: python -c "import yaml; d=yaml.safe_load(open('data.yaml')); print(len(d['names']))" should equal max ID + 1

Version Selection Issues

Q11: Which version should beginners start with?

Recommend YOLOv8: Most tutorials (docs, videos, community), most complete ecosystem (detection, segmentation, pose, classification), best for systematic learning
After learning v8, switching to YOLO11/26 is zero-cost — API is 100% compatible, only architecture differs internally
Not recommended to start with YOLOv5: v5’s API differs from the Ultralytics framework, requiring a migration later
Learning path: YOLOv8 official notebook → custom dataset training → ONNX/TensorRT deployment → switch to YOLO26 for performance

Q12: Which version for industrial deployment?

Scenario	Recommended	Reason
Edge devices (Jetson/phone)	YOLO26n	43% faster CPU, no NMS, simplest deployment
Server GPU	YOLO26s/m	Best accuracy-speed balance
CPU-only	YOLO26n INT8	INT8 accuracy loss < 1% mAP
Segmentation/pose needed	YOLOv8-seg/pose	v8 covers all task types
Legacy maintenance	YOLOv8	Most mature ecosystem, fewest issues

Decision tree: Edge → YOLO26n; GPU server → YOLO26s/m; CPU-only → YOLO26n INT8; Multi-task → YOLOv8

Q13: Which version for research exploration?

Different versions represent different technical directions:

YOLOv9: PGI (Programmable Gradient Information) + GELAN architecture, first choice for high-precision tasks
YOLOv10: No-NMS end-to-end detection, consistent dual assignment strategy, clean architecture for academic analysis
YOLOv12: CNN + attention hybrid (Area Attention), exploring CNN-Transformer fusion
YOLO26: Latest SOTA, ideal as comparison baseline
Run multiple versions simultaneously for research; Ultralytics’ unified API dramatically reduces experiment code complexity

Version Migration Guide

Q24: Migrating from YOLOv5 to Ultralytics (v8/v11/26)?

YOLOv5 uses a separate repository with a different API:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# YOLOv5 legacy code
model = torch.hub.load('ultralytics/yolov5', 'yolov5s')
results = model(img)
results.pandas().xyxy[0]  # Different result parsing

# Ultralytics new code (v8/v11/26)
from ultralytics import YOLO
model = YOLO("yolo11n.pt")
results = model(img)
results[0].boxes.data  # Unified result interface

Key changes: 1) Result parsing changes from .pandas().xyxy[0] to .boxes.data; 2) Training parameters move from CLI args to Python dictionaries; 3) data.yaml data configuration format is compatible and reusable. Use ultralytics migrate command for automatic legacy weight conversion.

Q25: Migrating from Detectron2 to YOLO?

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Detectron2 — extensive configuration
cfg = get_cfg()
cfg.merge_from_file("COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml")
predictor = DefaultPredictor(cfg)
outputs = predictor(img)

# YOLO — three lines
model = YOLO("yolo11n.pt")
results = model(img)
boxes = results[0].boxes.xyxy.cpu().numpy()

Key differences: 1) Detectron2’s XYXY coordinates map directly to YOLO’s boxes.xyxy; 2) YOLO has no background class (IDs start at 0), while Detectron2 starts at 1 (0 for background); 3) Evaluation metrics are the same (COCO mAP); 4) Dataset conversion from COCO JSON to YOLO TXT uses ultralytics.data.converter.coco2yolo.

Q26: Migrating from MMDetection to YOLO?

Dataset: MMDetection uses COCO JSON → convert to YOLO TXT with the coco2yolo command-line tool
Evaluation: MMDetection outputs COCO mAP; YOLO’s val command produces identical metrics
Inference output: MMDetection returns DetDataSample objects; YOLO returns concise Results objects
Migration strategy: First get training working on MMDetection with COCO format, then use coco2yolo to convert, and finally train on YOLO with equivalent config for accuracy comparison

Q27: How is API backward compatibility?

Ultralytics follows semantic versioning: v8.x → v8.y (same major version) is 100% API compatible; v8 → v11 → v26 core APIs are stable (YOLO(), model.train(), model.predict(), model.export() interfaces unchanged). New versions only add parameters rather than modifying existing signatures. For third-party library compatibility, refer to Q21’s ONNX Runtime / TensorRT compatibility table.

Error Code Reference Table

Error Message	Cause	Solution
`CUDA out of memory. Tried to allocate ...`	Insufficient VRAM	Reduce batch size, enable AMP, lower imgsz, use gradient accumulation
`No module named 'ultralytics'`	ultralytics not installed	`pip install ultralytics`, verify Python 3.8–3.12
`ONNX export failed: Couldn't export operator ...`	Operator unsupported in current opset	Try opset=17/18, or upgrade PyTorch
`CGO_LDFLAGS allowed`	Go security policy blocks custom linker flags	Set `CGO_LDFLAGS_ALLOW=".*"`
`ImportError: libcudart.so... cannot open shared object file`	CUDA runtime library missing or version mismatch	Install matching CUDA Toolkit, check LD_LIBRARY_PATH
`InvalidArgument: ... ORT ...`	ONNX Runtime version incompatible with export	Check Q21 compatibility table
`'NoneType' object has no attribute 'shape'`	Model weights not loaded correctly	Verify `.pt` path or re-download
`No such file or directory: 'data.yaml'`	Dataset config file path incorrect	Use absolute path or check working directory
`UserWarning: Class ... has only ... images`	Severe class sample shortage	Collect more data or use augmentation
Training cannot resume after OOM	Memory leak	Restart process, call `torch.cuda.empty_cache()`, check DataLoader num_workers

📖 References and Official Documentation

Ultralytics Official Documentation: https://docs.ultralytics.com/
YOLO26 Release Blog: https://www.ultralytics.com/blog/yolo26
YOLOv9 Paper: https://arxiv.org/abs/2402.13616
YOLOv10 Paper: https://arxiv.org/abs/2405.14458
YOLOv12 Paper: https://arxiv.org/abs/2502.12524
GitHub Repository: https://github.com/ultralytics/ultralytics

🎯 Summary and Next Steps

Key Conclusions:

YOLO26 is the latest version for 2026, optimized for edge computing, first choice for industrial deployment
YOLOv8 recommended for beginners, complete ecosystem, 100% API compatibility with new versions
Ultralytics unified framework is the biggest advantage, learn one version to master all
Version differences mainly in architecture, user-level APIs remain consistent, migration cost is minimal

Next Learning Steps:

Set up development environment according to the steps in this article
First run image/video detection examples with YOLOv8
Try training a simple custom dataset
Learn ONNX/TensorRT model deployment
Gradually explore YOLO26’s edge deployment advantages

Wish you success in your YOLO learning journey! 🚀

Part of series: AI Engineering Series

← Previous YOLO Deployment: Model Export and Multi-Platform Deployment Next → YOLO Go Deployment Guide