YOLO Deployment: Model Export and Multi-Platform Deployment
Model Export (17 Format Support)
Ultralytics Unified Export API
| |
Version Export Compatibility
| Format | YOLOv8 | YOLO11 | YOLO26 |
|---|---|---|---|
| ONNX | ✅ | ✅ | ✅ Best |
| TensorRT | ✅ | ✅ | ✅ No NMS, Simpler |
| OpenVINO | ✅ | ✅ | ✅ |
| TFLite | ✅ | ✅ | ✅ |
| NCNN | ✅ | ✅ | ✅ |
Python Deployment Practice
ONNX Runtime Deployment
| |
TensorRT Python Deployment
| |
OpenVINO Deployment with Benchmarking
| |
NCNN Mobile Deployment
NCNN is Tencent’s open-source mobile inference framework supporting ARM NEON and Vulkan GPU acceleration.
Model Optimization (ncnnoptimize):
| |
Python Binding Inference:
| |
Android JNI Integration:
| |
Performance Reference (YOLO26n, 640×640):
| Device | Backend | Latency | Power |
|---|---|---|---|
| Snapdragon 8 Gen 3 | NCNN Vulkan | ~8ms | 3.5W |
| Snapdragon 8 Gen 3 | NCNN CPU | ~18ms | 2.1W |
| Apple A17 Pro | CoreML/ANE | ~6ms | 2.8W |
| MediaTek Dimensity 9300 | NCNN GPU | ~10ms | 3.2W |
| Raspberry Pi 5 | NCNN CPU | ~120ms | 5W |
C++ Deployment Practice
OpenCV DNN Deployment (Simplest)
| |
Docker Multi-Stage Deployment
Dockerfile Build
| |
docker-compose Orchestration
| |
Health Check & Inference Service
| |
Resource Limit Reference
| Resource | Minimum | Recommended |
|---|---|---|
| CPU | 2 cores | 4+ cores |
| Memory | 2 GB | 4 GB |
| GPU (Optional) | NVIDIA T4 | A10G+ |
Edge Device Deployment
Rockchip RKNN (RK3588 / RK3566)
Rockchip NPU uses the RKNN format — requires converting ONNX models:
| |
Key Parameters:
do_quantization=True: INT8 quantization, 3-5x throughput improvementtarget_platform: RK3588 / RK3566 / RK3399Pro- NPU inference latency (RK3588): typically <10ms
NVIDIA Jetson (JetPack + TensorRT)
| |
Jetson Model Performance Estimate (YOLO26n, FP16, 640×640):
| Device | Inference Latency | Tier |
|---|---|---|
| Jetson Orin NX 16GB | ~6ms | Edge Flagship |
| Jetson Orin Nano 8GB | ~12ms | Best Value |
| Jetson Xavier NX | ~15ms | Previous Gen Flagship |
| Jetson Nano | ~45ms | Entry Level |
Google Coral Edge TPU (TFLite)
| |
| |
Mobile Deployment
Android Deployment (TFLite)
| |
Web Deployment
ONNX Runtime Web
| |
Benchmark Methodology
Standardized Benchmark Script
| |
Report Format
| |
Testing Considerations
| Consideration | Description |
|---|---|
| CPU Frequency Locking | Disable scaling: cpupower frequency-set -g performance |
| GPU Warmup | TensorRT first inference includes kernel compilation — warm up 10+ iterations |
| Batch Size Impact | Larger batches increase throughput but add latency — choose based on SLA |
| Memory Bandwidth | CPU inference is often memory-bandwidth bound — watch DDR frequency |
| Temperature Control | Edge device latency increases with temperature — monitor SoC temp |
| Reproducibility | Fix random seed, lock CPU affinity, disable hyperthreading |
Export Troubleshooting
Common Errors & Fixes
| Platform | Common Error | Solution |
|---|---|---|
| ONNX | Opset version mismatch | Specify opset=15 or opset=17 |
| ONNX | Dynamic shape unsupported | Set dynamic=False, imgsz=640 |
| TensorRT | Plugin not found | Upgrade TensorRT or set workspace=8 |
| TensorRT | Half precision not supported | Use half=False or upgrade GPU |
| OpenVINO | FP16 not supported on device | Device doesn’t support FP16, set half=False |
| TFLite | Quantization not supported | Use int8=False or provide calibration dataset |
| NCNN | Vulkan not initialized | Check GPU driver, compile with -DNCNN_VULKAN=ON |
| RKNN | Platform mismatch | Check target_platform parameter |
| CoreML | Min deployment target | Use mlmodel format instead of mlpackage |
Debug Tips
- Verbose export logging:
1model.export(format="onnx", verbose=True) - ONNX integrity check:
1python -m onnxruntime.tools.check_model yolo26n.onnx - TensorRT verbose logging:
1trt.Logger(trt.Logger.VERBOSE) # Replace WARNING - Graph visualization:
1 2import netron netron.start("yolo26n.onnx") # View model structure in browser
Version Deployment Differences Summary
| Deployment Aspect | YOLOv8 | YOLO11 | YOLO26 |
|---|---|---|---|
| NMS Post-processing | Required | Required | Not Required |
| DFL Parsing | Required | Required | Not Required |
| Deployment Code Size | Baseline | Baseline | Reduced by 50% |
| CPU Inference Speed | Baseline | +25% | +43% |
| Hardware Compatibility | Good | Good | Best |
| Edge Device Adaptation | Average | Better | Designed for Edge |