Model Lightweighting Strategies
Model Size Selection
| Model | Parameters (M) | mAP | CPU Inference | Use Cases |
|---|
| YOLO26n | 2.8 | 38.9 | Fastest | Edge devices, Embedded |
| YOLO26s | 9.4 | 48.2 | Very fast | Mobile, Web |
| YOLO26m | 21.8 | 53.1 | Medium | Server, High performance |
| YOLO11n | 2.6 | 39.6 | Fast | Lightweight deployment |
| YOLOv8n | 3.2 | 37.3 | Baseline | General purpose |
Knowledge Distillation
1
2
3
4
5
6
7
8
9
10
| # Large model as teacher, small model as student
teacher = YOLO("yolo26x.pt")
student = YOLO("yolo26n.yaml")
# Distillation training (Ultralytics built-in support)
student.train(
data="data.yaml",
distill="yolo26x.pt", # Teacher model
distill_ratio=0.5, # Distillation loss ratio
)
|
Model Pruning
Structured vs Unstructured Pruning
| Type | Method | Sparsity Pattern | Hardware Acceleration | Compression Ratio |
|---|
| Unstructured | Weight pruning | Random sparse | Difficult (special HW needed) | High |
| Structured | Channel pruning | Regular sparse | Native acceleration | Medium |
Torch Prune Channel Pruning Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| import torch
import torch.nn.utils.prune as prune
# L1 unstructured pruning on conv layers
model = YOLO("yolo26n.pt")
for name, module in model.model.named_modules():
if isinstance(module, torch.nn.Conv2d):
prune.l1_unstructured(module, name="weight", amount=0.3)
prune.remove(module, "weight") # Make pruning permanent
# Channel pruning with torch-pruning library
# pip install torch-pruning
import torch_pruning as tp
model = YOLO("yolo26n.pt").model
DG = tp.DependencyGraph()
DG.build_dependency(model, example_inputs=torch.randn(1, 3, 640, 640))
# Prune 20% channels by L1 norm
pruning_plan = DG.get_pruning_plan(
model.model[4], tp.prune_conv,
pruning_dim=0, # Output channel dimension
idxs=list(range(0, 64, 5)) # Keep every 5th channel
)
pruning_plan.exec()
|
Pruning Ratio Guidelines
| Model | Safe Ratio | Aggressive Ratio | mAP Drop |
|---|
| YOLO26n | ≤20% | 20-40% | <1% / 2-5% |
| YOLO26s | ≤30% | 30-50% | <1% / 3-6% |
| YOLO26m | ≤40% | 40-60% | <1% / 3-8% |
| YOLOv8n | ≤20% | 20-35% | <1% / 2-4% |
Model Pruning and Quantization
Export Time Quantization
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| model = YOLO("yolo26n.pt")
# INT8 quantization (requires calibration data)
model.export(
format="engine", # TensorRT
int8=True,
data="data.yaml", # Calibration dataset
batch=8,
)
# ONNX dynamic quantization
model.export(
format="onnx",
dynamic=True,
simplify=True,
)
|
TensorRT INT8 Calibration Step-by-Step
Calibration Dataset Preparation
INT8 quantization requires representative calibration data to determine activation value ranges:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| import tensorrt as trt
from ctypes import c_size_t
class YOLOCalibrator(trt.IInt8EntropyCalibrator2):
def __init__(self, data_loader, cache_file="calibration.cache"):
trt.IInt8EntropyCalibrator2.__init__(self)
self.data_loader = data_loader
self.cache_file = cache_file
self.buffer_size = 0
def get_batch_size(self):
return 8
def get_batch(self, names):
try:
batch = next(self.data_loader)
batch = batch.cpu().numpy()
return [batch.astype(np.float32).ctypes.data_as(c_size_t)]
except StopIteration:
return None
|
INT8 vs FP16 vs FP32 Comparison
| Precision | Storage Size | Inference Speed (GPU) | mAP Loss | Use Cases |
|---|
| FP32 | 100% (baseline) | 1× | 0% | Training, accuracy-first |
| FP16 | 50% | 1.5-2× | <0.5% | General inference |
| INT8 | 25% | 2-4× | 1-3% | Edge deployment, real-time |
Calibration Algorithm Selection
TensorRT provides two calibration algorithms:
- Entropy (IInt8EntropyCalibrator2): Based on KL divergence, minimizes information loss before and after quantization. Recommended for most vision models including YOLO.
- Min-Max (IInt8MinMaxCalibrator): Based on absolute value range, simple and fast but sensitive to outliers. Suitable for models with symmetric weight distribution.
Best practice: Use Entropy calibrator + 500-1000 calibration images, batch size=8-16, covering various scenarios.
Latency vs Accuracy Tradeoff
| Calibration Set Size | INT8 mAP | FP16 mAP | Latency (ms) | Speedup |
|---|
| 100 images | 51.2 | 52.8 | 3.2 | 3.1× |
| 500 images | 52.1 | 52.8 | 3.2 | 3.1× |
| 1000 images | 52.5 | 52.8 | 3.2 | 3.1× |
| 2000 images | 52.7 | 52.8 | 3.2 | 3.1× |
ONNX Quantization Detailed Guide
Dynamic vs Static Quantization
| Method | Calibration Data | Weight Precision | Activation Precision | Speedup | Use Cases |
|---|
| Dynamic | Not required | INT8 | FP32 (dynamic compute) | 1.5-2× | CPU, NLP models |
| Static | Required | INT8 | INT8 | 2-3× | CPU, Vision models |
QDQ (Quantize-Dequantize) Nodes
ONNX static quantization inserts QDQ nodes in the computation graph to simulate quantization effects:
1
2
| Input → QuantizeLinear → Conv → DequantizeLinear → ... → Output
scale, zp scale, zp
|
ONNX Runtime automatically fuses QDQ nodes into efficient INT8 kernels during inference.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
| from onnxruntime.quantization import quantize_dynamic, quantize_static
from onnxruntime.quantization import CalibrationMethod, QuantType, QuantFormat
# Dynamic quantization (no calibration data needed)
quantize_dynamic(
"yolo26n.onnx",
"yolo26n_dynamic.onnx",
weight_type=QuantType.QInt8,
)
# Static quantization (requires calibration data)
class YOLOCalibDataReader:
def __init__(self, image_paths, batch_size=8):
self.images = [self.preprocess(img) for img in image_paths]
self.batch_size = batch_size
self.index = 0
def get_next(self):
if self.index >= len(self.images):
return None
batch = self.images[self.index:self.index + self.batch_size]
self.index += self.batch_size
return {"images": np.stack(batch)}
quantize_static(
"yolo26n.onnx",
"yolo26n_static.onnx",
calibration_data_reader=YOLOCalibDataReader(calibration_images),
quant_format=QuantFormat.QDQ, # QDQ format, supports more optimizations
per_channel=True, # Per-channel gives higher accuracy
activation_type=QuantType.QInt8,
weight_type=QuantType.QInt8,
calibrate_method=CalibrationMethod.Entropy,
)
|
Per-Tensor vs Per-Channel Quantization
| Method | Granularity | Accuracy | Compute Overhead | Recommended For |
|---|
| Per-Tensor | One scale per weight tensor | Lower | None | Small models, quick deployment |
| Per-Channel | One scale per output channel | Higher | Slight | Accuracy-sensitive, YOLO detection models |
For YOLO models, per-channel quantization is strongly recommended due to high channel variability in the detection head.
NCNN Mobile Optimization
What is NCNN
NCNN is Tencent’s mobile neural network inference framework, optimized for mobile CPU/Vulkan. Compared to TensorRT (NVIDIA GPU only) and ONNX Runtime (cross-platform), NCNN has significant performance advantages on ARM CPUs and Adreno GPUs.
Vulkan GPU Acceleration
1
2
3
4
5
6
| # Convert ONNX to NCNN format
onnx2ncnn yolo26n.onnx yolo26n.param yolo26n.bin
# Use Vulkan GPU for accelerated inference
ncnnoptimize yolo26n.param yolo26n.bin yolo26n_opt.param yolo26n_opt.bin 1
# Last parameter: 0=CPU, 1=Vulkan
|
FP16 Storage
NCNN uses FP16 weight storage by default, reducing model size by 50%:
1
2
3
4
5
6
7
| # FP16 storage mode
ncnn2table --param=yolo26n.param --bin=yolo26n.bin \
--input=calibration_images --output=table.bfp32 \
--mean=0,0,0 --norm=255,255,255 --size=640,640
ncnn2int8 yolo26n.param yolo26n.bin yolo26n_int8.param \
yolo26n_int8.bin table.bfp32
|
Operator Fusion in NCNN
NCNN automatically performs the following operator fusion optimizations:
- Conv + BN + ReLU → ConvReLU
- Conv + BN → Conv
- Conv + ReLU → ConvReLU
- Adjacent 1×1 Conv merging
No manual intervention needed — ncnnoptimize handles it automatically.
ARM CPU Optimization (Assembly Kernels)
NCNN provides deeply optimized assembly kernels for ARM CPUs:
| CPU Architecture | Instruction Set | Speedup | Target Devices |
|---|
| ARMv7 | NEON | 2-3× | Older phones |
| ARMv8 | NEON | 3-4× | Mainstream Android |
| ARMv8.2 | SVE | 4-6× | Flagship phones, Apple M series |
| ARMv9 | SVE2 | 5-8× | Latest flagships |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| // C++ integration example
#include "net.h"
ncnn::Net yolo;
yolo.load_param("yolo26n_opt.param");
yolo.load_model("yolo26n_opt.bin");
ncnn::Mat in = ncnn::Mat::from_pixels_resize(
image_data, ncnn::Mat::PIXEL_BGR,
w, h, 640, 640
);
ncnn::Extractor ex = yolo.create_extractor();
ex.input("images", in);
ncnn::Mat out;
ex.extract("output", out);
|
Accuracy Improvement Techniques
Multi-scale Training and Testing
1
2
3
4
5
6
7
8
9
10
11
12
| # Multi-scale during training
model.train(
imgsz=640,
multi_scale=True, # Auto vary ±50% range
)
# Multi-scale enhancement during inference
results = model(
"test.jpg",
imgsz=[640, 800, 1024], # Multi-scale fusion
augment=True, # TTA augmentation during testing
)
|
Class Imbalance Handling
Method 1: Loss Weight Adjustment
1
2
3
4
5
| model.train(
box=7.5, # Box regression weight
cls=0.5, # Classification weight (reduce for minority classes)
dfl=1.5,
)
|
Method 2: Focal Loss Integration
YOLO uses BCE loss by default. For extreme class imbalance, replace it with Focal Loss:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| import torch.nn.functional as F
class FocalLoss(torch.nn.Module):
def __init__(self, alpha=0.25, gamma=2.0):
super().__init__()
self.alpha = alpha
self.gamma = gamma
def forward(self, pred, target):
ce_loss = F.binary_cross_entropy_with_logits(
pred, target, reduction="none"
)
pt = torch.exp(-ce_loss)
focal_loss = self.alpha * (1 - pt) ** self.gamma * ce_loss
return focal_loss.mean()
# Replace loss function in training callback
# Ultralytics custom loss: inherit DetectionLoss and override bcecls
|
Method 3: Automated Class Weight Computation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| import numpy as np
def compute_class_weights(labels_path, num_classes=80):
"""Compute class weights based on annotation frequency"""
class_counts = np.zeros(num_classes)
# Count bounding boxes per class
for label_file in labels_path.glob("*.txt"):
boxes = np.loadtxt(label_file) # [class_id, x, y, w, h]
if boxes.ndim == 1:
boxes = boxes.reshape(1, -1)
classes = boxes[:, 0].astype(int)
for c in classes:
class_counts[c] += 1
# Median Frequency Balancing
median_freq = np.median(class_counts[class_counts > 0])
class_weights = median_freq / (class_counts + 1e-6)
class_weights = np.clip(class_weights, 0.1, 10.0) # Clamp range
return class_weights
# weights = compute_class_weights("datasets/coco/labels/train/")
# Pass weights into loss function
|
Method 4: Resampling Strategy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
| # Resampling: set sampling weights in data.yaml
# Or use torch's WeightedRandomSampler
from torch.utils.data import WeightedRandomSampler
from collections import Counter
def create_balanced_sampler(labels_path, num_classes=80):
class_counts = Counter()
for label_file in labels_path.glob("*.txt"):
boxes = np.loadtxt(label_file)
if boxes.ndim == 1:
boxes = boxes.reshape(1, -1)
for c in boxes[:, 0].astype(int):
class_counts[c] += 1
# Sample weight = 1 / class frequency
weights = [1.0 / class_counts[c] for c in range(num_classes)]
# Per-image weight = average weight of all its annotation classes
sample_weights = []
for label_file in labels_path.glob("*.txt"):
boxes = np.loadtxt(label_file)
if boxes.ndim == 0:
sample_weights.append(1.0)
elif boxes.ndim == 1:
sample_weights.append(weights[int(boxes[0])])
else:
sample_weights.append(np.mean(
[weights[int(c)] for c in boxes[:, 0]]
))
return WeightedRandomSampler(
sample_weights, len(sample_weights), replacement=True
)
|
Method Comparison
| Method | Implementation Difficulty | Training Stability | Effect | Recommended Scenario |
|---|
| Loss Weight Adjustment | ★☆☆☆☆ | Stable | Fair | Mild imbalance |
| Focal Loss | ★★★☆☆ | Needs tuning | Excellent | Extreme imbalance |
| Class Weighting | ★★☆☆☆ | Relatively stable | Good | Long-tail distribution |
| Resampling | ★★☆☆☆ | Risk of overfitting | Good | Abundant data |
Ablation Study Methodology
What is an Ablation Study
An ablation study systematically removes a component or feature from the model to observe its impact on final performance. In YOLO optimization, ablation studies help answer:
- Is a particular optimization module truly effective?
- Do different optimizations have positive/negative interactions?
- What is the marginal gain of each module?
Ablation Study Design Principles
- Single-variable principle: Remove only one optimization at a time, keep everything else constant
- Reproducible baseline: All experiments start from the same baseline model
- Controlled variables: Keep training hyperparameters, data augmentation, and seeds consistent
- Quantified metrics: Report at least mAP, latency, and model size dimensions
YOLO Optimization Ablation Results
| Optimization Combo | mAP@50 | mAP@50:95 | Params (M) | Latency (ms) | Model Size (MB) |
|---|
| Baseline | 52.8 | 38.9 | 2.8 | 4.2 | 5.6 |
| + KD | 54.1 (+1.3) | 40.5 (+1.6) | 2.8 | 4.2 | 5.6 |
| + INT8 | 52.5 (-0.3) | 38.2 (-0.7) | 2.8 | 1.5 (2.8×) | 1.6 |
| + Multi-scale | 53.4 (+0.6) | 39.8 (+0.9) | 2.8 | 4.3 | 5.6 |
| + TTA | 54.6 (+1.8) | 41.2 (+2.3) | 2.8 | 12.6 (3×) | 5.6 |
| + Pruning | 52.2 (-0.6) | 38.0 (-0.9) | 1.9 | 3.4 (1.2×) | 4.0 |
| All Combined | 56.3 (+3.5) | 42.8 (+3.9) | 1.9 | 1.5 (2.8×) | 1.6 |
How to Interpret Ablation Results
- Positive-gain components: Knowledge distillation (+1.3 mAP) and multi-scale training (+0.6 mAP) are clear accuracy boosters
- Acceleration components: INT8 quantization has a slight mAP drop (-0.3) but reduces latency by 2.8×, making it essential for deployment
- Tradeoff components: TTA gives the biggest improvement (+1.8 mAP) but increases latency 3×, suitable for accuracy-first scenarios
- Combination effects: All combined gives +3.5 mAP (more than the sum of individual components), indicating positive interaction between KD, quantization, and multi-scale
- Diminishing returns: After stacking 4+ optimizations, each new component’s marginal gain decreases
Profiling and Benchmarking
| Metric | Unit | Description | Measurement Method |
|---|
| Latency | ms | Single image inference time | Warmup + average 100 runs |
| Throughput | FPS | Images processed per second | Batch inference |
| Memory Usage | MB | GPU/CPU peak memory | nvidia-smi / psutil |
| Compute | GFLOPs | Floating point operations | thop / ptflops |
| Model Size | MB | Disk storage usage | os.path.getsize |
Latency Measurement: Warmup + Multiple Runs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
| import time
import numpy as np
import torch
def benchmark_latency(model, input_tensor, num_warmup=50, num_runs=200):
# Warmup
for _ in range(num_warmup):
_ = model(input_tensor)
# Formal measurement
if torch.cuda.is_available():
torch.cuda.synchronize()
timings = []
for _ in range(num_runs):
start = time.perf_counter()
_ = model(input_tensor)
if torch.cuda.is_available():
torch.cuda.synchronize()
timings.append((time.perf_counter() - start) * 1000) # ms
return {
"mean": np.mean(timings),
"std": np.std(timings),
"p50": np.percentile(timings, 50),
"p95": np.percentile(timings, 95),
"p99": np.percentile(timings, 99),
}
|
Throughput Measurement
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| def benchmark_throughput(
model, batch_size=32, input_size=(3, 640, 640), num_batches=100
):
dummy_input = torch.randn(batch_size, *input_size).cuda()
# Warmup
for _ in range(10):
_ = model(dummy_input)
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(num_batches):
_ = model(dummy_input)
torch.cuda.synchronize()
total_time = time.perf_counter() - start
total_images = batch_size * num_batches
throughput = total_images / total_time
return {
"throughput": throughput,
"total_time": total_time,
"batch_size": batch_size,
}
|
Using Torch Profiler
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| from torch.profiler import profile, record_function, ProfilerActivity
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
profile_memory=True,
with_stack=True,
) as prof:
with record_function("model_inference"):
output = model(input_tensor)
# Print time ranking
print(prof.key_averages().table(
sort_by="cuda_time_total", row_limit=20
))
# Export for Chrome Trace visualization
prof.export_chrome_trace("trace.json")
# View in chrome://tracing/
|
ONNX Runtime Profiling
1
2
3
4
5
6
7
8
9
10
11
12
13
| import onnxruntime as ort
# Enable session profiling
sess_options = ort.SessionOptions()
sess_options.enable_profiling = True
sess = ort.InferenceSession("yolo26n.onnx", sess_options)
# Run inference
outputs = sess.run(None, {"images": np_input})
# Get profiling data
prof_file = sess.end_profiling()
# Outputs profiling_*.json file, viewable in chrome://tracing/
|
NVIDIA Nsight Systems Analysis
1
2
3
4
5
6
| # Install: https://developer.nvidia.com/nsight-systems
# Command-line profiling
nsys profile -o yolo_profile -t cuda,nvtx python inference.py
# Visualize results
# Open yolo_profile.qdrep in Nsight Systems GUI
|
Deployment Benchmark Comparison
| Solution | Latency (ms) | Throughput (FPS) | Memory (MB) | Model Size (MB) | Setup Difficulty |
|---|
| PyTorch (FP32) | 4.2 | 238 | 850 | 5.6 | Easiest |
| ONNX (FP32) | 3.1 | 322 | 620 | 5.5 | Easy |
| ONNX (INT8) | 2.0 | 500 | 480 | 1.5 | Medium |
| TensorRT (FP16) | 2.5 | 400 | 420 | 2.8 | Medium |
| TensorRT (INT8) | 1.5 | 666 | 380 | 1.6 | Harder |
| NCNN (FP16) | 8.5 (CPU) | 117 | 320 | 2.8 | Medium |
| NCNN (INT8) | 6.0 (CPU) | 166 | 280 | 1.0 | Harder |
Version Optimization Strategy Comparison
| Optimization Direction | YOLOv8 | YOLO11 | YOLO26 |
|---|
| Architecture Optimization | C2f | C2f optimized | Brand new simplified architecture |
| Training Optimization | SGD | SGD | MuSGD |
| Loss Function | DFL+CIoU | DFL+CIoU | ProgLoss+STAL |
| Deployment Friendliness | Good | Good | Best (no NMS) |
| CPU Optimization | Baseline | +25% | +43% |