YOLO Advanced Optimization: Lightweight, Quantization and Accuracy

Model Lightweighting Strategies

Model Size Selection

ModelParameters (M)mAPCPU InferenceUse Cases
YOLO26n2.838.9FastestEdge devices, Embedded
YOLO26s9.448.2Very fastMobile, Web
YOLO26m21.853.1MediumServer, High performance
YOLO11n2.639.6FastLightweight deployment
YOLOv8n3.237.3BaselineGeneral purpose

Knowledge Distillation

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Large model as teacher, small model as student
teacher = YOLO("yolo26x.pt")
student = YOLO("yolo26n.yaml")

# Distillation training (Ultralytics built-in support)
student.train(
    data="data.yaml",
    distill="yolo26x.pt",  # Teacher model
    distill_ratio=0.5,     # Distillation loss ratio
)

Model Pruning

Structured vs Unstructured Pruning

TypeMethodSparsity PatternHardware AccelerationCompression Ratio
UnstructuredWeight pruningRandom sparseDifficult (special HW needed)High
StructuredChannel pruningRegular sparseNative accelerationMedium

Torch Prune Channel Pruning Example

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import torch
import torch.nn.utils.prune as prune

# L1 unstructured pruning on conv layers
model = YOLO("yolo26n.pt")
for name, module in model.model.named_modules():
    if isinstance(module, torch.nn.Conv2d):
        prune.l1_unstructured(module, name="weight", amount=0.3)
        prune.remove(module, "weight")  # Make pruning permanent

# Channel pruning with torch-pruning library
# pip install torch-pruning
import torch_pruning as tp

model = YOLO("yolo26n.pt").model
DG = tp.DependencyGraph()
DG.build_dependency(model, example_inputs=torch.randn(1, 3, 640, 640))

# Prune 20% channels by L1 norm
pruning_plan = DG.get_pruning_plan(
    model.model[4], tp.prune_conv,
    pruning_dim=0,  # Output channel dimension
    idxs=list(range(0, 64, 5))  # Keep every 5th channel
)
pruning_plan.exec()

Pruning Ratio Guidelines

ModelSafe RatioAggressive RatiomAP Drop
YOLO26n≤20%20-40%<1% / 2-5%
YOLO26s≤30%30-50%<1% / 3-6%
YOLO26m≤40%40-60%<1% / 3-8%
YOLOv8n≤20%20-35%<1% / 2-4%

Model Pruning and Quantization

Export Time Quantization

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
model = YOLO("yolo26n.pt")

# INT8 quantization (requires calibration data)
model.export(
    format="engine",      # TensorRT
    int8=True,
    data="data.yaml",     # Calibration dataset
    batch=8,
)

# ONNX dynamic quantization
model.export(
    format="onnx",
    dynamic=True,
    simplify=True,
)

TensorRT INT8 Calibration Step-by-Step

Calibration Dataset Preparation

INT8 quantization requires representative calibration data to determine activation value ranges:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import tensorrt as trt
from ctypes import c_size_t

class YOLOCalibrator(trt.IInt8EntropyCalibrator2):
    def __init__(self, data_loader, cache_file="calibration.cache"):
        trt.IInt8EntropyCalibrator2.__init__(self)
        self.data_loader = data_loader
        self.cache_file = cache_file
        self.buffer_size = 0

    def get_batch_size(self):
        return 8

    def get_batch(self, names):
        try:
            batch = next(self.data_loader)
            batch = batch.cpu().numpy()
            return [batch.astype(np.float32).ctypes.data_as(c_size_t)]
        except StopIteration:
            return None

INT8 vs FP16 vs FP32 Comparison

PrecisionStorage SizeInference Speed (GPU)mAP LossUse Cases
FP32100% (baseline)0%Training, accuracy-first
FP1650%1.5-2×<0.5%General inference
INT825%2-4×1-3%Edge deployment, real-time

Calibration Algorithm Selection

TensorRT provides two calibration algorithms:

  • Entropy (IInt8EntropyCalibrator2): Based on KL divergence, minimizes information loss before and after quantization. Recommended for most vision models including YOLO.
  • Min-Max (IInt8MinMaxCalibrator): Based on absolute value range, simple and fast but sensitive to outliers. Suitable for models with symmetric weight distribution.

Best practice: Use Entropy calibrator + 500-1000 calibration images, batch size=8-16, covering various scenarios.

Latency vs Accuracy Tradeoff

Calibration Set SizeINT8 mAPFP16 mAPLatency (ms)Speedup
100 images51.252.83.23.1×
500 images52.152.83.23.1×
1000 images52.552.83.23.1×
2000 images52.752.83.23.1×

ONNX Quantization Detailed Guide

Dynamic vs Static Quantization

MethodCalibration DataWeight PrecisionActivation PrecisionSpeedupUse Cases
DynamicNot requiredINT8FP32 (dynamic compute)1.5-2×CPU, NLP models
StaticRequiredINT8INT82-3×CPU, Vision models

QDQ (Quantize-Dequantize) Nodes

ONNX static quantization inserts QDQ nodes in the computation graph to simulate quantization effects:

1
2
Input → QuantizeLinear → Conv → DequantizeLinear → ... → Output
          scale, zp                scale, zp

ONNX Runtime automatically fuses QDQ nodes into efficient INT8 kernels during inference.

ONNX Runtime Quantization Tools

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
from onnxruntime.quantization import quantize_dynamic, quantize_static
from onnxruntime.quantization import CalibrationMethod, QuantType, QuantFormat

# Dynamic quantization (no calibration data needed)
quantize_dynamic(
    "yolo26n.onnx",
    "yolo26n_dynamic.onnx",
    weight_type=QuantType.QInt8,
)

# Static quantization (requires calibration data)
class YOLOCalibDataReader:
    def __init__(self, image_paths, batch_size=8):
        self.images = [self.preprocess(img) for img in image_paths]
        self.batch_size = batch_size
        self.index = 0

    def get_next(self):
        if self.index >= len(self.images):
            return None
        batch = self.images[self.index:self.index + self.batch_size]
        self.index += self.batch_size
        return {"images": np.stack(batch)}

quantize_static(
    "yolo26n.onnx",
    "yolo26n_static.onnx",
    calibration_data_reader=YOLOCalibDataReader(calibration_images),
    quant_format=QuantFormat.QDQ,  # QDQ format, supports more optimizations
    per_channel=True,              # Per-channel gives higher accuracy
    activation_type=QuantType.QInt8,
    weight_type=QuantType.QInt8,
    calibrate_method=CalibrationMethod.Entropy,
)

Per-Tensor vs Per-Channel Quantization

MethodGranularityAccuracyCompute OverheadRecommended For
Per-TensorOne scale per weight tensorLowerNoneSmall models, quick deployment
Per-ChannelOne scale per output channelHigherSlightAccuracy-sensitive, YOLO detection models

For YOLO models, per-channel quantization is strongly recommended due to high channel variability in the detection head.

NCNN Mobile Optimization

What is NCNN

NCNN is Tencent’s mobile neural network inference framework, optimized for mobile CPU/Vulkan. Compared to TensorRT (NVIDIA GPU only) and ONNX Runtime (cross-platform), NCNN has significant performance advantages on ARM CPUs and Adreno GPUs.

Vulkan GPU Acceleration

bash
1
2
3
4
5
6
# Convert ONNX to NCNN format
onnx2ncnn yolo26n.onnx yolo26n.param yolo26n.bin

# Use Vulkan GPU for accelerated inference
ncnnoptimize yolo26n.param yolo26n.bin yolo26n_opt.param yolo26n_opt.bin 1
# Last parameter: 0=CPU, 1=Vulkan

FP16 Storage

NCNN uses FP16 weight storage by default, reducing model size by 50%:

bash
1
2
3
4
5
6
7
# FP16 storage mode
ncnn2table --param=yolo26n.param --bin=yolo26n.bin \
    --input=calibration_images --output=table.bfp32 \
    --mean=0,0,0 --norm=255,255,255 --size=640,640

ncnn2int8 yolo26n.param yolo26n.bin yolo26n_int8.param \
    yolo26n_int8.bin table.bfp32

Operator Fusion in NCNN

NCNN automatically performs the following operator fusion optimizations:

  • Conv + BN + ReLU → ConvReLU
  • Conv + BN → Conv
  • Conv + ReLU → ConvReLU
  • Adjacent 1×1 Conv merging

No manual intervention needed — ncnnoptimize handles it automatically.

ARM CPU Optimization (Assembly Kernels)

NCNN provides deeply optimized assembly kernels for ARM CPUs:

CPU ArchitectureInstruction SetSpeedupTarget Devices
ARMv7NEON2-3×Older phones
ARMv8NEON3-4×Mainstream Android
ARMv8.2SVE4-6×Flagship phones, Apple M series
ARMv9SVE25-8×Latest flagships
cpp
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
// C++ integration example
#include "net.h"

ncnn::Net yolo;
yolo.load_param("yolo26n_opt.param");
yolo.load_model("yolo26n_opt.bin");

ncnn::Mat in = ncnn::Mat::from_pixels_resize(
    image_data, ncnn::Mat::PIXEL_BGR, 
    w, h, 640, 640
);

ncnn::Extractor ex = yolo.create_extractor();
ex.input("images", in);
ncnn::Mat out;
ex.extract("output", out);

Accuracy Improvement Techniques

Multi-scale Training and Testing

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Multi-scale during training
model.train(
    imgsz=640,
    multi_scale=True,  # Auto vary ±50% range
)

# Multi-scale enhancement during inference
results = model(
    "test.jpg",
    imgsz=[640, 800, 1024],  # Multi-scale fusion
    augment=True,            # TTA augmentation during testing
)

Class Imbalance Handling

Method 1: Loss Weight Adjustment

python
1
2
3
4
5
model.train(
    box=7.5,    # Box regression weight
    cls=0.5,    # Classification weight (reduce for minority classes)
    dfl=1.5,
)

Method 2: Focal Loss Integration

YOLO uses BCE loss by default. For extreme class imbalance, replace it with Focal Loss:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import torch.nn.functional as F

class FocalLoss(torch.nn.Module):
    def __init__(self, alpha=0.25, gamma=2.0):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma

    def forward(self, pred, target):
        ce_loss = F.binary_cross_entropy_with_logits(
            pred, target, reduction="none"
        )
        pt = torch.exp(-ce_loss)
        focal_loss = self.alpha * (1 - pt) ** self.gamma * ce_loss
        return focal_loss.mean()

# Replace loss function in training callback
# Ultralytics custom loss: inherit DetectionLoss and override bcecls

Method 3: Automated Class Weight Computation

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np

def compute_class_weights(labels_path, num_classes=80):
    """Compute class weights based on annotation frequency"""
    class_counts = np.zeros(num_classes)
    # Count bounding boxes per class
    for label_file in labels_path.glob("*.txt"):
        boxes = np.loadtxt(label_file)  # [class_id, x, y, w, h]
        if boxes.ndim == 1:
            boxes = boxes.reshape(1, -1)
        classes = boxes[:, 0].astype(int)
        for c in classes:
            class_counts[c] += 1

    # Median Frequency Balancing
    median_freq = np.median(class_counts[class_counts > 0])
    class_weights = median_freq / (class_counts + 1e-6)
    class_weights = np.clip(class_weights, 0.1, 10.0)  # Clamp range
    return class_weights

# weights = compute_class_weights("datasets/coco/labels/train/")
# Pass weights into loss function

Method 4: Resampling Strategy

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Resampling: set sampling weights in data.yaml
# Or use torch's WeightedRandomSampler

from torch.utils.data import WeightedRandomSampler
from collections import Counter

def create_balanced_sampler(labels_path, num_classes=80):
    class_counts = Counter()
    for label_file in labels_path.glob("*.txt"):
        boxes = np.loadtxt(label_file)
        if boxes.ndim == 1:
            boxes = boxes.reshape(1, -1)
        for c in boxes[:, 0].astype(int):
            class_counts[c] += 1

    # Sample weight = 1 / class frequency
    weights = [1.0 / class_counts[c] for c in range(num_classes)]

    # Per-image weight = average weight of all its annotation classes
    sample_weights = []
    for label_file in labels_path.glob("*.txt"):
        boxes = np.loadtxt(label_file)
        if boxes.ndim == 0:
            sample_weights.append(1.0)
        elif boxes.ndim == 1:
            sample_weights.append(weights[int(boxes[0])])
        else:
            sample_weights.append(np.mean(
                [weights[int(c)] for c in boxes[:, 0]]
            ))

    return WeightedRandomSampler(
        sample_weights, len(sample_weights), replacement=True
    )

Method Comparison

MethodImplementation DifficultyTraining StabilityEffectRecommended Scenario
Loss Weight Adjustment★☆☆☆☆StableFairMild imbalance
Focal Loss★★★☆☆Needs tuningExcellentExtreme imbalance
Class Weighting★★☆☆☆Relatively stableGoodLong-tail distribution
Resampling★★☆☆☆Risk of overfittingGoodAbundant data

Ablation Study Methodology

What is an Ablation Study

An ablation study systematically removes a component or feature from the model to observe its impact on final performance. In YOLO optimization, ablation studies help answer:

  • Is a particular optimization module truly effective?
  • Do different optimizations have positive/negative interactions?
  • What is the marginal gain of each module?

Ablation Study Design Principles

  1. Single-variable principle: Remove only one optimization at a time, keep everything else constant
  2. Reproducible baseline: All experiments start from the same baseline model
  3. Controlled variables: Keep training hyperparameters, data augmentation, and seeds consistent
  4. Quantified metrics: Report at least mAP, latency, and model size dimensions

YOLO Optimization Ablation Results

Optimization CombomAP@50mAP@50:95Params (M)Latency (ms)Model Size (MB)
Baseline52.838.92.84.25.6
+ KD54.1 (+1.3)40.5 (+1.6)2.84.25.6
+ INT852.5 (-0.3)38.2 (-0.7)2.81.5 (2.8×)1.6
+ Multi-scale53.4 (+0.6)39.8 (+0.9)2.84.35.6
+ TTA54.6 (+1.8)41.2 (+2.3)2.812.6 (3×)5.6
+ Pruning52.2 (-0.6)38.0 (-0.9)1.93.4 (1.2×)4.0
All Combined56.3 (+3.5)42.8 (+3.9)1.91.5 (2.8×)1.6

How to Interpret Ablation Results

  1. Positive-gain components: Knowledge distillation (+1.3 mAP) and multi-scale training (+0.6 mAP) are clear accuracy boosters
  2. Acceleration components: INT8 quantization has a slight mAP drop (-0.3) but reduces latency by 2.8×, making it essential for deployment
  3. Tradeoff components: TTA gives the biggest improvement (+1.8 mAP) but increases latency 3×, suitable for accuracy-first scenarios
  4. Combination effects: All combined gives +3.5 mAP (more than the sum of individual components), indicating positive interaction between KD, quantization, and multi-scale
  5. Diminishing returns: After stacking 4+ optimizations, each new component’s marginal gain decreases

Profiling and Benchmarking

Key Performance Metrics

MetricUnitDescriptionMeasurement Method
LatencymsSingle image inference timeWarmup + average 100 runs
ThroughputFPSImages processed per secondBatch inference
Memory UsageMBGPU/CPU peak memorynvidia-smi / psutil
ComputeGFLOPsFloating point operationsthop / ptflops
Model SizeMBDisk storage usageos.path.getsize

Latency Measurement: Warmup + Multiple Runs

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import time
import numpy as np
import torch

def benchmark_latency(model, input_tensor, num_warmup=50, num_runs=200):
    # Warmup
    for _ in range(num_warmup):
        _ = model(input_tensor)

    # Formal measurement
    if torch.cuda.is_available():
        torch.cuda.synchronize()

    timings = []
    for _ in range(num_runs):
        start = time.perf_counter()
        _ = model(input_tensor)
        if torch.cuda.is_available():
            torch.cuda.synchronize()
        timings.append((time.perf_counter() - start) * 1000)  # ms

    return {
        "mean": np.mean(timings),
        "std": np.std(timings),
        "p50": np.percentile(timings, 50),
        "p95": np.percentile(timings, 95),
        "p99": np.percentile(timings, 99),
    }

Throughput Measurement

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def benchmark_throughput(
    model, batch_size=32, input_size=(3, 640, 640), num_batches=100
):
    dummy_input = torch.randn(batch_size, *input_size).cuda()

    # Warmup
    for _ in range(10):
        _ = model(dummy_input)

    torch.cuda.synchronize()
    start = time.perf_counter()
    for _ in range(num_batches):
        _ = model(dummy_input)
    torch.cuda.synchronize()
    total_time = time.perf_counter() - start

    total_images = batch_size * num_batches
    throughput = total_images / total_time
    return {
        "throughput": throughput,
        "total_time": total_time,
        "batch_size": batch_size,
    }

Using Torch Profiler

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from torch.profiler import profile, record_function, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=True,
) as prof:
    with record_function("model_inference"):
        output = model(input_tensor)

# Print time ranking
print(prof.key_averages().table(
    sort_by="cuda_time_total", row_limit=20
))

# Export for Chrome Trace visualization
prof.export_chrome_trace("trace.json")
# View in chrome://tracing/

ONNX Runtime Profiling

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import onnxruntime as ort

# Enable session profiling
sess_options = ort.SessionOptions()
sess_options.enable_profiling = True
sess = ort.InferenceSession("yolo26n.onnx", sess_options)

# Run inference
outputs = sess.run(None, {"images": np_input})

# Get profiling data
prof_file = sess.end_profiling()
# Outputs profiling_*.json file, viewable in chrome://tracing/

NVIDIA Nsight Systems Analysis

bash
1
2
3
4
5
6
# Install: https://developer.nvidia.com/nsight-systems
# Command-line profiling
nsys profile -o yolo_profile -t cuda,nvtx python inference.py

# Visualize results
# Open yolo_profile.qdrep in Nsight Systems GUI

Deployment Benchmark Comparison

SolutionLatency (ms)Throughput (FPS)Memory (MB)Model Size (MB)Setup Difficulty
PyTorch (FP32)4.22388505.6Easiest
ONNX (FP32)3.13226205.5Easy
ONNX (INT8)2.05004801.5Medium
TensorRT (FP16)2.54004202.8Medium
TensorRT (INT8)1.56663801.6Harder
NCNN (FP16)8.5 (CPU)1173202.8Medium
NCNN (INT8)6.0 (CPU)1662801.0Harder

Version Optimization Strategy Comparison

Optimization DirectionYOLOv8YOLO11YOLO26
Architecture OptimizationC2fC2f optimizedBrand new simplified architecture
Training OptimizationSGDSGDMuSGD
Loss FunctionDFL+CIoUDFL+CIoUProgLoss+STAL
Deployment FriendlinessGoodGoodBest (no NMS)
CPU OptimizationBaseline+25%+43%