VictoriaMetrics Stream Aggregation: Three-Year Review and Current Status (2026)

Introduction

It’s been exactly three years since the previous article Applying VictoriaMetrics Stream Aggregation for Metrics was published in March 2023. In these three years, the VictoriaMetrics ecosystem has undergone tremendous changes—let’s revisit the issues raised in that blog post, see what the official project has resolved, and where our stream-metrics-route project stands today.


I. Problems We Encountered Three Years Ago

Let’s quickly recap the core issue list from the 2023 blog post:

#Problem2023 Status
P1Collection gap issueNetwork jitter or performance issues causing time gaps, stream aggregation difference calculation inflated
P2Single-point compute limits for massive dataStream aggregation has no historical state, excellent performance but single-instance bottleneck exists
P3Distributed task allocationWhich compute node should data be assigned to?
P4Out-of-order discarding for same-dimension metricsSame-dimension metrics computed by multiple nodes with different time windows cause later values to be discarded
P5Resource balancingResource balancing in distributed computing
P6Task ID dimension explosionStream aggregation inserts node IDs into each aggregated time series, dimension labels increase with horizontal scaling

To address these issues, we developed stream-metrics-route, a Go-based distributed stream aggregation gateway.


II. Three Years Later, How Has the Official Project Done?

I reviewed VictoriaMetrics changelogs from v1.86 to v1.138.0 and the official documentation. Let’s take stock of the official project’s efforts over these three years:

2.1 Perfectly Resolved ✅

Issues P3, P5: Distributed Task Allocation & Resource Balancing

Official solution: vmagent now natively supports -remoteWrite.shardByURL with consistent hashing sharding!

Starting from v1.86, native support for shardByURL was introduced. v1.138.0 (2026-03) went further, upgrading the data distribution algorithm from round-robin to consistent hashing, significantly reducing data redistribution ratios during node changes Changelog.

vmagent’s hash sharding architecture evolution:

mermaid
flowchart LR
    subgraph Collect["Collection Layer"]
        A1@{ shape: doc, label: "Prometheus Agent 1" }
        A2@{ shape: doc, label: "Prometheus Agent 2" }
    end

    subgraph VMAgent["vmagent Cluster"]
        direction TB
        B1(vmagent-0)
        B2(vmagent-1)
        B3(vmagent-2)
    end

    subgraph Shard["Sharding Logic"]
        C@{ shape: diam, label: "Consistent Hashing" }
    end

    subgraph Storage["Storage Layer"]
        D1@{ shape: cyl, label: "vmstorage-0" }
        D2@{ shape: cyl, label: "vmstorage-1" }
        D3@{ shape: cyl, label: "vmstorage-2" }
    end

    A1 -->|remote write| B1
    A2 -->|remote write| B2
    B1 --> C
    B2 --> C
    B3 --> C
    C -->|shard 0| D1
    C -->|shard 1| D2
    C -->|shard 2| D3

    classDef storage fill:#e8f5e9,stroke:#4caf50
    class D1,D2,D3 storage

The VictoriaMetrics blog provides specific algorithm implementation and sharding deployment recommendations. Combined with VictoriaMetrics Operator, it also supports managing shards via shardCount.

Issue P2: Single-Node Compute Scaling

vmagent now supports horizontal scaling (sharding) with replicas + shardCount, with HA support. See Issue #5573 discussion.

Out-of-Order / Delayed Data Accuracy (P1 Partial Mitigation)

v1.112.0 (2025-02) was a key release, adding Aggregation Windows! This provides dual-window buffering for histogram and rate calculations—flushes aren’t immediate but delayed by a samples_lag time, significantly improving accuracy for delayed data, at the cost of doubled memory (maintaining two aggregation windows simultaneously).

How Aggregation Windows Work:

mermaid
sequenceDiagram
    autonumber
    participant C as Collector
    participant V as vmagent
    participant S as VictoriaMetrics

    rect rgba(76,175,80,0.1)
        Note over C,V: Data collection phase
        C->>V: sample1 @T0
        V->>V: Write to window A (current)
    end
    rect rgba(255,152,0,0.1)
        Note over C,V: Delayed data arrives
        C->>V: sample2 @T1 (delayed)
        V->>V: Write to window B (previous)
    end
    Note over V: Dual-window parallel buffering
    rect rgba(33,150,243,0.1)
        Note over V,S: Aggregation output
        V->>S: Aggregation result A @T2
        V->>S: Aggregation result B @T3
    end

Official docs: Streaming aggregation - Aggregation windows

2.2 Still Unresolved ❌

True Distributed Stream Aggregation Coordination

vmagent’s stream aggregation is single-instance aggregation. There is no coordination mechanism between instances—if the same metric is aggregated by two vmagent instances, duplicate or conflicting data is produced. The official recommendation is to use without/by labels to divide instance responsibilities, rather than providing cross-instance distributed coordination.

Task ID Dimension Explosion (P6)

Official vmagent still inserts internal labels (such as _aggr related labels) into aggregated time series, but lacks a stream_task_id pre-marking + dimension control design.


III. stream-metrics-route: Current Status and Value

stream-metrics-route Core Code Review:

FileRole
router.goRouting core, filters metrics based on relabel rules
remotecluster.goDual hashmod scheduling core!
remotewrite.goremote write HTTP client
kafka.goKafka producer

Core Algorithm (remotecluster.go):

go
1
2
3
4
5
6
7
8
9
// Dual hashmod scheduling
hash := sortLabelsHashKey(ts.Labels)
dime := hashMod(r.dimension, hash)  // First hashmod → task partition ID
ts.Labels = append(ts.Labels, prompb.Label{
    Name:  "stream_task_id",
    Value: strconv.Itoa(dime),           // Insert stream_task_id label
})
hashnode = sortLabelsHashKey(filterLabels)  // Second hashmod → node selection
tmpch := hashMod(r.uplen, hashnode)     // Which backend writer to send to

stream-metrics-route Irreplaceability Analysis

Conclusion: stream-metrics-route is still needed in 2026! But its positioning should shift from “full stream aggregation gateway” to “metric distribution routing gateway + Kafka integration layer.” Core differentiated value:

  1. Dual hashmod scheduling + stream_task_id pre-injection: Tags metrics with stream_task_id at the gateway layer; all subsequent nodes route consistently by this ID—this solves dimension control earlier at the data entry point than the official approach
  2. Multi-backend async distribution: Supports async distribution to Kafka and remote write, solving the “synchronous forwarding blocking the time window” issue mentioned in the blog
  3. Native Prometheus relabeling integration

Recommended architecture:

mermaid
flowchart LR
    subgraph Collect["Collection Layer"]
        Prometheus@{ shape: doc, label: "Prometheus Agent Cluster" }
        KafkaProducer@{ shape: doc, label: "Business System Metrics" }
    end

    subgraph Route["Routing Layer"]
        SMR@{ shape: hex, label: "stream-metrics-route" }
    end

    subgraph Aggregate["Aggregation Layer"]
        vmagent0(vmagent-0)
        vmagent1(vmagent-1)
        vmagent2(vmagent-2)
    end

    subgraph Storage["Storage Layer"]
        Victoria@{ shape: cyl, label: "VictoriaMetrics" }
    end

    subgraph Consume["Consumption Layer"]
        vmalert(vmalert)
        Grafana(Grafana)
    end

    Prometheus --> SMR
    KafkaProducer --> SMR
    SMR -->|task_id=0| vmagent0
    SMR -->|task_id=1| vmagent1
    SMR -->|task_id=2| vmagent2
    SMR --> Kafka@{ shape: cyl, label: "Kafka Topic" }
    vmagent0 --> Victoria
    vmagent1 --> Victoria
    vmagent2 --> Victoria
    Victoria --> vmalert
    Victoria --> Grafana

    classDef route fill:#fff3e0,stroke:#ff9800
    classDef storage fill:#e8f5e9,stroke:#4caf50
    classDef consume fill:#e3f2fd,stroke:#2196f3
    class SMR route
    class Victoria,Kafka storage
    class vmalert,Grafana consume

Key Configuration Recommendations

vmagent version requirement: >= v1.112.0, enable aggregation windows:

yaml
1
2
3
4
5
6
# stream aggregation config
- match: 'http_request_duration_seconds_bucket'
  interval: 5m
  without: [instance]
  enable_windows: true   # Critical! Enable aggregation windows
  outputs: [rate_sum]

Deployment: See deploy.yaml example


V. Evolution Recommendations

5.1 Short-term Recommendations

ActionDescription
Upgrade vmagent to >= v1.112.0Enable enable_windows: true to improve histogram aggregation accuracy
Evaluate whether stream-metrics-route is still neededIf there’s no Kafka requirement or high-cardinality stream_task_id control requirement, consider migrating away

5.2 Medium-term Recommendations

ActionDescription
stream-metrics-route as front-end routing layer onlyRetain hashmod task allocation + Kafka distribution
Disable raw metric persistenceOnly write stream-aggregated results to storage, reducing storage volume
Add metadata management moduleThe ruler-handle-process mentioned in the blog (dynamic Record Rule by dimension) is worth self-developing or contributing

5.3 Long-term Recommendations

ActionDescription
Contribute stream_task_id dimension control mechanism upstreamIf this design is proven in production
Improve monitoring metricsAdd stream-aggregation-related business metrics (queue depth per routing rule, distribution latency)

Summary

DimensionConclusion
Blog problem resolution rate~50% (2/4 core problems resolved through official upgrades, 2/4 still need self-developed solutions or maintaining status quo)
Is stream-metrics-route still needed?Still needed, positioning adjusted to “metric distribution routing gateway + Kafka integration layer”
Recommended architecturePrometheus → stream-metrics-route → vmagent v1.112.0+ → VictoriaMetrics Storage

References

Three years have passed, and the VictoriaMetrics ecosystem has matured significantly, but we still need to keep moving forward!