VictoriaMetrics Stream Aggregation: Three-Year Review and Current Status (2026)

May 12, 2026 Observability VictoriaMetrics, Prometheus, Metric Aggregation, Vmagent, Stream-Metrics-Route 1153 words 6 min read

🔊

Introduction

It’s been exactly three years since the previous article Applying VictoriaMetrics Stream Aggregation for Metrics was published in March 2023. In these three years, the VictoriaMetrics ecosystem has undergone tremendous changes—let’s revisit the issues raised in that blog post, see what the official project has resolved, and where our stream-metrics-route project stands today.

I. Problems We Encountered Three Years Ago

Let’s quickly recap the core issue list from the 2023 blog post:

#	Problem	2023 Status
P1	Collection gap issue	Network jitter or performance issues causing time gaps, stream aggregation difference calculation inflated
P2	Single-point compute limits for massive data	Stream aggregation has no historical state, excellent performance but single-instance bottleneck exists
P3	Distributed task allocation	Which compute node should data be assigned to?
P4	Out-of-order discarding for same-dimension metrics	Same-dimension metrics computed by multiple nodes with different time windows cause later values to be discarded
P5	Resource balancing	Resource balancing in distributed computing
P6	Task ID dimension explosion	Stream aggregation inserts node IDs into each aggregated time series, dimension labels increase with horizontal scaling

To address these issues, we developed stream-metrics-route, a Go-based distributed stream aggregation gateway.

II. Three Years Later, How Has the Official Project Done?

I reviewed VictoriaMetrics changelogs from v1.86 to v1.138.0 and the official documentation. Let’s take stock of the official project’s efforts over these three years:

2.1 Perfectly Resolved ✅

Issues P3, P5: Distributed Task Allocation & Resource Balancing

Official solution: vmagent now natively supports -remoteWrite.shardByURL with consistent hashing sharding!

Starting from v1.86, native support for shardByURL was introduced. v1.138.0 (2026-03) went further, upgrading the data distribution algorithm from round-robin to consistent hashing, significantly reducing data redistribution ratios during node changes Changelog.

vmagent’s hash sharding architecture evolution:

mermaid
flowchart TD
    A@{ shape: doc, label: "Prometheus<br/>Agents" }
    B(vmagent Cluster)
    C@{ shape: diam, label: "Consistent<br/>Hashing" }
    D@{ shape: cyl, label: "vmstorage<br/>Cluster" }

    A -->|remote write| B
    B --> C
    C -->|shard| D

    classDef src fill:#bbdefb,stroke:#2196F3,color:#1B5E20
    classDef proc fill:#fff3e0,stroke:#FF9800,color:#BF360C
    classDef spec fill:#f3e5f5,stroke:#9C27B0,color:#4A148C
    classDef store fill:#c8e6c9,stroke:#4CAF50,color:#1B5E20
    class A src
    class B proc
    class C spec
    class D store

The VictoriaMetrics blog provides specific algorithm implementation and sharding deployment recommendations. Combined with VictoriaMetrics Operator, it also supports managing shards via shardCount.

Issue P2: Single-Node Compute Scaling

vmagent now supports horizontal scaling (sharding) with replicas + shardCount, with HA support. See Issue #5573 discussion.

Out-of-Order / Delayed Data Accuracy (P1 Partial Mitigation)

v1.112.0 (2025-02) was a key release, adding Aggregation Windows! This provides dual-window buffering for histogram and rate calculations—flushes aren’t immediate but delayed by a samples_lag time, significantly improving accuracy for delayed data, at the cost of doubled memory (maintaining two aggregation windows simultaneously).

How Aggregation Windows Work:

mermaid
sequenceDiagram
    autonumber
    participant C as Collector
    participant V as vmagent
    participant S as VictoriaMetrics

    rect rgba(76,175,80,0.1)
        Note over C,V: Data collection phase
        C->>V: sample1 @T0
        V->>V: Write to window A (current)
    end
    rect rgba(255,152,0,0.1)
        Note over C,V: Delayed data arrives
        C->>V: sample2 @T1 (delayed)
        V->>V: Write to window B (previous)
    end
    Note over V: Dual-window parallel buffering
    rect rgba(33,150,243,0.1)
        Note over V,S: Aggregation output
        V->>S: Aggregation result A @T2
        V->>S: Aggregation result B @T3
    end

Official docs: Streaming aggregation - Aggregation windows

2.2 Still Unresolved ❌

True Distributed Stream Aggregation Coordination

vmagent’s stream aggregation is single-instance aggregation. There is no coordination mechanism between instances—if the same metric is aggregated by two vmagent instances, duplicate or conflicting data is produced. The official recommendation is to use without/by labels to divide instance responsibilities, rather than providing cross-instance distributed coordination.

Task ID Dimension Explosion (P6)

Official vmagent still inserts internal labels (such as _aggr related labels) into aggregated time series, but lacks a stream_task_id pre-marking + dimension control design.

III. stream-metrics-route: Current Status and Value

stream-metrics-route Core Code Review:

File	Role
router.go	Routing core, filters metrics based on relabel rules
remotecluster.go	Dual hashmod scheduling core!
remotewrite.go	remote write HTTP client
kafka.go	Kafka producer

Core Algorithm (remotecluster.go):

go
1
2
3
4
5
6
7
8
9
// Dual hashmod scheduling
hash := sortLabelsHashKey(ts.Labels)
dime := hashMod(r.dimension, hash)  // First hashmod → task partition ID
ts.Labels = append(ts.Labels, prompb.Label{
    Name:  "stream_task_id",
    Value: strconv.Itoa(dime),           // Insert stream_task_id label
})
hashnode = sortLabelsHashKey(filterLabels)  // Second hashmod → node selection
tmpch := hashMod(r.uplen, hashnode)     // Which backend writer to send to

stream-metrics-route Irreplaceability Analysis

Conclusion: stream-metrics-route is still needed in 2026! But its positioning should shift from “full stream aggregation gateway” to “metric distribution routing gateway + Kafka integration layer.” Core differentiated value:

Dual hashmod scheduling + stream_task_id pre-injection: Tags metrics with stream_task_id at the gateway layer; all subsequent nodes route consistently by this ID—this solves dimension control earlier at the data entry point than the official approach
Multi-backend async distribution: Supports async distribution to Kafka and remote write, solving the “synchronous forwarding blocking the time window” issue mentioned in the blog
Native Prometheus relabeling integration

IV. Recommended Hybrid Architecture 2026

Recommended architecture:

mermaid
flowchart TD
    A@{ shape: doc, label: "Prometheus Cluster<br/>Business Metrics" }
    B@{ shape: hex, label: "stream-<br/>metrics-route" }
    C(vmagent Cluster<br/>v1.112.0+)
    D@{ shape: cyl, label: "Kafka<br/>Topic" }
    E@{ shape: cyl, label: "VictoriaMetrics" }
    F(Grafana / vmalert)

    A --> B
    B -->|task_id sharding| C
    B --> D
    C --> E
    E --> F

    classDef src fill:#bbdefb,stroke:#2196F3,color:#1B5E20
    classDef route fill:#fff3e0,stroke:#FF9800,color:#BF360C
    classDef proc fill:#f3e5f5,stroke:#9C27B0,color:#4A148C
    classDef store fill:#c8e6c9,stroke:#4CAF50,color:#1B5E20
    classDef out fill:#bbdefb,stroke:#2196F3,color:#1B5E20
    class A src
    class B route
    class C proc
    class D,E store
    class F out

Key Configuration Recommendations

vmagent version requirement: >= v1.112.0, enable aggregation windows:

yaml
1
2
3
4
5
6
# stream aggregation config
- match: 'http_request_duration_seconds_bucket'
  interval: 5m
  without: [instance]
  enable_windows: true   # Critical! Enable aggregation windows
  outputs: [rate_sum]

Deployment: See deploy.yaml example

V. Evolution Recommendations

5.1 Short-term Recommendations

Action	Description
Upgrade vmagent to >= v1.112.0	Enable `enable_windows: true` to improve histogram aggregation accuracy
Evaluate whether stream-metrics-route is still needed	If there’s no Kafka requirement or high-cardinality `stream_task_id` control requirement, consider migrating away

5.2 Medium-term Recommendations

Action	Description
stream-metrics-route as front-end routing layer only	Retain hashmod task allocation + Kafka distribution
Disable raw metric persistence	Only write stream-aggregated results to storage, reducing storage volume
Add metadata management module	The `ruler-handle-process` mentioned in the blog (dynamic Record Rule by dimension) is worth self-developing or contributing

5.3 Long-term Recommendations

Action	Description
Contribute stream_task_id dimension control mechanism upstream	If this design is proven in production
Improve monitoring metrics	Add stream-aggregation-related business metrics (queue depth per routing rule, distribution latency)

Summary

Dimension	Conclusion
Blog problem resolution rate	~50% (2/4 core problems resolved through official upgrades, 2/4 still need self-developed solutions or maintaining status quo)
Is stream-metrics-route still needed?	Still needed, positioning adjusted to “metric distribution routing gateway + Kafka integration layer”
Recommended architecture	Prometheus → stream-metrics-route → vmagent v1.112.0+ → VictoriaMetrics Storage

References

Three years have passed, and the VictoriaMetrics ecosystem has matured significantly, but we still need to keep moving forward!