The Origin: Compliance Check Hassles Anyone in operations knows there’s no escaping one hurdle for domestic servers: Cybersecurity Level Protection (GB/T 22239-2019, commonly known as “Level Protection 2.0”). Whether you’re Level 3 or Level 2, auditors come asking about these things:
Is SSH root login disabled? Are password policies compliant? Is the firewall on? Is SELinux enforcing? Are there expired accounts? What’s the password validity period? Which ports are open? Are there high-risk services running? Are audit logs enabled? How long are they retained? There are plenty of compliance check tools on the market—search GitHub and you’ll find a bunch: Golin, EvaluationTools, Linux-Security-Compliance-Check, etc. But they all share one limitation: Run once, get a report, done. You check compliance today, and someone changes sshd_config tomorrow, turns off the firewall, installs a backdoor service—you’d never know.
From Static to Real-Time The previous article introduced security-collector-exporter v0.1.0 — turning Linux security configuration states into Prometheus metrics. But v0.1.0 is essentially “snapshot-based”: periodically reading /etc, /proc, capturing the static configuration at a single point in time.
There’s an area of security operations that snapshots can’t cover: real-time security events. Someone running a reverse shell, a process escalating privileges, an abnormal network connection, someone loading a kernel module — these events happen and pass; you’d never see them at your next scrape.
Introduction In the previous article, we reviewed the three-year evolution of stream-metrics-route and mentioned that the “dual hashmod scheduling” is the core scheduling mechanism of the entire gateway. However, during continuous production operation, one fatal flaw of hashmod became increasingly obvious—every scaling operation triggers full data redistribution.
This article documents the complete decision process of migrating from hash % N (hashmod) to Jump Consistent Hash: which candidate algorithms were evaluated, why Jump Hash was ultimately chosen, and the specific impact before and after migration.
Why This Was Built Anyone managing servers has probably had this experience: compliance audit comes, SSH into machines one by one to check—SSH config correct, SELinux enabled, firewall running, any expired accounts, password policies compliant. A few machines are fine; dozens or hundreds becomes purely manual grunt work.
And the more painful part: none of this has continuous monitoring. You check compliance today, someone changes a config tomorrow, and you’d never know.
Introduction It’s been exactly three years since the previous article Applying VictoriaMetrics Stream Aggregation for Metrics was published in March 2023. In these three years, the VictoriaMetrics ecosystem has undergone tremendous changes—let’s revisit the issues raised in that blog post, see what the official project has resolved, and where our stream-metrics-route project stands today.
I. Problems We Encountered Three Years Ago Let’s quickly recap the core issue list from the 2023 blog post:
Overview: How to Analyze a Protocol (MongoDB) Protocol Document Analysis Approach MongoDB Protocol OpCode Reference Table Analyzing the Most Common OpCode OP_MSG Extending Protocol Parsing in DeepFlow Agent DeepFlow Agent Development Document Overview Code Guide Define a Protocol with a Constant Identifier Prepare Parsing Logic for the New Protocol Define the Struct Implement L7ProtocolParserInterface Extending DeepFlow Protocol Collection Using Wasm Plugins Kafka Protocol Analysis Kafka Header and Data Overview Kafka Fetch API Kafka Produce API Kafka Protocol DeepFlow Agent Native Decoding DeepFlow Agent Wasm Plugin Wasm Go SDK Framework Plugin Code Guide Conclusion Native Rust Extension Wasm Plugin Extension Appendix Overview MongoDB is widely used today, but lacks effective observability capabilities. DeepFlow is an excellent solution for observability, but it lacks support for the MongoDB protocol. This article extends DeepFlow with MongoDB protocol parsing, enhancing observability in the MongoDB ecosystem. It briefly describes the process from protocol document analysis to implementing code parsing within DeepFlow.
Community VM Stream Aggregation Capability Analysis and Issues VictoriaMetrics Open-Source Project Native Capabilities Stream aggregation in the VictoriaMetrics project was integrated into vmagent starting from version 1.86. For details, refer to: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3460 From the source code analysis, the stream aggregation capability looks like this:
The core computation code is described in the pushSample function:
go 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 func (as *totalAggrState) pushSample(inputKey, outputKey string, value float64) { currentTime := fasttime.UnixTimestamp() deleteDeadline := currentTime + as.intervalSecs + (as.intervalSecs >> 1) again: v, ok := as.m.Load(outputKey) if !ok { v = &totalStateValue{ lastValues: make(map[string]*lastValueState), } vNew, loaded := as.m.LoadOrStore(outputKey, v) if loaded { v = vNew } } sv := v.(*totalStateValue) sv.mu.Lock() deleted := sv.deleted if !deleted { lv, ok := sv.lastValues[inputKey] if !ok { lv = &lastValueState{} sv.lastValues[inputKey] = lv } d := value if ok && lv.value <= value { d = value - lv.value } if ok || currentTime > as.ignoreInputDeadline { sv.total += d } lv.value = value lv.deleteDeadline = deleteDeadline sv.deleteDeadline = deleteDeadline } sv.mu.Unlock() if deleted { goto again } } General Application Analysis of Stream Aggregation First, let’s look at the time series chart after stream aggregation:
Deployment process and instructions reference: pixie install
Pixie Platform Main Components Pixie Edge Module (PEM): Pixie’s agent, installed per node. PEMs use eBPF to collect data, which is stored locally on the node.
Vizier: Pixie’s collector, installed per cluster. Responsible for query execution and managing PEMs.
Pixie Cloud: Used for user management, authentication, and data proxying. Can be hosted or self-hosted.
Pixie CLI: Used to deploy Pixie. Can also be used to run queries and manage resources like API keys.
Time-Sharing Systems and Linux First, let’s review time-sharing systems. The time-sharing system is a very important operating system concept that maximizes computer utilization and is a crucial means of implementing multi-program concurrency.
The Linux kernel we use daily also adopts the time-sharing system philosophy, mainly reflected in the following aspects:
Time Slice: Linux uses a time slice mechanism to divide CPU time. Each process can only execute for one time slice before yielding the CPU to other processes. This achieves CPU time sharing and fair allocation.
In the context of multi-cloud deployment, global networking, and exponential growth in service scale within internet businesses, monitoring platforms have long surpassed the basic “metric collection + alert notification” positioning, becoming the core infrastructure for ensuring end-to-end stability. This article, based on the real evolution of a large-scale internet enterprise monitoring platform, dissects the complete planning and implementation approach for upgrading the monitoring platform from large-scale coverage to productization, usability, and intelligence in 2022.