Observability

Advanced eBPF Memory Observability: Container Tracing and Rust Aya

June 12, 2026

The first two articles covered eBPF fundamentals and OOM Killer event tracing. This article goes deeper: container-level OOM pinpointing, real-time memory allocation rate tracking, and implementing the same functionality with the Rust Aya framework. Container-Level OOM Pinpointing In Kubernetes, “a Pod OOM’d” is actually a vague statement. A Pod consists of multiple containers, each belonging to different cgroups. eBPF can drill through this layer and precisely identify which container and which process caused the OOM.

Continue reading →

Building an OOM Killer Event Tracer with eBPF + Go

June 11, 2026

bpftrace is great for quick probing and ad-hoc debugging. For production-grade monitoring tools, you need full eBPF programs. The architecture splits into two layers: Kernel side: eBPF program written in C, attached to hook points, collecting event data User side: loader written in Go (or Rust / libbpf C), loading the eBPF program and reading events Architecture mermaid flowchart TD classDef kern fill:#bbdefb,stroke:#2196F3,color:#1B5E20 classDef user fill:#fff3e0,stroke:#FF9800,color:#BF360C classDef data fill:#c8e6c9,stroke:#4CAF50,color:#1B5E20 hook@{ shape: rounded, label: "oom_kill_process (kprobe)" } ebpf@{ shape: proc, label: "eBPF Program\nEvent Collection" } ring@{ shape: cyl, label: "Ring Buffer" } loader@{ shape: notch-rect, label: "bpf2go Loader" } reader@{ shape: proc, label: "RingBuf Reader\nEvent Parsing" } hook --> ebpf --> ring ring --> reader loader -.-> ebpf class hook,ebpf kern class ring data class loader,reader user eBPF Kernel Program (C) Name the C file oom_kprobe.bpf.c — the bpf suffix is a cilium/ebpf convention for bpf2go code generation:

Continue reading →

eBPF Observability: Getting Started with OOM Killer Monitoring

June 10, 2026

eBPF (Extended Berkeley Packet Filter) started as a network packet filtering tool, but over nearly a decade it has evolved into a mainstream observability framework in the Linux kernel. It allows you to safely inject and execute custom programs without modifying kernel source code or loading kernel modules. This article kicks off the series, using OOM (Out-of-Memory) monitoring as a concrete entry point to learn the core eBPF concepts and toolchain.

Continue reading →

From Compliance to Real-Time Defense: The Evolution of security-collector-exporter

May 21, 2026

The Origin: Compliance Check Hassles Anyone in operations knows there’s no escaping one hurdle for domestic servers: Cybersecurity Level Protection (GB/T 22239-2019, commonly known as “Level Protection 2.0”). Whether you’re Level 3 or Level 2, auditors come asking about these things: Is SSH root login disabled? Are password policies compliant? Is the firewall on? Is SELinux enforcing? Are there expired accounts? What’s the password validity period? Which ports are open? Are there high-risk services running? Are audit logs enabled? How long are they retained? There are plenty of compliance check tools on the market—search GitHub and you’ll find a bunch: Golin, EvaluationTools, Linux-Security-Compliance-Check, etc. But they all share one limitation: Run once, get a report, done. You check compliance today, and someone changes sshd_config tomorrow, turns off the firewall, installs a backdoor service—you’d never know.

Continue reading →

security-collector-exporter v0.3.0: Real-Time Security Monitoring with eBPF

May 19, 2026

From Static to Real-Time The previous article introduced security-collector-exporter v0.1.0 — turning Linux security configuration states into Prometheus metrics. But v0.1.0 is essentially “snapshot-based”: periodically reading /etc, /proc, capturing the static configuration at a single point in time. There’s an area of security operations that snapshots can’t cover: real-time security events. Someone running a reverse shell, a process escalating privileges, an abnormal network connection, someone loading a kernel module — these events happen and pass; you’d never see them at your next scrape.

Continue reading →

From Hashmod to Jump Consistent Hash — stream-metrics-route Hash Algorithm Upgrade

May 16, 2026

Introduction In the previous article, we reviewed the three-year evolution of stream-metrics-route and mentioned that the “dual hashmod scheduling” is the core scheduling mechanism of the entire gateway. However, during continuous production operation, one fatal flaw of hashmod became increasingly obvious—every scaling operation triggers full data redistribution. This article documents the decision process of migrating from hash % N (hashmod) to Jump Consistent Hash, including the candidate algorithms evaluated, why Jump Hash was ultimately chosen, and the specific impact before and after migration.

Continue reading →

security-collector-exporter: Monitoring Linux Security Auditing

May 14, 2026

Why This Was Built Anyone managing servers has probably had this experience: compliance audit comes, SSH into machines one by one to check—SSH config correct, SELinux enabled, firewall running, any expired accounts, password policies compliant. A few machines are fine; dozens or hundreds becomes purely manual grunt work. And the more painful part: none of this has continuous monitoring. You check compliance today, someone changes a config tomorrow, and you’d never know.

Continue reading →

VictoriaMetrics Stream Aggregation: Three-Year Review and Current Status (2026)

May 12, 2026

Introduction It’s been exactly three years since the previous article Applying VictoriaMetrics Stream Aggregation for Metrics was published in March 2023. In these three years, the VictoriaMetrics ecosystem has undergone tremendous changes—let’s revisit the issues raised in that blog post, see what the official project has resolved, and where our stream-metrics-route project stands today. I. Problems We Encountered Three Years Ago Let’s quickly recap the core issue list from the 2023 blog post:

Continue reading →

eBPF Series: DeepFlow Extended Protocol Parsing Practice (MongoDB Protocol & Kafka Protocol)

November 25, 2023

Overview: How to Analyze a Protocol (MongoDB) Protocol Document Analysis Approach MongoDB Protocol OpCode Reference Table Analyzing the Most Common OpCode OP_MSG Extending Protocol Parsing in DeepFlow Agent DeepFlow Agent Development Document Overview Code Guide Define a Protocol with a Constant Identifier Prepare Parsing Logic for the New Protocol Define the Struct Implement L7ProtocolParserInterface Extending DeepFlow Protocol Collection Using Wasm Plugins Kafka Protocol Analysis Kafka Header and Data Overview Kafka Fetch API Kafka Produce API Kafka Protocol DeepFlow Agent Native Decoding DeepFlow Agent Wasm Plugin Wasm Go SDK Framework Plugin Code Guide Conclusion Native Rust Extension Wasm Plugin Extension Appendix Overview MongoDB is widely used today, but lacks effective observability capabilities. DeepFlow is an excellent solution for observability, but it lacks support for the MongoDB protocol. This article extends DeepFlow with MongoDB protocol parsing, enhancing observability in the MongoDB ecosystem. It briefly describes the process from protocol document analysis to implementing code parsing within DeepFlow.

Continue reading →

Applying VictoriaMetrics Stream Aggregation for Metrics

March 27, 2023

Community VM Stream Aggregation Capability Analysis and Issues VictoriaMetrics Open-Source Project Native Capabilities Stream aggregation in the VictoriaMetrics project was integrated into vmagent starting from version 1.86. For details, refer to: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3460 From the source code analysis, the stream aggregation capability looks like this: The core computation code is described in the pushSample function: go 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 func (as *totalAggrState) pushSample(inputKey, outputKey string, value float64) { currentTime := fasttime.UnixTimestamp() deleteDeadline := currentTime + as.intervalSecs + (as.intervalSecs >> 1) again: v, ok := as.m.Load(outputKey) if !ok { v = &totalStateValue{ lastValues: make(map[string]*lastValueState), } vNew, loaded := as.m.LoadOrStore(outputKey, v) if loaded { v = vNew } } sv := v.(*totalStateValue) sv.mu.Lock() deleted := sv.deleted if !deleted { lv, ok := sv.lastValues[inputKey] if !ok { lv = &lastValueState{} sv.lastValues[inputKey] = lv } d := value if ok && lv.value <= value { d = value - lv.value } if ok || currentTime > as.ignoreInputDeadline { sv.total += d } lv.value = value lv.deleteDeadline = deleteDeadline sv.deleteDeadline = deleteDeadline } sv.mu.Unlock() if deleted { goto again } } General Application Analysis of Stream Aggregation First, let’s look at the time series chart after stream aggregation:

Continue reading →