Observability

eBPF Series: DeepFlow Extended Protocol Parsing Practice (MongoDB Protocol & Kafka Protocol)

Overview: How to Analyze a Protocol (MongoDB) Protocol Document Analysis Approach MongoDB Protocol OpCode Reference Table Analyzing the Most Common OpCode OP_MSG Extending Protocol Parsing in DeepFlow Agent DeepFlow Agent Development Document Overview Code Guide Define a Protocol with a Constant Identifier Prepare Parsing Logic for the New Protocol Define the Struct Implement L7ProtocolParserInterface Extending DeepFlow Protocol Collection Using Wasm Plugins Kafka Protocol Analysis Kafka Header and Data Overview Kafka Fetch API Kafka Produce API Kafka Protocol DeepFlow Agent Native Decoding DeepFlow Agent Wasm Plugin Wasm Go SDK Framework Plugin Code Guide Conclusion Native Rust Extension Wasm Plugin Extension Appendix Overview MongoDB is widely used today, but lacks effective observability capabilities. DeepFlow is an excellent solution for observability, but it lacks support for the MongoDB protocol. This article extends DeepFlow with MongoDB protocol parsing, enhancing observability in the MongoDB ecosystem. It briefly describes the process from protocol document analysis to implementing code parsing within DeepFlow.

Continue reading →

Applying VictoriaMetrics Stream Aggregation for Metrics

Community VM Stream Aggregation Capability Analysis and Issues VictoriaMetrics Open-Source Project Native Capabilities Stream aggregation in the VictoriaMetrics project was integrated into vmagent starting from version 1.86. For details, refer to: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3460 From the source code analysis, the stream aggregation capability looks like this: The core computation code is described in the pushSample function: go 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 func (as *totalAggrState) pushSample(inputKey, outputKey string, value float64) { currentTime := fasttime.UnixTimestamp() deleteDeadline := currentTime + as.intervalSecs + (as.intervalSecs >> 1) again: v, ok := as.m.Load(outputKey) if !ok { v = &totalStateValue{ lastValues: make(map[string]*lastValueState), } vNew, loaded := as.m.LoadOrStore(outputKey, v) if loaded { v = vNew } } sv := v.(*totalStateValue) sv.mu.Lock() deleted := sv.deleted if !deleted { lv, ok := sv.lastValues[inputKey] if !ok { lv = &lastValueState{} sv.lastValues[inputKey] = lv } d := value if ok && lv.value <= value { d = value - lv.value } if ok || currentTime > as.ignoreInputDeadline { sv.total += d } lv.value = value lv.deleteDeadline = deleteDeadline sv.deleteDeadline = deleteDeadline } sv.mu.Unlock() if deleted { goto again } } General Application Analysis of Stream Aggregation First, let’s look at the time series chart after stream aggregation:

Continue reading →

eBPF Series: A Brief Analysis of Pixie

Deployment process and instructions reference: pixie install Pixie Platform Main Components Pixie Edge Module (PEM): Pixie’s agent, installed per node. PEMs use eBPF to collect data, which is stored locally on the node. Vizier: Pixie’s collector, installed per cluster. Responsible for query execution and managing PEMs. Pixie Cloud: Used for user management, authentication, and data proxying. Can be hosted or self-hosted. Pixie CLI: Used to deploy Pixie. Can also be used to run queries and manage resources like API keys.

Continue reading →

A Casual Talk About CPU Timing and Modern Operating Systems

Time-Sharing Systems and Linux First, let’s review time-sharing systems. The time-sharing system is a very important operating system concept that maximizes computer utilization and is a crucial means of implementing multi-program concurrency. The Linux kernel we use daily also adopts the time-sharing system philosophy, mainly reflected in the following aspects: Time Slice: Linux uses a time slice mechanism to divide CPU time. Each process can only execute for one time slice before yielding the CPU to other processes. This achieves CPU time sharing and fair allocation.

Continue reading →

Eyes On You: The 2022 Productization Journey of a Multi-Cloud Heterogeneous Monitoring Platform

In the context of multi-cloud deployment, global networking, and exponential growth in service scale within internet businesses, monitoring platforms have long surpassed the basic “metric collection + alert notification” positioning, becoming the core infrastructure for ensuring end-to-end stability. This article, based on the real evolution of a large-scale internet enterprise monitoring platform, dissects the complete planning and implementation approach for upgrading the monitoring platform from large-scale coverage to productization, usability, and intelligence in 2022.

Continue reading →

Eyes On You: From SRE Principles to Prometheus Monitoring System Implementation

In the context of distributed internet services, high concurrency, and multi-cloud deployment, SRE (Site Reliability Engineering) has become a core role in ensuring service availability, and the monitoring system serves as SRE’s “eyes.” This article starts from SRE core principles, deconstructs the pain points of modern monitoring systems, technology stack selection, Prometheus core principles, and alerting best practices, presenting a practical enterprise-grade monitoring system construction methodology. SRE Core Principles: Stability is the #1 Metric SRE’s core is ensuring continuous service stability through engineering practices, focusing on capacity planning, cluster maintenance, fault tolerance, load balancing, and monitoring system construction. There are only 3 core measurement metrics:

Continue reading →

Monitoring Collection Notes

MySQL Monitoring MySQL Privilege Best Practices Privilege control is primarily for security reasons, so follow these best practices: Grant only the minimum privileges needed to prevent users from doing harm. For example, if a user only needs to query, just grant SELECT privileges, not UPDATE, INSERT, or DELETE. Restrict the login host when creating users, typically to a specific IP or internal network IP range. Delete users without passwords after initializing the database. The installation automatically creates some users with no passwords by default. Set passwords that meet complexity requirements for each user. Periodically clean up unnecessary users. Revoke privileges or delete users. Example:

Continue reading →