Eyes On You: From SRE Principles to Prometheus Monitoring System Implementation
In the context of distributed internet services, high concurrency, and multi-cloud deployment, SRE (Site Reliability Engineering) has become a core role in ensuring service availability, and the monitoring system serves as SRE’s “eyes.” This article starts from SRE core principles, deconstructs the pain points of modern monitoring systems, technology stack selection, Prometheus core principles, and alerting best practices, presenting a practical enterprise-grade monitoring system construction methodology.
SRE Core Principles: Stability is the #1 Metric
SRE’s core is ensuring continuous service stability through engineering practices, focusing on capacity planning, cluster maintenance, fault tolerance, load balancing, and monitoring system construction. There are only 3 core measurement metrics:
- MTBF: Mean Time Between Failures (longer = more stable)
- MTTR: Mean Time To Repair (shorter = more efficient)
- Availability:
Availability = MTBF / (MTBF + MTTR)
Four Stages of Operations System Evolution
Enterprise operations capabilities evolve through a complete transformation from manual to platform as business grows:
graph LR
Stage1@{ shape: rounded, label: "Manual Stage" } --> Stage2@{ shape: rounded, label: "Tool Stage" }
Stage2 --> Stage3@{ shape: rounded, label: "Automation Stage" }
Stage3 --> Stage4@{ shape: rounded, label: "Platform Stage" }
Stage1_Desc@{ shape: doc, label: "Fully manual operations" }
Stage2_Desc@{ shape: doc, label: "Shell/Python/Email" }
Stage3_Desc@{ shape: doc, label: "Nagios/Zabbix/Ansible/ELK" }
Stage4_Desc@{ shape: doc, label: "K8s/Cloud Native/Integrated Monitoring Platform" }
classDef stage fill:#e3f2fd,stroke:#1976d2
class Stage1,Stage2,Stage3,Stage4 stageFour Core Pain Points of Modern Monitoring Systems
Monitoring isn’t just about “watching metrics”—it’s a comprehensive system for fault localization, capacity management, and full-chain tracing. Enterprises face four major challenges:
- Massive data processing: Mixed real-time/non-real-time, structured/unstructured data, heavy storage and computation pressure
- Multi-dimensional data source fragmentation: Metrics, logs, traces, and synthetic monitoring data silos, unable to perform correlation analysis
- Information overload: Alert bombing, excessive invalid metrics, operations staff can’t focus on core issues
- Complex business fault localization: Under distributed architecture, unable to quickly trace fault root causes
Multi-dimensional Standards for Monitoring Data
A complete monitoring event must include full-dimensional labels: Time (when it happened) + Location (data center/node) + Component (service/interface) + Business line + Error code + Status value
Enterprise Monitoring Technology Stack Panorama Selection
Faced with a wide array of monitoring components, enterprises need layered selection across collection, storage, computation, visualization, and alerting to form a complete closed loop:
graph TB
subgraph Collection["Collection Layer"]
Exporter@{ shape: doc, label: "Various Exporters" }
Beats@{ shape: doc, label: "Beats" }
Flume@{ shape: doc, label: "Flume" }
Logstash@{ shape: doc, label: "Logstash" }
end
subgraph Storage["Storage Layer"]
Prometheus@{ shape: cyl, label: "Prometheus TSDB" }
Influxdb@{ shape: cyl, label: "InfluxDB" }
ES@{ shape: cyl, label: "Elasticsearch" }
Kafka@{ shape: cyl, label: "Kafka" }
end
subgraph Discovery["Service Discovery"]
Consul@{ shape: hex, label: "Consul" }
Etcd@{ shape: hex, label: "Etcd" }
end
subgraph Visualize["Visualization & Alerting"]
Grafana@{ shape: hex, label: "Grafana" }
Kibana@{ shape: hex, label: "Kibana" }
Alertmanager@{ shape: hex, label: "Alertmanager" }
end
Collection --> Storage
Service Discovery --> Collection
Storage --> Visualization & Alerting
classDef monitor fill:#e3f2fd,stroke:#1976d2
classDef storage fill:#e8f5e9,stroke:#4caf50
classDef alert fill:#fce4ec,stroke:#e53935
class Exporter,Beats,Flume,Logstash,Consul,Etcd,Grafana,Kibana monitor
class Prometheus,Influxdb,ES,Kafka storage
class Alertmanager alertCore selection strategy: Prometheus as metrics core, ELK as logging core, Grafana as unified entry, balancing real-time capability and scalability.
Prometheus Foundation: Deep Dive into Four Core Metric Types
Prometheus’s core capability stems from precise metric type design. Different metrics correspond to different business scenarios and form the foundation of monitoring accuracy:
Four Core Data Types
- Counter: Only increases, used for cumulative totals (e.g., total requests, CPU seconds)
- Gauge: Can increase or decrease, used for current state (e.g., free memory, connection count)
- Histogram: Client-side bucketing, server-side aggregation, supports quantile queries (e.g., P95/P99 latency)
- Summary: Client-side pre-computed quantiles, no server-side aggregation, higher precision
Histogram vs Summary Selection Guide
graph TD
D1@{ shape: diam, label: "Need aggregation?" } -->|"Yes"| R1@{ shape: doc, label: "Choose Histogram" }
D1 -->|"No"| D2@{ shape: diam, label: "Need precise quantiles?" }
D2@{ shape: diam, label: "Need precise quantiles?" } -->|"Yes"| R2@{ shape: doc, label: "Choose Summary" }
D2 -->|"No"| R1
classDef process fill:#f3e5f5,stroke:#9c27b0
classDef decision fill:#fff3e0,stroke:#ff9800
class D1,D2 decision
class R1,R2 process- Need multi-dimensional aggregation: Choose Histogram
- Need precise quantiles, no aggregation needed: Choose Summary
PromQL in Practice: 4 Typical Monitoring Query Examples
PromQL is Prometheus’s query language and the core of monitoring visualization. Here are the most commonly used enterprise examples:
Example 1: Memory Usage Monitoring (Gauge Type)
| |
- Scenario: Monitor process memory usage
- Type: Gauge (reflects current memory value in real-time)
Example 2: CPU Usage Monitoring (Counter Type)
| |
- Scenario: Monitor instantaneous CPU usage
- Technique: Use
irateto calculate counter growth rate
Example 3: P95/P99 Latency Monitoring (Histogram Type)
| |
- Scenario: API response latency percentile statistics
- Value: Accurately perceive the proportion of slow requests
Example 4: Multi-Label Combined Query
| |
- Scenario: CPU idle rate statistics by node
Alert Full Flow: From Trigger to Convergence
Alerting isn’t just “sending notifications”—it’s a precise, noise-reduced, suppressible closed-loop system, divided into two major stages: Prometheus rule evaluation and Alertmanager distribution.
Alert State Transition
stateDiagram-v2
[*] --> inactive : Threshold not triggered
inactive --> pending : Threshold triggered
pending --> firing : Duration satisfied
firing --> resolved : Threshold recovered
resolved --> inactive : State reset
classDef normal fill:#e8f5e9,stroke:#4caf50
classDef warning fill:#fff3e0,stroke:#ff9800
classDef critical fill:#ffebee,stroke:#f44336
class inactive normal
class pending warning
class firing critical
class resolved normal- inactive: Normal state
- pending: Threshold triggered, waiting for duration verification
- firing: Duration satisfied, formal alert triggered
- resolved: Fault recovered
Alertmanager Core Strategies
- Grouping: Merge and send similar alerts together, avoid bombing
- Inhibition: Higher-level alerts suppress lower-level alerts (e.g., node down inhibits process alerts)
- Silence: Temporarily mute specific alerts, suitable for release/maintenance scenarios
- Retry: Auto-retry on alert notification failure
Alert Rule Configuration Example
| |
Enterprise Monitoring System Construction Approach
From “full monitoring coverage” to “precise and effective monitoring,” enterprises need to follow a more → less → more construction path:
- Standardization: Unify log format, metric naming, SLA conventions
- Normalization: Unified collection, unified labels, unified storage for multiple data sources
- Noise Reduction & Aggregation: Alert grouping, inhibition, abstraction, reduce ineffective notifications
- Full-Chain Tracing: Use TraceID/RequestID to connect the full process, quickly locate root causes
- Quantitative Operations: Measure system health with metrics like slow request ratio, response time, success rate
Summary
The monitoring system is SRE’s core infrastructure. The cloud-native monitoring stack centered on Prometheus perfectly solves the core needs of massive metrics, multi-dimensional collection, and precise alerting. From SRE principles to monitoring system construction, the essence is using data to drive stability, transforming monitoring from “seeing something” to “seeing accurately and managing effectively,” ultimately achieving continuous improvement in service availability.