Eyes On You: From SRE Principles to Prometheus Monitoring System Implementation

In the context of distributed internet services, high concurrency, and multi-cloud deployment, SRE (Site Reliability Engineering) has become a core role in ensuring service availability, and the monitoring system serves as SRE’s “eyes.” This article starts from SRE core principles, deconstructs the pain points of modern monitoring systems, technology stack selection, Prometheus core principles, and alerting best practices, presenting a practical enterprise-grade monitoring system construction methodology.

SRE Core Principles: Stability is the #1 Metric

SRE’s core is ensuring continuous service stability through engineering practices, focusing on capacity planning, cluster maintenance, fault tolerance, load balancing, and monitoring system construction. There are only 3 core measurement metrics:

  1. MTBF: Mean Time Between Failures (longer = more stable)
  2. MTTR: Mean Time To Repair (shorter = more efficient)
  3. Availability: Availability = MTBF / (MTBF + MTTR)

Four Stages of Operations System Evolution

Enterprise operations capabilities evolve through a complete transformation from manual to platform as business grows:

mermaid
graph LR
    Stage1@{ shape: rounded, label: "Manual Stage" } --> Stage2@{ shape: rounded, label: "Tool Stage" }
    Stage2 --> Stage3@{ shape: rounded, label: "Automation Stage" }
    Stage3 --> Stage4@{ shape: rounded, label: "Platform Stage" }
    Stage1_Desc@{ shape: doc, label: "Fully manual operations" }
    Stage2_Desc@{ shape: doc, label: "Shell/Python/Email" }
    Stage3_Desc@{ shape: doc, label: "Nagios/Zabbix/Ansible/ELK" }
    Stage4_Desc@{ shape: doc, label: "K8s/Cloud Native/Integrated Monitoring Platform" }
    classDef stage fill:#e3f2fd,stroke:#1976d2
    class Stage1,Stage2,Stage3,Stage4 stage

Four Core Pain Points of Modern Monitoring Systems

Monitoring isn’t just about “watching metrics”—it’s a comprehensive system for fault localization, capacity management, and full-chain tracing. Enterprises face four major challenges:

  1. Massive data processing: Mixed real-time/non-real-time, structured/unstructured data, heavy storage and computation pressure
  2. Multi-dimensional data source fragmentation: Metrics, logs, traces, and synthetic monitoring data silos, unable to perform correlation analysis
  3. Information overload: Alert bombing, excessive invalid metrics, operations staff can’t focus on core issues
  4. Complex business fault localization: Under distributed architecture, unable to quickly trace fault root causes

Multi-dimensional Standards for Monitoring Data

A complete monitoring event must include full-dimensional labels: Time (when it happened) + Location (data center/node) + Component (service/interface) + Business line + Error code + Status value

Enterprise Monitoring Technology Stack Panorama Selection

Faced with a wide array of monitoring components, enterprises need layered selection across collection, storage, computation, visualization, and alerting to form a complete closed loop:

mermaid
graph TB
    subgraph Collection["Collection Layer"]
        Exporter@{ shape: doc, label: "Various Exporters" }
        Beats@{ shape: doc, label: "Beats" }
        Flume@{ shape: doc, label: "Flume" }
        Logstash@{ shape: doc, label: "Logstash" }
    end
    
    subgraph Storage["Storage Layer"]
        Prometheus@{ shape: cyl, label: "Prometheus TSDB" }
        Influxdb@{ shape: cyl, label: "InfluxDB" }
        ES@{ shape: cyl, label: "Elasticsearch" }
        Kafka@{ shape: cyl, label: "Kafka" }
    end
    
    subgraph Discovery["Service Discovery"]
        Consul@{ shape: hex, label: "Consul" }
        Etcd@{ shape: hex, label: "Etcd" }
    end
    
    subgraph Visualize["Visualization & Alerting"]
        Grafana@{ shape: hex, label: "Grafana" }
        Kibana@{ shape: hex, label: "Kibana" }
        Alertmanager@{ shape: hex, label: "Alertmanager" }
    end
    
    Collection --> Storage
    Service Discovery --> Collection
    Storage --> Visualization & Alerting
    classDef monitor fill:#e3f2fd,stroke:#1976d2
    classDef storage fill:#e8f5e9,stroke:#4caf50
    classDef alert fill:#fce4ec,stroke:#e53935
    class Exporter,Beats,Flume,Logstash,Consul,Etcd,Grafana,Kibana monitor
    class Prometheus,Influxdb,ES,Kafka storage
    class Alertmanager alert

Core selection strategy: Prometheus as metrics core, ELK as logging core, Grafana as unified entry, balancing real-time capability and scalability.

Prometheus Foundation: Deep Dive into Four Core Metric Types

Prometheus’s core capability stems from precise metric type design. Different metrics correspond to different business scenarios and form the foundation of monitoring accuracy:

Four Core Data Types

  • Counter: Only increases, used for cumulative totals (e.g., total requests, CPU seconds)
  • Gauge: Can increase or decrease, used for current state (e.g., free memory, connection count)
  • Histogram: Client-side bucketing, server-side aggregation, supports quantile queries (e.g., P95/P99 latency)
  • Summary: Client-side pre-computed quantiles, no server-side aggregation, higher precision

Histogram vs Summary Selection Guide

mermaid
graph TD
    D1@{ shape: diam, label: "Need aggregation?" } -->|"Yes"| R1@{ shape: doc, label: "Choose Histogram" }
    D1 -->|"No"| D2@{ shape: diam, label: "Need precise quantiles?" }
    D2@{ shape: diam, label: "Need precise quantiles?" } -->|"Yes"| R2@{ shape: doc, label: "Choose Summary" }
    D2 -->|"No"| R1
    classDef process fill:#f3e5f5,stroke:#9c27b0
    classDef decision fill:#fff3e0,stroke:#ff9800
    class D1,D2 decision
    class R1,R2 process
  • Need multi-dimensional aggregation: Choose Histogram
  • Need precise quantiles, no aggregation needed: Choose Summary

PromQL in Practice: 4 Typical Monitoring Query Examples

PromQL is Prometheus’s query language and the core of monitoring visualization. Here are the most commonly used enterprise examples:

Example 1: Memory Usage Monitoring (Gauge Type)

promql
1
process_resident_memory_bytes{job="thanos-sidecar"}
  • Scenario: Monitor process memory usage
  • Type: Gauge (reflects current memory value in real-time)

Example 2: CPU Usage Monitoring (Counter Type)

promql
1
irate(process_cpu_seconds_total[2m])
  • Scenario: Monitor instantaneous CPU usage
  • Technique: Use irate to calculate counter growth rate

Example 3: P95/P99 Latency Monitoring (Histogram Type)

promql
1
histogram_quantile(0.95, sum(irate(t_request_duration_seconds_bucket[2m])) by (le))
  • Scenario: API response latency percentile statistics
  • Value: Accurately perceive the proportion of slow requests

Example 4: Multi-Label Combined Query

promql
1
avg by (instance) (node_cpu_seconds_total{mode="idle"})
  • Scenario: CPU idle rate statistics by node

Alert Full Flow: From Trigger to Convergence

Alerting isn’t just “sending notifications”—it’s a precise, noise-reduced, suppressible closed-loop system, divided into two major stages: Prometheus rule evaluation and Alertmanager distribution.

Alert State Transition

mermaid
stateDiagram-v2
    [*] --> inactive : Threshold not triggered
    inactive --> pending : Threshold triggered
    pending --> firing : Duration satisfied
    firing --> resolved : Threshold recovered
    resolved --> inactive : State reset
    classDef normal fill:#e8f5e9,stroke:#4caf50
    classDef warning fill:#fff3e0,stroke:#ff9800
    classDef critical fill:#ffebee,stroke:#f44336
    class inactive normal
    class pending warning
    class firing critical
    class resolved normal
  • inactive: Normal state
  • pending: Threshold triggered, waiting for duration verification
  • firing: Duration satisfied, formal alert triggered
  • resolved: Fault recovered

Alertmanager Core Strategies

  • Grouping: Merge and send similar alerts together, avoid bombing
  • Inhibition: Higher-level alerts suppress lower-level alerts (e.g., node down inhibits process alerts)
  • Silence: Temporarily mute specific alerts, suitable for release/maintenance scenarios
  • Retry: Auto-retry on alert notification failure

Alert Rule Configuration Example

yaml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
groups:
- name: nodehealth
  rules:
  - alert: CPUIdleTooLow
    expr: avg by (hostname)(rate(node_cpu_seconds_total{mode="idle"}[2m])) < 0.05
    for: 2m
    labels:
      level: High
    annotations:
      description: "{{$labels.hostname}} CPU idle rate is only {{$value}}%"

Enterprise Monitoring System Construction Approach

From “full monitoring coverage” to “precise and effective monitoring,” enterprises need to follow a more → less → more construction path:

  1. Standardization: Unify log format, metric naming, SLA conventions
  2. Normalization: Unified collection, unified labels, unified storage for multiple data sources
  3. Noise Reduction & Aggregation: Alert grouping, inhibition, abstraction, reduce ineffective notifications
  4. Full-Chain Tracing: Use TraceID/RequestID to connect the full process, quickly locate root causes
  5. Quantitative Operations: Measure system health with metrics like slow request ratio, response time, success rate

Summary

The monitoring system is SRE’s core infrastructure. The cloud-native monitoring stack centered on Prometheus perfectly solves the core needs of massive metrics, multi-dimensional collection, and precise alerting. From SRE principles to monitoring system construction, the essence is using data to drive stability, transforming monitoring from “seeing something” to “seeing accurately and managing effectively,” ultimately achieving continuous improvement in service availability.