Eyes On You: From SRE Principles to Prometheus Monitoring System Implementation

June 20, 2020 Observability Prometheus, SRE, Monitoring System, Architecture Observability Series 996 words 5 min read

🔊

SRE (Site Reliability Engineering) is a core role in ensuring the availability of distributed services, and the monitoring system is the foundation of SRE work. This article focuses on SRE core principles, walking through the pain points of modern monitoring systems, technology stack selection, Prometheus core principles, and alerting best practices, offering a practical enterprise-grade monitoring construction methodology.

SRE Core Principles: Stability is the #1 Metric

SRE’s core is ensuring continuous service stability through engineering practices, focusing on capacity planning, cluster maintenance, fault tolerance, load balancing, and monitoring system construction. There are only 3 core measurement metrics:

MTBF: Mean Time Between Failures (longer = more stable)
MTTR: Mean Time To Repair (shorter = more efficient)
Availability: Availability = MTBF / (MTBF + MTTR)

Four Stages of Operations System Evolution

Enterprise operations capabilities evolve through a complete transformation from manual to platform as business grows:

mermaid
flowchart TD
    S1["Manual Stage<br/>fully manual ops"] --> S2["Tool Stage<br/>Shell/Python/Email"]
    S2 --> S3["Automation Stage<br/>Nagios/Zabbix/ELK"]
    S3 --> S4["Platform Stage<br/>K8s/Cloud Native"]
    classDef stage fill:#bbdefb,stroke:#2196F3,color:#1B5E20
    class S1,S2,S3,S4 stage

Four Core Pain Points of Modern Monitoring Systems

Monitoring isn’t just about “watching metrics”—it’s a comprehensive system for fault localization, capacity management, and full-chain tracing. Enterprises face four major challenges:

Massive data processing: Mixed real-time/non-real-time, structured/unstructured data, heavy storage and computation pressure
Multi-dimensional data source fragmentation: Metrics, logs, traces, and synthetic monitoring data silos, unable to perform correlation analysis
Information overload: Alert bombing, excessive invalid metrics, operations staff can’t focus on core issues
Complex business fault localization: Under distributed architecture, unable to quickly trace fault root causes

Multi-dimensional Standards for Monitoring Data

A complete monitoring event must include full-dimensional labels: Time (when it happened) + Location (data center/node) + Component (service/interface) + Business line + Error code + Status value

Enterprise Monitoring Technology Stack Panorama Selection

Faced with a wide array of monitoring components, enterprises need layered selection across collection, storage, computation, visualization, and alerting to form a complete closed loop:

mermaid
flowchart TD
    Coll["Collection<br/>Exporter/Beats/Flume/Logstash"]
    Disc@{ shape: hex, label: "Service Discovery<br/>Consul/Etcd" }
    Stor@{ shape: cyl, label: "Storage<br/>Prometheus/InfluxDB/ES/Kafka" }
    Vis["Visualization<br/>Grafana/Kibana"]
    Alert@{ shape: hex, label: "Alerting<br/>Alertmanager" }
    Disc --> Coll
    Coll --> Stor
    Stor --> Vis
    Stor --> Alert
    classDef monitor fill:#bbdefb,stroke:#2196F3,color:#1B5E20
    classDef storage fill:#c8e6c9,stroke:#4CAF50,color:#1B5E20
    classDef alert fill:#ffcdd2,stroke:#f44336,color:#B71C1C
    class Coll,Disc,Vis monitor
    class Stor storage
    class Alert alert

Core selection strategy: Prometheus as metrics core, ELK as logging core, Grafana as unified entry, balancing real-time capability and scalability.

Prometheus Foundation: Deep Dive into Four Core Metric Types

Prometheus’s core capability stems from precise metric type design. Different metrics correspond to different business scenarios and form the foundation of monitoring accuracy:

Four Core Data Types

Counter: Only increases, used for cumulative totals (e.g., total requests, CPU seconds)
Gauge: Can increase or decrease, used for current state (e.g., free memory, connection count)
Histogram: Client-side bucketing, server-side aggregation, supports quantile queries (e.g., P95/P99 latency)
Summary: Client-side pre-computed quantiles, no server-side aggregation, higher precision

Histogram vs Summary Selection Guide

mermaid
graph TD
    D1@{ shape: diam, label: "Need aggregation?" } -->|"Yes"| R1@{ shape: doc, label: "Choose Histogram" }
    D1 -->|"No"| D2@{ shape: diam, label: "Need precise quantiles?" }
    D2@{ shape: diam, label: "Need precise quantiles?" } -->|"Yes"| R2@{ shape: doc, label: "Choose Summary" }
    D2 -->|"No"| R1
    classDef process fill:#f3e5f5,stroke:#9c27b0
    classDef decision fill:#fff3e0,stroke:#ff9800
    class D1,D2 decision
    class R1,R2 process

Need multi-dimensional aggregation: Choose Histogram
Need precise quantiles, no aggregation needed: Choose Summary

PromQL in Practice: 4 Typical Monitoring Query Examples

PromQL is Prometheus’s query language and the core of monitoring visualization. Here are the most commonly used enterprise examples:

Example 1: Memory Usage Monitoring (Gauge Type)

promql
1
process_resident_memory_bytes{job="thanos-sidecar"}

Scenario: Monitor process memory usage
Type: Gauge (reflects current memory value in real-time)

Example 2: CPU Usage Monitoring (Counter Type)

promql
1
irate(process_cpu_seconds_total[2m])

Scenario: Monitor instantaneous CPU usage
Technique: Use irate to calculate counter growth rate

Example 3: P95/P99 Latency Monitoring (Histogram Type)

promql
1
histogram_quantile(0.95, sum(irate(t_request_duration_seconds_bucket[2m])) by (le))

Scenario: API response latency percentile statistics
Value: Accurately perceive the proportion of slow requests

Example 4: Multi-Label Combined Query

promql
1
avg by (instance) (node_cpu_seconds_total{mode="idle"})

Scenario: CPU idle rate statistics by node

Alert Full Flow: From Trigger to Convergence

Alerting isn’t just “sending notifications”—it’s a precise, noise-reduced, suppressible closed-loop system, divided into two major stages: Prometheus rule evaluation and Alertmanager distribution.

Alert State Transition

mermaid
stateDiagram-v2
    [*] --> inactive : Threshold not triggered
    inactive --> pending : Threshold triggered
    pending --> firing : Duration satisfied
    firing --> resolved : Threshold recovered
    resolved --> inactive : State reset
    classDef normal fill:#e8f5e9,stroke:#4caf50
    classDef warning fill:#fff3e0,stroke:#ff9800
    classDef critical fill:#ffebee,stroke:#f44336
    class inactive normal
    class pending warning
    class firing critical
    class resolved normal

inactive: Normal state
pending: Threshold triggered, waiting for duration verification
firing: Duration satisfied, formal alert triggered
resolved: Fault recovered

Alertmanager Core Strategies

Grouping: Merge and send similar alerts together, avoid bombing
Inhibition: Higher-level alerts suppress lower-level alerts (e.g., node down inhibits process alerts)
Silence: Temporarily mute specific alerts, suitable for release/maintenance scenarios
Retry: Auto-retry on alert notification failure

Alert Rule Configuration Example

yaml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
groups:
- name: nodehealth
  rules:
  - alert: CPUIdleTooLow
    expr: avg by (hostname)(rate(node_cpu_seconds_total{mode="idle"}[2m])) < 0.05
    for: 2m
    labels:
      level: High
    annotations:
      description: "{{$labels.hostname}} CPU idle rate is only {{$value}}%"

Enterprise Monitoring System Construction Approach

From “full monitoring coverage” to “precise and effective monitoring,” enterprises need to follow a more → less → more construction path:

Standardization: Unify log format, metric naming, SLA conventions
Normalization: Unified collection, unified labels, unified storage for multiple data sources
Noise Reduction & Aggregation: Alert grouping, inhibition, abstraction, reduce ineffective notifications
Full-Chain Tracing: Use TraceID/RequestID to connect the full process, quickly locate root causes
Quantitative Operations: Measure system health with metrics like slow request ratio, response time, success rate

Summary

The monitoring system is the foundation of SRE work. The cloud-native monitoring stack centered on Prometheus covers the core needs of massive metric collection, multi-dimensional aggregation, and precise alerting. From SRE principles to monitoring system construction, the essence is using data to drive stability, making monitoring genuinely support continuous improvement in service availability.

Part of series: Observability Series

← Previous Monitoring System Enterprise Architecture Evolution — First Steps with Prometheus Next → Monitoring System Enterprise Architecture Evolution — Cross-Region Hybrid Cloud