Monitoring System Enterprise Architecture Evolution — First Steps with Prometheus

December 12, 2019 Architecture Prometheus, Monitoring, Architecture Evolution Observability Series 1577 words 8 min read

🔊

Prometheus is an open-source monitoring and time series database system that has gained widespread adoption in recent years. The official architecture diagram is shown below:

Prometheus Official Architecture

This series of articles will gradually understand and deepen various components and concepts through the deployment architecture evolution of Prometheus in an enterprise. First, let’s get a quick overview of this evolution history through the diagram below:

Single Node Architecture

When first getting started with the Prometheus monitoring system, you only need to deploy the Prometheus binary on the server side, using the basic file service discovery configuration file_sd_config to pull metrics from node_exporter for basic host monitoring. Then configure the Prometheus url address as a datasource in Grafana to start viewing monitoring data.

mermaid
graph TB
    NE@{ shape: rounded, label: "node_exporter" } -->|metrics| P@{ shape: rounded, label: "Prometheus" }
    P -->|data| G@{ shape: rounded, label: "Grafana" }

    classDef src fill:#bbdefb,stroke:#2196F3,color:#1B5E20
    classDef proc fill:#fff3e0,stroke:#FF9800,color:#BF360C
    classDef out fill:#c8e6c9,stroke:#4CAF50,color:#1B5E20
    class NE src
    class P proc
    class G out

Prometheus’s data model consists mainly of Metric, Label, and Sample.

Metric:

Represents a time series metric, corresponding to a monitoring metric name. E.g., cpu_usage, free_memory, etc.
Metric only contains the time series data name, with no predefined structure or type. This gives Prometheus high flexibility.
A single Prometheus instance can contain any number of Metrics.

Label:

Used to describe and distinguish data with the same Metric. Similar to Tags in other time series databases.
Labels typically represent dimensions or attributes of data, such as instance, job, region, etc.
Each sample must include the same set of Labels. Labels are used for fast querying and aggregation of specific dimensions.
Label values can be strings, booleans, or integers. Supports filtering and grouping data by Labels.

Sample:

Represents a single time series data point, containing Timestamp, Value, and Label set.
Timestamp indicates the time of the sample, with millisecond precision. It is used for sorting and querying data within a given time range.
Value represents the metric value, which can be a float, integer, or string.
Label set identifies the attributes and dimensions of the sample data. Samples with the same Labels represent different records of the same metric.

A Prometheus Sample contains:

c
1
2
3
4
Metric - Time series metric name
Timestamp - Timestamp, millisecond precision
Value - Metric value
Label - Data attribute set

For example:

c
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
###################################################
Metric: cpu_usage
Timestamp: 1577836800000
Value: 0.6
Label: instance="web01", job="webapp"
###################################################
Metric: free_memory
Timestamp: 1577836800000
Value: 20*1024*1024
Label: instance="web01", job="webapp"
###################################################

Below is a sample of the collected data structure:

c
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# HELP thanos_grpc_req_panics_recovered_total Total number of gRPC requests recovered from internal panic.
# TYPE thanos_grpc_req_panics_recovered_total counter
thanos_grpc_req_panics_recovered_total 0

# HELP thanos_objstore_bucket_last_successful_upload_time Second timestamp of the last successful upload to the bucket.
# TYPE thanos_objstore_bucket_last_successful_upload_time gauge
thanos_objstore_bucket_last_successful_upload_time{bucket="thanos"} 1.591146025002795e+09

# HELP thanos_objstore_bucket_operation_duration_seconds Duration of successful operations against the bucket
# TYPE thanos_objstore_bucket_operation_duration_seconds histogram
thanos_objstore_bucket_operation_duration_seconds_bucket{bucket="thanos",operation="delete",le="0.001"} 0
thanos_objstore_bucket_operation_duration_seconds_bucket{bucket="thanos",operation="delete",le="0.01"} 0
thanos_objstore_bucket_operation_duration_seconds_bucket{bucket="thanos",operation="delete",le="0.1"} 0
thanos_objstore_bucket_operation_duration_seconds_bucket{bucket="thanos",operation="delete",le="0.3"} 0
thanos_objstore_bucket_operation_duration_seconds_bucket{bucket="thanos",operation="delete",le="0.6"} 0
thanos_objstore_bucket_operation_duration_seconds_bucket{bucket="thanos",operation="delete",le="1"} 0
thanos_objstore_bucket_operation_duration_seconds_bucket{bucket="thanos",operation="delete",le="3"} 0
thanos_objstore_bucket_operation_duration_seconds_bucket{bucket="thanos",operation="delete",le="6"} 0
thanos_objstore_bucket_operation_duration_seconds_bucket{bucket="thanos",operation="delete",le="9"} 0
thanos_objstore_bucket_operation_duration_seconds_bucket{bucket="thanos",operation="delete",le="20"} 0
thanos_objstore_bucket_operation_duration_seconds_bucket{bucket="thanos",operation="delete",le="30"} 0
thanos_objstore_bucket_operation_duration_seconds_bucket{bucket="thanos",operation="delete",le="60"} 0
thanos_objstore_bucket_operation_duration_seconds_bucket{bucket="thanos",operation="delete",le="90"} 0
thanos_objstore_bucket_operation_duration_seconds_bucket{bucket="thanos",operation="delete",le="120"} 0
thanos_objstore_bucket_operation_duration_seconds_bucket{bucket="thanos",operation="delete",le="+Inf"} 0
thanos_objstore_bucket_operation_duration_seconds_sum{bucket="thanos",operation="delete"} 0
thanos_objstore_bucket_operation_duration_seconds_count{bucket="thanos",operation="delete"} 0

Prometheus defines four main metric types:

Gauge - Suitable for metrics that can arbitrarily change, such as temperature, pressure.
Counter - A monotonically increasing metric, suitable for recording request counts, task completions, error counts, etc.
Histogram - Typically used for recording request duration, response size, etc. It records value distribution by configuring buckets. Can be used to calculate percentiles, averages, etc.
Summary - Similar to Histogram, directly stores percentile values for more convenient percentile computation.

These four metric types can represent different kinds of metrics:

Gauge: Instantaneous value
Counter: Cumulative total
Histogram/Summary: Statistical distribution

Choose the appropriate type based on the monitoring metric required. Prometheus handles metrics differently depending on their type when scraping and storing samples. Let’s look at the following case.

Alert Configuration

When alerting is needed, you must deploy the Alertmanager component and configure Alertmanager information in Prometheus. Alert rules need to be configured in Prometheus for periodic evaluation, sending alerts that meet thresholds to the Alertmanager component for processing.

Here’s a brief explanation of Prometheus’s alert mechanism and its relationship with Alertmanager:
Prometheus alert rules Prometheus server defines alert rules, which determine whether to generate alerts based on PromQL expressions. When the expression output is true, an alert event is created.
Prometheus alert processing Prometheus records these alert events and displays alert information on its status page. Prometheus can also send alert information to external systems via Webhook.
Alertmanager overview Alertmanager is an independent alert management component. It supports alert deduplication, grouping, routing, and sending. Alerts generated by Prometheus server are sent to Alertmanager.
Alertmanager alert processing After receiving alerts, Alertmanager can group them according to defined grouping rules, then send them to corresponding receivers (such as email, Slack, etc.) based on routing rules. It also handles alert deduplication.
Prometheus and Alertmanager integration Configure the Alertmanager URL address in Prometheus, so alerts generated by Prometheus are automatically sent to Alertmanager. Together, they form a dynamic alert management mechanism.
In summary, Prometheus is responsible for alert detection and generation, while Alertmanager focuses on subsequent alert distribution, processing, and notification. Their integration achieves a complete monitoring-to-alerting pipeline.

mermaid
graph TB
    NE@{ shape: rounded, label: "node_exporter" } -->|metrics| P@{ shape: rounded, label: "Prometheus" }
    P -->|data| G@{ shape: rounded, label: "Grafana" }
    P -->|alerts| AM@{ shape: rounded, label: "Alertmanager" }

    classDef src fill:#bbdefb,stroke:#2196F3,color:#1B5E20
    classDef proc fill:#fff3e0,stroke:#FF9800,color:#BF360C
    classDef out fill:#c8e6c9,stroke:#4CAF50,color:#1B5E20
    classDef spec fill:#f3e5f5,stroke:#9C27B0,color:#4A148C
    class NE src
    class P proc
    class G out
    class AM spec

inactive: Threshold not triggered
pending: Threshold triggered but alert duration not yet met
firing: Threshold triggered and alert duration satisfied

Let’s understand with a simple example. Below is a mysql alert configuration. The metric collection interval is 5s per sample, and the rule evaluates every 10s. When uptime is less than 30s, the alert is triggered.

yaml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
groups:
-  name: example
   rules:
   - alert: mysql_uptime
      expr: mysql:server_status:uptime < 30
      for: 10s
      labels:
         level: "CRITICAL"
      annotations:
         detail: Database uptime

As shown in the diagram:

When mysql_uptime >= 30, the alert status is inactive
When mysql_uptime < 30 and duration is less than 10s, the alert status is pending
When mysql_uptime < 30 and duration is greater than 10s, the alert status is firing

⚠ Note: The for keyword in the configuration is used to set the alert duration; if for is not set or set to 0, the pending state will be skipped directly.

Internal Alert Logic

The internal mechanism can be summarized in the following steps:

Alert reception: After Alertmanager receives alerts from Prometheus and other sources, it groups them according to configuration.
Alert grouping: Similar or related alerts are aggregated into alert groups, defined by grouping rules in the configuration file.
Alert deduplication: Duplicate alerts are deduplicated to prevent repeated sending.
Alert routing: Based on alert matching rules, determines the sending path for the alert group, which may be via email, Slack, or Webhook.
Silencing and inhibition: Directly silences or inhibits certain alert information during the alert phase based on configuration.
Alert sending: Finally, Alertmanager sends processed alerts to recipients via email, SMS, phone, etc.

Federation Deployment

When reaching a certain scale, multiple Prometheus instances are needed to share the collection and computation load. A federated deployment approach can be used to extend the architecture.

mermaid
graph TB
    FP@{ shape: rounded, label: "Prometheus federation node" }
    P1@{ shape: rounded, label: "Region A<br/>Prometheus + node_exporter" }
    P2@{ shape: rounded, label: "Region B<br/>Prometheus + node_exporter" }
    FP -->|federal pull| P1
    FP -->|federal pull| P2

    classDef src fill:#bbdefb,stroke:#2196F3,color:#1B5E20
    classDef proc fill:#fff3e0,stroke:#FF9800,color:#BF360C
    class FP proc
    class P1,P2 src

Distributed Architecture Prototype

As the business expands, the monitoring system also grows. Both from an architecture management and stability perspective, it’s necessary to upgrade from a single-node to a clustered architecture. There are several options in the industry:

Grafana community’s Cortex solution
Thanos community solution

Here, the Thanos distributed solution was chosen, while also introducing HashiCorp’s Consul to replace file-based service discovery.

Data Processing Logical Architecture

By introducing Consul to manage the exporter collector information that needs monitoring, operations teams can use scripts to periodically synchronize infrastructure and business component information from CMDB/CICD systems to this architecture, enabling dynamic discovery and collection. Additionally, the Thanos Sidecar component synchronizes TSDB BLOCK to object storage for backup.

Data Query Logical Architecture

The frontend query uses the Thanos Query component to provide a unified query entry for the entire distributed cluster. Thanos Query has built-in data deduplication and union query capabilities. Historical data can be aggregated over long periods and downsampled by the Thanos Compact component to optimize underlying TSDB BLOCK. Data from object storage is exposed via the Thanos Store API to reduce computational pressure on Prometheus.

mermaid
graph TB
    G@{ shape: rounded, label: "Grafana" } -->|query| TQ@{ shape: rounded, label: "Thanos Query Unified Query Entry" }
    TQ -->|gRPC| PS@{ shape: rounded, label: "Prometheus + Thanos Sidecar" }
    TQ -->|gRPC| TS@{ shape: rounded, label: "Thanos Store" }
    TS -->|read| O@{shape: cyl, label: "S3 Object Storage One set per region" }
    PS -->|sync TSDB BLOCK| O
    TC@{ shape: rounded, label: "Thanos Compact Data Aggregation and Downsampling" } -->|compact| O

    classDef src fill:#bbdefb,stroke:#2196F3,color:#1B5E20
    classDef proc fill:#fff3e0,stroke:#FF9800,color:#BF360C
    classDef store fill:#c8e6c9,stroke:#4CAF50,color:#1B5E20
    class G src
    class TQ,PS,TS,TC proc
    class O store

Part of series: Observability Series

← Previous Monitoring Collection Notes Next → Eyes On You: From SRE Principles to Prometheus Monitoring System Implementation