Monitoring System Enterprise Architecture Evolution — Cross-Region Hybrid Cloud

Recap

In “Monitoring System Enterprise Architecture Evolution — First Steps with Prometheus”, the monitoring system had already been upgraded from a single-node architecture to a single IDC distributed architecture. The content of the previous article applies to both VM-based and container-based deployments. Prometheus is a product of the cloud-native era and is commonly used alongside Kubernetes, but Prometheus itself can also replace traditional monitoring solutions like Zabbix in non-Kubernetes environments. In this article, we begin to use Kubernetes deployment to upgrade the entire monitoring system architecture, making it more flexible for cross-region hybrid cloud business scenarios.

Architecture Design

Three-Layer Cross-Region Structure Design

Design a three-layer regional structure while standardizing regional naming with labels for quick identification of service geographic details. In the third layer, Cluster and VPC are at the same level, representing services isolated within a cluster or a specific network segment.

mermaid
graph TB
    subgraph Region Layer
        TQR@{ shape: rounded, label: "Thanos Query Region" }
    end
    subgraph Zone Layer
        TQZ1@{ shape: rounded, label: "Thanos Query Zone A" }
        TQZ2@{ shape: rounded, label: "Thanos Query Zone B" }
    end
    subgraph Cluster / VPC
        P1@{ shape: rounded, label: "Prometheus" }
        P2@{ shape: rounded, label: "Prometheus" }
        P3@{ shape: rounded, label: "Prometheus" }
        P4@{ shape: rounded, label: "Prometheus" }
    end
    TQR --> TQZ1
    TQR --> TQZ2
    TQZ1 --> P1
    TQZ1 --> P2
    TQZ2 --> P3
    TQZ2 --> P4

Using Thanos Query for the Initial Layered Architecture

Leverage Thanos’s GRPC communication protocol and aggregation query capabilities to achieve progressive data aggregation up to the top Thanos Query component, which then aggregates and computes the time series result set for frontend display.

Introducing Thanos Query Frontend for a Unified Frontend Query Entry

The Thanos Query Frontend component has the following configurable capabilities for query optimization, which should be adjusted based on actual conditions:

  • Time series vertical splitting For example, querying 15 days of data. Due to the large sample volume, reading raw data into memory can cause OOM issues. By splitting vertically — such as breaking a 15-day aggregation query into 6-hour chunks — the Thanos Query component initiates 4 * 15 concurrent queries to complete sample queries, aggregates result sets for different time periods, and releases memory promptly after each sub-query completes, efficiently optimizing resource utilization.

  • Query result caching Cache result sets in memory or Redis using HASH KEY based on the query statement and time range for reuse, reducing upstream pressure.

Leveraging Kubernetes for More Elastic Redundancy

Self-Developed Architecture Components

Based on native open-source projects, the architecture has basically achieved cross-region hybrid cloud capabilities. However, more is needed for enterprise daily management — a management architecture and frontend capabilities are required for it to be considered an enterprise-grade service.

Basic Design Logic

To make the entire architecture flexible and versatile, several components were designed:

  • Self-research service discovery — Interfaces with third-party systems such as CMDB, CICD to collect business system and asset information, calculates relationships between business systems and infrastructure, and schedules resource information to the P-sidcar component based on geographic information.
  • P-sidcar — Manages Prometheus at the edge, receives nearby collector information from Self-research service discovery, provides http_sd to Prometheus for discovering exporter collectors while achieving fine-grained label injection.
  • msg route agent — Interfaces with Feishu, DingTalk and other communication services, synchronizes alert responsible persons from Conf/Rule Sync for efficient last-mile targeted information push.
  • A-sidcar — Manages Alertmanager cluster configuration, synchronizes suppression policies in near real-time for more precise alert management.
  • Conf/Rule Sync — Interfaces with various edge components, synchronizing status information and backend management policies in near real-time.
mermaid
graph TB
    subgraph Monitoring Platform
        subgraph Management Layer
            SD@{ shape: rounded, label: "Self-research Service Discovery" }
            CR@{ shape: doc, label: "Conf/Rule Sync" }
            MR@{ shape: rounded, label: "Message Route Agent" }
        end
        subgraph Edge Layer
            PS@{ shape: rounded, label: "P-sidcar Edge Collection Management" }
            AS@{ shape: rounded, label: "A-sidcar Alert Management" }
        end
    end
    CMDB@{ shape: hex, label: "CMDB/CICD External Systems" } -->|Resource Sync| SD
    SD -->|Resource Info| PS
    CR -->|Collection Policy| PS
    CR -->|Alert Policy| AS
    CR -->|Responsible Person Info| MR
    SD -->|Alert Responsible Person| MR
    PS -->|http_sd| P@{ shape: rounded, label: "Prometheus" }
    AS -->|Config Management| AM@{ shape: rounded, label: "Alertmanager" }
    MR -->|Push| IM@{ shape: double-circle, label: "Feishu/DingTalk etc." }

Advanced Extensions

The base design aims to be lean and streamlined without losing flexibility. On top of this, enterprise capabilities are gradually enriched through self-developed frontend services, middleware, and edge components.

  • The service discovery component focuses on interfacing with various third-party systems, not limited to CMDB/CICD systems, but also work order systems or job systems.
  • The alert component gradually evolves into a unified alert system platform, working with the service discovery component for more advanced dynamic alert scheduling.
  • The configuration sync component and Grafana gradually converge into a frontend system, integrating management and display.

The user-side logical view of the entire system is shown below:

At this stage, the platform has reached a certain level of complexity, but the user’s understanding needs to be simplified.

Unified Alert System

Once a monitoring platform reaches a certain stage, the alert storm problem inevitably begins to trouble various technical support departments within the enterprise. Alert convergence governance becomes a priority. At this point, the alert component evolves from simply providing targeted alert push capabilities to gradually enriching surrounding capabilities.

mermaid
graph LR
    subgraph Alert Configuration and Management
        AC@{ shape: rounded, label: "Alert Configuration" } -->|Rules| AG@{ shape: diam, label: "Alert Generation" }
        SC@{ shape: rounded, label: "Alert Silencing" }
        IN@{ shape: rounded, label: "Alert Inhibition" }
        PM@{ shape: rounded, label: "Responsible Person Management" }
        DM@{ shape: rounded, label: "On-duty Management" }
    end
    AG --> AR@{ shape: diam, label: "Alert Reception" }
    AR --> GP@{ shape: diam, label: "Alert Grouping" }
    GP --> DQ@{ shape: diam, label: "Alert Deduplication" }
    DQ --> RT@{ shape: diam, label: "Alert Routing" }
    SC --> RT
    IN --> RT
    PM --> RT
    DM --> RT
    RT --> NT@{ shape: diam, label: "Alert Notification" }
    NT --> EML@{ shape: doc, label: "Email" }
    NT --> FS@{ shape: rounded, label: "Feishu" }
    NT --> DT@{ shape: rounded, label: "DingTalk" }
    NT --> SMS@{ shape: doc, label: "SMS" }
    NT --> PHN@{ shape: doc, label: "Phone" }
    NT --> WH@{ shape: doc, label: "Webhook" }