Monitoring System Enterprise Architecture Evolution — Cross-Region Hybrid Cloud
Recap
In “Monitoring System Enterprise Architecture Evolution — First Steps with Prometheus”, the monitoring system had already been upgraded from a single-node architecture to a single IDC distributed architecture.
The content of the previous article applies to both VM-based and container-based deployments. Prometheus is a product of the cloud-native era and is commonly used alongside Kubernetes, but Prometheus itself can also replace traditional monitoring solutions like Zabbix in non-Kubernetes environments.
In this article, we begin to use Kubernetes deployment to upgrade the entire monitoring system architecture, making it more flexible for cross-region hybrid cloud business scenarios.

Architecture Design
Three-Layer Cross-Region Structure Design
Design a three-layer regional structure while standardizing regional naming with labels for quick identification of service geographic details. In the third layer, Cluster and VPC are at the same level, representing services isolated within a cluster or a specific network segment.
graph TB
subgraph Region Layer
TQR@{ shape: rounded, label: "Thanos Query Region" }
end
subgraph Zone Layer
TQZ1@{ shape: rounded, label: "Thanos Query Zone A" }
TQZ2@{ shape: rounded, label: "Thanos Query Zone B" }
end
subgraph Cluster / VPC
P1@{ shape: rounded, label: "Prometheus" }
P2@{ shape: rounded, label: "Prometheus" }
P3@{ shape: rounded, label: "Prometheus" }
P4@{ shape: rounded, label: "Prometheus" }
end
TQR --> TQZ1
TQR --> TQZ2
TQZ1 --> P1
TQZ1 --> P2
TQZ2 --> P3
TQZ2 --> P4Using Thanos Query for the Initial Layered Architecture
Leverage Thanos’s GRPC communication protocol and aggregation query capabilities to achieve progressive data aggregation up to the top Thanos Query component, which then aggregates and computes the time series result set for frontend display.
Introducing Thanos Query Frontend for a Unified Frontend Query Entry
The Thanos Query Frontend component has the following configurable capabilities for query optimization, which should be adjusted based on actual conditions:
Time series vertical splitting For example, querying 15 days of data. Due to the large sample volume, reading raw data into memory can cause OOM issues. By splitting vertically — such as breaking a 15-day aggregation query into 6-hour chunks — the
Thanos Querycomponent initiates4 * 15concurrent queries to complete sample queries, aggregates result sets for different time periods, and releases memory promptly after each sub-query completes, efficiently optimizing resource utilization.Query result caching Cache result sets in memory or
RedisusingHASH KEYbased on the query statement and time range for reuse, reducing upstream pressure.
Leveraging Kubernetes for More Elastic Redundancy
Self-Developed Architecture Components
Based on native open-source projects, the architecture has basically achieved cross-region hybrid cloud capabilities. However, more is needed for enterprise daily management — a management architecture and frontend capabilities are required for it to be considered an enterprise-grade service.
Basic Design Logic
To make the entire architecture flexible and versatile, several components were designed:
Self-research service discovery— Interfaces with third-party systems such asCMDB,CICDto collect business system and asset information, calculates relationships between business systems and infrastructure, and schedules resource information to theP-sidcarcomponent based on geographic information.P-sidcar— ManagesPrometheusat the edge, receives nearby collector information fromSelf-research service discovery, provideshttp_sdtoPrometheusfor discoveringexportercollectors while achieving fine-grainedlabelinjection.msg route agent— Interfaces with Feishu, DingTalk and other communication services, synchronizes alert responsible persons fromConf/Rule Syncfor efficient last-mile targeted information push.A-sidcar— ManagesAlertmanagercluster configuration, synchronizes suppression policies in near real-time for more precise alert management.Conf/Rule Sync— Interfaces with various edge components, synchronizing status information and backend management policies in near real-time.
graph TB
subgraph Monitoring Platform
subgraph Management Layer
SD@{ shape: rounded, label: "Self-research Service Discovery" }
CR@{ shape: doc, label: "Conf/Rule Sync" }
MR@{ shape: rounded, label: "Message Route Agent" }
end
subgraph Edge Layer
PS@{ shape: rounded, label: "P-sidcar Edge Collection Management" }
AS@{ shape: rounded, label: "A-sidcar Alert Management" }
end
end
CMDB@{ shape: hex, label: "CMDB/CICD External Systems" } -->|Resource Sync| SD
SD -->|Resource Info| PS
CR -->|Collection Policy| PS
CR -->|Alert Policy| AS
CR -->|Responsible Person Info| MR
SD -->|Alert Responsible Person| MR
PS -->|http_sd| P@{ shape: rounded, label: "Prometheus" }
AS -->|Config Management| AM@{ shape: rounded, label: "Alertmanager" }
MR -->|Push| IM@{ shape: double-circle, label: "Feishu/DingTalk etc." }Advanced Extensions
The base design aims to be lean and streamlined without losing flexibility. On top of this, enterprise capabilities are gradually enriched through self-developed frontend services, middleware, and edge components.
- The service discovery component focuses on interfacing with various third-party systems, not limited to CMDB/CICD systems, but also work order systems or job systems.
- The alert component gradually evolves into a unified alert system platform, working with the service discovery component for more advanced dynamic alert scheduling.
- The configuration sync component and Grafana gradually converge into a frontend system, integrating management and display.
The user-side logical view of the entire system is shown below:
At this stage, the platform has reached a certain level of complexity, but the user’s understanding needs to be simplified.
Unified Alert System
Once a monitoring platform reaches a certain stage, the alert storm problem inevitably begins to trouble various technical support departments within the enterprise. Alert convergence governance becomes a priority. At this point, the alert component evolves from simply providing targeted alert push capabilities to gradually enriching surrounding capabilities.
graph LR
subgraph Alert Configuration and Management
AC@{ shape: rounded, label: "Alert Configuration" } -->|Rules| AG@{ shape: diam, label: "Alert Generation" }
SC@{ shape: rounded, label: "Alert Silencing" }
IN@{ shape: rounded, label: "Alert Inhibition" }
PM@{ shape: rounded, label: "Responsible Person Management" }
DM@{ shape: rounded, label: "On-duty Management" }
end
AG --> AR@{ shape: diam, label: "Alert Reception" }
AR --> GP@{ shape: diam, label: "Alert Grouping" }
GP --> DQ@{ shape: diam, label: "Alert Deduplication" }
DQ --> RT@{ shape: diam, label: "Alert Routing" }
SC --> RT
IN --> RT
PM --> RT
DM --> RT
RT --> NT@{ shape: diam, label: "Alert Notification" }
NT --> EML@{ shape: doc, label: "Email" }
NT --> FS@{ shape: rounded, label: "Feishu" }
NT --> DT@{ shape: rounded, label: "DingTalk" }
NT --> SMS@{ shape: doc, label: "SMS" }
NT --> PHN@{ shape: doc, label: "Phone" }
NT --> WH@{ shape: doc, label: "Webhook" }