Hybrid Cloud Cross-Region Monitoring System Governance: Autonomous + Unified Dual-Core Architecture Practice

January 10, 2022 Architecture Hybrid Cloud, Monitoring, Architecture Observability Series 1037 words 5 min read

🔊

In the context of global business expansion and large-scale hybrid cloud deployment, cross-IDC, cross-border, multi-cloud heterogeneous monitoring governance has become a core challenge for stability assurance. Traditional monitoring solutions either rely on expensive dedicated line upgrades that intrude on business architecture, or cannot balance node autonomy with global unification. Meanwhile, as a non-revenue infrastructure, the monitoring system must strictly control resource usage without allowing capability degradation.

This article breaks down a practical cross-region monitoring system governance solution from a real internet company, explaining how to achieve elastic scaling, cross-border coverage, node autonomy, and data unification for the monitoring system without modifying business architecture or incurring business cross-domain costs.

Governance Background and Core Pain Points

As businesses deploy globally across multiple locations, monitoring systems face three critical problems:

Cross-domain management difficulty: No unified monitoring entry for hybrid cloud/transnational nodes, severe multi-cloud fragmentation and data silos
High solution cost: Mainstream industry solutions rely on VPN dedicated line upgrades, high investment, intrusive to business stability architecture
Strong resource constraints: Monitoring systems must strictly control network I/O and computing resources while maintaining monitoring capability
Public network risks: Public network transmission has jitter and security issues, distributed nodes lack unified management

Governance Core Objectives

Possess elastic scaling capability, adaptable to cross-IDC and cross-border deployment
Achieve node autonomy + global unification, single point failure does not affect the entire domain
Zero intrusion into business architecture, no consumption of business cross-domain connectivity costs
Strict resource control, monitoring service without degradation, security compliance

Core Technical Solution Selection

To address these pain points, the solution adopts a public network Mesh + zero-trust network + plug-and-play components three-in-one design, balancing security, performance, and scalability:

Public network Mesh capability: Based on Istio Envoy + Mosn, build a public network service mesh to replace dedicated lines for cross-domain management
Zero-trust network architecture: Consul manages ACL, Token, encryption policies uniformly, ensuring public network transmission security
Plug-and-play scaling: Modular plug-and-play expansion, adaptable to fast integration in heterogeneous environments
Public network performance optimization: TCP BBR algorithm reduces network jitter, Mesh layer implements circuit breaking/degradation
Unified data governance: Thanos cluster enables cross-node data aggregation, storage, and query

Layered Architecture Details (with Mermaid Diagrams)

Public Network Mesh Layer (Cross-Domain Connectivity Core)

All regional nodes communicate securely through port 59080/59443, Envoy handles network proxy, Consul manages policies, Mosn manages transmission rules — no dependency on business VPN dedicated lines.

mermaid
graph TD
    N@{ shape: rounded, label: "Domestic / International cloud nodes" } -->|Port 59080| M@{ shape: hex, label: "Envoy+Mosn Mesh network" }
    M --> F@{ shape: hex, label: "Consul policy center" }
    F -->|ACL/Token/Route sync| M
    M --> OPT@{ shape: doc, label: "TCP BBR + circuit breaking" }
    style N fill:#E3F2FD,stroke:#1976d2,color:#0d47a1
    style M fill:#FFF3E0,stroke:#ff9800,color:#e65100
    style F fill:#F3E5F5,stroke:#7b1fa2,color:#4a148c
    style OPT fill:#E8F5E9,stroke:#4caf50,color:#1b5e20

Each regional node is an independent autonomous unit. Even if disconnected from the primary node, it can still collect, alert, and store normally, preventing domain-wide failures.

mermaid
graph TD
    BE@{ shape: rounded, label: "Business Exporter" } --> P@{ shape: rounded, label: "Prometheus collection" }
    P --> ST@{shape: cyl, label: "Thanos Sidecar + TSDB<br/>local storage / S3"}
    P --> AM@{ shape: rounded, label: "Alertmanager alerts<br/>→ notifications" }
    P --> TQ@{ shape: rounded, label: "Thanos Query local query" }
    Mosn@{ shape: rounded, label: "Mosn public mesh entry" } --> P
    style BE fill:#E3F2FD,stroke:#1976d2,color:#0d47a1
    style P fill:#FFF3E0,stroke:#ff9800,color:#e65100
    style ST fill:#E8F5E9,stroke:#4caf50,color:#1b5e20
    style AM fill:#FCE4EC,stroke:#e53935,color:#b71c1c
    style TQ fill:#F3E5F5,stroke:#7b1fa2,color:#4a148c
    style Mosn fill:#E3F2FD,stroke:#1976d2,color:#0d47a1

Primary IDC Aggregation Architecture

The primary node has global data aggregation, unified alerting, and global reporting capabilities. Any autonomous node can be quickly promoted to primary node, supporting flexible traffic migration and decommissioning.

mermaid
graph TD
    AN@{ shape: rounded, label: "Regional autonomous nodes" } -->|Thanos Receive| MC@{ shape: rounded, label: "Primary IDC Thanos cluster" }
    MC --> S3@{shape: cyl, label: "Store / Compact<br/>S3 object storage"}
    MC --> TR@{ shape: rounded, label: "Rule global alerts<br/>→ Alertmanager" }
    MC --> GF@{ shape: rounded, label: "Grafana unified visualization" }
    style AN fill:#E3F2FD,stroke:#1976d2,color:#0d47a1
    style MC fill:#FFF3E0,stroke:#ff9800,color:#e65100
    style S3 fill:#E8F5E9,stroke:#4caf50,color:#1b5e20
    style TR fill:#FCE4EC,stroke:#e53935,color:#b71c1c
    style GF fill:#F3E5F5,stroke:#7b1fa2,color:#4a148c

Overall Cross-Region Monitoring Governance Architecture

mermaid
graph TB
    AN@{ shape: rounded, label: "Regional autonomous node<br/>Prometheus / Sidecar / Alertmanager / Mosn" }
    AN -->|Mosn/Envoy| C@{ shape: hex, label: "Public mesh control<br/>Consul + BBR + circuit breaking" }
    AN -->|Sidecar upload| D@{ shape: cyl, label: "Primary IDC aggregation<br/>Query/Receive → Store/Compact<br/>→ global Alertmanager / Grafana" }
    style AN fill:#E3F2FD,stroke:#1976d2,color:#0d47a1
    style C fill:#FFF3E0,stroke:#ff9800,color:#e65100
    style D fill:#E8F5E9,stroke:#4caf50,color:#1b5e20

Key Technical Capability Implementation

Zero-Trust Network Security

Consul manages ACL policies, token authentication, service routing uniformly
Independent certificate encryption per node, full-link encryption for cross-node transmission
Mosn controls port access, strictly limiting data read/write permissions

Public Network Performance Stability

Nodes enable TCP BBR algorithm at the OS level, reducing public network jitter impact
Envoy + Mosn implements circuit breaking, degradation, rate limiting, preventing public network anomalies from crippling monitoring
Data blocks merge and upload periodically (every 2 hours), reducing network I/O usage

Plug-and-Play Elastic Scaling

Monitoring components are modular and plug-and-play for fast onboarding of new regional nodes
Heterogeneous environments (different cloud providers, different architectures) require no modification
Nodes can independently upgrade, decommission, or switch without affecting global monitoring

Autonomous + Unified Dual Mode

Node autonomy: Local collection, local alerting, local storage — still usable when disconnected
Global unification: Primary node aggregates data, unified view, global alerting, centralized reporting

Solution Core Value

Zero business intrusion: No changes to business architecture, no consumption of business dedicated line costs, minimal modification risk
Low-cost implementation: Based on public network Mesh replacing expensive dedicated lines, investment is only 1/3 of traditional solutions
High availability guarantee: Node autonomy eliminates single points of failure, global monitoring stability improved by 90%
Elastic expansion: Plug-and-play components support rapid onboarding of global nodes, adapting to unlimited business expansion
Security compliance: Zero-trust network + full-link encryption, meeting cross-border monitoring security requirements

Summary

This cross-region monitoring governance solution is one of the best practices for monitoring architecture under hybrid cloud, global deployment. It breaks away from the traditional approach of “modifying business and investing in dedicated lines”, centering on public network Mesh + zero-trust network + Thanos data unification, perfectly balancing the four core requirements of scalability, security, cost, and availability. It achieves both global control of cross-region monitoring and independent autonomy of single nodes, providing a replicable implementation template for internet companies’ global monitoring infrastructure.

Part of series: Observability Series

← Previous From Bottleneck Breakthrough to Platform Governance — The Full Evolution of an Internet Company's Monitoring Platform Architecture Next → Eyes On You: The 2022 Productization Journey of a Multi-Cloud Heterogeneous Monitoring Platform