Eyes On You: The 2022 Productization Journey of a Multi-Cloud Heterogeneous Monitoring Platform

June 20, 2022 Observability Prometheus, Monitoring Platform, Productization, Architecture Observability Series 1039 words 5 min read

🔊

This article reviews the 2022 productization evolution of a large-scale internet monitoring platform, covering the planning and implementation approach to move from large-scale coverage to productization, usability, and intelligence. The platform spans multi-cloud, multi-region, massive-metric collection and alerting; this is a record of its real evolution.

Platform Status: Large-Scale Challenges of Multi-Cloud Heterogeneity

As business expands globally, the monitoring platform has entered a phase of ultra-large scale, highly distributed, and heterogeneous operation. The current status is as follows:

Resource scale: Services cover four major cloud providers, multiple domestic and international regions, including 57 Kubernetes clusters, 2420 non-container hosts, and hundreds of MySQL, MongoDB, Kafka, Redis, TiDB middleware instances.
Collection scale: Metric collection has experienced explosive three-year growth, with a daily average of over 324.9 billion metrics collected in 2022, peak collection rate reaching 3.76 million metrics/second, processed by 122 distributed Prometheus nodes across regions.
Alert scale: Unified access to four major cloud provider monitoring + self-built platform alerts, pushing over 220,000 alert events weekly to 50+ collaboration groups.

The current platform exhibits four core characteristics: multi-cloud differentiation, multi-environment isolation, cross-region dispersion, and real-time massive data processing. There is no fully standardized optimal solution in the industry; the platform needs continuous iteration to adapt to business growth.

Core Pain Points: The Critical Gap from “Usable” to “Easy to Use”

After initial large-scale construction, the monitoring platform completed basic coverage, but still had significant shortcomings in productization, usability, and effectiveness. Core issues are divided into two dimensions:

Metric Collection & Processing Platform

Significant infrastructure differences across cloud providers, insufficient automatic resource discovery and monitoring coverage
Inconsistent metric naming, scattered collection nodes, R&D teams cannot self-manage their own service monitoring strategies
Custom monitoring reporting process is cumbersome with high barriers, slowing down monitoring coverage efficiency

Unified Alert Platform

Multiple alert sources, inconsistent data structures, lack of global control capabilities
Alert flooding, excessive noise, greatly reducing fault perception and handling efficiency
Low alert event transparency, R&D/operations cannot autonomously manage and trace the alert lifecycle

2021 Foundation Building: Establishing Platform Management Capabilities

2021 focused on managing the platform as a platform, completing the core transformation from “scattered tools” to “unified platform”, building four core capabilities:

Core System Construction

Service Discovery & Registration System: Cross-domain distributed resource discovery and registration, using Sidecar pattern to interface with CMDB and CICD, automatically managing all monitorable resources
Unified Alert Platform: Full lifecycle management of alert ingestion, normalization, routing, notification, and archiving
Basic Control Capabilities: Alert silencing/inhibition, statistical analysis, strategy management, route distribution system
Visualization Frontend: Operations-oriented monitoring management and alert processing portal

2021 Monitoring Platform Core Architecture

mermaid
flowchart TD
    A["Hosts/Containers/<br/>Middleware"]
    B["CMDB/CICD<br/>Third-party"]
    C["Service Discovery<br/>Registry"]
    D@{ shape: cyl, label: "Distributed<br/>Prometheus Cluster" }
    E["Exporter/Sidecar<br/>Collectors"]
    L["Grafana<br/>Visualization"]
    A --> E --> D
    B --> C --> D
    D --> L
    classDef input fill:#bbdefb,stroke:#2196F3,color:#1B5E20
    classDef proc fill:#fff3e0,stroke:#FF9800,color:#BF360C
    classDef output fill:#c8e6c9,stroke:#4CAF50,color:#1B5E20
    class A,B input
    class C,E proc
    class D,L output

Resources are discovered and registered, then collected by the distributed Prometheus cluster and visualized via Grafana.

mermaid
flowchart TD
    F@{ shape: hex, label: "Alertmanager" }
    G["Alert Routing<br/>System"]
    H["Alert Strategy/<br/>Silence/Stats"]
    I["Enterprise<br/>IM/SMS"]
    J["Personal/Group<br/>Subscriptions"]
    K["Monitoring<br/>WEB UI"]
    F --> G --> H
    H --> I
    H --> J
    H --> K
    classDef special fill:#f3e5f5,stroke:#9C27B0,color:#4A148C
    classDef proc fill:#fff3e0,stroke:#FF9800,color:#BF360C
    classDef output fill:#c8e6c9,stroke:#4CAF50,color:#1B5E20
    class F special
    class G,H proc
    class I,J,K output

Alerts route from Alertmanager through the strategy hub for silencing and statistics, then fan out to IM, subscriptions, and the ops portal.

H1 2022 Planning: R&D-Facing One-Stop Monitoring Platform

The core goal for H1 2022: Create a one-stop, frictionless, easy-to-use, and efficient monitoring experience for R&D teams, shielding the complexity of multi-cloud, cross-region, and isolated environments. Core focus areas:

Core Design Principles

OneIn: Unified entry point, no need to care about environment, data center, or cloud provider differences
Coverage: CICD+CMDB full-chain automated coverage, frictionless monitoring integration
Self-service: Full-process self-service operations, realizing “Your Data, Your Rules”
Observability: Metrics+Tracing+Logging convergence

Core Feature Delivery

One-Stop Monitoring Portal Unified integration of hosts, containers, databases, and middleware monitoring across all scenarios, supporting quick resource search by IP, business info, cloud provider, and region.
Minute-Level Instance Monitoring Setup New resources get monitoring deployed within minutes, no manual configuration needed, automated collection and alert binding.
Intuitive Metric Visualization Multi-dimensional real-time charts for CPU, memory, disk, network, performance, connection count, etc., with custom time ranges and refresh frequencies.
Quick Owner Association Automatic binding of resources with R&D/operations owners, enabling rapid identification of responsible persons during incidents for faster troubleshooting.
One-Stop Alert Configuration Simple/Advanced dual-mode alert configuration frontend, supporting metric registration, rule group creation, and policy self-management.

H2 2022 Planning: Full-Link Observability and Intelligent Monitoring

H2 focuses on completing full-chain capabilities and alert intelligence, expanding monitoring coverage:

Full-Scenario Observability Capabilities

Connect the full chain of reporting services, queue services, consumer services, distributed storage, and metadata management
Integrate business monitoring, APM application monitoring, client-side monitoring, and synthetic monitoring with deep call chain linkage
Cover WEB + mobile endpoints, enabling monitoring viewing and alert handling anytime, anywhere

Intelligent Alert Evolution

Solve alert coverage challenges for complex middleware like Kafka, optimize multi-scenario alert strategies
Explore threshold-free intelligent alerting: Automatically infer anomaly thresholds based on seasonal data patterns
Reuse intelligent models, extend to more business scenarios, build generalized intelligent alerting capabilities

Full-Stack Capability Extension

Add capacity planning, task scheduling, network monitoring and other capabilities, covering the collection-storage-analysis-alert-self-healing-planning observability chain.

Solving Core Questions: Let Monitoring Return to Business Itself

Platform iteration always revolves around four core questions, shielding users from underlying complexity:

Where are the resources and services? The service discovery system manages heterogeneous resources across 64 isolated environments, automatically sensing the location of all services.
Where are the collected metrics? Distributed edge autonomy architecture shields differences across 122 Prometheus nodes; users don’t need to care about collection node assignments.
How are alert events accurately delivered? Alert lifecycle management + noise reduction and rate limiting, accurately pushing effective information from over 220,000 alert events.
How can I quickly observe my services? One-stop view, quickly locating your own service data from 324.9 billion daily metrics.

Summary

2022 is the year the monitoring platform moved from an operations tool to a productized platform available to everyone. Centered on usability, effectiveness, transparency, and universality, the platform pushed forward automated monitoring onboarding, self-service use by R&D, alert noise reduction, and cross-region observability under massive multi-cloud heterogeneous conditions.

Part of series: Observability Series

← Previous Hybrid Cloud Cross-Region Monitoring System Governance: Autonomous + Unified Dual-Core Architecture Practice Next → Observability Storage Architecture Overview: From Gorilla to Parquet