Eyes On You: The 2022 Productization Journey of a Multi-Cloud Heterogeneous Monitoring Platform
In the context of multi-cloud deployment, global networking, and exponential growth in service scale within internet businesses, monitoring platforms have long surpassed the basic “metric collection + alert notification” positioning, becoming the core infrastructure for ensuring end-to-end stability. This article, based on the real evolution of a large-scale internet enterprise monitoring platform, dissects the complete planning and implementation approach for upgrading the monitoring platform from large-scale coverage to productization, usability, and intelligence in 2022.
Platform Status: Large-Scale Challenges of Multi-Cloud Heterogeneity
As business expands globally, the monitoring platform has entered a phase of ultra-large scale, highly distributed, and heterogeneous operation. The current status is as follows:
- Resource scale: Services cover four major cloud providers, multiple domestic and international regions, including 57 Kubernetes clusters, 2420 non-container hosts, and hundreds of MySQL, MongoDB, Kafka, Redis, TiDB middleware instances.
- Collection scale: Metric collection has experienced explosive three-year growth, with a daily average of over 324.9 billion metrics collected in 2022, peak collection rate reaching 3.76 million metrics/second, processed by 122 distributed Prometheus nodes across regions.
- Alert scale: Unified access to four major cloud provider monitoring + self-built platform alerts, pushing over 220,000 alert events weekly to 50+ collaboration groups.
The current platform exhibits four core characteristics: multi-cloud differentiation, multi-environment isolation, cross-region dispersion, and real-time massive data processing. There is no fully standardized optimal solution in the industry; the platform needs continuous iteration to adapt to business growth.
Core Pain Points: The Critical Gap from “Usable” to “Easy to Use”
After initial large-scale construction, the monitoring platform completed basic coverage, but still had significant shortcomings in productization, usability, and effectiveness. Core issues are divided into two dimensions:
Metric Collection & Processing Platform
- Significant infrastructure differences across cloud providers, insufficient automatic resource discovery and monitoring coverage
- Inconsistent metric naming, scattered collection nodes, R&D teams cannot self-manage their own service monitoring strategies
- Custom monitoring reporting process is cumbersome with high barriers, slowing down monitoring coverage efficiency
Unified Alert Platform
- Multiple alert sources, inconsistent data structures, lack of global control capabilities
- Alert flooding, excessive noise, greatly reducing fault perception and handling efficiency
- Low alert event transparency, R&D/operations cannot autonomously manage and trace the alert lifecycle
2021 Foundation Building: Establishing Platform Management Capabilities
2021 focused on managing the platform as a platform, completing the core transformation from “scattered tools” to “unified platform”, building four core capabilities:
Core System Construction
- Service Discovery & Registration System: Cross-domain distributed resource discovery and registration, using Sidecar pattern to interface with CMDB and CICD, automatically managing all monitorable resources
- Unified Alert Platform: Full lifecycle management of alert ingestion, normalization, routing, notification, and archiving
- Basic Control Capabilities: Alert silencing/inhibition, statistical analysis, strategy management, route distribution system
- Visualization Frontend: Operations-oriented monitoring management and alert processing portal
2021 Monitoring Platform Core Architecture
graph TB
subgraph Resources["Resources & Ecosystem Layer"]
A@{ shape: doc, label: "Hosts/Containers/Middleware/Databases" }
B@{ shape: doc, label: "CMDB/CICD Third-party Systems" }
end
subgraph Collection["Collection & Discovery Layer"]
C@{ shape: doc, label: "Service Discovery & Registration System" }
D@{ shape: cyl, label: "Distributed Prometheus Cluster" }
E@{ shape: doc, label: "Exporter/Sidecar Collection Components" }
end
subgraph Alert["Alert Central Layer"]
F@{ shape: hex, label: "Alertmanager" }
G@{ shape: doc, label: "Alert Routing System" }
H@{ shape: doc, label: "Alert Strategy/Silence/Statistics" }
end
subgraph Notification["Notification & Display Layer"]
I@{ shape: doc, label: "Enterprise IM/SMS" }
J@{ shape: doc, label: "Personal/Group Subscriptions" }
K@{ shape: doc, label: "Monitoring Management WEB UI" }
L@{ shape: doc, label: "Grafana Visualization" }
end
A --> E
B --> C
C --> D
D --> F
F --> G
G --> H
H --> I
H --> J
D --> L
H --> K
classDef monitor fill:#e3f2fd,stroke:#1976d2
classDef storage fill:#e8f5e9,stroke:#4caf50
classDef alert fill:#fce4ec,stroke:#e53935
class A,B,C,E,G,H,I,J,K,L monitor
class D storage
class F alertH1 2022 Planning: R&D-Facing One-Stop Monitoring Platform
The core goal for H1 2022: Create a one-stop, frictionless, easy-to-use, and efficient monitoring experience for R&D teams, shielding the complexity of multi-cloud, cross-region, and isolated environments. Core focus areas:
Core Design Principles
- OneIn: Unified entry point, no need to care about environment, data center, or cloud provider differences
- Coverage: CICD+CMDB full-chain automated coverage, frictionless monitoring integration
- Self-service: Full-process self-service operations, realizing “Your Data, Your Rules”
- Observability: Metrics+Tracing+Logging convergence
Core Feature Delivery
- One-Stop Monitoring Portal Unified integration of hosts, containers, databases, and middleware monitoring across all scenarios, supporting quick resource search by IP, business info, cloud provider, and region.
- Minute-Level Instance Monitoring Setup New resources get monitoring deployed within minutes, no manual configuration needed, automated collection and alert binding.
- Intuitive Metric Visualization Multi-dimensional real-time charts for CPU, memory, disk, network, performance, connection count, etc., with custom time ranges and refresh frequencies.
- Quick Owner Association Automatic binding of resources with R&D/operations owners, enabling rapid identification of responsible persons during incidents for faster troubleshooting.
- One-Stop Alert Configuration Simple/Advanced dual-mode alert configuration frontend, supporting metric registration, rule group creation, and policy self-management.
H2 2022 Planning: Full-Link Observability and Intelligent Monitoring Blueprint
H2 focuses on full-chain capability completion and intelligent upgrade, breaking through traditional monitoring boundaries to build the next-generation observability platform:
Full-Scenario Observability Capabilities
- Connect the full chain of reporting services, queue services, consumer services, distributed storage, and metadata management
- Integrate business monitoring, APM application monitoring, client-side monitoring, and synthetic monitoring with deep call chain linkage
- Cover WEB + mobile endpoints, enabling monitoring viewing and alert handling anytime, anywhere
Intelligent Alert Evolution
- Solve alert coverage challenges for complex middleware like Kafka, optimize multi-scenario alert strategies
- Explore threshold-free intelligent alerting: Automatically infer anomaly thresholds based on seasonal data patterns
- Reuse intelligent models, extend to more business scenarios, build generalized intelligent alerting capabilities
Full-Stack Capability Extension
Add capacity planning, task scheduling, network monitoring and other capabilities, forming a complete observability closed loop of collection-storage-analysis-alert-self-healing-planning.
Solving Core Questions: Let Monitoring Return to Business Itself
Platform iteration always revolves around four core questions, shielding users from underlying complexity:
- Where are the resources and services? The service discovery system manages heterogeneous resources across 64 isolated environments, automatically sensing the location of all services.
- Where are the collected metrics? Distributed edge autonomy architecture shields differences across 122 Prometheus nodes; users don’t need to care about collection node assignments.
- How are alert events accurately delivered? Alert lifecycle management + noise reduction and rate limiting, accurately pushing effective information from over 220,000 alert events.
- How can I quickly observe my services? One-stop view, quickly locating your own service data from 324.9 billion daily metrics.
Summary: From Tool to Platform, From Operations to Everyone
2022 is the key year for the monitoring platform to transform from an operations tool to a productized platform available to everyone. With usability, effectiveness, transparency, and universality at its core, the platform achieves automated monitoring coverage, self-service usage, intelligent alerting, and full-chain observability in the complex scenario of massive multi-cloud heterogeneity.