From Bottleneck Breakthrough to Platform Governance — The Full Evolution of an Internet Company's Monitoring Platform Architecture
In the context of rapid internet business expansion, multi-cloud deployment, and exponential asset growth, the monitoring platform is a critical infrastructure for ensuring service stability. This article provides a complete review of a major internet company’s monitoring platform evolution from 2019 to 2021 — from solving legacy monitoring performance bottlenecks, to implementing cross-cloud distributed monitoring, to cloud-native platform governance — presenting the full transformation of the monitoring system from 0 to 1 build → large-scale expansion → platform governance.
Evolution Overview: Three Major Steps in Three Years, Anchoring Core Goals
The three-year evolution of the monitoring platform revolved around four core requirements: business growth, multi-cloud heterogeneity, fault self-healing, and ease of use and efficiency, completed in three phases:
- 2019 (Breakthrough): Replaced Zabbix+MySQL legacy architecture, completing monitoring platform 0-1 implementation
- 2020 (Expansion): Cross-cloud integration, full-link monitoring, self-built probing, filling the user-side monitoring gap
- 2021 (Governance): Cloud-native transformation, platform closed-loop, usability upgrades, achieving full monitoring lifecycle management
2019: The Breakthrough Year — Monitoring Platform 0-1, Solving Core Bottlenecks
1. Core Pain Points
- Business monitoring data reached million-level reporting, Zabbix+MySQL sharding reached performance bottlenecks, monitoring alerts near failure
- Cloud vendor database monitoring APIs had data loss, unable to adapt to Prometheus real-time pull mode
- High learning cost for business teams to integrate monitoring SDK, asset surge without unified classification standards
- Legacy architecture could not support K8s cluster monitoring, unable to keep up with containerization trends
Core Solutions
- Technology replacement: Introduced Prometheus to replace Zabbix, self-developed business Exporter for monitoring reporting
- Data compatibility: Used InfluxDB+Grafana for cloud vendor DB monitoring data, fixed API data loss issues
- Log support: Introduced ELK stack for business statistics and reporting needs
- Multi-read scaling: Integrated Thanos for multi-read monitoring data scenarios, Prometheus focused on collection and alerting
- Service discovery: Used Consul for resource registration and information retrieval
2019 Monitoring Platform Core Architecture
graph TD
%% Collection phase
A[Cloud Vendor API] --> B[Cloud API Collector] --> C@{shape: cyl, label: "InfluxDB"}
D[Business/Host Exporter] --> E[Prometheus]
F[Consul] --> E[Resource Registration/Service Discovery]
%% Data processing
E --> G[Thanos Sidecar] --> H@{shape: cyl, label: "S3 Object Storage"}
C --> I[Grafana]
E --> I
%% Event generation
E --> J[Alert Push] --> K[Enterprise IM]
I --> J
classDef primary fill:#e3f2fd,stroke:#1976d2
classDef storage fill:#e8f5e9,stroke:#4caf50
classDef alert fill:#fce4ec,stroke:#e53935
classDef process fill:#f3e5f5,stroke:#7b1fa2
class C,H storage
class E,G,J,K alert
class B,D,I process
class A,F primaryPhase Remaining Issues
- Cloud vendor DB monitoring alert back-migration to Prometheus had long cycles
- Architecture only supported 400+ assets, weekly report performance was extremely poor after asset explosion
- Multiple technology stacks running in parallel, high maintenance costs
2020: The Expansion Year — Cross-Cloud Distributed + Full-Stack Capability Completion
Core Pain Points
- Business fault root cause analysis difficult, legacy log system had extremely low troubleshooting efficiency
- Multi-cloud vendor resources had no intranet dedicated lines, unable to achieve unified monitoring
- User-side last-mile monitoring gap, third-party probing services too expensive
- Alert channel switching, legacy templates could not be reused, low alert reachability efficiency
Core Solutions
- Full-link monitoring: Introduced distributed tracing system and lightweight logging system for fault localization
- Cross-cloud integration: Based on Mesh technology stack, implemented public network cross-region distributed monitoring, Ansible unified node management
- Monitoring completion: Self-built black-box probing system, replacing expensive third-party services, covering URL/certificate/network quality monitoring
- Self-developed alerting: Built alert hub system, integrating with CMDB for targeted push, supporting enterprise IM/SMS
2020 Monitoring Platform Core Architecture
graph TD
%% Collection upgrade
A[Cloud Vendor API] --> B[Cloud API Collector] --> C@{shape: cyl, label: "InfluxDB"}
D[Cloud Exporter/Kafka Exporter] --> E[Prometheus]
F[Consul] --> E
%% Data processing
E --> G[Thanos Sidecar] --> H@{shape: cyl, label: "S3 Object Storage"}
E --> I[Thanos Query/Rule] --> J[Cluster Statistics Events]
%% Alert hub
E --> K[Self-developed Alert System] --> L[Enterprise IM/SMS]
C --> M[Grafana] --> K
classDef primary fill:#e3f2fd,stroke:#1976d2
classDef storage fill:#e8f5e9,stroke:#4caf50
classDef alert fill:#fce4ec,stroke:#e53935
classDef process fill:#f3e5f5,stroke:#7b1fa2
class C,H storage
class E,G,J,K,L alert
class B,D,M process
class A,F primaryCross-Region Mesh Distributed Architecture
graph LR
%% Multi-region probing/monitoring nodes
N1[Cloud Beijing Node] --> P[59080 Mesh Port]
N2[Cloud Guangzhou Node] --> P
N3[Cloud Singapore Node] --> P
N4[Cloud Shanghai Node] --> P
P --> Q[Prometheus Cluster]
Q --> R[Thanos Global Aggregation]
classDef primary fill:#e3f2fd,stroke:#1976d2
classDef network fill:#fff3e0,stroke:#ff9800
classDef alert fill:#fce4ec,stroke:#e53935
class N1,N2,N3,N4 primary
class P network
class Q,R alertPhase Remaining Issues
- Distributed tracing costs did not match business value, input-output ratio unbalanced
- No distributed management system, Mesh architecture complexity increased
- Bastion host upgrade caused Ansible unified management to fail
- Multi-cloud vendor adaptation consumed significant manpower
2021: The Governance Year — Cloud-Native Platformization, Achieving Monitoring Closed Loop
Core Objectives
Shift from fragmented operations to platform-based management, complete cloud-native transformation, improve monitoring coverage, alert governance efficiency, and user usability.
Core Construction Content
- Alert closed loop: Developed alert silencing, statistical analysis, policy management, and routing distribution systems
- Service discovery: Integrated with CMDB/CICD, automatic resource registration, host monitoring coverage reached 90.8%
- Cloud-native transformation: Full migration to K8s cluster, elastic scaling based on HPA
- Performance optimization: Thanos Store with caching, LRU policy, dimension-indexed query, improving query performance
- Usability upgrade: Built WEB visualization backend, supporting mobile alert suppression and adjustable thresholds
2021 Platform Monitoring Architecture
graph TD
%% Collection layer
A[Exporter/Sidecar] --> B[Prometheus]
C[CMDB System] --> D[Service Discovery Registration] --> B
%% Data layer
B --> E[Thanos Cluster] --> F@{shape: cyl, label: "S3 Object Storage + Cache"}
E --> G[Grafana/WEB UI]
%% Alert hub (Post Office)
B --> H[Alertmanager] --> I[Self-developed Alert Platform]
I --> J[Alert Suppression/Routing/Policy]
J --> K[Enterprise IM/SMS/Personal Subscription]
classDef primary fill:#e3f2fd,stroke:#1976d2
classDef storage fill:#e8f5e9,stroke:#4caf50
classDef alert fill:#fce4ec,stroke:#e53935
classDef process fill:#f3e5f5,stroke:#7b1fa2
class F storage
class B,E,H,I,J,K alert
class A,C,D,G processCloud-Native K8s Cluster Architecture
graph TD
A[K8s Cluster] --> B[Prometheus]
B --> C[Thanos Sidecar] --> D@{shape: cyl, label: "S3 Object Storage"}
B --> E[Thanos Query]
E --> F[Thanos Store] --> D
E --> G[Thanos Compact]
F --> H[Grafana Frontend]
%% Elastic capabilities
B --> I[HPA Auto-scaling]
F --> J[LRU Cache/60-minute Data Cache]
classDef primary fill:#e3f2fd,stroke:#1976d2
classDef storage fill:#e8f5e9,stroke:#4caf50
classDef alert fill:#fce4ec,stroke:#e53935
classDef process fill:#f3e5f5,stroke:#7b1fa2
class D storage
class B,E,F,G,H alert
class A,I,J processPhase Core Achievements
- Resource monitoring coverage approaching 100%, auto-discovery without manual configuration
- Alert processing efficiency significantly improved, supporting mobile silencing and visual threshold adjustment
- Cloud-native architecture supports thousand-level+ assets, completely solving performance bottlenecks
- Full monitoring process platformized, reducing SRE daily operations costs
Core Technology Evolution Summary
| Dimension | 2019 (Initial) | 2020 (Expansion) | 2021 (Platform) |
|---|---|---|---|
| Monitoring Engine | Zabbix+MySQL | Prometheus+Thanos | Prometheus+Thanos+K8s |
| Deployment Architecture | Single datacenter single machine | Cross-cloud Mesh distributed | Cloud-native containerized |
| Data Storage | Single-node MySQL sharding | TSDB+S3 Object Storage | Cache + dimension-index + object storage |
| Alert System | Fragmented push | Self-developed alert hub | Alert closed loop + policy governance + visualization |
| Monitoring Capability | Basic metric monitoring | Full-link + probing + logs | Full-scenario + automated + platformized |
Evolution Value and Summary
- Break performance bottleneck: Completely solved the core problem of legacy monitoring architecture unable to support business growth
- Fill monitoring gaps: Self-built probing completed user-side last-mile monitoring, shifting from passive fault reporting to proactive discovery
- Reduce cost and improve efficiency: Replaced expensive third-party services, self-developed system adapted to business customization needs
- Cloud-native upgrade: K8s + platformization achieved monitoring system scalability, maintainability, and usability
- Full-link closed loop: Metrics + logs + tracing + probing integration, forming a complete stability assurance system
The evolution of this monitoring platform is a typical practice of business driving technology, technology supporting business in internet companies — from emergency solutions for single-point problems to building comprehensive, platform-based stability infrastructure, providing a complete reference for monitoring implementation in large-scale, multi-cloud, containerized businesses.