From Bottleneck Breakthrough to Platform Governance — The Full Evolution of an Internet Company's Monitoring Platform Architecture

January 10, 2022 Architecture Monitoring Platform, Architecture Evolution, Architecture Observability Series 1137 words 6 min read

🔊

In the context of rapid internet business expansion, multi-cloud deployment, and exponential asset growth, the monitoring platform is a critical infrastructure for ensuring service stability. This article provides a complete review of a major internet company’s monitoring platform evolution from 2019 to 2021 — from solving legacy monitoring performance bottlenecks, to implementing cross-cloud distributed monitoring, to cloud-native platform governance — presenting the full transformation of the monitoring system from 0 to 1 build → large-scale expansion → platform governance.

Evolution Overview: Three Major Steps in Three Years, Anchoring Core Goals

The three-year evolution of the monitoring platform revolved around four core requirements: business growth, multi-cloud heterogeneity, fault self-healing, and ease of use and efficiency, completed in three phases:

2019 (Breakthrough): Replaced Zabbix+MySQL legacy architecture, completing monitoring platform 0-1 implementation
2020 (Expansion): Cross-cloud integration, full-link monitoring, self-built probing, filling the user-side monitoring gap
2021 (Governance): Cloud-native transformation, platform closed-loop, usability upgrades, achieving full monitoring lifecycle management

2019: The Breakthrough Year — Monitoring Platform 0-1, Solving Core Bottlenecks

1. Core Pain Points

Business monitoring data reached million-level reporting, Zabbix+MySQL sharding reached performance bottlenecks, monitoring alerts near failure
Cloud vendor database monitoring APIs had data loss, unable to adapt to Prometheus real-time pull mode
High learning cost for business teams to integrate monitoring SDK, asset surge without unified classification standards
Legacy architecture could not support K8s cluster monitoring, unable to keep up with containerization trends

Core Solutions

Technology replacement: Introduced Prometheus to replace Zabbix, self-developed business Exporter for monitoring reporting
Data compatibility: Used InfluxDB+Grafana for cloud vendor DB monitoring data, fixed API data loss issues
Log support: Introduced ELK stack for business statistics and reporting needs
Multi-read scaling: Integrated Thanos for multi-read monitoring data scenarios, Prometheus focused on collection and alerting
Service discovery: Used Consul for resource registration and information retrieval

2019 Monitoring Platform Core Architecture

mermaid
graph TD
    A["Cloud API / Exporter / Consul"] --> B["Prometheus<br/>+ Cloud API collector"]
    B --> C@{shape: cyl, label: "InfluxDB / S3<br/>Thanos Sidecar"}
    B --> D["Grafana display"]
    B --> E["Alert push → Enterprise IM"]
    C --> D
    classDef primary fill:#e3f2fd,stroke:#1976d2,color:#0d47a1
    classDef storage fill:#e8f5e9,stroke:#4caf50,color:#1b5e20
    classDef alert fill:#fce4ec,stroke:#e53935,color:#b71c1c
    classDef process fill:#f3e5f5,stroke:#7b1fa2,color:#4a148c
    class A primary
    class C storage
    class B,D process
    class E alert

Phase Remaining Issues

Cloud vendor DB monitoring alert back-migration to Prometheus had long cycles
Architecture only supported 400+ assets, weekly report performance was extremely poor after asset explosion
Multiple technology stacks running in parallel, high maintenance costs

2020: The Expansion Year — Cross-Cloud Distributed + Full-Stack Capability Completion

Core Pain Points

Business fault root cause analysis difficult, legacy log system had extremely low troubleshooting efficiency
Multi-cloud vendor resources had no intranet dedicated lines, unable to achieve unified monitoring
User-side last-mile monitoring gap, third-party probing services too expensive
Alert channel switching, legacy templates could not be reused, low alert reachability efficiency

Core Solutions

Full-link monitoring: Introduced distributed tracing system and lightweight logging system for fault localization
Cross-cloud integration: Based on Mesh technology stack, implemented public network cross-region distributed monitoring, Ansible unified node management
Monitoring completion: Self-built black-box probing system, replacing expensive third-party services, covering URL/certificate/network quality monitoring
Self-developed alerting: Built alert hub system, integrating with CMDB for targeted push, supporting enterprise IM/SMS

2020 Monitoring Platform Core Architecture

mermaid
graph TD
    A["Cloud API / Cloud Exporter / Consul"] --> B["Prometheus"]
    B --> C@{shape: cyl, label: "S3 + InfluxDB<br/>Thanos Query/Rule"}
    B --> D["Self-developed alert system<br/>→ Enterprise IM/SMS"]
    C --> E["Grafana display"]
    C --> D
    classDef primary fill:#e3f2fd,stroke:#1976d2,color:#0d47a1
    classDef storage fill:#e8f5e9,stroke:#4caf50,color:#1b5e20
    classDef alert fill:#fce4ec,stroke:#e53935,color:#b71c1c
    classDef process fill:#f3e5f5,stroke:#7b1fa2,color:#4a148c
    class A primary
    class C storage
    class B,E process
    class D alert

Cross-Region Mesh Distributed Architecture

mermaid
graph TD
    N["Multi-region nodes<br/>Beijing / Guangzhou / Singapore / Shanghai"] --> P["59080 Mesh port"]
    P --> Q["Prometheus cluster"]
    Q --> R["Thanos global aggregation"]
    classDef primary fill:#e3f2fd,stroke:#1976d2,color:#0d47a1
    classDef network fill:#fff3e0,stroke:#ff9800,color:#e65100
    classDef alert fill:#fce4ec,stroke:#e53935,color:#b71c1c
    class N primary
    class P network
    class Q,R alert

Phase Remaining Issues

Distributed tracing costs did not match business value, input-output ratio unbalanced
No distributed management system, Mesh architecture complexity increased
Bastion host upgrade caused Ansible unified management to fail
Multi-cloud vendor adaptation consumed significant manpower

2021: The Governance Year — Cloud-Native Platformization, Achieving Monitoring Closed Loop

Core Objectives

Shift from fragmented operations to platform-based management, complete cloud-native transformation, improve monitoring coverage, alert governance efficiency, and user usability.

Core Construction Content

Alert closed loop: Developed alert silencing, statistical analysis, policy management, and routing distribution systems
Service discovery: Integrated with CMDB/CICD, automatic resource registration, host monitoring coverage reached 90.8%
Cloud-native transformation: Full migration to K8s cluster, elastic scaling based on HPA
Performance optimization: Thanos Store with caching, LRU policy, dimension-indexed query, improving query performance
Usability upgrade: Built WEB visualization backend, supporting mobile alert suppression and adjustable thresholds

2021 Platform Monitoring Architecture

mermaid
graph TD
    A["Exporter/Sidecar<br/>+ CMDB service discovery"] --> B["Prometheus"]
    B --> C@{shape: cyl, label: "Thanos cluster<br/>S3 + cache"}
    C --> D["Grafana / WEB UI"]
    B --> E["Alertmanager → self-developed alert platform<br/>suppression/routing/policy → IM/SMS"]
    classDef primary fill:#e3f2fd,stroke:#1976d2,color:#0d47a1
    classDef storage fill:#e8f5e9,stroke:#4caf50,color:#1b5e20
    classDef alert fill:#fce4ec,stroke:#e53935,color:#b71c1c
    classDef process fill:#f3e5f5,stroke:#7b1fa2,color:#4a148c
    class A primary
    class C storage
    class B,D process
    class E alert

Cloud-Native K8s Cluster Architecture

mermaid
graph TD
    A["K8s cluster<br/>HPA auto-scaling"] --> B["Prometheus"]
    B --> C@{shape: cyl, label: "Thanos components<br/>Sidecar/Query/Store/Compact<br/>+ S3 + LRU cache"}
    C --> D["Grafana Frontend"]
    classDef primary fill:#e3f2fd,stroke:#1976d2,color:#0d47a1
    classDef storage fill:#e8f5e9,stroke:#4caf50,color:#1b5e20
    classDef alert fill:#fce4ec,stroke:#e53935,color:#b71c1c
    classDef process fill:#f3e5f5,stroke:#7b1fa2,color:#4a148c
    class A primary
    class C storage
    class B,D process
    classDef alert fill:#fce4ec,stroke:#e53935
    classDef process fill:#f3e5f5,stroke:#7b1fa2
    class D storage
    class B,E,F,G,H alert
    class A,I,J process

Phase Core Achievements

Resource monitoring coverage approaching 100%, auto-discovery without manual configuration
Alert processing efficiency significantly improved, supporting mobile silencing and visual threshold adjustment
Cloud-native architecture supports thousand-level+ assets, completely solving performance bottlenecks
Full monitoring process platformized, reducing SRE daily operations costs

Core Technology Evolution Summary

Dimension	2019 (Initial)	2020 (Expansion)	2021 (Platform)
Monitoring Engine	Zabbix+MySQL	Prometheus+Thanos	Prometheus+Thanos+K8s
Deployment Architecture	Single datacenter single machine	Cross-cloud Mesh distributed	Cloud-native containerized
Data Storage	Single-node MySQL sharding	TSDB+S3 Object Storage	Cache + dimension-index + object storage
Alert System	Fragmented push	Self-developed alert hub	Alert closed loop + policy governance + visualization
Monitoring Capability	Basic metric monitoring	Full-link + probing + logs	Full-scenario + automated + platformized

Evolution Value and Summary

Break performance bottleneck: Completely solved the core problem of legacy monitoring architecture unable to support business growth
Fill monitoring gaps: Self-built probing completed user-side last-mile monitoring, shifting from passive fault reporting to proactive discovery
Reduce cost and improve efficiency: Replaced expensive third-party services, self-developed system adapted to business customization needs
Cloud-native upgrade: K8s + platformization achieved monitoring system scalability, maintainability, and usability
Full-link closed loop: Metrics + logs + tracing + probing integration, forming a complete stability assurance system

The evolution of this monitoring platform is a typical practice of business driving technology, technology supporting business in internet companies — from emergency solutions for single-point problems to building comprehensive, platform-based stability infrastructure, providing a complete reference for monitoring implementation in large-scale, multi-cloud, containerized businesses.

Part of series: Observability Series

← Previous Black-box Probing Monitoring System Architecture Design and Practice for Internet Companies Next → Hybrid Cloud Cross-Region Monitoring System Governance: Autonomous + Unified Dual-Core Architecture Practice