From Bottleneck Breakthrough to Platform Governance — The Full Evolution of an Internet Company's Monitoring Platform Architecture

In the context of rapid internet business expansion, multi-cloud deployment, and exponential asset growth, the monitoring platform is a critical infrastructure for ensuring service stability. This article provides a complete review of a major internet company’s monitoring platform evolution from 2019 to 2021 — from solving legacy monitoring performance bottlenecks, to implementing cross-cloud distributed monitoring, to cloud-native platform governance — presenting the full transformation of the monitoring system from 0 to 1 build → large-scale expansion → platform governance.

Evolution Overview: Three Major Steps in Three Years, Anchoring Core Goals

The three-year evolution of the monitoring platform revolved around four core requirements: business growth, multi-cloud heterogeneity, fault self-healing, and ease of use and efficiency, completed in three phases:

  1. 2019 (Breakthrough): Replaced Zabbix+MySQL legacy architecture, completing monitoring platform 0-1 implementation
  2. 2020 (Expansion): Cross-cloud integration, full-link monitoring, self-built probing, filling the user-side monitoring gap
  3. 2021 (Governance): Cloud-native transformation, platform closed-loop, usability upgrades, achieving full monitoring lifecycle management

2019: The Breakthrough Year — Monitoring Platform 0-1, Solving Core Bottlenecks

1. Core Pain Points

  • Business monitoring data reached million-level reporting, Zabbix+MySQL sharding reached performance bottlenecks, monitoring alerts near failure
  • Cloud vendor database monitoring APIs had data loss, unable to adapt to Prometheus real-time pull mode
  • High learning cost for business teams to integrate monitoring SDK, asset surge without unified classification standards
  • Legacy architecture could not support K8s cluster monitoring, unable to keep up with containerization trends

Core Solutions

  1. Technology replacement: Introduced Prometheus to replace Zabbix, self-developed business Exporter for monitoring reporting
  2. Data compatibility: Used InfluxDB+Grafana for cloud vendor DB monitoring data, fixed API data loss issues
  3. Log support: Introduced ELK stack for business statistics and reporting needs
  4. Multi-read scaling: Integrated Thanos for multi-read monitoring data scenarios, Prometheus focused on collection and alerting
  5. Service discovery: Used Consul for resource registration and information retrieval

2019 Monitoring Platform Core Architecture

mermaid
graph TD
    %% Collection phase
    A[Cloud Vendor API] --> B[Cloud API Collector] --> C@{shape: cyl, label: "InfluxDB"}
    D[Business/Host Exporter] --> E[Prometheus]
    F[Consul] --> E[Resource Registration/Service Discovery]

    %% Data processing
    E --> G[Thanos Sidecar] --> H@{shape: cyl, label: "S3 Object Storage"}
    C --> I[Grafana]
    E --> I

    %% Event generation
    E --> J[Alert Push] --> K[Enterprise IM]
    I --> J

    classDef primary fill:#e3f2fd,stroke:#1976d2
    classDef storage fill:#e8f5e9,stroke:#4caf50
    classDef alert fill:#fce4ec,stroke:#e53935
    classDef process fill:#f3e5f5,stroke:#7b1fa2
    class C,H storage
    class E,G,J,K alert
    class B,D,I process
    class A,F primary

Phase Remaining Issues

  • Cloud vendor DB monitoring alert back-migration to Prometheus had long cycles
  • Architecture only supported 400+ assets, weekly report performance was extremely poor after asset explosion
  • Multiple technology stacks running in parallel, high maintenance costs

2020: The Expansion Year — Cross-Cloud Distributed + Full-Stack Capability Completion

Core Pain Points

  • Business fault root cause analysis difficult, legacy log system had extremely low troubleshooting efficiency
  • Multi-cloud vendor resources had no intranet dedicated lines, unable to achieve unified monitoring
  • User-side last-mile monitoring gap, third-party probing services too expensive
  • Alert channel switching, legacy templates could not be reused, low alert reachability efficiency

Core Solutions

  1. Full-link monitoring: Introduced distributed tracing system and lightweight logging system for fault localization
  2. Cross-cloud integration: Based on Mesh technology stack, implemented public network cross-region distributed monitoring, Ansible unified node management
  3. Monitoring completion: Self-built black-box probing system, replacing expensive third-party services, covering URL/certificate/network quality monitoring
  4. Self-developed alerting: Built alert hub system, integrating with CMDB for targeted push, supporting enterprise IM/SMS

2020 Monitoring Platform Core Architecture

mermaid
graph TD
    %% Collection upgrade
    A[Cloud Vendor API] --> B[Cloud API Collector] --> C@{shape: cyl, label: "InfluxDB"}
    D[Cloud Exporter/Kafka Exporter] --> E[Prometheus]
    F[Consul] --> E

    %% Data processing
    E --> G[Thanos Sidecar] --> H@{shape: cyl, label: "S3 Object Storage"}
    E --> I[Thanos Query/Rule] --> J[Cluster Statistics Events]

    %% Alert hub
    E --> K[Self-developed Alert System] --> L[Enterprise IM/SMS]
    C --> M[Grafana] --> K

    classDef primary fill:#e3f2fd,stroke:#1976d2
    classDef storage fill:#e8f5e9,stroke:#4caf50
    classDef alert fill:#fce4ec,stroke:#e53935
    classDef process fill:#f3e5f5,stroke:#7b1fa2
    class C,H storage
    class E,G,J,K,L alert
    class B,D,M process
    class A,F primary

Cross-Region Mesh Distributed Architecture

mermaid
graph LR
    %% Multi-region probing/monitoring nodes
    N1[Cloud Beijing Node] --> P[59080 Mesh Port]
    N2[Cloud Guangzhou Node] --> P
    N3[Cloud Singapore Node] --> P
    N4[Cloud Shanghai Node] --> P

    P --> Q[Prometheus Cluster]
    Q --> R[Thanos Global Aggregation]

    classDef primary fill:#e3f2fd,stroke:#1976d2
    classDef network fill:#fff3e0,stroke:#ff9800
    classDef alert fill:#fce4ec,stroke:#e53935
    class N1,N2,N3,N4 primary
    class P network
    class Q,R alert

Phase Remaining Issues

  • Distributed tracing costs did not match business value, input-output ratio unbalanced
  • No distributed management system, Mesh architecture complexity increased
  • Bastion host upgrade caused Ansible unified management to fail
  • Multi-cloud vendor adaptation consumed significant manpower

2021: The Governance Year — Cloud-Native Platformization, Achieving Monitoring Closed Loop

Core Objectives

Shift from fragmented operations to platform-based management, complete cloud-native transformation, improve monitoring coverage, alert governance efficiency, and user usability.

Core Construction Content

  1. Alert closed loop: Developed alert silencing, statistical analysis, policy management, and routing distribution systems
  2. Service discovery: Integrated with CMDB/CICD, automatic resource registration, host monitoring coverage reached 90.8%
  3. Cloud-native transformation: Full migration to K8s cluster, elastic scaling based on HPA
  4. Performance optimization: Thanos Store with caching, LRU policy, dimension-indexed query, improving query performance
  5. Usability upgrade: Built WEB visualization backend, supporting mobile alert suppression and adjustable thresholds

2021 Platform Monitoring Architecture

mermaid
graph TD
    %% Collection layer
    A[Exporter/Sidecar] --> B[Prometheus]
    C[CMDB System] --> D[Service Discovery Registration] --> B

    %% Data layer
    B --> E[Thanos Cluster] --> F@{shape: cyl, label: "S3 Object Storage + Cache"}
    E --> G[Grafana/WEB UI]

    %% Alert hub (Post Office)
    B --> H[Alertmanager] --> I[Self-developed Alert Platform]
    I --> J[Alert Suppression/Routing/Policy]
    J --> K[Enterprise IM/SMS/Personal Subscription]

    classDef primary fill:#e3f2fd,stroke:#1976d2
    classDef storage fill:#e8f5e9,stroke:#4caf50
    classDef alert fill:#fce4ec,stroke:#e53935
    classDef process fill:#f3e5f5,stroke:#7b1fa2
    class F storage
    class B,E,H,I,J,K alert
    class A,C,D,G process

Cloud-Native K8s Cluster Architecture

mermaid
graph TD
    A[K8s Cluster] --> B[Prometheus]
    B --> C[Thanos Sidecar] --> D@{shape: cyl, label: "S3 Object Storage"}
    B --> E[Thanos Query]
    E --> F[Thanos Store] --> D
    E --> G[Thanos Compact]
    F --> H[Grafana Frontend]
    %% Elastic capabilities
    B --> I[HPA Auto-scaling]
    F --> J[LRU Cache/60-minute Data Cache]

    classDef primary fill:#e3f2fd,stroke:#1976d2
    classDef storage fill:#e8f5e9,stroke:#4caf50
    classDef alert fill:#fce4ec,stroke:#e53935
    classDef process fill:#f3e5f5,stroke:#7b1fa2
    class D storage
    class B,E,F,G,H alert
    class A,I,J process

Phase Core Achievements

  • Resource monitoring coverage approaching 100%, auto-discovery without manual configuration
  • Alert processing efficiency significantly improved, supporting mobile silencing and visual threshold adjustment
  • Cloud-native architecture supports thousand-level+ assets, completely solving performance bottlenecks
  • Full monitoring process platformized, reducing SRE daily operations costs

Core Technology Evolution Summary

Dimension2019 (Initial)2020 (Expansion)2021 (Platform)
Monitoring EngineZabbix+MySQLPrometheus+ThanosPrometheus+Thanos+K8s
Deployment ArchitectureSingle datacenter single machineCross-cloud Mesh distributedCloud-native containerized
Data StorageSingle-node MySQL shardingTSDB+S3 Object StorageCache + dimension-index + object storage
Alert SystemFragmented pushSelf-developed alert hubAlert closed loop + policy governance + visualization
Monitoring CapabilityBasic metric monitoringFull-link + probing + logsFull-scenario + automated + platformized

Evolution Value and Summary

  1. Break performance bottleneck: Completely solved the core problem of legacy monitoring architecture unable to support business growth
  2. Fill monitoring gaps: Self-built probing completed user-side last-mile monitoring, shifting from passive fault reporting to proactive discovery
  3. Reduce cost and improve efficiency: Replaced expensive third-party services, self-developed system adapted to business customization needs
  4. Cloud-native upgrade: K8s + platformization achieved monitoring system scalability, maintainability, and usability
  5. Full-link closed loop: Metrics + logs + tracing + probing integration, forming a complete stability assurance system

The evolution of this monitoring platform is a typical practice of business driving technology, technology supporting business in internet companies — from emergency solutions for single-point problems to building comprehensive, platform-based stability infrastructure, providing a complete reference for monitoring implementation in large-scale, multi-cloud, containerized businesses.