Eyes On You: The 2022 Productization Journey of a Multi-Cloud Heterogeneous Monitoring Platform

In the context of multi-cloud deployment, global networking, and exponential growth in service scale within internet businesses, monitoring platforms have long surpassed the basic “metric collection + alert notification” positioning, becoming the core infrastructure for ensuring end-to-end stability. This article, based on the real evolution of a large-scale internet enterprise monitoring platform, dissects the complete planning and implementation approach for upgrading the monitoring platform from large-scale coverage to productization, usability, and intelligence in 2022.

Platform Status: Large-Scale Challenges of Multi-Cloud Heterogeneity

As business expands globally, the monitoring platform has entered a phase of ultra-large scale, highly distributed, and heterogeneous operation. The current status is as follows:

  • Resource scale: Services cover four major cloud providers, multiple domestic and international regions, including 57 Kubernetes clusters, 2420 non-container hosts, and hundreds of MySQL, MongoDB, Kafka, Redis, TiDB middleware instances.
  • Collection scale: Metric collection has experienced explosive three-year growth, with a daily average of over 324.9 billion metrics collected in 2022, peak collection rate reaching 3.76 million metrics/second, processed by 122 distributed Prometheus nodes across regions.
  • Alert scale: Unified access to four major cloud provider monitoring + self-built platform alerts, pushing over 220,000 alert events weekly to 50+ collaboration groups.

The current platform exhibits four core characteristics: multi-cloud differentiation, multi-environment isolation, cross-region dispersion, and real-time massive data processing. There is no fully standardized optimal solution in the industry; the platform needs continuous iteration to adapt to business growth.

Core Pain Points: The Critical Gap from “Usable” to “Easy to Use”

After initial large-scale construction, the monitoring platform completed basic coverage, but still had significant shortcomings in productization, usability, and effectiveness. Core issues are divided into two dimensions:

Metric Collection & Processing Platform

  1. Significant infrastructure differences across cloud providers, insufficient automatic resource discovery and monitoring coverage
  2. Inconsistent metric naming, scattered collection nodes, R&D teams cannot self-manage their own service monitoring strategies
  3. Custom monitoring reporting process is cumbersome with high barriers, slowing down monitoring coverage efficiency

Unified Alert Platform

  1. Multiple alert sources, inconsistent data structures, lack of global control capabilities
  2. Alert flooding, excessive noise, greatly reducing fault perception and handling efficiency
  3. Low alert event transparency, R&D/operations cannot autonomously manage and trace the alert lifecycle

2021 Foundation Building: Establishing Platform Management Capabilities

2021 focused on managing the platform as a platform, completing the core transformation from “scattered tools” to “unified platform”, building four core capabilities:

Core System Construction

  • Service Discovery & Registration System: Cross-domain distributed resource discovery and registration, using Sidecar pattern to interface with CMDB and CICD, automatically managing all monitorable resources
  • Unified Alert Platform: Full lifecycle management of alert ingestion, normalization, routing, notification, and archiving
  • Basic Control Capabilities: Alert silencing/inhibition, statistical analysis, strategy management, route distribution system
  • Visualization Frontend: Operations-oriented monitoring management and alert processing portal

2021 Monitoring Platform Core Architecture

mermaid
graph TB
    subgraph Resources["Resources & Ecosystem Layer"]
        A@{ shape: doc, label: "Hosts/Containers/Middleware/Databases" }
        B@{ shape: doc, label: "CMDB/CICD Third-party Systems" }
    end
    subgraph Collection["Collection & Discovery Layer"]
        C@{ shape: doc, label: "Service Discovery & Registration System" }
        D@{ shape: cyl, label: "Distributed Prometheus Cluster" }
        E@{ shape: doc, label: "Exporter/Sidecar Collection Components" }
    end
    subgraph Alert["Alert Central Layer"]
        F@{ shape: hex, label: "Alertmanager" }
        G@{ shape: doc, label: "Alert Routing System" }
        H@{ shape: doc, label: "Alert Strategy/Silence/Statistics" }
    end
    subgraph Notification["Notification & Display Layer"]
        I@{ shape: doc, label: "Enterprise IM/SMS" }
        J@{ shape: doc, label: "Personal/Group Subscriptions" }
        K@{ shape: doc, label: "Monitoring Management WEB UI" }
        L@{ shape: doc, label: "Grafana Visualization" }
    end
    A --> E
    B --> C
    C --> D
    D --> F
    F --> G
    G --> H
    H --> I
    H --> J
    D --> L
    H --> K
    classDef monitor fill:#e3f2fd,stroke:#1976d2
    classDef storage fill:#e8f5e9,stroke:#4caf50
    classDef alert fill:#fce4ec,stroke:#e53935
    class A,B,C,E,G,H,I,J,K,L monitor
    class D storage
    class F alert

H1 2022 Planning: R&D-Facing One-Stop Monitoring Platform

The core goal for H1 2022: Create a one-stop, frictionless, easy-to-use, and efficient monitoring experience for R&D teams, shielding the complexity of multi-cloud, cross-region, and isolated environments. Core focus areas:

Core Design Principles

  • OneIn: Unified entry point, no need to care about environment, data center, or cloud provider differences
  • Coverage: CICD+CMDB full-chain automated coverage, frictionless monitoring integration
  • Self-service: Full-process self-service operations, realizing “Your Data, Your Rules”
  • Observability: Metrics+Tracing+Logging convergence

Core Feature Delivery

  1. One-Stop Monitoring Portal Unified integration of hosts, containers, databases, and middleware monitoring across all scenarios, supporting quick resource search by IP, business info, cloud provider, and region.
  2. Minute-Level Instance Monitoring Setup New resources get monitoring deployed within minutes, no manual configuration needed, automated collection and alert binding.
  3. Intuitive Metric Visualization Multi-dimensional real-time charts for CPU, memory, disk, network, performance, connection count, etc., with custom time ranges and refresh frequencies.
  4. Quick Owner Association Automatic binding of resources with R&D/operations owners, enabling rapid identification of responsible persons during incidents for faster troubleshooting.
  5. One-Stop Alert Configuration Simple/Advanced dual-mode alert configuration frontend, supporting metric registration, rule group creation, and policy self-management.

H2 focuses on full-chain capability completion and intelligent upgrade, breaking through traditional monitoring boundaries to build the next-generation observability platform:

Full-Scenario Observability Capabilities

  • Connect the full chain of reporting services, queue services, consumer services, distributed storage, and metadata management
  • Integrate business monitoring, APM application monitoring, client-side monitoring, and synthetic monitoring with deep call chain linkage
  • Cover WEB + mobile endpoints, enabling monitoring viewing and alert handling anytime, anywhere

Intelligent Alert Evolution

  • Solve alert coverage challenges for complex middleware like Kafka, optimize multi-scenario alert strategies
  • Explore threshold-free intelligent alerting: Automatically infer anomaly thresholds based on seasonal data patterns
  • Reuse intelligent models, extend to more business scenarios, build generalized intelligent alerting capabilities

Full-Stack Capability Extension

Add capacity planning, task scheduling, network monitoring and other capabilities, forming a complete observability closed loop of collection-storage-analysis-alert-self-healing-planning.

Solving Core Questions: Let Monitoring Return to Business Itself

Platform iteration always revolves around four core questions, shielding users from underlying complexity:

  1. Where are the resources and services? The service discovery system manages heterogeneous resources across 64 isolated environments, automatically sensing the location of all services.
  2. Where are the collected metrics? Distributed edge autonomy architecture shields differences across 122 Prometheus nodes; users don’t need to care about collection node assignments.
  3. How are alert events accurately delivered? Alert lifecycle management + noise reduction and rate limiting, accurately pushing effective information from over 220,000 alert events.
  4. How can I quickly observe my services? One-stop view, quickly locating your own service data from 324.9 billion daily metrics.

Summary: From Tool to Platform, From Operations to Everyone

2022 is the key year for the monitoring platform to transform from an operations tool to a productized platform available to everyone. With usability, effectiveness, transparency, and universality at its core, the platform achieves automated monitoring coverage, self-service usage, intelligent alerting, and full-chain observability in the complex scenario of massive multi-cloud heterogeneity.