Black-box Probing Monitoring System Architecture Design and Practice for Internet Companies

August 31, 2021 Architecture Probing, Black-Box Monitoring, Architecture Observability Series 1034 words 5 min read

🔊

In the full-link monitoring system of internet services, white-box monitoring focuses on proactively uncovering potential issues and predicting risks, while black-box monitoring is fault-oriented, rapidly detecting problems that have already occurred online. The two work together to form a complete monitoring closed loop. Most internet companies have long had a monitoring blind spot for public network services and the user-side last mile. User-side faults often only trigger investigation after users report issues. The black-box probing monitoring system was designed precisely to solve this industry pain point.

I. Monitoring System Positioning and Core Pain Points

1. Core Differences Between White-box and Black-box Monitoring

White-box monitoring: From the internal system perspective, based on metrics, logs, and tracing, proactively discovers/predicts potential issues.
Black-box monitoring: Simulates real user access behavior, detects external service availability and access efficiency, and when faults occur, quickly locates and responds immediately.

2. Common Industry Monitoring Pain Points

There is a gap in public network and user-side last-mile monitoring; user-side faults cannot be proactively discovered.
Third-party probing services are expensive, have low business scenario coverage, and struggle to adapt to enterprise customization needs.
Distributed probing node management is difficult; data collection, alert convergence, and security auditing are hard to balance.

II. Self-built vs. Third-party Monitoring Solution Comparison

Enterprises abandon expensive third-party services and choose to build their own black-box probing monitoring system. The core comparison is as follows:

Dimension	Self-built Probing Monitoring	Industry Third-party Probing Services
Resource Coverage	Deployed on self-owned IDC nodes, can cover intranet scenarios	More abundant public nodes, finer geographical granularity
Cost Control	Unlimited probing quantity and frequency, only basic server resources required	Billed by URL/city/ISP/node/frequency, cost grows with scale
Core Advantages	Supports intranet probing, customizable monitoring frequency, certificate monitoring, TCP probing, self-developed CMDB integration, natively compatible with existing Prometheus+Grafana stack	Supports CDN/origin MD5 verification, rich HTTP monitoring, ad-hoc probing, endpoint packet capture

III. Core Business Scenarios Supported

The system covers mainstream service monitoring needs of internet companies, fully adaptable to various business scenarios:

Frontend service monitoring: Certificate chain validity, DNS latency, TLS handshake latency, connection establishment latency, page load latency.
Network quality monitoring: ICMP probing, cross-domain internal/external network and dedicated line quality monitoring.
Protocol and scenario coverage: Supports DNS, TLS, TCP, SMTP and other protocols, adapting to CDN, proxy services, enterprise email and other scenarios.

IV. Core System Architecture Design

1. Probing Node Internal Architecture

Each probing node is an autonomous island unit, with independent probing, alerting, and status monitoring capabilities, secured through mesh proxy management.

mermaid
flowchart TD
    BE@{ shape: rounded, label: "blackbox<br/>_exporter" } -->|Execute Probe| TS@{ shape: rounded, label: "Target Service" }
    P@{ shape: rounded, label: "Prometheus<br/>+ node_exporter" } -->|Pull Probe/Status| BE
    P -->|Push Alert| AM@{ shape: rounded, label: "Alertmanager<br/>+ Alert Bot" }
    M@{ shape: rounded, label: "Mosn Mesh Proxy" } -->|Proxy/Rate Limit| P
    M -->|Auth| IN@{ shape: double-circle, label: "Internet" }

    classDef src fill:#bbdefb,stroke:#2196F3,color:#1B5E20
    classDef proc fill:#fff3e0,stroke:#FF9800,color:#BF360C
    classDef out fill:#c8e6c9,stroke:#4CAF50,color:#1B5E20
    classDef warn fill:#ffcdd2,stroke:#f44336,color:#B71C1C
    class BE,TS src
    class P,M proc
    class IN out
    class AM warn

2. Public Network Overall Architecture

Adopts a distributed probing nodes + centralized data aggregation architecture, strictly separating data flows to ensure security and audit compliance.

mermaid
flowchart TD
    N@{ shape: rounded, label: "Distributed Probe<br/>Points (geohash)" } -->|report| Z@{ shape: rounded, label: "Prometheus<br/>Aggregation" }
    Z -->|Storage| O@{shape: cyl, label: "OSS" }
    Z -->|Query| T@{ shape: rounded, label: "Thanos Query" }
    T --> G@{ shape: rounded, label: "Grafana<br/>Visualization" }
    Z -->|Alerts| A@{ shape: rounded, label: "Alertmanager" }

    classDef src fill:#bbdefb,stroke:#2196F3,color:#1B5E20
    classDef proc fill:#fff3e0,stroke:#FF9800,color:#BF360C
    classDef spec fill:#f3e5f5,stroke:#9C27B0,color:#4A148C
    classDef store fill:#c8e6c9,stroke:#4CAF50,color:#1B5E20
    classDef out fill:#bbdefb,stroke:#2196F3,color:#1B5E20
    class N src
    class Z proc
    class T spec
    class O store
    class G out
    class A out

3. Data Visualization Architecture

Implements geographical data visualization based on geohash + OpenStreetMap, replacing traditional time series charts to intuitively display national node network quality.

mermaid
graph TD
    A@{ shape: rounded, label: "Distributed Probe Points" } --> B@{ shape: rounded, label: "Geohash Geo-encoding" }
    B --> C@{ shape: rounded, label: "Thanos Data Aggregation" }
    C --> D@{ shape: rounded, label: "OpenStreetMap Regional Display" }
    C --> E@{ shape: rounded, label: "Grafana Metric Charts" }
    D & E --> F@{ shape: rounded, label: "Operations/Monitoring Console" }
    classDef primary fill:#e3f2fd,stroke:#1976d2
    classDef storage fill:#e8f5e9,stroke:#4caf50
    classDef network fill:#fff3e0,stroke:#ff9800
    classDef alert fill:#fce4ec,stroke:#e53935
    classDef process fill:#f3e5f5,stroke:#7b1fa2
    class A,B,D,E,F primary
    class C alert

V. Key Technical Implementation

1. Probe Engine Selection

Adopts Prometheus + blackbox_exporter as the core probe engine, supporting HTTP/TCP/ICMP/DNS/SMTP multi-protocol probing, accurately simulating user access behavior.

2. Distributed Node Management

Introduces Mosn service mesh for unified proxy control of ports, rate limiting/circuit breaking, encrypted authentication, reducing management costs of nationwide distributed nodes and enabling zero-trust network architecture.

3. Geo-visualization Capability

Binds probing data with geographical locations via geohash + OpenStreetMap, intuitively displaying service quality across different regions and ISPs, enabling rapid localization of regional faults.

4. Security and Auditing

Dual network control via cloud provider security groups + Mosn.
Strictly controls PULL/PUSH data flows, physically isolated from existing monitoring systems.
Full-link operations auditable, meeting enterprise security compliance requirements.

VI. Alert Mechanism and Practical Results

The system supports alert aggregation and multi-channel push. Alert information includes core dimensions such as node, target, latency, and anomaly status. Example:

【Alert】Connectivity probing takes too long Trigger Node: Cloud Computing Beijing Node Target Address: https://hub.xxx.com Latency Details: Total 7.9s, DNS 3.7ms, Connection 3.2s, TLS 3.4s Status: FIRING

Through this system, enterprises can proactively detect user-side faults without waiting for user reports, improving fault discovery efficiency by over 90%, fully closing the last-mile monitoring gap.

VII. Core Architecture Advantages

Full scenario coverage: Public network/intranet, multi-protocol, multi-business scenario one-stop monitoring.
Controllable cost: Self-built model has no probing frequency limits, elastically adapting to business scale.
Deep integration: Seamless integration with existing monitoring systems, supporting custom access for self-developed systems.
Secure and controllable: Zero-trust network architecture, sensitive services support IP whitelist monitoring.
Intuitive visualization: Geo-display + precise alerting, zero-delay fault localization.

Summary

This black-box probing monitoring system, from the user’s perspective, fills the last gap in internet companies’ monitoring systems, forming a complete closed loop of proactive prediction + rapid discovery together with white-box monitoring. The architecture balances distributed management, security compliance, and visualization, with costs far lower than third-party services — it is one of the best practices for large internet companies building full-link monitoring systems.

Part of series: Observability Series

← Previous Monitoring System Enterprise Architecture Evolution — Probing Monitoring Next → From Bottleneck Breakthrough to Platform Governance — The Full Evolution of an Internet Company's Monitoring Platform Architecture