Black-box Probing Monitoring System Architecture Design and Practice for Internet Companies

In the full-link monitoring system of internet services, white-box monitoring focuses on proactively uncovering potential issues and predicting risks, while black-box monitoring is fault-oriented, rapidly detecting problems that have already occurred online. The two work together to form a complete monitoring闭环. Most internet companies have long had a monitoring blind spot for public network services and the user-side last mile. User-side faults often only trigger investigation after users report issues. The black-box probing monitoring system was designed precisely to solve this industry pain point.

I. Monitoring System Positioning and Core Pain Points

1. Core Differences Between White-box and Black-box Monitoring

  • White-box monitoring: From the internal system perspective, based on metrics, logs, and tracing, proactively discovers/predicts potential issues.
  • Black-box monitoring: Simulates real user access behavior, detects external service availability and access efficiency, and when faults occur, quickly locates and responds immediately.

2. Common Industry Monitoring Pain Points

  • There is a gap in public network and user-side last-mile monitoring; user-side faults cannot be proactively discovered.
  • Third-party probing services are expensive, have low business scenario coverage, and struggle to adapt to enterprise customization needs.
  • Distributed probing node management is difficult; data collection, alert convergence, and security auditing are hard to balance.

II. Self-built vs. Third-party Monitoring Solution Comparison

Enterprises abandon expensive third-party services and choose to build their own black-box probing monitoring system. The core comparison is as follows:

DimensionSelf-built Probing MonitoringIndustry Third-party Probing Services
Resource CoverageDeployed on self-owned IDC nodes, can cover intranet scenariosMore abundant public nodes, finer geographical granularity
Cost ControlUnlimited probing quantity and frequency, only basic server resources requiredBilled by URL/city/ISP/node/frequency, cost grows with scale
Core AdvantagesSupports intranet probing, customizable monitoring frequency, certificate monitoring, TCP probing, self-developed CMDB integration, natively compatible with existing Prometheus+Grafana stackSupports CDN/origin MD5 verification, rich HTTP monitoring, ad-hoc probing, endpoint packet capture

III. Core Business Scenarios Supported

The system covers mainstream service monitoring needs of internet companies, fully adaptable to various business scenarios:

  1. Frontend service monitoring: Certificate chain validity, DNS latency, TLS handshake latency, connection establishment latency, page load latency.
  2. Network quality monitoring: ICMP probing, cross-domain internal/external network and dedicated line quality monitoring.
  3. Protocol and scenario coverage: Supports DNS, TLS, TCP, SMTP and other protocols, adapting to CDN, proxy services, enterprise email and other scenarios.

IV. Core System Architecture Design

1. Probing Node Internal Architecture

Each probing node is an autonomous island unit, with independent probing, alerting, and status monitoring capabilities, secured through mesh proxy management.

mermaid
graph TB
    BE@{ shape: rounded, label: "blackbox_exporter" } -->|Execute Probe| TS@{ shape: rounded, label: "Target Service" }
    P@{ shape: rounded, label: "Prometheus" } -->|Pull Probe Tasks| BE
    P -->|Pull Local Status| NE@{ shape: rounded, label: "node_exporter" }
    P -->|Push Alerts| AM@{ shape: rounded, label: "Alertmanager" }
    AM -->|Webhook| AR@{ shape: double-circle, label: "Alert Bot" }
    M@{ shape: rounded, label: "Mosn" } -->|Proxy 59080/59443 Ports| P
    M -->|Rate Limiting/Circuit Breaking/Auth| IN@{ shape: double-circle, label: "Internet" }

2. Public Network Overall Architecture

Adopts a distributed probing nodes + centralized data aggregation architecture, strictly separating data flows to ensure security and audit compliance.

mermaid
graph LR
    N1@{ shape: rounded, label: "Distributed Probe Point 1" } -->|geohash Location| Z@{ shape: rounded, label: "Prometheus Aggregation" }
    N2@{ shape: rounded, label: "Distributed Probe Point 2" } -->|geohash Location| Z
    N3@{ shape: rounded, label: "Distributed Probe Point 3" } -->|geohash Location| Z
    Z -->|Storage| O@{shape: cyl, label: "OSS" }
    Z -->|Query| T@{ shape: rounded, label: "Thanos Query" }
    T --> G@{ shape: rounded, label: "Grafana Visualization" }
    Z -->|Alerts| A@{ shape: rounded, label: "Alertmanager" }
    classDef primary fill:#e3f2fd,stroke:#1976d2
    classDef storage fill:#e8f5e9,stroke:#4caf50
    classDef network fill:#fff3e0,stroke:#ff9800
    classDef alert fill:#fce4ec,stroke:#e53935
    classDef process fill:#f3e5f5,stroke:#7b1fa2
    class N1,N2,N3,Z,T,G,A primary
    class O storage

3. Data Visualization Architecture

Implements geographical data visualization based on geohash + OpenStreetMap, replacing traditional time series charts to intuitively display national node network quality.

mermaid
graph TD
    A@{ shape: rounded, label: "Distributed Probe Points" } --> B@{ shape: rounded, label: "Geohash Geo-encoding" }
    B --> C@{ shape: rounded, label: "Thanos Data Aggregation" }
    C --> D@{ shape: rounded, label: "OpenStreetMap Regional Display" }
    C --> E@{ shape: rounded, label: "Grafana Metric Charts" }
    D & E --> F@{ shape: rounded, label: "Operations/Monitoring Console" }
    classDef primary fill:#e3f2fd,stroke:#1976d2
    classDef storage fill:#e8f5e9,stroke:#4caf50
    classDef network fill:#fff3e0,stroke:#ff9800
    classDef alert fill:#fce4ec,stroke:#e53935
    classDef process fill:#f3e5f5,stroke:#7b1fa2
    class A,B,D,E,F primary
    class C alert

V. Key Technical Implementation

1. Probe Engine Selection

Adopts Prometheus + blackbox_exporter as the core probe engine, supporting HTTP/TCP/ICMP/DNS/SMTP multi-protocol probing, accurately simulating user access behavior.

2. Distributed Node Management

Introduces Mosn service mesh for unified proxy control of ports, rate limiting/circuit breaking, encrypted authentication, reducing management costs of nationwide distributed nodes and enabling zero-trust network architecture.

3. Geo-visualization Capability

Binds probing data with geographical locations via geohash + OpenStreetMap, intuitively displaying service quality across different regions and ISPs, enabling rapid localization of regional faults.

4. Security and Auditing

  • Dual network control via cloud provider security groups + Mosn.
  • Strictly controls PULL/PUSH data flows, physically isolated from existing monitoring systems.
  • Full-link operations auditable, meeting enterprise security compliance requirements.

VI. Alert Mechanism and Practical Results

The system supports alert aggregation and multi-channel push. Alert information includes core dimensions such as node, target, latency, and anomaly status. Example:

【Alert】Connectivity probing takes too long Trigger Node: Cloud Computing Beijing Node Target Address: https://hub.xxx.com Latency Details: Total 7.9s, DNS 3.7ms, Connection 3.2s, TLS 3.4s Status: FIRING

Through this system, enterprises can proactively detect user-side faults without waiting for user reports, improving fault discovery efficiency by over 90%, fully closing the last-mile monitoring gap.

VII. Core Architecture Advantages

  1. Full scenario coverage: Public network/intranet, multi-protocol, multi-business scenario one-stop monitoring.
  2. Controllable cost: Self-built model has no probing frequency limits, elastically adapting to business scale.
  3. Deep integration: Seamless integration with existing monitoring systems, supporting custom access for self-developed systems.
  4. Secure and controllable: Zero-trust network architecture, sensitive services support IP whitelist monitoring.
  5. Intuitive visualization: Geo-display + precise alerting, zero-delay fault localization.

Summary

This black-box probing monitoring system, from the user’s perspective, fills the last gap in internet companies’ monitoring systems, forming a complete closed loop of proactive prediction + rapid discovery together with white-box monitoring. The architecture balances distributed management, security compliance, and visualization, with costs far lower than third-party services — it is one of the best practices for large internet companies building full-link monitoring systems.