P2P Production Best Practices

Moving a P2P system from prototype to production involves engineering challenges across connection management, security, observability, and deployment. This article covers six critical areas with reusable code snippets and actionable guidance.

Connection Management and Resource Limits

P2P nodes must maintain a large number of simultaneous connections. Without resource caps, a node can suffer OOM crashes or file descriptor exhaustion. Production environments require strict control over three dimensions.

Three-Layer Resource Control Model

mermaid
flowchart LR
    subgraph Transport Layer
        A1["Max Inbound<br/>Connections: 1024"] --> B1["Connection Queue"]
        A2["Max Outbound<br/>Connections: 512"] --> B1
    end

    subgraph Stream Layer
        C1["Inbound Concurrent<br/>Streams: 128"] --> D1["Stream Scheduler"]
        C2["Outbound Concurrent<br/>Streams: 256"] --> D1
    end

    subgraph Idle Management
        E1["Connection Idle<br/>Timeout: 30s"] --> F1{"Health Check"}
        E2["Stream Idle<br/>Timeout: 60s"] --> F1
        F1 -->|"Pass"| G1["Update Active Timestamp"]
        F1 -->|"Fail"| G2["Close Stream/Conn<br/>Release FD & Memory"]
    end

    B1 --> C1
    B1 --> C2
    D1 --> E1
    D1 --> E2

    style A1 fill:#4CAF50,color:#fff
    style A2 fill:#4CAF50,color:#fff
    style C1 fill:#2196F3,color:#fff
    style C2 fill:#2196F3,color:#fff
    style E1 fill:#FF9800,color:#fff
    style E2 fill:#FF9800,color:#fff

Rust Connection Management

Rust’s SwarmBuilder provides a fluent API for connection configuration:

rust
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
use std::time::Duration;
use libp2p::swarm::{SwarmBuilder, SwarmConfig};

let swarm = SwarmBuilder::with_tokio_executor(transport, behaviour, peer_id)
    // Maximum concurrent inbound streams being negotiated
    .max_negotiating_inbound_streams(128)
    // Close connection after 30 seconds of inactivity
    .connection_idle_timeout(Duration::from_secs(30))
    // Handler buffer for backpressure
    .notify_handler_buffer_size(64)
    // Connection event buffer
    .connection_event_buffer_size(64)
    .build();

Idle timeout explained: connection_idle_timeout is critical for preventing zombie connections. The timer starts when no active streams exist and no protocol negotiation is pending. It resets on every new stream creation or protocol completion. Recommended production value: 30-60 seconds—too short causes frequent reconnects, too long wastes resources.

Go Resource Manager Configuration

Go’s ResourceManager offers granular memory and connection control:

go
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import (
    "github.com/libp2p/go-libp2p"
    "github.com/libp2p/go-libp2p/p2p/host/resource-manager"
)

func setupResourceManager() (libp2p.Option, error) {
    // Use default limiter and scale limits
    limiter := rcmgr.NewDefaultLimiter()

    // Adjust global limits
    limiter.IncreaseLimit(rcmgr.BaseLimit{
        Streams:         1024,
        StreamsInbound:  256,
        StreamsOutbound: 768,
        Connections:     400,
        ConnectionsInbound:  200,
        ConnectionsOutbound: 200,
        Memory:          256 << 20, // 256 MB
        FD:              512,
    })

    rm, err := rcmgr.NewResourceManager(limiter)
    if err != nil {
        return nil, err
    }

    return libp2p.ResourceManager(rm), nil
}

func main() {
    rmOpt, _ := setupResourceManager()
    h, err := libp2p.New(
        libp2p.ListenAddrStrings("/ip4/0.0.0.0/tcp/0"),
        rmOpt,
        libp2p.ConnectionGater(NewCustomGater()),
    )
    if err != nil {
        panic(err)
    }
    defer h.Close()
}

Implement a ConnectionGater with allow/deny logic to intercept malicious addresses before connections are established.

Message Serialization and Compression

Message size directly consumes bandwidth, and serialization efficiency affects node throughput. Choosing the right serialization scheme can reduce network transfer by 60%-80%.

Serialization Comparison

SchemeSizeParse SpeedSchemaVersion CompatUse Case
ProtobufSmall (~30% of JSON)FastStrict .protoBackward-compatibleGeneral RPC, structured data
FlatBuffersMedium (~40% of JSON)Extremely fast (zero-copy)Strict .fbsYesHigh-frequency low-latency
Cap’n ProtoSmall (~35% of JSON)Extremely fast (zero-copy)StrictYesEmbedded, high performance
MessagePackMedium (~50% of JSON)MediumNoneNoneLightweight JSON alternative
JSONLarge (baseline)SlowWeak (manual)NaturalDebugging, logging, simple APIs

Protobuf Example

protobuf
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
syntax = "proto3";
package p2p.message;

message Block {
    bytes  hash       = 1;   // SHA-256 content hash
    bytes  data       = 2;   // Raw data payload
    uint64 timestamp  = 3;   // Unix timestamp in ms
    uint32 ttl        = 4;   // Time-to-live in seconds
    uint32 version    = 5;   // Message format version
    bytes  signature  = 6;   // Sender signature
}

message PeerInfo {
    bytes       peer_id    = 1;
    repeated    bytes      multiaddrs = 2;
    map<string, string>   metadata   = 3;
}

Compression Strategies

AlgorithmRatioEncode SpeedDecode SpeedRecommended Use
Snappy1.5-2.5xExtremely fast (~400MB/s)Extremely fast (~800MB/s)Latency-sensitive real-time messages
Gzip (Level 3)3-5xMedium (~50MB/s)Medium (~100MB/s)Batch sync, large data transfer
Zstandard (Level 1)2.5-4xFast (~300MB/s)Fast (~500MB/s)General purpose, balanced ratio/speed

Rule of thumb: Messages < 1 KB should not be compressed—the overhead outweighs the benefit. Always use Snappy for blocks > 10 KB. Use Gzip or Zstd for batch transfers.

rust
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
// Snappy compression example (Rust)
use snap::raw::{Encoder, Decoder};

fn compress_block(data: &[u8]) -> Vec<u8> {
    let mut encoder = Encoder::new();
    encoder.compress(data).expect("snappy compress failed")
}

fn decompress_block(compressed: &[u8]) -> Vec<u8> {
    let mut decoder = Decoder::new();
    decoder.decompress(compressed).expect("snappy decompress failed")
}

Security Considerations

P2P networks lack trust boundaries—every node could be an attack vector. Security must be built into the protocol layer, not bolted on afterwards.

mermaid
mindmap
  root((P2P Security))
    Transport Security
      Noise Protocol
      TLS 1.3
      Perfect Forward Secrecy
    Message Verification
      Signature Verification
      Replay Attack Prevention
      Monotonic SeqNo
    Resource Protection
      Rate Limiting
      Connection Caps
      Memory Quotas
    Data Integrity
      SHA-256 Hashing
      Incremental Verification
      Merkle Tree
    Peer Management
      Blacklist
      Reputation Scoring
      Behavioral Analysis

Transport Encryption

Always enable Noise or TLS 1.3 to ensure Perfect Forward Secrecy (PFS) on all data in transit. libp2p defaults to the Noise_XK handshake pattern:

rust
1
2
3
4
5
6
7
8
9
use libp2p::noise::{Config as NoiseConfig, AuthenticKeypair, Keypair};

// Generate Noise identity keys
let id_keys = identity::Keypair::generate_ed25519();
let noise_keys = AuthenticKeypair::new(id_keys).unwrap();

// Encrypted transport is configured automatically
// by tokio_development_transport which includes Noise
let transport = libp2p::tokio_development_transport(id_keys.clone())?;

Token Bucket Rate Limiting

The Token Bucket algorithm is the standard for controlling message frequency—simple to implement and supports burst traffic:

rust
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Mutex;
use std::time::Instant;

pub struct TokenBucket {
    capacity: u64,       // Maximum burst size
    tokens: AtomicU64,   // Current token count
    fill_rate: u64,      // Tokens added per second
    last_refill: Mutex<Instant>,
}

impl TokenBucket {
    pub fn new(capacity: u64, fill_rate: u64) -> Self {
        Self {
            capacity,
            tokens: AtomicU64::new(capacity),
            fill_rate,
            last_refill: Mutex::new(Instant::now()),
        }
    }

    pub fn allow(&self) -> bool {
        // 1. Atomically refill tokens
        if let Ok(mut last) = self.last_refill.lock() {
            let now = Instant::now();
            let elapsed = now.duration_since(*last).as_secs();
            if elapsed > 0 {
                let current = self.tokens.load(Ordering::Relaxed);
                let new = current.saturating_add(elapsed * self.fill_rate).min(self.capacity);
                self.tokens.store(new, Ordering::Release);
                *last = now;
            }
        }

        // 2. CAS to consume one token
        loop {
            let current = self.tokens.load(Ordering::Acquire);
            if current == 0 {
                return false; // Token exhausted, reject
            }
            if self.tokens.compare_exchange_weak(
                current, current - 1,
                Ordering::Release, Ordering::Relaxed,
            ).is_ok() {
                return true;
            }
        }
    }
}

Strategy: maintain an independent TokenBucket per peer, and dynamically adjust capacity based on the peer’s reputation score. Trusted peers get larger fill_rate values.

Blacklist Implementation

rust
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
use std::collections::HashSet;
use std::sync::RwLock;
use libp2p::PeerId;
use std::time::{Duration, Instant};

struct BlacklistEntry {
    reason: String,
    expires_at: Instant,
}

pub struct PeerBlacklist {
    permanent: RwLock<HashSet<PeerId>>,
    temporary: RwLock<std::collections::HashMap<PeerId, BlacklistEntry>>,
}

impl PeerBlacklist {
    pub fn new() -> Self {
        Self {
            permanent: RwLock::new(HashSet::new()),
            temporary: RwLock::new(std::collections::HashMap::new()),
        }
    }

    pub fn is_blocked(&self, peer: &PeerId) -> bool {
        if self.permanent.read().unwrap().contains(peer) {
            return true;
        }
        if let Some(entry) = self.temporary.read().unwrap().get(peer) {
            if Instant::now() < entry.expires_at {
                return true;
            }
            // Expired entries are cleaned up lazily
        }
        false
    }

    pub fn ban_permanent(&self, peer: PeerId, reason: &str) {
        self.permanent.write().unwrap().insert(peer);
    }

    pub fn ban_temporary(&self, peer: PeerId, duration: Duration, reason: &str) {
        let mut map = self.temporary.write().unwrap();
        map.insert(peer, BlacklistEntry {
            reason: reason.to_string(),
            expires_at: Instant::now() + duration,
        });
    }
}

Message Signing and Integrity Verification

rust
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
use libp2p::identity::Keypair;
use sha2::{Sha256, Digest};

#[derive(prost::Message)]
struct SignedMessage {
    #[prost(bytes, tag = 1)]
    payload: Vec<u8>,
    #[prost(bytes, tag = 2)]
    signature: Vec<u8>,
    #[prost(uint64, tag = 3)]
    seq_no: u64,  // Monotonically increasing, prevents replay
}

fn sign_message(keypair: &Keypair, payload: &[u8], seq_no: u64) -> SignedMessage {
    let mut data = payload.to_vec();
    data.extend_from_slice(&seq_no.to_be_bytes());
    let signature = keypair.sign(&data).expect("sign failed");
    SignedMessage { payload: payload.to_vec(), signature, seq_no }
}

fn verify_message(public_key: &identity::PublicKey, msg: &SignedMessage) -> bool {
    let mut data = msg.payload.clone();
    data.extend_from_slice(&msg.seq_no.to_be_bytes());
    public_key.verify(&data, &msg.signature).is_ok()
}

fn hash_block(data: &[u8]) -> [u8; 32] {
    let mut hasher = Sha256::new();
    hasher.update(data);
    hasher.finalize().into()
}

Monitoring and Observability

The distributed nature of P2P makes monitoring more challenging than traditional client-server architectures. A comprehensive metrics framework is essential for rapid diagnosis.

Prometheus Metrics Collection

rust
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
use libp2p::metrics::Metrics;
use prometheus::{Registry, Encoder, TextEncoder};
use lazy_static::lazy_static;

lazy_static! {
    static ref REGISTRY: Registry = Registry::new();
}

fn setup_metrics() -> Metrics {
    let metrics = Metrics::new(&REGISTRY);

    // Register custom metrics
    let msg_throughput = prometheus::CounterVec::new(
        prometheus::opts!("p2p_messages_total", "Total P2P messages processed"),
        &["direction", "protocol"],
    ).unwrap();
    REGISTRY.register(Box::new(msg_throughput)).unwrap();

    metrics
}

fn expose_metrics() -> String {
    let encoder = TextEncoder::new();
    let metric_families = REGISTRY.gather();
    let mut buffer = vec![];
    encoder.encode(&metric_families, &mut buffer).unwrap();
    String::from_utf8(buffer).unwrap_or_default()
}

// Expose via HTTP endpoint (route "/metrics")
// warp::path!("metrics").map(|| warp::reply::body(metrics::expose_metrics()))

Key Metrics and Alert Thresholds

MetricTypeDescriptionAlert ThresholdSeverity
libp2p_connections_opened_totalCounterCumulative connections openedSudden drop >50% in rateWarning
libp2p_connections_activeGaugeCurrent active connections> 80% of configured limitWarning
libp2p_dht_query_duration_secondsHistogramDHT lookup latencyP99 > 5sCritical
libp2p_gossipsub_messages_received_totalCounterTotal Gossipsub messages receivedRate spike >300%Warning
libp2p_gossipsub_peersGaugePeers in Gossipsub mesh< 4 or > 20Warning
libp2p_relay_connections_activeGaugeActive relay connections> 80% of limitWarning
process_resident_memory_bytesGaugeNode RSS memory> 80% of configured limitCritical
p2p_message_processing_secondsHistogramMessage processing latencyP99 > 1sWarning
p2p_peer_dial_failures_totalCounterCumulative dial failuresRate > 10/minWarning
p2p_blacklist_totalCounterBlacklisted peers count> 20 in one minuteInfo

Monitoring Dashboard Architecture

mermaid
flowchart TD
    subgraph Node Layer
        P1["P2P Node 1<br/>:9100/metrics"]
        P2["P2P Node 2<br/>:9100/metrics"]
        P3["P2P Node N<br/>:9100/metrics"]
    end

    subgraph Collection Layer
        PM["Prometheus Server<br/>Pull mode scraping"]
        SH["Service Discovery<br/>DNS / File / Consul"]
    end

    subgraph Storage Layer
        VC["VictoriaMetrics<br/>Long-term storage"]
        AL["AlertManager<br/>Alert routing"]
    end

    subgraph Visualization Layer
        GF["Grafana<br/>Unified dashboard"]
        NT["Notification Channels<br/>Slack / PagerDuty / Email"]
    end

    P1 --> PM
    P2 --> PM
    P3 --> PM
    SH --> PM
    PM --> VC
    VC --> GF
    PM --> AL
    AL --> NT

    style P1 fill:#4CAF50,color:#fff
    style P2 fill:#4CAF50,color:#fff
    style P3 fill:#4CAF50,color:#fff
    style PM fill:#2196F3,color:#fff
    style VC fill:#FF9800,color:#fff
    style AL fill:#f44336,color:#fff
    style GF fill:#9C27B0,color:#fff

Structured Logging

Use structured logging for compatibility with log aggregation systems (Loki / ELK):

rust
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
use tracing::{info, warn, error, Span};
use serde::Serialize;

#[derive(Serialize)]
struct P2PEvent {
    peer_id: String,
    event_type: String,
    duration_ms: u64,
    protocol: String,
    error: Option<String>,
}

fn log_connection_event(peer: &PeerId) {
    info!(
        target: "p2p_events",
        peer_id = %peer,
        event_type = "connection_opened",
        protocol = "libp2p",
    );
}

Deployment Considerations

Containerizing P2P nodes differs from traditional web services—special attention is required for network configuration and state management.

Docker Containerization

dockerfile
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Multi-stage build
FROM rust:1.78-slim AS builder
WORKDIR /app
COPY Cargo.toml Cargo.lock ./
COPY src ./src
RUN cargo build --release

# Runtime stage
FROM gcr.io/distroless/cc-debian12
WORKDIR /app
COPY --from=builder /app/target/release/p2p-node /app/p2p-node

# P2P ports
EXPOSE 4001/tcp   # libp2p transport
EXPOSE 4001/udp   # QUIC transport
EXPOSE 9100/tcp   # Prometheus metrics

# Health check
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
    CMD ["/app/p2p-node", "health"]

ENTRYPOINT ["/app/p2p-node"]

Use distroless base images to reduce attack surface. P2P ports must be mapped at runtime:

yaml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# docker-compose.yml
version: "3.8"
services:
  p2p-bootstrap:
    image: p2p-node:latest
    ports:
      - "4001:4001/tcp"
      - "4001:4001/udp"   # QUIC
      - "9100:9100"
    environment:
      - P2P_BOOTSTRAP=true
      - P2P_LISTEN_ADDRS=/ip4/0.0.0.0/tcp/4001,/ip4/0.0.0.0/udp/4001/quic-v1
      - P2P_EXTERNAL_ADDRS=/ip4/${PUBLIC_IP}/tcp/4001
      - P2P_MAX_CONNECTIONS=500
      - RUST_LOG=info
      - RUST_BACKTRACE=1
    restart: always
    volumes:
      - p2p-data:/app/data
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: "1.0"

volumes:
  p2p-data:

Environment Variable Configuration

All runtime parameters are injected through environment variables to keep images environment-agnostic:

VariableDefaultDescription
P2P_LISTEN_ADDRS/ip4/0.0.0.0/tcp/4001Listen addresses (comma-separated)
P2P_EXTERNAL_ADDRS-Externally reachable addresses (required for NAT)
P2P_BOOTSTRAP_PEERS-Bootstrap peer multiaddr list
P2P_BOOTSTRAPfalseWhether to act as a bootstrap node
P2P_MAX_CONNECTIONS400Maximum concurrent connections
P2P_METRICS_PORT9100Prometheus metrics port
P2P_DATA_DIR/app/dataPersistent data directory
RUST_LOGinfoLog level

Graceful Shutdown

On shutdown, the node must notify peers, drain in-progress DHT operations, and persist the routing table:

rust
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
use tokio::signal;
use tracing::info;

async fn graceful_shutdown(swarm: &mut Swarm<MyBehaviour>) {
    // Wait for SIGINT or SIGTERM
    signal::ctrl_c().await.expect("failed to listen for signal");
    info!("Shutdown signal received, starting graceful shutdown");

    // 1. Stop accepting new connections
    swarm.stop_listening(swarm.listeners().next().unwrap()).unwrap();

    // 2. Wait for in-progress operations to complete (max 10s)
    tokio::time::timeout(Duration::from_secs(10), async {
        loop {
            tokio::select! {
                event = swarm.select_next_some() => {
                    info!("Draining event: {:?}", std::mem::discriminant(&event));
                }
                _ = tokio::time::sleep(Duration::from_secs(1)) => break,
            }
        }
    }).await.ok();

    // 3. Persist routing table to disk
    if let Err(e) = save_routing_table(swarm).await {
        error!("Failed to persist routing table: {e}");
    }

    info!("Graceful shutdown complete");
}

async fn save_routing_table(swarm: &Swarm<MyBehaviour>) -> anyhow::Result<()> {
    let peers = swarm.behaviour().kademlia.kbuckets()
        .flat_map(|bucket| bucket.iter().map(|entry| *entry.node.key.preimage()))
        .collect::<Vec<_>>();
    let data = serde_json::to_string(&peers)?;
    tokio::fs::write("/app/data/routing_table.json", data).await?;
    info!("Persisted {} peers to routing table", peers.len());
    Ok(())
}

Health Check Endpoints

rust
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
use warp::Filter;

fn health_routes() -> impl Filter<Extract = impl warp::Reply, Error = warp::Rejection> + Clone {
    let liveness = warp::path!("health" / "live")
        .map(|| warp::reply::json(&serde_json::json!({"status": "ok"})));

    let readiness = warp::path!("health" / "ready")
        .and(warp::any().map(move || get_node_status()))
        .map(|status| warp::reply::json(&status));

    liveness.or(readiness)
}

#[derive(Serialize)]
struct NodeStatus {
    status: String,
    connected_peers: usize,
    routing_table_size: usize,
    uptime_seconds: u64,
}

fn get_node_status() -> NodeStatus {
    NodeStatus {
        status: "ok".to_string(),
        connected_peers: count_connected_peers(),
        routing_table_size: get_routing_table_size(),
        uptime_seconds: get_uptime(),
    }
}

Troubleshooting

IssuePossible CauseInvestigation StepsSolution
Node unreachableSymmetric NATCheck libp2p_nat_traversal metricEnable Relay + DCUtR hole punching
Slow DHT queriesBootstrap nodes unreachable or too fewcurl localhost:9100/metrics and check libp2p_dht_query_duration_secondsAdd 3-5 stable bootstrap peers, configure a fallback list
Message lossUnstable Gossipsub meshVerify libp2p_gossipsub_peers < 6Adjust D parameter to 6-8, increase heartbeat_interval
Memory leakConnections not properly closedCapture heap dump with pprofSet connection_idle_timeout, implement Drop cleanup
High CPUHeartbeat too frequentProfile with perf top or flamegraphIncrease heartbeat interval to 1s+, reduce unnecessary protocols
Latency spikesNetwork congestion or peer overloadCheck p2p_message_processing_seconds P99Implement message priority queue and backpressure
Disk usage explosionUnbounded logging or DHT recordsdu -sh /app/dataConfigure log rotation, cap DHT storage size
File descriptor exhaustionExceeding connection limits`lsof -p wc -l`

Quick Diagnostic Script

bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#!/bin/bash
# P2P Node Health Check Script

NODE=${1:-"http://localhost:9100"}

echo "=== P2P Node Diagnostics ==="
echo "Node: $NODE"
echo ""

# Basic connectivity
echo "1. Connectivity:"
if curl -sf "$NODE/metrics" > /dev/null 2>&1; then
    echo "   ✅ Metrics endpoint reachable"
else
    echo "   ❌ Cannot reach metrics endpoint"
fi

# Active connections
CONN=$(curl -sf "$NODE/metrics" 2>/dev/null | grep "^libp2p_connections_active" | awk '{print $2}')
echo "2. Active Connections: ${CONN:-N/A}"

# DHT P99 latency
DHT_P99=$(curl -sf "$NODE/metrics" 2>/dev/null | grep "libp2p_dht_query_duration_seconds" | grep "quantile=\"0.99\"" | awk '{print $2}')
echo "3. DHT Query P99: ${DHT_P99:-N/A}s"

# Memory usage
MEM=$(curl -sf "$NODE/metrics" 2>/dev/null | grep "^process_resident_memory_bytes" | awk '{print $2}')
if [ -n "$MEM" ]; then
    MEM_MB=$((MEM / 1024 / 1024))
    echo "4. Memory: ${MEM_MB}MB"
else
    echo "4. Memory: N/A"
fi

# Dial failure count
ERR=$(curl -sf "$NODE/metrics" 2>/dev/null | grep "p2p_peer_dial_failures_total" | awk '{print $2}')
echo "5. Dial Failures Total: ${ERR:-N/A}"

Production Readiness Checklist

  • Connection caps configured (inbound + outbound)
  • Idle timeout set (30-60 seconds)
  • Protobuf / FlatBuffers selected over JSON
  • Snappy compression enabled for large blocks
  • Transport encryption enabled (Noise / TLS)
  • Message signing with SeqNo anti-replay implemented
  • Per-peer Token Bucket rate limiting active
  • Blacklist mechanism deployed
  • Prometheus metrics endpoint exposed
  • Key alert rules configured
  • Docker image uses distroless base
  • Health check endpoints implemented (liveness + readiness)
  • Graceful shutdown logic tested
  • ulimit tuned (≥ 65536)

References