P2P Production Best Practices

December 1, 2024 Network P2P, Performance, Security, Monitoring, Operations, Serialization Network Development Practice 2773 words 14 min read

🔊

Moving a P2P system from prototype to production involves engineering challenges across connection management, security, observability, and deployment. This article covers six critical areas with reusable code snippets and actionable guidance.

Connection Management and Resource Limits

P2P nodes must maintain a large number of simultaneous connections. Without resource caps, a node can suffer OOM crashes or file descriptor exhaustion. Production environments require strict control over three dimensions.

Three-Layer Resource Control Model

mermaid
flowchart TD
    TL["Transport Limits<br/>Inbound 1024 / Outbound 512"] --> Q["Connection Queue"]
    Q --> SL["Stream Limits<br/>Inbound 128 / Outbound 256"]
    SL --> I{"Health Check<br/>Idle 30s/60s"}
    I -->|"Pass"| OK["Update Active Timestamp"]
    I -->|"Fail"| CL["Close Stream/Conn<br/>Release FD & Memory"]

    classDef src fill:#bbdefb,stroke:#2196F3,color:#1B5E20
    classDef proc fill:#fff3e0,stroke:#FF9800,color:#BF360C
    classDef decision fill:#f3e5f5,stroke:#9C27B0,color:#4A148C
    classDef ok fill:#c8e6c9,stroke:#4CAF50,color:#1B5E20
    classDef bad fill:#ffcdd2,stroke:#f44336,color:#B71C1C
    class TL src
    class Q,SL proc
    class I decision
    class OK ok
    class CL bad

Rust Connection Management

Rust’s SwarmBuilder provides a fluent API for connection configuration:

rust
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
use std::time::Duration;
use libp2p::swarm::{SwarmBuilder, SwarmConfig};

let swarm = SwarmBuilder::with_tokio_executor(transport, behaviour, peer_id)
    // Maximum concurrent inbound streams being negotiated
    .max_negotiating_inbound_streams(128)
    // Close connection after 30 seconds of inactivity
    .connection_idle_timeout(Duration::from_secs(30))
    // Handler buffer for backpressure
    .notify_handler_buffer_size(64)
    // Connection event buffer
    .connection_event_buffer_size(64)
    .build();

Idle timeout explained: connection_idle_timeout is critical for preventing zombie connections. The timer starts when no active streams exist and no protocol negotiation is pending. It resets on every new stream creation or protocol completion. Recommended production value: 30-60 seconds—too short causes frequent reconnects, too long wastes resources.

Go Resource Manager Configuration

Go’s ResourceManager offers granular memory and connection control:

go
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import (
    "github.com/libp2p/go-libp2p"
    "github.com/libp2p/go-libp2p/p2p/host/resource-manager"
)

func setupResourceManager() (libp2p.Option, error) {
    // Use default limiter and scale limits
    limiter := rcmgr.NewDefaultLimiter()

    // Adjust global limits
    limiter.IncreaseLimit(rcmgr.BaseLimit{
        Streams:         1024,
        StreamsInbound:  256,
        StreamsOutbound: 768,
        Connections:     400,
        ConnectionsInbound:  200,
        ConnectionsOutbound: 200,
        Memory:          256 << 20, // 256 MB
        FD:              512,
    })

    rm, err := rcmgr.NewResourceManager(limiter)
    if err != nil {
        return nil, err
    }

    return libp2p.ResourceManager(rm), nil
}

func main() {
    rmOpt, _ := setupResourceManager()
    h, err := libp2p.New(
        libp2p.ListenAddrStrings("/ip4/0.0.0.0/tcp/0"),
        rmOpt,
        libp2p.ConnectionGater(NewCustomGater()),
    )
    if err != nil {
        panic(err)
    }
    defer h.Close()
}

Implement a ConnectionGater with allow/deny logic to intercept malicious addresses before connections are established.

Message Serialization and Compression

Message size directly consumes bandwidth, and serialization efficiency affects node throughput. Choosing the right serialization scheme can reduce network transfer by 60%-80%.

Serialization Comparison

Scheme	Size	Parse Speed	Schema	Version Compat	Use Case
Protobuf	Small (~30% of JSON)	Fast	Strict `.proto`	Backward-compatible	General RPC, structured data
FlatBuffers	Medium (~40% of JSON)	Extremely fast (zero-copy)	Strict `.fbs`	Yes	High-frequency low-latency
Cap’n Proto	Small (~35% of JSON)	Extremely fast (zero-copy)	Strict	Yes	Embedded, high performance
MessagePack	Medium (~50% of JSON)	Medium	None	None	Lightweight JSON alternative
JSON	Large (baseline)	Slow	Weak (manual)	Natural	Debugging, logging, simple APIs

Protobuf Example

protobuf
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
syntax = "proto3";
package p2p.message;

message Block {
    bytes  hash       = 1;   // SHA-256 content hash
    bytes  data       = 2;   // Raw data payload
    uint64 timestamp  = 3;   // Unix timestamp in ms
    uint32 ttl        = 4;   // Time-to-live in seconds
    uint32 version    = 5;   // Message format version
    bytes  signature  = 6;   // Sender signature
}

message PeerInfo {
    bytes       peer_id    = 1;
    repeated    bytes      multiaddrs = 2;
    map<string, string>   metadata   = 3;
}

Compression Strategies

Algorithm	Ratio	Encode Speed	Decode Speed	Recommended Use
Snappy	1.5-2.5x	Extremely fast (~400MB/s)	Extremely fast (~800MB/s)	Latency-sensitive real-time messages
Gzip (Level 3)	3-5x	Medium (~50MB/s)	Medium (~100MB/s)	Batch sync, large data transfer
Zstandard (Level 1)	2.5-4x	Fast (~300MB/s)	Fast (~500MB/s)	General purpose, balanced ratio/speed

Rule of thumb: Messages < 1 KB should not be compressed—the overhead outweighs the benefit. Always use Snappy for blocks > 10 KB. Use Gzip or Zstd for batch transfers.

rust
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
// Snappy compression example (Rust)
use snap::raw::{Encoder, Decoder};

fn compress_block(data: &[u8]) -> Vec<u8> {
    let mut encoder = Encoder::new();
    encoder.compress(data).expect("snappy compress failed")
}

fn decompress_block(compressed: &[u8]) -> Vec<u8> {
    let mut decoder = Decoder::new();
    decoder.decompress(compressed).expect("snappy decompress failed")
}

Security Considerations

P2P networks lack trust boundaries—every node could be an attack vector. Security must be built into the protocol layer, not bolted on afterwards.

mermaid
mindmap
  root((P2P Security))
    Transport Security
      Noise Protocol
      TLS 1.3
      Perfect Forward Secrecy
    Message Verification
      Signature Verification
      Replay Attack Prevention
      Monotonic SeqNo
    Resource Protection
      Rate Limiting
      Connection Caps
      Memory Quotas
    Data Integrity
      SHA-256 Hashing
      Incremental Verification
      Merkle Tree
    Peer Management
      Blacklist
      Reputation Scoring
      Behavioral Analysis

Transport Encryption

Always enable Noise or TLS 1.3 to ensure Perfect Forward Secrecy (PFS) on all data in transit. libp2p defaults to the Noise_XK handshake pattern:

rust
1
2
3
4
5
6
7
8
9
use libp2p::noise::{Config as NoiseConfig, AuthenticKeypair, Keypair};

// Generate Noise identity keys
let id_keys = identity::Keypair::generate_ed25519();
let noise_keys = AuthenticKeypair::new(id_keys).unwrap();

// Encrypted transport is configured automatically
// by tokio_development_transport which includes Noise
let transport = libp2p::tokio_development_transport(id_keys.clone())?;

Token Bucket Rate Limiting

The Token Bucket algorithm is the standard for controlling message frequency—simple to implement and supports burst traffic:

rust
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Mutex;
use std::time::Instant;

pub struct TokenBucket {
    capacity: u64,       // Maximum burst size
    tokens: AtomicU64,   // Current token count
    fill_rate: u64,      // Tokens added per second
    last_refill: Mutex<Instant>,
}

impl TokenBucket {
    pub fn new(capacity: u64, fill_rate: u64) -> Self {
        Self {
            capacity,
            tokens: AtomicU64::new(capacity),
            fill_rate,
            last_refill: Mutex::new(Instant::now()),
        }
    }

    pub fn allow(&self) -> bool {
        // 1. Atomically refill tokens
        if let Ok(mut last) = self.last_refill.lock() {
            let now = Instant::now();
            let elapsed = now.duration_since(*last).as_secs();
            if elapsed > 0 {
                let current = self.tokens.load(Ordering::Relaxed);
                let new = current.saturating_add(elapsed * self.fill_rate).min(self.capacity);
                self.tokens.store(new, Ordering::Release);
                *last = now;
            }
        }

        // 2. CAS to consume one token
        loop {
            let current = self.tokens.load(Ordering::Acquire);
            if current == 0 {
                return false; // Token exhausted, reject
            }
            if self.tokens.compare_exchange_weak(
                current, current - 1,
                Ordering::Release, Ordering::Relaxed,
            ).is_ok() {
                return true;
            }
        }
    }
}

Strategy: maintain an independent TokenBucket per peer, and dynamically adjust capacity based on the peer’s reputation score. Trusted peers get larger fill_rate values.

Blacklist Implementation

rust
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
use std::collections::HashSet;
use std::sync::RwLock;
use libp2p::PeerId;
use std::time::{Duration, Instant};

struct BlacklistEntry {
    reason: String,
    expires_at: Instant,
}

pub struct PeerBlacklist {
    permanent: RwLock<HashSet<PeerId>>,
    temporary: RwLock<std::collections::HashMap<PeerId, BlacklistEntry>>,
}

impl PeerBlacklist {
    pub fn new() -> Self {
        Self {
            permanent: RwLock::new(HashSet::new()),
            temporary: RwLock::new(std::collections::HashMap::new()),
        }
    }

    pub fn is_blocked(&self, peer: &PeerId) -> bool {
        if self.permanent.read().unwrap().contains(peer) {
            return true;
        }
        if let Some(entry) = self.temporary.read().unwrap().get(peer) {
            if Instant::now() < entry.expires_at {
                return true;
            }
            // Expired entries are cleaned up lazily
        }
        false
    }

    pub fn ban_permanent(&self, peer: PeerId, reason: &str) {
        self.permanent.write().unwrap().insert(peer);
    }

    pub fn ban_temporary(&self, peer: PeerId, duration: Duration, reason: &str) {
        let mut map = self.temporary.write().unwrap();
        map.insert(peer, BlacklistEntry {
            reason: reason.to_string(),
            expires_at: Instant::now() + duration,
        });
    }
}

Message Signing and Integrity Verification

rust
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
use libp2p::identity::Keypair;
use sha2::{Sha256, Digest};

#[derive(prost::Message)]
struct SignedMessage {
    #[prost(bytes, tag = 1)]
    payload: Vec<u8>,
    #[prost(bytes, tag = 2)]
    signature: Vec<u8>,
    #[prost(uint64, tag = 3)]
    seq_no: u64,  // Monotonically increasing, prevents replay
}

fn sign_message(keypair: &Keypair, payload: &[u8], seq_no: u64) -> SignedMessage {
    let mut data = payload.to_vec();
    data.extend_from_slice(&seq_no.to_be_bytes());
    let signature = keypair.sign(&data).expect("sign failed");
    SignedMessage { payload: payload.to_vec(), signature, seq_no }
}

fn verify_message(public_key: &identity::PublicKey, msg: &SignedMessage) -> bool {
    let mut data = msg.payload.clone();
    data.extend_from_slice(&msg.seq_no.to_be_bytes());
    public_key.verify(&data, &msg.signature).is_ok()
}

fn hash_block(data: &[u8]) -> [u8; 32] {
    let mut hasher = Sha256::new();
    hasher.update(data);
    hasher.finalize().into()
}

Monitoring and Observability

The distributed nature of P2P makes monitoring more challenging than traditional client-server architectures. A comprehensive metrics framework is essential for rapid diagnosis.

Prometheus Metrics Collection

rust
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
use libp2p::metrics::Metrics;
use prometheus::{Registry, Encoder, TextEncoder};
use lazy_static::lazy_static;

lazy_static! {
    static ref REGISTRY: Registry = Registry::new();
}

fn setup_metrics() -> Metrics {
    let metrics = Metrics::new(&REGISTRY);

    // Register custom metrics
    let msg_throughput = prometheus::CounterVec::new(
        prometheus::opts!("p2p_messages_total", "Total P2P messages processed"),
        &["direction", "protocol"],
    ).unwrap();
    REGISTRY.register(Box::new(msg_throughput)).unwrap();

    metrics
}

fn expose_metrics() -> String {
    let encoder = TextEncoder::new();
    let metric_families = REGISTRY.gather();
    let mut buffer = vec![];
    encoder.encode(&metric_families, &mut buffer).unwrap();
    String::from_utf8(buffer).unwrap_or_default()
}

// Expose via HTTP endpoint (route "/metrics")
// warp::path!("metrics").map(|| warp::reply::body(metrics::expose_metrics()))

Key Metrics and Alert Thresholds

Metric	Type	Description	Alert Threshold	Severity
`libp2p_connections_opened_total`	Counter	Cumulative connections opened	Sudden drop >50% in rate	Warning
`libp2p_connections_active`	Gauge	Current active connections	> 80% of configured limit	Warning
`libp2p_dht_query_duration_seconds`	Histogram	DHT lookup latency	P99 > 5s	Critical
`libp2p_gossipsub_messages_received_total`	Counter	Total Gossipsub messages received	Rate spike >300%	Warning
`libp2p_gossipsub_peers`	Gauge	Peers in Gossipsub mesh	< 4 or > 20	Warning
`libp2p_relay_connections_active`	Gauge	Active relay connections	> 80% of limit	Warning
`process_resident_memory_bytes`	Gauge	Node RSS memory	> 80% of configured limit	Critical
`p2p_message_processing_seconds`	Histogram	Message processing latency	P99 > 1s	Warning
`p2p_peer_dial_failures_total`	Counter	Cumulative dial failures	Rate > 10/min	Warning
`p2p_blacklist_total`	Counter	Blacklisted peers count	> 20 in one minute	Info

Monitoring Dashboard Architecture

mermaid
flowchart TD
    NODES["P2P Node Cluster<br/>/metrics endpoints"] --> PM["Prometheus<br/>Pull-mode scraping"]
    SH["Service Discovery<br/>DNS / File / Consul"] --> PM
    PM --> STORE["VictoriaMetrics Storage<br/>AlertManager Routing"]
    STORE --> VIEW["Grafana Dashboard<br/>Slack / PagerDuty / Email"]

    classDef src fill:#bbdefb,stroke:#2196F3,color:#1B5E20
    classDef proc fill:#fff3e0,stroke:#FF9800,color:#BF360C
    classDef store fill:#c8e6c9,stroke:#4CAF50,color:#1B5E20
    classDef view fill:#f3e5f5,stroke:#9C27B0,color:#4A148C
    class NODES src
    class PM,SH proc
    class STORE store
    class VIEW view

Structured Logging

Use structured logging for compatibility with log aggregation systems (Loki / ELK):

rust
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
use tracing::{info, warn, error, Span};
use serde::Serialize;

#[derive(Serialize)]
struct P2PEvent {
    peer_id: String,
    event_type: String,
    duration_ms: u64,
    protocol: String,
    error: Option<String>,
}

fn log_connection_event(peer: &PeerId) {
    info!(
        target: "p2p_events",
        peer_id = %peer,
        event_type = "connection_opened",
        protocol = "libp2p",
    );
}

Deployment Considerations

Containerizing P2P nodes differs from traditional web services—special attention is required for network configuration and state management.

Docker Containerization

dockerfile
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Multi-stage build
FROM rust:1.78-slim AS builder
WORKDIR /app
COPY Cargo.toml Cargo.lock ./
COPY src ./src
RUN cargo build --release

# Runtime stage
FROM gcr.io/distroless/cc-debian12
WORKDIR /app
COPY --from=builder /app/target/release/p2p-node /app/p2p-node

# P2P ports
EXPOSE 4001/tcp   # libp2p transport
EXPOSE 4001/udp   # QUIC transport
EXPOSE 9100/tcp   # Prometheus metrics

# Health check
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
    CMD ["/app/p2p-node", "health"]

ENTRYPOINT ["/app/p2p-node"]

Use distroless base images to reduce attack surface. P2P ports must be mapped at runtime:

yaml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# docker-compose.yml
version: "3.8"
services:
  p2p-bootstrap:
    image: p2p-node:latest
    ports:
      - "4001:4001/tcp"
      - "4001:4001/udp"   # QUIC
      - "9100:9100"
    environment:
      - P2P_BOOTSTRAP=true
      - P2P_LISTEN_ADDRS=/ip4/0.0.0.0/tcp/4001,/ip4/0.0.0.0/udp/4001/quic-v1
      - P2P_EXTERNAL_ADDRS=/ip4/${PUBLIC_IP}/tcp/4001
      - P2P_MAX_CONNECTIONS=500
      - RUST_LOG=info
      - RUST_BACKTRACE=1
    restart: always
    volumes:
      - p2p-data:/app/data
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: "1.0"

volumes:
  p2p-data:

Environment Variable Configuration

All runtime parameters are injected through environment variables to keep images environment-agnostic:

Variable	Default	Description
`P2P_LISTEN_ADDRS`	`/ip4/0.0.0.0/tcp/4001`	Listen addresses (comma-separated)
`P2P_EXTERNAL_ADDRS`	-	Externally reachable addresses (required for NAT)
`P2P_BOOTSTRAP_PEERS`	-	Bootstrap peer multiaddr list
`P2P_BOOTSTRAP`	`false`	Whether to act as a bootstrap node
`P2P_MAX_CONNECTIONS`	`400`	Maximum concurrent connections
`P2P_METRICS_PORT`	`9100`	Prometheus metrics port
`P2P_DATA_DIR`	`/app/data`	Persistent data directory
`RUST_LOG`	`info`	Log level

Graceful Shutdown

On shutdown, the node must notify peers, drain in-progress DHT operations, and persist the routing table:

rust
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
use tokio::signal;
use tracing::info;

async fn graceful_shutdown(swarm: &mut Swarm<MyBehaviour>) {
    // Wait for SIGINT or SIGTERM
    signal::ctrl_c().await.expect("failed to listen for signal");
    info!("Shutdown signal received, starting graceful shutdown");

    // 1. Stop accepting new connections
    swarm.stop_listening(swarm.listeners().next().unwrap()).unwrap();

    // 2. Wait for in-progress operations to complete (max 10s)
    tokio::time::timeout(Duration::from_secs(10), async {
        loop {
            tokio::select! {
                event = swarm.select_next_some() => {
                    info!("Draining event: {:?}", std::mem::discriminant(&event));
                }
                _ = tokio::time::sleep(Duration::from_secs(1)) => break,
            }
        }
    }).await.ok();

    // 3. Persist routing table to disk
    if let Err(e) = save_routing_table(swarm).await {
        error!("Failed to persist routing table: {e}");
    }

    info!("Graceful shutdown complete");
}

async fn save_routing_table(swarm: &Swarm<MyBehaviour>) -> anyhow::Result<()> {
    let peers = swarm.behaviour().kademlia.kbuckets()
        .flat_map(|bucket| bucket.iter().map(|entry| *entry.node.key.preimage()))
        .collect::<Vec<_>>();
    let data = serde_json::to_string(&peers)?;
    tokio::fs::write("/app/data/routing_table.json", data).await?;
    info!("Persisted {} peers to routing table", peers.len());
    Ok(())
}

Health Check Endpoints

rust
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
use warp::Filter;

fn health_routes() -> impl Filter<Extract = impl warp::Reply, Error = warp::Rejection> + Clone {
    let liveness = warp::path!("health" / "live")
        .map(|| warp::reply::json(&serde_json::json!({"status": "ok"})));

    let readiness = warp::path!("health" / "ready")
        .and(warp::any().map(move || get_node_status()))
        .map(|status| warp::reply::json(&status));

    liveness.or(readiness)
}

#[derive(Serialize)]
struct NodeStatus {
    status: String,
    connected_peers: usize,
    routing_table_size: usize,
    uptime_seconds: u64,
}

fn get_node_status() -> NodeStatus {
    NodeStatus {
        status: "ok".to_string(),
        connected_peers: count_connected_peers(),
        routing_table_size: get_routing_table_size(),
        uptime_seconds: get_uptime(),
    }
}

Troubleshooting

Issue	Possible Cause	Investigation Steps	Solution
Node unreachable	Symmetric NAT	Check `libp2p_nat_traversal` metric	Enable Relay + DCUtR hole punching
Slow DHT queries	Bootstrap nodes unreachable or too few	`curl localhost:9100/metrics` and check `libp2p_dht_query_duration_seconds`	Add 3-5 stable bootstrap peers, configure a fallback list
Message loss	Unstable Gossipsub mesh	Verify `libp2p_gossipsub_peers` < 6	Adjust D parameter to 6-8, increase `heartbeat_interval`
Memory leak	Connections not properly closed	Capture heap dump with `pprof`	Set `connection_idle_timeout`, implement `Drop` cleanup
High CPU	Heartbeat too frequent	Profile with `perf top` or flamegraph	Increase heartbeat interval to 1s+, reduce unnecessary protocols
Latency spikes	Network congestion or peer overload	Check `p2p_message_processing_seconds` P99	Implement message priority queue and backpressure
Disk usage explosion	Unbounded logging or DHT records	`du -sh /app/data`	Configure log rotation, cap DHT storage size
File descriptor exhaustion	Exceeding connection limits	`lsof -p	wc -l`

Quick Diagnostic Script

bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#!/bin/bash
# P2P Node Health Check Script

NODE=${1:-"http://localhost:9100"}

echo "=== P2P Node Diagnostics ==="
echo "Node: $NODE"
echo ""

# Basic connectivity
echo "1. Connectivity:"
if curl -sf "$NODE/metrics" > /dev/null 2>&1; then
    echo "   ✅ Metrics endpoint reachable"
else
    echo "   ❌ Cannot reach metrics endpoint"
fi

# Active connections
CONN=$(curl -sf "$NODE/metrics" 2>/dev/null | grep "^libp2p_connections_active" | awk '{print $2}')
echo "2. Active Connections: ${CONN:-N/A}"

# DHT P99 latency
DHT_P99=$(curl -sf "$NODE/metrics" 2>/dev/null | grep "libp2p_dht_query_duration_seconds" | grep "quantile=\"0.99\"" | awk '{print $2}')
echo "3. DHT Query P99: ${DHT_P99:-N/A}s"

# Memory usage
MEM=$(curl -sf "$NODE/metrics" 2>/dev/null | grep "^process_resident_memory_bytes" | awk '{print $2}')
if [ -n "$MEM" ]; then
    MEM_MB=$((MEM / 1024 / 1024))
    echo "4. Memory: ${MEM_MB}MB"
else
    echo "4. Memory: N/A"
fi

# Dial failure count
ERR=$(curl -sf "$NODE/metrics" 2>/dev/null | grep "p2p_peer_dial_failures_total" | awk '{print $2}')
echo "5. Dial Failures Total: ${ERR:-N/A}"

Production Readiness Checklist

Connection caps configured (inbound + outbound)
Idle timeout set (30-60 seconds)
Protobuf / FlatBuffers selected over JSON
Snappy compression enabled for large blocks
Transport encryption enabled (Noise / TLS)
Message signing with SeqNo anti-replay implemented
Per-peer Token Bucket rate limiting active
Blacklist mechanism deployed
Prometheus metrics endpoint exposed
Key alert rules configured
Docker image uses distroless base
Health check endpoints implemented (liveness + readiness)
Graceful shutdown logic tested
ulimit tuned (≥ 65536)

References

libp2p Resource Management. https://docs.libp2p.io/concepts/resource-management/
libp2p Security Considerations. https://docs.libp2p.io/concepts/security/
Prometheus Monitoring. https://prometheus.io/docs/introduction/overview/
Token Bucket Algorithm (RFC 2698). https://datatracker.ietf.org/doc/rfc2698/
Protocol Buffers. https://protobuf.dev/
Snappy Compression. https://github.com/google/snappy
Dockerfile Best Practices. https://docs.docker.com/develop/develop-images/dockerfile_best-practices/
NAT Traversal with DCUtR. https://github.com/libp2p/specs/tree/master/dcutr
rust-libp2p Metrics. https://docs.rs/libp2p-metrics/latest/libp2p_metrics/

Part of series: Network Development Practice

← Previous Hands-On: Building a Distributed File Sharing System Next → Gossip in Production Systems