Advanced eBPF Memory Observability: Container Tracing and Rust Aya

The first two articles covered eBPF fundamentals and OOM Killer event tracing. This article goes deeper: container-level OOM pinpointing, real-time memory allocation rate tracking, and implementing the same functionality with the Rust Aya framework.

Container-Level OOM Pinpointing

In Kubernetes, “a Pod OOM’d” is actually a vague statement. A Pod consists of multiple containers, each belonging to different cgroups. eBPF can drill through this layer and precisely identify which container and which process caused the OOM.

The correlation chain:

text
1
2
3
4
oom_kill_process triggered
    ↓ capture task_struct → read /proc/PID/cgroup
        ↓ parse container ID → map to Pod
            ↓ map to Namespace

In the eBPF program, the oom_control parameter from oom_kill_process carries a memcg pointer, giving access to cgroup-level information:

c
1
2
3
4
5
6
struct mem_cgroup *memcg = BPF_CORE_READ(oc, memcg);
if (memcg) {
    char cgroup_path[256];
    bpf_probe_read_kernel_str(cgroup_path, sizeof(cgroup_path),
        BPF_CORE_READ(memcg, css.cgroup->kn->name));
}

User-space programs can parse the cgroup path to extract Pod and container names:

go
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
// cgroup path example:
// /kubepods/burstable/pod<UID>/<containerID>

func parseContainerID(cgroupPath string) string {
    parts := strings.Split(cgroupPath, "/")
    if len(parts) < 3 {
        return ""
    }
    return parts[len(parts)-1]
}

func resolvePod(cgroupPath string) string {
    for _, part := range strings.Split(cgroupPath, "/") {
        if strings.HasPrefix(part, "pod") {
            return strings.TrimPrefix(part, "pod")
        }
    }
    return ""
}

Memory Allocation Rate Tracking

OOM is the final result. The real value lies in the trend leading up to it. By tracking kmalloc and free events, abnormal growth can be detected before OOM occurs.

c
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
struct {
    __uint(type, BPF_MAP_TYPE_PERCPU_HASH);
    __uint(max_entries, 65536);
    __type(key, u32);    // PID
    __type(value, u64);  // Accumulated bytes
} alloc_stats SEC(".maps");

volatile const u64 sample_rate = 100;  // 1 in 100 events sampled

SEC("tracepoint/kmem/kmalloc")
int trace_kmalloc(struct trace_event_raw_kmem_alloc *ctx)
{
    u32 pid = (bpf_get_current_pid_tgid()) >> 32;

    if (bpf_get_prandom_u32() % sample_rate != 0)
        return 0;

    u64 size = BPF_CORE_READ(ctx, bytes_alloc);
    u64 *val = bpf_map_lookup_elem(&alloc_stats, &pid);
    if (!val) {
        bpf_map_update_elem(&alloc_stats, &pid, &size, BPF_ANY);
    } else {
        __sync_fetch_and_add(val, size);
    }
    return 0;
}

Key Design Points

  • BPF_MAP_TYPE_PERCPU_HASH: Each CPU core has its own hash table — writes require no locking. On multi-core systems, this is the optimal performance choice
  • Sampling: kmalloc fires millions of times per second — can’t record every event. Proportional sampling keeps overhead manageable
  • Tracepoint first: tracepoint/kmem/kmalloc is a stable ABI, safer and more discoverable than kprobe

The user-space program periodically reads the alloc_stats map to calculate allocation rates:

go
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
type AllocStat struct {
    PID       uint32
    TotalAlloc uint64
    Rate      float64 // bytes/sec
}

func (m *Monitor) pollAllocStats() {
    for range time.Tick(10 * time.Second) {
        var key, prevVal, currVal uint32
        for {
            // Iterate all entries in the hash map
            if err := m.objs.AllocStats.NextKey(key, &currVal); err != nil {
                break
            }
            // Read current value, calculate delta
            // ... update rate metrics
            key = currVal
        }
    }
}

Rust Aya Implementation

Aya is a pure Rust eBPF framework with no libbpf dependency. Here’s the same OOM monitor implemented in Rust:

eBPF Kernel Side (Rust)

rust
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
#![no_std]
#![no_main]

use aya_ebpf::{
    macros::{kprobe, map},
    maps::{PerCpuArray, RingBuf},
    programs::ProbeContext,
};
use aya_ebpf::helpers::bpf_ktime_get_ns;

const TASK_COMM_LEN: usize = 16;

#[repr(C)]
pub struct OomEvent {
    pub pid: u32,
    pub tgid: u32,
    pub fpid: u32,
    pub pages: u64,
    pub comm: [u8; TASK_COMM_LEN],
    pub fcomm: [u8; TASK_COMM_LEN],
    pub timestamp: u64,
}

#[map]
static EVENTS: RingBuf = RingBuf::with_byte_size(1024 * 1024, 0);

#[map]
static mut BUF: PerCpuArray<OomEvent> = PerCpuArray::with_max_entries(1, 0);

#[kprobe(function = "oom_kill_process")]
pub fn oom_kill_process(ctx: ProbeContext) -> u32 {
    match try_oom_kill_process(ctx) {
        Ok(ret) => ret,
        Err(_) => 1,
    }
}

fn try_oom_kill_process(ctx: ProbeContext) -> Result<u32, i64> {
    let event_buf = unsafe {
        BUF.get_mut(aya_ebpf::bindings::BPF_F_CURRENT_CPU)
            .ok_or(-1)?
    };

    // Read task_struct pointer from kprobe arg(1)
    let p: *const u8 = ctx.arg(1).ok_or(-1)?;

    // Use bpf_probe_read_kernel for safe field access (CO-RE aware)
    event_buf.pid = unsafe { bpf_probe_read_kernel(&(*p).pid) };
    event_buf.timestamp = bpf_ktime_get_ns();

    if let Some(mut buf) = EVENTS.reserve::<OomEvent>(0) {
        unsafe { core::ptr::copy_nonoverlapping(event_buf, buf.as_mut_ptr(), 1) };
        buf.submit(0);
    }

    Ok(0)
}

#[panic_handler]
fn panic(_info: &core::panic::PanicInfo) -> ! {
    unsafe { core::hint::unreachable_unchecked() }
}

Note: The code above uses bpf_probe_read_kernel for safe kernel memory access instead of hardcoded offsets. CO-RE works the same way in Aya — BPF programs leverage BTF info to correctly resolve struct field positions without relying on specific kernel versions.

User Side (Rust)

rust
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
use aya::{
    include_bytes_aligned,
    maps::ring_buf::RingBuf,
    programs::KProbe,
    Bpf,
};

#[tokio::main]
async fn main() -> Result<(), anyhow::Error> {
    let mut bpf = Bpf::load(include_bytes_aligned!(
        "../../target/bpfel-unknown-none/debug/ebpf-oom"
    ))?;

    let program: &mut KProbe = bpf.program_mut("oom_kill_process")
        .unwrap().try_into()?;
    program.load()?;
    program.attach("oom_kill_process", 0)?;

    let mut events = RingBuf::try_from(bpf.map_mut("EVENTS").unwrap())?;

    loop {
        while let Some(item) = events.next() {
            let event: OomEvent = unsafe {
                std::ptr::read(item.as_ref().as_ptr() as *const _)
            };
            println!("OOM: pid={} pages={}", event.pid, event.pages);
        }
        tokio::time::sleep(tokio::time::Duration::from_millis(100)).await;
    }
}

Build and Run with cargo-aya

bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Install cargo-aya
cargo install cargo-aya

# Create a new project
cargo aya new ebpf-oom

# Build the eBPF program (kernel-side)
cargo build --package ebpf-oom-ebpf --release

# Build the user-space program
cargo build --package ebpf-oom --release

# Run (requires root)
sudo ./target/release/ebpf-oom

The compiled binary is self-contained — eBPF bytecode is embedded via the include_bytes_aligned! macro.

About cargo-aya: cargo aya new creates a dual-package structure (-ebpf kernel + user-space). Kernel eBPF code lives in ebpf-oom-ebpf/, user-space code in ebpf-oom/, matching the paths in the code snippets above.

Aya vs cilium/ebpf

|| Dimension | cilium/ebpf (Go) | Aya (Rust) | ||———–|—————–|————| || Kernel language | C (compiled by Clang) | Rust (custom target) | || User language | Go | Rust | || Toolchain | Requires Clang/LLVM | Pure Rust toolchain | || Type safety | C has no type guarantees | Rust compile-time checks | || Learning curve | Must know both C and Go | Unified Rust | || Maturity | More mature, more production use | Rapidly developing | || BTF/CO-RE | Full support | Full support | || Dev experience | bpf2go code generation | cargo-aya all-in-one toolchain |

Choosing Go or Rust depends on your team background and project needs. Go + cilium/ebpf is more mature with richer ecosystem; Rust + Aya offers better type safety and developer experience.

Best Practices

Tracepoint over Kprobe

Propertytracepointkprobe
ABI stabilityStable, maintained by kernel devsNo guarantee
Discoverabilitybpftrace -l lists themNeed to read source
PerformanceSlightly betterSlightly higher overhead
When to useOfficially supported scenariosNo tracepoint available for target function

Sampling Strategy

High-frequency events (like kmalloc, page_fault) must be sampled:

  • Proportional sampling: record 1 in N events via bpf_get_prandom_u32() % N == 0
  • Adaptive sampling: dynamically adjust sample rate based on current event rate
  • Key-based sampling: only track specific PIDs or cgroups

Ring Buffer vs Perf Buffer

Ring Buffer (BPF_MAP_TYPE_RINGBUF) is the recommended event transport mechanism:

  • Higher performance (lock-free, batched submission)
  • Supports event loss notification (bpf_ringbuf_discard)
  • Supports reserve/commit two-phase write, avoiding extra copies

Perf Buffer (BPF_MAP_TYPE_PERF_EVENT_ARRAY) is the legacy approach. New projects should prefer Ring Buffer.

Summary

This article covered:

  • Container-level OOM pinpointing via cgroup paths mapped to Kubernetes Pods
  • Memory allocation rate tracking using per-CPU maps and sampling
  • Rust Aya implementation showing an alternative development paradigm

The next (and final) article in this series covers the BPF OOM kernel patches — a new feature being discussed in the community that allows eBPF programs to fully take over the kernel’s OOM policy.