eBPF 内存可观测性进阶：容器追踪与 Rust Aya 实践

June 12, 2026 可观测性 EBPF, Rust, Aya, 内存管理, OOM, Cgroup, Linux 可观测性系列 2568 字 6 分钟阅读

🔊

前两篇文章覆盖了 eBPF 基础概念和 OOM Killer 事件追踪。这篇文章进入更深的层次：容器级别的 OOM 定位、内存分配速率的实时追踪，以及用 Rust Aya 框架来实现同样的功能。

容器级 OOM 定位

在 Kubernetes 环境中，“某个 Pod OOM 了"实际上是一个模糊的描述。Pod 由多个容器组成，容器可能属于不同的 cgroup。eBPF 可以穿透这一层，精确地定位到"是哪个容器里的哪个进程"导致了 OOM。

关联链路：

text
1
2
3
4
oom_kill_process 触发
    ↓ 捕获 task_struct → 读取 /proc/PID/cgroup
        ↓ 解析容器 ID → 关联到 Pod
            ↓ 关联到 Namespace

在 eBPF 程序中，可以从 oom_kill_process 的参数 oom_control 中拿到 memcg 指针，从而获取 cgroup 级别的信息：

c
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
SEC("kprobe/oom_kill_process")
int BPF_KPROBE(oom_kill_process, struct oom_control *oc,
               struct task_struct *p, const char *message)
{
    struct mem_cgroup *memcg;

    // 从 oom_control 中获取 memcg
    memcg = BPF_CORE_READ(oc, memcg);
    if (memcg) {
        // 读取 cgroup 名称（/sys/fs/cgroup/memory/kubepods/...）
        char cgroup_path[256];
        bpf_probe_read_kernel_str(cgroup_path, sizeof(cgroup_path),
            BPF_CORE_READ(memcg, css.cgroup->kn->name));
    }

    // 读取被杀进程的 PID
    u32 pid = BPF_CORE_READ(p, pid);
    // 存储 cgroup 信息与 PID 的映射，供用户态关联查询
}

用户态程序拿到 cgroup 路径后，可以解析出 Pod 和容器名称：

go
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
// cgroup 路径示例:
// /kubepods/burstable/pod<UID>/<containerID>

func parseContainerID(cgroupPath string) string {
    parts := strings.Split(cgroupPath, "/")
    if len(parts) < 3 {
        return ""
    }
    return parts[len(parts)-1]
}

func resolvePod(cgroupPath string) string {
    for _, part := range strings.Split(cgroupPath, "/") {
        if strings.HasPrefix(part, "pod") {
            return strings.TrimPrefix(part, "pod")
        }
    }
    return ""
}

内存分配速率追踪

OOM 是最终结果，但真正的价值在于 OOM 发生前的趋势。通过追踪 kmalloc 和 free 事件，可以在 OOM 发生之前就察觉到异常增长。

eBPF 程序通过挂载 tracepoint（比 kprobe 更稳定）来追踪内核内存分配：

c
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// Per-CPU 统计 map，无锁竞争
struct {
    __uint(type, BPF_MAP_TYPE_PERCPU_HASH);
    __uint(max_entries, 65536);
    __type(key, u32);    // PID
    __type(value, u64);  // 累计分配字节
} alloc_stats SEC(".maps");

// 采样率控制（不在热路径上停不下来）
volatile const u64 sample_rate = 100;  // 每 100 次采 1 次

SEC("tracepoint/kmem/kmalloc")
int trace_kmalloc(struct trace_event_raw_kmem_alloc *ctx)
{
    u64 pid_tgid = bpf_get_current_pid_tgid();
    u32 pid = pid_tgid >> 32;

    // 采样：不是每次 kmalloc 都记录
    if (bpf_get_prandom_u32() % sample_rate != 0)
        return 0;

    u64 size = BPF_CORE_READ(ctx, bytes_alloc);

    u64 *val = bpf_map_lookup_elem(&alloc_stats, &pid);
    if (!val) {
        u64 init = size;
        bpf_map_update_elem(&alloc_stats, &pid, &init, BPF_ANY);
    } else {
        __sync_fetch_and_add(val, size);
    }
    return 0;
}

几个设计要点：

BPF_MAP_TYPE_PERCPU_HASH：每个 CPU 核心有独立的 hash 表，写入不需要加锁。在多核系统上，这是最优的性能方案
采样：kmalloc 是极高频率的内核事件（每秒可能数百万次），不能全部记录。比例采样将开销控制在可接受范围内
tracepoint 优先：tracepoint/kmem/kmalloc 是稳定 ABI，比 kprobe 更安全

用户态程序定期读取 alloc_stats map，计算 delta 得到分配速率：

go
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
type AllocStat struct {
    PID       uint32
    TotalAlloc uint64
    Rate      float64 // 字节/秒
}

func (m *Monitor) pollAllocStats() {
    for range time.Tick(10 * time.Second) {
        var key, prevVal, currVal uint32
        for {
            // 遍历 hash map 的所有 entry
            if err := m.objs.AllocStats.NextKey(key, &currVal); err != nil {
                break
            }
            // 读取当前值，计算 delta
            // ... 更新速率指标
            key = currVal
        }
    }
}

用 Rust Aya 实现 OOM 监控

Rust 的 eBPF 生态以 Aya 框架为代表——纯 Rust 实现，不依赖 libbpf，类型安全，开发体验出色。下面用 Aya 重写 OOM 事件监控程序。

eBPF 内核态（Rust）

Aya 的 eBPF 程序用 Rust 编写，通过属性宏来定义 map 和 hook 点：

rust
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
#![no_std]
#![no_main]

use aya_ebpf::{
    macros::{kprobe, map},
    maps::{PerCpuArray, RingBuf},
    programs::ProbeContext,
    BpfContext,
};
use aya_ebpf::helpers::bpf_ktime_get_ns;

#[repr(C)]
pub struct OomEvent {
    pub pid: u32,
    pub tgid: u32,
    pub fpid: u32,
    pub pages: u64,
    pub comm: [u8; TASK_COMM_LEN],
    pub fcomm: [u8; TASK_COMM_LEN],
    pub timestamp: u64,
}

const TASK_COMM_LEN: usize = 16;

#[map]
static EVENTS: RingBuf = RingBuf::with_byte_size(1024 * 1024, 0);

#[map]
static mut BUF: PerCpuArray<OomEvent> = PerCpuArray::with_max_entries(1, 0);

#[kprobe(function = "oom_kill_process")]
pub fn oom_kill_process(ctx: ProbeContext) -> u32 {
    match try_oom_kill_process(ctx) {
        Ok(ret) => ret,
        Err(_) => 1,
    }
}

fn try_oom_kill_process(ctx: ProbeContext) -> Result<u32, i64> {
    // 获取每个 CPU 的暂存缓冲区
    let event_buf = unsafe {
        BUF.get_mut(aya_ebpf::bindings::BPF_F_CURRENT_CPU)
            .ok_or(-1)?
    };

    // 读取参数 p (struct task_struct *)
    let p: *const u8 = ctx.arg(1).ok_or(-1)?;

    // 通过 BPF_CORE_READ 宏读取字段
    // 注意：Aya 中对 task_struct 的访问需要使用 bpf_probe_read_kernel
    // 这里调用 Aya 的 helpers 来安全读取
    event_buf.pid = unsafe { bpf_probe_read_kernel(&(*p).pid) };
    event_buf.tgid = unsafe { bpf_probe_read_kernel(&(*p).tgid) };

    event_buf.timestamp = bpf_ktime_get_ns();

    // 写入 Ring Buffer
    if let Some(mut buf) = EVENTS.reserve::<OomEvent>(0) {
        unsafe { core::ptr::copy_nonoverlapping(event_buf, buf.as_mut_ptr(), 1) };
        buf.submit(0);
    }

    Ok(0)
}

#[panic_handler]
fn panic(_info: &core::panic::PanicInfo) -> ! {
    unsafe { core::hint::unreachable_unchecked() }
}

注意：上面的代码使用了 bpf_probe_read_kernel 安全读取内核内存，而不是硬编码偏移量。CO-RE 在 Aya 中同样有效——BFP 程序利用 BTF 信息来正确解析结构体字段的位置。

用户态（Rust）

rust
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
use aya::{
    include_bytes_aligned,
    maps::ring_buf::RingBuf,
    programs::{KProbe, ProgramError},
    Bpf,
};

#[repr(C)]
#[derive(Debug, Clone, Copy)]
struct OomEvent {
    pid: u32,
    tgid: u32,
    fpid: u32,
    pages: u64,
    comm: [u8; 16],
    fcomm: [u8; 16],
    timestamp: u64,
}

#[tokio::main]
async fn main() -> Result<(), anyhow::Error> {
    // 加载 eBPF 程序（ELF 字节码）
    let mut bpf = Bpf::load(include_bytes_aligned!(
        "../../target/bpfel-unknown-none/debug/ebpf-oom"
    ))?;

    // 附加 Kprobe
    let program: &mut KProbe = bpf.program_mut("oom_kill_process")
        .unwrap().try_into()?;
    program.load()?;
    program.attach("oom_kill_process", 0)?;

    // 读取 Ring Buffer
    let mut events = RingBuf::try_from(bpf.map_mut("EVENTS").unwrap())?;

    loop {
        while let Some(item) = events.next() {
            let event: OomEvent = unsafe {
                std::ptr::read(item.as_ref().as_ptr() as *const _)
            };
            println!("OOM: pid={} pages={}", event.pid, event.pages);
        }
        tokio::time::sleep(tokio::time::Duration::from_millis(100)).await;
    }
}

使用 cargo-aya 构建和运行

bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# 安装 cargo-aya
cargo install cargo-aya

# 创建新项目
cargo aya new ebpf-oom

# 构建 eBPF 程序（内核态）
cargo build --package ebpf-oom-ebpf --release

# 构建用户态程序
cargo build --package ebpf-oom --release

# 运行（需要 root）
sudo ./target/release/ebpf-oom

编译好的二进制是自包含的——eBPF 字节码通过 include_bytes_aligned! 宏嵌入到 Rust 程序中。

关于 cargo-aya：cargo aya new 会自动创建双包（-ebpf 内核态和用户态）项目结构。内核态 eBPF 代码在 ebpf-oom-ebpf/ 中，用户态在 ebpf-oom/ 中，与上面代码片段中的路径一致。

Aya vs cilium/ebpf 对比

维度	cilium/ebpf (Go)	Aya (Rust)
内核态语言	C（Clang 编译）	Rust（自定义 target）
用户态语言	Go	Rust
依赖	需要 Clang/LLVM 工具链	纯 Rust 工具链
类型安全	C 的 eBPF 代码无类型保障	Rust 编译期检查
学习曲线	需要同时掌握 C + Go	统一 Rust
社区成熟度	更成熟，生产用例更多	快速发展中
BTF/CO-RE	完全支持	完全支持
开发体验	`bpf2go` 代码生成	`cargo-aya` 一站式工具链

选择 Go 还是 Rust，取决于团队背景和项目需求。Go + cilium/ebpf 更成熟、生态更丰富；Rust + Aya 在类型安全和开发体验上更优。

最佳实践

Tracepoint 优先于 Kprobe

特性	tracepoint	kprobe
ABI 稳定性	稳定，内核开发者维护	无保证，函数签名可能随版本变化
可发现性	`bpftrace -l` 可列出	需要阅读源码
性能	略微更优	略微更高开销
适用场景	官方支持的监控场景	需要监控的函数没有对应 tracepoint

优先使用 tracepoint，只有 tracepoint 覆盖不了的场景再用 kprobe。

采样策略

高频事件（如 kmalloc、page_fault）必须采样。常用策略：

比例采样：每 N 次事件采 1 次，用 bpf_get_prandom_u32() % N == 0 实现
自适应采样：根据当前事件速率动态调整采样率
按 key 采样：只跟踪特定的 PID 或 cgroup

Ring Buffer vs Perf Buffer

Ring Buffer（BPF_MAP_TYPE_RINGBUF）是推荐的事件传输方案：

性能更高（无锁、批量提交）
支持事件丢失通知（bpf_ringbuf_discard）
支持 reserv/commit 两阶段写入，避免拷贝

Perf Buffer（BPF_MAP_TYPE_PERF_EVENT_ARRAY）是旧方案，新项目应优先使用 Ring Buffer。

小结

这篇文章覆盖了三个进阶主题：

容器级 OOM 定位：通过 cgroup 路径关联到 Kubernetes Pod 和容器
内存分配速率追踪：用 Per-CPU maps 和采样技术，在不影响性能的前提下追踪内存分配趋势
Rust Aya 实现：展示另一种开发范式，对比 Go 和 Rust 在 eBPF 领域的优劣势

下一篇文章将介绍 BPF OOM 的内核补丁——一个正被社区讨论的新特性，它允许用 eBPF 程序完全接管内核的 OOM 策略。

所属系列: 可观测性系列

← 上一篇用 eBPF + Go 构建 OOM Killer 事件追踪器下一篇 → BPF OOM 内核补丁深度解析：用 eBPF 自定义 OOM 策略