BPF OOM Kernel Patches Deep Dive: Custom OOM Policies with eBPF

June 13, 2026 Observability EBPF, Linux, OOM, Kernel, Kfunc, PSI, Roman Gushchin Observability Series 938 words 5 min read

🔊

The previous articles showed how to use eBPF to observe OOM events. But we could only watch, not intervene. The kernel’s OOM Killer decides who lives and dies based on the oom_badness() algorithm, with no user control.

In 2025, Google engineer Roman Gushchin proposed the BPF OOM kernel patch series, aiming to let eBPF programs fully take over OOM handling policy. This is the biggest change to Linux memory management’s OOM subsystem in nearly two decades.

Timeline note: As of June 2026, the BPF OOM patches are still under RFC/review and have not been merged into any mainline kernel release. The content below is based on the patch series’ technical proposals — some interfaces and designs may change before final inclusion.

Why the Traditional OOM Killer Falls Short

The Linux OOM Killer has been around since roughly 2001, with its core logic largely unchanged:

Memory exhausted → __alloc_pages_slowpath fails
out_of_memory() called → iterate all processes
Calculate each process’s “badness” via oom_badness() (based on rss, swap, oom_score_adj)
Kill the “worst” process

The problem is that this single “worst” heuristic can’t be optimized for different workloads:

Database nodes: protect the DB process, kill query processes holding buffer pool memory
AI training clusters: graceful checkpoint instead of SIGKILL
Kubernetes nodes: cgroup-aware — kill the over-limit container, not kubelet
Latency-sensitive services: proactive handling via PSI signals before OOM

Patch Series Overview

Roman Gushchin’s approach redesigns the OOM flow:

mermaid
flowchart TD
    classDef new fill:#FFECB3,stroke:#E65100,color:#BF360C
    classDef existing fill:#E3F2FD,stroke:#1565C0,color:#1565C0
    classDef result fill:#E8F5E9,stroke:#2E7D32,color:#1B5E20

    trig@{ shape: rounded, label: "Triggers<br/>PSI pressure / `__alloc_pages` fails" }

    hook@{ shape: diamond, label: "BPF OOM hook<br/>new hook point" }

    bpf@{ shape: proc, label: "BPF Program<br/>custom policy" }
    default@{ shape: proc, label: "Default OOM Killer<br/>`oom_badness`" }

    kill@{ shape: stadium, label: "`bpf_oom_kill_process()`<br/>user-defined" }
    ck@{ shape: stadium, label: "System recovers" }

    trig --> hook
    hook -->|"BPF attached"| bpf
    hook -->|"No BPF"| default

    bpf -->|"policy decision"| kill
    kill --> ck
    default --> ck

    class trig,default existing
    class hook,bpf,kill new
    class ck result

Patch Timeline

Version	Date	Patches	Notes
RFC v1	2025-04-28	12	Initial proposal, covered by LWN
v2	2025-10-27	23	Major revision, added memcg kfuncs and selftests
v3	2026-01-26	17	Further adjustments based on review feedback

LWN analysis: Custom out-of-memory killers in BPF
Patchset overview: mm: BPF OOM
Phoronix (v3 coverage): Updated Linux Patches
Kernel Recipes 2025 talk: BPFOOM

kfunc Interfaces

The patchset introduces several new kfuncs (BPF-callable kernel functions):

bpf_oom_kill_process()

c
1
int bpf_oom_kill_process(struct oom_control *oc, struct task_struct *p);

Called from a BPF OOM program to kill a specified process in exactly the same way as the kernel OOM Killer. Declared sleepable because killing a process may need to wait on locks or I/O.

bpf_out_of_memory()

c
1
int bpf_out_of_memory(struct mem_cgroup *memcg, int order);

Explicitly declare an OOM state and trigger OOM handling. Can be used after PSI event detection — the BPF program decides “memory pressure is too high, need to trigger OOM.”

Memcg kfuncs

c
1
2
3
struct task_struct *bpf_get_memcg_tasks(struct mem_cgroup *memcg);
long bpf_memcg_usage(struct mem_cgroup *memcg);
struct mem_cgroup *bpf_get_root_mem_cgroup(void);

About the Attachment Mechanism

How BPF programs bind to the kernel OOM hook is still an open design question. RFC v1 used fmodret (function modifier return) — the BPF program executes before the OOM hook returns, and can decide whether to take over. The community is also discussing alternative approaches (such as a new BPF program type).

PSI-Driven Proactive OOM

Traditional OOM Killer is reactive — it acts only after the system is already deadlocked. The PSI-driven approach can trigger BPF programs proactively when memory pressure exceeds a threshold:

c
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
SEC("kprobe/psi_group_poll")
int BPF_KPROBE(psi_handler, struct psi_group *group, u32 reason)
{
    u64 avg60 = BPF_CORE_READ(group, avg[PSI_MEM][2]);
    s64 full_avg = avg60 >> 12;
    if (full_avg > 500000) {
        bpf_out_of_memory(NULL, 0);
    }
    return 0;
}

User-Space OOM Daemon vs BPF OOM

Dimension	User-space daemon (oomd / systemd-oomd)	BPF OOM
Latency	Seconds (PSI signal → user-space → kill)	Milliseconds (in-kernel)
Reliability	User-space can itself be OOM-killed	Runs in kernel, immune to OOM
Deployment	Separate daemon to configure and maintain	Load BPF program
Data access	Limited to /proc	Full kernel data structures
Policy flexibility	Fully flexible	BPF verifier limits
Observability	Logs + metrics	Logs + metrics + full kernel context

These approaches are not mutually exclusive. Ideally, BPF OOM handles standard scenarios while user-space daemons handle complex policies.

Current Status and Outlook

As of June 2026:

RFC v1 (Apr 2025) → v2 (Oct 2025) → v3 (Jan 2026): ongoing iteration
Patches still on bpf-next, under community review
Roman Gushchin presented at Kernel Recipes 2025
No confirmed merge timeline

Key challenges:

Safety: allowing BPF programs to kill processes is high-risk; the verifier must ensure predictable behavior
Lock dependencies: BPF programs may hold oom_lock while calling operations that may trigger I/O
Hierarchical policy: how should OOM policy inheritance work in a cgroup hierarchy?
Fallback: ensuring the system doesn’t deadlock if a BPF program fails

Summary

BPF OOM is one of the most significant changes to Linux memory management in recent years. It solves the twenty-year-old “one-size-fits-all” problem by letting users tailor OOM policy to their workloads. Though the patches haven’t been merged yet, the design and technical approach are already highly instructive.

That wraps up the OOM observability topic for now — from eBPF fundamentals, to OOM event tracing, container-level pinpointing, and finally the BPF OOM kernel patches. Hopefully it provides a complete learning path from observing to controlling memory behavior with eBPF.

But the eBPF journey is far from over. The series will continue to explore more directions — from hands-on practice with eBPF observability tools like DeepFlow and Pixie, to comparisons of eBPF language ecosystems like Rust and Zig.

Part of series: Observability Series

← Previous Advanced eBPF Memory Observability: Container Tracing and Rust Aya Next → eBPF Series: DeepFlow Extended Protocol Parsing Practice (MongoDB Protocol & Kafka Protocol)