BPF OOM Kernel Patches Deep Dive: Custom OOM Policies with eBPF
The previous articles showed how to use eBPF to observe OOM events. But we could only watch, not intervene. The kernel’s OOM Killer decides who lives and dies based on the oom_badness() algorithm, with no user control.
In 2025, Google engineer Roman Gushchin proposed the BPF OOM kernel patch series, aiming to let eBPF programs fully take over OOM handling policy. This is the biggest change to Linux memory management’s OOM subsystem in nearly two decades.
Timeline note: As of June 2026, the BPF OOM patches are still under RFC/review and have not been merged into any mainline kernel release. The content below is based on the patch series’ technical proposals — some interfaces and designs may change before final inclusion.
Why the Traditional OOM Killer Falls Short
The Linux OOM Killer has been around since roughly 2001, with its core logic largely unchanged:
- Memory exhausted →
__alloc_pages_slowpathfails out_of_memory()called → iterate all processes- Calculate each process’s “badness” via
oom_badness()(based on rss, swap,oom_score_adj) - Kill the “worst” process
The problem is that this single “worst” heuristic can’t be optimized for different workloads:
- Database nodes: protect the DB process, kill query processes holding buffer pool memory
- AI training clusters: graceful checkpoint instead of SIGKILL
- Kubernetes nodes: cgroup-aware — kill the over-limit container, not kubelet
- Latency-sensitive services: proactive handling via PSI signals before OOM
Patch Series Overview
Roman Gushchin’s approach redesigns the OOM flow:
flowchart TD
classDef new fill:#FFECB3,stroke:#E65100,color:#BF360C
classDef existing fill:#E3F2FD,stroke:#1565C0,color:#1565C0
classDef result fill:#E8F5E9,stroke:#2E7D32,color:#1B5E20
ps@{ shape: rounded, label: "PSI memory pressure (early trigger)" }
alloc@{ shape: rounded, label: "__alloc_pages fails (direct reclaim)" }
hook@{ shape: diamond, label: "BPF OOM hook (new hook point)" }
bpf@{ shape: proc, label: "BPF Program (custom policy)" }
default@{ shape: proc, label: "Default OOM Killer (oom_badness)" }
kill@{ shape: stadium, label: "bpf_oom_kill_process() (user-defined)" }
ck@{ shape: stadium, label: "System recovers" }
ps --> hook
alloc --> hook
hook -->|"BPF attached"| bpf
hook -->|"No BPF"| default
bpf --> kill
kill --> ck
default --> ck
class ps,alloc existing
class hook,bpf,kill new
class default existing
class ck resultPatch Timeline
| Version | Date | Patches | Notes |
|---|---|---|---|
| RFC v1 | 2025-04-28 | 12 | Initial proposal, covered by LWN |
| v2 | 2025-10-27 | 23 | Major revision, added memcg kfuncs and selftests |
| v3 | 2026-01-26 | 17 | Further adjustments based on review feedback |
- LWN analysis: Custom out-of-memory killers in BPF
- Patchset overview: mm: BPF OOM
- Phoronix (v3 coverage): Updated Linux Patches
- Kernel Recipes 2025 talk: BPFOOM
kfunc Interfaces
The patchset introduces several new kfuncs (BPF-callable kernel functions):
bpf_oom_kill_process()
| |
Called from a BPF OOM program to kill a specified process in exactly the same way as the kernel OOM Killer. Declared sleepable because killing a process may need to wait on locks or I/O.
bpf_out_of_memory()
| |
Explicitly declare an OOM state and trigger OOM handling. Can be used after PSI event detection — the BPF program decides “memory pressure is too high, need to trigger OOM.”
Memcg kfuncs
| |
About the Attachment Mechanism
How BPF programs bind to the kernel OOM hook is still an open design question. RFC v1 used fmodret (function modifier return) — the BPF program executes before the OOM hook returns, and can decide whether to take over. The community is also discussing alternative approaches (such as a new BPF program type).
PSI-Driven Proactive OOM
Traditional OOM Killer is reactive — it acts only after the system is already deadlocked. The PSI-driven approach can trigger BPF programs proactively when memory pressure exceeds a threshold:
| |
User-Space OOM Daemon vs BPF OOM
| Dimension | User-space daemon (oomd / systemd-oomd) | BPF OOM |
|---|---|---|
| Latency | Seconds (PSI signal → user-space → kill) | Milliseconds (in-kernel) |
| Reliability | User-space can itself be OOM-killed | Runs in kernel, immune to OOM |
| Deployment | Separate daemon to configure and maintain | Load BPF program |
| Data access | Limited to /proc | Full kernel data structures |
| Policy flexibility | Fully flexible | BPF verifier limits |
| Observability | Logs + metrics | Logs + metrics + full kernel context |
These approaches are not mutually exclusive. Ideally, BPF OOM handles standard scenarios while user-space daemons handle complex policies.
Current Status and Outlook
As of June 2026:
- RFC v1 (Apr 2025) → v2 (Oct 2025) → v3 (Jan 2026): ongoing iteration
- Patches still on bpf-next, under community review
- Roman Gushchin presented at Kernel Recipes 2025
- No confirmed merge timeline
Key challenges:
- Safety: allowing BPF programs to kill processes is high-risk; the verifier must ensure predictable behavior
- Lock dependencies: BPF programs may hold oom_lock while calling operations that may trigger I/O
- Hierarchical policy: how should OOM policy inheritance work in a cgroup hierarchy?
- Fallback: ensuring the system doesn’t deadlock if a BPF program fails
Summary
BPF OOM is one of the most significant changes to Linux memory management in recent years. It solves the twenty-year-old “one-size-fits-all” problem by letting users tailor OOM policy to their workloads. Though the patches haven’t been merged yet, the design and technical approach are already highly instructive.
This concludes the eBPF Observability series. From fundamentals to OOM event tracing, container-level pinpointing, memory allocation analysis, and finally the BPF OOM kernel patches — hopefully this provides a complete learning path from observing to controlling OOM behavior with eBPF.