BPF OOM Kernel Patches Deep Dive: Custom OOM Policies with eBPF

The previous articles showed how to use eBPF to observe OOM events. But we could only watch, not intervene. The kernel’s OOM Killer decides who lives and dies based on the oom_badness() algorithm, with no user control.

In 2025, Google engineer Roman Gushchin proposed the BPF OOM kernel patch series, aiming to let eBPF programs fully take over OOM handling policy. This is the biggest change to Linux memory management’s OOM subsystem in nearly two decades.

Timeline note: As of June 2026, the BPF OOM patches are still under RFC/review and have not been merged into any mainline kernel release. The content below is based on the patch series’ technical proposals — some interfaces and designs may change before final inclusion.

Why the Traditional OOM Killer Falls Short

The Linux OOM Killer has been around since roughly 2001, with its core logic largely unchanged:

  1. Memory exhausted → __alloc_pages_slowpath fails
  2. out_of_memory() called → iterate all processes
  3. Calculate each process’s “badness” via oom_badness() (based on rss, swap, oom_score_adj)
  4. Kill the “worst” process

The problem is that this single “worst” heuristic can’t be optimized for different workloads:

  • Database nodes: protect the DB process, kill query processes holding buffer pool memory
  • AI training clusters: graceful checkpoint instead of SIGKILL
  • Kubernetes nodes: cgroup-aware — kill the over-limit container, not kubelet
  • Latency-sensitive services: proactive handling via PSI signals before OOM

Patch Series Overview

Roman Gushchin’s approach redesigns the OOM flow:

mermaid
flowchart TD
    classDef new fill:#FFECB3,stroke:#E65100,color:#BF360C
    classDef existing fill:#E3F2FD,stroke:#1565C0,color:#1565C0
    classDef result fill:#E8F5E9,stroke:#2E7D32,color:#1B5E20

    ps@{ shape: rounded, label: "PSI memory pressure (early trigger)" }
    alloc@{ shape: rounded, label: "__alloc_pages fails (direct reclaim)" }

    hook@{ shape: diamond, label: "BPF OOM hook (new hook point)" }

    bpf@{ shape: proc, label: "BPF Program (custom policy)" }
    default@{ shape: proc, label: "Default OOM Killer (oom_badness)" }

    kill@{ shape: stadium, label: "bpf_oom_kill_process() (user-defined)" }
    ck@{ shape: stadium, label: "System recovers" }

    ps --> hook
    alloc --> hook
    hook -->|"BPF attached"| bpf
    hook -->|"No BPF"| default

    bpf --> kill
    kill --> ck
    default --> ck

    class ps,alloc existing
    class hook,bpf,kill new
    class default existing
    class ck result

Patch Timeline

VersionDatePatchesNotes
RFC v12025-04-2812Initial proposal, covered by LWN
v22025-10-2723Major revision, added memcg kfuncs and selftests
v32026-01-2617Further adjustments based on review feedback

kfunc Interfaces

The patchset introduces several new kfuncs (BPF-callable kernel functions):

bpf_oom_kill_process()

c
1
int bpf_oom_kill_process(struct oom_control *oc, struct task_struct *p);

Called from a BPF OOM program to kill a specified process in exactly the same way as the kernel OOM Killer. Declared sleepable because killing a process may need to wait on locks or I/O.

bpf_out_of_memory()

c
1
int bpf_out_of_memory(struct mem_cgroup *memcg, int order);

Explicitly declare an OOM state and trigger OOM handling. Can be used after PSI event detection — the BPF program decides “memory pressure is too high, need to trigger OOM.”

Memcg kfuncs

c
1
2
3
struct task_struct *bpf_get_memcg_tasks(struct mem_cgroup *memcg);
long bpf_memcg_usage(struct mem_cgroup *memcg);
struct mem_cgroup *bpf_get_root_mem_cgroup(void);

About the Attachment Mechanism

How BPF programs bind to the kernel OOM hook is still an open design question. RFC v1 used fmodret (function modifier return) — the BPF program executes before the OOM hook returns, and can decide whether to take over. The community is also discussing alternative approaches (such as a new BPF program type).

PSI-Driven Proactive OOM

Traditional OOM Killer is reactive — it acts only after the system is already deadlocked. The PSI-driven approach can trigger BPF programs proactively when memory pressure exceeds a threshold:

c
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
SEC("kprobe/psi_group_poll")
int BPF_KPROBE(psi_handler, struct psi_group *group, u32 reason)
{
    u64 avg60 = BPF_CORE_READ(group, avg[PSI_MEM][2]);
    s64 full_avg = avg60 >> 12;
    if (full_avg > 500000) {
        bpf_out_of_memory(NULL, 0);
    }
    return 0;
}

User-Space OOM Daemon vs BPF OOM

DimensionUser-space daemon (oomd / systemd-oomd)BPF OOM
LatencySeconds (PSI signal → user-space → kill)Milliseconds (in-kernel)
ReliabilityUser-space can itself be OOM-killedRuns in kernel, immune to OOM
DeploymentSeparate daemon to configure and maintainLoad BPF program
Data accessLimited to /procFull kernel data structures
Policy flexibilityFully flexibleBPF verifier limits
ObservabilityLogs + metricsLogs + metrics + full kernel context

These approaches are not mutually exclusive. Ideally, BPF OOM handles standard scenarios while user-space daemons handle complex policies.

Current Status and Outlook

As of June 2026:

  • RFC v1 (Apr 2025) → v2 (Oct 2025) → v3 (Jan 2026): ongoing iteration
  • Patches still on bpf-next, under community review
  • Roman Gushchin presented at Kernel Recipes 2025
  • No confirmed merge timeline

Key challenges:

  1. Safety: allowing BPF programs to kill processes is high-risk; the verifier must ensure predictable behavior
  2. Lock dependencies: BPF programs may hold oom_lock while calling operations that may trigger I/O
  3. Hierarchical policy: how should OOM policy inheritance work in a cgroup hierarchy?
  4. Fallback: ensuring the system doesn’t deadlock if a BPF program fails

Summary

BPF OOM is one of the most significant changes to Linux memory management in recent years. It solves the twenty-year-old “one-size-fits-all” problem by letting users tailor OOM policy to their workloads. Though the patches haven’t been merged yet, the design and technical approach are already highly instructive.

This concludes the eBPF Observability series. From fundamentals to OOM event tracing, container-level pinpointing, memory allocation analysis, and finally the BPF OOM kernel patches — hopefully this provides a complete learning path from observing to controlling OOM behavior with eBPF.