all posts

userfaultfd: Lazy Memory for Instant VM Restore

Ajay Kumar··9 min read

userfaultfd is a Linux kernel feature that delivers page-fault events to user space. Normally, when a process touches a page of memory that isn't physically present, the kernel resolves the fault itself — zero-fill it, read it back from swap, page it in from a mapped file. With userfaultfd, the kernel instead hands that fault to a handler you wrote, and waits. Your handler decides what the page should contain, installs it, and the faulting thread resumes as if nothing happened. That single inversion — the kernel asking your code to populate memory on demand — is the basis for live migration, CRIU checkpoint/restore, and lazy snapshot restore. PandaStack uses it to stream a microVM's RAM from object storage so a guest can resume before its whole memory image has been downloaded. This post explains how userfaultfd works, then walks the exact fault path we use to make restore not wait on a multi-gigabyte download.

What is userfaultfd?

Memory in a Linux process is virtual. A region can be mapped — the address is valid — without any physical page behind it yet. The first time the process reads or writes such an address, the CPU raises a page fault and traps into the kernel, which figures out what belongs there and wires up a physical page. This is demand paging, and it's how mmap, the page cache, and swap all work. The key point is that the resolution policy normally lives entirely inside the kernel.

userfaultfd moves that policy into user space for a chosen range of memory. You open a userfaultfd (a file descriptor, via the userfaultfd(2) syscall), register an address range with it, and from then on faults in that range don't get resolved by the kernel. Instead the kernel posts an event — "a thread faulted at address X" — onto the userfaultfd, and parks the faulting thread. A handler thread reads that event, fetches or computes the page's contents, and installs it with a dedicated ioctl. Only then is the parked thread released to continue.

Two ioctls do the installing. UFFDIO_COPY copies a page of real content you provide into the faulting address atomically. UFFDIO_ZEROPAGE installs an all-zero page without you supplying any data. Both wake the waiting thread on completion. The handler is the whole point: you, in user space, get to answer "what is at this address?" lazily, the moment — and only the moment — the program actually reaches for it.

The mental model: a page fault is normally a question the kernel answers itself. userfaultfd reroutes that question to your handler. Nothing is copied until the program touches the address — and untouched pages are never fetched at all.

What userfaultfd is actually for

userfaultfd exists because several important problems are really the same problem: bring a process's memory back to life without copying all of it up front. The canonical users:

  • Live migration — moving a running VM or process to another host. With post-copy migration, you start the workload on the destination immediately and pull pages from the source over the network as the guest faults on them, instead of blocking until the entire RAM image has transferred.
  • CRIU (Checkpoint/Restore In Userspace) — freezing a process tree to disk and restoring it later. Lazy restore registers the restored memory with userfaultfd so pages stream back from the image only as the process touches them, cutting restore-to-first-instruction time.
  • Lazy snapshot restore for VMs — the case this post is about. A VM snapshot includes the guest's full physical RAM. Restoring it eagerly means reading the whole memory file before the guest runs. userfaultfd lets the guest start and pulls pages on demand.

The common thread is that working sets are smaller than total memory. A guest may have 2 GiB of RAM but touch only a few hundred megabytes to get to a ready state. If you can serve exactly the pages it touches, exactly when it touches them, you skip the rest entirely — and you can start before any of it has arrived.

The problem: snapshot restore without the download

Some background on how PandaStack boots is useful here. Every sandbox is a Firecracker microVM — its own guest kernel, hardware-virtualization isolation via KVM, not a shared-kernel container. There is no warm pool of idle VMs. Every create restores a baked Firecracker snapshot of a ready guest, which is what gets a fresh sandbox to a usable state with a p50 of about 179ms and a p99 around 203ms. (The how-Firecracker-boots-fast post covers why restore beats cold boot.)

A full snapshot is two things: the rootfs disk and the memory image, vm.mem, which is the guest's entire physical RAM byte for byte. A 2 GiB guest produces a 2 GiB memory file. When that snapshot already lives on the local host, restore maps vm.mem with MAP_PRIVATE and the kernel pages it in lazily and copy-on-write — fast, no network involved. The deep mechanics live in the snapshot-restore internals docs.

The friction appears when the memory image isn't local yet. In a multi-host fleet, snapshots are published to object storage so any agent can restore any template. The naive path is: download the whole multi-gigabyte vm.mem to local disk, then restore. That download is pure dead time — the guest can't run until it finishes, and most of those bytes are pages the guest will never touch this boot. userfaultfd is how we delete that wait.

userfaultfd streams memory, not disk. The rootfs still has to be a local file, because copy-on-write cloning (XFS reflink or dm-snapshot) needs a local block device. Streaming removes the big vm.mem download specifically — the agent still syncs rootfs and seed artifacts locally.

The fault flow: kernel fault to UFFDIO_COPY

Firecracker supports a UFFD restore mode: instead of pointing the VMM at a local memory file, you hand it a userfaultfd and let an external process back the guest's memory. PandaStack's agent is that external process. The setup and the steady-state loop:

  1. Before loading the snapshot, the agent opens a userfaultfd handler bound to a Unix socket and starts Firecracker in UFFD mode.
  2. Firecracker connects and sends, over the socket, the userfaultfd file descriptor (via SCM_RIGHTS) plus the layout of the guest memory regions — which guest-physical ranges map to which file offsets in vm.mem.
  3. The guest resumes. It touches a page of RAM that hasn't been populated yet.
  4. The CPU raises a page fault; the kernel sees the address is registered with userfaultfd and, instead of resolving it, posts a fault event on the descriptor and parks the faulting vCPU thread.
  5. The agent's handler reads the event, translates the faulting guest address into an offset in vm.mem, and fetches the surrounding 4 MiB chunk from object storage with an HTTP Range GET.
  6. The handler installs the page (or the whole chunk's worth of pages) into the guest's address space with UFFDIO_COPY, which atomically places the bytes and wakes the parked vCPU.
  7. The vCPU resumes exactly where it faulted, now seeing valid memory. It never knew the page came over the network.

That loop is the entire idea. The guest is running the whole time; faults are just brief stalls while a chunk arrives. Pages the guest never touches are never fetched, so a restore that needs a few hundred megabytes of working set pays for a few hundred megabytes, not the full image.

Making on-demand paging fast enough to matter

A page-at-a-time round trip to object storage on every fault would be too slow — network latency would show up as guest jitter. Four things keep streamed restore close to local-disk speed.

4 MiB chunks, not 4 KiB pages

Guest memory faults arrive a 4 KiB page at a time, but we fetch in 4 MiB chunks. Memory access has strong spatial locality, so when the guest faults on one page it almost always wants its neighbors next. One Range GET amortizes the network round trip across 1,024 pages, turning what would be a thousand fetches into one and keeping subsequent nearby faults as local hits.

Zero-page elision

A surprising fraction of a guest's RAM is zero — freshly zeroed pages the OS hasn't used. There's no reason to ship zeros over the network. At bake time we record a header marking which chunks are non-zero. On restore, a fault that lands in an all-zero region is served with UFFDIO_ZEROPAGE — no fetch at all — and only chunks with real content are pulled from storage.

Prefetch trace

A given template touches roughly the same hot set of pages every restore — the path from resume to ready is largely deterministic. So at bake time we record that hot chunk set as a prefetch trace and replay it in the background as soon as restore begins. The streamer is racing ahead of the guest, pulling the chunks it's about to need so that by the time a vCPU faults, the chunk is frequently already in local cache — a hit instead of a network round trip.

Shared per-host chunk cache

Chunks fetched from object storage land in a persistent, per-host cache keyed to the snapshot generation. The first restore of a template on a host pays the object-storage latency for its working set once; every later restore of the same template on that host serves those chunks from local disk. Across a fleet handling many creates of the same template, the network cost amortizes toward zero. The cache is crash-safe — a chunk is only marked present after its data is durably synced — and LRU-evicted under a size budget. A re-bake produces a new generation, so stale chunks self-invalidate rather than serving wrong memory.

There's also an optional 2 MiB hugepage path: backing guest RAM with hugepages means one fault covers 2 MiB instead of 4 KiB, cutting fault count by up to 512x on restore. Hugepage-ness is a property of the snapshot, so Firecracker only restores hugepage snapshots through the userfaultfd backend — which is one more reason the UFFD path is central rather than a side feature.

Streaming is a real trade, not free magic. The first restore of a template on a cold host still depends on object-storage latency for its working set, and a fault that misses cache is a network round trip rather than a memory read. Streaming wins clearly for fleets that restore the same templates repeatedly; for a single restore of a one-off image on a cold host, a plain local restore can be simpler. Measure your own working sets and hit rates before assuming.

Why userfaultfd is the right primitive here

You could imagine other ways to avoid the full download — a custom FUSE filesystem behind the memory file, a network block device, an eager background copy you race against. They all work to a degree, but userfaultfd fits the shape of the problem exactly: it operates at the page-fault level, where demand paging already happens; it requires no extra kernel module; and it gives the handler the precise information it needs — which address faulted, right now — to fetch exactly that and nothing more. The kernel does the hard part (trapping the fault, parking the thread, atomically installing the page); your handler only supplies the bytes.

For PandaStack the payoff is concrete: an agent can restore a template it has never held locally without first downloading a multi-gigabyte memory image, because the guest pulls only the pages it actually touches, prefetch hides most of the latency, and the per-host cache makes every repeat restore local-disk fast. It's the same copy-on-write, lazy-by-default philosophy as reflinked rootfs and MAP_PRIVATE memory — applied across the network. The full mechanics, headers, and cache design are in the streaming-restore internals docs. PandaStack is open source under Apache-2.0, so you can read the userfaultfd handler and run the whole streaming path on your own Linux KVM hosts.

Frequently asked questions

What is userfaultfd lazy memory loading?

userfaultfd is a Linux kernel feature that delivers page-fault events to user space. When a program touches a memory page that isn't present, instead of the kernel resolving the fault, it hands the fault to a handler you wrote and parks the faulting thread. Your handler fetches or computes the page's contents and installs it with the UFFDIO_COPY ioctl, then the thread resumes. This makes memory loading lazy — pages are populated on demand, the moment the program reaches for them, and untouched pages are never loaded at all. It underpins live migration, CRIU restore, and lazy VM snapshot restore.

How does PandaStack use userfaultfd to restore a microVM?

A Firecracker snapshot includes vm.mem, the guest's entire physical RAM. Rather than downloading that whole multi-gigabyte file before the guest can run, PandaStack starts Firecracker in UFFD mode and hands it a userfaultfd. When the guest faults on a page, the kernel posts the fault to PandaStack's handler, which translates the address to an offset in vm.mem, fetches the surrounding 4 MiB chunk from object storage via an HTTP Range GET, and installs it with UFFDIO_COPY. The guest resumes having only paid for the pages it actually touched.

Does userfaultfd stream the disk too, or only memory?

Only memory. userfaultfd operates on a process's address space, so it streams the guest's RAM image (vm.mem) on demand. The rootfs disk is handled separately and must stay a local file, because copy-on-write disk cloning — XFS reflink or dm-snapshot — needs a local block device. Streaming removes the large vm.mem download specifically; agents still sync rootfs and seed artifacts to local disk.

What is the exact fault flow with userfaultfd?

The guest touches a page that hasn't been populated. The CPU raises a page fault and traps into the kernel. The kernel sees the address is registered with userfaultfd, so instead of resolving it the kernel posts a fault event on the descriptor and parks the faulting thread. The user-space handler reads the event, fetches the page's contents (for PandaStack, a 4 MiB chunk from object storage over an HTTP Range GET), and installs it with UFFDIO_COPY, which atomically places the bytes and wakes the parked thread. The thread then resumes exactly where it faulted.

Run code in a microVM in one API call.

49ms p50 cold start. Fork, snapshot, and scale to zero.

Start free
Written by Ajay Kumar, Founder, PandaStack.