Copy-on-Write Rootfs: Why MicroVM Create Is O(metadata)
Every PandaStack sandbox gets its own writable disk: a rootfs the guest can scribble on without touching the template it came from. The naive way to give each sandbox that disk is to copy the template's rootfs.ext4 file. But that image is gigabytes. Copying gigabytes off disk for every create would put seconds of I/O on the critical path and defeat the whole point of a 179ms create. So PandaStack doesn't copy the data. It makes a copy-on-write clone with an XFS reflink — a clone that shares the underlying data blocks with the template until something writes to them. The clone is an O(metadata) operation: it writes a little filesystem bookkeeping and returns, in a few milliseconds, regardless of how big the image is. This post is about how that works, why ext4 can't do it, and how the same idea — applied to memory as well as disk — is what makes a fork land in ~400ms.
The problem: every sandbox needs its own writable disk
A Firecracker microVM boots off a block device — for PandaStack, an ext4 rootfs image holding Ubuntu 24.04 plus whatever the template baked in. That image is shared, read-only intent: the template is the same for every sandbox spawned from it. But a running guest writes constantly — temp files, package installs, logs, the user's actual work. Two sandboxes from the same template must not see each other's writes, and neither must corrupt the pristine template.
The requirement, then, is: give each sandbox a private, writable view of a multi-gigabyte image, instantly, without duplicating the image. That is precisely the problem copy-on-write was invented for. Instead of copying the data eagerly, you hand out a clone that shares the original's blocks and defers the copy to the moment of first write — and only for the blocks that actually get written.
What a reflink actually does
On a filesystem that supports it, a reflink (the cp --reflink operation, backed by the FICLONE ioctl) creates a new file whose contents are identical to the source, but without copying any data. Both files point at the same underlying extents — the filesystem's runs of data blocks. The filesystem records, in metadata, that those extents now have two owners.
Concretely, here is the difference between a real copy and a reflink clone of a 2 GiB image:
# Full copy: reads and writes 2 GiB. Seconds of I/O, 2 GiB of new space.
$ time cp rootfs.ext4 clone.ext4
real 0m4.812s
# Reflink clone on XFS: writes a little metadata, shares the extents.
# Constant time, ~no new space until a write happens.
$ time cp --reflink=always rootfs.ext4 clone.ext4
real 0m0.003sThe reflink returns in milliseconds because it is doing O(metadata) work — bookkeeping proportional to the number of extents, not the number of bytes. The clone occupies almost no additional disk space at creation time, because there is no additional data; there is one physical copy of the blocks with two filenames pointing at it.
The "copy-on-write" part is what happens next. When the guest writes to a block that is still shared, the filesystem intercepts the write, allocates a fresh block, copies the original contents into it (or just writes the new data), and re-points only the clone's metadata at the new block. The template's view is untouched. Crucially, this copy is per-block — a one-byte change touches one block, not the whole file. The two images diverge gradually, exactly as fast as the guest actually modifies things, and no faster.
Why XFS, and why not ext4
Copy-on-write at the file level is a filesystem capability, not something the application can fake efficiently. The filesystem has to track block ownership with reference counts so it knows when a shared block needs to be copied on write and when it can be freed. Not every Linux filesystem implements that.
- XFS supports reflinks via shared/reference-counted extents (the reflink feature, default on modern mkfs.xfs). This is what PandaStack uses for the rootfs host filesystem — cp --reflink and the FICLONE ioctl both work.
- Btrfs has the same capability — it is copy-on-write by design and supports reflink clones natively. Different implementation, same O(metadata) clone semantics. If you self-host on Btrfs instead of XFS, the rootfs clone path works the same way.
- ext4 does not support reflinks. It has no reference-counted shared extents, so there is no way to clone a file without copying its data. On ext4, cp --reflink fails and you fall back to a full byte copy — which is exactly the multi-second I/O we are trying to avoid on the create path.
This is a common source of confusion: the rootfs image itself is formatted ext4 (it is the guest's filesystem, and ext4 is a fine, boring choice for what the guest sees). But the host filesystem that the rootfs.ext4 file lives on must be reflink-capable — XFS or Btrfs — for the clone to be cheap. The CoW happens at the host layer, beneath the guest's own filesystem. The guest never knows its disk is a shared clone.
dm-snapshot: the block-layer alternative
Reflink is a filesystem-level mechanism, but it is not the only way to get a copy-on-write rootfs. PandaStack also supports dm-snapshot, the device-mapper copy-on-write target, which does the same thing one layer lower — at the block device rather than the file.
A dm-snapshot device presents a writable view of a read-only base (the "origin") plus a separate "COW" store for changed blocks. Reads of unchanged blocks pass through to the shared origin; the first write to a block copies it into the COW store and redirects future access there. Same copy-on-write contract, same O(metadata)-ish create cost, but it does not require the host filesystem to support reflinks — it works on top of a plain block device. It is the fallback path when reflink isn't available, and it is selectable independently of the filesystem choice. The semantics the guest sees are identical: a private, writable disk that shares its unwritten blocks with the template.
Either way, the property we care about holds: create does not move rootfs data. It clones a reference and lets divergence happen lazily, block by block, only as the sandbox writes.
Where the rootfs clone sits in a 179ms create
The rootfs clone is one stage of the create pipeline, and a small one. PandaStack restores a baked Firecracker snapshot on every create — there is no warm pool of idle VMs — and the median create lands at 179ms (p99 ~203ms). Within that budget, the rootfs reflink is roughly 4ms. It is small precisely because it is O(metadata): it does not grow with image size, so a 2 GiB rootfs and a 20 GiB rootfs clone in about the same time.
The dominant stages of a create are elsewhere — memory-mapping the snapshot's memory file (~80ms) and probing that the guest is reachable (~40ms). The rootfs clone is engineered to disappear into the noise, which is the whole point: a writable disk should be a few milliseconds of bookkeeping, not a copy you wait on. The full step-by-step breakdown is in /blog/how-firecracker-boots-fast, and the engineering reference is at /docs/internals/snapshot-restore.
The same idea, applied to memory: how fork hits ~400ms
Copy-on-write on the rootfs is half the story. The other half is memory, and it is the same trick applied to RAM instead of disk — which is what makes a fork so cheap.
When PandaStack restores a snapshot, it maps the guest's memory file with MAP_PRIVATE. That mapping is copy-on-write at the page level: the guest's RAM is backed by the shared snapshot file, nothing is eagerly copied, and the kernel faults pages in only as the guest touches them. A read fault maps the page straight from the shared file; a write fault copies just that one 4 KiB page private to this guest, leaving the original intact for everyone else. It is the reflink idea, but for memory pages instead of disk blocks.
A fork is therefore two copy-on-write clones happening together: a MAP_PRIVATE map of the parent's memory and a reflink (or dm-snapshot) clone of the parent's rootfs. Neither moves real data up front. That is why a same-host fork runs in roughly 400ms — the parent's memory is already resident on the machine and the rootfs reflinks locally, so the fork is a memory map, a metadata clone, and a resume. A cross-host fork is slower (1.2–3.5s) because reflink only works within a single filesystem and the parent's resident RAM isn't on the destination, so the artifacts have to move over the network first.
A reflink clones the disk in metadata; MAP_PRIVATE clones the RAM in metadata. A fork is both at once — which is why branching a live machine costs about as much as a create, not a reboot.
The consequence for the things built on top — forking an agent mid-task into parallel branches, scale-to-zero app hosting, snapshot-and-resume — is that they all inherit "cheap until you diverge" from the underlying copy-on-write. A child fork starts out sharing essentially all of its disk and memory with the parent and only pays for what it changes. Two branches that diverge slowly stay cheap for a long time. The full fork mechanics are in /blog/snapshot-and-fork-explained and /docs/internals/fork-cow.
The honest caveats
Copy-on-write is not free in every dimension, and it is worth being precise about where the trade-offs are:
- Divergence has a cost. The clone is cheap at creation, but a write-heavy sandbox that rewrites most of its rootfs eventually copies most of those blocks. The O(metadata) win is on the create, not the lifetime — a long-lived, write-heavy VM ends up roughly as large as a full copy. The savings are largest for short-lived sandboxes and slowly-diverging forks.
- It is a host-layer optimization, not isolation. The CoW clone gives each sandbox a private disk; the hardware-virtualization boundary (KVM, a separate guest kernel per microVM) is what isolates one sandbox from another. Copy-on-write makes create fast; it is not the security model. Each sandbox is a full Firecracker microVM with its own kernel for that reason.
- The filesystem must actually support it. As above — XFS or Btrfs for reflink, or dm-snapshot at the block layer. On plain ext4 with no dm-snapshot, the system still works but the create falls back to a real copy.
- Streaming is for memory, not the rootfs. PandaStack can page a snapshot's memory file in on demand from object storage, but the rootfs always has to be a local file, because copy-on-write cloning needs a local block device. Memory streams; the rootfs is synced and reflinked locally. The streaming path is covered in /docs/internals/streaming-restore.
The mental model that holds up: a rootfs clone is a reference, not a copy. The data is shared with the template until the sandbox writes, at which point only the written blocks diverge. That is what turns "copy a multi-gigabyte disk" into a few milliseconds of metadata, and it is the same copy-on-write principle — applied to disk blocks via reflink and to memory pages via MAP_PRIVATE — that lets every create restore a baked snapshot in 179ms and every same-host fork branch a live machine in ~400ms.
PandaStack's core is open source under Apache-2.0, so you can run the control-plane API and per-host agent on your own Linux KVM hosts and watch the reflink-vs-copy difference yourself — point the rootfs directory at XFS and time a create, then point it at ext4 and time the fallback. For the conceptual grounding, /blog/what-is-a-microvm; for the broader copy-on-write story, /blog/snapshot-and-fork-explained and /docs/concepts/snapshots-and-forks.
Frequently asked questions
Why is cloning a multi-GB rootfs an O(metadata) operation?
Because a copy-on-write clone doesn't copy the data. On a reflink-capable filesystem like XFS, cp --reflink creates a new file that points at the same underlying data extents as the source and records, in metadata, that those extents now have two owners. The work is proportional to the number of extents, not the number of bytes, so a 2 GiB image and a 20 GiB image clone in about the same few milliseconds. Data is only copied later, one block at a time, when the clone is written to.
Can ext4 do a copy-on-write rootfs clone like XFS?
No. ext4 has no reference-counted shared extents, so it can't reflink — cp --reflink fails on ext4 and falls back to a full byte-for-byte copy. XFS and Btrfs both support reflinks and give you the O(metadata) clone. Note that the rootfs image can be formatted ext4 (that's the guest's filesystem); what matters is that the host filesystem the image file lives on is XFS or Btrfs. PandaStack also supports dm-snapshot, which provides copy-on-write at the block-device layer and works even when the host filesystem isn't reflink-capable.
What's the difference between reflink and dm-snapshot for the rootfs?
Both give a copy-on-write rootfs; they operate at different layers. Reflink is a filesystem feature (XFS/Btrfs) that clones a file by sharing its data extents until a write copies a block. dm-snapshot is a device-mapper target that does the same one layer lower — it presents a writable view of a read-only origin block device plus a separate store for changed blocks. dm-snapshot doesn't require a reflink-capable filesystem, so it's the fallback path. The guest sees an identical result either way: a private, writable disk that shares unwritten blocks with the template.
How does copy-on-write make a fork land in ~400ms?
A fork is two copy-on-write clones at once: the parent's rootfs is reflinked (O(metadata), shares blocks until a write) and the parent's memory is mapped MAP_PRIVATE (copy-on-write at the 4 KiB page level, pages fault in lazily). Neither moves real data up front, so a same-host fork is just a metadata clone, a memory map, and a resume — roughly 400ms. Cross-host is slower (1.2–3.5s) because reflink only works within one filesystem and the parent's resident RAM isn't on the destination host, so the artifacts must move over the network first.
49ms p50 cold start. Fork, snapshot, and scale to zero.