all posts

Why Docker Isn't a Sandbox

Ajay Kumar··8 min read

Docker is one of the best tools we have for packaging software and isolating cooperating workloads from each other. It is not a sandbox for hostile code. Those two statements aren't in tension — they describe different threat models. The first assumes the code in the container is on your side and you're isolating it from accidents and noisy neighbors. The second assumes the code is actively trying to break out. Docker is built for the first and only partially defends against the second, and the reason is structural, not a matter of configuration.

This post is the escape-mechanics deep dive: what actually isolates a container, why the shared kernel is the load-bearing weakness, the concrete categories of escape, and why hardening narrows the gap without closing it. If you want the layer-by-layer feature comparison instead, that lives in /blog/firecracker-vs-docker — this post deliberately doesn't repeat the table.

What actually isolates a container

A container is not a thing. It's a regular Linux process that the kernel has been asked to treat specially, using a handful of independent kernel features stacked together:

  • Namespaces (pid, mount, net, uts, ipc, user, cgroup, time) — control what the process can see. A pid namespace hides other processes; a mount namespace gives a private filesystem view; a network namespace gives its own interfaces.
  • cgroups (v1/v2) — control what the process can consume: CPU, memory, I/O, the number of pids. This is the resource-limit half, not the visibility half.
  • Capabilities — slice the historically monolithic root into individual privileges (CAP_SYS_ADMIN, CAP_NET_ADMIN, and dozens more), so a container can be denied most of what root could do.
  • seccomp-bpf — filters which syscalls the process is even allowed to invoke, and with which argument values.
  • LSMs (AppArmor or SELinux) — mandatory-access-control profiles layered on top, distinct from capabilities and seccomp.

That stack is genuinely effective at isolating cooperating software, and each layer stops real attacks. But notice what every single one of those layers has in common: it is implemented and enforced by the host kernel. The kernel is simultaneously the thing running the container and the thing you're trying to protect from the container. There is no second referee.

The shared kernel is the structural weakness

Every container on a host shares one kernel. There is no separate guest kernel — that absence is precisely what makes containers light, and precisely what makes them a weak security boundary against hostile code. A process inside a container still calls into the same kernel as everything else on the box. The Linux syscall ABI it talks to is enormous: well over 300 syscalls on x86-64, plus the ioctl interfaces, the filesystem layer, the networking stack, and every device driver reachable from inside. That entire surface is the attack surface, and it is shared.

The single biggest misconception is that a container is a security boundary like a VM is. It isn't. Namespaces and cgroups control what a process sees and uses; they do not, by themselves, shrink the kernel attack surface. One reachable kernel bug — in any syscall the container is still allowed to call — and the container is no longer a boundary at all. The kernel that's enforcing the isolation is the same kernel the exploit is attacking.

This is the consensus of container-security literature, not a vendor talking point — NIST's IR 8176 on Linux container security says the same thing. A container is a strong isolation mechanism and a weak security boundary, because isolation and security-against-hostile-code are different properties, and the shared kernel gives you a lot of the first and not enough of the second.

The three categories of escape

Escapes fall into three buckets. They're worth separating because they call for different defenses, and because the dominant real-world cause is the least glamorous one.

1. Kernel privilege escalation reachable via syscalls

Because the kernel is shared, a local privilege-escalation bug that's reachable through the syscall interface is reachable from inside the container. Exploit it and you're running as root on the host, alongside every other tenant. The Linux kernel ships such bugs periodically — that's not a knock on Linux, it's the reality of a codebase that large. Whether a given kernel bug is reachable from a specific container depends on that container's seccomp, capability, and namespace configuration, so it's a category risk rather than a guarantee in either direction. The honest framing: a kernel LPE reachable via syscalls is a container escape, and the shared kernel is what makes it one.

2. Container-runtime bugs

The runtime itself — runc, containerd, the Docker daemon — runs with host privileges and is part of the attack surface. The canonical example is CVE-2019-5736 in runc (CVSS 8.6), disclosed in February 2019 and fixed in Docker 18.09.2. Quoting NVD's own wording, it "allows attackers to overwrite the host runc binary … because of file-descriptor mishandling, related to /proc/self/exe." The precise mechanism matters to anyone who'll fact-check it: a malicious container that can run a process as root inside itself obtains a file descriptor to the host's running runc via /proc/self/exe, then overwrites the on-disk runc binary — it does not patch live process memory. The next time runc is invoked, the attacker's code runs as root on the host.

It's been patched for years; treat it as a historical illustration of the class, not a live threat. The point is that the runtime is host-privileged code sitting directly between hostile guests and the host, so a single memory-safety or logic bug there is a host compromise — independent of how well-configured your namespaces are.

3. Dangerous misconfigurations

In practice this is where most real escapes come from, and none of them are bugs — they're by-design behaviors used carelessly:

  • --privileged — disables most of the isolation at once: full capabilities, unmasked /proc and /sys, broad device access. It's close to running the workload as host root.
  • Mounting /var/run/docker.sock into the container — hands the container control of the Docker daemon, which runs as root. From there, spawning a new privileged container that bind-mounts the host filesystem is trivial host root.
  • Host bind mounts — mounting host paths (worst case, /) straight into the container hands over whatever is mounted.
  • CAP_SYS_ADMIN — so broad it's commonly called "the new root." Granting it reopens a wide range of escape paths that the default capability set deliberately closes.

These are configuration mistakes, not Docker vulnerabilities — but for a multi-tenant or untrusted-code setting, the distinction is cold comfort. The boundary is only as strong as the single most permissive flag anyone set, and the failure mode is silent.

Why hardening helps but doesn't close it

The standard advice is correct and worth following: keep the default seccomp profile (it blocks several dozen of the most dangerous syscalls — kexec_load, reboot, restricted ptrace, mount, keyctl, and others), drop every capability you don't need, run rootless, and enable user namespaces so container-root maps to an unprivileged host uid. All of this is real defense-in-depth and stops real attacks.

But look at what hardening does and doesn't change. seccomp shrinks the reachable syscall surface — the default profile still leaves 300-plus syscalls allowed, because a container has to be able to do useful work. Fewer reachable syscalls means fewer paths to a kernel bug, not zero paths, and a vulnerability in any still-allowed syscall remains exploitable. User namespaces mitigate a lot, but user-namespace code has itself been a source of privilege-escalation bugs. Every one of these controls is enforced by the shared kernel, so every one of them shares the kernel's fate. Hardening moves you along a spectrum of reduced risk. It does not give you the categorically different boundary that not sharing the kernel gives you.

That's the whole argument in one line: you can make the shared-kernel surface smaller, but you cannot make it not-shared. To get a boundary that doesn't depend on the integrity of the host kernel, you have to stop sharing the host kernel.

What an actual sandbox looks like

There's a ladder of stronger boundaries, and we treat it in full in /blog/code-isolation-hierarchy. The two rungs above a plain container are worth naming here because they're the answer to "then what should I use for hostile code."

gVisor (runsc) narrows the gap. Its Sentry is a user-space reimplementation of the Linux syscall interface, written in Go, that intercepts the guest's syscalls and services them itself instead of passing them straight to the host kernel. The host kernel surface the sandbox can ultimately reach is dramatically smaller. It's not free — interposing on syscalls adds overhead, heaviest on syscall- and I/O-bound workloads — and it has its own attack surface. Note the nuance an expert will check: even gVisor's KVM platform mode is not a hardware VM like Firecracker. It keeps a process model and borrows CPU virtualization extensions for address-space isolation; it does not become a VM. gVisor reduces the shared-kernel surface; it doesn't eliminate it.

A microVM removes the shared kernel entirely. Firecracker (and Cloud Hypervisor) give each guest its own kernel, isolated by hardware virtualization — Intel VT-x / AMD-V, exposed through Linux KVM. The host no longer presents the full Linux syscall ABI to the guest. What it exposes instead is the VMM plus the KVM ioctl interface plus a deliberately minimal virtio device model. Firecracker emulates essentially virtio-net, virtio-block, and virtio-vsock (with a trivial serial console and a one-button keyboard controller), is written in Rust, and runs behind a jailer that drops privileges and confines the process with chroot, cgroups, namespaces, and a tight per-thread seccomp filter — on the order of tens of allowed syscalls, several argument-constrained, versus the hundreds a container shares. (Kata Containers gets you the same microVM-class boundary with a container-like UX, by running each pod inside a lightweight VM on a VMM such as QEMU, Cloud Hypervisor, or Firecracker.)

Smaller and more-audited is the honest claim, not "unbreakable." KVM has had guest-to-host escape CVEs; Google's kvmCTF pays up to $250,000 for one, which tells you both that they're real and that the surface is small enough to be a focused target. The virtio device model is still attackable, Rust prevents memory-safety bugs but not logic bugs, and microarchitectural side channels cross the VM boundary in principle. A microVM is a meaningfully stronger boundary than a shared kernel — it is not an absolute one.

The honest conclusion

Use Docker for what it's superb at: packaging, and isolating software that cooperates with you. The moment the code is untrusted — AI-generated commands, per-user playgrounds, arbitrary repos in CI, multi-tenant execution — you want a boundary that doesn't share the host kernel, because the shared kernel is the one thing hardening can't fix. That's the whole reason PandaStack runs every sandbox, database, and hosted app as its own Firecracker microVM, each with its own guest kernel (5.10, Ubuntu 24.04), isolated by KVM, with per-sandbox network namespaces for egress isolation. The historical objection to VM isolation was startup cost; PandaStack closes that by restoring a baked snapshot on every create — ~179ms p50, no shared kernel in the path. For the broader map of the problem, start at /blog/how-to-sandbox-untrusted-code; for the model itself, see /blog/what-is-a-microvm and /blog/firecracker-vs-docker.

Frequently asked questions

Is Docker a security sandbox?

Not against hostile code. Docker is excellent at packaging and at isolating cooperating workloads, but its isolation (namespaces, cgroups, capabilities, seccomp, LSMs) is all enforced by the host kernel that every container shares. A kernel bug reachable from inside a container can defeat that isolation and compromise the host. For untrusted code, use a boundary that doesn't share the kernel, such as a Firecracker microVM.

What is a container escape?

Breaking out of the container's isolation to gain access to the host or other containers. There are three categories: a kernel privilege-escalation bug reachable via syscalls (since the kernel is shared); a bug in the host-privileged runtime (runc/containerd/Docker daemon) — the canonical example is CVE-2019-5736 in runc, patched in 2019; and dangerous misconfigurations like --privileged, a mounted docker.sock, host bind mounts, or granting CAP_SYS_ADMIN. Misconfiguration is the dominant real-world cause.

Does seccomp or dropping capabilities make Docker safe for untrusted code?

It helps but doesn't close the gap. The default seccomp profile blocks several dozen dangerous syscalls and dropping capabilities removes privileges, which reduces the reachable attack surface and stops many attacks. But hundreds of syscalls remain allowed so the container can do useful work, every control is still enforced by the shared host kernel, and a bug in any allowed syscall is still exploitable. Hardening reduces risk; it does not give you the categorically different boundary of not sharing the kernel.

What should I use instead of Docker to run untrusted code?

Move up the isolation ladder. gVisor (runsc) interposes a user-space kernel and shrinks the host kernel surface, but it's not a hardware VM and retains some shared-kernel exposure. A microVM (Firecracker, Cloud Hypervisor) gives each guest its own kernel isolated by hardware virtualization via KVM, exposing only a minimal VMM plus virtio device model rather than the full Linux syscall ABI. Kata Containers offers the same microVM-class isolation with a container-like UX. PandaStack runs every workload as a Firecracker microVM and restores a snapshot per create in ~179ms so VM-grade isolation costs almost no startup latency.

Run code in a microVM in one API call.

49ms p50 cold start. Fork, snapshot, and scale to zero.

Start free
Written by Ajay Kumar, Founder, PandaStack.