all posts

An Open-Source OpenAI Code Interpreter Alternative

Ajay Kumar··10 min read

If you've used OpenAI's Code Interpreter — the tool that runs model-generated Python and hands back tables and charts, available in the legacy Assistants API and now as the python tool in the Responses API — you already know its shape: it's hosted, it's Python-only, and it's a black box. The code runs in OpenAI's sandbox, you can't run that sandbox yourself, and the only things you control are the inputs and outputs. For a lot of products that's completely fine. For others — multi-tenant platforms, anything with a data-residency rule, teams that want code execution on their own infrastructure, or workloads that aren't Python — the hosted-only model is exactly the constraint they're trying to escape. This post is for that second group: an honest look at a self-hostable, Firecracker-isolated alternative you control end to end, and a fair read on when OpenAI's built-in is the better choice.

I'm the founder of PandaStack, so treat this as a vendor's pitch with the bias that implies. I keep it honest the only way that works: I cite specific numbers (latency, license, fork times) only for PandaStack, I describe OpenAI's tool in general terms rather than inventing internals, and I include a real section on when to just use OpenAI's built-in. Anything load-bearing to your decision — OpenAI's current limits, pricing, model behavior — verify against OpenAI's own docs, because capabilities in this space change monthly.

What OpenAI's Code Interpreter actually is

Two things get conflated here, and getting them apart is what keeps this comparison honest. The first is the classic Code Interpreter / python tool: it runs Python in an OpenAI-hosted, ephemeral sandbox container that expires after a short window of inactivity. It is Python-only, the environment is managed entirely by OpenAI, and network access is best treated as not available. The second, newer thing is OpenAI's provider-agnostic agent-sandbox concept — a richer Unix-like environment with a shell and resumable state — which explicitly supports pluggable third-party compute backends. Those are different layers. When people say 'OpenAI Code Interpreter,' they almost always mean the first: the hosted Python tool. That's the thing this post offers an alternative to.

The strengths of the hosted tool are real. It's zero-infrastructure — you make an API call and code runs, with nothing to operate. It's tightly integrated with OpenAI's models, so the loop from 'the model writes code' to 'the code runs and the result comes back' is about as short as it gets. And the sandboxing is OpenAI's problem, not yours. If you're building entirely on OpenAI and that fits, the rest of this post may not be for you — and I'll say so plainly later.

The constraints are equally real. Python-only rules out a code interpreter that needs to run Node, a shell pipeline, or a quick Go program. Hosted-only means the code physically executes on OpenAI's machines — a non-starter under some data-residency and compliance regimes, and a hard stop if your requirement is literally 'run this on our own hardware.' And the black-box nature means you can't bake your own image, pin the dependency set, control the network egress policy, or inspect what the runtime actually does. None of these are bugs; they're the deliberate shape of a managed feature. They're also exactly the reasons people go looking for an alternative.

The alternative: a code interpreter you own

The wedge is straightforward: keep the developer experience of OpenAI's Code Interpreter — generate code, run it, get rich results back — but move the sandbox to infrastructure you control, drop the Python-only restriction, and make the whole thing self-hostable. That's what PandaStack is built for, and the isolation model is the reason it's a credible place to run code you didn't write.

Every PandaStack sandbox is a Firecracker microVM with its own guest kernel (5.10, Ubuntu 24.04), isolated by hardware virtualization through KVM — not a shared-kernel container. Firecracker is the same VMM AWS uses under Lambda and Fargate; it's written in Rust, runs under a jailer that drops privileges, and exposes a minimal virtio device model (net, block, vsock). The practical upshot for a code interpreter: arbitrary model-generated code — an infinite loop, a 40GB allocation, an `import os; os.system('rm -rf /')` — is contained to a disposable VM with its own kernel, memory, filesystem, and network namespace, not to a process sharing your host kernel. That's a stronger boundary than `exec()` in your own process or a hardened container, which is the right bar for untrusted LLM output. See /blog/firecracker-vs-docker and /blog/what-is-a-microvm for the deeper isolation story, and /blog/why-docker-is-not-a-sandbox for why containers fall short of it.

The historical objection to VMs for per-request work was boot time. PandaStack's answer is specific: there is no warm pool of idle VMs. Every create restores a baked Firecracker snapshot on demand — a snapshot that already holds a booted kernel, a running guest agent, and an open network stack — so 'starting' a sandbox is really 'restore memory pages and resume.' That lands at 179ms p50 (p99 ~203ms). The only slow path is the first-ever spawn of a brand-new template, which does a real cold boot (~3s) and bakes the snapshot; every create after that is on the fast restore path. Sub-200ms create is what makes a fresh, fully isolated VM per untrusted run practical. See /docs/internals/snapshot-restore for how the restore path works.

Rich results — charts come back as objects

OpenAI's Code Interpreter returns charts and tables as part of the model's output. A self-hosted alternative is only useful if it gives you the same ergonomics, and this is where the black-box-vs-yours distinction pays off: with PandaStack you get a code-context session — a persistent kernel where state carries across runs — and each run returns a typed Execution whose results are structured objects, not text you scrape from stdout. A matplotlib figure comes back as a base64 PNG you can pull directly off the result; a DataFrame repr comes back as HTML. Here's the loop in Python:

from pandastack import Sandbox

# One Firecracker microVM, one persistent kernel; state survives across run_code calls.
with Sandbox.create(template="code-interpreter", ttl_seconds=300) as sbx:
    ctx = sbx.create_code_context()  # language="python" by default

    # Cell 1: load data into the live kernel.
    ctx.run_code("import pandas as pd; df = pd.read_csv('/workspace/sales.csv')")

    # Cell 2: model-generated code that produces a chart.
    ex = ctx.run_code(
        "import matplotlib; matplotlib.use('Agg')\n"
        "import matplotlib.pyplot as plt\n"
        "df.groupby('region')['revenue'].sum().plot.bar()\n"
        "plt.tight_layout()",
        timeout_seconds=30,
    )

    # Rich output as an object — base64 PNG, not scraped stdout.
    if ex.png:
        import base64
        with open("chart.png", "wb") as f:
            f.write(base64.b64decode(ex.png))
    print(ex.text)            # text repr / stdout
    print(len(ex.results))    # every result the cell emitted (charts, tables, reprs)

The TypeScript SDK mirrors it exactly, which is the second thing OpenAI's Python-only tool can't give you — a first-class code interpreter from a Node or Bun codebase:

import { PandaStack } from "@pandastack/sdk";

const ps = new PandaStack(); // reads PANDASTACK_API_KEY (prefix pds_)
const sbx = await ps.sandboxes.create({ template: "code-interpreter", ttlSeconds: 300 });

const ctx = await sbx.createCodeContext();
await ctx.runCode("import pandas as pd; df = pd.read_csv('/workspace/sales.csv')");

const ex = await ctx.runCode(
  "df.groupby('region')['revenue'].sum().plot.bar()",
  { timeoutSeconds: 30 },
);

if (ex.png) {
  // base64 PNG straight off the Execution — render it or attach to a chat message.
  await Bun.write("chart.png", Buffer.from(ex.png, "base64"));
}
console.log(ex.text, ex.results.length);

await sbx.delete();

Because the runtime isn't a black box, the code interpreter isn't limited to Python either. The same sandbox runs a shell, Node 22, Go, or anything baked into the template — `run_code(..., language="shell")` or just `exec("node script.js")`. And because you control the image, you can bake your own dependency set into the snapshot so there's zero per-run pip install — the libraries are already importable when the VM resumes. For the full walkthrough, see /docs/guides/code-interpreter and the agent recipe at /docs/cookbook/data-analyst-agent. The build-it-from-scratch version is at /blog/code-interpreter-how-to.

Forking: the trick OpenAI's hosted tool can't do

There's a capability the microVM model unlocks that a hosted Python container fundamentally can't expose: copy-on-write forking. Warm one sandbox to a known state — dependencies installed, a 2GB dataset loaded into the kernel, the REPL hot — then fork it N times to run branches in parallel, each starting from the exact same memory without re-running setup. A snapshot captures the full machine (memory plus rootfs); a fork shares guest memory through MAP_PRIVATE so the kernel only copies pages on write, and clones the rootfs with an XFS reflink so disk data is shared until something writes. A same-host fork completes in about 400ms; a cross-host fork (download plus restore) runs 1.2–3.5s.

For an agent that does tree-search, runs multiple candidate fixes, or explores several analyses off one loaded dataset, this is the difference between re-doing expensive setup per branch and forking from a hot baseline. It's a primitive OpenAI's hosted interpreter doesn't surface, because forking a VM is something you can only offer when you control the VM. See /blog/snapshot-and-fork-explained and /docs/concepts/snapshots-and-forks for how it works.

Self-host, multi-tenancy, and egress control

The structural difference from OpenAI is where the code runs. PandaStack's core is open-source under Apache-2.0 and is designed to run on your own Linux KVM hosts (anything with /dev/kvm). You run a control-plane API and a per-host agent; sandboxes execute entirely on your infrastructure. There's a hosted offering too, but self-host is first-class — same binaries, same agent, base URL configurable so identical SDK code points at either. If your reason for leaving OpenAI's hosted tool is 'the code must run on our hardware / in our VPC / in this region,' that's the whole point.

Two more things the hosted black box can't give you. First, network policy: each sandbox runs in its own Linux network namespace (NATID — per-sandbox netns plus veth plus tap, 16,384 /30 subnets per agent), so egress is isolated per sandbox and you can restrict outbound access at the network layer rather than trusting the code not to exfiltrate. See /docs/concepts/networking-natid. Second, multi-tenancy you can reason about: the isolation boundary is the VM, so the correct pattern is one sandbox per untrusted run (or per user session), torn down after — never two users' code in the same VM. The 179ms create and ~400ms fork make per-run isolation cheap enough to actually do. For the threat model, see /blog/secure-code-execution-for-ai-agents and /blog/run-untrusted-code-safely.

Self-hosting is real operational weight. You're now running KVM hosts, an agent fleet, networking, and snapshot storage. If you don't have an infra team or the appetite for one, OpenAI's hosted tool is genuinely less to operate — and that's a legitimate reason to stay. PandaStack's job is to make the self-host path tractable, not to pretend it's free.

When OpenAI's built-in is the right call

An honest comparison has to say when the thing you're comparing against wins. Stick with OpenAI's hosted Code Interpreter when:

  • You're all-in on OpenAI. If your whole stack is the Assistants/Responses API and the model is already orchestrating tool calls, the built-in interpreter is the shortest path and the tightest integration — there's no sandbox to wire up.
  • You want zero infrastructure. No KVM hosts, no agent fleet, no snapshot storage — you make an API call and code runs. For a prototype or a low-volume product, that operational simplicity is worth a lot.
  • Your workload is genuinely Python-only and short-lived. If you never need Node, a shell, Go, a custom image, or persistent forked state, the Python tool covers it and the extra capability of a self-hosted microVM is capability you won't use.
  • You have no data-residency or on-prem constraint. If it's fine for the code to run on OpenAI's machines, one of the main reasons to move doesn't apply to you.

Put plainly: if you want zero infra and you're happy inside OpenAI's ecosystem, use the built-in. The case for an alternative gets strong when one of those four flips — you need self-host or data residency, you've outgrown Python-only, you want to bake and control the runtime, or you need forking and per-tenant network isolation that a hosted black box structurally can't expose.

Where this sits in the wider field

PandaStack isn't the only sandbox you could plug in behind an agent, and OpenAI's own provider-agnostic agent-sandbox concept treats the execution layer as swappable. The honest landscape: most serious AI code-execution sandboxes converge on microVM-class isolation (E2B and Vercel Sandbox use Firecracker; others use gVisor or Kata), and they differ mainly on hosted-vs-self-host, boot path, forking semantics, and platform breadth — not on whether microVMs are the right foundation. If you're weighing options beyond OpenAI's built-in, /blog/e2b-alternatives maps the field by decision criteria, and the head-to-heads at /blog/pandastack-vs-e2b, /blog/pandastack-vs-modal, and /blog/pandastack-vs-vercel-sandbox go a level deeper on each.

Don't pick on a feature list. Benchmark the two candidates you actually care about: call create() in your own region, run your real model-generated code under realistic load, fork into the branching pattern your workload uses, and check the isolation and egress behavior yourself. Every vendor's headline number — including PandaStack's create latency, and including how you read OpenAI's limits — is a hypothesis to verify, not a settled fact. An hour of measurement on your workload beats a week of reading comparison posts.

The bottom line

OpenAI's Code Interpreter is a good, tightly-integrated, zero-infrastructure tool — and it's Python-only, hosted-only, and a black box. If those constraints fit, use it. If any of them is the thing blocking you, the alternative is a code interpreter you own: a Firecracker microVM per run with hardware-level isolation, sub-200ms creates, multi-language execution, rich results that come back as typed objects from both a TypeScript and a Python SDK, copy-on-write forking, per-sandbox network isolation, and an Apache-2.0 core you can self-host on your own KVM hosts. Same developer experience as the hosted tool, but on infrastructure you control and without the Python-only ceiling. Start at /docs/guides/code-interpreter, and benchmark it against your real workload before you commit.

Frequently asked questions

What is the best open-source alternative to OpenAI's Code Interpreter?

PandaStack is an open-source alternative built specifically as a sandbox for running model-generated code. Its core is Apache-2.0 licensed and designed to run on your own Linux KVM hosts (a control-plane API plus a per-host agent), so the code executes on your infrastructure rather than OpenAI's. Each sandbox is a Firecracker microVM with its own guest kernel, it restores a baked snapshot on every create (179ms p50), and a code-context session returns rich results — charts as base64 PNG objects, DataFrames as HTML — from both Python and TypeScript SDKs. Unlike OpenAI's hosted Python tool, it isn't limited to Python and you control the runtime image and network egress policy.

Can I self-host an OpenAI Code Interpreter alternative?

Yes. OpenAI's Code Interpreter is hosted-only — the code runs on OpenAI's machines and you can't run that sandbox yourself. PandaStack's core is Apache-2.0 and designed to be self-hosted: you run the control-plane API and a per-host agent on machines with /dev/kvm, and sandboxes execute entirely on your own infrastructure. The same binaries power both the hosted offering and self-hosting, and the SDK base URL is configurable, so identical code points at either. This is the usual reason teams leave a hosted-only tool — data residency, on-prem requirements, VPC isolation, or per-sandbox network control. The trade-off is operational: you're now running KVM hosts, an agent fleet, networking, and snapshot storage.

Is OpenAI's Code Interpreter limited to Python?

The classic Code Interpreter / python tool in the Assistants and Responses APIs runs Python only, in an OpenAI-hosted ephemeral sandbox. A self-hosted microVM alternative removes that ceiling: because you control the guest image, the same sandbox can run a shell, Node, Go, or anything baked into the template, and you can use run_code with language set to python or shell. If your workload is genuinely Python-only and short-lived, OpenAI's built-in covers it; if you need other runtimes or a custom dependency set, that's a concrete reason to look at an alternative.

When should I just use OpenAI's built-in Code Interpreter instead of an alternative?

Use OpenAI's built-in when you're all-in on the OpenAI ecosystem and want the tightest integration, when you want zero infrastructure to operate, when your workload is genuinely Python-only and short-lived, and when you have no data-residency or on-prem constraint. In those cases the hosted tool is the shortest path and the least to run. The case for a self-hosted alternative gets strong when one of those flips — you need the code to run on your own hardware or in a specific region, you've outgrown Python-only, you want to bake and control the runtime image, or you need capabilities a hosted black box can't expose like copy-on-write forking and per-sandbox egress isolation.

Run code in a microVM in one API call.

49ms p50 cold start. Fork, snapshot, and scale to zero.

Start free
Written by Ajay Kumar, Founder, PandaStack.