Over the past 2-3 years, OCI containers paired with Kubernetes have become de-facto standard for the OSS Cloud-Infra stack at Meta, powering many things from normal applications to GPU workloads for AI research. That scale surfaced a thorny challenge in the past year: running containers-inside-containers, safely and without root.
This talk demystifies (OCI) containers by walking the stack and the kernel primitives that make them work. In OCI terms: an image (layered filesystem + metadata) is executed by a runtime (runc, crun) under an engine (i.e: containerd). A container is just a process, isolated by Linux features: namespaces (pid, net, mnt, ipc, uts, user), cgroups for resource accounting, overlay filesystems for copy-on-write layers, and security controls (process capabilities, seccomp, LSMs (AppArmor/SELinux)) to drop privilege and filter syscalls.
In a container-in-container (“nested”) setup, you replay the image/runtime lifecycle inside an already isolated sandbox, think a container matrix. It works, but the edge cases multiply: aligning user-namespace ID maps; balancing privileged vs. rootless modes; managing resource accounting for nested PIDs; dealing with mount propagation and overlay-on-overlay; stacking networks; juggling UID mapping for root-squashed NFS. We’ll unpack a few often-overloaded terms: rootless (running without access to the real host root) and user namespaces (UID/GID remapping that allows containerized users to map to an unprivileged user on the host). None of this would be possible without recent advancements in the Linux kernel and the broader open-source ecosystem: kernel features introduced since 5.x+ (idmapped mounts, unprivileged OverlayFS, cgroup v2 delegation) and the latest integrations in containerd (v2.1+) and runc.
With this foundation, we’ll share 2 production case studies: (1) non-root BuildKit sandboxes that run our container builds, and (2) an interactive-containerized dev environment that allows researchers to safely run their own nested containers inside. We’ll lay out the trade-offs: performance (layering, nested cgroups, user-space networking), visibility (logs from inside the sandbox), and usability (UID/GID ownership, debugging in remapped namespaces). We plan to finish with a practical decision guide plus clear do’s and don’ts for running container-in-container safely.



