I Thought I Understood Containers. Then I Tried Building One.
DEV Community

I Thought I Understood Containers. Then I Tried Building One.

v1: Namespaces, or the First Time PID 1 Lied to Me

The first version was supposed to be easy: run a process in a new PID namespace and prove it sees itself as PID 1. So I ran the command the way I thought it worked:

$ sudo unshare --pid bash
# echo $$
25184

That was not PID 1. That was just embarrassing.

The rule I had missed is simple: PID namespaces apply to children. The process that calls unshare --pid does not magically become PID 1. You need to fork. The first child born into the new namespace becomes PID 1.

So the working version was:

$ sudo unshare --pid --fork bash
# echo $$
1

That one line changed the tone. I was inside a different process universe. The shell thought it was process 1. Signals felt different. Orphans came home to it.

Then I ran ps, and got humbled again.

# ps -o pid,ppid,comm
  PID  PPID COMMAND
25310 25304 bash
25344 25310 ps

That made no sense at first. I was PID 1, but ps was showing host-looking PIDs. The next reveal: ps does not ask the kernel some pure "what processes exist?" question. It reads files. If /proc still points at the host procfs, your tools will tell you the host story.

So I remounted /proc from inside the namespace:

# mount -t proc proc /proc
# ps -o pid,ppid,comm
  PID  PPID COMMAND
    1     0 bash
    7     1 ps

That was when it clicked. The namespace did not become real to my eyes until /proc agreed with it. Before that, I had isolation, but my tools were reading the old filesystem view.

The UTS namespace lesson was cleaner. I accidentally ran a science experiment. In one terminal, without a UTS namespace:

$ hostname
ba149abae9bd

Then inside a new UTS namespace:

$ sudo unshare --uts bash
# hostname toybox
# hostname
toybox

Back outside that UTS namespace:

$ hostname
ba149abae9bd

That was my control and treatment. Same machine, same kernel, different hostname view - and the "host" was already a container hostname, which made the containers-inside-containers setup visible in the output. Nothing mystical. Just one isolated kernel data structure doing exactly what the docs said, except now I had seen it with my own hands.

v2: pivot_root, the Boss Fight

After namespaces, I got overconfident again. The next version was supposed to give the process its own filesystem: a tiny rootfs, a shell, maybe BusyBox. Very container-ish. My repo had bash scripts for this, not some compiled runtime from a tutorial. So the shape of the attempt was v2.sh, a rootfs, and a command to run inside it.

The parade started with the obvious error:

$ sudo ./v2.sh ./rootfs /bin/sh
exec /bin/sh: no such file or directory

Fine. There was no shell where I said there would be a shell. I fixed that with Alpine's own BusyBox and hit the more annoying version: the file existed, but the kernel still said it could not run it.

$ ./rootfs/bin/busybox sh
bash: ./rootfs/bin/busybox: cannot execute: required file not found

This is the kind of error that feels personal because you can list the file. You can see the symlink. The computer still refuses. The plot twist came from file:

$ cd rootfs
$ file bin/busybox
bin/busybox: ELF 64-bit LSB pie executable, ARM aarch64, dynamically linked, interpreter /lib/ld-musl-aarch64.so.1, stripped

The binary was there. The interpreter was not available from the old world. Linux was not saying "your BusyBox file does not exist." It was saying "from here, I cannot load the interpreter this ELF needs." Same surface error, different problem.

The fix was not what I first thought. Alpine's BusyBox did not need to become static. Once Alpine became /, its musl loader would be at /lib/ld-musl-aarch64.so.1, and Alpine's /bin/sh would be happy. The thing that needed help was the handoff itself: my Ubuntu slim image did not even have pivot_root.

$ pivot_root . put_old
bash: pivot_root: command not found
$ file /bin/busybox
/bin/busybox: ELF 64-bit LSB executable, ARM aarch64, statically linked, stripped
$ /bin/busybox pivot_root . put_old

That was the better plot twist: the old world could not perform the handoff without borrowing a static tool. busybox-static was not my replacement shell inside Alpine. It was the bridge that could run before and during the transition.

Then I hit the Bash hash-cache moment. Alpine was now /, but Bash still remembered a command path from before the filesystem switch. It went hunting for /usr/bin/mount in a world that had just been evicted.

/ # mount -t proc proc /proc
bash: /usr/bin/mount: No such file or directory
/ # hash -r
/ # mount -t proc proc /proc

I had fixed the filesystem and was still debugging an old decision Bash had remembered for me. That is the kind of bug that makes you take a short walk.

Then came the Mac problem. My setup was not "normal Linux laptop, local ext4 disk." It was Apple Silicon Mac β†’ privileged Ubuntu container β†’ repo mounted from macOS. That means virtiofs was in the story whether I wanted it there or not. The symptom showed up after the pivot, inside Alpine, which made it stranger. Applet symlinks like mount and ls could fail with Permission denied on the Mac-shared mount, while calling BusyBox directly still worked. The files were there; executing through those symlinks was the weird part.

/ # ls
sh: ls: Permission denied
/ # /bin/busybox ls
bin dev etc lib proc put_old
/ # mount -t proc proc /proc
sh: mount: Permission denied
/ # /bin/busybox mount -t proc proc /proc

That is where I stopped trying to make the shared mount behave like a normal Linux filesystem. I moved the rootfs to a container-native path and reran the same idea from there. Boring fix, correct fix.

pivot_root itself still had opinions:

pivot_root: invalid argument

This one was not glamorous either. The new root had to be a mount point. The old root needed somewhere to go. I had to bind-mount the new root onto itself, create oldroot, call pivot_root(newroot, oldroot), chdir("/"), and unmount the old root.

When it finally worked, the reward was tiny and perfect:

/ # cat /etc/os-release
NAME = "Alpine Linux"
ID = alpine
VERSION_ID = 3.24.1
PRETTY_NAME = "Alpine Linux v3.24"

That output felt better than it should have. It was just /etc/os-release, but it meant the process was now living in the filesystem I had assembled. Not Docker magic. Just mounts, an ELF loader problem, a static pivot_root applet, a root pivot, and errors more useful than any clean diagram.

v3: cgroups, or Linux as a Filesystem API

The third version was about resource limits. This was where cgroups stopped being "that Kubernetes thing" and became a filesystem API I could write to.

On cgroup v2, everything looked like files:

$ ls /sys/fs/cgroup
cgroup.controllers  cgroup.procs  cgroup.subtree_control  memory.current  memory.max  pids.max

Beautiful, until you write the wrong value to the wrong file at the wrong level.

My first attempt was to create a child cgroup, move the process, and enable controllers casually.

$ sudo mkdir /sys/fs/cgroup/tiny
$ echo +memory | sudo tee /sys/fs/cgroup/cgroup.subtree_control
tee: /sys/fs/cgroup/cgroup.subtree_control: Device or resource busy

That error was the "no internal processes" rule. In cgroup v2, you cannot enable domain controllers on a cgroup that still has normal processes in it. The parent has to become a manager; work happens in children.

So I had to do the evacuation dance: create a child for the shell, move myself into it, then enable controllers on the parent, then create the actual container cgroup.

$ sudo mkdir /sys/fs/cgroup/init
$ echo $$ | sudo tee /sys/fs/cgroup/init/cgroup.procs
18420
$ echo +memory +pids | sudo tee /sys/fs/cgroup/cgroup.subtree_control
+memory +pids
$ sudo mkdir /sys/fs/cgroup/tiny
$ echo 32M | sudo tee /sys/fs/cgroup/tiny/memory.max
32M
$ echo 64 | sudo tee /sys/fs/cgroup/tiny/pids.max
64

After that, the runtime could put the child into /sys/fs/cgroup/tiny/cgroup.procs before executing the workload.

The first OOM test was almost too satisfying. I ran a small memory hog inside the limited cgroup and watched the kernel enforce the number.

$ dmesg | tail -n 3
[ 5221.143892] memory: usage 32768kB, limit 32768kB, failcnt 41
[ 5221.143901] Memory cgroup out of memory: Killed process 18432 (python3) total-vm:89344kB, anon-rss:31812kB, file-rss:0kB, shmem-rss:0kB
[ 5221.143944] oom_reaper: reaped process 18432 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

That line made cgroups real. Not a Docker setting. Not YAML. The kernel enforcing a number I wrote into a file.

Then Kubernetes Looked Smaller

After this, I deployed to Kubernetes and the concepts no longer felt distant. A Pod was not magic anymore. It was these same pieces with an API in front: namespaces to shape what the process can see, a filesystem root to control what it can touch, and cgroups to limit what it can consume.

I would not call my tiny runtime useful, but it made the abstractions stop floating. Docker, containerd, and Kubernetes became less like brands and more like organized versions of the three scripts I had just fought into existence.

Comments

No comments yet. Start the discussion.