Ctrtool is a Linux container runtime, similar to Docker and LXC. Ctrtool is actually a set of 12 discrete programs, all combined into a single binary. Using these programs, along with a bit of shell and Python scripting, containers can be created. Ctrtool also includes a few other utilities not directly related to containers, but would be nice to have in a container environment.
||Unix domain socket server for launching a debug shell within a container|
||Main program to start a container|
||Minimal init program that allows for things like file descriptor sharing|
||Non-bloated version of the mount(8) command (albeit with different syntax)|
||Open sockets in a particular network namespace and/or the root directory of a mount namespace.|
||Create and manage PID file descriptors. Shell wrapper for the pidfd_open(2) and pidfd_getfd(2) system calls.|
||Simple shell wrapper for the renameat2(2) system call.|
||Set the cgroups of the current process all at once.|
||Prepares the root filesystem of a container; see #Design rationale below.|
||Normalize the process's file descriptor table.|
||Simple syslog implementation that takes the UID of the logging process into account, so that different syslog messages from different containers can be differentiated.|
||Simple pseudoterminal relay. Allows terminals with job control and interactivity to be created on e.g. UNIX domain sockets, and also protects against TIOCSTI attacks and the like.|
Ctrtool was largely inspired by Lizzie Dixon's Linux containers in 500 lines of code blog post, showing that it is possible to create a Linux container with only a few lines of C code (though nothing in ctrtool was derived directly from Lizzie's code). Since then, I have improved on the concept quite extensively, and now ctrtool has become a more mature container runtime that is comparable in quality to Docker and LXC.
For comparison, ctrtool's container launcher has about 1,500 lines of code, and ctrtool has a total of 5,000 lines of code across all 12 programs (including common code used by multiple programs).
TODO: mention the very beginning, where I still used the unshare shell command to create privileged containers.
To do list
New programs to add
- chroot_pivot: Chroot to a specified directory, then run pivot_root. Useful to atomically swap one mount point with another.
- Full rewrite of the container launcher.
- "In / We Trust" (the issue of chroot assumptions, and why it might not be a good idea to always trust paths starting with "/", cf. CVE-2019-14271)
Ctrtool does not create a container by itself. Rather, it just performs the main steps needed to create a container. The author is mindful that there is no one-size-fits-all container solution, so it allows for external scripts to set up things like the network and the root filesystem (the requirements for these can vary wildly between different containers):
- Containers can be IPv4-only, dual-stack, or IPv6-only.
- Containers can have a "local" route (i.e. a routed prefix directly into the container) (see Snippets:Nginx geo local server address); the container behind IPv6 Things is actually set up like this.
- Containers can have their network interface routed using a dedicated interface, bridged to a common bridge (similar to "docker0" or "lxcbr0"), or even bridged onto the host's ethernet interface. The exact choice depends on a number of factors, including:
- Whether a routed prefix (for IPv4 and/or IPv6) is available
- Networking connectivity requirements for the application running in the container
- NAT/connection tracking overhead
- In the past, a variety of root filesystem schemes were tested. These include:
- Bind-mounting a directory from the host as the root filesystem. This is the most straightforward way of doing it, but it may be prone to symlink attacks.
- Using a read-only squashfs as the root filesystem. This makes the appearance more permanent, but the squashfs had to be rebuilt every time a new directory at the root needed to be added.
- Ultimately, I settled on a tmpfs as the root filesystem, with bind mounts to link to specific host directories. These directories can be read-only or read-write. This has several benefits:
- Flexibility as to the nature of the root filesystem; this same scheme allows for both a read-only /usr and a read-write /var.
- Better security -- we build the tmpfs from scratch, as it protects against things like symlink attacks when performing bind mounts.
- This is accomplished using a helper (also within ctrtool) called "container-rootfs-mount". This process can technically also be accomplished using shell scripts, but integrating the whole thing into a single program minimizes the overhead of spawning new processes for every bind mount, symlink, directory, and file created as part of this process. Plus it just happens to be an operation that is common to all containers (regardless of the nature of the filesystems), which eliminates redundant lines in shell scripts.
- There are many tmpfs and virtual filesystem mounts on a typical Linux system -- /tmp, /run, /dev/shm, /proc, etc. Putting all of them under a single tmpfs mount is much simpler than creating separate tmpfs mounts for each of these directories.
- This is similar to systemd-nspawn's --volatile option, but with symlinks to _fsroot_ro and _fsroot_rw instead of directly bind mounting the root filesystems such that the location of e.g. /var can be easily switched.