Ctrtool

<!--   from peterjin.org, the official website of Peter H. Jin -->
Jump to navigation Jump to search

Ctrtool is a Linux container runtime, similar to Docker and LXC. Ctrtool is actually a set of 12 discrete programs, all combined into a single binary. Using these programs, along with a bit of shell and Python scripting, containers can be created. Ctrtool also includes a few other utilities not directly related to containers, but would be nice to have in a container environment.

Name Purpose
debug_shell Unix domain socket server for launching a debug shell within a container
(container-)launcher Main program to start a container
mini-init Minimal init program that allows for things like file descriptor sharing
mount_seq Non-bloated version of the mount(8) command (albeit with different syntax)
ns_open_file Open sockets in a particular network namespace and/or the root directory of a mount namespace.
pidfd_ctl Create and manage PID file descriptors. Shell wrapper for the pidfd_open(2) and pidfd_getfd(2) system calls.
(simple-)renameat2 Simple shell wrapper for the renameat2(2) system call.
reset_cgroup Set the cgroups of the current process all at once.
(container-)rootfs-mount Prepares the root filesystem of a container; see #Design rationale below.
set_fds Normalize the process's file descriptor table.
syslogd Simple syslog implementation that takes the UID of the logging process into account, so that different syslog messages from different containers can be differentiated.
tty_proxy Simple pseudoterminal relay. Allows terminals with job control and interactivity to be created on e.g. UNIX domain sockets, and also protects against TIOCSTI attacks and the like.

History

Ctrtool was largely inspired by Lizzie Dixon's Linux containers in 500 lines of code blog post, showing that it is possible to create a Linux container with only a few lines of C code (though nothing in ctrtool was derived directly from Lizzie's code). Since then, I have improved on the concept quite extensively, and now ctrtool has become a more mature container runtime that is comparable in quality to Docker and LXC.

For comparison, ctrtool's container launcher has about 1,500 lines of code, and ctrtool has a total of 5,000 lines of code across all 12 programs (including common code used by multiple programs).

TODO: mention the very beginning, where I still used the unshare shell command to create privileged containers.

To do list

New programs to add

  • chroot_pivot: Chroot to a specified directory, then run pivot_root. Useful to atomically swap one mount point with another.

Other

  • Full rewrite of the container launcher.
  • "In / We Trust" (the issue of chroot assumptions, and why it might not be a good idea to always trust paths starting with "/", cf. CVE-2019-14271)

Design rationale

Ctrtool prefers a tmpfs as the root directory, with files on the filesystem provided by symlinks or bind mounts to a read-only squashfs mount. This greatly simplifies setup since the squashfs mount is read-only and can be shared between all containers, whereas other things like mounts to program data and temporary file directories are dynamically generated using the startup scripts. There is no need to use overlayfs or similar, allowing for nested unprivileged containers (as overlayfs does not have FS_USERNS_MOUNT). In other words, the container root filesystem is just another volume!
To maximize efficiency, we include multiple rootfs images in one squashfs. When it comes time to actually starting the container, we bind mount the appropriate subdirectory and use that as the root filesystem. In the case of the next image, we bind-mounted the /generic subdirectory from the first squashfs image. Packing multiple rootfs images in one squashfs also has the advantage that duplicate files are removed, which is useful given the fact that many of the Docker images that the rootfs images are made from use the same base images (and thus the files from those base images will be the same).
All files in the squashfs are owned root:root (previous image), so as to not lock containers into using a certain predefined UID/GID map. Setuid/setgid bits and file capabilities (setcap) are also stripped, as this would otherwise lead to privilege escalation for processes running outside the container (in the case of a hostile or vulnerable container image). The entire squashfs will appear as nobody:nogroup inside the container (this image), indicating an unmapped user and group ID, but that's fine as we won't be able to write to the rootfs anyway.

Ctrtool does not create a container by itself. Rather, it just performs the main steps needed to create a container. The author is mindful that there is no one-size-fits-all container solution, so it allows for external scripts to set up things like the network and the root filesystem (the requirements for these can vary wildly between different containers):

  • Containers can be IPv4-only, dual-stack, or IPv6-only.
  • Containers can have a "local" route (i.e. a routed prefix directly into the container) (see Snippets:Nginx geo local server address); the container behind IPv6 Things is actually set up like this.
  • Containers can have their network interface routed using a dedicated interface, bridged to a common bridge (similar to "docker0" or "lxcbr0"), or even bridged onto the host's ethernet interface. The exact choice depends on a number of factors, including:
  • Whether a routed prefix (for IPv4 and/or IPv6) is available
  • Networking connectivity requirements for the application running in the container
  • Simplicity
  • NAT/connection tracking overhead
  • In the past, a variety of root filesystem schemes were tested. These include:
  • Bind-mounting a directory from the host as the root filesystem. This is the most straightforward way of doing it, but it may be prone to symlink attacks.
  • Using a read-only squashfs as the root filesystem. This makes the appearance more permanent, but the squashfs had to be rebuilt every time a new directory at the root needed to be added.
  • Ultimately, I settled on a tmpfs as the root filesystem, with bind mounts to link to specific host directories. These directories can be read-only or read-write. This has several benefits:
  • Flexibility as to the nature of the root filesystem; this same scheme allows for both a read-only /usr and a read-write /var.
  • Better security -- we build the tmpfs from scratch, as it protects against things like symlink attacks when performing bind mounts.
  • This is accomplished using a helper (also within ctrtool) called "container-rootfs-mount". This process can technically also be accomplished using shell scripts, but integrating the whole thing into a single program minimizes the overhead of spawning new processes for every bind mount, symlink, directory, and file created as part of this process. Plus it just happens to be an operation that is common to all containers (regardless of the nature of the filesystems), which eliminates redundant lines in shell scripts.
  • There are many tmpfs and virtual filesystem mounts on a typical Linux system -- /tmp, /run, /dev/shm, /proc, etc. Putting all of them under a single tmpfs mount is much simpler than creating separate tmpfs mounts for each of these directories.
  • This is similar to systemd-nspawn's --volatile option, but with symlinks to _fsroot_ro and _fsroot_rw instead of directly bind mounting the root filesystems such that the location of e.g. /var can be easily switched.

See also