Ctrtool

(Redirected from Container Launcher)
For end-user documentation on using ctrtool, see Help:Ctrtool

Ctrtool is a Linux container runtime, similar to Docker and LXC. Ctrtool is actually a set of 14 discrete programs, all combined into a single binary. Using these programs, along with a bit of shell and Python scripting, containers can be created. Ctrtool also includes a few other utilities not directly related to containers, but would be nice to have in a container environment.

Name Purpose
chroot_pivot Perform a chroot and pivot_root at the same time (useful to swap one mount point with another)
debug_shell Unix domain socket server for launching a debug shell within a container; also a mini inetd/tcpsvd clone.
(container-)launcher Main program to start a container
mini-init Minimal init program that allows for things like file descriptor sharing
mount_seq Non-bloated version of the mount(8) command (albeit with different syntax)
ns_open_file Open sockets in a particular network namespace and/or the root directory of a mount namespace.
pidfd_ctl Create and manage PID file descriptors. Shell wrapper for the pidfd_open(2) and pidfd_getfd(2) system calls.
ppid_check Set various attributes of the current process. Most commonly, enable PR_SET_PDEATHSIG, then check the expected value of the "parent process ID". Intended to be used for the purpose of ensuring synchronous termination of a group of processes, but can be used for a number of other purposes.
(simple-)renameat2 Simple shell wrapper for the renameat2(2) system call.
reset_cgroup Set the cgroups of the current process all at once.
(container-)rootfs-mount Prepares the root filesystem of a container; see #Design rationale below.
set_fds Normalize the process's file descriptor table.
syslogd Simple syslog implementation that takes the UID of the logging process into account, so that different syslog messages from different containers can be differentiated.
tty_proxy Simple pseudoterminal relay. Allows terminals with job control and interactivity to be created on e.g. UNIX domain sockets, and also protects against TIOCSTI attacks and the like.

History

Ctrtool was largely inspired by Lizzie Dixon's Linux containers in 500 lines of code blog post, showing that it is possible to create a Linux container with only a few lines of C code (though nothing in ctrtool was derived directly from Lizzie's code). Since then, I have improved on the concept quite extensively, and now ctrtool has become a more mature container runtime that is comparable in quality to Docker and LXC.

For comparison, ctrtool's container launcher has about 1,500 lines of code, and ctrtool has a total of 6,300 lines of code across all 13 programs (including common code used by multiple programs).

Another reason for making ctrtool was because Docker did not offer sufficient functionality to meet our requirements. For example, I can list the following problems that I had with docker when I first used it:

  • The need to use copy-on-write systems for container images (tmpfs as root + squashfs subdirectory ultimately won out, but Docker didn't really support that)
  • Needs somewhat better support for non-x86 systems (ARM, for example)[1]
  • Inability to use different ID maps for different containers (i.e. everything had to have the same UID and GID map); this was important since it would have allowed use of MySQL's auth_socket plugin in a way that allowed the containers to be reliably identified from one another.
  • Better IPv6 support (we needed to route an entire /64 into a container, and then the container can set that entire /64 as local; without this, IPv6 Things would not exist or would have required very ugly hacks to containerize securely.)
  • Better support for routed prefixes (instead of using NAT)
  • Better support for direct access to containers through their link-local IPv6 addresses[2] from the host.
  • Better support for inter-container communication without using veth devices (Unix domain sockets and/or shared memory, for example. Also known as out-of-band communication.)
  • Needed better rootless container support.
  • Recursive (nested) unprivileged container support.
  • PID reuse issues (not necessarily an issue with Docker itself, but rather with docker inspect -f '{{.State.Pid}}').
  • docker run was difficult to script[3].
  • Security issues when sharing volumes (mostly symlink related)[4].
  • We needed the ability to run custom code hooks at various random places before starting the container (e.g. to prepare specialized sockets).
  • Docker really likes to mess around with iptables rules, especially when publishing ports.
  • False sense of security when used in conjunction with a firewall.
  • Any container which shares the Docker unix socket (with e.g. -v /var/run/docker.sock:/var/run/docker.sock) is effectively root equivalent. Such a container can do strange and nasty things like deleting itself or running a privileged container (even with --userns-remap set) and mounting the entire host disk as a volume. To add insult to injury, the Docker daemon itself had to run as the host's "root" (i.e. in the initial user namespace), which meant that I couldn't just put Docker in a ctrtool container. Now that unprivileged overlayfs has been in the kernel, this may one day be reconsidered.
  • Pulling and updating images from Docker Hub requires Internet access. I could set up a private registry, but it's a bit of a pain.
  • Finally, the endless number of hacks that I had to use in order to get around these issues.

TODO: mention the very beginning, where I still used the unshare shell command to create privileged containers.

To do list

New programs

  • make_ns: Simple program to make new user/mount/network/time/IPC/UTS/cgroup namespaces on file descriptors (note: doesn't support PID namespaces). Intended to be a somewhat more elegant version of implementing e.g. Snippets:Ctrtool unprivileged network namespace creation, where we don't necessarily want to run a container command, but rather just to create namespaces. Like with the container launcher, we could have hooks for each type of namespace:
  • User: Write to uid_map, gid_map, projid_map, and /proc/PID/setgroups. Also support entering an existing user namespace, rather than creating a new one.
  • Mount: Run a script to set up mount points (either in a privileged or unprivileged context).
  • Network: Run a script to set up a veth pair (likely in a privileged context). May be used in conjunction with ns_open_file.
  • Time: Write to /proc/PID/timens_offsets.
  • IPC: Pre-create shared memory segments with the given keys and sizes (though this might be better in a separate script that calls ipcmk or similar).
  • UTS: Set hostname and NIS domain name.
  • Cgroup: Have a set of directories to set the process's cgroups into.
  • Privileged and unprivileged context scripts (to manage all types).
  • Note that unlike the container launcher, it doesn't execute a container command; rather, it just opens the namespaces and inherits them into another host process.
  • pidns_run: Simple program to create new PID and mount namespaces (as well as to enter other types of namespaces), pivot_root, and run a command in the PID namespace, while also accounting for chroot assumptions. May be thought of as a lighter-weight version of the container launcher that doesn't have any of the features that make_ns would have.
  • unlock: Perform one or more pivot_root, mount, mount --move, open, or umount operations (list can be extended), then drop capabilities. Intended to be a mechanism where the container starts out with fresh, trusted binaries on a tmpfs or bind mount, performs a set of privileged (e.g. CAP_SYS_ADMIN) operations, and then loads a root filesystem into it, thus safely "tainting" the container in a way such that the binaries on the untrusted root filesystem do not gain those capabilities. This needs to be a single, atomic operation because of the window where the container has loaded its root filesystem, but the capabilities are not yet dropped (i.e. mount --move followed by setpriv --bounding-set=(caps) in a shell script would not suffice since it could use the container's glibc). All operations should work in "rootless" container modes. May reuse code from mount_seq. May use the statically linked version due to the absence of glibc. Bind-mounting the host's /lib and /lib64 into the container, and then immediately unmounting it (using unlock itself), might be a little better.
  • cmd: Read lines from a specified file, and use that as the command line arguments for any one of the other ctrtool commands.
  • env: Simple clone of the shell command of the same name, supporting -, -i, and -u flags. Intended to be used in conjunction with the cmd command in circumstances where the environment variables contain secret information that would otherwise be visible in /proc/PID/cmdline.

Other

  • Full rewrite of the container launcher (it's rather bloated because I originally intended container-launcher to be a one-size-fits-all tool, before any of the other tools [except possibly reset_cgroup] existed).
  • Support for chaining make_ns, ns_open_file, pidfd_ctl, set_fds, and unlock together.
  • mount_seq needs to bail out with an error or warning if it sees a non-option argument (currently it ignores them). In case non-option arguments ever become useful, it can be restored by adding another option argument.
  • mount_seq needs to support creating regular files with predetermined content, either literally or with escape sequences (e.g. it could write a conf file with predetermined content). It also needs to possibly interpret the non-option arguments as a command to exec into after performing all operations. So it would be useful to use it to bind mount /proc/self somewhere and then immediately exec a program, thus allowing the bind-mounted directory to act as the equivalent of a PID file.
  • Non-zero exit codes need to be a bit more consistent.
  • Long options for other commands.
  • The --unix-socketpair option in the container launcher may be abused (untested) in a way that allows access to (abstract) Unix domain sockets on the host, since it is created in the host's network namespace (none of my own containers use it). Ideally, the file descriptor receiving code would be much better in ctrtool itself rather than in the container. Or it could be integrated into the new "unlock" command.
  • Make the privilege-dropping code (that was originally in the container launcher) into a library function.
  • "In / We Trust" (the issue of chroot assumptions, and why it might not be a good idea to always trust paths starting with "/", cf. CVE-2019-14271)
  • Container-launcher needs to support creating new non-user namespaces in a given user namespace with --nsenter2=.
  • Fix musl libc build issues.
  • ns_open_file should support using PID file descriptors as namespace arguments, but it currently does not.

Design rationale

Ctrtool prefers a tmpfs as the root directory, with files on the filesystem provided by symlinks or bind mounts to a read-only squashfs mount. This greatly simplifies setup since the squashfs mount is read-only and can be shared between all containers, whereas other things like mounts to program data and temporary file directories are dynamically generated using the startup scripts. There is no need to use overlayfs or similar, allowing for nested unprivileged containers (as overlayfs does not have FS_USERNS_MOUNT). In other words, the container root filesystem is just another volume!
To maximize efficiency, we include multiple rootfs images in one squashfs. When it comes time to actually starting the container, we bind mount the appropriate subdirectory and use that as the root filesystem. In the case of the next image, we bind-mounted the /generic subdirectory from the first squashfs image. Packing multiple rootfs images in one squashfs also has the advantage that duplicate files are removed, which is useful given the fact that many of the Docker images that the rootfs images are made from use the same base images (and thus the files from those base images will be the same).
All files in the squashfs are owned root:root (previous image), so as to not lock containers into using a certain predefined UID/GID map. Setuid/setgid bits and file capabilities (setcap) are also stripped, as this would otherwise lead to privilege escalation for processes running outside the container (in the case of a hostile or vulnerable container image). The entire squashfs will appear as nobody:nogroup inside the container (this image), indicating an unmapped user and group ID, but that's fine as we won't be able to write to the rootfs anyway.

Ctrtool follows a UNIX philosophy in the design of its subcommands: make each command do one thing and do it very well.

Ctrtool does not create a container by itself. Rather, it just performs the main steps needed to create a container. The author is mindful that there is no one-size-fits-all container solution, so it allows for external scripts to set up things like the network and the root filesystem (the requirements for these can vary wildly between different containers):

  • Containers can be IPv4-only, dual-stack, or IPv6-only.
  • Containers can have a "local" route (i.e. a routed prefix directly into the container) (see Snippets:Nginx geo local server address); the container behind IPv6 Things is actually set up like this.
  • Containers can have their network interface routed using a dedicated interface, bridged to a common bridge (similar to "docker0" or "lxcbr0"), or even bridged onto the host's ethernet interface. The exact choice depends on a number of factors, including:
  • Whether a routed prefix (for IPv4 and/or IPv6) is available
  • Networking connectivity requirements for the application running in the container
  • Simplicity
  • NAT/connection tracking overhead
  • In the past, a variety of root filesystem schemes were tested. These include:
  • Bind-mounting a directory from the host as the root filesystem. This is the most straightforward way of doing it, but it may be prone to symlink attacks.
  • Using a read-only squashfs as the root filesystem. This makes the appearance more permanent, but the squashfs had to be rebuilt every time a new directory at the root needed to be added.
  • Ultimately, I settled on a tmpfs as the root filesystem, with bind mounts to link to specific host directories. These directories can be read-only or read-write. This has several benefits:
  • Flexibility as to the nature of the root filesystem; this same scheme allows for both a read-only /usr and a read-write /var.
  • Better security -- we build the tmpfs from scratch, as it protects against things like symlink attacks when performing bind mounts.
  • This is accomplished using a helper (also within ctrtool) called "container-rootfs-mount". This process can technically also be accomplished using shell scripts, but integrating the whole thing into a single program minimizes the overhead of spawning new processes for every bind mount, symlink, directory, and file created as part of this process. Plus it just happens to be an operation that is common to all containers (regardless of the nature of the filesystems), which eliminates redundant lines in shell scripts.
  • There are many tmpfs and virtual filesystem mounts on a typical Linux system -- /tmp, /run, /dev/shm, /proc, etc. Putting all of them under a single tmpfs mount is much simpler than creating separate tmpfs mounts for each of these directories.
  • This is similar to systemd-nspawn's --volatile option, but with symlinks to _fsroot_ro and _fsroot_rw instead of directly bind mounting the root filesystems such that the location of e.g. /var can be easily switched.
  • There is no absolute requirement that a tmpfs root be used. Ctrtool's container launcher is also designed to accept any filesystem (as a bind mount) as the root; we just use a tmpfs mainly for the security advantages.

See also

  1. Docker does support ARM systems, but the availability of container images is quite limited.
  2. Using link-local addresses effectively constrains the connection to only go towards the container, and not towards the default route, in case the Docker daemon suddenly disappears.
  3. By the time I knew about docker-compose, I had already moved onto my own container system.
  4. Mostly concerned with configuration volumes in unprivileged containers with the CAP_SYS_ADMIN capability, where a container can effectively undo the read-only flag in its bind mounts; try remounting with busybox mount -o remount,bind,rw /_fsroot_ro within a container with a user namespace with CAP_SYS_ADMIN. Tested in rootless ctrtool, but not rootless Docker. If this is of concern to you, you can create a level 1 user+mount namespace with the requested filesystem remounted or bind-mounted as read-only and run all containers in a level 2 user+mount namespace.