Creating a custom container script
To be able to run all of its web apps efficiently, the servers that run peterjin.org are not monolithic servers (i.e. servers with only a single filesystem and directory tree), but are rather complex systems of namespaces and unprivileged containers. Due to their limitations, we do not use the LXC tools or Docker; instead, we use our own tools. These tools allow us to customize the container to the exact requirements of the applications running in it.
Generally, these requirements include, but are not limited to:
- User namespace
- Number of UIDs/GIDs required. Some require just one (the root user); others require two (one root and one unprivileged), and others require a hundred or a thousand.
- Non-overlapping UID/GID maps for even greater security
- "Relatively privileged" containers -- creation of a privileged container inside of an unprivileged container -- the new container is still unprivileged with respect to the host, but is "privileged" relative to the unprivileged container. Would likely need CAP_SYS_ADMIN in the unprivileged container's (bounding) capability set.
- Bind mounting a read-only squashfs filesystem; advantages: easy to replace, can be shared between multiple containers, can be compressed to save disk space; disadvantages: filesystem is read only, can't install new programs as easily.
- Bind-mounting the host's /usr -- usually safe since there is usually no write access for the unprivileged container's root user; advantages: same as squashfs, but can easily install new programs; disadvantages: read-only, installation of new program requires installation on the host.
- Maintaining a separate filesystem tree; advantages: can install programs even in container, filesystem is read-write; disadvantages: filesystem can be (maliciously) modified, (usually) can't be shared between containers, have to maintain software updates in the container
- Consideration of /etc, /usr, and /var: should they be a predefined tree, a persistent filesystem, or inherited from the squashfs image?
- In-place building of the host's rootfs mount
- Options: share the host's network configuration, create a separate network namespace.
- Routing of IPv4 and IPv6 addresses and prefixes to the container
Generally, a script will consist of the following parts:
- Create the user and network namespaces using the nsfd-create utility.
- Create a veth pair on the host network namespace and move one of the interfaces into the newly created network namespace.
- Actually enter the user and network namespaces.
- Close out the excess file descriptors that were created using nsfd-create
- Create the rest of the namespaces (IPC, Cgroup, mount, and UTS)
- Configure the network and filesystem mount points.
- Enter the newly created filesystem using pivot_root, create the PID namespace, and execute the "init" program.
The following template script illustrates all of this in action:
- With the exception of apps-vm8 at this time of writing, but it will be changed later.