Notes about namespaces

Namespaces ([1]) on Linux are pretty cool. They let you do certain things that you could only previously do with virtual machines, and also accomplish things that you can never even do with virtual machines, especially with shared resources.

Note: A lot of the tricks described here are untested. Use at your own risk. If you know whether something works or not, please contact me.

General / Miscellaneous

  • There are eight types of namespaces as of Linux 5.6 -- cgroup, IPC, mount, network, PID, time, user, and UTS (hostname).
  • Creating a namespace requires the CAP_SYS_ADMIN capability (i.e. root access), with the exception of user namespaces, which can be created by an unprivileged (non-root) user.
  • Namespaces can be created using the CLONE_NEW* flags of the clone() syscall or the unshare() syscall.
  • A user namespace creates a kind of root access for creating other namespaces. By creating a user namespace, other namespaces can also be created; however, they are limited to its own processes.
  • Although a process can only be in one network or mount namespace, it can still hold directory file descriptors, pipes, sockets, or other file descriptors received from other namespaces, whether by keeping them alive through a namespace transition or being passed down via a UNIX domain socket. This allows for fine-grained resource sharing that is unobtainable from virtual machines. The inet-relay and socketbox programs rely on this premise to effectively "teleport" a TCP connection from one namespace to another, independent of other common techniques like virtual Ethernet devices, and socketbox-preload, as well as ns_open_file in ctrtool, generalizes this concept, allowing virtually any "proxy" like program to be used this way.
  • Unix domain sockets identified by a pathname can be accessed by any process with filesystem and mode access to that socket, regardless of its user, mount, or network namespaces. For example, if there is a MySQL instance listening on the mysqld.sock file in the filesystem, and a different process on the system creates new user and network namespaces, then it can still access the MySQL socket as long as it has filesystem access to that socket. This basically forms the core premise of socketbox, and is simply not possible with virtual machines. In addition, in the case of MySQL, if the user namespace has a unique UID and GID range not shared with any other namespace on the system, then you can actually use the auth_socket[1] (unix_socket[2] in MariaDB) authentication plugin, as long as the user within the namespace also exists on the host.
  • To use clone() in a way that acts like fork() but with additional CLONE_* flags, use syscall(SYS_clone, [flags], 0, 0, 0, 0).
  • Since containerized processes run like any other normal process on the host system from the perspective of the kernel (i.e. they are visible in ps aux or top), they do not suffer from the same slowdown effects often found in virtual machines.
  • Unlike most other types of file descriptors, namespace file descriptors are usually not capabilities (as in a capability-based security system), as not only are the permissions on them always 444 and can't be changed, but the kernel actually checks the capabilities of the process performing the operation (see [2]) whenever an operation is performed on the namespace file descriptor; the mere fact that the process has a handle on the namespace file is not sufficient.
  • When using setns() with a PID file descriptor, PTRACE_MODE_READ_REALCREDS[3] on the target process is required.
  • In my container implementations, only the user and network namespaces require privileges to prepare. Creating the user namespace requires privileges to write user and group ID maps, and creating the network namespace requires privileges to add a virtual Ethernet device from the host into the namespace. The user and network namespaces can be combined all into a single socket using the technique described in #Network namespaces.
  • The /proc/PID/ns/{cgroup,ipc,mnt,net,pid,time,user,uts} references (including bind mounts thereof) are the file representations of a namespace, consistent with Unix's "everything is a file" concept. These files are usually opened with the O_RDONLY flag, and thus they can be bound to file descriptors on the shell using the < redirection operator e.g. 3</proc/self/ns/net to bind the network namespace of the current process to file descriptor 3. The namespace can later be retrieved using /proc/self/fd/3 as usual.
  • The required process capabilities to perform operations in other namespaces are as follows:

Creating namespaces

User namespaces allow you to perform actions that would otherwise be restricted to root, with the understanding that root access is only needed if it affects a resource that is used by something more privileged. Here, we create a new user namespace. We are unable to remove /etc/passwd (note the changes in file ownership), but we are still able to create a new network namespace along with it and bring up the loopback interface there.
Non-root users are normally not allowed to change the host name. Creating a new user namespace, however, is still not sufficient, as the process's UTS namespace is still owned by the initial user namespace. Only when we actually unshare the UTS namespace within will we actually be able to change the hostname. Note that when we exit from the shell, the hostname reverts back to its original value.
Namespace type Required capability
User none[4]
cgroup, IPC, mount, network, PID, time, UTS CAP_SYS_ADMIN in the current user namespace

By default, the new namespace will be owned by the same user namespace that the creating process (the one that called unshare() or clone()) had at the time the new namespace was created.

If CLONE_NEWUSER is passed to unshare() or clone(), the new user namespace will be a child of the user namespace that the creating process had at that time.

If both CLONE_NEWUSER and one or more non-user namespace types are passed to unshare() or clone(), the new non-user namespaces will be owned by the newly created user namespace.

For privileged operations (such as changing the hostname on a UTS namespace), capability checks will be made relative to the user namespace that is the owner of the non-user namespace.

Suppose we have the following:

  • a user namespace USERNS1
  • a user namespace USERNS2 whose parent namespace is USERNS1, and whose owner UID is 100 in USERNS1
  • a user namespace USERNS3 whose parent namespace is USERNS2, and whose owner UID is 50 in USERNS2
  • a user namespace USERNS4 whose parent namespace is USERNS1, and whose owner UID is 101 in USERNS1
  • a UTS namespace UTSNS1 whose owning user namespace is USERNS1
  • a UTS namespace UTSNS2 whose owning user namespace is USERNS2
  • a UTS namespace UTSNS3 whose owning user namespace is USERNS3
  • a UTS namespace UTSNS4 whose owning user namespace is USERNS4

Therefore,

  • A process which runs in USERNS1 and has the CAP_SYS_ADMIN capability in it is able to change the hostname in all four UTS namespaces.
  • A process which runs in USERNS2 and has the CAP_SYS_ADMIN capability in it is able to change the hostname in UTSNS2 and UTSNS3.
  • A process which runs in USERNS3 and has the CAP_SYS_ADMIN capability in it is able to change the hostname in UTSNS3 only.
  • A process which runs in USERNS4 and has the CAP_SYS_ADMIN capability in it is able to change the hostname in UTSNS4 only.
  • A process which runs in USERNS1 and has an effective user ID of 100 is able to change the hostname for UTSNS2 and UTSNS3, provided that it enter USERNS2 first; no capabilities in USERNS1 are required.
  • A process which runs in USERNS1 and has an effective user ID of 101 is able to change the hostname for UTSNS4, provided that it enter USERNS4 first; no capabilities in USERNS1 are required.
  • A process which runs in USERNS2 and has an effective user ID of 50 in USERNS2 is able to change the hostname for UTSNS3 only, provided that it enter USERNS3 first; no capabilities in USERNS2 are required.

Entering other namespaces

[5]

Namespace type Required capability in own user namespace Required capability in target namespace
User none CAP_SYS_ADMIN[6]
Mount CAP_SYS_ADMIN and CAP_SYS_CHROOT CAP_SYS_ADMIN
cgroup, IPC, network, PID, time, UTS CAP_SYS_ADMIN CAP_SYS_ADMIN
netlink IFLA_NET_NS_FD (e.g. with ip link set <dev> netns <netns>) none[7] (untested) CAP_NET_ADMIN

A common way of entering a foreign non-user namespace owned by a user namespace without root privileges is to enter the user namespace first before entering all other namespaces:

nsenter -t TARGET-PID --user --net /bin/bash

This works because entering the user namespace would first give the process CAP_SYS_ADMIN in the "current user namespace", which is now set to the new (target) user namespace. It can subsequently enter the new network namespace (generally assumed to be owned by the same user namespace as the "target user namespace"), because it now has the CAP_SYS_ADMIN capability in is own (target) user namespace, as well as in the user namespace that owns the new network namespace (they are technically the same user namespace).

User namespaces

  • If a process that creates a user namespace needs privileged access to a file on the host, but doesn't want any access to it after the process has finished initializing, it can perform the following steps:
1. Have the privileged file have a group ID which lies outside of the user namespace's GID map.
2. Before calling unshare(), call setgroups() to set its supplementary group IDs to include the GID of the privileged file, along with any other group IDs that the process may need access to.
3. Call clone(CLONE_NEWUSER) to create a new user namespace.
4. Write the UID and GID maps as usual. Note that you might need to use clone() and not unshare() to create the new user namespace, since you need CAP_SETGID in the original user namespace to write the gid_map file. Note that if you wrote "deny" to the setgroups file, you will not be able to perform step 6.
5. Perform any operations that might require the privileged file. Note that if you open a file, you will still be able to operate on that file even after step 6, as long as the file descriptor is kept open.
6. Perform setgroups(0, NULL) to drop group memberships. Since the GID of the file is now outside the user namespace, it will not be possible to specify it in any further setgroups() or setgid() call to gain access to it.

The above procedure is most useful in the context of a bind mount when you want to restrict access in a way that prevents unprivileged users on the host from accessing the filesystem, but you also want unprivileged users in the container to be able to access it.

  • You lose any and all capabilities in the original user namespace when you call clone() and unshare() with CLONE_NEWUSER. For this reason, you can't simply write your UID and GID map after you call CLONE_NEWUSER because you wouldn't have the necessary capabilities (CAP_SETUID, CAP_SETGID) in the parent user namespace to write the UID and GID map files. There are two ways around this:
1. Call clone(CLONE_NEWUSER), have the child process wait until uid_map is set, then run the containerization routines. While the child process waits, in the parent process, open up the forked process's UID and GID maps, and write the mappings from there. Note that unless the signal disposition of SIGCHLD is set to SIG_IGN, the forked process's /proc/pid directory will continue to exist even if it terminates for any reason, so the risk of PID reuse is mitigated unless wait(), waitid(), or waitpid() is called (which you shouldn't).
2. Call clone(CLONE_NEWUSER), then call pause() in the forked process. While this is happening, have the parent process write the UID and GID maps in the /proc/PID directory of the forked process. Next, open the forked process's /proc/PID/ns/user file in the parent process. Then, use kill() to terminate the forked process. Finally, use setns() on the /proc/PID/ns/user file to enter the new user namespace.
  • If a process creates or joins a new user namespace and then calls execve(), then it will no longer have capabilities unless the process has UID 0 in that namespace. To fix that without writing the uid_map, prior to calling execve(), use capget() and capset() to copy all capabilities from the permitted set into the inheritable set, then raise all ambient capabilities in the permitted set by calling prctl[3](PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, [n]) on each bit position that is 1 in the permitted set that you originally retrieved with capget(). Privileges can be later dropped by clearing both the inheritable and permitted sets. If you use this technique and want to ensure that the process no longer gains capabilities because uid_map was written and the mapped UID of the process becomes 0, then you can set the SECBIT_NOROOT securebit ([4]) and either lock it or remove CAP_SETPCAP from the bounding set. Note that any existing locked securebits will not interfere with this operation, since all securebits are cleared and unlocked when entering a new user namespace.
  • To restrict the capability set of a root (effective UID=0 in its user namespace) process, you can:
1. Clear the capabilities from the bounding set. Note that this will only take effect once you call execve(). If you want the changes to take effect immediately, clear the capabilities from the permitted set. Note that if you clear the permitted set without clearing the bounding set, then call execve(), it will just regain those capabilities.
2. Alternatively, do the same as for a non-root user below, but before doing so, set the SECBIT_NOROOT securebit and lock it, so that UID 0 is no longer special.
  • For a non-root process that wants to have capabilities, you can perform the following steps:
1. Before switching to a non-root user, use the PR_SET_KEEPCAPS prctl() operation to make sure that the permitted set is preserved after switching the UID.
2. Change the process's UID.
3. Use capget and capset to add the desired capabilities to the inheritable set.
4. Use the PR_SET_AMBIENT_RAISE operation to add the same capabilities to the ambient set.
5. Call exec() on the new program.
6. (Untested) If the program you are running actually checks that the UID is 0 and fails to run if it is not (even if the selected capabilities allow for proper execution of its otherwise root-only operations), then you can install a seccomp filter that makes every call to get[e]uid() return 0 via the SECCOMP_RET_ERRNO return value. It is sadly not possible to do the same with getresuid().
capsh --keep=1 --user=[username] --inh=cap_xxx,cap_yyy --addamb=cap_xxx,cap_yyy -- -c '"$@"' - [program] [arguments]

Or with my ctrtool program (currently only supports numbers and not names):

ctrtool launcher -fi -S 1000 -G 1000 -I "$((1<<m|1<<n))" -a inherit [program] [arguments]
  • The process's UID and GID after a user namespace switch are completely unchanged, even if they are not mapped to any UID or GID in the target user namespace.
  • On Linux, the supplementary group ID list operates completely independently of any other process GID (effective, real, saved, or filesystem GID). The init process starts out with an empty supplementary GID list. However, if root logs in or a command is executed with "sudo", then the supplementary GID list contains GID 0. Combined with the above points, this means that a process could have access to group IDs (such as the original effective group ID) outside its user namespace if its supplementary GID list is not fully cleared prior to or after entering the new user namespace.
  • (Untested) On Linux, in addition to real, effective, and saved-set UIDs, we also have the filesystem UID. It was originally used by the NFS server to access files as if it were running with a certain UID before the rules of kill() regarding UIDs were changed, and now it is regarded as obsolete. However, there is one interesting corner case that would require the filesystem UID to be used. Suppose that a process is in a certain parent user namespace and has the CAP_SETUID capability but not the CAP_SYS_ADMIN capability, and it wants to enter a descendant user namespace whose owner UID does not match the process's current effective user ID and the current effective UID is not mapped in the target user namespace. To do so, it can change its effective user ID to match that of the target user namespace, and then switch to it using setns(). However, in doing so, it cannot return back to its original effective user ID, even if it matches its real or saved UID, since it is not mapped in the target user namespace. If the process changed its filesystem UID back to the original effective UID prior to calling setns(), then it can still assume that UID for accessing files in the filesystem (if its mount namespace has changed), even though it is not mapped in the target user namespace. I did realize that another way of doing this is by forking a new process with CLONE_FILES set, change its effective user ID, call setns() on the target user and mount namespaces, then opening the root directory as a file descriptor, then operating on that file descriptor in the original process. But it might not work if accessing files in the target mount namespace would have required some of the capabilities in the target user namespace.
  • (Untested) The specific case of having a process be in the initial user namespace and setting its network, IPC, etc. namespace from another user namespace is quite interesting. The author uses this technique to inspect the network configuration inside a Docker container without having to include the necessary commands in the container. This requires CAP_SYS_ADMIN in the initial user namespace. However, if after doing so, the process changes its effective UID to match the owner UID (which can be non-zero) of the user namespace, then it can assume all capabilities in the network or IPC namespace owned by the descendant user namespace, even though it is not running as UID 0. Of course, if it also needs capabilities from the initial user namespace, then it can set its ambient capabilities. In lieu of ambient capabilities, if it uses PR_SET_KEEPCAPS prior to changing effective UID, then it can have certain capabilities in the permitted set prior to calling execve(), but even though its capabilities in the initial user namespace are gone after calling execve(), it still has the capabilities in the descendant user namespace.
  • According to capabilities(7), if a root process calls execve(), the new process's effective and permitted capabilities will be the union of the capability bounding set and the inheritable set. This is not true, however, if the process has the NO_NEW_PRIVS bit set, in which case it will still be limited to the process's existing effective and permitted capabilities.
  • A process in a child user namespace can be ptraced (e.g. with strace) by a process in the parent user namespace whose effective UID is the owner of the child user namespace, even if kernel.yama.ptrace_scope is set to 1 or 2, since it technically has CAP_SYS_PTRACE in the target user namespace per the owner UID rule. (Untested) This also applies to CAP_KILL.
  • Creating new user namespaces has the restriction (as documented in the man page) that only users with their UID and GID mapped in the current user namespace will be allowed to create new user namespaces. Generally, this would mean that the process originated in the initial user namespace, changed its user and group IDs to something that is not mapped in a given level 1 user namespace, entered that level 1 user namespace using setns(), and finally, attempt to create a new level 2 user namespace in that level 1 user namespace; because of that restriction, the operation would fail. Alternatively, a process could call unshare(CLONE_NEWUSER) twice in succession without writing the user and group ID maps in between. This restriction is based on whether or not the "kernel user ID" (i.e. the user ID that is in the initial user namespace) of the creating process is actually present in the current user namespace. The initial user namespace has all user and group IDs mapped (as /proc/self/uid_map and gid_map have 0 0 4294967295 in it), so it would only apply to user namespaces created in a user namespace other than the initial one (i.e. creating a user namespace of height 2 or more, as described below). Note that this is different from, and has no relation to the user and group existing in /etc/passwd or /etc/group (those restrictions are generally enforced by userspace programs, and not by the kernel).
  • (Untested) In single-user-and-group user namespaces (e.g. the user namespace created by unshare -r), filesystem permissions and file ownership can no longer be used to enforce security policies. Instead, capabilities and child namespaces must be used. Examples of these include:
  • Utilizing the /proc/PID/fd "magic links" to access privileged directories (enforced through the PTRACE_MODE_READ_FSCREDS ptrace check, which is capability-based, usually with CAP_SYS_PTRACE). Be sure to close these file descriptors (or make them close-on-exec) if you ever intend to spawn unprivileged child processes!
  • Creating bind mounts of mount namespaces that contain mounts of privileged directories (enforced through CAP_SYS_ADMIN to change mount namespace)
  • For the two above techniques to actually be useful, the root filesystem must be mounted read-only (otherwise, unprivileged processes could put files there), and transitions from privileged to non-privileged states (e.g. in circumstances where a web (or other) server needs to run as an unprivileged user rather than as root) must be done using capabilities instead of user IDs (e.g. with setpriv --bounding-set=-all instead of sudo -u user).
  • (As a stronger alternative) Creating new child user namespaces with PID and mount namespaces in them, so that privileged processes in the original user namespace are not even visible. This is the only viable option if you intend to have multiple unprivileged user realms in your user namespace. Note that if this option is used, all unprivileged processes will need to be in child user namespaces; otherwise, those unprivileged processes technically have all capabilities in those user namespaces per the owner UID rule.
  • (Yet another alternative) Put privileged files (and/or mount points) behind a mode-000 directory, thus gating access through the CAP_DAC_OVERRIDE and CAP_DAC_READ_SEARCH capabilities. However, this is only effective if the directory's filesystem is read-only (otherwise, an unprivileged process could chmod it) and new user namespace creation is disabled (otherwise, the capability could be regained, and it would apply to that directory, because the owner and group can be mapped in the new user namespace.)
  • A process which intends to map UID 0 from the original user namespace into a new user namespace must have the CAP_SETFCAP capability in the original user namespace at the time it called unshare() or clone(). Otherwise, because the root user ID of the original and new user namespaces are the same, a process in the new user namespace could create a version 3 file capability xattr record on a given executable and gain privileges in the original user namespace by running that executable in the original user namespace.
  • Sometimes I like to think of the user namespace hierarchy like this:
  • The initial user namespace has a tree height of 0. There is only one such namespace on the system.
  • Every user namespace whose parent user namespace is the initial user namespace has a tree height of 1.
  • Every user namespace whose parent user namespace has a tree height of 1, has a tree height of 2.
  • Every user namespace whose parent user namespace has a tree height of n, has a tree height of n+1.
  • Systems are designed so that the user namespaces are not nested too deeply, so as to minimize the maximum tree height. The degree of each user namespace (number of direct descendent user namespaces) is unlimited.
  • The "tree height" described here may also be referred to as the "level" in the Linux kernel source code.
Highest Privilege Level
Root in the initial user namespace
[drop capabilities; may be reversed by executing a setuid root program, or by file capabilities] [*]
Non-root in the initial user namespace
[unshare(); irreversible]
Root in a user namespace of height 1
[drop capabilities; may be reversed by executing a setuid root program, or by file capabilities] [*]
Non-root in a user namespace of height 1
[unshare(); irreversible]
Root in a user namespace of height 2
[drop capabilities; may be reversed by executing a setuid root program, or by file capabilities] [*]
Non-root in a user namespace of height 2
[unshare(); irreversible]
etc.
Lowest Privilege Level

For the purposes of the diagram above, "setuid root program" in a user namespace means an executable file with the set-uid bit set, and whose owner UID is whatever maps to UID 0 in that namespace. So for a user namespace with a UID map of "0 100000 65536", UID 100000 in the parent user namespace would be "root" in the child namespace.

[*] If the no_new_privs bit is set, then the "drop capabilities" step is irreversible.

There are several ways to block the creation of a new user namespace:

  • Putting the process in a chroot environment.
  • Put the process in a descendant user namespace with UIDs and GIDs outside the namespace's UID and GID maps, and remove the CAP_SETUID and CAP_SETGID capabilities. This is only possible if the process started in the initial user namespace. However, in this mode, the process cannot send any credentials using a UNIX domain socket (but the other side calling SO_PEERCRED may also work).
  • Use seccomp() to block the unshare() call (not recommended).
  • Setting /proc/sys/kernel/unprivileged_userns_clone to 0.
  • (Untested) Setting /proc/sys/user/max_user_namespaces to 0, and remove the CAP_SYS_RESOURCE capability.

Mount namespaces

  • Mount namespaces can be used for the following purposes:
  • To restrict and virtualize the views of the filesystem for certain groups of processes (preferably with pivot_root).
  • To act as a container that stores other types of namespaces.
  • To allow two different processes on the same system to have a different view of the same file path (useful if that file path is hard-coded).
  • To create a temporary directory that is automatically deleted when the processes that use it terminate; useful if it is not possible to clean up the filesystem otherwise (e.g. in a signal handler).
  • (if used with a user namespace) To override a system file for a user's own processes.
  • (Untested for security) To protect (using the CAP_SYS_ADMIN capability) access to resources in cases where filesystem permissions are not sufficient (e.g. single-UID rootless containers).
  • (if used with a user namespace) (Untested) To allow unprivileged use of ip netns or similar. Processes within the namespace will be able to access it by name (the names for those network namespaces are isolated from the set of network namespaces on the host), but when all of the processes in the mount namespace terminate, the equivalent of ip netns del is performed on all network namespaces in that mount namespace. (To allow access to the greater Internet using those namespaces, you can use a tool like slirp4netns or Universal Relay; see #Network namespaces below.)
  • Don't use chroot(). Instead, use pivot_root(). You may create a new bind mount or tmpfs mount in the new mount namespace as your new root and call umount2([oldroot], MNT_DETACH) (or umount -l /oldroot, lowercase L) to disassociate it with the current mount namespace.
  • If you create a new mount namespace, have a removable disk attached to the system at the time of mount namespace creation, and either 1. call mount --make-rprivate / (but not mount --make-rslave /) (unshare -m does this by default) or 2. the propagation type of the previous mount namespace is set to slave or private, then when the removable disk is later unmounted, it will still be mounted in the new mount namespace. This has actually caused filesystem corruption twice for the author.
  • When calling setns() on a mount namespace, the process's root directory returns to the root of the mount namespace, even if the process is chroot'ed. (Untested) This means that if the /proc filesystem is mounted in the chroot(), then any process with the CAP_SYS_CHROOT and CAP_SYS_ADMIN capability in the process's mount namespace can escape from the chroot by calling setns() on the /proc/self/ns/mnt file. Note that an unprivileged process cannot do the same, since it is not allowed to create a new user namespace in a chroot().
  • Not exactly namespace-related, but instead of using a PID file, the /proc/PID directory can be bind-mounted somewhere else. This has an advantage that if the process terminates and the PID is reused, then the /proc/PID bind mount will no longer have files, whereas simply keeping track of PIDs will only result in a false positive.
  • A tmpfs mount made in a mount namespace that is owned by a user namespace other than the initial namespace is restricted to the user namespace's UID and GID maps. That is, it is not possible to create a file on such a tmpfs mount where the creating process's filesystem UID/GID is not mapped in the target user namespace, and for example, the root user in the initial user namespace cannot create files in such a mount if UID 0 in the initial user namespace is not mapped into the target user namespace (doing so results in -EOVERFLOW).
  • If a filesystem is mounted on top of / (and the current / is not the topmost mount), then /.. will refer to this topmost mount[8].
  • It is also possible to make a namespace (not necessarily a mount namespace) "semi-persistent", by bind mounting that namespace into a new mount namespace, but without propagating it back to the host. In this case, the new namespace has a pathname within the mount namespace but not outside of it, so that processes within the mount namespace can still refer to it by the mount. If the mount namespace is non-persistent, then the new namespace will disappear at the same time the mount namespace disappears (i.e. upon termination of a container). This is useful for multihomed containers (not to be confused with containers on multihomed hosts) since processes in a container can switch between two network namespaces at will without requiring any other privileges; apps-vm13 is an example of this.
  • The /proc and /sys filesystems in a mount namespace owned by a non-initial user namespace can only be mounted by a process in a non-initial user namespace if there is already a proc or sysfs filesystem visible in the current mount namespace. Normally, this is only of concern if pivot_root is called on the mount namespace and the old root mount is unmounted, as the host's /proc and /sys mounts are locked (MNT_LOCKED) otherwise.
  • (Untested) On kernel 5.11 or higher, to create the equivalent of a read-only bind mount in a container that has the CAP_SYS_ADMIN capability, use an overlayfs with only a lowerdir (no upperdir or workdir).

Network namespaces

All sockets hold a reference to the network namespace in which they were originally created. In addition, the network namespace also has a reference to its owning user namespace. With the SIOCGSKNS and NS_GET_USERNS ioctls (and with the right privileges), both the network and user namespaces of a socket can be discovered. This means that a child process can first prepare a user and network namespace (e.g. it can write user and group ID maps and/or create a veth or macvlan device) and then send just a single socket in that network namespace back to the parent process using the SCM_RIGHTS operation. The parent process can then use those two ioctls on the received socket to obtain file descriptors for the user and network namespaces.
  • Network namespaces can be used for the following purposes (not exhaustive, but these are the most reasonable ones):
  • To isolate the network connectivity of a certain group of processes. (This may be relied on for security.)
  • To isolate the namespace of abstract UNIX domain socket paths (for instance, to run two instances of the same application that use the same abstract UNIX domain socket path).
  • To restrict the visibility of applications that bind to localhost, in circumstances where server-side request forgery or the ability to hijack ports >= 1024 is otherwise possible.
  • To implement a "critical" VPN (one that does not fall back to the "native" connection when the VPN connection is severed).
  • To act as a "container" for network configuration, in a way that does not disturb the configuration of the real host. (The author once relied on this to ensure that Docker would not mess up the iptables rules that were already present in the host, as the host was being used as a NAT[9] router between two physical networks. This was done by putting the entire Docker daemon into its own network namespace, bridged to the host using a veth device, such that the outbound connections from the containers had an internal IP address that was different from the rest of the host. In fact, this scenario was what prompted the need for Socketbox, as this procedure would otherwise make the containers inaccessible on localhost (of the host netns), and it was too complicated and insecure to set up routing rules to facilitate localhost connections otherwise.)
  • To establish zones for connection tracking, such that on a e.g. multi-VLAN router, fast flows that do not rely on connection tracking (e.g. inter-VLAN routing) are not bottlenecked by the connection tracking on the WAN interface. (Connection tracking is automatically enabled in a network namespace if there is at least one iptables rule in that namespace that specifies -m conntrack, -m state, -j CT, or if the nat table is activated (e.g. with -j SNAT, -j DNAT, -j NETMAP, or -j MASQUERADE). It is what makes firewalls "stateful", with the downside that it requires much more memory.)
  • To work around the issue of VPN or multihomed connections having the same network prefix as the "native" network (a.k.a. "IP conflict").
  • (if used with a user namespace) To allow an unprivileged user to use tools like the firewall, routing table, etc. as a means of controlling packets within the user's own processes.
  • Notwithstanding the existence of Socket Enhancer combined with VRFs and/or policy-based routing, to control the network output of programs that do not have the functionality to set the IP address for outbound connections, especially when done so for security.
  • For server applications that only allow a port to be specified without a listening IP address, to allow multiple instances of that server application to run on the same machine.
  • On a system connected to a network with a dynamic SLAAC IPv6 prefix, to allow multiple IPv6 addresses to be used for multiple applications.
  • As a rudimentary implementation of a software-defined network (SDN).
  • To implement tools like Socketbox.
  • To design a network topology that resembles that of a real network, virtually within a single computer, i.e. a real-life network simulator that can be used even for production systems.
  • To facilitate the development of network applications, in cases where localhost/loopback is not sufficient (e.g. routing daemons, UDP multicast or broadcast)
  • Users without root access can still create network namespaces by wrapping them in a user namespace, however, without root privileges, it is not possible to create a veth device to the host. Fortunately, the following operations can still be performed:
  • Create and configure virtual network devices (e.g. virtual Ethernet devices, TUN/TAP devices, loopback, WireGuard), as long as you have the CAP_NET_ADMIN capability in all network namespaces that you create the devices in.
  • Use veth devices to connect between other network namespaces that are created in that user namespace.
  • Use tools like ip and iptables to create a "program" for packet flow. (EBPF programs other than socket filters might not be possible because creating EBPF programs requires privileges in the initial user namespace. It might still be possible to use a pinned program, however.)
  • Connect to pathname-based Unix domain sockets that were created in any other network namespace, including the host network namespace, as long as the socket file exists in the mount namespace and filesystem permissions allow the connection.
  • Inherit one or more listening or connected sockets from the host's network namespace and somehow make the containerized program use them (e.g. systemd socket activation for servers or ProxyUseFdPass with the OpenSSH client, the latter of which causes any forwarding options (-D, -L, -R, or (untested) -w) to be in the new network namespace, with the actual connection to the SSH server still in the host's network namespace.)
  • Use Socketbox, or more generally, send and receive sockets from other namespaces using SCM_RIGHTS.
  • Run some kind of proxy server (ideally a transparent proxy server) on the host network namespace, but which listens on a socket or TUN/TAP device in the unprivileged network namespace (e.g. slirp4netns or Universal Relay) XXX document netns sockets to Universal Relay to SOCKS proxy to remote SSH server.
  • Use TCP/UDP Relay to bridge a WireGuard or OpenVPN interface created in the network namespace to a UDP socket on the host.[10](This is possible only because there are two primary access conditions to allow a VPN client to run: 1. the ability to create the socket which communicates with the VPN server, and 2. the ability to create the low-level tunneling device that allows the raw IP/Ethernet packets to be encapsulated. The first condition is theoretically satisfied for UDP-based VPN protocols like WireGuard, OpenVPN, and UDP-based IPSec. ("Classical" IPSec, with AH/ESP headers, as well as 6-in-4 tunnels, is not supported, as that usually requires the creation of a raw socket.) The second condition is usually only satisfied if there are root privileges on the host, but since we're creating such a device in a network namespace that we control, capabilities only in the newly created network namespace are sufficient. The VPN client is merely a relay between those two objects, with encryption usually part of it (but is not a strict requirement).)
  • In other words, it might actually be easier to use some kind of proxy or VPN client to connect the newly created network namespace to the Internet, rather than reusing the host's network connection as-is directly.
  • If you do choose to use a VPN with multiple network namespaces, it is probably better for performance reasons to create one top-level network namespace and use it as a "router" for other network namespaces, rather than having a VPN client for every network namespace. This can be done without NAT by setting AllowedIPs on the WireGuard server (iroute for OpenVPN) to enclose all of the network prefixes that are present behind it (e.g. if you have a network with 192.168.100.0/24 underneath the client, then set AllowedIPs on the server to include 192.168.100.0/24), or it can still be done with NAT as usual. Other aspects of the firewall (e.g. blocking incoming connections) can also be done as usual.
  • More generally, if an application is capable of creating sockets or TUN/TAP devices in both the host network namespace as well as in the newly created network namespace, then that application can be used to communicate between the network namespaces. (ctrtool ns_open_file can be used to facilitate creating such applications.)
  • Moving a network device from one namespace to another via ip link set <dev> netns <ns> or similar requires CAP_NET_ADMIN in both the current network namespace and in the target network namespace, in contrast with nsenter/setns which requires CAP_SYS_ADMIN.
  • If you want to intercept network connections without creating devices (sort of have a virtual TCP/IP stack), then you can run the commands
ip route add local ::/0 dev lo table main
ip route add local 0.0.0.0/0 dev lo table main

in a newly created network namespace and then open up a wildcard-bound socket in that same namespace to intercept connections to any and all IP addresses, provided that they were made in that network namespace. You can use getsockname() to obtain the original destination IP address. This might be useful for transparent proxying.

  • iptables-legacy may not work well in a network namespace owned by an unprivileged user namespace. Use iptables-nft instead.
  • One particularly interesting use of network namespaces is that it can allow a computer to have multiple IP addresses on a wired network in a way that is transparent to the application, without the need to create separate bindings (this is mostly useful for having multiple IPv6 addresses on an IPv6 dynamic prefix, where SLAAC and privacy extensions may make it difficult to bind to a subset of every address that exists on the system). To do this, first remove any existing IP addresses from the original Ethernet interface. (If you're accessing the system remotely over SSH, consider using netplan try up to configuring the bridge instead of modifying network configuration directly, so that you can revert if you screw up. Alternatively, consider using macvlan or ipvlan interfaces, which I had only discovered after writing this procedure.) Then, create a bridge interface, and add the original Ethernet interface to it. Configure the bridge interface's IP address like you normally would with the Ethernet interface. Next, create a virtual Ethernet device pair, and then add one of the two newly created interfaces onto the bridge and bring the new interface up. Create the new network namespace, and then move the other virtual network interface into that namespace. Set up the IP address of the virtual Ethernet device in the new network namespace, which should be different from the IP address in the original namespace. This setup is as if you connected another computer onto the same network that the physical network card is plugged into, but with the advantage that they can share the same system environment. If you also unshare the PID namespace, you can also run an init process in it, allowing you to spawn multiple processes in it. Busybox init works well here; though if you want to have multiple inittab files, you will need to create a new mount namespace too and bind mount a separate inittab file onto /etc/inittab in each mount namespace. The author actually uses this technique to run his web-based email interface and GitLab in separate containers, both of which have their own (private) IP address, all on one computer, and without any NAT, host firewall, proxy ARP, or routing configuration. One main disadvantage here is that since the new network namespaces are on the same layer 2 broadcast domain as the host, ARP spoofing attacks are possible if the containerized processes are untrusted, even if a new user namespace is created. This can be mitigated by removing the CAP_NET_RAW and CAP_NET_ADMIN capabilities from the bounding set prior to running the containerized application.
  • Since the loopback interface is independent in each network namespace, moving applications to a different network namespace can prevent a certain class of security vulnerabilities which exploit the ability to make arbitrary connections and bindings in a common loopback interface.
  • The network namespace is actually the property of a socket, not of a process. A process's network namespace membership is only used to determine the network namespace of new sockets that it creates (except those returned by accept(), which will be in the client's network namespace for UNIX domain sockets, or the same namespace as the listening socket for all other address families). Those sockets hold a reference to the network namespace, which can be retrieved with the SIOCGSKNS ioctl. Furthermore, if a process operates entirely on sockets from a different network namespace than its own, then it would be indistinguishable from if the process were to be in the same network namespace as the sockets. This would mean, that for example,
my_socket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)
unshare(CLONE_NEWUSER|CLONE_NEWNET)
bind(my_socket, {AF_INET, 80, 0.0.0.0})

would fail to bind the socket because my_socket is still in the original network namespace at the time of the bind() call.

  • All sockets have an associated network namespace, including UNIX domain sockets. If the client of a UNIX domain socket stream or sequenced packet server is in a different network namespace compared to the server itself, then the server socket returned by accept() will actually be in the client's network namespace. (Untested) With the SIOCGSKNS ioctl on the server socket, it is possible to determine the network namespace of the client, just like how SO_PEERCRED can be used to determine the client's user and group IDs. (The SIOCGSKNS ioctl requires the CAP_NET_ADMIN capability on the client's network namespace; also consider the owner UID rule for namespace capabilities in rootless containers.) If that is really true, I will soon be writing a server daemon that relies on this to be able to dynamically create a "veth" pair between a container and the host, without having to create such a device beforehand (i.e. during container creation). (I have had this concept in mind before, but it would have relied on using SCM_RIGHTS to pass the network namespace to the server.)
  • When a bridged veth or macvlan device is brought up on a network with IPv6 stateless autoconfiguration, the veth/macvlan device will by default obtain an IPv6 address and default route automatically, without having to run a DHCPv6 client (useful if it's in a different network namespace). If in a different network namespace, privacy extensions are disabled by default (enable with sysctl -w net.ipv6.conf.[device].use_tempaddr=2). (Usually, this should not matter in a misconfiguration, as the MAC addresses of veth and (untested) macvlan devices are randomized by default.)
  • When managing network namespaces with the "ip" command, only commands beginning with ip netns are subject to the limitation of only supporting namespaces in /var/run/netns. Network namespaces in other contexts (e.g. ip link set <dev> netns [NETNS]) are not subject to this restriction. In this case, if NETNS is a number, then it is interpreted as a /proc/PID/ns/net reference. If NETNS does not contain a slash character, then it is interpreted relative to /var/run/netns. Otherwise, it is interpreted as a regular filename relative to the root directory or the current working directory, as usual. Network namespaces can still be created without ip netns by using :> [location] && unshare -n mount --bind /proc/self/ns/net [location], and the equivalent of ip netns exec is nsenter --net=[location]. Unlike ip netns exec, nsenter does not unshare the mount namespace nor mount over /etc/resolv.conf or /sys. See https://git2.peterjin.org/ctrtool-containers for examples.

IPC namespaces

  • (Untested) If a process changes IPC namespace, then it only affects new calls to shmget(). Any existing shared memory segments mapped in the process's memory will continue to exist until it has been detached with shmdt(). However, it will not be possible to remove this shared memory segment unless IPC_RMID is used prior to changing IPC namespace.
  • IPC namespaces operate only on System V shared memory segments and semaphores (shmget/semget), and POSIX message queues. They do not operate on POSIX semaphores or shared memory obtained with sem_open or shm_open. To do that, use a mount namespace to overmount a tmpfs instance on /dev/shm.

PID namespaces

  • PID namespaces are not useful to keep alive because once the last process terminates, it can no longer be joined with setns().
  • By using the set_tid option of clone3(), you can have "reserved"/"fixed" PIDs in a PID namespace, thus eliminating PID files. This is accomplished by the following steps.
1. Create a new PID namespace and if using unshare(), fork a new process to join that namespace.
2. Mount the /proc filesystem.
3. Write the number 300 to /proc/sys/kernel/ns_last_pid.
4. To create new processes with "reserved" PIDs, use clone3() and set the first element of the set_tid array to the desired "reserved" PID, which can range from 2 to 299 inclusive.

This may be useful because it can eliminate PID files since the daemon uses a known fixed PID. To terminate a daemon with a fixed PID, simply kill the fixed PID. No need to find it anymore in /proc. And if the daemon is already running, then step 4 will fail since the PID is already in use.

The magic number 300 is present here since PIDs wrap around from 32768 or 4194304 to that point. PIDs 2-299 will never be assigned as a result of PID reuse. It is not known whether if all the PIDs > 300 are used on the system, then the kernel will assign PIDs less than 300 to new processes; if you're paranoid, make sure that the number in /proc/sys/kernel/pid_max is at least 350 + the value in /proc/sys/kernel/threads-max, or use the PID cgroup controller.

This works best in a descendant PID namespace, since 2-299 in the initial PID namespace are already taken by kernel threads.

  • PID 1 is always a stable reference in any PID namespace. In this case, the PID namespace file descriptor can sometimes serve as a PID file descriptor (as described in [5]), and entering the PID namespace followed by referencing PID 1 will always refer to that process.

See also

  1. https://dev.mysql.com/doc/mysql-secure-deployment-guide/8.0/en/secure-deployment-configure-authentication.html
  2. https://mariadb.com/kb/en/authentication-plugin-unix-socket/
  3. https://elixir.bootlin.com/linux/v5.12.3/source/kernel/nsproxy.c#L373
  4. However, the process must not be running with an unmapped UID/GID and must not be in a chroot.
  5. https://man7.org/linux/man-pages/man2/setns.2.html
  6. Typically achieved either by having CAP_SYS_ADMIN in the parent namespace OR being in the parent namespace and having the same effective user ID as the original process that created the namespace; in the latter case, no capabilities in the process's current user namespace are required.
  7. May also require CAP_NET_ADMIN if operating concurrently on current network namespace, depending on the operation.
  8. https://lwn.net/ml/linux-kernel/CALrw=nF-0E2icB85aU6hDoGmukQ0Hp_b0Un0savTco=meQV4uw@mail.gmail.com/
  9. Yes, I know, it was an old strategy. Current methods of repeating the above do not necessarily rely on network address translation; I would use static or dynamic routes instead.
  10. Currently, the only restriction with using TCP/UDP Relay is that the VPN client must support binding outbound connections to a local UDP port number, as it relies on seeing the exact 4-tuple of the UDP packet.