Socketbox

For end-user documentation on socketbox, see Help:Socketbox.
Socketbox works by having a single "master" daemon accept connections for multiple IP/port combinations, then sending connection sockets to other server processes using the SCM_RIGHTS operation. Socketbox determines which server the socket should be sent to by calling getsockname() on the connection socket, then using its internal rules to direct the connection socket to the right server.

Socketbox is a replacement for the classic "inetd" daemon. This daemon operates similar to inetd in that it listens for connections and passes each newly accepted connection on to a program, but instead of spawning a new program, the socket is sent to an existing daemon using a Unix domain socket and the SCM_RIGHTS control message; the exact procedures determined by a configuration file. For example, you can specify that sockets with a server IP address of 2001:db8::1 go to one program, and sockets with a server IP address of 2001:db8::2 go to another. Essentially, we have performed the socket demultiplexing routines in user space rather than kernel space, in a way that would span multiple network namespaces.

https://git2.peterjin.org/socketbox

Remember that from the Notes about namespaces page, even though a process can only be in one network namespace, it can still hold sockets or other file descriptors obtained from other namespaces using a Unix domain socket. This allows containers to have a different addressing scheme than the server IP address endpoints. For example, on a network with both global and unique local IPv6 addresses, the containers could be addressed with global addresses, whereas the server sockets would be restricted to the unique local addresses.

Socketbox can be used to enable multiple instances of virtually any TCP-based protocol. Servers tested include HTTP/S (web) servers, IMAP, SMTP, and SSH. For IMAP and SMTP, both Implicit TLS and STARTTLS modes were tested successfully.

Socketbox is not a proxy server. Instead, it is more of an application router, which routes incoming connections to applications based on the destination IP address and port number, except that (unlike nginx, Traefik, or HAProxy) it works on the socket layer directly, which can improve performance in many cases.

In addition, Socketbox can only route requests based on IP addresses, port numbers, or any information that was already present during the TCP three-way handshake. Routing based on application-layer data, such as the Host header in HTTP, SNI for SSL, PROXY protocol headers, as well as SSL termination, is out of scope for Socketbox, and for those purposes, nginx and/or HAProxy should be used instead. Both of those applications could still be configured to receive sockets from Socketbox if desired (see Socketbox/Compatibility).

Socketbox also has no relation to "WebSocket", "socket.io", or similar web-based "socket" technologies. The "socket" in Socketbox refers to operating system sockets, and not web-based "sockets." It is a much lower-level tool, that is used for server applications themselves, and not of web applications.

History

Socketbox started out because on one of my machines, I was running "throwaway boxes" in some Docker containers. Each container had a globally unique IPv6 address, an SSH server on that address and a persistent home directory, and they are called "throwaway" boxes because I could easily delete and recreate them if something bad happens.

The problem with this was that my ISP's IPv6 prefix delegation is dynamic, and if the prefix were to change, then I would have to delete and recreate all the containers again, as well as update all the containers' IP addresses.

ULAs were added to mitigate this problem, but (at that time, I did not know that you could have multiple prefixes in a Docker network|Docker does not allow more than one IPv6 prefix in a network.). But I still had the ULA static route set up, so what could I do? Well, I could use the AnyIP trick to make all of those addresses locally bound.

And so it was. Initially, I modified the SSH server to run in inetd mode, then set up an inetd-like daemon that instead of listening on a TCP socket, listened on a Unix domain socket. Then, with some experimentation, I discovered that if the "AnyIP" trick is used on a wildcard socket, then the server will accept any IP address within that subnet, and the original IP address out of that entire subnet can still be retrieved using the getsockname() syscall on the socket returned by accept() as part of a normal TCP server flow. Using this method, along with the "geo" module in nginx to match the server IP address, I set up a very interesting reverse proxy for the SSH servers.

But there were some limitations. First off, the client IP address was never recorded in the ssh logs. While this is not too important on a local network, it does, of course, have implications on the public Internet. Nginx also had to be used as a middleman, so performance was not that great.

So what could I do? Initially, I had a program that created a socket bound to a ULA in the host network namespace, and then forwarded any sockets returned by accept() to another program in a different network namespace using the SCM_RIGHTS control message. However, there were still some limitations:

  • Sockets had to be bound to individual addresses rather than the wildcard address. This meant that I would need to run N programs and open N sockets if I were to have N ssh servers. Obviously, this wouldn't scale.
  • The server program had to have an "inetd mode" or similar. Due to complexities, most server programs don't have inetd mode anymore.

And so socketbox basically extends upon this concept without the above limitations. By having a "socket dispatch" routine in userspace, we can basically have a single "universal" socket that accepts for multiple destination IP/port combinations, and then forward them throughout the system by applying rules on the destination IP address. In addition, we have a "socketbox-preload" utility that modifies virtually any TCP server program to work seamlessly with socketbox's dispatch routine, so the "inetd mode" restriction no longer exists for the most part.

Since then, the issue with the Docker network prefixes no longer exists because I've since moved on to custom container software, which offers much more flexibility as to the network configuration. However, the ideas of AnyIP (which now includes IPv6 Things) and inter-netns sockets were still strong, so socketbox still has its uses.

The "B" protocol

The original socketbox protocol (now known as the "A" protocol) was rather straightforward: listening daemons would, instead of calling bind() on TCP sockets, bind() on a Unix domain datagram socket instead, and instead of calling accept(), would receive connections through SCM_RIGHTS datagrams received directly from the bound Unix domain datagram socket.

This sounded good in theory, was simple to implement, and also had the advantage that bind() never blocked and accept() would never fail with a permanent error (reasons why this is important can be seen below). It also mapped cleanly with non-blocking servers with respect to poll/select/epoll. However, it did not work out very well in practice. For example, if I were to delegate the directories to untrusted containers, the following problems would result:

  • A malicious container, could, instead of creating a socket, create a symlink to some other file on the filesystem. Although it is only effective if the target file is a SOCK_DGRAM socket, it was still a security issue. This was partially mitigated by putting socketbox in a chroot, but it did not protect against attacks between different delegated directories.
  • Similarly, because the containers had write access to the delegated directory, a malicious container could also fill up the entire delegated directory with inodes, thus resulting in a denial-of-service attack for other containers.
  • These issues were sort of fixed with a new paradigm where the socket would only be created as root, just before container creation, and then only a file descriptor for that socket would be inherited into the container, without access to the original delegated directory, but that had its own issues too. Most notably, a container could abuse that socket to send datagrams to arbitrary Unix domain datagram sockets on the host's network namespace, provided that they were bound to an abstract address. Even worse, for programs that use SCM_CREDENTIALS to check the identity of the peer, they would see that the socket was created by UID 0 in the initial user namespace, thus allowing possible system daemon compromise if the daemon relied on that. In addition, if the socket file were deleted, then it could not connect back again without restarting the container.
  • Finally, there were some rather obscure issues with using POSIX ACLs and setgid bits to control access to the delegated directories. Specifically, it was not very friendly with nested containers.
  • The original socketbox daemon also had issues of its own. The number of maps and rule sets were artificially limited due to memory and data structure constraints (it was mostly fixed-size arrays), and as with all other C programs, there could be out-of-bounds memory access related vulnerabilities.
  • There were also certain concerns regarding sending sockets with the IP_TRANSPARENT socket option kept enabled. Currently, the -z flag would disable this socket option before sending it to the daemon, however, after further experimentation, the sockets did not work, so it had to be kept enabled. Checking the Linux source code, it appears that this socket option only provides the same powers as the IP_FREEBIND option (which does not require CAP_NET_ADMIN) but also with special treatment with regards to the ip6tables TPROXY target (i.e. IP_TRANSPARENT only has a privileged effect with the TPROXY target).

As a result of all of this, the "B" protocol was introduced. It is similar to the "A" protocol in that accepted TCP sockets are still sent and received using SCM_RIGHTS, but it slightly differs from the "A" protocol in a few respects:

  • The "B" protocol does not use a series of directories on the filesystem. Instead, it is a "back-to-back" system, where daemons in containers are clients that have to "register" with a host socketbox daemon, and the host socketbox will "stream" incoming TCP connections to that container.
  • The "B" protocol does authentication within the daemon, rather than with filesystem permissions. Currently, it's done with SO_PEERCRED along with the actual identity of the listening socket (the latter is useful for some container runtimes where all containers have the same UID/GID maps, including "privileged" containers without user namespaces)

However, the "B" protocol has its own issues. Most notably, it would violate the contracts for bind() and accept(), had we attempted to implement this in socketbox-preload; specifically, bind() would block until it receives a response from socketbox, and accept() could fail with a permanent error if the Unix domain socket connection to socketbox were broken. In response, to ensure the least amount of surprise, the "B" protocol would be supported along with the "A" protocol; specifically, the "B" protocol would be used for host-to-container communication, and the "A" protocol would be used for trusted communication within a single container (i.e. with the old version of socketbox running locally within the container, retrofitted as a "B" protocol client). A new program, "socketbox-inetd-p", would operate in exactly the same way as the previous "socketbox-inetd" would, but with the "B" protocol instead of the "A" protocol.

The reference implementation for a "B" protocol server can be found at https://git.peterjin.org/_/python-socketbox/. It is written in Python instead of C, as it was meant to run only on the host, where a Python runtime may more likely be found (the old "A" protocol preferred a hierarchical approach, with instances running in intermediate containers).

The reference implementation is still "pre-alpha", as it was just written less than 12 hours before this text was written. Things to do:

  • Allow IPv4 sockets and Unix domain sockets as "incoming" sockets
  • Dynamic reconfiguration
  • Allow setting of ACLs on the Unix domain sockets (to reduce the attack surface, but this could also be accomplished by setting appropriate permissions on the directory containing it instead)

The terms "A protocol" and "B protocol" are arbitrary; these terms are not actually used in the socketbox code; they are here just to label and differentiate between the old ("A") and new ("B") protocols.

Use with ip6tables TPROXY

Socketbox supports ip6tables TPROXY as a means of allowing a single socket to listen on multiple addresses, thus eliminating the need to use poll() on multiple sockets.

As I understand it, what TPROXY appears to do is the following: For the purposes of socket demultiplexing and finding a listening socket, TPROXY overrides which address a daemon has to listen on to accept the connection[1]. For example, an incoming TCP SYN packet with destination IP/port [2001:db8::1]:80 would require that there be a listening socket on [::]:80 or [2001:db8::1]:80. If this SYN packet matched a TPROXY rule with --on-ip ::1 --on-port 80, then it would look for a listening socket on [::1]:80 instead. Multiple TCP/IP flows can be reduced to one listening address, thus allowing the socket to accept connections for any set of IPs or port numbers determined by iptables rules.

Dual-stack TPROXY

ip route add local 2001:db8::/64 dev lo
ip route add local 192.0.2.0/24 dev lo
ip6tables[-nft] -t mangle -A PREROUTING -d 2001:db8::/64 -p tcp -m multiport --dports 22,80,443 -j TPROXY --on-ip ::ffff:127.100.100.1 --on-port 1
iptables[-nft] -t mangle -A PREROUTING -d 192.0.2.0/24 -p tcp -m multiport --dports 22,80,443 -j TPROXY --on-ip 127.100.100.1 --on-port 1

Socketbox must be configured to bind to [::ffff:127.100.100.1]:1 with IPV6_TRANSPARENT set in order for this to work. You must use the IPv6-mapped-IPv4 notation; an IPv4 socket, even with IP_TRANSPARENT set, will not work.

This works because the ::ffff:127.100.100.1 address can be interpreted from the perspective of the kernel as both an IPv6 and an IPv4 address.

Permissions for bound Unix domain sockets

This part used to have a guide where certain subdirectories of the socketbox root could be freely delegated to other containers using ACLs on the UID/GID maps. However, this method is now deprecated for two reasons:

  • Possibility of symlink attacks. For example, if a container had access to the 00001 directory underneath the socketbox root, then it could create a symlink to ../00002/example.socket instead of the normal socket and allow access to 00002/example.socket even if the IP address and port are behind a firewall.
  • Because the socketbox root directory is a shared tmpfs, a malicious container could fill up all the inodes on that filesystem, resulting in a denial of service.

There are a few ways to mitigate these vulnerabilities:

  • For container creation, we could create the Unix domain socket before we create the container, then allow the file descriptor of the socket to be inherited into the container, where a container-local socketbox could be created; we would not bind-mount the /run/socketbox/00001 directory in this case. This container-local socketbox only has privileges to its own container, so the set of sockets that it could access would be limited entirely to its own container. Keep in mind when doing this, whenever container processes have the option to drop privileges, the creation and binding of the socket is often done before the process drops privileges.
  • We could set up a second daemon that performs the following operation (taking into account TOCTOU-related issues):
    • Create a temporary directory inside the socketbox root which is writable by socketbox but not by the containers.
    • Whenever an incoming connection occurs, attempt to hard link the supposed destination socket file into the temporary directory.
    • Check that the newly created file is satisfactory (has correct UID/GID and/or is actually a Unix domain socket and not a symlink to another socket)
    • Use the newly created file instead of the original file path to send the accepted connection into the container.
  • For setups that would have a performance penalty as a result of the above operations (e.g. when using socketbox as an alternative to authbind or similar on a system without containers), socketbox could be limited to one security realm (i.e. processes with the same user ID, user namespace, and root directory are part of the same security realm)
  • The symlink issue could also be fixed by using the MS_NOSYMFOLLOW mount option (-F8 in ctrtool mount_seq) on the socketbox root directory (for kernels 5.10 and up).

The previous text written for this section can be found by clicking on the "expand" button on the right.

Socketbox requires a directory of sockets created by end daemons that it can access. To ensure security of that directory, the author recommends that this procedure be followed:

Let's say that the directory is /run/socketbox, and there are two containers, one with UID/GID map 100000-100999, and another with 101000-101999. The socketbox user has a dedicated user and group ID along with another group ID (socketbox-access).

We want to create two subdirectories in that root directory, one for each container: /run/socketbox/00001 and /run/socketbox/00002. Make the /run/socketbox directory mode 2755 with owner/group as root:socketbox-access. Make each of the 00001 and 00002 directories mode 2750 with owner/group root:socketbox-access, but also set an ACL such that the 00001 directory has read-write-execute privileges for UID 100000, and 00002 has read-write-execute for UID 101000.

mkdir -pm 2755 /run/socketbox
chown root:socketbox-access /run/socketbox

# The presence of the set-group-ID bit on /run/socketbox means that these
# directories will also be set-group-ID with a group of socketbox-access
mkdir -pm 2750 /run/socketbox/00001 /run/socketbox/00002

setfacl -m u:100000:rwx /run/socketbox/00001
setfacl -m u:101000:rwx /run/socketbox/00002

This is the safest in terms of security, because:

  • The 00001 and 00002 directories are restricted to three users: root, UID 100000 (the container's root user, rwx privileges), and socketbox-access (r-x privileges). This allows sockets to be created by the container's root user without restriction. The directory is set-group-ID, so any created sockets in that directory will have a group ID of socketbox-access; this works even though this group ID is outside of the container's GID map. The daemon chmods this to 0660, so that the socket is also accessible by socketbox (which has a supplementary group ID of socketbox-access).
  • From the perspective of the containers, the sockets will have a group of "nogroup" (i.e. outside the GID map), so they appear to have a mode of 0600 instead. Similarly, the 00001 and 00002 directories appear to have a mode of 0700 (read-write-execute only for the container's root user)
  • There is no write access for the directories for the socketbox-access group; this means that socketbox is free to access the sockets, but can't create new ones or delete existing ones.
  • All directories are owned by the host's root, so changing permissions from the containers isn't possible.

Essentially, the set-group-ID bit on the directory is used here as a means for a process in a container to specify access for a GID that is not otherwise mapped in a container's user namespace, without forcing it into "other".

How we use socketbox

A service (like my Matrix homeserver) in an isolated network namespace can be made to safely accept connections directly from the LAN, by using socketbox in the main network namespace (or another namespace that's directly connected to the LAN) to send connection sockets over to the server in the other network namespace.

As a means of preventing leaks, the server network that is part of our home LAN is on a completely isolated network namespace[2] which is not connected in any way to the rest of the LAN. However, when accessing servers locally, it would theoretically need to take the full path of the Internet, which is very inefficient. Socketbox can be used to safely "bridge" these two namespaces together such that they stay isolated in terms of routing policy, but the servers in the other network namespace can also be accessed from the main network namespace without any tunneling overhead. The server IP address is different, so we use split-horizon DNS to ensure that this configuration remains fully transparent to clients in the LAN.

The config used for this looks like this:

max-maps:100
max-rules:100
rule-set:0 size:100
ip:fd01:db8:1:40::/96 port:443 jump-map:2
ip:fd01:db8:1:40::/96 port:22 jump-map:3
ip:fd01:db8:1:40:0:64::/96 jump-map:1

map-set:1 size:100 match-lport:1
port:80 jump-unix:/nat64/nat64.sock
port:443 jump-unix:/nat64/nat64.sock
default:fail

map-set:2 size:100 match-lip:128
ip:fd01:db8:1:40::1:10 jump-unix:/vm7/sb-00443
ip:fd01:db8:1:40::1:11 jump-unix:/vm7/apache-00443
default:fail

map-set:3 size:100 match-lip:128
ip:fd01:db8:1:40::1:10 jump-unix:/vm7/sshd_socket
default:fail

The effect of all this is pretty interesting, since it gives off the illusion that the addresses all correspond to their own host:

$ ssh root@fd01:db8:1:40::1:10
-bash-5.0# exit
logout
Connection to fd01:db8:1:40::1:10 closed.
$ ssh root@fd01:db8:1:40::1:11
kex_exchange_identification: Connection closed by remote host
$ ssh root@fd01:db8:1:40::1:f
kex_exchange_identification: Connection closed by remote host
$ ssh root@fd01:db8:1:40::1:10
-bash-5.0# who
root     pts/0        2020-12-18 22:19 (fd01:db8:1:21:8d64:3353:1f4c:55e)
-bash-5.0# exit
logout
Connection to fd01:db8:1:40::1:10 closed.
$ ssh root@fd01:db8:1:40::1:0
kex_exchange_identification: Connection closed by remote host
$ wget https://[fd01:db8:1:40::1:f]
--2020-12-18 18:23:37--  https://[fd01:db8:1:40::1:f]/
Connecting to [fd01:db8:1:40::1:f]:443... connected.
Unable to establish SSL connection.
$ wget https://[fd01:db8:1:40::1:10]
--2020-12-18 18:23:38--  https://[fd01:db8:1:40::1:10]/
Connecting to [fd01:db8:1:40::1:10]:443... connected.
    ERROR: certificate common name ‘www2.peterjin.org’ doesn't match requested host name ‘fd01:db8:1:40::1:10’.
To connect to fd01:db8:1:40::1:10 insecurely, use `--no-check-certificate'.
$ wget https://[fd01:db8:1:40::1:11]
--2020-12-18 18:23:39--  https://[fd01:db8:1:40::1:11]/
Connecting to [fd01:db8:1:40::1:11]:443... connected.
    ERROR: certificate common name ‘apps-vm7-www.srv.peterjin.org’ doesn't match requested host name ‘fd01:db8:1:40::1:11’.
To connect to fd01:db8:1:40::1:11 insecurely, use `--no-check-certificate'.
$ wget https://[fd01:db8:1:40::1:12]
--2020-12-18 18:23:42--  https://[fd01:db8:1:40::1:12]/
Connecting to [fd01:db8:1:40::1:12]:443... connected.
Unable to establish SSL connection.
$ wget https://[fd01:db8:1:40::1:1f]
--2020-12-18 18:23:44--  https://[fd01:db8:1:40::1:1f]/
Connecting to [fd01:db8:1:40::1:1f]:443... connected.
Unable to establish SSL connection.
$ wget https://[fd01:db8:1:40::1:11]
--2020-12-18 18:23:47--  https://[fd01:db8:1:40::1:11]/
Connecting to [fd01:db8:1:40::1:11]:443... connected.
    ERROR: certificate common name ‘apps-vm7-www.srv.peterjin.org’ doesn't match requested host name ‘fd01:db8:1:40::1:11’.
To connect to fd01:db8:1:40::1:11 insecurely, use `--no-check-certificate'.
$ wget https://[fd01:db8:1:40::1234:5678]
--2020-12-18 18:23:53--  https://[fd01:db8:1:40::1234:5678]/
Connecting to [fd01:db8:1:40::1234:5678]:443... connected.
Unable to establish SSL connection.

Known bugs

  • The "B" protocol server cannot run in SO_REUSEPORT mode (the "A" protocol server could, but it was a coincidence). Due to Python's limited multithreading support (cf. the global interpreter lock), this may take a while to implement.
  • It is possible for a client to cause a denial of service on the socketbox server, by not receiving file descriptors from the server. This issue is relatively minor and in most cases, can be fixed by increasing socketbox's RLIMIT_NOFILE soft limit. This issue is not otherwise remotely exploitable if the container is well behaved.

To do list

  • Allow the main socketbox daemon to spawn programs on socket pairs, as if it were an init daemon.
  • Spoofing of getsockname() and getsockopt() (using the preload library) to trick socket-activated daemons into thinking that the inherited AF_UNIX/SOCK_DGRAM socket is an AF_INET[6]/SOCK_STREAM socket with a specified sockname.
  • Fix possible file descriptor leaks when using socketbox-preload on a multithreaded program and the thread that calls the fake accept() is cancelled.
  • Socketbox-inetd needs a limit as to how many child processes can be spawned. RLIMIT_NPROC does not suffice for all uses because there could be more child processes and even threads that could be created by the spawned program.
  • Go bindings.
  • Built-in (or separate) web server to universally redirect http to https and handle /.well-known/acme-challenge (to allow use of Certbot's webroot authenticator), perhaps using Bottle or something similar; i.e. a built-in version of Snippets:Nginx universal port 80 redirect.
  • To allow node.js net.createServer to use socketbox, we could block accept4() using internal seccomp filter (the libuv library that node.js uses seems to call accept4() directly using syscall(), at least on my Ubuntu 20.04 machine; appears to be fixed in upstream version). host: 'fe8f::x:y:y' works and the socket is correctly bound (if the socket did not exist before binding; otherwise returns bogus EADDRINUSE), but the connection listener does not work. A complete solution may require spoofing the SO_TYPE socket option entirely, and maybe even the sa_family field of getsockname(). The connection listener and retrieval of socket addresses (local and remote) do appear to be functional on my Debian 11 machine.

See also