A new feature expected to be merged for the upcoming Linux 7.0 kernel cycle is adding an OPEN_TREE_NAMESPACE flag for the open_tree() system call. This OPEN_TREE_NAMESPACE option can provide a nice performance win with added security benefits if you are dealing a lot with containerized workloads on Linux.
Microsoft engineer Christian Brauner developed the OPEN_TREE_NAMESPACE functionality for open_tree() to make launching containers less wasteful around copying mounts that are ultimately unnecessary and to then be immediately destroyed. Brauner elaborated in the late December patch series:
“When creating containers the setup usually involves using CLONE_NEWNS via clone3() or unshare(). This copies the caller’s complete mount namespace. The runtime will also assemble a new rootfs and then use pivot_root() to switch the old mount tree with the new rootfs. Afterward it will recursively umount the old mount tree thereby getting rid of all mounts.
On a basic system here where the mount table isn’t particularly large this still copies about 30 mounts. Copying all of these mounts only to get rid of them later is pretty wasteful.
This is exacerbated if intermediary mount namespaces are used that only exist for a very short amount of time and are immediately destroyed again causing a ton of mounts to be copied and destroyed needlessly.
With a large mount table and a system where thousands or ten-thousands of namespaces are spawned in parallel this quickly becomes a bottleneck increasing contention on the semaphore.
Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of returning a file descriptor referring to that mount tree OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor to a new mount namespace. In that new mount namespace the copied mount tree has been mounted on top of a copy of the real rootfs.
The caller can setns() into that mount namespace and perform any additionally setup such as move_mount()ing detached mounts in there.
This allows OPEN_TREE_NAMESPACE to function as a combined unshare(CLONE_NEWNS) and pivot_root().”
In testing out the new functionality, it was found to be around 40% faster:
“With the older pivot_root() based method, I can create about 73k “containers” in 60s. With the newer open_tree() method, I can create about 109k in the same time. So it seems like the new method is roughly 40% faster than the older scheme (and a lot less syscalls too).”
Beyond OPEN_TREE_NAMESPACE being less wasteful and better efficiency, there are also expected security benefits too for blocking attacks if the container root manages to get unmounted in trying to access the underlying mounts.
The OPEN_TREE_NAMESPACE patches as of a few days ago have been queued into vfs/vfs.git’s vfs-7.0.namespace Git branch. With the code now there, it will presumably be sent in for the upcoming Linux 6.20~7.0 kernel merge window barring any last minute issues.
