Jekyll2024-01-25T15:41:40+01:00https://brauner.io/feed.xmlPersonal blog of Christian BraunerChristian BraunerMounting into mount namespaces2023-02-28T00:00:00+01:002023-02-28T00:00:00+01:00https://brauner.io/2023/02/28/mounting-into-mount-namespaces<h1 id="introduction">Introduction</h1>
<p>Early on when the <code class="language-plaintext highlighter-rouge">LXD</code> project was started we were clear that we wanted to make it possible to change settings while the container is running.
On of the very first things that came to our mind was making it possible to insert new mounts into a running container.
When I was still at Canonical working on <code class="language-plaintext highlighter-rouge">LXD</code> we quickly realized that inserting mounts into a running container would require a lot of creativity given the limitations of the api.</p>
<p>Back then the only way to create mounts or change mount option was by using the <code class="language-plaintext highlighter-rouge">mount(2)</code> system call.
The mount system call multiplexes a lot of different operations.
For example, it doesn’t just allow the creation of new filesystem mounts but also handles bind mounts and mount option changes.
Mounting is overall a pretty complex operation as it doesn’t just involve path lookup but also needs to handle mount propagation and filesystem specific and generic mount options.</p>
<p>I want to take a look at our legacy solution to this problem and a new approach that I’ve used and that has existed for a while but never talked about widely.</p>
<h1 id="creative-uses-of-mount2">Creative uses of <code class="language-plaintext highlighter-rouge">mount(2)</code></h1>
<p>Before <code class="language-plaintext highlighter-rouge">openat2(2)</code> came along adding mounts to a container during startup was difficult because there was always the danger of symlink attacks.
A mount source or target path could be specified containing symlinks that would allow processes in the container to escape to the host filesystem.
These attacks used to be quite common and there was no straightforward solution available; at least not before the <code class="language-plaintext highlighter-rouge">RESOLVE_*</code> flag namespace of <code class="language-plaintext highlighter-rouge">openat2(2)</code> improved things so considerably that symlink attacks on new kernels can be effectively blocked.</p>
<p>But before <code class="language-plaintext highlighter-rouge">openat2()</code> symlink attacks when mounting could only be prevented with very careful coding and a rather elaborate algorithm.
I won’t go into too much detail but it is roughly done by verifying each path component in userspace using <code class="language-plaintext highlighter-rouge">O_PATH</code> file descriptors making sure that the paths point into the container’s rootfs.</p>
<p>But even if you verified that the path is sane and you hold a file descriptor to the last component you still need to solve the problem that <code class="language-plaintext highlighter-rouge">mount(2)</code> only operates on paths.
So you are still susceptible to symlink attacks as soon as you call <code class="language-plaintext highlighter-rouge">mount(source, target, ...)</code>.</p>
<p>The way we solved this problem was by realizing that <code class="language-plaintext highlighter-rouge">mount(2)</code> was perfectly happy to operate on <code class="language-plaintext highlighter-rouge">/proc/self/fd/<nr></code> paths
(This is similar to how <code class="language-plaintext highlighter-rouge">fexecve()</code> used to work before the addition of the <code class="language-plaintext highlighter-rouge">execveat()</code> system call.).
So we could verify the whole path and then open the last component of the source and target paths at which point we could call <code class="language-plaintext highlighter-rouge">mount("/proc/self/fd/1234", "/proc/self/fd/5678", ...)</code>.</p>
<p>We immediately thought that if <code class="language-plaintext highlighter-rouge">mount(2)</code> allows you to do that then we could easily use this to mount into namespaces.
So if the container is running in its mount namespace we could just create a bind mount on the host, open the newly created bind mount and then change to the container’s mount namespace (and it’s owning user namespace) and then simply call <code class="language-plaintext highlighter-rouge">mount("/proc/self/fd/1234", "/mnt", ...)</code>.
In pseudo C code it would look roughly:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fd_mnt</span> <span class="o">=</span> <span class="n">openat</span><span class="p">(</span><span class="o">-</span><span class="n">EBADF</span><span class="p">,</span> <span class="s">"/opt"</span><span class="p">,</span> <span class="n">O_PATH</span><span class="p">,</span> <span class="p">...);</span>
<span class="n">setns</span><span class="p">(</span><span class="n">fd_userns</span><span class="p">,</span> <span class="n">CLONE_NEWUSER</span><span class="p">);</span>
<span class="n">setns</span><span class="p">(</span><span class="n">fd_mntns</span><span class="p">,</span> <span class="n">CLONE_NEWNS</span><span class="p">);</span>
<span class="n">mount</span><span class="p">(</span><span class="s">"/proc/self/fd/fd_mnt"</span><span class="p">,</span> <span class="s">"/mnt"</span><span class="p">,</span> <span class="p">...);</span>
</code></pre></div></div>
<p>However, this isn’t possible as the kernel will enforce that the mounts that the source and target paths refer to are located in the caller’s mount namespace.
Since the caller will be located in the container’s mount namespace after the <code class="language-plaintext highlighter-rouge">setns()</code> call but the source file descriptors refers to a mount located in the host’s mount namespace this check fails.
The semantics behind this are somewhat sane and straightforward to understand so there was no need to change them even though we were tempted.
Back then it would’ve also meant that adding mounts to containers would’ve only worked on newer kernels and we were quite eager to enable this feature for kernels that were already released.</p>
<h1 id="mount-namespace-tunnels">Mount namespace tunnels</h1>
<p>So we came up with the idea of mount namespace tunnels.
Since we spearheaded this idea it has been picked up by various projects such as <code class="language-plaintext highlighter-rouge">systemd</code> for system services and it’s own <code class="language-plaintext highlighter-rouge">systemd-nspawn</code> container runtime.</p>
<p>The general idea as based on the observation that mount propagation can be used to function like a tunnel between mount namespaces:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mount --bind /opt /opt
mount --make-private /opt
mount --make-shared /opt
# Create new mount namespace with all mounts turned into dependent mounts.
unshare --mount --propagation=slave
</code></pre></div></div>
<p>and then create a mount on or beneath the shared <code class="language-plaintext highlighter-rouge">/opt</code> mount on the host:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mkdir /opt/a
mount --bind /tmp /opt/a
</code></pre></div></div>
<p>then the new mount of <code class="language-plaintext highlighter-rouge">/tmp</code> on the dentry <code class="language-plaintext highlighter-rouge">/opt/a</code> will propagate into the mount namespace we created earlier.
Since the <code class="language-plaintext highlighter-rouge">/opt</code> mount at the <code class="language-plaintext highlighter-rouge">/opt</code> dentry in the new mount namespace is a dependent mount we can now move the mount to its final location:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mount --move /opt/a /mnt
</code></pre></div></div>
<p>As a last step we can unmount <code class="language-plaintext highlighter-rouge">/opt/a</code> in the host mount namespace.
And as long as the <code class="language-plaintext highlighter-rouge">/mnt</code> dentry doesn’t reside on a mount that is a dependent mount of <code class="language-plaintext highlighter-rouge">/opt</code>’s peer group the unmount of <code class="language-plaintext highlighter-rouge">/opt/a</code> we just performed on the host will only unmount the mount in the host mount namespace.</p>
<p>There are various problems with this solution:</p>
<ul>
<li>It’s complex.</li>
<li>The container manager needs to set up the mount tunnel when the container starts.
In other words, it needs to part of the architecture of the container which is always unfortunate.</li>
<li>The mount at the endpoint of the tunnel in the container needs to be protected from being unmounted.
Otherwise the container payload can just unmount the mount at its end of the mount tunnel and prevent the insertion of new mounts into the container.</li>
</ul>
<h1 id="mounting-into-mount-namespaces">Mounting into mount namespaces</h1>
<p>A few years ago a new mount api made it into the kernel.
Shortly after I’ve also added the <code class="language-plaintext highlighter-rouge">mount_setattr(2)</code> system call.
Since then I’ve been expanding the abilities of this api and to put it to its full use.</p>
<p>Unfortunately the adoption of the new mount api has been slow.
Mostly, because people don’t know about it or because they don’t yet see the many advantages it offers over the old one.
But with the next release of the <code class="language-plaintext highlighter-rouge">mount(8)</code> binary a lot of us use the new mount api will be used whenever possible.</p>
<p>I won’t be covering all the features that the mount api offers.
This post just illustrates how the new mount api makes it possible to mount into mount namespaces and let’s us get rid of the complex mount propagation scheme.</p>
<p>Luckily, the new mount api is designed around file descriptors.</p>
<h2 id="filesystem-mounts">Filesystem Mounts</h2>
<p>To create a new filesystem mount using the old mount api is simple:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mount("/dev/sda", "/mnt", "xfs", ...);
</code></pre></div></div>
<p>We pass the source, target, and filesystem type and potentially additional mount options.
This single system call does a lot behind the scenes.
A new superblock will be allocated for the filesystem, mount options will be set, a new mount will be created and attached to a mountpoint in the caller’s mount namespace.</p>
<p>In the new mount api the various steps are split into separate system calls.
While this makes mounting more complex it allows allows for greater flexibility.
Mounting doesn’t have to be a fast operation and never has been.</p>
<p>So in the new mount api we would create a new filesystem mount with the following steps:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Create a new filesystem context. */</span>
<span class="n">fd_fs</span> <span class="o">=</span> <span class="n">fsopen</span><span class="p">(</span><span class="s">"xfs"</span><span class="p">);</span>
<span class="cm">/*
* Set the source of the filsystem mount. Whether or not this is required
* depends on the type of filesystem of course. For example, mounting a tmpfs
* filesystem would not require us to set the "source" property as it's not
* backed by a block device.
*/</span>
<span class="n">fsconfig</span><span class="p">(</span><span class="n">fd_fs</span><span class="p">,</span> <span class="n">FSCONFIG_SET_STRING</span><span class="p">,</span> <span class="s">"source"</span><span class="p">,</span> <span class="s">"/dev/sda"</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="cm">/* Actually create the superblock and prepare to allocate a mount. */</span>
<span class="n">fsconfig</span><span class="p">(</span><span class="n">fd_fs</span><span class="p">,</span> <span class="n">FSCONFIG_CMD_CREATE</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">fd_fs</code> file descriptor refers to VFS context object that doesn’t concern us here.
Let it suffice that it is an opaque object that can only be used to configure
the superblock and the filesystem until <code class="language-plaintext highlighter-rouge">fsmount()</code> is called:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Create a new detached mount and return an O_PATH file descriptor refering to the mount. */</span>
<span class="n">fd_mnt</span> <span class="o">=</span> <span class="n">fsmount</span><span class="p">(</span><span class="n">fd_fs</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">fsmount()</code> call will turn the context file descriptor into an <code class="language-plaintext highlighter-rouge">O_PATH</code> file descriptor that refers to a detached mount.
A detached mount is a mount that isn’t attached to any mount namespace.</p>
<h2 id="bind-mounts">Bind Mounts</h2>
<p>The old mount api created bind mounts via:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mount</span><span class="p">(</span><span class="s">"/opt"</span><span class="p">,</span> <span class="s">"/mnt"</span><span class="p">,</span> <span class="n">MNT_BIND</span><span class="p">,</span> <span class="p">...)</span>
</code></pre></div></div>
<p>and recursive bind mounts via:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mount</span><span class="p">(</span><span class="s">"/opt"</span><span class="p">,</span> <span class="s">"/mnt"</span><span class="p">,</span> <span class="n">MNT_BIND</span> <span class="o">|</span> <span class="n">MS_REC</span><span class="p">,</span> <span class="p">...)</span>
</code></pre></div></div>
<p>Most people however will be more familiar with <code class="language-plaintext highlighter-rouge">mount(8)</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mount --bind /opt /mnt
mount --rbind / /mnt
</code></pre></div></div>
<p>Bind mounts play a major role in container runtimes and system services as run by <code class="language-plaintext highlighter-rouge">systemd</code>.</p>
<p>The new mount api supports bind mounts through the <code class="language-plaintext highlighter-rouge">open_tree()</code> system call.
Calling <code class="language-plaintext highlighter-rouge">open_tree()</code> on an existing mount will just return an <code class="language-plaintext highlighter-rouge">O_PATH</code> file descriptor referring to that mount.
But if <code class="language-plaintext highlighter-rouge">OPEN_TREE_CLONE</code> is specified <code class="language-plaintext highlighter-rouge">open_tree()</code> will create a detached mount and return an <code class="language-plaintext highlighter-rouge">O_PATH</code> file descriptor.
That file descriptor is indistinguishable from an <code class="language-plaintext highlighter-rouge">O_PATH</code> file descriptor returned from the earlier <code class="language-plaintext highlighter-rouge">fsmount()</code> example:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fd_mnt</span> <span class="o">=</span> <span class="n">open_tree</span><span class="p">(</span><span class="o">-</span><span class="n">EBADF</span><span class="p">,</span> <span class="s">"/opt"</span><span class="p">,</span> <span class="n">OPEN_TREE_CLONE</span><span class="p">,</span> <span class="p">...)</span>
</code></pre></div></div>
<p>creates a new detached mount of <code class="language-plaintext highlighter-rouge">/opt</code> and:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fd_mnt</span> <span class="o">=</span> <span class="n">open_tree</span><span class="p">(</span><span class="o">-</span><span class="n">EBADF</span><span class="p">,</span> <span class="s">"/"</span><span class="p">,</span> <span class="n">OPEN_TREE_CLONE</span> <span class="o">|</span> <span class="n">AT_RECURSIVE</span><span class="p">,</span> <span class="p">...)</span>
</code></pre></div></div>
<p>would create a new detached copy of the whole rootfs mount tree.</p>
<h3 id="attaching-detached-mounts">Attaching detached mounts</h3>
<p>As mentioned before the file descriptor returned from <code class="language-plaintext highlighter-rouge">fsmount()</code> and <code class="language-plaintext highlighter-rouge">open_tree(OPEN_TREE_CLONE)</code> refers to a detached mount in both cases.
The mount it refers to doesn’t appear anywhere in the filesystem hierarchy.
Consequently, the mount can’t be found by lookup operations going through the filesystem hierarchy.
The new mount api thus provides an elegant mechanism for:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mount</span><span class="p">(</span><span class="s">"/opt"</span><span class="p">,</span> <span class="s">"/mnt"</span><span class="p">,</span> <span class="n">MS_BIND</span><span class="p">,</span> <span class="p">...);</span>
<span class="n">fd_mnt</span> <span class="o">=</span> <span class="n">openat</span><span class="p">(</span><span class="o">-</span><span class="n">EABDF</span><span class="p">,</span> <span class="s">"/mnt"</span><span class="p">,</span> <span class="n">O_PATH</span> <span class="o">|</span> <span class="n">O_DIRECTORY</span> <span class="o">|</span> <span class="n">O_CLOEXEC</span><span class="p">,</span> <span class="p">...);</span>
<span class="n">umount2</span><span class="p">(</span><span class="s">"/mnt"</span><span class="p">,</span> <span class="n">MNT_DETACH</span><span class="p">);</span>
</code></pre></div></div>
<p>and with the added benefit that the mount never actually had to appear anywhere in the filesystem hierarchy and thus never had to belong to any mount namespace.
This alone is already a very powerful tool but we won’t go into depth today.</p>
<p>Most of the time a detached mount isn’t wanted however.
Usually we want to make the mount visible in the filesystem hierarchy so other user or programs can access it.
So we need to attach them to the filesystem hierarchy.</p>
<p>In order to attach a mount we can use the <code class="language-plaintext highlighter-rouge">move_mount()</code> system call.
For example, to attach the detached mount <code class="language-plaintext highlighter-rouge">fd_mnt</code> we create before we can use:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">move_mount</span><span class="p">(</span><span class="n">fd_mnt</span><span class="p">,</span> <span class="s">""</span><span class="p">,</span> <span class="o">-</span><span class="n">EBADF</span><span class="p">,</span> <span class="s">"/mnt"</span><span class="p">,</span> <span class="n">MOVE_MOUNT_F_EMPTY_PATH</span><span class="p">);</span>
</code></pre></div></div>
<p>This will attach the detached mount of <code class="language-plaintext highlighter-rouge">/opt</code> at the <code class="language-plaintext highlighter-rouge">/mnt</code> dentry on the <code class="language-plaintext highlighter-rouge">/</code> mount.
What this means is that the <code class="language-plaintext highlighter-rouge">/opt</code> mount will be inserted into the mount namespace that the caller is located in at the time of calling <code class="language-plaintext highlighter-rouge">move_mount()</code>.
(The kernel has very tight semantics here. For example, it will enforce that the caller has <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code> in the owning user namespace of its mount namespace.
It will also enforce that the mount the <code class="language-plaintext highlighter-rouge">/mnt</code> dentry is located on belongs to the same mount namespace as the caller.)</p>
<p>After <code class="language-plaintext highlighter-rouge">move_mount()</code> returns the mount is permanently attached.
Even if it is unmounted while still pinned by a file descriptor will it still belong to the mount namespace it was attached to.
In other words, <code class="language-plaintext highlighter-rouge">move_mount()</code> is an irreversible operation.</p>
<p>The main point is that before <code class="language-plaintext highlighter-rouge">move_mount()</code> is called a detached mount doesn’t belong to any mount namespace and can thus be freely moved around.</p>
<h2 id="mounting-a-new-filesystem-into-a-mount-namespace">Mounting a new filesystem into a mount namespace</h2>
<p>To mount a filesystem into a new mount namespace we can make use of the split between configuring a filesystem context and creating a new superblock and actually attaching the mount to the filesystem hiearchy:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fd_fs</span> <span class="o">=</span> <span class="n">fsopen</span><span class="p">(</span><span class="s">"xfs"</span><span class="p">);</span>
<span class="n">fsconfig</span><span class="p">(</span><span class="n">fd_fs</span><span class="p">,</span> <span class="n">FSCONFIG_SET_STRING</span><span class="p">,</span> <span class="s">"source"</span><span class="p">,</span> <span class="s">"/dev/sda"</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">fsconfig</span><span class="p">(</span><span class="n">fd_fs</span><span class="p">,</span> <span class="n">FSCONFIG_CMD_CREATE</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">fd_mnt</span> <span class="o">=</span> <span class="n">fsmount</span><span class="p">(</span><span class="n">fd_fs</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>
<p>For filesystems that require host privileges such as <code class="language-plaintext highlighter-rouge">xfs</code>, <code class="language-plaintext highlighter-rouge">ext4</code>, or <code class="language-plaintext highlighter-rouge">btrfs</code> (and many others) these steps can be performed by a privileged container or pod manager with sufficient privileges.
However, once we have created a detached mounts we are free to attach to whatever mount and mountpoint we have privilege over in the target mount namespace.
So we can simply attach to the user namespace and mount namespace of the container:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">setns</span><span class="p">(</span><span class="n">fd_userns</span><span class="p">);</span>
<span class="n">setns</span><span class="p">(</span><span class="n">fd_mntns</span><span class="p">);</span>
</code></pre></div></div>
<p>and then use</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">move_mount</span><span class="p">(</span><span class="n">fd_mnt</span><span class="p">,</span> <span class="s">""</span><span class="p">,</span> <span class="o">-</span><span class="n">EBADF</span><span class="p">,</span> <span class="s">"/mnt"</span><span class="p">,</span> <span class="n">MOVE_MOUNT_F_EMPTY_PATH</span><span class="p">);</span>
</code></pre></div></div>
<p>to attach the detached mount anywhere we like in the container.</p>
<h2 id="mounting-a-new-bind-mount-into-a-mount-namespace">Mounting a new bind mount into a mount namespace</h2>
<p>A bind mount is even simpler.
If we want to share a specific host directory with the container we can just have the container manager call:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fd_mnt</span> <span class="o">=</span> <span class="n">open_tree</span><span class="p">(</span><span class="o">-</span><span class="n">EBADF</span><span class="p">,</span> <span class="s">"/opt"</span><span class="p">,</span> <span class="n">OPEN_TREE_CLOEXEC</span> <span class="o">|</span> <span class="n">OPEN_TREE_CLONE</span><span class="p">);</span>
</code></pre></div></div>
<p>to allocate a new detached copy of the mount and then attach to the user and mount namespace of the container:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">setns</span><span class="p">(</span><span class="n">fd_userns</span><span class="p">);</span>
<span class="n">setns</span><span class="p">(</span><span class="n">fd_mntns</span><span class="p">);</span>
</code></pre></div></div>
<p>and as above we are free to attach the detached mount anywhere we like in the container.</p>
<h1 id="conclusion">Conclusion</h1>
<p>This is really it and as simple as it sounds.
It is a powerful delegation mechanism making it possible to inject mounts into lesser privileged mount namespace or unprivileged containers.
We’ve making heavy use of this <code class="language-plaintext highlighter-rouge">LXD</code> and it is general the proper way to insert mounts into mount namespaces on newer kernels.</p>
<h1 id="contributing-fixes-improvements-or-corrections-to-this-post">Contributing fixes, improvements, or corrections to this post</h1>
<p>If you have fixes, improvements, or corrections feel free to email them to me or simply open a pull request against <a href="https://github.com/brauner/brauner.github.io">the repository for this blog</a>.</p>Christian BraunerIntroductionAn excursion into a mount propagation bug2023-01-05T00:00:00+01:002023-01-05T00:00:00+01:00https://brauner.io/2023/01/05/mount-propagation-bug<h1 id="introduction">Introduction</h1>
<p>At the end of 2022 we received a bug report about the following splat:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[ 115.848393] BUG: kernel NULL pointer dereference, address: 0000000000000010
[ 115.848967] #PF: supervisor read access in kernel mode
[ 115.849386] #PF: error_code(0x0000) - not-present page
[ 115.849803] PGD 0 P4D 0
[ 115.850012] Oops: 0000 [#1] PREEMPT SMP PTI
[ 115.850354] CPU: 0 PID: 15591 Comm: mount Not tainted 6.1.0-rc7 #3
[ 115.850851] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
VirtualBox 12/01/2006
[ 115.851510] RIP: 0010:propagate_one.part.0+0x7f/0x1a0
[ 115.851924] Code: 75 eb 4c 8b 05 c2 25 37 02 4c 89 ca 48 8b 4a 10
49 39 d0 74 1e 48 3b 81 e0 00 00 00 74 26 48 8b 92 e0 00 00 00 be 01
00 00 00 <48> 8b 4a 10 49 39 d0 75 e2 40 84 f6 74 38 4c 89 05 84 25 37
02 4d
[ 115.853441] RSP: 0018:ffffb8d5443d7d50 EFLAGS: 00010282
[ 115.853865] RAX: ffff8e4d87c41c80 RBX: ffff8e4d88ded780 RCX: ffff8e4da4333a00
[ 115.854458] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8e4d88ded780
[ 115.855044] RBP: ffff8e4d88ded780 R08: ffff8e4da4338000 R09: ffff8e4da43388c0
[ 115.855693] R10: 0000000000000002 R11: ffffb8d540158000 R12: ffffb8d5443d7da8
[ 115.856304] R13: ffff8e4d88ded780 R14: 0000000000000000 R15: 0000000000000000
[ 115.856859] FS: 00007f92c90c9800(0000) GS:ffff8e4dfdc00000(0000)
knlGS:0000000000000000
[ 115.857531] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 115.858006] CR2: 0000000000000010 CR3: 0000000022f4c002 CR4: 00000000000706f0
[ 115.858598] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 115.859393] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 115.860099] Call Trace:
[ 115.860358] <TASK>
[ 115.860535] propagate_mnt+0x14d/0x190
[ 115.860848] attach_recursive_mnt+0x274/0x3e0
[ 115.861212] path_mount+0x8c8/0xa60
[ 115.861503] __x64_sys_mount+0xf6/0x140
[ 115.861819] do_syscall_64+0x5b/0x80
[ 115.862117] ? do_faccessat+0x123/0x250
[ 115.862435] ? syscall_exit_to_user_mode+0x17/0x40
[ 115.862826] ? do_syscall_64+0x67/0x80
[ 115.863133] ? syscall_exit_to_user_mode+0x17/0x40
[ 115.863527] ? do_syscall_64+0x67/0x80
[ 115.863835] ? do_syscall_64+0x67/0x80
[ 115.864144] ? do_syscall_64+0x67/0x80
[ 115.864452] ? exc_page_fault+0x70/0x170
[ 115.864775] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 115.865187] RIP: 0033:0x7f92c92b0ebe
[ 115.865480] Code: 48 8b 0d 75 4f 0c 00 f7 d8 64 89 01 48 83 c8 ff
c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 a5 00 00
00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 42 4f 0c 00 f7 d8 64 89
01 48
[ 115.866984] RSP: 002b:00007fff000aa728 EFLAGS: 00000246 ORIG_RAX:
00000000000000a5
[ 115.867607] RAX: ffffffffffffffda RBX: 000055a77888d6b0 RCX: 00007f92c92b0ebe
[ 115.868240] RDX: 000055a77888d8e0 RSI: 000055a77888e6e0 RDI: 000055a77888e620
[ 115.868823] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
[ 115.869403] R10: 0000000000001000 R11: 0000000000000246 R12: 000055a77888e620
[ 115.869994] R13: 000055a77888d8e0 R14: 00000000ffffffff R15: 00007f92c93e4076
[ 115.870581] </TASK>
[ 115.870763] Modules linked in: nft_fib_inet nft_fib_ipv4
nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6
nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6
nf_defrag_ipv4 ip_set rfkill nf_tables nfnetlink qrtr snd_intel8x0
sunrpc snd_ac97_codec ac97_bus snd_pcm snd_timer intel_rapl_msr
intel_rapl_common snd vboxguest intel_powerclamp video rapl joydev
soundcore i2c_piix4 wmi fuse zram xfs vmwgfx crct10dif_pclmul
crc32_pclmul crc32c_intel polyval_clmulni polyval_generic
drm_ttm_helper ttm e1000 ghash_clmulni_intel serio_raw ata_generic
pata_acpi scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath
[ 115.875288] CR2: 0000000000000010
[ 115.875641] ---[ end trace 0000000000000000 ]---
[ 115.876135] RIP: 0010:propagate_one.part.0+0x7f/0x1a0
[ 115.876551] Code: 75 eb 4c 8b 05 c2 25 37 02 4c 89 ca 48 8b 4a 10
49 39 d0 74 1e 48 3b 81 e0 00 00 00 74 26 48 8b 92 e0 00 00 00 be 01
00 00 00 <48> 8b 4a 10 49 39 d0 75 e2 40 84 f6 74 38 4c 89 05 84 25 37
02 4d
[ 115.878086] RSP: 0018:ffffb8d5443d7d50 EFLAGS: 00010282
[ 115.878511] RAX: ffff8e4d87c41c80 RBX: ffff8e4d88ded780 RCX: ffff8e4da4333a00
[ 115.879128] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8e4d88ded780
[ 115.879715] RBP: ffff8e4d88ded780 R08: ffff8e4da4338000 R09: ffff8e4da43388c0
[ 115.880359] R10: 0000000000000002 R11: ffffb8d540158000 R12: ffffb8d5443d7da8
[ 115.880962] R13: ffff8e4d88ded780 R14: 0000000000000000 R15: 0000000000000000
[ 115.881548] FS: 00007f92c90c9800(0000) GS:ffff8e4dfdc00000(0000)
knlGS:0000000000000000
[ 115.882234] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 115.882713] CR2: 0000000000000010 CR3: 0000000022f4c002 CR4: 00000000000706f0
[ 115.883314] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 115.883966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
</code></pre></div></div>
<p>The bug could be reproduced, albeit unreliably, by running the mount propagation test of the <a href="https://github.com/linux-test-project/ltp">LTP</a> testsuite in a loop while simultaneously creating various network namespaces with <code class="language-plaintext highlighter-rouge">ip netns</code> command.
When we started debugging this it turned out that the interesting aspect of <code class="language-plaintext highlighter-rouge">ip netns</code> for this bug was that it persisted a network namespace by bind-mounting it in a separate mount namespace.</p>
<p>It turned out that the reliability of the reproducer could be significantly increased by using</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>unshare --mount --propagation=unchanged -- mount --make-rslave /
</code></pre></div></div>
<p>while using the <a href="https://github.com/linux-test-project/ltp/blob/af98698067f706feeb1729e038eef9aefc12760c/testcases/kernel/fs/fs_bind/bind/fs_bind24.sh#L4">a specific test</a> of the LTP mount propagation testsuite, modifying it slightly so that we loop around <code class="language-plaintext highlighter-rouge">mount</code> and <code class="language-plaintext highlighter-rouge">umount</code> in the script.</p>
<p>In previous years <a href="https://github.com/sforshee">Seth Forshee</a> and I had reported and fixed issues in the mount propagation code.
Each of these bugs had been hard to understand but only required trivial patches in order to be fixed.
My expectation was no different for this bug.</p>
<p>The mount propagation code uses the now obsolete “slave” and “master” concepts to express dependency relationships.
As the data structures themselves use these terms they are used here as well.</p>
<h2 id="basic-mount-propagation-concepts">Basic mount propagation concepts</h2>
<p>The <code class="language-plaintext highlighter-rouge">propagate_mnt()</code> function handles mount propagation when creating mounts.
It propagates a source mount tree <code class="language-plaintext highlighter-rouge">@source_mnt</code> to all applicable nodes of a destination propagation tree headed by the destination mount <code class="language-plaintext highlighter-rouge">@dest_mnt</code>.</p>
<p>While fixing this bug we’ve gotten confused multiple times due to unclear terminology or missing concepts.
So we’ll start this with some clarifications:</p>
<ul>
<li>
<p>The terms “master” or “peer” denote a shared mount.
A shared mount belongs to a peer group.</p>
</li>
<li>
<p>A peer group is a set of shared mounts that propagate to each other.
They are identified by a peer group id. The peer group id is available in <code class="language-plaintext highlighter-rouge">@shared_mnt->mnt_group_id</code>.
Shared mounts within the same peer group have the same peer group id.
The peers in a peer group can be reached via <code class="language-plaintext highlighter-rouge">@shared_mnt->mnt_share</code>.</p>
</li>
<li>
<p>The terms “slave mount” or “dependent mount” denote a mount that receives propagation from a peer in a peer group.
Thus, shared mounts may have slave mounts and slave mounts have shared mounts as their master.
Slave mounts of a given peer in a peer group are listed on that peers slave list available at <code class="language-plaintext highlighter-rouge">@shared_mnt->mnt_slave_list</code>.</p>
</li>
<li>
<p>The term “master mount” denotes a mount in a peer group.
In other words, it denotes a shared mount or a peer mount in a peer group.
The term “master mount” - or “master” for short - is mostly used when talking in the context of slave mounts that receive propagation from a master mount.
A master mount of a slave identifies the closest peer group a slave mount receives propagation from.
The master mount of a slave can be identified via <code class="language-plaintext highlighter-rouge">@slave_mount->mnt_master</code>.
Different slaves may point to different masters in the same peer group.</p>
</li>
<li>
<p>Multiple peers in a peer group can have non-empty <code class="language-plaintext highlighter-rouge">->mnt_slave_list</code>s.
Non-empty <code class="language-plaintext highlighter-rouge">->mnt_slave_lists</code> of peers don’t intersect.
Consequently, to ensure all slave mounts of a peer group are visited the <code class="language-plaintext highlighter-rouge">->mnt_slave_list</code>s of all peers in a peer group have to be walked.</p>
</li>
<li>
<p>Slave mounts point to a peer in the closest peer group they receive propagation from via <code class="language-plaintext highlighter-rouge">@slave_mnt->mnt_master</code> (see above).
Together with these peers they form a propagation group (see below).
The closest peer group can thus be identified through the peer group id <code class="language-plaintext highlighter-rouge">@slave_mnt->mnt_master->mnt_group_id</code> of the peer/master that a slave mount receives propagation from.</p>
</li>
<li>
<p>A shared-slave mount is a slave mount to a peer group <code class="language-plaintext highlighter-rouge">pg1</code> while also a peer in another peer group <code class="language-plaintext highlighter-rouge">pg2</code>.
This simply means that a peer group may receive propagation from another peer group.</p>
<p>If a peer group <code class="language-plaintext highlighter-rouge">pg2</code> is a slave to another peer group <code class="language-plaintext highlighter-rouge">pg1</code> then all peers in peer group <code class="language-plaintext highlighter-rouge">pg2</code> point to the same peer in peer group <code class="language-plaintext highlighter-rouge">pg1</code> via <code class="language-plaintext highlighter-rouge">->mnt_master</code>.
So all peers in peer group <code class="language-plaintext highlighter-rouge">pg2</code> appear on the same <code class="language-plaintext highlighter-rouge">->mnt_slave_list</code> of a peer in <code class="language-plaintext highlighter-rouge">pg1</code>.
So they cannot be slaves to different peer groups or even different peers in the same peer group.</p>
</li>
<li>
<p>A pure slave mount is a slave to a peer group but is not a peer in another peer group.</p>
</li>
<li>
<p>A propagation group denotes the set of mounts consisting of a single peer group <code class="language-plaintext highlighter-rouge">pg1</code> and all slave mounts and shared-slave mounts that point to a peer in that peer group via <code class="language-plaintext highlighter-rouge">->mnt_master</code>.
This means all slave mounts such that <code class="language-plaintext highlighter-rouge">@slave_mnt->mnt_master->mnt_group_id</code> is equal to <code class="language-plaintext highlighter-rouge">@shared_mnt->mnt_group_id</code>.</p>
<p>The concept of a propagation group makes it easier to talk about a single propagation level in a propagation tree.</p>
<p>For example, in <code class="language-plaintext highlighter-rouge">propagate_mnt()</code> the immediate peers of <code class="language-plaintext highlighter-rouge">@dest_mnt</code> and all slaves of <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group form a propagation group <code class="language-plaintext highlighter-rouge">propg1</code>.
So a shared-slave mount that is a slave in <code class="language-plaintext highlighter-rouge">propg1</code> and that is a peer in another peer group <code class="language-plaintext highlighter-rouge">pg2</code> forms another propagation group <code class="language-plaintext highlighter-rouge">propg2</code> together with all slaves that point to that shared-slave mount in their <code class="language-plaintext highlighter-rouge">->mnt_master</code>.</p>
</li>
<li>
<p>A propagation tree refers to all mounts that receive propagation starting from a specific shared mount.</p>
<p>For example, for <code class="language-plaintext highlighter-rouge">propagate_mnt()</code> the destination mount <code class="language-plaintext highlighter-rouge">@dest_mnt</code> is the start of a propagation tree.
The propagation tree encompasses all mounts that receive propagation from <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group down to the leafs.</p>
</li>
</ul>
<h2 id="the-bug-in-one-sentence">The bug in one sentence</h2>
<p>The <code class="language-plaintext highlighter-rouge">propagate_mnt()</code> function contains a bug where it fails to terminate at peers of <code class="language-plaintext highlighter-rouge">@source_mnt</code> when searching for a copy of <code class="language-plaintext highlighter-rouge">@source_mnt</code> which is a suitable master for a new copy mounted on top of a slave in the destination propagation tree, causing a NULL dereference.</p>
<h2 id="the-impact-of-the-bug">The impact of the bug</h2>
<p>Once the mechanics of the bug are understood it’s easy to trigger.
Because of unprivileged user namespaces it is available to unprivileged users.
When the bug triggers <code class="language-plaintext highlighter-rouge">namespace_lock()</code> is held which is a read-write semaphore that needs to be held in a host of scenarios.
The gist is that once the bug has been triggered most interactions with the filesystem become impossible.
The kernel is effectively deadlocked.</p>
<p>Since we’re not attackers we’re not sure whether this bug can be exploited in more meaningful ways.
If it is and you do manage to exploit it be sure to write a post about it.
We would be very interested.</p>
<h1 id="the-mount-propagation-algorithm">The Mount Propagation Algorithm</h1>
<p>When a new mount is attached to a destination - be it a new filesystem or a bind-mount - the <code class="language-plaintext highlighter-rouge">attach_recursive_mnt()</code> function will be called.
It is also responsible for calling into <code class="language-plaintext highlighter-rouge">propagate_mnt()</code> to handle mount propagation.</p>
<p>By the time we call into <code class="language-plaintext highlighter-rouge">propagate_mnt()</code> we know that the destination mount <code class="language-plaintext highlighter-rouge">@dest_mnt</code> is either a pure shared mount or a shared-slave mount.
This is guaranteed by a check in <code class="language-plaintext highlighter-rouge">attach_recursive_mnt()</code>.
So <code class="language-plaintext highlighter-rouge">propagate_mnt()</code> will first propagate the source mount tree to all peers in <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="n">n</span> <span class="o">=</span> <span class="n">next_peer</span><span class="p">(</span><span class="n">dest_mnt</span><span class="p">);</span> <span class="n">n</span> <span class="o">!=</span> <span class="n">dest_mnt</span><span class="p">;</span> <span class="n">n</span> <span class="o">=</span> <span class="n">next_peer</span><span class="p">(</span><span class="n">n</span><span class="p">))</span> <span class="p">{</span>
<span class="n">ret</span> <span class="o">=</span> <span class="n">propagate_one</span><span class="p">(</span><span class="n">n</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">ret</span><span class="p">)</span>
<span class="k">goto</span> <span class="n">out</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Notice, that the peer propagation loop of <code class="language-plaintext highlighter-rouge">propagate_mnt()</code> doesn’t propagate <code class="language-plaintext highlighter-rouge">@dest_mnt</code> itself.
Instead, <code class="language-plaintext highlighter-rouge">@dest_mnt</code> is mounted directly in <code class="language-plaintext highlighter-rouge">attach_recursive_mnt()</code> after we propagated to the destination propagation tree.</p>
<p>The mount that will be mounted on top of <code class="language-plaintext highlighter-rouge">@dest_mnt</code> is <code class="language-plaintext highlighter-rouge">@source_mnt</code>.
This copy was created earlier even before we entered <code class="language-plaintext highlighter-rouge">attach_recursive_mnt()</code> and doesn’t concern us a lot here.</p>
<p>It’s just important to notice that when <code class="language-plaintext highlighter-rouge">propagate_mnt()</code> is called <code class="language-plaintext highlighter-rouge">@source_mnt</code> will not yet have been mounted on top of <code class="language-plaintext highlighter-rouge">@dest_mnt</code>.
Thus, <code class="language-plaintext highlighter-rouge">@source_mnt->mnt_parent</code> will either still point to <code class="language-plaintext highlighter-rouge">@source_mnt</code> or - in the case <code class="language-plaintext highlighter-rouge">@source_mnt</code> is moved and thus already attached - still to its former parent.</p>
<p>For each peer <code class="language-plaintext highlighter-rouge">@m</code> in <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group <code class="language-plaintext highlighter-rouge">propagate_one()</code> will create a new copy of the source mount tree and mount that copy <code class="language-plaintext highlighter-rouge">@child</code> on <code class="language-plaintext highlighter-rouge">@m</code> such that <code class="language-plaintext highlighter-rouge">@child->mnt_parent</code> points to <code class="language-plaintext highlighter-rouge">@m</code> after <code class="language-plaintext highlighter-rouge">propagate_one()</code> returns.</p>
<p><code class="language-plaintext highlighter-rouge">propagate_one()</code> will stash the last destination propagation node <code class="language-plaintext highlighter-rouge">@m</code> in <code class="language-plaintext highlighter-rouge">@last_dest</code> and the last copy it created for the source mount tree in <code class="language-plaintext highlighter-rouge">@last_source</code>.</p>
<p>Hence, if we call into <code class="language-plaintext highlighter-rouge">propagate_one()</code> again for the next destination propagation node <code class="language-plaintext highlighter-rouge">@m</code>, <code class="language-plaintext highlighter-rouge">@last_dest</code> will point to the previous destination propagation node and <code class="language-plaintext highlighter-rouge">@last_source</code> will point to the previous copy of the source mount tree and mounted on <code class="language-plaintext highlighter-rouge">@last_dest</code>.</p>
<p>Each new copy of the source mount tree is created from the previous copy of the source mount tree.
This will become important later.</p>
<p>The peer loop in <code class="language-plaintext highlighter-rouge">propagate_mnt()</code> is straightforward.
We iterate through the peers copying and updating <code class="language-plaintext highlighter-rouge">@last_source</code> and <code class="language-plaintext highlighter-rouge">@last_dest</code> as we go through them and mount each copy of the source mount tree <code class="language-plaintext highlighter-rouge">@child</code> on a peer <code class="language-plaintext highlighter-rouge">@m</code> in <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group.</p>
<p>After <code class="language-plaintext highlighter-rouge">propagate_mnt()</code> handles the peers in <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group it will propagate the source mount down the propagation tree that <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group propagates to:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="n">m</span> <span class="o">=</span> <span class="n">next_group</span><span class="p">(</span><span class="n">dest_mnt</span><span class="p">,</span> <span class="n">dest_mnt</span><span class="p">);</span> <span class="n">m</span><span class="p">;</span>
<span class="n">m</span> <span class="o">=</span> <span class="n">next_group</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">dest_mnt</span><span class="p">))</span> <span class="p">{</span>
<span class="cm">/* everything in that slave group */</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">m</span><span class="p">;</span>
<span class="k">do</span> <span class="p">{</span>
<span class="n">ret</span> <span class="o">=</span> <span class="n">propagate_one</span><span class="p">(</span><span class="n">n</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">ret</span><span class="p">)</span>
<span class="k">goto</span> <span class="n">out</span><span class="p">;</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">next_peer</span><span class="p">(</span><span class="n">n</span><span class="p">);</span>
<span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">n</span> <span class="o">!=</span> <span class="n">m</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">next_group()</code> helper will recursively walk the destination propagation tree, descending into each propagation group of the propagation tree.</p>
<p>The important part is that it takes care to propagate the source mount tree to all peers in the peer group of a propagation group before it propagates to the slaves to those peers in the propagation group.
In other words, a mount in the source mount propagation tree which will be a master is always created before that mount’s slaves.</p>
<p>It is important to remember that propagating the source mount tree to each mount <code class="language-plaintext highlighter-rouge">@m</code> in the destination propagation tree simply means that we create and mount new copies <code class="language-plaintext highlighter-rouge">@child</code> of the source mount tree on <code class="language-plaintext highlighter-rouge">@m</code> such that <code class="language-plaintext highlighter-rouge">@child->mnt_parent</code> points to <code class="language-plaintext highlighter-rouge">@m</code>.</p>
<p>Since we know that each node <code class="language-plaintext highlighter-rouge">@m</code> in the destination propagation tree headed by <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group will be overmounted with a copy of the source mount tree and since we know that the propagation properties of each copy of the source mount tree we create and mount at <code class="language-plaintext highlighter-rouge">@m</code> will mostly mirror the propagation properties of <code class="language-plaintext highlighter-rouge">@m</code>.
Since we know that each node <code class="language-plaintext highlighter-rouge">@m</code> in the destination propagation tree headed by <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group will be overmounted with a copy of the source mount, and since we know that the propagation properties of each copy of the source mount tree we create and mount at <code class="language-plaintext highlighter-rouge">@m</code> will mostly mirror the propagation properties of <code class="language-plaintext highlighter-rouge">@m</code>, we can use that information to create and mount the copies of the source mount that become masters before their slaves.</p>
<p>The easy case is always when <code class="language-plaintext highlighter-rouge">@m</code> and <code class="language-plaintext highlighter-rouge">@last_dest</code> are peers in a peer group of a given propagation group.
In that case we know that the new copy <code class="language-plaintext highlighter-rouge">@child</code> should have the same master as <code class="language-plaintext highlighter-rouge">@last_source</code>, whose master was determined in a previous call to <code class="language-plaintext highlighter-rouge">propagate_one()</code>.</p>
<p>The hard case is when we’re dealing with an <code class="language-plaintext highlighter-rouge">@m</code> which is a pure slave mount or a shared-slave mount in a new peer group, as we need to find an appropriate mount in the source mount tree to be the master of <code class="language-plaintext highlighter-rouge">@m</code>.</p>
<p>For each pure slave or peer group in the destination propagation tree we need to make sure that the master for new copies of <code class="language-plaintext highlighter-rouge">@source_mnt</code> is a mount from the source mount propagation tree whose parent is in the chain of masters of the parent for the new child mount.
This is a mouthful but as far as we can tell that’s the core of it all.</p>
<p>But, if we keep track of the masters in the destination propagation tree we can use the information to find the correct master for each copy of the source mount tree we create and mount at the slaves in the destination propagation tree. Keeping track of masters in the destination propagation tree can be done by temporarily “marking” each master with the <code class="language-plaintext highlighter-rouge">MNT_MARKED</code> flag (Note that this flag is also abused for unmounting but that’s another topic.).</p>
<p>Let’s walk through the base case as that’s still fairly easy to grasp.</p>
<p>If we’re dealing with the first slave in the propagation group that <code class="language-plaintext highlighter-rouge">@dest_mnt</code> is in then we don’t yet have marked any masters in the destination propagation tree.</p>
<p>We know the master for the first slave to <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group is simply <code class="language-plaintext highlighter-rouge">@dest_mnt</code>.
So we expect this algorithm to yield a copy of the source mount tree that was mounted on a peer in <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group as the master for the copy of the source mount tree we want to mount at the first slave in the destination propagation tree:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="n">n</span> <span class="o">=</span> <span class="n">m</span><span class="p">;</span> <span class="p">;</span> <span class="n">n</span> <span class="o">=</span> <span class="n">p</span><span class="p">)</span> <span class="p">{</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">n</span><span class="o">-></span><span class="n">mnt_master</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">p</span> <span class="o">==</span> <span class="n">dest_master</span> <span class="o">||</span> <span class="n">IS_MNT_MARKED</span><span class="p">(</span><span class="n">p</span><span class="p">))</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>For the first slave we walk the destination propagation tree all the way up to a peer in <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group.
So, the propagation hierarchy can be walked by walking up the <code class="language-plaintext highlighter-rouge">@m->mnt_master</code> hierarchy of the destination propagation tree <code class="language-plaintext highlighter-rouge">@m</code>.
We will ultimately find a peer in <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group and thus ultimately <code class="language-plaintext highlighter-rouge">@dest_mnt->mnt_master</code>.</p>
<p>By the way, here the assumption we listed at the beginning becomes important.
Namely, that peers in a peer group <code class="language-plaintext highlighter-rouge">pg1</code> that are slaves in another peer group <code class="language-plaintext highlighter-rouge">pg2</code> appear on the same <code class="language-plaintext highlighter-rouge">->mnt_slave_list</code>.
So all slaves who are peers in peer group <code class="language-plaintext highlighter-rouge">pg1</code> point to the same peer in peer group <code class="language-plaintext highlighter-rouge">pg2</code> via <code class="language-plaintext highlighter-rouge">->mnt_master</code>.
Otherwise the termination condition in the code above would be wrong and <code class="language-plaintext highlighter-rouge">next_group()</code> would be broken too.</p>
<p>So the first iteration sets:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n</span> <span class="o">=</span> <span class="n">m</span><span class="p">;</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">n</span><span class="o">-></span><span class="n">mnt_master</span><span class="p">;</span>
</code></pre></div></div>
<p>such that <code class="language-plaintext highlighter-rouge">@p</code> now points to a peer or <code class="language-plaintext highlighter-rouge">@dest_mnt</code> itself.
We walk up one more level since we don’t have any marked masters.
So we end up with:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n</span> <span class="o">=</span> <span class="n">dest_mnt</span><span class="p">;</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">dest_mnt</span><span class="o">-></span><span class="n">mnt_master</span><span class="p">;</span>
</code></pre></div></div>
<p>If <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group is not slave to another peer group then <code class="language-plaintext highlighter-rouge">@p</code> is now <code class="language-plaintext highlighter-rouge">NULL</code>.
If <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group is a slave to another peer group then <code class="language-plaintext highlighter-rouge">@p</code> now points to <code class="language-plaintext highlighter-rouge">@dest_mnt->mnt_master</code>, which is a master outside the propagation tree we’re dealing with.</p>
<p>Now we need to figure out the master for the copy of the source mount tree we’re about to create and mount on the first slave of <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">do</span> <span class="p">{</span>
<span class="k">struct</span> <span class="n">mount</span> <span class="o">*</span><span class="n">parent</span> <span class="o">=</span> <span class="n">last_source</span><span class="o">-></span><span class="n">mnt_parent</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">last_source</span> <span class="o">==</span> <span class="n">first_source</span><span class="p">)</span>
<span class="k">break</span><span class="p">;</span>
<span class="n">done</span> <span class="o">=</span> <span class="n">parent</span><span class="o">-></span><span class="n">mnt_master</span> <span class="o">==</span> <span class="n">p</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">done</span> <span class="o">&&</span> <span class="n">peers</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">parent</span><span class="p">))</span>
<span class="k">break</span><span class="p">;</span>
<span class="n">last_source</span> <span class="o">=</span> <span class="n">last_source</span><span class="o">-></span><span class="n">mnt_master</span><span class="p">;</span>
<span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">done</span><span class="p">);</span>
</code></pre></div></div>
<p>We know that <code class="language-plaintext highlighter-rouge">@last_source->mnt_parent</code> points to <code class="language-plaintext highlighter-rouge">@last_dest</code> and <code class="language-plaintext highlighter-rouge">@last_dest</code> is the last peer in <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group we propagated to in the peer loop in <code class="language-plaintext highlighter-rouge">propagate_mnt()</code>.</p>
<p>Consequently, <code class="language-plaintext highlighter-rouge">@last_source</code> is the last copy we created and mounted on that last peer in <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group.
So <code class="language-plaintext highlighter-rouge">@last_source</code> is the master we want to pick.</p>
<p>We know that <code class="language-plaintext highlighter-rouge">@last_source->mnt_parent->mnt_master</code> points to <code class="language-plaintext highlighter-rouge">@last_dest->mnt_master</code>.
We also know that <code class="language-plaintext highlighter-rouge">@last_dest->mnt_master</code> is either <code class="language-plaintext highlighter-rouge">NULL</code> or points to a master outside of the destination propagation tree and so does <code class="language-plaintext highlighter-rouge">@p</code>.
Hence:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">done</span> <span class="o">=</span> <span class="n">parent</span><span class="o">-></span><span class="n">mnt_master</span> <span class="o">==</span> <span class="n">p</span><span class="p">;</span>
</code></pre></div></div>
<p>is trivially true in the base condition.</p>
<p>We also know that for the first slave mount of <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group, <code class="language-plaintext highlighter-rouge">@last_dest</code> either points <code class="language-plaintext highlighter-rouge">@dest_mnt</code> itself because it was initialized to:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">last_dest</span> <span class="o">=</span> <span class="n">dest_mnt</span><span class="p">;</span>
</code></pre></div></div>
<p>at the beginning of <code class="language-plaintext highlighter-rouge">propagate_mnt()</code> or it will point to a peer of <code class="language-plaintext highlighter-rouge">@dest_mnt</code> in its peer group.
In both cases it is guaranteed that on the first iteration <code class="language-plaintext highlighter-rouge">@n</code> and <code class="language-plaintext highlighter-rouge">@parent</code> are peers (Please note the check for peers here as that’s important.):</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="n">done</span> <span class="o">&&</span> <span class="n">peers</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">parent</span><span class="p">))</span>
<span class="k">break</span><span class="p">;</span>
</code></pre></div></div>
<p>So, as we expected, we select <code class="language-plaintext highlighter-rouge">@last_source</code>, which refers to the last copy of the source mount tree we mounted on the last peer in <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group, as the master of the first slave in <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group.
The rest is taken care of by <code class="language-plaintext highlighter-rouge">clone_mnt(last_source, ...)</code>.</p>
<p>At the end of <code class="language-plaintext highlighter-rouge">propagate_mnt()</code> we now mark <code class="language-plaintext highlighter-rouge">@m->mnt_master</code> as the first master in the destination propagation tree that is distinct from <code class="language-plaintext highlighter-rouge">@dest_mnt->mnt_master</code>.
Thus, we mark <code class="language-plaintext highlighter-rouge">@dest_mnt</code> itself as a master.</p>
<p>By marking <code class="language-plaintext highlighter-rouge">@dest_mnt</code> or one of it’s peers we are able to easily find it again when we later lookup masters for other copies of the source mount tree we mount copies of the source mount tree on slaves <code class="language-plaintext highlighter-rouge">@m</code> to <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group.
This in turn allows us to find the masters we selected for the copies of <code class="language-plaintext highlighter-rouge">@source_mnt</code>, which are always mounted on masters in the destination propagation tree.</p>
<p>The important part is to realize that the code makes use of the fact that the last copy of the source mount tree stashed in <code class="language-plaintext highlighter-rouge">@last_source</code> was mounted on top of the previous destination propagation node <code class="language-plaintext highlighter-rouge">@last_dest</code>.
What this means is that <code class="language-plaintext highlighter-rouge">@last_source</code> allows us to walk the destination propagation hierarchy the same way each destination propagation node <code class="language-plaintext highlighter-rouge">@m</code> does.</p>
<p>If we take <code class="language-plaintext highlighter-rouge">@last_source</code>, which is the copy of <code class="language-plaintext highlighter-rouge">@source_mnt</code> we have mounted on <code class="language-plaintext highlighter-rouge">@last_dest</code> in the previous iteration of <code class="language-plaintext highlighter-rouge">propagate_one()</code>, then we know <code class="language-plaintext highlighter-rouge">@last_source->mnt_parent</code> points to <code class="language-plaintext highlighter-rouge">@last_dest</code> but we also know that as we walk through the destination propagation tree that <code class="language-plaintext highlighter-rouge">@last_source->mnt_master</code> will point to an earlier copy of the source mount tree we mounted one an earlier destination propagation node <code class="language-plaintext highlighter-rouge">@m</code>.</p>
<p>So <code class="language-plaintext highlighter-rouge">@last_source->mnt_parent</code> will be our hook into the destination propagation tree and each consecutive <code class="language-plaintext highlighter-rouge">@last_source->mnt_master</code> will lead us to an earlier propagation node <code class="language-plaintext highlighter-rouge">@m</code> via <code class="language-plaintext highlighter-rouge">@last_source->mnt_master->mnt_parent</code>.</p>
<p>Hence, by walking up <code class="language-plaintext highlighter-rouge">@last_source->mnt_master</code>, each of which is mounted on a node that is a master in the destination propagation tree, we can also walk up the destination propagation hierarchy.</p>
<p>So, for each new destination propagation node <code class="language-plaintext highlighter-rouge">@m</code> we use the previous copy of <code class="language-plaintext highlighter-rouge">@last_source</code> and the fact it’s mounted on the previous propagation node <code class="language-plaintext highlighter-rouge">@last_dest</code> via <code class="language-plaintext highlighter-rouge">@last_source->mnt_master->mnt_parent</code> to determine what the master of the new copy of <code class="language-plaintext highlighter-rouge">@last_source</code> needs to be.</p>
<p>The goal is to select a master in the <strong>closest</strong> peer group for the new copy of the source mount tree we are about to create and mount on a slave <code class="language-plaintext highlighter-rouge">@m</code> in the destination propagation tree.
This means we want to find a suitable master in the propagation group.</p>
<p>As the structure of the source mount propagation tree we create mirrors the propagation structure of the destination propagation
tree we can find <code class="language-plaintext highlighter-rouge">@m</code>’s closest master - i.e., a marked master - which is a peer in the closest peer group that <code class="language-plaintext highlighter-rouge">@m</code> receives propagation from.
We store that closest master of <code class="language-plaintext highlighter-rouge">@m</code> in <code class="language-plaintext highlighter-rouge">@p</code> as before and record the slave to that master in <code class="language-plaintext highlighter-rouge">@n</code></p>
<p>We then search for this master <code class="language-plaintext highlighter-rouge">@p</code> via <code class="language-plaintext highlighter-rouge">@last_source</code> by walking up the master hierarchy starting from <code class="language-plaintext highlighter-rouge">@last_source</code>.</p>
<p>We will try to find the master by walking <code class="language-plaintext highlighter-rouge">@last_source->mnt_master</code> and by comparing <code class="language-plaintext highlighter-rouge">@last_source->mnt_master->mnt_parent->mnt_master</code> to <code class="language-plaintext highlighter-rouge">@p</code>.
If we find <code class="language-plaintext highlighter-rouge">@p</code> then we can figure out what earlier copy of the source mount tree needs to be the master for the new copy of the source mount tree we’re about to create and mount at the current destination propagation node <code class="language-plaintext highlighter-rouge">@m</code>.</p>
<p>If <code class="language-plaintext highlighter-rouge">@last_source->mnt_master->mnt_parent</code> and <code class="language-plaintext highlighter-rouge">@n</code> are peers then we know that the closest master they receive propagation from is <code class="language-plaintext highlighter-rouge">@last_source->mnt_master->mnt_parent->mnt_master</code>.
If not then the closest immediate peer group that they receive propagation from must be one level higher up.</p>
<p>This builds on the earlier clarification at the beginning that all peers in a peer group which are slaves of other peer groups all point to the same <code class="language-plaintext highlighter-rouge">->mnt_master</code>, i.e., appear on the same <code class="language-plaintext highlighter-rouge">->mnt_slave_list</code>, of the closest peer group that they receive propagation from.</p>
<h1 id="failing-to-terminate-the-algorithm">Failing to terminate the algorithm</h1>
<p>However, terminating the walk has corner cases.</p>
<p>If the closest marked master for a given destination node <code class="language-plaintext highlighter-rouge">@m</code> cannot be found by walking up the master hierarchy via <code class="language-plaintext highlighter-rouge">@last_source->mnt_master</code> then we need to terminate the walk when we encounter <code class="language-plaintext highlighter-rouge">@source_mnt</code> again.</p>
<p>This isn’t an arbitrary termination.
It simply means that the new copy of the source mount tree we’re about to create has a copy of the source mount tree we created and mounted on a peer in <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group as its master.
So <code class="language-plaintext highlighter-rouge">@source_mnt</code> is the peer in the closest peer group that the new copy of the source mount tree receives propagation from.</p>
<p>We absolutely have to stop <code class="language-plaintext highlighter-rouge">@source_mnt</code> because <code class="language-plaintext highlighter-rouge">@last_source->mnt_master</code> either points outside the propagation hierarchy we’re dealing with or it is <code class="language-plaintext highlighter-rouge">NULL</code> because <code class="language-plaintext highlighter-rouge">@source_mnt</code> isn’t a shared-slave.</p>
<p>So continuing the walk past <code class="language-plaintext highlighter-rouge">@source_mnt</code> would cause a <code class="language-plaintext highlighter-rouge">NULL</code> dereference via <code class="language-plaintext highlighter-rouge">@last_source->mnt_master->mnt_parent</code>.
And so we have to stop the walk when we encounter <code class="language-plaintext highlighter-rouge">@source_mnt</code> again.</p>
<p>One scenario where this can happen is when we first handled a series of slaves of <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group and then encounter peers in a new peer group that is a slave to <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group.
We handle them and then we encounter another slave mount to <code class="language-plaintext highlighter-rouge">@dest_mnt</code> that is a pure slave to <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group.
That pure slave will have a peer in <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group as its master.
Consequently, the new copy of the source mount tree will need to have <code class="language-plaintext highlighter-rouge">@source_mnt</code> as it’s master.
So we walk the propagation hierarchy all the way up to <code class="language-plaintext highlighter-rouge">@source_mnt</code> based on <code class="language-plaintext highlighter-rouge">@last_source->mnt_master</code>.</p>
<p>So terminate on <code class="language-plaintext highlighter-rouge">@source_mnt</code>, easy peasy.
Except, that the check misses something that the rest of the algorithm already handles.</p>
<p>If <code class="language-plaintext highlighter-rouge">@dest_mnt</code> has peers in its peer group the peer loop in <code class="language-plaintext highlighter-rouge">propagate_mnt()</code>:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="n">n</span> <span class="o">=</span> <span class="n">next_peer</span><span class="p">(</span><span class="n">dest_mnt</span><span class="p">);</span> <span class="n">n</span> <span class="o">!=</span> <span class="n">dest_mnt</span><span class="p">;</span> <span class="n">n</span> <span class="o">=</span> <span class="n">next_peer</span><span class="p">(</span><span class="n">n</span><span class="p">))</span> <span class="p">{</span>
<span class="n">ret</span> <span class="o">=</span> <span class="n">propagate_one</span><span class="p">(</span><span class="n">n</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">ret</span><span class="p">)</span>
<span class="k">goto</span> <span class="n">out</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>will consecutively update <code class="language-plaintext highlighter-rouge">@last_source</code> with each previous copy of the source mount tree we created and mounted at the previous peer in <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group.
So after that loop terminates <code class="language-plaintext highlighter-rouge">@last_source</code> will point to whatever copy of the source mount tree was created and mounted on the last peer in <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group.</p>
<p>Furthermore, if there is even a single additional peer in <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group then <code class="language-plaintext highlighter-rouge">@last_source</code> will <strong>not</strong> point to <code class="language-plaintext highlighter-rouge">@source_mnt</code> anymore.
Because, as we mentioned above, <code class="language-plaintext highlighter-rouge">@dest_mnt</code> isn’t even handled in this loop but directly in <code class="language-plaintext highlighter-rouge">attach_recursive_mnt()</code>.
So it can’t even accidently come last in that peer loop.</p>
<p>So the first time we handle a slave mount <code class="language-plaintext highlighter-rouge">@m</code> of <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group the copy of the source mount tree we create will make the <strong>last copy of the source mount tree we created and mounted on the last peer in <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group the master of the new copy of the source mount tree we create and mount on the first slave of <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group</strong>.</p>
<p>But this means that the termination condition that checks for <code class="language-plaintext highlighter-rouge">@source_mnt</code> is wrong.
The <code class="language-plaintext highlighter-rouge">@source_mnt</code> cannot be found anymore by <code class="language-plaintext highlighter-rouge">propagate_one()</code>.
Instead it will find the last copy of the source mount tree we created and mounted for the last peer of <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group again.
And that is a peer of <code class="language-plaintext highlighter-rouge">@source_mnt</code> not <code class="language-plaintext highlighter-rouge">@source_mnt</code> itself.</p>
<p>This means, we fail to terminate the loop correctly and ultimately dereference <code class="language-plaintext highlighter-rouge">@last_source->mnt_master->mnt_parent</code>.
When <code class="language-plaintext highlighter-rouge">@source_mnt</code>’s peer group isn’t slave to another peer group then <code class="language-plaintext highlighter-rouge">@last_source->mnt_master</code> is <code class="language-plaintext highlighter-rouge">NULL</code> causing the splat above.</p>
<p>For example, assume <code class="language-plaintext highlighter-rouge">@dest_mnt</code> is a pure shared mount and has three peers in its peer group:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>===================================================================================
mount-id mount-parent-id peer-group-id
===================================================================================
(@dest_mnt) mnt_master[216] 309 297 shared:216
\
(@source_mnt) mnt_master[218]: 609 609 shared:218
(1) mnt_master[216]: 607 605 shared:216
\
(P1) mnt_master[218]: 624 607 shared:218
(2) mnt_master[216]: 576 574 shared:216
\
(P2) mnt_master[218]: 625 576 shared:218
(3) mnt_master[216]: 545 543 shared:216
\
(P3) mnt_master[218]: 626 545 shared:218
</code></pre></div></div>
<p>After this sequence has been processed <code class="language-plaintext highlighter-rouge">@last_source</code> will point to <code class="language-plaintext highlighter-rouge">(P3)</code>, the copy generated for the third peer in <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group we handled.
So the copy of the source mount tree <code class="language-plaintext highlighter-rouge">(P4)</code> we create and mount on the first slave of <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>===================================================================================
mount-id mount-parent-id peer-group-id
===================================================================================
mnt_master[216] 309 297 shared:216
/
/
(S0) mnt_slave 483 481 master:216
\
\ (P3) mnt_master[218] 626 545 shared:218
\ /
\/
(P4) mnt_slave 627 483 master:218
</code></pre></div></div>
<p>will pick the last copy of the source mount tree <code class="language-plaintext highlighter-rouge">(P3)</code> as master, not <code class="language-plaintext highlighter-rouge">(@source_mnt)</code>.</p>
<p>When walking the propagation hierarchy via <code class="language-plaintext highlighter-rouge">@last_source</code>’s master hierarchy we encounter <code class="language-plaintext highlighter-rouge">(P3)</code> but not <code class="language-plaintext highlighter-rouge">(@source_mnt)</code>.</p>
<p>We can fix this in multiple ways:</p>
<p>(1) By setting <code class="language-plaintext highlighter-rouge">@last_source</code> to <code class="language-plaintext highlighter-rouge">@source_mnt</code> after we processed the peers in <code class="language-plaintext highlighter-rouge">@dest_mnt</code>’s peer group right after the peer loop in <code class="language-plaintext highlighter-rouge">propagate_mnt()</code>.
This guarantees that we really alwways find <code class="language-plaintext highlighter-rouge">@source_mnt</code> itself.</p>
<p>(2) By changing the termination condition that relies on finding exactly <code class="language-plaintext highlighter-rouge">@source_mnt</code> to finding a peer of <code class="language-plaintext highlighter-rouge">@source_mnt</code>.</p>
<p>(3) By only moving <code class="language-plaintext highlighter-rouge">@last_source</code> when we actually venture into a new peer group or some clever variant thereof.</p>
<p>The first two options are minimally invasive and what we want as a fix.
The third option is more intrusive but something we’d like to explore in the near future.</p>
<h1 id="conclusion">Conclusion</h1>
<p>This is an example of a very clever but <strong>worringly</strong> underdocumented algorithm.
Since there isn’t a single detailed comment to be found in the code it has been a giant pain to understand and work through this bug.
A bug like this is very difficult to fix without a detailed understanding of what’s happening.
Let’s not talk about the amount of time that was sunk into fixing this.</p>
<p>Oh, and as predicted the actual <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=11933cf1d91d57da9e5c53822a540bbdc2656c16">fix</a> was trivial.</p>
<h1 id="contributing-fixes-improvements-or-corrections-to-this-post">Contributing fixes, improvements, or corrections to this post</h1>
<p>If you have fixes, improvements, or corrections feel free to email them to me or simply open a pull request against <a href="https://github.com/brauner/brauner.github.io">the repository for this blog</a>.</p>Christian BraunerIntroductionManaging a kernel patch series with b42023-01-02T00:00:00+01:002023-01-02T00:00:00+01:00https://brauner.io/2023/01/02/b4-managed-patch-series<p>This is a (live-[?])blog about managing a patch series solely with <code class="language-plaintext highlighter-rouge">b4</code>.
It’s “live” as I’m writing this down while fumbling my way through this adventure.
No corrections other than grammer and spelling (but no guarantees for the correctness of either).</p>
<p><strong>UPDATE START</strong></p>
<p>After having written this Konstantin (the author of <code class="language-plaintext highlighter-rouge">b4</code>) pointed me to https://b4.docs.kernel.org which answers a few questions I had here.</p>
<p><strong>UPDATE END</strong></p>
<p>Ok, I have a <code class="language-plaintext highlighter-rouge">xfstests</code> patch to send.
The patch has been written and tested.
I’ve already created a branch <code class="language-plaintext highlighter-rouge">fstests.setgid.v6.2</code> based on <code class="language-plaintext highlighter-rouge">xfstests</code>’s <code class="language-plaintext highlighter-rouge">for-next</code> branch.</p>
<p>So, first step locate the subcommand that will allow me to do something with that patch/branch.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>b4 --help
</code></pre></div></div>
<p>That shows a subcommand <code class="language-plaintext highlighter-rouge">prep</code> which looks like what I would want:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> b4 prep --help
usage: b4 prep [-h] [-c | -p OUTPUT_DIR | --edit-cover | --show-revision | --force-revision N | --compare-to vN | --manual-reroll COVER_MSGID | --set-prefixes PREFIX [PREFIX ...] |
--show-info] [-n NEW_SERIES_NAME] [-f FORK_POINT] [-F MSGID] [-e ENROLL_BASE]
options:
-h, --help show this help message and exit
-c, --auto-to-cc Automatically populate cover letter trailers with To and Cc addresses
-p OUTPUT_DIR, --format-patch OUTPUT_DIR
Output prep-tracked commits as patches
--edit-cover Edit the cover letter in your defined $EDITOR (or core.editor)
--show-revision Show current series revision number
--force-revision N Force revision to be this number instead
--compare-to vN Display a range-diff to previously sent revision N
--manual-reroll COVER_MSGID
Mark current revision as sent and reroll (requires cover letter msgid)
--set-prefixes PREFIX [PREFIX ...]
Extra prefixes to add to [PATCH] (e.g.: RFC mydrv)
--show-info Show current series info in a column-parseable format
Create new branch:
Create a new branch for working on patch series
-n NEW_SERIES_NAME, --new NEW_SERIES_NAME
Create a new branch for working on a patch series
-f FORK_POINT, --fork-point FORK_POINT
When creating a new branch, use this fork point instead of HEAD
-F MSGID, --from-thread MSGID
When creating a new branch, use this thread
Enroll existing branch:
Enroll existing branch for prep work
-e ENROLL_BASE, --enroll ENROLL_BASE
Enroll current branch, using the passed tag, branch, or commit as fork base
</code></pre></div></div>
<p>Ok, so looks like I first need to enroll that branch.
It wants a base so I’m going to use <code class="language-plaintext highlighter-rouge">HEAD~1</code> as I’m basing this on the <code class="language-plaintext highlighter-rouge">for-next</code> branch of <code class="language-plaintext highlighter-rouge">xfstests</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> b4 prep -e HEAD~1
Will track 1 commits
Created the default cover letter, you can edit with --edit-cover.
</code></pre></div></div>
<p>So that worked.
First question I have is whether I can exmatriculate a branch?</p>
<p>Ah, neat it seems that the cover letter is kept as an empty commit before the actual commit:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit 416776f9204e73dfc6900c5e41a7aa47e869203a
Author: Christian Brauner <brauner@kernel.org>
AuthorDate: Tue Jan 3 15:43:06 2023 +0100
Commit: Christian Brauner (Microsoft) <brauner@kernel.org>
CommitDate: Tue Jan 3 15:43:06 2023 +0100
EDITME: cover title for fstests.setgid.v6.2
# Lines starting with # will be removed from the cover letter. You can use
# them to add notes or reminders to yourself.
EDITME: describe the purpose of this series. The information you put here
will be used by the project maintainer to make a decision whether your
patches should be reviewed, and in what priority order. Please be very
detailed and link to any relevant discussions or sites that the maintainer
can review to better understand your proposed changes. If you only have a
single patch in your series, the contents of the cover letter will be
appended to the "under-the-cut" portion of the patch.
# You can add trailers to the cover letter. Any email addresses found in
# these trailers will be added to the addresses specified/generated during
# the b4 send stage. You can also run "b4 prep --auto-to-cc" to auto-populate
# the To: and Cc: trailers based on the code being modified.
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
--- b4-submit-tracking ---
# This section is used internally by b4 prep for tracking purposes.
{
"series": {
"revision": 1,
"change-id": "20230103-fstests-setgid-v6-2-4ce5852d11e2",
"base-branch": null,
"prefixes": []
}
}
</code></pre></div></div>
<p>My solution so far had been to keep it as a branch description.
I don’t have a strong opinion about this though I’m not sure I like the empty commit thing.
But the motivation most likely is that branch descriptions aren’t kept/synced across remotes whereas an empty commit is.</p>
<p>In any case, I’m not concinved that my single patch needs a cover letter so I need to figure out how to get rid of it.
Given that I don’t see a direct command to achieve this the first thing that comes to mind is to make the empty commit’s commit message empty.
Let’s try that:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> b4 prep --edit-cover
New cover letter blank, leaving current one unchanged.
</code></pre></div></div>
<p>Let’s put a comment under there?</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit 8c6a27fc5b793a4a23bb9d457662ed821b5431df
Author: Christian Brauner <brauner@kernel.org>
AuthorDate: Tue Jan 3 15:43:06 2023 +0100
Commit: Christian Brauner (Microsoft) <brauner@kernel.org>
CommitDate: Tue Jan 3 15:43:06 2023 +0100
# I don't need a cover letter.
--- b4-submit-tracking ---
# This section is used internally by b4 prep for tracking purposes.
{
"series": {
"revision": 1,
"change-id": "20230103-fstests-setgid-v6-2-4ce5852d11e2",
"base-branch": null,
"prefixes": []
}
}
</code></pre></div></div>
<p>It seems that for the next step I might need to use <code class="language-plaintext highlighter-rouge">b4 send</code>. It provides
<code class="language-plaintext highlighter-rouge">--dry-run</code> and <code class="language-plaintext highlighter-rouge">--reflect</code> where the former option is pretty self-explanatory
and the latter option allow to send the patch series to myself.</p>
<p>Let’s try that:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> b4 send --reflect
Converted the branch to 1 messages
---
To: "Christian Brauner (Microsoft)" <brauner@kernel.org>
---
[PATCH] generic: update setgid tests
+Cc: Amir Goldstein <amir73il@gmail.com>
Zorro Lang <zlang@redhat.com>
---
Ready to:
- send the above messages to just Christian Brauner <brauner@kernel.org> (REFLECT MODE)
- with envelope-from: Christian Brauner <brauner@kernel.org>
- via SMTP server mail.kernel.org
REFLECT MODE:
The To: and Cc: headers will be fully populated, but the only
address given to the mail server for actual delivery will be
Christian Brauner <brauner@kernel.org>
Addresses in To: and Cc: headers will NOT receive this series.
Press Enter to proceed or Ctrl-C to abort
</code></pre></div></div>
<p>That looks good.
Excellent that it points out that it will only be sent to me and no one else.
I like this a lot!</p>
<p>Ah, that was skinny love.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Press Enter to proceed or Ctrl-C to abort
Connecting to mail.kernel.org:587
Traceback (most recent call last):
File "/home/brauner/src/git/b4/b4/command.py", line 376, in <module>
cmd()
File "/home/brauner/src/git/b4/b4/command.py", line 359, in cmd
cmdargs.func(cmdargs)
File "/home/brauner/src/git/b4/b4/command.py", line 86, in cmd_send
b4.ez.cmd_send(cmdargs)
File "/home/brauner/src/git/b4/b4/ez.py", line 1523, in cmd_send
sent = b4.send_mail(smtp, send_msgs, fromaddr=fromaddr, patatt_sign=sign,
File "/home/brauner/src/git/b4/b4/__init__.py", line 3257, in send_mail
bdata = patatt.rfc2822_sign(bdata)
AttributeError: module 'patatt' has no attribute 'rfc2822_sign'
</code></pre></div></div>
<p>Chances are that my <code class="language-plaintext highlighter-rouge">patatt</code> version is too old?
Ok, let’s try <code class="language-plaintext highlighter-rouge">pip3 install --upgrade patatt</code> and try this again:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> b4 send --reflect
Converted the branch to 1 messages
---
To: "Christian Brauner (Microsoft)" <brauner@kernel.org>
---
[PATCH] generic: update setgid tests
+Cc: Amir Goldstein <amir73il@gmail.com>
Zorro Lang <zlang@redhat.com>
---
Ready to:
- send the above messages to just Christian Brauner <brauner@kernel.org> (REFLECT MODE)
- with envelope-from: Christian Brauner <brauner@kernel.org>
- via SMTP server mail.kernel.org
REFLECT MODE:
The To: and Cc: headers will be fully populated, but the only
address given to the mail server for actual delivery will be
Christian Brauner <brauner@kernel.org>
Addresses in To: and Cc: headers will NOT receive this series.
Press Enter to proceed or Ctrl-C to abort
Connecting to mail.kernel.org:587
---
[PATCH] generic: update setgid tests
---
Reflected 1 messages
</code></pre></div></div>
<p>There we go!
Let’s inspect whether the patch looks sane.</p>
<p>I would really appreciate if I didn’t have to carry a pointless empty commit message with the commend <code class="language-plaintext highlighter-rouge"># I don't need a cover letter.</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>commit 8c6a27fc5b793a4a23bb9d457662ed821b5431df
Author: Christian Brauner <brauner@kernel.org>
AuthorDate: Tue Jan 3 15:43:06 2023 +0100
Commit: Christian Brauner (Microsoft) <brauner@kernel.org>
CommitDate: Tue Jan 3 15:43:06 2023 +0100
# I don't need a cover letter.
--- b4-submit-tracking ---
# This section is used internally by b4 prep for tracking purposes.
{
"series": {
"revision": 1,
"change-id": "20230103-fstests-setgid-v6-2-4ce5852d11e2",
"base-branch": null,
"prefixes": []
}
}
</code></pre></div></div>
<p>Seems like that could be improved by at least allowing no content other than the <code class="language-plaintext highlighter-rouge">b4-submit-tracking</code> content in there.</p>
<p>Another thought, while <code class="language-plaintext highlighter-rouge">b4 send</code> allows to specify recipients via <code class="language-plaintext highlighter-rouge">--to</code> and <code class="language-plaintext highlighter-rouge">--cc</code> it would be neat if one could store additional recipients alongside each patch outside of trailers.
Currently I do this by tagging each commit I generate via <code class="language-plaintext highlighter-rouge">git format-patch</code> based on <code class="language-plaintext highlighter-rouge">git notes</code> which are then moved before the <code class="language-plaintext highlighter-rouge">SUBJECT</code> line.
For example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>From 8575998dc5ac659f9d893220372cabbbb65c323d Mon Sep 17 00:00:00 2001
From: Christian Brauner <a@b>
Date: Fri, 23 Sep 2022 10:29:39 +0200
To: A <a@a>
Cc: B <b@b>
Cc: C <c@c>
Cc: D <d@d>
Cc: E <e@e>
Subject: [PATCH v5 02/30] some: commmimt
</code></pre></div></div>
<p>It be excellent if that were possible.</p>
<p><strong>UPDATE START</strong></p>
<p>This seems possible:</p>
<blockquote>
<p>What if I only have a single patch?</p>
<p>When you only have a single patch, the contents of the cover letter will be mixed into the “under-the-cut” portion of the patch. You can just use the cover letter for extra To/Cc trailers and changelog entries as your patch goes through revisions. If you add more commits in the future version, you can fill in the cover letter content with additional information about the intent of your entire series.</p>
</blockquote>
<p>(https://b4.docs.kernel.org/en/latest/contributor/prep.html#what-if-i-only-have-a-single-patch)</p>
<p><strong>UPDATE END</strong></p>
<p>For the single patch I can do with just using the <code class="language-plaintext highlighter-rouge">--to</code> and <code class="language-plaintext highlighter-rouge">--cc</code> flags to <code class="language-plaintext highlighter-rouge">b4 send</code>.
So let’s try this but better safe than sorry and also let’s take the chance to try <code class="language-plaintext highlighter-rouge">--dry-run</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> b4 send --cc="Amir Goldstein <amir73il@gmail.com>" --cc="Zorro Lang <zlang@redhat.com>" --to="<fstests@vger.kernel.org>" --dry-run
Converted the branch to 1 messages
--- DRYRUN: message follows ---
| From: Christian Brauner <brauner@kernel.org>
| Date: Tue, 03 Jan 2023 16:26:26 +0100
| Subject: [PATCH] generic: update setgid tests
| MIME-Version: 1.0
| Content-Type: text/plain; charset="utf-8"
| Content-Transfer-Encoding: 7bit
| Message-Id: <20230103-fstests-setgid-v6-2-v1-1-cbdbeef2411a@kernel.org>
| X-B4-Tracking: v=1; b=H4sIACJJtGMC/x2NwQqDMBAFf0X23IVkraX0V0oPMXnqgqQlG6Ug/
| ntDjzMwzEGGojB6dAcV7Gr6zg38paO4hDyDNTUmcdI773qerMKqsaHOmni/sfA1YrgPkryHUCvH
| YOCxhByX1uZtXZv8FEz6/a+er/P8AY8ALQl6AAAA
| To: fstests@vger.kernel.org
| Cc: "Christian Brauner (Microsoft)" <brauner@kernel.org>, Zorro Lang <zlang@redhat.com>,
| Amir Goldstein <amir73il@gmail.com>
| X-Mailer: b4 0.12-dev-214b3
| X-Developer-Signature: v=1; a=openpgp-sha256; l=11235; i=brauner@kernel.org;
| h=from:subject:message-id; bh=Oi7CQzED3JPkJt0k+GEUAJ3KgV2hlR5H9UVIMTL7D54=;
| b=owGbwMvMwCU28Zj0gdSKO4sYT6slMSRv8VRqrb5e/2BddGniFuONH66XL6oU33feadWtOWfnTmMR
| M157sqOUhUGMi0FWTJHFod0kXG45T8Vmo0wNmDmsTCBDGLg4BWAi8+YzMhz+fd3m3CfdR3/fvFxy9t
| Xj36frBTq/Vgvy/isXDvRg7Gdh+J8ZvXPBu2W8mwpPd8ZbXZn3alVEZq9XiPk2YSt5yZVftLgB
| X-Developer-Key: i=brauner@kernel.org; a=openpgp;
| fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624
|
[I'm snipping the actual content as this is not of interest here.]
| ---
| base-commit: fbd489798b31e32f0eaefcd754326a06aa5b166f
| change-id: 20230103-fstests-setgid-v6-2-4ce5852d11e2
|
| Best regards,
| --
| Christian Brauner (Microsoft) <brauner@kernel.org>
--- DRYRUN: message ends ---
---
DRYRUN: Would have sent 1 messages
</code></pre></div></div>
<p>So let’s try this for real:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> b4 send --cc="Amir Goldstein <amir73il@gmail.com>" --cc="Zorro Lang <zlang@redhat.com>" --to="<fstests@vger.kernel.org>"
Converted the branch to 1 messages
---
To: fstests@vger.kernel.org
Cc: "Christian Brauner (Microsoft)" <brauner@kernel.org>
Zorro Lang <zlang@redhat.com>
---
[PATCH] generic: update setgid tests
+Cc: Amir Goldstein <amir73il@gmail.com>
---
Ready to:
- send the above messages to actual listed recipients
- with envelope-from: Christian Brauner <brauner@kernel.org>
- via SMTP server mail.kernel.org
- tag and reroll the series to the next revision
Press Enter to proceed or Ctrl-C to abort
Connecting to mail.kernel.org:587
---
[PATCH] generic: update setgid tests
---
Sent 1 messages
Tagging sent/fstests.setgid.v6.2-v1
Recording series message-id in cover letter tracking
Created new revision v2
Updating cover letter with templated changelog entries.
Invoking git-filter-repo to update the cover letter.
New history written in 0.02 seconds...
Completely finished after 0.05 seconds.
</code></pre></div></div>
<p>Ok, let’s check my own mail whether this looks good.
And yes, it does:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Date: Tue, 03 Jan 2023 16:28:20 +0100
From: Christian Brauner <brauner@kernel.org>
To: fstests@vger.kernel.org
Cc: "Christian Brauner (Microsoft)" <brauner@kernel.org>, Zorro Lang <zlang@redhat.com>, Amir Goldstein <amir73il@gmail.com>
Subject: [PATCH] generic: update setgid tests
X-Date: Tue, 03 Jan 2023 16:28:20 +0100
X-URI: https://lore.kernel.org/fstests/20230103-fstests-setgid-v6-2-v1-1-b8972c303ebe@kernel.org
</code></pre></div></div>Christian BraunerThis is a (live-[?])blog about managing a patch series solely with b4. It’s “live” as I’m writing this down while fumbling my way through this adventure. No corrections other than grammer and spelling (but no guarantees for the correctness of either).The Seccomp Notifier - Cranking up the crazy with bpf()2020-08-07T00:00:00+02:002020-08-07T00:00:00+02:00https://brauner.io/2020/08/07/seccomp-notify--intercepting-the-bpf-syscall<p>In my last article I looked at the <a href="https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development">seccomp notifier</a> in detail and how it allows us to make unprivileged containers way more capable (Sorry, kernel joke.). This is the (very) crazy (but very short) sequel. (Sorry Jon, no novella this time. :))</p>
<p>Last time I mentioned two new features that we had landed:</p>
<ol>
<li><a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=83fa805bcbfc53ae82eedd65132794ae324798e5">Retrieving file descriptors from another task via <code class="language-plaintext highlighter-rouge">pidfd_getfd()</code></a></li>
<li><a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9ecc6ea491f0c0531ad81ef9466284df260b2227">Injection file descriptors via the new <code class="language-plaintext highlighter-rouge">SECCOMP_IOCTL_NOTIF_ADDFD</code> ioctl on the seccomp notifier</a></li>
</ol>
<p>The 2. feature just landed in the merge window for <code class="language-plaintext highlighter-rouge">v5.9</code>. So what better time than now to boot a <code class="language-plaintext highlighter-rouge">v5.9</code> pre-rc1 kernel and play with the new features.</p>
<p>I said that these features make it possible to intercept syscalls that return file descriptors or that pass file descriptors to the kernel. Syscalls that come to mind are <code class="language-plaintext highlighter-rouge">open()</code>, <code class="language-plaintext highlighter-rouge">connect()</code>, <code class="language-plaintext highlighter-rouge">dup2()</code>, but also <code class="language-plaintext highlighter-rouge">bpf()</code>.
People that read the first blogpost might not have realized how crazy^serious one can get with these two new features so I thought it be a good exercise to illustrate it. And what better victim than <code class="language-plaintext highlighter-rouge">bpf()</code>.</p>
<p>As we know, <code class="language-plaintext highlighter-rouge">bpf()</code> and unprivileged containers don’t get along too well. But that doesn’t need to be the case. For the demo you’re about to see I enabled LXD to supervise the <code class="language-plaintext highlighter-rouge">bpf()</code> syscalls for tasks running in unprivileged containers. We will intercept the <code class="language-plaintext highlighter-rouge">bpf()</code> syscalls for the <code class="language-plaintext highlighter-rouge">BPF_PROG_LOAD</code> command for <code class="language-plaintext highlighter-rouge">BPF_PROG_TYPE_CGROUP_DEVICE</code> program types and the <code class="language-plaintext highlighter-rouge">BPF_PROG_ATTACH</code>, and <code class="language-plaintext highlighter-rouge">BPF_PROG_DETACH</code> commands for the <code class="language-plaintext highlighter-rouge">BPF_CGROUP_DEVICE</code> attach type. This allows a nested unprivileged container to load its own device profile in the cgroup2 hierarchy.</p>
<p>This is just a tiny glimpse into how this can be used and extended. ;) The pull request for LXD is already up <a href="https://github.com/lxc/lxd/pull/7743">here</a>. Let’s see if the rest of the team thinks I’m going crazy. :)</p>
<p><a href="https://asciinema.org/a/352191"><img src="https://asciinema.org/a/352181.svg" alt="asciicast" /></a></p>Christian BraunerIn my last article I looked at the seccomp notifier in detail and how it allows us to make unprivileged containers way more capable (Sorry, kernel joke.). This is the (very) crazy (but very short) sequel. (Sorry Jon, no novella this time. :))The Seccomp Notifier - New Frontiers in Unprivileged Container Development2020-07-23T00:00:00+02:002020-07-23T00:00:00+02:00https://brauner.io/2020/07/23/seccomp-notify<h4 id="introduction">Introduction</h4>
<p>As most people know by know we do a lot of upstream kernel development. This stretches over multiple areas and of course we also do a lot of kernel work around containers. In this article I’d like to take a closer look at the new seccomp notify feature we have been developing both in the kernel and in userspace and that is seeing more and more users. I’ve talked about this feature quite a few times at various conferences (just recently again at <a href="https://ossna2020.sched.com/event/c3WE/making-unprivileged-containers-more-useable-christian-brauner-canonical]">OSS NA</a>) over the last two years but never actually sat down to write a blogpost about it. This is something I had wanted to do for quite some time. First, because it is a very exciting feature from a purely technical perspective but also from the new possibilities it opens up for (unprivileged) containers and other use-cases.</p>
<h4 id="the-limits-of-unprivileged-containers">The Limits of Unprivileged Containers</h4>
<p>That (Linux) Containers are a userspace fiction is a well-known dictum nowadays. It simply expresses the fact that there is no container kernel object in the Linux kernel. Instead, userspace is relatively free to define what a container is. But for the most part userspace agrees that a container is somehow concerned with isolating a task or a task tree from the host system. This is achieved by combining a multitude of Linux kernel features. One of the better known kernel features that is used to build containers are namespaces. The number of namespaces the kernel supports has grown over time and we are currently at eight. Before you go and look them up on <code class="language-plaintext highlighter-rouge">namespaces(7)</code> here they are:</p>
<ul>
<li>cgroup: <code class="language-plaintext highlighter-rouge">cgroup_namespaces(7)</code></li>
<li>ipc: <code class="language-plaintext highlighter-rouge">ipc_namespaces(7)</code></li>
<li>network: <code class="language-plaintext highlighter-rouge">network_namespaces(7)</code></li>
<li>mount: <code class="language-plaintext highlighter-rouge">mount_namespaces(7)</code></li>
<li>pid: <code class="language-plaintext highlighter-rouge">pid_namespaces(7)</code></li>
<li>time: <code class="language-plaintext highlighter-rouge">time_namespaces(7)</code></li>
<li>user: <code class="language-plaintext highlighter-rouge">user_namespaces(7)</code></li>
<li>uts: <code class="language-plaintext highlighter-rouge">uts_namespaces(7)</code></li>
</ul>
<p>Of these eight namespaces the user namespace is the only one concerned with isolating core privilege concepts on Linux such as user- and group ids, and capabilities.</p>
<p>Quite often we see tasks in userspace that check whether they run as root or whether they have a specific capability (e.g. <code class="language-plaintext highlighter-rouge">CAP_MKNOD</code> is required to create device nodes) and it seems that when the answer is “yes” then the task is actually a privileged task. But as usual things aren’t that simple. What the task thinks it’s checking for and what the kernel really is checking for are possibly two very different things. A naive task, i.e. a task not aware of user namespaces, might think it’s asking whether it is privileged with respect to the whole system aka the host but what the kernel really checks for is whether the task has the necessary privileges relative to the user namespace it is located in.</p>
<p>In most cases the kernel will not check whether the task is privileged with respect to the whole system. Instead, it will almost always call a function called <code class="language-plaintext highlighter-rouge">ns_capable()</code> which is the kernel’s way of checking whether the calling task has privilege in its current user namespace.</p>
<p>For example, when a new user namespace is created by setting the <code class="language-plaintext highlighter-rouge">CLONE_NEWUSER</code> flag in <code class="language-plaintext highlighter-rouge">unshare(2)</code> or in <code class="language-plaintext highlighter-rouge">clone3(2)</code> the kernel will grant a full set of capabilities to the task that called <code class="language-plaintext highlighter-rouge">unshare(2)</code> or the newly created child task via <code class="language-plaintext highlighter-rouge">clone3(2)</code> <em>within</em> the new user namespace. When this task now e.g. checks whether it has the <code class="language-plaintext highlighter-rouge">CAP_MKNOD</code> capability the kernel will report back that it indeed has that capability. The key point though is that this “yes” is not a global “yes”, i.e. the question “Am I privileged enough to perform this operation?” only applies to the current user namespace (and technically any nested user namespaces) not the host itself.</p>
<p>This distinction is important when trying to understand why a task running as root in a new user namespace with all capabilities raised will still see <code class="language-plaintext highlighter-rouge">EPERM</code> when e.g. trying to call <code class="language-plaintext highlighter-rouge">mknod("/dev/mem", makedev(1, 1))</code> even though it seems to have all necessary privileges. The reason for this counterintuitive behavior is that the kernel isn’t always checking whether you are privileged against your current user namespace. Instead, for any operation that it thinks is dangerous to expose to unprivileged users it will check whether the task is privileged in the initial user namespace, i.e. the host’s user namespace.</p>
<p>Creating device nodes is one such example: if a task running in a user namespace were to be able to create character or block device nodes it could e.g. create <code class="language-plaintext highlighter-rouge">/dev/kmem</code> or any other critical device and use the device to take over the host. So the kernel simply blocks creating all device nodes in user namespaces by always performing the check for required privileges against the initial user namespace. This is of course technically inconsistent since capabilities are per user namespace as we observed above.</p>
<p>Other examples where the kernel requires privileges in the initial user namespace are mounting of block devices. So simply making a disk device node available to an unprivileged container will still not make it useable since it cannot mount it. On the other hand, some filesystems like <code class="language-plaintext highlighter-rouge">cgroup</code>, <code class="language-plaintext highlighter-rouge">cgroup2</code>, <code class="language-plaintext highlighter-rouge">tmpfs</code>, <code class="language-plaintext highlighter-rouge">proc</code>, <code class="language-plaintext highlighter-rouge">sysfs</code>, and <code class="language-plaintext highlighter-rouge">fuse</code> can be mounted in user namespace (with some caveats for <code class="language-plaintext highlighter-rouge">proc</code> and <code class="language-plaintext highlighter-rouge">sys</code> but we’re ignoring those details for now) because the kernel can guarantee that this is safe.</p>
<p>But of course these restrictions are annoying. Not being able to mount block devices or create device nodes means quite a few workloads are not able to run in containers even though they could be made to run safely. Quite often a container manager like <code class="language-plaintext highlighter-rouge">LXD</code> will know better than the kernel when an operation that a container tries to perform is safe.</p>
<p>A good example are device nodes. Most containers bind-mount the set of standard devices into the container otherwise it would not work correctly:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/dev/console
/dev/full
/dev/null
/dev/random
/dev/tty
/dev/urandom
/dev/zero
</code></pre></div></div>
<p>Allowing a container to create these devices would be safe. Of course, the container will simply bind-mount these devices during container startup into the container so this isn’t really a serious problem. But any program running inside the container that wants to create these harmless devices nodes would fail.</p>
<p>The other example that was mentioned earlier is mounting of block-based filesystems. Our users often instruct LXD to make certain disk devices available to their containers because they know that it is safe. For example, they could have a dedicated disk for the container or they want to share data with or among containers. But the container could not mount any of those disks.</p>
<p>For any use-case where the administrator is aware that a device node or disk device is missing from the container LXD provides the ability to hotplug them into one or multiple containers. For example, here is how you’d hotplug <code class="language-plaintext highlighter-rouge">/dev/zero</code> into a running container:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> brauner@wittgenstein|~
> lxc exec f5 -- ls -al /my/zero
brauner@wittgenstein|~
> lxc config device add f5 zero-device unix-char source=/dev/zero path=/my/zero
Device zero-device added to f5
brauner@wittgenstein|~
> lxc exec f5 -- ls -al /my/zero
crw-rw---- 1 root root 1, 5 Jul 23 10:47 /my/zero
</code></pre></div></div>
<p>But of course, that doesn’t help at all when a random application inside the container calls <code class="language-plaintext highlighter-rouge">mknod(2)</code> itself. In these cases LXD has no way of helping the application by hotplugging the device as it’s unaware that a mknod syscall has been performed.</p>
<p>So the root of the problem seems to be:</p>
<ul>
<li>A task inside the container performs a syscall that will fail.</li>
<li>The syscall would not need to fail since the container manager knows that it is safe.</li>
<li>The container manager has no way of knowing when such a syscall is performed.</li>
<li>Even if the the container manager would know when such a syscall is performed it has no way of inspecting it in detail.</li>
</ul>
<p>So a potential solution to this problem seems to be to enable the container manager or any sufficiently privileged task to take action on behalf of the container whenever it performs a syscall that would usually fail. So somehow we need to be able to interact with the syscalls of another task.</p>
<h4 id="seccomp---the-basics-of-syscall-interception">Seccomp - The Basics of Syscall Interception</h4>
<p>The obvious candidate to look at is seccomp. Short for “secure computing” it provides a way of restricting the syscalls of a task either by allowing only a subset of the syscalls the kernel supports or by denying a set of syscalls it thinks would be unsafe for the task in question. But seccomp allows even more advanced configurations through so-called “filters”. Filters are BPF programs (Not to be equated with eBPF. BPF is a predecessor of eBPF.) that can be written in userspace and loaded into the kernel. For example, a task could use a seccomp filter to only allow the <code class="language-plaintext highlighter-rouge">mount()</code> syscall and only those mount syscalls that create bind mounts. This simple syscall management mechanism has made seccomp an essential security feature for a lot of userspace programs. Nowadays it is considered good practice to restrict any critical programs to only those syscalls it absolutely needs to run successfully. Browser-based sandboxes and containers being prime examples but even systemd services can be seccomp restricted.</p>
<p>At its core seccomp is nothing but a syscall interception mechanism. One way or another every operating system has something that is at least roughly comparable. The way seccomp works is that it intercepts syscalls right in the architecture specific syscall entry paths. So the seccomp invocations themselves live in the architecture specific codepaths although most of the logical around it is architecture agnostic.</p>
<p>Usually, when a syscall is performed, and no seccomp filter has been applied to the task issuing the syscall the kernel will simply lookup the syscall number in the architecture specific syscall table and if it is a known syscall will perform it reporting back the result to userspace.</p>
<p>But when a seccomp filter is loaded for the task issuing the syscall instead of directly looking up the syscall number in the architecture’s syscall table the kernel will first call into seccomp and run the loaded seccomp filter.</p>
<p>Depending on whether a deny or allow approach is used for the seccomp filter any syscall that the filter is not handling specifically is either performed or denied reporting back a specified default value to the calling task. If the requested syscall is supposed to be specifically handled by the seccomp filter the kernel can e.g. be caused to report back a specific error code. This way, it is for example possible to have the kernel pretend like it doesn’t know the <code class="language-plaintext highlighter-rouge">mount(2)</code> syscall by creating a seccomp filter that reports back <code class="language-plaintext highlighter-rouge">ENOSYS</code> whenever the task tries to call <code class="language-plaintext highlighter-rouge">mount(2)</code>.</p>
<p>But the way seccomp used to work isn’t very dynamic. Specifically, once a filter is loaded the decision whether or not the syscall is successful or not is fixed based on the policy expressed by the filter. So there is no way to make a case-by-case decision which might come in handy in some scenarios.</p>
<p>In addition seccomp itself can’t make a syscall actually succeed other than in the trivial way of reporting back success to the caller. So seccomp will only allow the kernel to pretend that a syscall succeeded. So while it is possible to instruct the kernel to return 0 for the <code class="language-plaintext highlighter-rouge">mount(2)</code> syscall it cannot actually be instructed to make the <code class="language-plaintext highlighter-rouge">mount(2)</code> syscall succeed. So just making the seccomp filter return 0 for mounting a dedicated <code class="language-plaintext highlighter-rouge">ext4</code> disk device to <code class="language-plaintext highlighter-rouge">/mnt</code> will still not actually mount it at <code class="language-plaintext highlighter-rouge">/mnt</code>; it just pretends to the caller that it did. Of course that is in itself already a useful property for a bunch of use-cases but it doesn’t really help with the <code class="language-plaintext highlighter-rouge">mknod(2)</code> or <code class="language-plaintext highlighter-rouge">mount(2)</code> problem outlined above.</p>
<h4 id="extending-seccomp">Extending Seccomp</h4>
<p>So from the section above it should be clear that seccomp provides a few desirable properties that make it a natural candiate to look at to help solve our <code class="language-plaintext highlighter-rouge">mknod(2)</code> and <code class="language-plaintext highlighter-rouge">mount(2)</code> problem. Since seccomp intercepts syscalls early in the syscall path it already gives us a hook into the syscall path of a given task. What is missing though is a way to bring another task such as the LXD container manager into the picture. Somehow we need to modify seccomp in a way that makes it possible for a container manager to not just be informed when a task inside the container performs a syscall it wants to be informed about but also how to make it possible to block the task until the container manager instructs the kernel to allow it to proceed.</p>
<p>The answer to these questions is the seccomp notifier. This is as good a time as any to bring in some historical context. The exact origins of the idea for a more dynamic way to intercept syscalls is probably not recoverable and it has been thrown around in unspecific form in various discussions but nothing serious every materialized. The first concrete details around the seccomp notifier were conceived in early 2017 in the LXD team. The first public talk around the basic idea for this feature was given by Stéphane Graber at the Linux Plumbers Conference 2017 during the Container’s Microconference in Los Angeles. The details of this talk are still listed <a href="https://blog.linuxplumbersconf.org/2017/ocw/sessions/4795.html">here</a> here and I’m sure Stéphane can still provide the slides we came up with. I didn’t find a video recording even though I somehow thought we did have one. If someone is really curious I can try to investigate with the Linux Plumbers committee. After this talk implementation specifics were discussed in a hallway meeting later that day. And after a long arduous journey the implementation was upstreamed by Tycho Andersen who used to be on the LXD team. The rest is history^wchangelog.</p>
<h4 id="seccomp-notify---syscall-interception-20">Seccomp Notify - Syscall Interception 2.0</h4>
<p>In its essence, the seccomp notify mechanism is simply a file descriptor (fd) for a specific seccomp filter. When a container starts it will usually load a seccomp filter to restrict its attack surface. That is even done for unprivileged containers even though it is not strictly necessary.</p>
<p>With the addition of the seccomp notifier a container wishing to have a subset of syscalls handled by another process can set the new <code class="language-plaintext highlighter-rouge">SECCOMP_RET_USER_NOTIF</code> flag on its seccomp filter. This flag instructs the kernel to return a file descriptor to the calling task after having loaded its filter. This file descriptor is a seccomp notify file descriptor.</p>
<p>Of course, the seccomp notify fd is not very useful to the task itself. First, since it doesn’t make a lot of sense apart from very weird use-cases for a task to listen for its own syscalls. Second, because the task would likely block itself indefinitely pretty quickly without taking extreme care.</p>
<p>But what the task can do with the seccomp notifier is to hand to another task. Usually the task that it will hand the seccomp notify fd to will be more privileged than itself. For a container the most obvious candidate would be the container manager of course.</p>
<p>Since the seccomp notify fd is pollable it is possible to put it into an event loop such as <code class="language-plaintext highlighter-rouge">epoll(7)</code>, <code class="language-plaintext highlighter-rouge">poll(2)</code>, or <code class="language-plaintext highlighter-rouge">select(2)</code> and wait for the file descriptor to become readable, i.e. for the kernel to return <code class="language-plaintext highlighter-rouge">EPOLLIN</code> to userspace. For the seccomp notify fd to become readable means that the seccomp filter it refers to has detected that one of the tasks it has been applied to has performed a syscall that is part of the policy it implements. This is a complicated way of saying the kernel is notifying the container manager that a task in the container has performed a syscall it cares about, e.g. <code class="language-plaintext highlighter-rouge">mknod(2)</code> or <code class="language-plaintext highlighter-rouge">mount(2)</code>.</p>
<p>Put another way, this means the container manager can listen for syscall events for tasks running in the container. Now instead of simply running the filter and immediately reporting back to the calling task the kernel will send a notification to the container manager on the seccomp notify fd and block the task performing the syscall.</p>
<p>After the seccomp notify fd indicates that it is readable the container manager can use the new <code class="language-plaintext highlighter-rouge">SECCOMP_IOCTL_NOTIF_RECV</code> <code class="language-plaintext highlighter-rouge">ioctl()</code> associated with seccomp notify fds to read a <code class="language-plaintext highlighter-rouge">struct seccomp_notif</code> message for the syscall. Currently the data to be read from the seccomp notify fd includes the following pieces. But please be aware that we are in the process of discussing potentially intrusive changes for future versions:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">seccomp_notif</span> <span class="p">{</span>
<span class="n">__u64</span> <span class="n">id</span><span class="p">;</span>
<span class="n">__u32</span> <span class="n">pid</span><span class="p">;</span>
<span class="n">__u32</span> <span class="n">flags</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">seccomp_data</span> <span class="n">data</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<p>Let’s look at this in a little more detail. The <code class="language-plaintext highlighter-rouge">pid</code> field is the <code class="language-plaintext highlighter-rouge">pid</code> of the task that performed the syscall as seen in the caller’s pid namespace. To stay within the realm of our current examples, this is simply the pid of the task in the container the e.g. called <code class="language-plaintext highlighter-rouge">mknod(2)</code> as seen in the pid namespace of the container manager. The <code class="language-plaintext highlighter-rouge">id</code> field is a unique identifier for the performed syscall. This can be used to verify that the task is still alive and the syscall request still valid to avoid any race conditions caused by pid recycling. The <code class="language-plaintext highlighter-rouge">flags</code> argument is currently unused and reserved for future extensions.</p>
<p>The <code class="language-plaintext highlighter-rouge">struct seccomp_data</code> argument is probably the most interesting one as it contains the really exciting bits and pieces:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">seccomp_data</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">nr</span><span class="p">;</span>
<span class="n">__u32</span> <span class="n">arch</span><span class="p">;</span>
<span class="n">__u64</span> <span class="n">instruction_pointer</span><span class="p">;</span>
<span class="n">__u64</span> <span class="n">args</span><span class="p">[</span><span class="mi">6</span><span class="p">];</span>
<span class="p">};</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">int</code> field is the syscall number which can only be correctly interpreted relative to the <code class="language-plaintext highlighter-rouge">arch</code> field. The <code class="language-plaintext highlighter-rouge">arch</code> field is the (audit) architecture for which this syscall was made. This field is very relevant since compatible architectures (For the <code class="language-plaintext highlighter-rouge">x86</code> architectures this encompasses at least <code class="language-plaintext highlighter-rouge">x32</code>, <code class="language-plaintext highlighter-rouge">i386</code>, and <code class="language-plaintext highlighter-rouge">x86_64</code>. The <code class="language-plaintext highlighter-rouge">arm</code>, <code class="language-plaintext highlighter-rouge">mips</code>, and <code class="language-plaintext highlighter-rouge">power</code> architectures also have compatible “sub” architectures.) are stackable and the returned syscall number might be different than the current headers imply (For example, you could be making a syscall from a 32bit userspace on a 64bit kernel. If the intercepted syscall has different syscall numbers on 32 bit and on 64bit, for example syscall <code class="language-plaintext highlighter-rouge">foo()</code> might have syscall number 1 on 32 bit and 2 on 64 bit. So the task reading the seccomp data can’t simply assume that since it itself is running in a 32 bit environment the syscall number must be 1. Rather, it must check what the audit <code class="language-plaintext highlighter-rouge">arch</code> is and then either check that the value of the syscall is 1 on 32 bit and 2 on 64 bit. Otherwise the container manager might end up emulating <code class="language-plaintext highlighter-rouge">mount()</code> when it should be emulating <code class="language-plaintext highlighter-rouge">mknod()</code>.). The <code class="language-plaintext highlighter-rouge">instruction_pointer</code> is set to the address of the instruction that performed the syscall. This is of course also architecture specific. And last the <code class="language-plaintext highlighter-rouge">args</code> member are the syscall arguments that the task performed the syscall with.</p>
<p>The <code class="language-plaintext highlighter-rouge">args</code> need to be interpreted and treated differently depending on the syscall layout and their type. If they are non-pointer arguments (<code class="language-plaintext highlighter-rouge">unsigned int</code> etc.) they can be copied into a local variable and interpreted right away. But if they are pointer arguments they are offsets into the virtual memory of the task that performed the syscall. In the latter case the memory needs to be read and copied before it can be interpreted.</p>
<p>Let’s look at a concrete example to figure out why it is vital to know the syscall layout other than for knowing the types of the syscall arguments. Say the performed syscall was <code class="language-plaintext highlighter-rouge">mount(2)</code>. In order to interpret the <code class="language-plaintext highlighter-rouge">args</code> field correctly we look at the <em>syscall</em> layout of <code class="language-plaintext highlighter-rouge">mount()</code>. (Please note, that I’m stressing that we need to look at the layout of <em>syscall</em> and the only reliable source for this is actually the kernel source code. The Linux manpages often list the wrapper provided by the system’s libc and these wrapper do not necessarily line-up with the syscall itself (compare the <code class="language-plaintext highlighter-rouge">waitid()</code> wrapper and the <code class="language-plaintext highlighter-rouge">waitid()</code> syscall or the various <code class="language-plaintext highlighter-rouge">clone()</code> syscall layouts).) From the layout of <code class="language-plaintext highlighter-rouge">mount(2)</code> we see that <code class="language-plaintext highlighter-rouge">args[0]</code> is a pointer argument identifying the source path, <code class="language-plaintext highlighter-rouge">args[1]</code> is another pointer argument identifying the target path, <code class="language-plaintext highlighter-rouge">args[2]</code> is a pointer argument identifying the filesystem type, <code class="language-plaintext highlighter-rouge">args[3]</code> is a non-pointer argument identifying the options, and <code class="language-plaintext highlighter-rouge">args[4]</code> is another pointer argument identifying additional mount options.</p>
<p>So if we were to be interested in the source path of this <code class="language-plaintext highlighter-rouge">mount(2)</code> syscall we would need to open the <code class="language-plaintext highlighter-rouge">/proc/<pid>/mem</code> file of the task that performed this syscall and e.g. use the <code class="language-plaintext highlighter-rouge">pread(2)</code> function with <code class="language-plaintext highlighter-rouge">args[0]</code> as the offset into the task’s virtual memory and read it into a buffer at least the length of a standard path. Alternatively, we can use a single syscall like <code class="language-plaintext highlighter-rouge">process_vm_readv(2)</code> to read multiple remote pointers at different locations all in one go. Once we have done this we can interpret it.</p>
<p>A friendly advice: in general it is a good idea for the container manager to read all syscall arguments <em>once</em> into a local buffer and base its decisions on how to proceed on the data in this local buffer. Not just because it will otherwise not be able for the container manager to interpret pointer arguments but it’s also a possible attack vector since a sufficiently privileged attacker (e.g. a thread in the same thread-group) can write to <code class="language-plaintext highlighter-rouge">/proc/<pid>/mem</code> and change the contents of e.g. <code class="language-plaintext highlighter-rouge">args[0]</code> or any other syscall argument. Also note, that the container manager should ensure that <code class="language-plaintext highlighter-rouge">/proc/<pid></code> still refers to the same task after opening it by checking the validity of the syscall request via the <code class="language-plaintext highlighter-rouge">id</code> field and the associated <code class="language-plaintext highlighter-rouge">SECCOMP_IOCTL_NOTIF_ID_VALID</code> <code class="language-plaintext highlighter-rouge">ioctl()</code> to exclude the possibility of the task having exited, been reaped and its pid having been recycled.</p>
<p>But let’s assume we have done all that. Now that the container manager has the task’s syscall arguments available in a local buffer it can interpret the syscall arguments. While it is doing so the target task remains blocked waiting for the kernel to tell it to proceed. After the container manager is done interpreting the arguments and has performed whatever action it wanted to perform it can use the <code class="language-plaintext highlighter-rouge">SECCOMP_IOCTL_NOTIF_SEND</code> <code class="language-plaintext highlighter-rouge">ioctl()</code> on the seccomp notify fd to tell the kernel what it should do with the blocked task’s syscall. The response is given in the form <code class="language-plaintext highlighter-rouge">struct seccomp_notif_resp</code>:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">seccomp_notif_resp</span> <span class="p">{</span>
<span class="n">__u64</span> <span class="n">id</span><span class="p">;</span>
<span class="n">__s64</span> <span class="n">val</span><span class="p">;</span>
<span class="n">__s32</span> <span class="n">error</span><span class="p">;</span>
<span class="n">__u32</span> <span class="n">flags</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<p>Let’s look at this struct in a little more detail too. The <code class="language-plaintext highlighter-rouge">id</code> field is set to the <code class="language-plaintext highlighter-rouge">id</code> of the syscall request to respond to and should correspond to the received <code class="language-plaintext highlighter-rouge">id</code> in the <code class="language-plaintext highlighter-rouge">struct seccomp_notif</code> that the container manager read via the <code class="language-plaintext highlighter-rouge">SECCOMP_IOCTL_NOTIF_RECV</code> <code class="language-plaintext highlighter-rouge">ioctl()</code> when the seccomp notify fd became readable. The <code class="language-plaintext highlighter-rouge">val</code> field is the return value of the syscall and is only set if the <code class="language-plaintext highlighter-rouge">error</code> field is set to 0. The <code class="language-plaintext highlighter-rouge">error</code> field is the error to return from the syscall and should be set to a negative <code class="language-plaintext highlighter-rouge">errno(3)</code> code if the syscall is supposed to fail (For example, to trick the caller into thinking that <code class="language-plaintext highlighter-rouge">mount(2)</code> is not supported on this kernel set <code class="language-plaintext highlighter-rouge">error</code> to <code class="language-plaintext highlighter-rouge">-ENOSYS</code>.). The <code class="language-plaintext highlighter-rouge">flags</code> value can be used to tell the kernel to continue the syscall by setting the <code class="language-plaintext highlighter-rouge">SECCOMP_USER_NOTIF_FLAG_CONTINUE</code> flag which I added to be able to intercept <code class="language-plaintext highlighter-rouge">mount(2)</code> and other syscalls that are difficult for seccomp to filter efficiently because of the restrictions around pointer arguments. More on that in a little bit.</p>
<p>With this machinery in place we are for now ;) done with the kernel bits.</p>
<h4 id="emulating-syscalls-in-userspace">Emulating Syscalls In Userspace</h4>
<p>So what is the container manager supposed to do after having read and interpreted the syscall information for the task running in the container and telling the kernel to let the task continue. Probably emulate it. Otherwise we just have a fancy and less performant seccomp userspace policy (Please read my comments on why that is a <em>very</em> bad idea.).</p>
<p>Emulating syscalls in userspace is not a very new thing to do. It has been done for a long time. For example, libc’s can choose to emulate the <code class="language-plaintext highlighter-rouge">execveat(2)</code> syscall which allows a task to exec a program by providing a file descriptor to the binary instead of a path. On a kernel that doesn’t support the <code class="language-plaintext highlighter-rouge">execveat(2)</code> syscall the libc can emulate it by calling <code class="language-plaintext highlighter-rouge">exec(3)</code> with the path set to <code class="language-plaintext highlighter-rouge">/proc/self/fd/<nr></code>. The problem of course is that this emulation only works when the task in question actually uses the libc wrapper (<code class="language-plaintext highlighter-rouge">fexecve(3)</code> for our example). Any task using <code class="language-plaintext highlighter-rouge">syscall(__NR_execveat, [...])</code> to perform the syscall without going through the provided wrapper will be bypassing libc and so libc doesn’t know that the task wants to perform the <code class="language-plaintext highlighter-rouge">execveat(2)</code> syscall and will not be able to emulate it in case the kernel doesn’t support it.</p>
<p>The seccomp notifier doesn’t suffer from this problem since its syscall interception abilities aren’t located in userspace at the library level but directly in the syscall path as we have seen. This greatly expands the abilities to emulate syscalls.</p>
<p>So now we have all the kernel pieces in place to solve our <code class="language-plaintext highlighter-rouge">mknod(2)</code> and <code class="language-plaintext highlighter-rouge">mount(2)</code> problem in unprivileged containers. Instead of simply letting the container fail on such harmless requests as creating the <code class="language-plaintext highlighter-rouge">/dev/zero</code> device node we can use the seccomp notifier to intercept the syscall and emulate it for the container in userspace by simply creating the device node for it. Similarly, we can intercept <code class="language-plaintext highlighter-rouge">mount(2)</code> requests requiring the user to e.g. give us a list of allowed filesystems to mount for the container and performing the mount for the container. We can even make this a lot safer by providing a user with the ability to specify a fuse binary that should be used when a task in the container tries to mount a filesystem. We actually support this feature in LXD. Since fuse is a safe way for unprivileged users to mount filesystems rewriting <code class="language-plaintext highlighter-rouge">mount(2)</code> requests is a great way to expose filesystems to containers.</p>
<p>In general, the possibilities of the seccomp notifier can’t be overstated and we are extremely happy that this work is now not just fully integrated into the Linux kernel but also into both LXD and LXC. As with many other technologies we have driven both in the upstream kernel and in userspace it directly benefits not just our users but all of userspace with the seccomp notifier seeing adoption in browsers and by other companies. A whole range of Travis workloads can now run in unprivileged LXD containers thanks to the seccomp notifier.</p>
<h4 id="seccomp-notify-in-action---lxd">Seccomp Notify in action - LXD</h4>
<p>After finishing the kernel bits we implemented support for it in LXD and the LXC shared library it uses. Instead of simply exposing the raw seccomp notify fd for the container’s seccomp filter directly to LXD each container connects to a multi-threaded socket that the LXD container manager exposes and on which it listens for new clients. Clients here are new containers who the administrator has signed up for syscall supervisions through LXD. Each container has a dedicated syscall supervisor which runs as a separate go routine and stays around for as long as the container is running.</p>
<p>When the container performs a syscall that the filter applies to a notification is generated on the seccomp notify fd. The container then forwards this request including some additional data on the socket it connected to during startup by sending a unix message including necessary credentials. LXD then interprets the message, checking the validity of the request, verifying the credentials, and processing the syscall arguments. If LXD can prove that the request is valid according to the policy the administrator specified for the container LXD will proceed to emulate the syscall. For <code class="language-plaintext highlighter-rouge">mknod(2)</code> it will create the device node for the container and for <code class="language-plaintext highlighter-rouge">mount(2)</code> it will mount the filesystem for the container. Either by directly mounting it or by using a specified fuse binary for additional security.</p>
<p>If LXD manages to emulate the syscall successfully it will prepare a response that it will forward on the socket to the container. The container then parses the message, verifying the credentials and will use the <code class="language-plaintext highlighter-rouge">SECCOMP_IOCTL_NOTIF_SEND</code> <code class="language-plaintext highlighter-rouge">ioctl()</code> sending a <code class="language-plaintext highlighter-rouge">struct seccomp_notif_resp</code> causing the kernel to unblock the task performing the syscall and reporting back that the syscall succeeded. Conversely, if LXD fails to emulate the syscall for whatever reason or the syscall is not allowed by the policy the administrator specified it will prepare a message that instructs the container to report back that the syscall failed and unblocking the task.</p>
<h4 id="show-me">Show Me!</h4>
<p>Ok, enough talk. Let’s intercept some syscalls. The following demo shows how LXD uses the seccomp notify fd to emulate the <code class="language-plaintext highlighter-rouge">mknod(2)</code> and <code class="language-plaintext highlighter-rouge">mount(2)</code> syscalls for an unprivileged container:</p>
<script id="asciicast-285491" src="https://asciinema.org/a/285491.js" async=""></script>
<h4 id="current-work-and-future-directions">Current Work and Future Directions</h4>
<h5 id="seccomp_user_notif_flag_continue"><code class="language-plaintext highlighter-rouge">SECCOMP_USER_NOTIF_FLAG_CONTINUE</code></h5>
<p>After the initial support for the seccomp notify fd landed we ran into limitations pretty quickly. We realized we couldn’t intercept the mount syscall. Since the mount syscall has various pointer arguments it is difficult to write highly specific seccomp filters such that we only accept syscalls that we intended to intercept. This is caused by seccomp not being able to handle pointer arguments. They are opaque for seccomp. So while it is possible to tell seccomp to only intercept <code class="language-plaintext highlighter-rouge">mount(2)</code> requests for real filesystems by only intercepting <code class="language-plaintext highlighter-rouge">mount(2)</code> syscalls where the <code class="language-plaintext highlighter-rouge">MS_BIND</code> flag is not set in the flags argument it is not possible to write a seccomp filter that only notifies the container manager about <code class="language-plaintext highlighter-rouge">mount(2)</code> syscalls for the <code class="language-plaintext highlighter-rouge">ext4</code> or <code class="language-plaintext highlighter-rouge">btrfs</code> filesystem because the filesystem argument is a pointer.</p>
<p>But this means we will inadvertently intercept syscalls that we didn’t intend to intercept. That is a generic problem but for some syscalls it’s not really a big deal. For example, we know that <code class="language-plaintext highlighter-rouge">mknod(2)</code> fails for all character and block devices in unprivileged containers. So as long was we write a seccomp filter that intercepts only character and block device <code class="language-plaintext highlighter-rouge">mknod(2)</code> syscalls but no socket or fifo <code class="language-plaintext highlighter-rouge">mknod()</code> syscalls we don’t have a problem. For any character or block device that is not in the list of allowed devices in LXD we can simply instruct LXD to prepare a seccomp message that tells the kernel to report <code class="language-plaintext highlighter-rouge">EPERM</code> and since the syscalls would fail anyway there’s no problem.</p>
<p>But <em>any</em> system call that we intercepted as a consequence of seccomp not being able to filter on pointer arguments that would succeed in unprivileged containers would need to be emulated in userspace. But this would of course include all <code class="language-plaintext highlighter-rouge">mount(2)</code> syscalls for filesystems that can be mounted in unprivileged containers. I’ve listed a subset of them above. It includes at least <code class="language-plaintext highlighter-rouge">tmpfs</code>, <code class="language-plaintext highlighter-rouge">proc</code>, <code class="language-plaintext highlighter-rouge">sysfs</code>, <code class="language-plaintext highlighter-rouge">devpts</code>, <code class="language-plaintext highlighter-rouge">cgroup</code>, <code class="language-plaintext highlighter-rouge">cgroup2</code> and probably a few others I’m forgetting. That’s not ideal. We only want to emulate syscalls that we really have to emulate, i.e. those that would actually fail.</p>
<p>The solution to this problem was a patchset of mine that added the ability to continue an intercepted syscall. To instruct the kernel to continue the syscall the <code class="language-plaintext highlighter-rouge">SECCOMP_USER_NOTIF_FLAG_CONTINUE</code> flag can be set in <code class="language-plaintext highlighter-rouge">struct seccomp_notif_resp</code>’s flag argument when instructing the kernel to unblock the task.</p>
<p>This is of course a very exciting feature and has a few readers probably thinking “Hm, I could implement a dynamic userspace seccomp policy.” to which I want to very loudly respond “No, you can’t!”. In general, the seccomp notify fd cannot be used to implement any kind of security policy in userspace. I’m now going to mostly quote verbatim from my comment for the extension: The <code class="language-plaintext highlighter-rouge">SECCOMP_USER_NOTIF_FLAG_CONTINUE</code> flag must be used with extreme caution! If set by the task supervising the syscalls of another task the syscall will continue. This is problematic is inherent because of TOCTOU (Time of Check-Time of Use). An attacker can exploit the time while the supervised task is waiting on a response from the supervising task to rewrite syscall arguments which are passed as pointers of the intercepted syscall. It should be absolutely clear that this means that the seccomp notifier <em>cannot</em> be used to implement a security policy on syscalls that read from dereferenced pointers in user space! It should only ever be used in scenarios where a more privileged task supervises the syscalls of a lesser privileged task to get around kernel-enforced security restrictions when the privileged task deems this safe. In other words, in order to continue a syscall the supervising task should be sure that another security mechanism or the kernel itself will sufficiently block syscalls if arguments are rewritten to something unsafe.</p>
<p>Similar precautions should be applied when stacking <code class="language-plaintext highlighter-rouge">SECCOMP_RET_USER_NOTIF</code> or <code class="language-plaintext highlighter-rouge">SECCOMP_RET_TRACE</code>. For <code class="language-plaintext highlighter-rouge">SECCOMP_RET_USER_NOTIF</code> filters acting on the same syscall, the most recently added filter takes precedence. This means that the new <code class="language-plaintext highlighter-rouge">SECCOMP_RET_USER_NOTIF</code> filter can override any <code class="language-plaintext highlighter-rouge">SECCOMP_IOCTL_NOTIF_SEND</code> from earlier filters, essentially allowing all such filtered syscalls to be executed by sending the response <code class="language-plaintext highlighter-rouge">SECCOMP_USER_NOTIF_FLAG_CONTINUE</code>. Note that <code class="language-plaintext highlighter-rouge">SECCOMP_RET_TRACE</code> can equally be overriden by <code class="language-plaintext highlighter-rouge">SECCOMP_USER_NOTIF_FLAG_CONTINUE</code>.</p>
<h5 id="retrieving-file-descriptors-pidfd_getfd">Retrieving file descriptors <code class="language-plaintext highlighter-rouge">pidfd_getfd()</code></h5>
<p>Another extension that was added by <a href="https://twitter.com/sargun">Sargun Dhillon</a> recently building on top of my pidfd work was to make it possible to retrieve file descriptors from another task. This works even without the seccomp notifier since it is a new syscall but is of course especially useful in conjunction with it.</p>
<p>Often we would like to intercept syscalls such as <code class="language-plaintext highlighter-rouge">connect(2)</code>. For example, the container manager might want to rewrite the <code class="language-plaintext highlighter-rouge">connect(2)</code> request to something other than the task intended for security reasons or because the task lacks the necessary information about the networking layout to connect to the right endpoint. In these cases <code class="language-plaintext highlighter-rouge">pidfd_getfd(2)</code> can be used to retrieve a copy of the file descriptor of the task and perform the <code class="language-plaintext highlighter-rouge">connect(2)</code> for it. This unblocks another wide range of use-cases.</p>
<p>For example, it can be used for further introspection into file descriptors than ss, or netstat would typically give you, as you can do things like run <code class="language-plaintext highlighter-rouge">getsockopt(2)</code> on the file descriptor, and you can use options like <code class="language-plaintext highlighter-rouge">TCP_INFO</code> to fetch a significant amount of information about the socket. Not only can you fetch information about the socket, but you can also set fields like <code class="language-plaintext highlighter-rouge">TCP_NODELAY</code>, to tune the socket without requiring the user’s intervention. This mechanism, in conjunction can be used to build a rudimentary layer 4 load balancer where <code class="language-plaintext highlighter-rouge">connect(2)</code> calls are intercepted, and the destination is changed to a real server instead.</p>
<p>Early results indicate that this method can yield incredibly good latency as compared to other layer 4 load balancing techniques.</p>
<div>
<a href="https://plotly.com/~sargun/63/?share_key=TBxaZob2h9GiGD9LxVuFuE" target="_blank" title="Plot 63" style="display: block; text-align: center;"><img src="https://plotly.com/~sargun/63.png?share_key=TBxaZob2h9GiGD9LxVuFuE" alt="Plot 63" style="max-width: 100%;width: 600px;" width="600" onerror="this.onerror=null;this.src='https://plotly.com/404.png';" /></a>
</div>
<h5 id="injecting-file-descriptors-seccomp_notify_ioctl_addfd">Injecting file descriptors <code class="language-plaintext highlighter-rouge">SECCOMP_NOTIFY_IOCTL_ADDFD</code></h5>
<p>Current work for the upcoming merge window is focussed on making it possible to inject file descriptors into a task. As things stand, we are unable to intercept syscalls (Unless we share the file descriptor table with the task which is usually never the case for container managers and the containers they supervise.) such as <code class="language-plaintext highlighter-rouge">open(2)</code> that cause new file descriptors to be installed in the task performing the syscall.</p>
<p>The new seccomp extension effectively allows the container manager to instructs the target task to install a set of file descriptors into its own file descriptor table before instructing it to move on. This way it is possible to intercept syscalls such as <code class="language-plaintext highlighter-rouge">open(2)</code> or <code class="language-plaintext highlighter-rouge">accept(2)</code>, and install (or replace, like <code class="language-plaintext highlighter-rouge">dup2(2)</code>) the container manager’s resulting fd in the target task.</p>
<p>This new technique opens the door to being able to make massive changes in userspace. For example, techniques such as enabling unprivileged access to <code class="language-plaintext highlighter-rouge">perf_event_open(2)</code>, and <code class="language-plaintext highlighter-rouge">bpf(2)</code> for tracing are available via this mechanism. The manager can inspect the program, and the way the perf events are being setup to prevent the user from doing ill to the system. On top of that, various network techniques are being introducd, such as zero-cost IPv6 transition mechanisms in the future.</p>
<p>Last, I want to note that <a href="https://twitter.com/sargun">Sargun Dhillon</a> was kind enough to contribute paragraphs to the <code class="language-plaintext highlighter-rouge">pidfd_getfd(2)</code> and <code class="language-plaintext highlighter-rouge">SECCOMP_NOTIFY_IOCTL_ADDFD</code> sections. He also provided the graphic in the <code class="language-plaintext highlighter-rouge">pidfd_getfd(2)</code> sections to illustrate the performance benefits of this solution.</p>
<p>Christian</p>Christian BraunerIntroductionSlides for Kernel Recipes, Paris 2019: pidfd: Process file descriptors on Linux2019-10-01T00:00:00+02:002019-10-01T00:00:00+02:00https://brauner.io/2019/10/01/kernel-recipes-pidfds<p><a href="https://brauner.io/_img/2019_kernel_recipes_pidfds.pdf">Slides (pdf)</a></p>Christian BraunerSlides (pdf)Slides for Open Source Summit (OSS) North America, San Diego 2019: New Container Kernel Features2019-08-23T00:00:00+02:002019-08-23T00:00:00+02:00https://brauner.io/2019/08/23/oss-na-new-container-kernel-features<p><a href="https://brauner.io/_img/2019_oss_na_new_container_kernel_features.pdf">Slides</a></p>Christian BraunerSlidesLinux Kernel VFSisms2019-06-28T00:00:00+02:002019-06-28T00:00:00+02:00https://brauner.io/2019/06/28/vfs-wisdom<h4 id="introduction">Introduction</h4>
<p>This is intended as a collection of helpful knowledge bits around Linus Kernel
VFS internals. It mostly contains (hopefully) useful bits and pieces I picked
up while working on the Linux kernel and talking to VFS maintainers or
high-profile contributors.</p>
<h4 id="ksys_close"><code class="language-plaintext highlighter-rouge">ksys_close()</code></h4>
<p>Should never be used. One of the major reasons being that it is too easy to get
wrong.</p>
<h4 id="on-creating-and-installing-new-file-descriptors">On creating and installing new file descriptors</h4>
<p>A file descriptor should only be installed past every possible point of
failure. Specifically for a syscall the file descriptor should be installed
right before returning to userspace.
Consider the function <code class="language-plaintext highlighter-rouge">anon_inode_getfd()</code>. This functions creates and installs
a new file descriptor for a task. Hence, by the rule given above it should only
ever be called when the syscall cannot fail anymore in any other way then by
failing <code class="language-plaintext highlighter-rouge">anon_inode_getfd()</code>.</p>
<p>For all other cases the rule is to <strong>reserve</strong> a file descriptor but defer the
<strong>installation</strong> of the file descriptor past the last point of failure. Note,
that installing an file descriptor itself is not an operation that can fail.</p>
<p>Back to the anonymous inode example: Instead of calling <code class="language-plaintext highlighter-rouge">anon_inode_getfd()</code>
callers who need a file descriptor before the last point of failure should
reserve a file descriptor, call <code class="language-plaintext highlighter-rouge">anon_inode_getfile()</code> and then defer the
<code class="language-plaintext highlighter-rouge">fd_install()</code> until after the last point of failure. Here is a concrete
example blessed by Al Viro:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">if</span> <span class="p">(</span><span class="n">clone_flags</span> <span class="o">&</span> <span class="n">CLONE_PIDFD</span><span class="p">)</span> <span class="p">{</span>
<span class="cm">/* reserve a new file descriptor */</span>
<span class="n">retval</span> <span class="o">=</span> <span class="n">get_unused_fd_flags</span><span class="p">(</span><span class="n">O_RDWR</span> <span class="o">|</span> <span class="n">O_CLOEXEC</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">retval</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span>
<span class="k">goto</span> <span class="n">bad_fork_free_pid</span><span class="p">;</span>
<span class="n">pidfd</span> <span class="o">=</span> <span class="n">retval</span><span class="p">;</span>
<span class="cm">/* get file to associate with file descriptor */</span>
<span class="n">pidfile</span> <span class="o">=</span> <span class="n">anon_inode_getfile</span><span class="p">(</span><span class="s">"[pidfd]"</span><span class="p">,</span> <span class="o">&</span><span class="n">pidfd_fops</span><span class="p">,</span> <span class="n">pid</span><span class="p">,</span>
<span class="n">O_RDWR</span> <span class="o">|</span> <span class="n">O_CLOEXEC</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">IS_ERR</span><span class="p">(</span><span class="n">pidfile</span><span class="p">))</span> <span class="p">{</span>
<span class="n">put_unused_fd</span><span class="p">(</span><span class="n">pidfd</span><span class="p">);</span>
<span class="n">retval</span> <span class="o">=</span> <span class="n">ERR_PTR</span><span class="p">(</span><span class="n">pidfile</span><span class="p">);</span>
<span class="k">goto</span> <span class="n">bad_fork_free_pid</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">get_pid</span><span class="p">(</span><span class="n">pid</span><span class="p">);</span> <span class="cm">/* held by pidfile now */</span>
<span class="cm">/* place file descriptor in buffer accessible for userspace */</span>
<span class="n">retval</span> <span class="o">=</span> <span class="n">put_user</span><span class="p">(</span><span class="n">pidfd</span><span class="p">,</span> <span class="n">parent_tidptr</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">retval</span><span class="p">)</span>
<span class="k">goto</span> <span class="n">bad_fork_put_pidfd</span><span class="p">;</span>
<span class="p">}</span>
<span class="cm">/* a lot more code that can fail somehow */</span>
<span class="cm">/* Let kill terminate clone/fork in the middle */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">fatal_signal_pending</span><span class="p">(</span><span class="n">current</span><span class="p">))</span> <span class="p">{</span>
<span class="n">retval</span> <span class="o">=</span> <span class="o">-</span><span class="n">EINTR</span><span class="p">;</span>
<span class="k">goto</span> <span class="n">bad_fork_cancel_cgroup</span><span class="p">;</span>
<span class="p">}</span>
<span class="cm">/* past the last point of failure */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">pidfile</span><span class="p">)</span>
<span class="n">fd_install</span><span class="p">(</span><span class="n">pidfd</span><span class="p">,</span> <span class="n">pidfile</span><span class="p">);</span>
</code></pre></div></div>
<h4 id="setting-i_nlink-after-creating-a-directory-in-kernel">Setting <code class="language-plaintext highlighter-rouge">i_nlink</code> after creating a directory in-kernel</h4>
<p>When a new directory is created in a filesystem the inode needs to be
initialized. The <code class="language-plaintext highlighter-rouge">new_inode</code> for the directoy needs to get a count of <code class="language-plaintext highlighter-rouge">2</code> for
(<code class="language-plaintext highlighter-rouge">.</code> and <code class="language-plaintext highlighter-rouge">..</code>) the count of the <code class="language-plaintext highlighter-rouge">parent_inode</code> of the parent directory needs to
be incremented. There are a few places in kernel where this is done like this:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">inc_nlink</span><span class="p">(</span><span class="n">new_inode</span><span class="p">);</span>
<span class="n">d_instantiate</span><span class="p">(</span><span class="n">dentry</span><span class="p">,</span> <span class="n">new_inode</span><span class="p">);</span>
<span class="n">inc_nlink</span><span class="p">(</span><span class="n">parent_inode</span><span class="p">);</span>
<span class="n">fsnotify_mkdir</span><span class="p">(</span><span class="n">parent_inode</span><span class="p">,</span> <span class="n">dentry</span><span class="p">);</span>
</code></pre></div></div>
<p>But the preferred method of doing this is:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">set_nlink</span><span class="p">(</span><span class="n">new_inode</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">d_instantiate</span><span class="p">(</span><span class="n">dentry</span><span class="p">,</span> <span class="n">new_inode</span><span class="p">);</span>
<span class="n">inc_nlink</span><span class="p">(</span><span class="n">parent_inode</span><span class="p">);</span>
<span class="n">fsnotify_mkdir</span><span class="p">(</span><span class="n">parent_inode</span><span class="p">,</span> <span class="n">dentry</span><span class="p">);</span>
</code></pre></div></div>
<p>since <code class="language-plaintext highlighter-rouge">new_inode</code> cannot be modified by someone else concurrently.</p>Christian BraunerIntroductionRuntimes And the Curse of the Privileged Container2019-02-12T00:00:00+01:002019-02-12T00:00:00+01:00https://brauner.io/2019/02/12/privileged-containers<h4 id="introduction-cve-2019-5736">Introduction (<a href="https://seclists.org/oss-sec/2019/q1/119">CVE-2019-5736</a>)</h4>
<p>Today, Monday, 2019-02-11, 14:00:00 CET <a href="https://seclists.org/oss-sec/2019/q1/119">CVE-2019-5736</a> was released:</p>
<blockquote>
<p>The vulnerability allows a malicious container to (with minimal user
interaction) overwrite the host runc binary and thus gain root-level
code execution on the host. The level of user interaction is being able
to run any command (it doesn’t matter if the command is not
attacker-controlled) as root within a container in either of these
contexts:</p>
<ul>
<li>Creating a new container using an attacker-controlled image.</li>
<li>Attaching (docker exec) into an existing container which the
attacker had previous write access to.</li>
</ul>
</blockquote>
<p>I’ve been working on a fix for this issue over the last couple of weeks
together with <a href="https://www.cyphar.com/">Aleksa</a> a friend of mine and maintainer of runC. When he
notified me about the issue in runC we tried to come up with an exploit for
<a href="https://github.com/lxc/lxc">LXC</a> as well and though harder it is doable.
I was interested in the issue for technical reasons and figuring out how to
reliably fix it was quite fun (with a proper dose of pure hatred). It also
caused me to finally write down some personal thoughts I had for a long time
about how we are running containers.</p>
<h4 id="what-are-privileged-containers">What are Privileged Containers?</h4>
<p>At a first glance this is a question that is probably trivial to anyone who has
a decent low-level understanding of containers. Maybe even most users by now
will know what a privileged container is. A first pass at defining it would be
to say that a privileged container is a container that is owned by root.
Looking closer this seems an insufficient definition. What about containers
using user namespaces that are started as root?
It seems we need to distinguish between what ids a container is running with.
So we could say a privileged container is a container that is running as root.
However, this is still wrong. Because “running as root” can either be seen as
meaning “running as root as seen from the outside” or “running as root from the
inside” where “outside” means “as seen from a task outside the container” and
“inside” means “as seen from a task inside the container”.</p>
<p>What we really mean by a privileged container is a container where the
semantics for id 0 are the same inside and outside of the container ceteris
paribus. I say “ceteris paribus” because using LSMs, seccomp or any other
security mechanism will not cause a change in the meaning of id 0 inside and
outside the container. For example, a breakout caused by a bug in the runtime
implementation will give you root access on the host.</p>
<p>An unprivileged container then simply is any container in which the semantics
for id 0 inside the container are different from id 0 outside the container.
For example, a breakout caused by a bug in the runtime implementation will not
give you root access on the host by default. This should only be possible if
the kernel’s user namespace implementation has a bug.</p>
<p>The reason why I like to define privileged containers this way is that it also
lets us handle edge cases. Specifically, the case where a container is using
a user namespace but a hole is punched into the idmapping at id 0 aka where id
0 is mapped through. Consider a container that uses the following idmappings:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>id: 0 100000 100000
</code></pre></div></div>
<p>This instructs the kernel to setup the following mapping:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>id: container_id(0) -> host_id(100000)
id: container_id(1) -> host_id(100001)
id: container_id(2) -> host_id(100002)
.
.
.
container_id(100000) -> host_id(200000)
</code></pre></div></div>
<p>With this mapping it’s evident that <code class="language-plaintext highlighter-rouge">container_id(0) != host_id(0)</code>. But now
consider the following mapping:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>id: 0 0 1
id: 1 100001 99999
</code></pre></div></div>
<p>This instructs the kernel to setup the following mapping:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>id: container_id(0) -> host_id(0)
id: container_id(1) -> host_id(100001)
id: container_id(2) -> host_id(100002)
.
.
.
container_id(99999) -> host_id(199999)
</code></pre></div></div>
<p>In contrast to the first example this has the consequence that <code class="language-plaintext highlighter-rouge">container_id(0)
== host_id(0)</code>.
I would argue that any container that at least punches a hole for id 0 into its
idmapping up to specifying an identity mapping is to be considered a privileged
container.</p>
<p>As a sidenote, Docker containers run as privileged containers by default. There
is usually some confusion where people think because they do not use the
<code class="language-plaintext highlighter-rouge">--privileged</code> flag that Docker containers run unprivileged. This is wrong.
What the <code class="language-plaintext highlighter-rouge">--privileged</code> flag does is to give you even more permissions by e.g.
not dropping (specific or even any) capabilities. One could say that such
containers are almost “super-privileged”.</p>
<h4 id="the-trouble-with-privileged-containers">The Trouble with Privileged Containers</h4>
<p>The problem I see with privileged containers is essentially captured by
<a href="https://seclists.org/oss-sec/2019/q1/119">LXC</a>’s and <a href="https://github.com/lxc/lxd">LXD</a>’s upstream security position which we have held since
at least <a href="https://github.com/lxc/linuxcontainers.org/commit/b1a45aef6abc885594aab2ce6bdeb2186c5e0973">2015</a> but probably even earlier. I’m quoting from our <a href="https://linuxcontainers.org/lxc/security/#privileged-containers">notes about
privileged containers</a>:</p>
<blockquote>
<p>Privileged containers are defined as any container where the container uid 0 is
mapped to the host’s uid 0. In such containers, protection of the host and
prevention of escape is entirely done through Mandatory Access Control
(apparmor, selinux), seccomp filters, dropping of capabilities and namespaces.</p>
<p>Those technologies combined will typically prevent any accidental damage of the
host, where damage is defined as things like reconfiguring host hardware,
reconfiguring the host kernel or accessing the host filesystem.</p>
<p>LXC upstream’s position is that those containers aren’t and cannot be
root-safe.</p>
<p>They are still valuable in an environment where you are running trusted
workloads or where no untrusted task is running as root in the container.</p>
<p>We are aware of a number of exploits which will let you escape such containers
and get full root privileges on the host. Some of those exploits can be
trivially blocked and so we do update our different policies once made aware of
them. Some others aren’t blockable as they would require blocking so many core
features that the average container would become completely unusable.</p>
</blockquote>
<p>[…]</p>
<blockquote>
<p>As privileged containers are considered unsafe, we typically will not consider
new container escape exploits to be security issues worthy of a CVE and quick
fix. We will however try to mitigate those issues so that accidental damage to
the host is prevented.</p>
</blockquote>
<p>LXC’s upstream position for a long time has been that privileged containers are
not and cannot be root safe. For something to be considered root safe it should
be safe to hand root access to third parties or tasks.</p>
<h4 id="running-untrusted-workloads-in-privileged-containers">Running Untrusted Workloads in Privileged Containers</h4>
<p>is insane. That’s about everything that this paragraph should contain. The fact
that the semantics for id 0 inside and outside the container are identical
entails that any meaningful container escape will have the attacker gain root
on the host.</p>
<h4 id="cve-2019-5736-is-a-very-very-very-bad-privilege-escalation-to-host-root"><a href="https://seclists.org/oss-sec/2019/q1/119">CVE-2019-5736</a> Is a Very Very Very Bad Privilege Escalation to Host Root</h4>
<p><a href="https://seclists.org/oss-sec/2019/q1/119">CVE-2019-5736</a> is an excellent illustration of such an attack. Think about
it: a process running <strong>inside</strong> a privileged container can rather trivially
corrupt the binary that is used to attach to the container. This allows an
attacker to create a custom ELF binary on the host. That binary could do
anything it wants:</p>
<ul>
<li>could just be a binary that calls <code class="language-plaintext highlighter-rouge">poweroff</code></li>
<li>could be a binary that spawns a root shell</li>
<li>could be a binary that kills other containers when called again to attach</li>
<li>could be <code class="language-plaintext highlighter-rouge">suid</code> <code class="language-plaintext highlighter-rouge">cat</code></li>
<li>.</li>
<li>.</li>
<li>.</li>
</ul>
<p>The attack vector is actually slightly worse for runC due to its architecture.
Since runC exits after spawning the container it can also be attacked through
a malicious container image. Which is super bad given that a lot of container
workload workflows rely on downloading images from the web.</p>
<p><a href="https://seclists.org/oss-sec/2019/q1/119">LXC</a> cannot be attacked through a malicious image since the monitor process
(a singleton per-container) never exits during the containers life cycle. Since
the kernel does not allow modifications to running binaries it is not possible
for the attacker to corrupt it. When the container is shutdown or killed the
attacking task will be killed before it can do any harm. Only when the last
process running inside the container has exited will the monitor itself exit.
This has the consequence, that if you run privileged OCI containers via our
<code class="language-plaintext highlighter-rouge">oci</code> template with <a href="https://seclists.org/oss-sec/2019/q1/119">LXC</a> your are not vulnerable to malicious images. Only
the vector through the attaching binary still applies.</p>
<h4 id="the-lie-that-privileged-containers-can-be-safe">The Lie that Privileged Containers can be safe</h4>
<p>Aside from mostly working on the Kernel I’m also a maintainer of <a href="https://seclists.org/oss-sec/2019/q1/119">LXC</a> and
<a href="https://github.com/lxc/lxd">LXD</a> alongside <a href="https://stgraber.org/">Stéphane Graber</a>. We are responsible for <a href="https://seclists.org/oss-sec/2019/q1/119">LXC</a> - the
low-level container runtime - and <a href="https://github.com/lxc/lxd">LXD</a> - the container management daemon
using <a href="https://seclists.org/oss-sec/2019/q1/119">LXC</a>.
We have made a very conscious decision to consider privileged containers not
root safe. Two main corollaries follow from this:</p>
<ol>
<li>Privileged containers should never be used to run untrusted workloads.</li>
<li>Breakouts from privileged containers are not considered CVEs by our security
policy.
It still seems a common belief that if we all just try hard enough using
privileged containers for untrusted workloads is safe. This is not a promise
that can be made good upon. A privileged container is not a security boundary.
The reason for this is simply what we looked at above: <code class="language-plaintext highlighter-rouge">container_id(0) ==
host_id(0)</code>.
It is therefore deeply troubling that this industry is happy to let users
believe that they are safe and secure using privileged containers.</li>
</ol>
<h4 id="unprivileged-containers-as-default">Unprivileged Containers as Default</h4>
<p>As upstream for <a href="https://seclists.org/oss-sec/2019/q1/119">LXC</a> and <a href="https://github.com/lxc/lxd">LXD</a> we have been advocating the use of
unprivileged containers by default for years. Way ahead before anyone else did.
Our low-level library <a href="https://seclists.org/oss-sec/2019/q1/119">LXC</a> has supported unprivileged containers since 2013
when user namespaces were merged into the kernel. With <a href="https://github.com/lxc/lxd">LXD</a> we have taken
it one step further and made unprivileged containers the default and privileged
containers opt-in for that very matter: privileged containers aren’t safe. We
even allow you to have per-container idmappings to make sure that not just each
container is isolated from the host but also all containers from each other.</p>
<p>For years we have been advocating for unprivileged containers on conferences,
in blogposts, and whenever we have spoken to people but somehow this whole
industry has chosen to rely on privileged containers.</p>
<p>The good news is that we are seeing changes as people become more familiar with
the perils of privileged containers. Let this recent CVE be another reminder
that unprivileged containers need to be the default.</p>
<h4 id="are-lxc-and-lxd-affected">Are LXC and LXD affected?</h4>
<p>I have seen this question asked all over the place so I guess I should add
a section about this too:</p>
<ul>
<li>
<p>Unprivileged <a href="https://seclists.org/oss-sec/2019/q1/119">LXC</a> and <a href="https://github.com/lxc/lxc">LXD</a> containers are not affected.</p>
</li>
<li>
<p>Any privileged <a href="https://seclists.org/oss-sec/2019/q1/119">LXC</a> and <a href="https://github.com/lxc/lxc">LXD</a> container running on a read-only rootfs
is not affected.</p>
</li>
<li>
<p>Privileged <a href="https://seclists.org/oss-sec/2019/q1/119">LXC</a> containers in the definition provided above are affected.
Though the attack is more difficult than for runC. The reason for this is
that the <code class="language-plaintext highlighter-rouge">lxc-attach</code> binary does not exit before the program in the
container has finished executing. This means an attacker would need to open
an <code class="language-plaintext highlighter-rouge">O_PATH</code> file descriptor to <code class="language-plaintext highlighter-rouge">/proc/self/exe</code>, <code class="language-plaintext highlighter-rouge">fork()</code> itself into the
background and re-open the <code class="language-plaintext highlighter-rouge">O_PATH</code> file descriptor through
<code class="language-plaintext highlighter-rouge">/proc/self/fd/<O_PATH-nr></code> in a loop as <code class="language-plaintext highlighter-rouge">O_WRONLY</code> and keep trying to write
to the binary until such time as <code class="language-plaintext highlighter-rouge">lxc-attach</code> exits. Before that it will not
succeed since the kernel will not allow modification of a running binary.</p>
</li>
<li>
<p>Privileged <a href="https://github.com/lxc/lxc">LXD</a> containers are only affected if the daemon is restarted
other than for upgrade reasons. This should basically never happen.
The <a href="https://github.com/lxc/lxc">LXD</a> daemon never exits so any write will fail because the kernel
does not allow modification of a running binary.
If the <a href="https://github.com/lxc/lxc">LXD</a> daemon is restarted because of an upgrade the binary will be
swapped out and the file descriptor used for the attack will write to the old
in-memory binary and not to the new binary.</p>
</li>
</ul>
<h4 id="chromebooks-with-crostini-using-lxd-are-not-affected">Chromebooks with Crostini using LXD are not affected</h4>
<p>Chromebooks use <a href="https://github.com/lxc/lxc">LXD</a> as their default container runtime are not affected.
First of all, all binaries reside on a read-only filesystem and second,
<a href="https://github.com/lxc/lxc">LXD</a> does not allow running privileged containers on Chromebooks through
the <code class="language-plaintext highlighter-rouge">LXD_UNPRIVILEGED_ONLY</code> flag. For more details see this <a href="https://www.reddit.com/r/Crostini/comments/apkz8t/crostini_containers_likely_vulnerable_to/">link</a>.</p>
<h4 id="fixing-cve-2019-5736">Fixing CVE-2019-5736</h4>
<p>To prevent this attack, <a href="https://seclists.org/oss-sec/2019/q1/119">LXC</a> has been patched to create a temporary copy of
the calling binary itself when it attaches to containers (cf.
<a href="https://github.com/lxc/lxc/commit/6400238d08cdf1ca20d49bafb85f4e224348bf9d">6400238d08cdf1ca20d49bafb85f4e224348bf9d</a>). To do this <a href="https://seclists.org/oss-sec/2019/q1/119">LXC</a> can be
instructed to create an anonymous, in-memory file using the <code class="language-plaintext highlighter-rouge">memfd_create()</code>
system call and to copy itself into the temporary in-memory file, which is then
sealed to prevent further modifications. <a href="https://seclists.org/oss-sec/2019/q1/119">LXC</a> then executes this sealed,
in-memory file instead of the original on-disk binary. Any compromising write
operations from a privileged container to the host <a href="https://seclists.org/oss-sec/2019/q1/119">LXC</a> binary will then
write to the temporary in-memory binary and not to the host binary on-disk,
preserving the integrity of the host <a href="https://seclists.org/oss-sec/2019/q1/119">LXC</a> binary. Also as the temporary,
in-memory <a href="https://seclists.org/oss-sec/2019/q1/119">LXC</a> binary is sealed, writes to this will also fail. To not
break downstream users of the shared library this is opt-in by setting
<code class="language-plaintext highlighter-rouge">LXC_MEMFD_REXEC</code> in the environment. For our <code class="language-plaintext highlighter-rouge">lxc-attach</code> binary which is the
only attack vector this is now done by default.</p>
<p>Workloads that place the <a href="https://seclists.org/oss-sec/2019/q1/119">LXC</a> binaries on a read-only filesystem or prevent
running privileged containers can disable this feature by passing
<code class="language-plaintext highlighter-rouge">--disable-memfd-rexec</code> during the <code class="language-plaintext highlighter-rouge">configure</code> stage when compiling <a href="https://seclists.org/oss-sec/2019/q1/119">LXC</a>.</p>Christian BraunerIntroduction (CVE-2019-5736)Video and Slides for FOSDEM, Brussels 2019: A Year of Container Kernel Work2019-02-05T00:00:00+01:002019-02-05T00:00:00+01:00https://brauner.io/2019/02/05/fosdem-container-kernel-work<p><a href="https://brauner.io/_img/2019_fosdem_container_kernel_work.pdf">Slides (pdf)</a></p>
<p><a href="https://brauner.io/_img/2019_fosdem_container_kernel_work.odp">Slides (odp)</a></p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/tjg398rxkyc" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<iframe width="560" height="315" src="https://gra.mirror.cyberbits.eu/fosdem/2019/UA2.114/containers_kernel_update.webm" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>Christian BraunerSlides (pdf)