Re: [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts"

From: Michael Kerrisk (man-pages) <hidden>
Date: 2021-08-19 00:23:02
Also in: linux-fsdevel, lkml

Possibly related (same subject, not in this thread)

2021-08-19 · Re: [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts" · Michael Kerrisk (man-pages) <hidden>
2021-08-17 · Re: [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts" · Christian Brauner <hidden>
2021-08-17 · Re: [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts" · Michael Kerrisk (man-pages) <hidden>
2021-08-16 · Re: [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts" · Eric W. Biederman <hidden>
2021-08-14 · Re: [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts" · Christian Brauner <hidden>

Hello Eric,

Thank you for you response.

On 8/17/21 5:51 PM, Eric W. Biederman wrote:

"Michael Kerrisk (man-pages)" [off-list ref] writes:

quoted

Hi Eric,

Thanks for your feedback!

On 8/16/21 6:03 PM, Eric W. Biederman wrote:

quoted

Michael Kerrisk [off-list ref] writes:

quoted

For a long time, this manual page has had a brief discussion of
"locked" mounts, without clearly saying what this concept is, or
why it exists. Expand the discussion with an explanation of what
locked mounts are, why mounts are locked, and some examples of the
effect of locking.

Thanks to Christian Brauner for a lot of help in understanding
these details.

Reported-by: Christian Brauner <redacted>
Signed-off-by: Michael Kerrisk <redacted>
---

Hello Eric and others,

After some quite helpful info from Chrstian Brauner, I've expanded
the discussion of locked mounts (a concept I didn't really have a
good grasp on) in the mount_namespaces(7) manual page. I would be
grateful to receive review comments, acks, etc., on the patch below.
Could you take a look please?

Cheers,

Michael

 man7/mount_namespaces.7 | 73 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)

diff --git a/man7/mount_namespaces.7 b/man7/mount_namespaces.7
index e3468bdb7..97427c9ea 100644
--- a/man7/mount_namespaces.7
+++ b/man7/mount_namespaces.7

@@ -107,6 +107,62 @@ operation brings across all of the mounts from the original
 mount namespace as a single unit,
 and recursive mounts that propagate between
 mount namespaces propagate as a single unit.)
+.IP
+In this context, "may not be separated" means that the mounts
+are locked so that they may not be individually unmounted.
+Consider the following example:
+.IP
+.RS
+.in +4n
+.EX
+$ \fBsudo mkdir /mnt/dir\fP
+$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP
+$ \fBsudo mount \-\-bind -o ro /some/path /mnt/dir\fP
+$ \fBls /mnt/dir\fP   # Former contents of directory are invisible

Do we want a more motivating example such as a /proc/sys?

It has been common to mount over /proc files and directories that can be
written to by the global root so that users in a mount namespace may not
touch them.

Seems reasonable. But I want to check one thing. Can you please
define "global root". I'm pretty sure I know what you mean, but
I'd like to know your definition.

I mean uid 0 in the initial user namespace.

(Good. That's what I thought you meant. So far, that term is not 
described in the manual pages. I just now added a definition of the 
term to user_namespaces(7).)

This uid owns most of files in /proc.

Container systems that don't want to use user namespaces frequently
mount over files in proc to prevent using some of the root privileges
that come simply by having uid 0.

Another use is mounting over files on virtual filesystems like proc
to reduce the attack surface.

Thanks for the background. I think for the moment I will go with 
Christian's alternative suggestion (an example using /etc/shadow).

For reducing what the root user in a container can do, I think using user
namespaces and using a uid other than 0 in the initial user namespace.

quoted

+.EE
+.in
+.RE
+.IP
+The above steps, performed in a more privileged user namespace,
+have created a (read-only) bind mount that
+obscures the contents of the directory
+.IR /mnt/dir .
+For security reasons, it should not be possible to unmount
+that mount in a less privileged user namespace,
+since that would reveal the contents of the directory
+.IR /mnt/dir .

 > +.IP

quoted

+Suppose we now create a new mount namespace
+owned by a (new) subordinate user namespace.
+The new mount namespace will inherit copies of all of the mounts
+from the previous mount namespace.
+However, those mounts will be locked because the new mount namespace
+is owned by a less privileged user namespace.
+Consequently, an attempt to unmount the mount fails:
+.IP
+.RS
+.in +4n
+.EX
+$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
+               \fBstrace \-o /tmp/log \e\fP
+               \fBumount /mnt/dir\fP
+umount: /mnt/dir: not mounted.
+$ \fBgrep \(aq^umount\(aq /tmp/log\fP
+umount2("/mnt/dir", 0)     = \-1 EINVAL (Invalid argument)
+.EE
+.in
+.RE
+.IP
+The error message from
+.BR mount (8)
+is a little confusing, but the
+.BR strace (1)
+output reveals that the underlying
+.BR umount2 (2)
+system call failed with the error
+.BR EINVAL ,
+which is the error that the kernel returns to indicate that
+the mount is locked.

Do you want to mention that you can unmount the entire subtree?  Either
with pivot_root if it is locked to "/" or with
"umount -l /path/to/propagated/directory".

Yes, I wondered about that, but hadn't got round to devising 
the scenario. How about this:

[[
       *  Following on from the previous point, note that it is possible
          to unmount an entire tree of mounts that propagated as a unit

                                 ^^^^^ subtree?

Yes, probably better, to prevent misunderstandings. Changed (and in a few
other places also).

quoted

          into a mount namespace that is owned by a less privileged user
          namespace, as illustrated in the following example.

quoted

          First, we create new user and mount namespaces using
          unshare(1).  In the new mount namespace, the propagation type
          of all mounts is set to private.  We then create a shared bind
          mount at /mnt, and a small hierarchy of mount points underneath
          that mount point.

              $ PS1='ns1# ' sudo unshare --user --map-root-user \
                                     --mount --propagation private bash
              ns1# echo $$        # We need the PID of this shell later
              778501
              ns1# mount --make-shared --bind /mnt /mnt
              ns1# mkdir /mnt/x
              ns1# mount --make-private -t tmpfs none /mnt/x
              ns1# mkdir /mnt/x/y
              ns1# mount --make-private -t tmpfs none /mnt/x/y
              ns1# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
              986 83 8:5 /mnt /mnt rw,relatime shared:344
              989 986 0:56 / /mnt/x rw,relatime
              990 989 0:57 / /mnt/x/y rw,relatime

          Continuing in the same shell session, we then create a second
          shell in a new mount namespace and a new subordinate (and thus
          less privileged) user namespace and check the state of the
          propagated mount points rooted at /mnt.

              ns1# PS1='ns2# unshare --user --map-root-user \
                                     --mount --propagation unchanged bash
              ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
              1239 1204 8:5 /mnt /mnt rw,relatime master:344
              1240 1239 0:56 / /mnt/x rw,relatime
              1241 1240 0:57 / /mnt/x/y rw,relatime

          Of note in the above output is that the propagation type of the
          mount point /mnt has been reduced to slave, as explained near
          the start of this subsection.  This means that submount events
          will propagate from the master /mnt in "ns1", but propagation
          will not occur in the opposite direction.

          From a separate terminal window, we then use nsenter(1) to
          enter the mount and user namespaces corresponding to "ns1".  In
          that terminal window, we then recursively bind mount /mnt/x at
          the location /mnt/ppp.

              $ PS1='ns3# ' sudo nsenter -t 778501 --user --mount
              ns3# mount --rbind --make-private /mnt/x /mnt/ppp
              ns3# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
              986 83 8:5 /mnt /mnt rw,relatime shared:344
              989 986 0:56 / /mnt/x rw,relatime
              990 989 0:57 / /mnt/x/y rw,relatime
              1242 986 0:56 / /mnt/ppp rw,relatime
              1243 1242 0:57 / /mnt/ppp/y rw,relatime shared:518

          Because the propagation type of the parent mount, /mnt, was
          shared, the recursive bind mount propagated a small tree of
          mounts under the slave mount /mnt into "ns2", as can be
          verified by executing the following command in that shell
          session:

              ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
              1239 1204 8:5 /mnt /mnt rw,relatime master:344
              1240 1239 0:56 / /mnt/x rw,relatime
              1241 1240 0:57 / /mnt/x/y rw,relatime
              1244 1239 0:56 / /mnt/ppp rw,relatime
              1245 1244 0:57 / /mnt/ppp/y rw,relatime master:518

          While it is not possible to unmount a part of that propagated
          subtree (/mnt/ppp/y), it is possible to unmount the entire
          tree, as shown by the following commands:

              ns2# umount /mnt/ppp/y
              umount: /mnt/ppp/y: not mounted.
              ns2# umount -l /mnt/ppp | sed 's/ - .*//'      # Succeeds...
              ns2# grep /mnt /proc/self/mountinfo
              1239 1204 8:5 /mnt /mnt rw,relatime master:344
              1240 1239 0:56 / /mnt/x rw,relatime
              1241 1240 0:57 / /mnt/x/y rw,relatime
]]

?

Yes.

It is worth noting that in ns2 it is also possible to mount on top of
/mnt/ppp/y and umount from /mnt/ppp/y.

Yes, good point. I've added some text, and an example for that case.

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help