Re: Apparmor move_mount mediation breaks mount tool in containers

From: John Johansen <john.johansen@canonical.com>
Date: 2023-12-06 03:16:27
Also in: regressions

On 12/5/23 18:18, Stéphane Graber wrote:

On Tue, Dec 5, 2023 at 2:55 PM John Johansen
[off-list ref] wrote:

quoted

On 12/4/23 22:57, Stéphane Graber wrote:

quoted

On Mon, Dec 4, 2023 at 9:20 AM Linux regression tracking (Thorsten
Leemhuis) [off-list ref] wrote:

quoted

On 04.12.23 14:14, John Johansen wrote:

quoted

On 12/3/23 17:34, Stéphane Graber wrote:

Side note: Stéphane, thx for CCing the regressions list.

quoted

On Sun, Dec 3, 2023 at 8:21 PM John Johansen
[off-list ref] wrote:

quoted

On 12/2/23 17:20, Stéphane Graber wrote:

quoted

Upstream commit 157a3537d6bc28ceb9a11fc8cb67f2152d860146 which just
landed in 6.6.3 stable as 96af45154a0be30485ad07f70f852b1456cb13d7 is
blocking new mounts for all LXC, LXD and Incus users (at least) on
distributions using the newer version of util-linux.

That's because for a simple mount like "mount -t tmpfs tmpfs /tmp",
the new mount command now performs:

fsconfig(3, FSCONFIG_SET_STRING, "source", "tmpfs", 0) = 0
fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0
fsmount(3, FSMOUNT_CLOEXEC, 0)          = 4
statx(4, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_MNT_ID,
{stx_mask=STATX_BASIC_STATS|STATX_MNT_ID,
stx_attributes=STATX_ATTR_MOUNT_ROOT, stx_mode=S_IFDIR|S_ISVTX|0777,
stx_size=40, ...}) = 0
move_mount(4, "", AT_FDCWD, "/tmp", MOVE_MOUNT_F_EMPTY_PATH) = 0

That last call to "move_mount" is incorrectly interpreted by AppArmor
as an attempt to move-mount "/" to "/mnt" rather than as a new mount
being created, this therefore results in:

Dec 03 01:05:03 kernel-test kernel: audit: type=1400
audit(1701565503.599:34): apparmor="DENIED" operation="mount"
class="mount" info="failed perms check" error=-13
profile="incus-a_</var/lib/incus>" name="/tmp/" pid=2190 comm="mount"
srcname="/" flags="rw, move"

Note that the flags here show "move", the fstype isn't even set and
the source path at srcname incorrectly shows "/".

This operation therefore trips any container manager which has an
apparmor security policy preventing arbitrary move-mount (as those
could be used to bypass other apparmor path based policies).


The way I see it, the current mediation support effectively breaks any
attempt at mediating mounts in a useful way in apparmor as it's now
impossible to mediate new mounts based on their fstype or even
distinguish them from a move-mount operation.

Indeed it is a far from good solution. It is a stop gap.

quoted

I don't know if this warrants pulling the mediation patch out of
stable (and out of linus' tree), obviously doing that would
reintroduce that hole in mount coverage, but at the same time, the
current coverage is broken enough that our only alternative is to
effectively allow all mounts, making the current mediation useless.

pulling it effectively means ALL applications by-pass mediation, the
alternative is to block all applications from using the move_mount
system call as part of mediation. Which might have been acceptable
as a stop gap when the syscall first landed but not now.

Pulling it from the stable branch may still make sense, you now have
folks who are updating to get actual bugfixes and end up with broken
containers, that doesn't exactly seem like a good outcome...

I will defer such a decision to the maintainers the stable trees. I
can see arguments either way.

FWIW, Greg usually does not revert a backported change if the change
causes the same problem to happen with mainline, as then it should be
fixed there as well. Which is normally the right thing to do, as Linus
wants people to be able to upgrade to new kernels without having them to
upgrade anything else (firmware, anything in userland incl. apparmor
policies).

So normally this whole issue afaics would be "157a3537d6bc28 needs to be
fixed in mainline (if nothing else helps with a revert), and that fix
then needs to be backported to the various stable trees".

It obviously is more complicated because that commit apparently fixes a
security vulnerability. But even under such circumstances Linus afaik
wants us to do everything possible to avoid breaking peoples workflows.
Which is why I wonder if there is a more elegant solution here
somewhere. I doubt that the answer is yes, but I'll ask the following
anyway: Would a kernel config option that distros could set when they
updated their Apparmor policies help here? Or something in sysfs?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

So there are a few different scenarios here:

1) Distribution shipping application specific profiles that were
developed back when the new VFS mount API wasn't around, those
profiles allow the specific mount calls made by the application. The
application has since moved on to using the VFS mount API and thanks
to the lack of apparmor coverage, things have kept on working fine.
Now when the distro pulls this bugfix, the application will now fail
to perform the mount. Apparmor returns EACCESS which doesn't cause a
fallback to the old mount API but instead a straight up error returned
to the user. ENOSYS would usually be handled properly.

I am not opposed to conditionally returning ENOSYS. I also have no doubt
that doing so would break fewer applications, it will still break some.
As I said I will look into doing that

Thanks, that would certainly be a much better stop gap measure until
such time as the mount API can properly be mediated by AppArmor.

quoted

2) All the container managers out there which use AppArmor as a safety
net to block mostly misbehaving or misconfigured software. Those
policies are not meant to be water tight to begin with (and due to the
nature of containers, they really can't be). But they do catch a lot
and while not an actual security barrier, they certainly prevent a lot
of accidents. Unfortunately with the new mediation in place, the only
real way to operate moving forward is to not mediate mounts at all as
mediation of the new VFS API leads to just about everything getting
denied and apparmor doesn't provide a way to separately mediate the
old and new mount APIs.

atm no, I am working on a conditional for detached mounts. I am not
convinced we should completely separate move_mount() mediation for
attached mounts.

But to use any of that you will need a new apparmor userspace to
support it. LXD is using the mounts as a behavioral/advisory catch, so
allowing an unmediated move_mount() is not the end of the world. Apparmor
still needs to be able to mediate mounts for applications that aren't
containers and do need a tighter barrier.

The reality with the new mount api is that atm we can only mediate
move_mount(), but the new mount api does add a twist with the whole
detached mounts. The only way to support that on older releases is
using a very generic mount rule.

New userspaces can pick up more. Leaving move_mount() unmediated even
for older releases really isn't an appealing option. New kernels
get pulled back and used on old releases all the time.

quoted

3) A system where apparmor is used to effectively prevent any mounting
whatsoever, where the policy is now being bypassed by simply using the
new VFS API. Such a system would also need to not be using Seccomp in
combination with Apparmor as blocking the new VFS API syscalls would
avoid the issue.

sure

quoted

I suspect 3) is what John is hinting at in this thread. I could
certainly see this bug be used to bypass the snapd mount restrictions
for example, though snapd also generates seccomp policies, so just
blocking the new VFS API would be a much simpler fix there.

I am concerned about more than the snapd case, yes snapd could and
should use seccomp, but apparmor should be able to stand on its own
for other cases.

quoted

I have no idea how common 1) is, but it looks like we're about to find
out soon and that has potential for some very "interesting" bug
reports as mount related errors are usually not the easiest to
understand and LSMs make them so much worse.

unfortunately that is always the case when fixing a mediation regression.
This one sat way too long that doesn't mean it shouldn't get fixed.

Overall though, I think lxd/incus is probably the major one. Most people
are running targeted policies and have left most of the stuff doing
mounts unconfined.

We did some research with Aleksa last night and that does seem to be
true for the profiles we could find.

However there is a bit of a catch that Docker, Kubernetes, ... pretty
much all the application container options do support using AppArmor
but don't provide advanced profiles, instead they put the burden on
the user to attach profiles to specific containers.

I have no idea how many people actually do that and since they are
individually crafted profiles by those users, we have no idea what may
be in there and whether they got broken by this change.

quoted

As for 2), as it stands, we're going to have to effectively turn off
any of the mount related safety nets we had in place in LXC, Incus and
likely most other container runtimes out there to handle this. The
usual issue with doing that kind of heavy hammer change is that once
it's done, it will take a long time to undo it. That is, once AppArmor
is actually capable of mediating the way it's supposed to, a lot of
the profiles will have been changed to just blanket allow all mounts
and they're only ever going to be changed once we're confident that
the vast majority of our users are running a kernel with the fixed
logic.

I am well aware of it. I fall on the side of leaving this completely
unmediated is slightly worse, but I can understand why someone
would choose to leave this unmediated.

quoted

I guess some of that could be handled better if there was a way to
detect the current broken mediation of the new mount API and then
later again, detect that the kernel now has working mediation,
allowing for those container managers to generate accurate profile
rather than have to go for the "safest" option (as far as keeping
users running) and just keep allowing everything.

John, is there any such detection mechanism currently present that
could be used by userspace to better handle this situation?

sadly not for the move_mount() patch, it is however a one line patch

diff --git a/security/apparmor/apparmorfs.c b/security/apparmor/apparmorfs.c
index 7465c39ad1bc..6cd52767cfeb 100644
--- a/security/apparmor/apparmorfs.c
+++ b/security/apparmor/apparmorfs.c

@@ -2373,6 +2373,7 @@ static struct aa_sfs_entry aa_sfs_entry_policy[] = {

   static struct aa_sfs_entry aa_sfs_entry_mount[] = {
         AA_SFS_FILE_STRING("mask", "mount umount pivot_root"),
+       AA_SFS_FILE_STRING(move_mount, "detached"),
         { }
   };

or similar. That I could send out today.

I think that'd be useful to have as something that can be used to
detect this behavior from userspace and have the affected code turn
off mount mediation if that's the case.

quoted

it would surfaced the via the regular securityfs/apparmor interface as
/sys/kernel/security/apparmor/features/mount/move_mount

quoted

I could also get you a simple conditional patch around returning ENOSYS
to start testing, so that we could potentially try sending a final version
of that patch up next week.

That'd be good. I feel that if this change had come with the ENOSYS
behavior, I wouldn't have noticed it at all.
With the new mount API being still pretty fresh, I'm not aware of any
userspace which today relies exclusively on it, so having it be
effectively disabled until such time as you can mediate it properly
shouldn't really break anyone (always hard to know, but sure would
break a lot less cases than the current change).

One catch though from my testing with seccomp, to make this work, you
need to have fsmount return ENOSYS, if you only have move_mount return
ENOSYS, it's too late and libmount just returns the failure to the
user rather than go down the old mount API code path.

ugh, that is a problem. we don't have a hook in fsmount(), the closest
we get is the dentry_open() hook, but I don't see how a call to that
could reasonably detect that its coming from fsmount.

I am not opposed to a new hook at all, and I will throw something together
but I don't see that going into 6.7

quoted

Overall, this still feels very rushed and broken to me. I understand
that being able to trivially bypass AppArmor mount rules isn't great,
but shipping code that makes the vast majority of said rules useless
doesn't really feel like such an improvement. It's effectively turning
what was a reasonably flexible policy engine around mount calls, into
a binary switch to either allow everything or block everything.

I get what you are saying. The other side of the coin is oh look apparmor
isn't mediating mount if I just do this ...

Sure, but I suspect we could have avoided a bunch of pain if we had a
chat around this change, had an opportunity to test it ahead of time.
At the very least, we'd have had the flag file introduced from the
start and hopefully some better handling like some kind of ENOSYS
option.

Sadly, yes I would have loved more testing, and I should have thought to
to poke you directly. I deliberately delayed sending this up during last
cycle when the report to kernel.org came in. I kicked a kernel with it over
to lxd (while still treating the issue as being in embargo. sigh, ...), and
then put it into the ubuntu proposed kernels and apparmor-next.

heard plenty on the user namespace mediation, but nothing of significance
around move_mount.

quoted

We at the very least need a way to check whether we're dealing with
this known broken state so that any change that's made to userspace
can be made condition on this, ensuring that once mediation actually
works as intended, those policies also go back to the way they're
supposed to be.

agreed. That really should have been part of the patch to begin with.

Stéphane

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help