Re: [PATCH v2 14/39] commoncap: handle idmapped mounts
From: Christian Brauner <hidden>
Date: 2020-11-23 07:45:14
Also in:
linux-api, linux-ext4, linux-fsdevel, linux-integrity, selinux
On Sun, Nov 22, 2020 at 04:18:55PM -0500, Paul Moore wrote:
On Sun, Nov 15, 2020 at 5:39 AM Christian Brauner [off-list ref] wrote:quoted
When interacting with user namespace and non-user namespace aware filesystem capabilities the vfs will perform various security checks to determine whether or not the filesystem capabilities can be used by the caller (e.g. during exec), or even whether they need to be removed. The main infrastructure for this resides in the capability codepaths but they are called through the LSM security infrastructure even though they are not technically an LSM or optional. This extends the existing security hooks security_inode_removexattr(), security_inode_killpriv(), security_inode_getsecurity() to pass down the mount's user namespace and makes them aware of idmapped mounts. In order to actually get filesystem capabilities from disk the capability infrastructure exposes the get_vfs_caps_from_disk() helper. For user namespace aware filesystem capabilities a root uid is stored alongside the capabilities. In order to determine whether the caller can make use of the filesystem capability or whether it needs to be ignored it is translated according to the superblock's user namespace. If it can be translated to uid 0 according to that id mapping the caller can use the filesystem capabilities stored on disk. If we are accessing the inode that holds the filesystem capabilities through an idmapped mount we need to map the root uid according to the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. Reading filesystem caps from disk enforces that the root uid associated with the filesystem capability must have a mapping in the superblock's user namespace and that the caller is either in the same user namespace or is a descendant of the superblock's user namespace. For filesystems that are mountable inside user namespace the container can just mount the filesystem and won't usually need to idmap it. If it does create an idmapped mount it can mark it with a user namespace it has created and which is therefore a descendant of the s_user_ns. For filesystems that are not mountable inside user namespaces the descendant rule is trivially true because the s_user_ns will be the initial user namespace. If the initial user namespace is passed all operations are a nop so non-idmapped mounts will not see a change in behavior and will also not see any performance impact. Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <redacted>...quoted
diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 8dba8f0983b5..ddb9213a3e81 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c@@ -1944,7 +1944,7 @@ static inline int audit_copy_fcaps(struct audit_names *name, if (!dentry) return 0; - rc = get_vfs_caps_from_disk(dentry, &caps); + rc = get_vfs_caps_from_disk(&init_user_ns, dentry, &caps); if (rc) return rc;@@ -2495,7 +2495,8 @@ int __audit_log_bprm_fcaps(struct linux_binprm *bprm, ax->d.next = context->aux; context->aux = (void *)ax; - get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps); + get_vfs_caps_from_disk(mnt_user_ns(bprm->file->f_path.mnt), + bprm->file->f_path.dentry, &vcaps);As audit currently records information in the context of the initial/host namespace I'm guessing we don't want the mnt_user_ns() call above; it seems like &init_user_ns would be the right choice (similar to audit_copy_fcaps()), yes?
Ok, sounds good. It also makes the patchset simpler. Note that I'm currently not on the audit mailing list so this is likely not going to show up there. (Fwiw, I responded to you in your other mail too.) Christian