Re: [RFC PATCH] lsm: add the inode_free_security_rcu() LSM implementation hook

From: Matus Jokay <hidden>
Date: 2024-07-24 10:20:38
Also in: linux-fsdevel, selinux

On 23. 7. 2024 21:48, Paul Moore wrote:

On Tue, Jul 23, 2024 at 5:27 AM Matus Jokay [off-list ref] wrote:

quoted

On 22. 7. 2024 21:46, Paul Moore wrote:

quoted

On Mon, Jul 22, 2024 at 8:30 AM Matus Jokay [off-list ref] wrote:

quoted

On 10. 7. 2024 12:40, Mickaël Salaün wrote:

quoted

On Tue, Jul 09, 2024 at 10:40:30PM -0400, Paul Moore wrote:

quoted

The LSM framework has an existing inode_free_security() hook which
is used by LSMs that manage state associated with an inode, but
due to the use of RCU to protect the inode, special care must be
taken to ensure that the LSMs do not fully release the inode state
until it is safe from a RCU perspective.

This patch implements a new inode_free_security_rcu() implementation
hook which is called when it is safe to free the LSM's internal inode
state.  Unfortunately, this new hook does not have access to the inode
itself as it may already be released, so the existing
inode_free_security() hook is retained for those LSMs which require
access to the inode.

Signed-off-by: Paul Moore <paul@paul-moore.com>

I like this new hook.  It is definitely safer than the current approach.

To make it more consistent, I think we should also rename
security_inode_free() to security_inode_put() to highlight the fact that
LSM implementations should not free potential pointers in this blob
because they could still be dereferenced in a path walk.

quoted

---
 include/linux/lsm_hook_defs.h     |  1 +
 security/integrity/ima/ima.h      |  2 +-
 security/integrity/ima/ima_iint.c | 20 ++++++++------------
 security/integrity/ima/ima_main.c |  2 +-
 security/landlock/fs.c            |  9 ++++++---
 security/security.c               | 26 +++++++++++++-------------
 6 files changed, 30 insertions(+), 30 deletions(-)

...

quoted

Sorry for the questions, but for several weeks I can't find answers to two things related to this RFC:

1) How does this patch close [1]?
   As Mickaël pointed in [2], "It looks like security_inode_free() is called two times on the same inode."
   Indeed, it does not seem from the backtrace that it is a case of race between destroy_inode and inode_permission,
   i.e. referencing the inode in a VFS path walk while destroying it...
   Please, can anyone tell me how this situation could have happened? Maybe folks from VFS... I added them to the copy.

The VFS folks can likely provide a better, or perhaps a more correct
answer, but my understanding is that during the path walk the inode is
protected by a RCU lock which allows for multiple threads to access
the inode simultaneously; this could result in some cases where one
thread is destroying the inode while another is accessing it.
Changing this would require changes to the VFS code, and I'm not sure
why you would want to change it anyway, the performance win of using
RCU here is likely significant.

quoted

2) Is there a guarantee that inode_free_by_rcu and i_callback will be called within the same RCU grace period?

I'm not an RCU expert, but I don't believe there are any guarantees
that the inode_free_by_rcu() and the inode's own free routines are
going to be called within the same RCU grace period (not really
applicable as inode_free_by_rcu() isn't called *during* a grace
period, but *after* the grace period of the associated
security_inode_free() call).  However, this patch does not rely on
synchronization between the inode and inode LSM free routine in
inode_free_by_rcu(); the inode_free_by_rcu() function and the new
inode_free_security_rcu() LSM callback does not have a pointer to the
inode, only the inode's LSM blob.  I agree that it is a bit odd, but
freeing the inode and inode's LSM blob independently of each other
should not cause a problem so long as the inode is no longer in use
(hence the RCU callbacks).

Paul, many thanks for your answer.

I will try to clarify the issue, because fsnotify was a bad example.
Here is the related code taken from v10.

void security_inode_free(struct inode *inode)
{
        call_void_hook(inode_free_security, inode);
        /*
         * The inode may still be referenced in a path walk and
         * a call to security_inode_permission() can be made
         * after inode_free_security() is called. Ideally, the VFS
         * wouldn't do this, but fixing that is a much harder
         * job. For now, simply free the i_security via RCU, and
         * leave the current inode->i_security pointer intact.
         * The inode will be freed after the RCU grace period too.
         */
        if (inode->i_security)
                call_rcu((struct rcu_head *)inode->i_security,
                         inode_free_by_rcu);
}

void __destroy_inode(struct inode *inode)
{
        BUG_ON(inode_has_buffers(inode));
        inode_detach_wb(inode);
        security_inode_free(inode);
        fsnotify_inode_delete(inode);
        locks_free_lock_context(inode);
        if (!inode->i_nlink) {
                WARN_ON(atomic_long_read(&inode->i_sb->s_remove_count) == 0);
                atomic_long_dec(&inode->i_sb->s_remove_count);
        }

#ifdef CONFIG_FS_POSIX_ACL
        if (inode->i_acl && !is_uncached_acl(inode->i_acl))
                posix_acl_release(inode->i_acl);
        if (inode->i_default_acl && !is_uncached_acl(inode->i_default_acl))
                posix_acl_release(inode->i_default_acl);
#endif
        this_cpu_dec(nr_inodes);
}

static void destroy_inode(struct inode *inode)
{
        const struct super_operations *ops = inode->i_sb->s_op;

        BUG_ON(!list_empty(&inode->i_lru));
        __destroy_inode(inode);
        if (ops->destroy_inode) {
                ops->destroy_inode(inode);
                if (!ops->free_inode)
                        return;
        }
        inode->free_inode = ops->free_inode;
        call_rcu(&inode->i_rcu, i_callback);
}

Yes, inode_free_by_rcu() is being called after the grace period of the associated
security_inode_free(). i_callback() is also called after the grace period, but is it
always the same grace period as in the case of inode_free_by_rcu()? If not in general,
maybe it could be a problem. Explanation below.

If there is a function call leading to the end of the grace period between
call_rcu(inode_free_by_rcu) and call_rcu(i_callback) (by reaching a CPU quiescent state
or another mechanism?), there will be a small time window, when the inode security
context is released, but the inode itself not, because call_rcu(i_callback) was not called
yet. So in that case each access to inode security blob leads to UAF.

While it should be possible for the inode's LSM blob to be free'd
prior to the inode itself, the RCU callback mechanism provided by
call_rcu() should ensure that both the LSM's free routine and the
inode's free routine happen at a point in time after the current RCU
critical sections have lapsed and the inode is no longer being

It is questionable whether the "current RCU CS" refers to both functions
together, see the diagram below.

accessed.  The LSM's inode_free_rcu callback can run independent of
the inode's callback as it doesn't access the inode and if it does

Agree, there are two independent calls as you described.

happen to run before the inode's RCU callback that should also be okay
as we are past the original RCU critical sections and the inode should
no longer be in use.  If the inode is still in use by the time the

I think the inode can be still in use in may_lookup() after the security
RCU callback function, see below.

LSM's RCU callback is triggered then there is a flaw in the inode
RCU/locking/synchronization code.

I don't think it is a flaw. It may be the use of the RCU mechanism with
incorrect assumption, that both RCU callbacks belong to the common GP.

It is also worth mentioning that while this patch shuffles around some
code at the LSM layer, the basic idea of the LSM using a RCU callback
to free state associated with an inode is far from new.  While that
doesn't mean there isn't a problem, we have a few years of experience
across a large number of systems, that would indicate this isn't a
problem.

I agree. But history only shows that it is very difficult to achieve this
race. And yes, I agree that we may address this issue when it turns out
to be relevant.

quoted

For example, see invoking ops->destroy_inode() after call_rcu(inode_free_by_rcu) but
*before* call_rcu(i_callback). If destroy_inode() may sleep, can be reached end of the
grace period? destroy_inode() is *before* call_rcu(i_callback), therefore simultaneous
access to the inode during path lookup may be performed. Note: I use destroy_inode() as
*an example* of the idea. I'm not expert at all in fsnotify, posix ACL, VFS in general
and RCU, too.

In the previous message I only mentioned fsnotify, but it was only as an example.
I think that destroy_inode() is a better example of the idea I wanted to express.

I repeat that I'm aware that this RFC does not aim to solve this issue. But it can be
unpleasant to get another UAF in a production environment.

I'm open to there being another fix needed, or a change to this fix,
but I don't believe the problems you are describing are possible.  Of
course it's very possible that I'm wrong, so if you are aware of an
issue I would appreciate a concrete example explaining the code path
and timeline between tasks A and B that would trigger the flaw ... and
yes, patches are always welcome ;)

Oh patches... Even from the message from Dave it can be seen that the
cooperation of people from VFS and some very good idea of a solution
are needed. Of course, provided that the scheme below was correct.
I would be very happy if someone could explain to me why this cannot be
so!

CPU related to
RCU callbacks            task A                                 task B
==================       =================================      =======================
                         ...
                         __destroy_inode()
                            ...
                            security_inode_free()
                               ...
                               call_rcu(inode_free_by_rcu)
                         ...
                         ops->destroy_inode() // *suppose* may sleep
// end of GP;
// inode *can* be used as
// i_callback does not
// belong to this GP
inode_free_by_rcu()
------------------------------------------------------------------------------------------------------
// start of another GP                                          ...
                         ...                                    rcu_read_lock()
                         call_rcu(i_callback)                   ...
                         ...                                    security_inode_permission() // <-- UAF
                                                                ...
                                                                rcu_read_unlock()
                                                                ...
// end of GP;
// right now inode is not in use anymore
i_callback()

Why is it difficult to achieve this race? The GP (grace period) between
two call_rcu() calls must come to an end. Again, I chose as an example
of this situation destroy_inode() function. But there can be others in
the code path, I really don't know.

I looked at the destroy_inode() functions (kernel v10) and from a quick
look I found overlayfs, which directly calls dput(), see [1].
If it is possible to force printk() to sleep, then it is possible to
consider afs [2] and, under certain circumstances, ext4 [3].
After a deeper analysis, maybe even more.

I think it is difficult to divide the GP exactly as this situation
requires. That's probably why this hasn't appeared yet. Or it is
impossible to achieve it. All the better. But that would require an audit
of the code between our two call_rcu()'s, whether at some point it cannot
come to the end of the GP.

And I agree with the opinion that as long as this type of error has not
yet occurred, we can just play possum. When it comes to the crunch, we
can deal with it more deeply.

[1] https://elixir.bootlin.com/linux/v6.10/source/fs/overlayfs/super.c#L181
[2] https://elixir.bootlin.com/linux/v6.10/source/fs/afs/super.c#L718
[3] https://elixir.bootlin.com/linux/v6.10/source/fs/ext4/super.c#L1448

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help