Thread (53 messages) 53 messages, 11 authors, 2024-11-25

Re: [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program

From: Christian Brauner <brauner@kernel.org>
Date: 2024-11-20 09:28:32
Also in: bpf, linux-fsdevel, lkml
Subsystem: filesystems (vfs and infrastructure), the rest · Maintainers: Alexander Viro, Christian Brauner, Linus Torvalds

On Tue, Nov 19, 2024 at 09:53:20PM +0000, Song Liu wrote:
Hi Jeff and Amir, 

Thanks for your inputs!
quoted
On Nov 19, 2024, at 7:30 AM, Amir Goldstein [off-list ref] wrote:

On Tue, Nov 19, 2024 at 4:25 PM Amir Goldstein [off-list ref] wrote:
quoted
On Tue, Nov 19, 2024 at 3:21 PM Jeff Layton [off-list ref] wrote:
quoted
[...]
quoted
quoted
quoted
Longer term, I think it may be beneficial to come up with a way to attach
quoted
quoted
private info to the inode in a way that doesn't cost us one pointer per
funcionality that may possibly attach info to the inode. We already have
i_crypt_info, i_verity_info, i_flctx, i_security, etc. It's always a tough
call where the space overhead for everybody is worth the runtime &
complexity overhead for users using the functionality...
It does seem to be the right long term solution, and I am willing to
work on it. However, I would really appreciate some positive feedback
on the idea, so that I have better confidence my weeks of work has a
better chance to worth it.

Thanks,
Song

[1] https://github.com/systemd/systemd/blob/main/src/core/bpf/restrict_fs/restrict-fs.bpf.c
fsnotify is somewhat similar to file locking in that few inodes on the
machine actually utilize these fields.

For file locking, we allocate and populate the inode->i_flctx field on
an as-needed basis. The kernel then hangs on to that struct until the
inode is freed.
If we have some universal on-demand per-inode memory allocator, 
I guess we can move i_flctx to it?
quoted
quoted
quoted
We could do something similar here. We have this now:

#ifdef CONFIG_FSNOTIFY
       __u32                   i_fsnotify_mask; /* all events this inode cares about */
       /* 32-bit hole reserved for expanding i_fsnotify_mask */
       struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
#endif
And maybe some fsnotify fields too?

With a couple users, I think it justifies to have some universal
on-demond allocator. 
quoted
quoted
quoted
What if you were to turn these fields into a pointer to a new struct:

       struct fsnotify_inode_context {
               struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
               struct bpf_local_storage __rcu          *i_bpf_storage;
               __u32                                   i_fsnotify_mask; /* all events this inode cares about */
       };
The extra indirection is going to hurt for i_fsnotify_mask
it is being accessed frequently in fsnotify hooks, so I wouldn't move it
into a container, but it could be moved to the hole after i_state.
quoted
quoted
quoted
Then whenever you have to populate any of these fields, you just
allocate one of these structs and set the inode up to point to it.
They're tiny too, so don't bother freeing it until the inode is
deallocated.

It'd mean rejiggering a fair bit of fsnotify code, but it would give
the fsnotify code an easier way to expand per-inode info in the future.
It would also slightly shrink struct inode too.
I am hoping to make i_bpf_storage available to tracing programs. 
Therefore, I would rather not limit it to fsnotify context. We can
still use the universal on-demand allocator.
Can't we just do something like:
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7e29433c5ecc..cc05a5485365 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -627,6 +627,12 @@ is_uncached_acl(struct posix_acl *acl)
 #define IOP_DEFAULT_READLINK   0x0010
 #define IOP_MGTIME     0x0020

+struct inode_addons {
+        struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
+        struct bpf_local_storage __rcu          *i_bpf_storage;
+        __u32                                   i_fsnotify_mask; /* all events this inode cares about */
+};
+
 /*
  * Keep mostly read-only and often accessed (especially for
  * the RCU path lookup and 'stat' data) fields at the beginning
@@ -731,12 +737,7 @@ struct inode {
                unsigned                i_dir_seq;
        };

-
-#ifdef CONFIG_FSNOTIFY
-       __u32                   i_fsnotify_mask; /* all events this inode cares about */
-       /* 32-bit hole reserved for expanding i_fsnotify_mask */
-       struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
-#endif
+       struct inode_addons *i_addons;

 #ifdef CONFIG_FS_ENCRYPTION
        struct fscrypt_inode_info       *i_crypt_info;
Then when either fsnotify or bpf needs that storage they can do a
cmpxchg() based allocation for struct inode_addons just like I did with
f_owner:

int file_f_owner_allocate(struct file *file)
{
	struct fown_struct *f_owner;

	f_owner = file_f_owner(file);
	if (f_owner)
		return 0;

	f_owner = kzalloc(sizeof(struct fown_struct), GFP_KERNEL);
	if (!f_owner)
		return -ENOMEM;

	rwlock_init(&f_owner->lock);
	f_owner->file = file;
	/* If someone else raced us, drop our allocation. */
	if (unlikely(cmpxchg(&file->f_owner, NULL, f_owner)))
		kfree(f_owner);
	return 0;
}

The internal allocations for specific fields are up to the subsystem
ofc. Does that make sense?
quoted
quoted
quoted
       struct fsnotify_inode_context {
               struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
               struct bpf_local_storage __rcu          *i_bpf_storage;
               __u32                                   i_fsnotify_mask; /* all events this inode cares about */
       };
quoted
quoted
This was already done for s_fsnotify_marks, so you can follow the recipe
of 07a3b8d0bf72 ("fsnotify: lazy attach fsnotify_sb_info state to sb")
and create an fsnotify_inode_info container.
On second thought, fsnotify_sb_info container is allocated and attached
in the context of userspace adding a mark.

If you will need allocate and attach fsnotify_inode_info in the content of
fast path fanotify hook in order to add the inode to the map, I don't
think that is going to fly??
Do you mean we may not be able to allocate memory in the fast path 
hook? AFAICT, the fast path is still in the process context, so I 
think this is not a problem?

Thanks,
Song 
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help