Thread (14 messages) 14 messages, 6 authors, 2024-12-02

Re: [PATCH v3 fanotify 1/2] fanotify: Introduce fanotify filter

From: Amir Goldstein <amir73il@gmail.com>
Date: 2024-11-24 06:25:34
Also in: bpf, linux-fsdevel, lkml

On Sat, Nov 23, 2024 at 12:00 AM Song Liu [off-list ref] wrote:
fanotify filter enables handling fanotify events within the kernel, and
thus saves a trip to the user space. fanotify filter can be useful in
many use cases. For example, if a user is only interested in events for
some files in side a directory, a filter can be used to filter out
irrelevant events.

fanotify filter is attached to fsnotify_group. At most one filter can
be attached to a fsnotify_group. The attach/detach of filter are
controlled by two new ioctls on the fanotify fds: FAN_IOC_ADD_FILTER
and FAN_IOC_DEL_FILTER.

fanotify filter is packaged in a kernel module. In the future, it is
also possible to package fanotify filter in a BPF program. Since loading
modules requires CAP_SYS_ADMIN, _loading_ fanotify filter in kernel
modules is limited to CAP_SYS_ADMIN. However, non-SYS_CAP_ADMIN users
can _attach_ filter loaded by sys admin to their fanotify fds. The owner
of the fanotify fitler can use flag FAN_FILTER_F_SYS_ADMIN_ONLY to
make a filter available only to users with CAP_SYS_ADMIN.

To make fanotify filters more flexible, a filter can take arguments at
attach time.

sysfs entry /sys/kernel/fanotify_filter is added to help users know
which fanotify filters are available. At the moment, these files are
added for each filter: flags, desc, and init_args.
It's a shame that we have fanotify knobs at /proc/sys/fs/fanotify/ and
in sysfs, but understand we don't want to make more use of proc for this.

Still I would add the filter files under a new /sys/fs/fanotify/ dir and not
directly under /sys/kernel/
quoted hunk ↗ jump to hunk
Signed-off-by: Song Liu <song@kernel.org>
---
 fs/notify/fanotify/Kconfig           |  13 ++
 fs/notify/fanotify/Makefile          |   1 +
 fs/notify/fanotify/fanotify.c        |  44 +++-
 fs/notify/fanotify/fanotify_filter.c | 289 +++++++++++++++++++++++++++
 fs/notify/fanotify/fanotify_user.c   |   7 +
 include/linux/fanotify.h             | 128 ++++++++++++
 include/linux/fsnotify_backend.h     |   6 +-
 include/uapi/linux/fanotify.h        |  36 ++++
 8 files changed, 520 insertions(+), 4 deletions(-)
 create mode 100644 fs/notify/fanotify/fanotify_filter.c
diff --git a/fs/notify/fanotify/Kconfig b/fs/notify/fanotify/Kconfig
index 0e36aaf379b7..abfd59d95f49 100644
--- a/fs/notify/fanotify/Kconfig
+++ b/fs/notify/fanotify/Kconfig
@@ -24,3 +24,16 @@ config FANOTIFY_ACCESS_PERMISSIONS
           hierarchical storage management systems.

           If unsure, say N.
+
+config FANOTIFY_FILTER
+       bool "fanotify in kernel filter"
+       depends on FANOTIFY
+       default y
+       help
+          Say Y here if you want to use fanotify in kernel filter.
+          The filter can be implemented in a kernel module or a
+          BPF program. The filter can speed up fanotify in many
+          use cases. For example, when the listener is only interested in
+          a subset of events.
+
+          If unsure, say Y.
\ No newline at end of file
diff --git a/fs/notify/fanotify/Makefile b/fs/notify/fanotify/Makefile
index 25ef222915e5..d95ec0aeffb5 100644
--- a/fs/notify/fanotify/Makefile
+++ b/fs/notify/fanotify/Makefile
@@ -1,2 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0-only
 obj-$(CONFIG_FANOTIFY)         += fanotify.o fanotify_user.o
+obj-$(CONFIG_FANOTIFY_FILTER)  += fanotify_filter.o
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 224bccaab4cc..c70184cd2d45 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -18,6 +18,8 @@

 #include "fanotify.h"

+extern struct srcu_struct fsnotify_mark_srcu;
+
 static bool fanotify_path_equal(const struct path *p1, const struct path *p2)
 {
        return p1->mnt == p2->mnt && p1->dentry == p2->dentry;
@@ -888,6 +890,7 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
        struct fsnotify_event *fsn_event;
        __kernel_fsid_t fsid = {};
        u32 match_mask = 0;
+       struct fanotify_filter_hook *filter_hook __maybe_unused;

        BUILD_BUG_ON(FAN_ACCESS != FS_ACCESS);
        BUILD_BUG_ON(FAN_MODIFY != FS_MODIFY);
@@ -921,6 +924,39 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
        pr_debug("%s: group=%p mask=%x report_mask=%x\n", __func__,
                 group, mask, match_mask);

+       if (FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS))
+               fsid = fanotify_get_fsid(iter_info);
+
+#ifdef CONFIG_FANOTIFY_FILTER
+       filter_hook = srcu_dereference(group->fanotify_data.filter_hook, &fsnotify_mark_srcu);
Do we actually need the sleeping rcu protection for calling the hook?
Can regular rcu read side be nested inside srcu read side?

Jan,

I don't remember why srcu is needed since we are not holding it
when waiting for userspace anymore?
quoted hunk ↗ jump to hunk
+       if (filter_hook) {
+               struct fanotify_filter_event filter_event = {
+                       .mask = mask,
+                       .data = data,
+                       .data_type = data_type,
+                       .dir = dir,
+                       .file_name = file_name,
+                       .fsid = &fsid,
+                       .match_mask = match_mask,
+               };
+
+               ret = filter_hook->ops->filter(group, filter_hook, &filter_event);
+
+               /*
+                * The filter may return
+                * - FAN_FILTER_RET_SEND_TO_USERSPACE => continue the rest;
+                * - FAN_FILTER_RET_SKIP_EVENT => return 0 now;
+                * - < 0 error => return error now.
+                *
+                * For the latter two cases, we can just return ret.
+                */
+               BUILD_BUG_ON(FAN_FILTER_RET_SKIP_EVENT != 0);
+
+               if (ret != FAN_FILTER_RET_SEND_TO_USERSPACE)
+                       return ret;
+       }
+#endif
+
        if (fanotify_is_perm_event(mask)) {
                /*
                 * fsnotify_prepare_user_wait() fails if we race with mark
@@ -930,9 +966,6 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
                        return 0;
        }

-       if (FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS))
-               fsid = fanotify_get_fsid(iter_info);
-
        event = fanotify_alloc_event(group, mask, data, data_type, dir,
                                     file_name, &fsid, match_mask);
        ret = -ENOMEM;
@@ -976,6 +1009,11 @@ static void fanotify_free_group_priv(struct fsnotify_group *group)

        if (mempool_initialized(&group->fanotify_data.error_events_pool))
                mempool_exit(&group->fanotify_data.error_events_pool);
+
+#ifdef CONFIG_FANOTIFY_FILTER
+       if (group->fanotify_data.filter_hook)
+               fanotify_filter_hook_free(group->fanotify_data.filter_hook);
+#endif
 }

 static void fanotify_free_path_event(struct fanotify_event *event)
diff --git a/fs/notify/fanotify/fanotify_filter.c b/fs/notify/fanotify/fanotify_filter.c
new file mode 100644
index 000000000000..9215612e2fcb
--- /dev/null
+++ b/fs/notify/fanotify/fanotify_filter.c
@@ -0,0 +1,289 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/fanotify.h>
+#include <linux/kobject.h>
+#include <linux/module.h>
+
+#include "fanotify.h"
+
+extern struct srcu_struct fsnotify_mark_srcu;
+
+static DEFINE_SPINLOCK(filter_list_lock);
+static LIST_HEAD(filter_list);
+
+static struct kobject *fan_filter_root_kobj;
+
+static struct {
+       enum fanotify_filter_flags flag;
+       const char *name;
+} fanotify_filter_flags_names[] = {
+       {
+               .flag = FAN_FILTER_F_SYS_ADMIN_ONLY,
+               .name = "SYS_ADMIN_ONLY",
+       }
+};
+
+static ssize_t flags_show(struct kobject *kobj,
+                         struct kobj_attribute *attr, char *buf)
+{
+       struct fanotify_filter_ops *ops;
+       ssize_t len = 0;
+       int i;
+
+       ops = container_of(kobj, struct fanotify_filter_ops, kobj);
+       for (i = 0; i < ARRAY_SIZE(fanotify_filter_flags_names); i++) {
+               if (ops->flags & fanotify_filter_flags_names[i].flag) {
+                       len += sysfs_emit_at(buf, len, "%s%s", len ? " " : "",
+                                            fanotify_filter_flags_names[i].name);
+               }
+       }
+       len += sysfs_emit_at(buf, len, "\n");
+       return len;
+}
+
+static ssize_t desc_show(struct kobject *kobj,
+                        struct kobj_attribute *attr, char *buf)
+{
+       struct fanotify_filter_ops *ops;
+
+       ops = container_of(kobj, struct fanotify_filter_ops, kobj);
+
+       return sysfs_emit(buf, "%s\n", ops->desc ?: "N/A");
+}
+
+static ssize_t init_args_show(struct kobject *kobj,
+                             struct kobj_attribute *attr, char *buf)
+{
+       struct fanotify_filter_ops *ops;
+
+       ops = container_of(kobj, struct fanotify_filter_ops, kobj);
+
+       return sysfs_emit(buf, "%s\n", ops->init_args ?: "N/A");
+}
+
+static struct kobj_attribute flags_kobj_attr = __ATTR_RO(flags);
+static struct kobj_attribute desc_kobj_attr = __ATTR_RO(desc);
+static struct kobj_attribute init_args_kobj_attr = __ATTR_RO(init_args);
+
+static struct attribute *fan_filter_attrs[] = {
+       &flags_kobj_attr.attr,
+       &desc_kobj_attr.attr,
+       &init_args_kobj_attr.attr,
+       NULL,
+};
+ATTRIBUTE_GROUPS(fan_filter);
+
+static void fan_filter_kobj_release(struct kobject *kobj)
+{
+}
+
+static const struct kobj_type fan_filter_ktype = {
+       .release = fan_filter_kobj_release,
+       .sysfs_ops = &kobj_sysfs_ops,
+       .default_groups = fan_filter_groups,
+};
+
+static struct fanotify_filter_ops *fanotify_filter_find(const char *name)
+{
+       struct fanotify_filter_ops *ops;
+
+       list_for_each_entry(ops, &filter_list, list) {
+               if (!strcmp(ops->name, name))
+                       return ops;
+       }
+       return NULL;
+}
+
+static void __fanotify_filter_unregister(struct fanotify_filter_ops *ops)
+{
+       spin_lock(&filter_list_lock);
+       list_del_init(&ops->list);
+       spin_unlock(&filter_list_lock);
+}
+
+/*
+ * fanotify_filter_register - Register a new filter.
+ *
+ * Add a filter to the filter_list. These filter are
+ * available for all users in the system.
+ *
+ * @ops:       pointer to fanotify_filter_ops to add.
+ *
+ * Returns:
+ *     0       - on success;
+ *     -EEXIST - filter of the same name already exists.
+ *     -ENODEV - fanotify filter was not properly initialized.
+ */
+int fanotify_filter_register(struct fanotify_filter_ops *ops)
+{
+       int ret;
+
+       if (!fan_filter_root_kobj)
+               return -ENODEV;
+
+       spin_lock(&filter_list_lock);
+       if (fanotify_filter_find(ops->name)) {
+               /* cannot register two filters with the same name */
+               spin_unlock(&filter_list_lock);
+               return -EEXIST;
+       }
+       list_add_tail(&ops->list, &filter_list);
+       spin_unlock(&filter_list_lock);
+
+
+       kobject_init(&ops->kobj, &fan_filter_ktype);
+       ret = kobject_add(&ops->kobj, fan_filter_root_kobj, "%s", ops->name);
+       if (ret) {
+               __fanotify_filter_unregister(ops);
+               return ret;
+       }
+       return 0;
+}
+EXPORT_SYMBOL_GPL(fanotify_filter_register);
+
+/*
+ * fanotify_filter_unregister - Unregister a new filter.
+ *
+ * Remove a filter from filter_list.
+ *
+ * @ops:       pointer to fanotify_filter_ops to remove.
+ */
+void fanotify_filter_unregister(struct fanotify_filter_ops *ops)
+{
+       kobject_put(&ops->kobj);
+       __fanotify_filter_unregister(ops);
+}
+EXPORT_SYMBOL_GPL(fanotify_filter_unregister);
+
+/*
+ * fanotify_filter_add - Add a filter to fsnotify_group.
+ *
+ * Add a filter from filter_list to a fsnotify_group.
+ *
+ * @group:     fsnotify_group that will have add
+ * @argp:      fanotify_filter_args that specifies the filter
+ *             and the init arguments of the filter.
+ *
+ * Returns:
+ *     0       - on success;
+ *     -EEXIST - filter of the same name already exists.
+ */
+int fanotify_filter_add(struct fsnotify_group *group,
+                       struct fanotify_filter_args __user *argp)
+{
+       struct fanotify_filter_hook *filter_hook;
+       struct fanotify_filter_ops *filter_ops;
+       struct fanotify_filter_args args;
+       void *init_args = NULL;
+       int ret = 0;
+
+       ret = copy_from_user(&args, argp, sizeof(args));
+       if (ret)
+               return -EFAULT;
+
+       if (args.init_args_size > FAN_FILTER_ARGS_MAX)
+               return -EINVAL;
+
+       args.name[FAN_FILTER_NAME_MAX - 1] = '\0';
+
+       fsnotify_group_lock(group);
+
+       if (rcu_access_pointer(group->fanotify_data.filter_hook)) {
+               fsnotify_group_unlock(group);
+               return -EBUSY;
+       }
+
+       filter_hook = kzalloc(sizeof(*filter_hook), GFP_KERNEL);
+       if (!filter_hook) {
+               ret = -ENOMEM;
+               goto out;
+       }
+
+       spin_lock(&filter_list_lock);
+       filter_ops = fanotify_filter_find(args.name);
+       if (!filter_ops || !try_module_get(filter_ops->owner)) {
+               spin_unlock(&filter_list_lock);
+               ret = -ENOENT;
+               goto err_free_hook;
+       }
+       spin_unlock(&filter_list_lock);
+
+       if (!capable(CAP_SYS_ADMIN) && (filter_ops->flags & FAN_FILTER_F_SYS_ADMIN_ONLY)) {
1. feels better to opt-in for UNPRIV (and maybe later on) rather than
make it the default.
2. need to check that filter_ops->flags has only "known" flags
3. probably need to add filter_ops->version check in case we want to
change the ABI
+               ret = -EPERM;
+               goto err_module_put;
+       }
+
+       if (filter_ops->filter_init) {
+               if (args.init_args_size != filter_ops->init_args_size) {
+                       ret = -EINVAL;
+                       goto err_module_put;
+               }
+               if (args.init_args_size) {
+                       init_args = kzalloc(args.init_args_size, GFP_KERNEL);
+                       if (!init_args) {
+                               ret = -ENOMEM;
+                               goto err_module_put;
+                       }
+                       if (copy_from_user(init_args, (void __user *)args.init_args,
+                                          args.init_args_size)) {
+                               ret = -EFAULT;
+                               goto err_free_args;
+                       }
+
+               }
+               ret = filter_ops->filter_init(group, filter_hook, init_args);
+               if (ret)
+                       goto err_free_args;
+               kfree(init_args);
+       }
+       filter_hook->ops = filter_ops;
+       rcu_assign_pointer(group->fanotify_data.filter_hook, filter_hook);
+
+out:
+       fsnotify_group_unlock(group);
+       return ret;
+
+err_free_args:
+       kfree(init_args);
+err_module_put:
+       module_put(filter_ops->owner);
+err_free_hook:
+       kfree(filter_hook);
+       goto out;
+}
+
+void fanotify_filter_hook_free(struct fanotify_filter_hook *filter_hook)
+{
+       if (filter_hook->ops->filter_free)
+               filter_hook->ops->filter_free(filter_hook);
+
+       module_put(filter_hook->ops->owner);
+       kfree(filter_hook);
+}
+
+/*
+ * fanotify_filter_del - Delete a filter from fsnotify_group.
+ */
+void fanotify_filter_del(struct fsnotify_group *group)
+{
+       struct fanotify_filter_hook *filter_hook;
+
+       fsnotify_group_lock(group);
+       filter_hook = group->fanotify_data.filter_hook;
+       if (!filter_hook)
+               goto out;
+
+       rcu_assign_pointer(group->fanotify_data.filter_hook, NULL);
+       fanotify_filter_hook_free(filter_hook);
The read side is protected with srcu and there is no srcu/rcu delay of freeing.
You will either need something along the lines of
fsnotify_connector_destroy_workfn() with synchronize_srcu()
or use regular rcu delay and read side (assuming that it can be nested inside
srcu read side?).

Thanks,
Amir.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help