Re: [PATCH v5 bpf-next 0/5] bpf path iterator
From: Christian Brauner <brauner@kernel.org>
Date: 2025-07-07 10:46:48
Also in:
bpf, linux-fsdevel, lkml
On Fri, Jun 27, 2025 at 08:51:21AM +1000, NeilBrown wrote:
On Fri, 27 Jun 2025, Song Liu wrote:quoted
quoted
On Jun 26, 2025, at 3:22 AM, NeilBrown [off-list ref] wrote:[...]quoted
quoted
I guess I misunderstood the proposal of vfs_walk_ancestors() initially, so some clarification: I think vfs_walk_ancestors() is good for the rcu-walk, and some rcu-then-ref-walk. However, I don’t think it fits all use cases. A reliable step-by-step ref-walk, like this set, works well with BPF, and we want to keep it.The distinction between rcu-walk and ref-walk is an internal implementation detail. You as a caller shouldn't need to think about the difference. You just want to walk. Note that LOOKUP_RCU is documented in namei.h as "semi-internal". The only uses outside of core-VFS code is in individual filesystem's d_revalidate handler - they are checking if they are allowed to sleep or not. You should never expect to pass LOOKUP_RCU to an VFS API - no other code does. It might be reasonable for you as a caller to have some control over whether the call can sleep or not. LOOKUP_CACHED is a bit like that. But for dotdot lookup the code will never sleep - so that is not relevant.Unfortunately, the BPF use case is more complicated. In some cases, the callback function cannot be call in rcu critical sections. For example, the callback may need to read xatter. For these cases, we we cannot use RCU walk at all.I really think you should stop using the terms RCU walk and ref-walk. I think they might be focusing your thinking in an unhelpful direction.
Thank you! I really appreciate you helping to shape this API and it aligns a lot with my thinking.
The key issue about reading xattrs is that it might need to sleep. Focusing on what might need to sleep and what will never need to sleep is a useful approach - the distinction is wide spread in the kernel and several function take a flag indicating if they are permitted to sleep, or if failure when sleeping would be required. So your above observation is better described as The vfs_walk_ancestors() API has an (implicit) requirement that the callback mustn't sleep. This is a problem for some use-cases where the call back might need to sleep - e.g. for accessing xattrs. That is a good and useful observation. I can see three possibly responses: 1/ Add a vfs_walk_ancestors_maysleep() API for which the callback is always allowed to sleep. I don't particularly like this approach.
Agreed.
2/ Use repeated calls to vfs_walk_parent() when the handling of each ancestor might need to sleep. I see no problem with supporting both vfs_walk_ancestors() and vfs_walk_parent(). There is plenty of precedent for having different interfaces for different use cases.
Meh.
3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback.
I think that's fine.