Re: [PATCH v2 7/9] NFSD: Use rhashtable for managing nfs4_file objects
From: Chuck Lever III <chuck.lever@oracle.com>
Date: 2022-10-12 15:02:10
On Oct 11, 2022, at 7:37 PM, NeilBrown [off-list ref] wrote: On Tue, 11 Oct 2022, Chuck Lever III wrote:quoted
quoted
On Oct 10, 2022, at 8:16 PM, NeilBrown [off-list ref] wrote: On Fri, 07 Oct 2022, Chuck Lever wrote:quoted
-static unsigned int file_hashval(struct svc_fh *fh) +/* + * The returned hash value is based solely on the address of an in-code + * inode, a pointer to a slab-allocated object. The entropy in such a + * pointer is concentrated in its middle bits.I think you need more justification than that for throwing away some of the entropy, even if you don't think it is much.We might have that justification: https://lore.kernel.org/linux-nfs/YrUFbLJ5uVbWtZbf@ZenIV/ (local) Actually I believe we are not discarding /any/ entropy in this function. The bits we discard are invariant.Ok, I can accept that this: + k = ptr >> L1_CACHE_SHIFT;
I searched for ">> *L1_CACHE_SHIFT". Apart from the nfsd filecache you mentioned I find two. One in quota and one in reiserfs. Both work with traditional hash tables which are more forgiving of longer chains. Do you have other evidence of this being a common trope?
This approach is based on the hash function in fs/inode.c,
which uses integer division instead of a shift.
509 static unsigned long hash(struct super_block *sb, unsigned long hashval)
510 {
511 unsigned long tmp;
512
513 tmp = (hashval * (unsigned long)sb) ^ (GOLDEN_RATIO_PRIME + hashval) /
514 L1_CACHE_BYTES;
515 tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> i_hash_shift);
516 return tmp & i_hash_mask;
517 }
only discards invariant bits, but how can you justify this: + k &= 0x00ffffff; ??
After shifting an address, the top byte generally contains invariant bits as well.
And given that you pass it all to jhash anyway, why not just pass all of it?
I don't think this is a big deal, but these functions are basically the same as what was recently merged without complaint. It's not a high priority to revisit those. It might be worth a clean-up to share this hash function between the two hash tables... at that point we might consider removing the extra mask.
quoted
And, note that this is exactly the same situation we just merged in the filecache overhaul, and is a common trope amongst other hash tables that base their function on the inode's address.quoted
Presumably you think hashing 32 bits is faster than hashing 64 bits. Maybe it is, but is it worth it? rhashtable depends heavily on having a strong hash function. In particular if any bucket ends up with more than 16 elements it will choose a new seed and rehash. If you deliberately remove some bits that it might have been used to spread those 16 out, then you are asking for trouble. We know that two different file handles can refer to the same inode ("rarely"), and you deliberately place them in the same hash bucket. So if an attacker arranges to access 17 files with the same inode but different file handles, then the hashtable will be continuously rehashed. The preferred approach when you require things to share a hash chain is to use an rhl table.Again, this is the same situation for the filecache. Do you believe it is worth reworking that? I'm guessing "yes".As a matter of principle: yes. rhashtable is designed to assume that hash collisions are bad and can be changed by choosing a different seed. rhl_tables was explicitly added for cases when people wanted multiple elements to hash to the same value. The chance of it causing a problem without an attack are admittedly tiny. Attacks are only possible with subtree_check enabled, or if the underlying filesystem does something "clever" with file handles, so there wouldn't be many situations where an attack would even be possible. But if it were possible, it would likely be easy. The cost of the attack would be a minor-to-modest performance impact. So it is hard to argue "this code is dangerous and must be changed", but we have different tools for a reason, and I believe that rhl-tables is the right tool for this job.
Agreed. I wasn't suggesting it's an emergency situation, but it's something that should get fixed at some point if there's a problem with it, even a minor one. I think I stopped at the non-list variant of rhashtable because using rhl was more complex, and the non-list variant seemed to work fine. There's no architectural reason either file_hashtbl or the filecache must use the non-list variant. In any event, it's worth taking the trouble now to change the nfs4_file implementation proposed here as you suggest. -- Chuck Lever