Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

From: Guo Chao <hidden>
Date: 2012-09-27 08:42:01

On Wed, Sep 26, 2012 at 10:54:09AM +1000, Dave Chinner wrote:

On Tue, Sep 25, 2012 at 04:59:55PM +0800, Guo Chao wrote:

quoted

On Mon, Sep 24, 2012 at 06:26:54PM +1000, Dave Chinner wrote:

quoted

@@ -783,14 +783,19 @@ static void __wait_on_freeing_inode(struct inode *inode);
 static struct inode *find_inode(struct super_block *sb,
 				struct hlist_head *head,
 				int (*test)(struct inode *, void *),
-				void *data)
+				void *data, bool locked)
 {
 	struct hlist_node *node;
 	struct inode *inode = NULL;

 repeat:
-	hlist_for_each_entry(inode, node, head, i_hash) {
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(inode, node, head, i_hash) {
 		spin_lock(&inode->i_lock);
+		if (inode_unhashed(inode)) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}

Is this check too early? If the unhashed inode happened to be the target
inode, we are wasting our time to continue the traversal and we do not wait 
on it.

If the inode is unhashed, then it is already passing through evict()
or has already passed through. If it has already passed through
evict() then it is too late to call __wait_on_freeing_inode() as the
wakeup occurs in evict() immediately after the inode is removed
from the hash. i.e:

        remove_inode_hash(inode);

        spin_lock(&inode->i_lock);
        wake_up_bit(&inode->i_state, __I_NEW);
        BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
        spin_unlock(&inode->i_lock);

i.e. if we get the case:

Thread 1, RCU hash traversal		Thread 2, evicting foo

rcu_read_lock()
found inode foo
					remove_inode_hash(inode);
					spin_lock(&foo->i_lock);
					wake_up(I_NEW)
					spin_unlock(&foo->i_lock);
					destroy_inode()
					......
	spin_lock(foo->i_lock)
	match sb, ino
	I_FREEING
	  rcu_read_unlock()

<rcu grace period can expire at any time now,
 so use after free is guaranteed at some point>

	  wait_on_freeing_inode
	    wait_on_bit(I_NEW)

<will never get woken>

Hence if the inode is unhashed, it doesn't matter what inode it is,
it is never valid to use it any further because it may have already
been freed and the only reason we can safely access here it is that
the RCU grace period will not expire until we call
rcu_read_unlock().

Yeah, looks right.

quoted

@@ -1078,8 +1098,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
 		struct inode *old;

 		spin_lock(&inode_hash_lock);
-		/* We released the lock, so.. */
-		old = find_inode_fast(sb, head, ino);
+		old = find_inode_fast(sb, head, ino, true);
 		if (!old) {
 			inode->i_ino = ino;
 			spin_lock(&inode->i_lock);

Emmmm ... couldn't we use memory barrier API instead of irrelevant spin
lock on newly allocated inode to publish I_NEW?

Yes, we could.

However, having multiple synchronisation methods for a single
variable that should only be used in certain circumstances is
something that is easy to misunderstand and get wrong. Memory
barriers are much more subtle and harder to understand than spin
locks, and every memory barrier needs to be commented to explain
what the barrier is actually protecting against.

In the case where a spin lock is guaranteed to be uncontended and
the cache line hot in the CPU cache, it makes no sense to replace
the spin lock with a memory barrier, especially when every other
place we modify the i_state/i_hash fields we have to wrap them
with i_lock....

Simple code is good code - save the complexity for something that
needs it.

Emmm, I doubt "it's simpler and need no document". 

I bet someday there will be other guys stand out and ask "why take spin 
lock on a inode which apparently does not subject to any race condition?".

I know that the per-sb inode lru lock is currently the hotest of the
inode cache locks (performance limiting at somewhere in the range of
8-16way workloads on XFS), and I've got work in (slow) progress to
address that.  That work will also the address the per-sb dentry LRU
locks, which are the hotest dentry cache locks as well.

Glad to hear that.

Thank your for all your explanation, especially historical ones.

Regards,
Guo Chao

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help