Thread (40 messages) 40 messages, 9 authors, 2013-08-19

Re: page fault scalability (ext3, ext4, xfs)

From: Dave Chinner <david@fromorbit.com>
Date: 2013-08-15 22:18:07
Also in: linux-fsdevel, lkml

On Thu, Aug 15, 2013 at 02:43:09PM -0700, Andy Lutomirski wrote:
On Thu, Aug 15, 2013 at 2:37 PM, Dave Chinner
[off-list ref] wrote:
quoted
On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
quoted
My behavior also means that, if an NFS
client reads and caches the file between the two writes, then it will
eventually find out that the data is stale.
"eventually" is very different behaviour to the current behaviour.

My understanding is that NFS v4 delegations require the underlying
filesystem to bump the version count on *any* modification made to
the file so that delegations can be recalled appropriately. So not
informing the filesystem that the file data has been changed is
going to cause problems.
We don't do that right now (and we can't without utterly destroying
performance) because we don't trap on every modification.  See
below...
We don't trap every mmap modification. We trap every modification
that the filesystem is informed about. That includes a c/mtime
update on every write page fault. It's as fine grained as we can get
without introducing serious performance killing overhead.

And nobody has made any compelling argument that what we do now is
problematic - all we've got is a microbenchmark doesn't quite scale
linearly because filesystem updates through a global filesystem
structure (the journal) don't scale linearly.
quoted
quoted
The current behavior, on
the other hand, means that a single pass of mmapped writes through the
file will update the times much faster.

I could arrange for the first page fault to *also* update times when
the FS is exported or if a particular mount option is set.  (The ext4
change to request the new behavior is all of four lines, and it's easy
to adjust.)
What does "first page fault" mean?
The first write to the page triggers a page fault and marks the page
writable.  The second write to the page (assuming no writeback happens
in the mean time) does not trigger a page fault or notify the kernel
in any way.
IIUC, you are saying is that you'll maintain the current behaviour
(i.e. clean->dirty does a timestamp update) if the filesystem
requires it? So the default behaviour of any filesystem that
supports NFSv4 is going to behave as it does now?

If that's the case, why bother changing anything as nfsv4 is the
default version that the kernel uses? (I'm playing devil's advocate
here).
In current kernels, this chain of events won't work:

 - Server goes down
 - Server comes up
 - Userspace on server calls mmap and writes something
 - Client reconnects and invalidates its cache
 - Userspace on server writes something else *to the same page*

The client will never notice the second write, because it won't update
any inode state. 
That's wrong. The server wrote the dirty page before the client
reconnected, therefore it got marked clean. The second write to the
server page marks it dirty again, causing page_mkwrite to be
called, thereby updating the timestamp/i_version field. So, the NFS
client will notice the second change on the server, and it will
notice it immediately after the second access has occurred, not some
time later when:
With my patches, the client will as soon as the
server starts writeback.
Your patches introduce a 30+ second window where a file can be dirty
on the server but the NFS server doesn't know about it and can't
tell the clients about it because i_version doesn't get bumped until
writeback.....
So I think that there are cases where my changes make things better
and cases where they make things worse.
Right, and the issue is that there are important use cases that we
have to support in default configurations that it makes things
worse.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help