Re: page fault scalability (ext3, ext4, xfs) | linux-ext4

quoted

On Thu, Aug 15, 2013 at 2:37 PM, Dave Chinner [off-list ref] wrote:
On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
I didn't think of that at all.

If userspace does:

ptr = mmap(...);
ptr[0] = 1;
sleep(1);
ptr[0] = 2;
sleep(1);
munmap();

Then current kernels will mark the inode changed on (only) the ptr[0]
= 1 line.  My patches will instead mark the inode changed when munmap
is called (or after ptr[0] = 2 if writepages gets called for any
reason).

I'm not sure which is better.  POSIX actually requires my behavior
(which is most irrelevant).
Not by my reading of it. Posix states that c/mtime needs to be
updated between the first access and the next msync() call. We
update mtime on the first access, and so therefore we conform to the
posix requirement....
It says "between a write reference to the mapped region and the next
call to msync()."  Most write references don't cause page faults.

My behavior also means that, if an NFS
client reads and caches the file between the two writes, then it will
eventually find out that the data is stale.
"eventually" is very different behaviour to the current behaviour.

My understanding is that NFS v4 delegations require the underlying
filesystem to bump the version count on *any* modification made to
the file so that delegations can be recalled appropriately. So not
informing the filesystem that the file data has been changed is
going to cause problems.
We don't do that right now (and we can't without utterly destroying
performance) because we don't trap on every modification.  See
below...

The current behavior, on
the other hand, means that a single pass of mmapped writes through the
file will update the times much faster.

I could arrange for the first page fault to *also* update times when
the FS is exported or if a particular mount option is set.  (The ext4
change to request the new behavior is all of four lines, and it's easy
to adjust.)
What does "first page fault" mean?
The first write to the page triggers a page fault and marks the page
writable.  The second write to the page (assuming no writeback happens
in the mean time) does not trigger a page fault or notify the kernel
in any way.

In current kernels, this chain of events won't work:

 - Server goes down
 - Server comes up
 - Userspace on server calls mmap and writes something
 - Client reconnects and invalidates its cache
 - Userspace on server writes something else *to the same page*

The client will never notice the second write, because it won't update
any inode state.  With my patches, the client will as soon as the
server starts writeback.

So I think that there are cases where my changes make things better
and cases where they make things worse.

--Andy

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help