Re: page fault scalability (ext3, ext4, xfs) | linux-ext4

quoted

On Thu, Aug 15, 2013 at 04:01:49PM +1000, Dave Chinner wrote:
On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner [off-list ref] wrote:
On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
It would be better to write zeros to it, so we aren't measuring the
cost of the unwritten->written conversion.
At the risk of beating a dead horse, how hard would it be to defer
this part until writeback?
Part of the work has to be done at write time because we need to
update allocation statistics (i.e., so that we don't have ENOSPC
problems).  The unwritten->written conversion does happen at writeback
(as does the actual block allocation if we are doing delayed
allocation).

The point is that if the goal is to measure page fault scalability, we
shouldn't have this other stuff happening as the same time as the page
fault workload.
Sure, but the real problem is not the block mapping or allocation
path - even if the test is changed to take that out of the picture,
we still have timestamp updates being done on every single page
fault. ext4, XFS and btrfs all do transactional timestamp updates
and have nanosecond granularity, so every page fault is resulting in
a transaction to update the timestamp of the file being modified.
I have (unmergeable) patches to fix this:

http://comments.gmane.org/gmane.linux.kernel.mm/92476
The big problem with this approach is that not doing the
timestamp update on page faults is going to break the inode change
version counting because for ext4, btrfs and XFS it takes a
transaction to bump that counter. NFS needs to know the moment a
file is changed in memory, not when it is written to disk.
I don't think the in-memory updates of the data and the version have to
be completely atomic, if that's what you mean.

Also, NFS
requires the change to the counter to be persistent over server
failures, so it needs to be changed as part of a transaction....
I'm not sure those two updates have to be a single atomic transaction on
disk, either.

(Though the reboot cases are more complicated, I may not have thought it
through.)

(By the way, I wonder what happens if we reuse a change attribute value
after a crash?  There's probably a (hard to hit) bug there.)

--b.

IOWs, fixing the "filesystems need a transaction on each page_mkwrite
call" problem isn't as simple as changing how timestamps are
updated.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help