Thread (40 messages) 40 messages, 9 authors, 2013-08-19

Re: page fault scalability (ext3, ext4, xfs)

From: Jan Kara <jack@suse.cz>
Date: 2013-08-15 07:45:31
Also in: linux-fsdevel, lkml

On Thu 15-08-13 17:11:42, Dave Chinner wrote:
On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
quoted
On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner [off-list ref] wrote:
quoted
On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
quoted
On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner [off-list ref] wrote:
quoted
On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
quoted
On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
quoted
quoted
It would be better to write zeros to it, so we aren't measuring the
cost of the unwritten->written conversion.
At the risk of beating a dead horse, how hard would it be to defer
this part until writeback?
Part of the work has to be done at write time because we need to
update allocation statistics (i.e., so that we don't have ENOSPC
problems).  The unwritten->written conversion does happen at writeback
(as does the actual block allocation if we are doing delayed
allocation).

The point is that if the goal is to measure page fault scalability, we
shouldn't have this other stuff happening as the same time as the page
fault workload.
Sure, but the real problem is not the block mapping or allocation
path - even if the test is changed to take that out of the picture,
we still have timestamp updates being done on every single page
fault. ext4, XFS and btrfs all do transactional timestamp updates
and have nanosecond granularity, so every page fault is resulting in
a transaction to update the timestamp of the file being modified.
I have (unmergeable) patches to fix this:

http://comments.gmane.org/gmane.linux.kernel.mm/92476
The big problem with this approach is that not doing the
timestamp update on page faults is going to break the inode change
version counting because for ext4, btrfs and XFS it takes a
transaction to bump that counter. NFS needs to know the moment a
file is changed in memory, not when it is written to disk. Also, NFS
requires the change to the counter to be persistent over server
failures, so it needs to be changed as part of a transaction....
I've been running a kernel that has the file_update_time call
commented out for over a year now, and the only problem I've seen is
that the timestamp doesn't get updated :)

I think I must be misunderstanding you (or vice versa).  I'm currently
Yup, you are.
quoted
redoing the patches, and this time I'll do it for just the mm core and
ext4.  The only change I'm proposing to ext4's page_mkwrite is to
remove the file_update_time call.
Right. Where does that end up? All the way down in
ext4_mark_iloc_dirty(), and that does:

        if (IS_I_VERSION(inode))
		inode_inc_iversion(inode);

The XFS transaction code is the same - deep inside it where an inode
is marked as dirty in the transaction, it bumps the same counter and
adds it to the transaction.
  Yeah, I'd just add that ext4 maintains i_version only if it has been
mounted with i_version mount option. But then NFS server would depend on
c/mtime update so it won't help you much - you still should update at least
one of i_version, ctime, mtime on page fault. OTOH if the filesystem isn't
exported, you could avoid this relatively expensive dance and defer things
as Andy suggests.

								Honza
-- 
Jan Kara [off-list ref]
SUSE Labs, CR
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help