Re: Splitting a THP beyond EOF

From: Matthew Wilcox <willy@infradead.org>
Date: 2020-10-20 11:21:48
Also in: linux-fsdevel, linux-xfs

On Tue, Oct 20, 2020 at 03:59:28PM +1100, Dave Chinner wrote:

On Tue, Oct 20, 2020 at 02:43:57AM +0100, Matthew Wilcox wrote:

quoted

This is a weird one ... which is good because it means the obvious
ones have been fixed and now I'm just tripping over the weird cases.
And fortunately, xfstests exercises the weird cases.

1. The file is 0x3d000 bytes long.
2. A readahead allocates an order-2 THP for 0x3c000-0x3ffff
3. We simulate a read error for 0x3c000-0x3cfff
4. Userspace writes to 0x3d697 to 0x3dfaa

So this is a write() beyond EOF, yes?

If yes, then we first go through this path:

	xfs_file_buffered_aio_write()
	  xfs_file_aio_write_checks()
	    iomap_zero_range(isize, pos - isize)

To zero the region between the current EOF and where the new write
starts. i.e. from 0x3d000 to 0x3d696.

Yes.  That calls iomap_write_begin() which calls iomap_split_page()
which is where we run into trouble.  I elided the exact path from the
description of the problem.

quoted

5. iomap_write_begin() gets the 0x3c page, sees it's THP and !Uptodate
   so it calls iomap_split_page() (passing page 0x3d)

Splitting the page because it's !Uptodate seems rather drastic to
me.  Why does it need to split the page here?

Because we can't handle Dirty, !Uptodate THPs in the truncate path.
Previous discussion:
https://lore.kernel.org/linux-mm/20200821144021.GV17456@casper.infradead.org/ (local)

The current assumption is that a !Uptodate THP is due to a read error,
and so the sensible thing to do is split it and handle read errors at
a single-page level.

I've been playing around with creating THPs in the write path, and that
offers a different pathway to creating Dirty, !Uptodate THPs, so this
may also change at some point.  I'd like to get what I have merged and
then figure out how to make this better.

Also, this concerns me: if we are exposing the cached EOF page via
mmap, it needs to contain only zeroes in the region beyond EOF so
that we don't expose stale data to userspace. Hence when a THP that
contains EOF is instantiated, we have to ensure that the region
beyond EOF is compeltely zeroed. It then follows that if we read all
the data in that THP up to EOF, then the page is actually up to
date...

We do that in iomap_readpage_actor().  Had the readahead I/O not "failed",
we'd've had an Uptodate THP which straddled EOF.  I dumped the page and
its uptodate bits are 0xfff0 (1kB block size filesystem, the three pages
0x3d-0x3f are uptodate because they were zeroed and 0x3c is !uptodate
because the I/O failed).

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help