Thread (10 messages) 10 messages, 3 authors, 2021-02-09

Re: [PATCHSET 0/3] Improve IOCB_NOWAIT O_DIRECT

From: Jens Axboe <axboe@kernel.dk>
Date: 2021-02-08 23:38:17
Also in: linux-mm

On 2/8/21 4:28 PM, Dave Chinner wrote:
On Mon, Feb 08, 2021 at 03:18:26PM -0700, Jens Axboe wrote:
quoted
Hi,

Ran into an issue with IOCB_NOWAIT and O_DIRECT, which causes a rather
serious performance issue. If IOCB_NOWAIT is set, the generic/iomap
iterators check for page cache presence in the given range, and return
-EAGAIN if any is there. This is rather simplistic and looks like
something that was never really finished. For !IOCB_NOWAIT, we simply
call filemap_write_and_wait_range() to issue (if any) and wait on the
range. The fact that we have page cache entries for this range does
not mean that we cannot safely do O_DIRECT IO to/from it.

This series adds filemap_range_needs_writeback(), which checks if
we have pages in the range that do require us to call
filemap_write_and_wait_range(). If we don't, then we can proceed just
fine with IOCB_NOWAIT.
Not exactly. If it is a write we are doing, we _must_ invalidate
the page cache pages over the range of the DIO write to maintain
some level of cache coherency between the DIO write and the page
cache contents. i.e. the DIO write makes the page cache contents
stale, so the page cache has to be invalidated before the DIO write
is started, and again when it completes to toss away racing updates
(mmap) while the DIO write was in flight...

Page invalidation can block (page locks, waits on writeback, taking
the mmap_sem to zap page tables, etc), and it can also fail because
pages are dirty (e.g. writeback+invalidation racing with mmap).

And if it fails because dirty pages then we fall back to buffered
IO, which serialises readers and writes and will block.
Right, not disagreeing with any of that.
quoted
The problem manifested itself in a production environment, where someone
is doing O_DIRECT on a raw block device. Due to other circumstances,
blkid was triggered on this device periodically, and blkid very helpfully
does a number of page cache reads on the device. Now the mapping has
page cache entries, and performance falls to pieces because we can no
longer reliably do IOCB_NOWAIT O_DIRECT.
If it was a DIO write, then the pages would have been invalidated
on the first write and the second write would issued with NOWAIT
just fine.

So the problem sounds to me like DIO reads from the block device are
not invalidating the page cache over the read range, so they persist
and prevent IOCB_NOWAIT IO from being submitted.
That is exactly the case I ran into indeed.
Historically speaking, this is why XFS always used to invalidate the
page cache for DIO - it didn't want to leave cached clean pages that
would prevent future DIOs from being issued concurrently because
coherency with the page cache caused performance issues. We
optimised away this invalidation because the data in the page cache
is still valid after a flush+DIO read, but it sounds to me like
there are still corner cases where "always invalidate cached pages"
is the right thing for DIO to be doing....

Not sure what the best way to go here it - the patch isn't correct
for NOWAIT DIO writes, but it looks necessary for reads. And I'm not
sure that we want to go back to "invalidate everything all the time"
either....
We still do the invalidation for writes with the patch for writes,
nothing has changed there. We just skip the
filemap_write_and_wait_range() if there's nothing to write. And if
there's nothing to write, _hopefully_ the invalidation should go
smoothly unless someone dirtied/locked/put-under-writeback the page
since we did the check. But that's always going to be racy, and there's
not a whole lot we can do about that.

-- 
Jens Axboe
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help