Re: [PATCH v3] mm: fix race between MADV_FREE reclaim and blkdev direct IO read
From: Mauricio Faria de Oliveira <hidden>
Date: 2022-02-02 16:29:49
Also in:
linux-mm
On Wed, Feb 2, 2022 at 11:03 AM Christoph Hellwig [off-list ref] wrote:
On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote:quoted
Well, blkdev_direct_IO() gets references for all pages, and on READ operations it only sets them dirty _later_. So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct IO read from block devices, and page reclaim happens during __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages() returns, but BEFORE the pages are set dirty, the situation happens. The direct IO read eventually completes. Now, when userspace reads the buffers, the PTE is no longer there and the page fault handler do_anonymous_page() services that with the zero-page, NOT the data!So why not just set the pages dirty early like the other direct I/O implementations? Or if this is fine with the patch should we remove the early dirtying elsewhere?
In general, since this particular problem is specific to MADV_FREE, it seemed about right to go for a more contained/particular solution (than changes with broader impact/risk to things that aren't broken). This isn't to say either approach shouldn't be pursued, but just that the larger changes aren't strictly needed to actually fix _this_ issue (and might complicate landing the fix into the stable/distro kernels.) Now, specifically on the 2 suggestions you mentioned, I'm not very familiar with other implementations, thus I can't speak to that, sorry. However, on the 1st suggestion (set pages dirty early), John noted [1] there might be issues with that and advised not going there.
quoted
Reproducer: ========== @ test.c (simplified, but works)Can you add this to blktests or some other regularly run regression test suite?
Sure. The test also needs the kernel-side change (to trigger memory reclaim), which can probably be wired for blktests with a fault-injection capability. Does that sound good? Maybe there's a better way to do it.
quoted
+ smp_rmb(); + + /* + * The only page refs must be from the isolation + * plus one or more rmap's (dropped by discard:).Overly long line.
Hmm, checkpatch.pl didn't complain about it. Ah, it checks for 100 chars. Ok; v4.
quoted
+ */ + if ((ref_count == 1 + map_count) &&No need for the inner braces.
Ok; v4. I'll wait a bit in case more changes are needed, and send v4 w/ the above. Thanks! [1] https://lore.kernel.org/linux-mm/7094dbd6-de0c-9909-e657-e358e14dc6c3@nvidia.com/ (local) -- Mauricio Faria de Oliveira