Thread (12 messages) 12 messages, 3 authors, 2021-10-28

Re: Problem with direct IO

From: Zhengyuan Liu <hidden>
Date: 2021-10-19 03:39:52
Also in: linux-fsdevel, linux-mm

On Tue, Oct 19, 2021 at 2:43 AM Andrew Morton [off-list ref] wrote:
On Mon, 18 Oct 2021 09:09:06 +0800 Zhengyuan Liu [off-list ref] wrote:
quoted
Ping.

I think this problem is serious and someone may  also encounter it in
the future.


On Wed, Oct 13, 2021 at 9:46 AM Zhengyuan Liu
[off-list ref] wrote:
quoted
Hi, all

we are encounting following Mysql crash problem while importing tables :

    2021-09-26T11:22:17.825250Z 0 [ERROR] [MY-013622] [InnoDB] [FATAL]
    fsync() returned EIO, aborting.
    2021-09-26T11:22:17.825315Z 0 [ERROR] [MY-013183] [InnoDB]
    Assertion failure: ut0ut.cc:555 thread 281472996733168

At the same time , we found dmesg had following message:

    [ 4328.838972] Page cache invalidation failure on direct I/O.
    Possible data corruption due to collision with buffered I/O!
    [ 4328.850234] File: /data/mysql/data/sysbench/sbtest53.ibd PID:
    625 Comm: kworker/42:1

Firstly, we doubled Mysql has operating the file with direct IO and
buffered IO interlaced, but after some checking we found it did only
do direct IO using aio. The problem is exactly from direct-io
interface (__generic_file_write_iter) itself.

ssize_t __generic_file_write_iter()
{
...
        if (iocb->ki_flags & IOCB_DIRECT) {
                loff_t pos, endbyte;

                written = generic_file_direct_write(iocb, from);
                /*
                 * If the write stopped short of completing, fall back to
                 * buffered writes.  Some filesystems do this for writes to
                 * holes, for example.  For DAX files, a buffered write will
                 * not succeed (even if it did, DAX does not handle dirty
                 * page-cache pages correctly).
                 */
                if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
                        goto out;

                status = generic_perform_write(file, from, pos = iocb->ki_pos);
...
}

From above code snippet we can see that direct io could fall back to
buffered IO under certain conditions, so even Mysql only did direct IO
it could interleave with buffered IO when fall back occurred. I have
no idea why FS(ext3) failed the direct IO currently, but it is strange
__generic_file_write_iter make direct IO fall back to buffered IO, it
seems  breaking the semantics of direct IO.
That makes sense.
quoted
quoted
The reproduced  environment is:
Platform:  Kunpeng 920 (arm64)
Kernel: V5.15-rc
PAGESIZE: 64K
Mysql:  V8.0
Innodb_page_size: default(16K)
This is all fairly mature code, I think.  Do you know if earlier
kernels were OK, and if so which versions?
we have tested v4.18 and v4.19 and the problem is still here,  the earlier
version such before v4.12 doesn't support Arm64 well  so we can't test.

I think this problem has something to do with page size,  if we change kernel
page size from 64K to 4k or just set Innodb_page_size to 64K then we cannot
reproduce this problem.  Typically we use 4k as kernel page size and FS block
size, if database use more than 4k as IO unit then it won't interleave for each
IO in kernel page cache as each one will occupy one or more page cache, that
means it is hard to trigger this problem on x84 or other platforms using 4k page
size.  But thing got changed when come to Arm64 64K page size, if database uses
a smaller IO unit, in our Mysql case that is 16K DIO, then two IO
could share one
page cache and if one falls back to buffered IO it can trigger the problem. For
example,  aio got two direct IO which share the same page cache to write , it
dispatched the first one to storage and begin process the second one before
the first one completed, if the second one fall back to buffered IO it will been
copy to page cache and mark the page as dirty, upon that the first one completed
it will check and invalidate it's page cache, if it is dirty then the
problem occured.

If my analysis isn't correct please point it out, thanks.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help