Thread (12 messages) 12 messages, 3 authors, 2021-10-28

Re: Problem with direct IO

From: Zhengyuan Liu <hidden>
Date: 2021-10-21 02:22:12
Also in: linux-fsdevel, linux-mm

On Thu, Oct 21, 2021 at 1:37 AM Jan Kara [off-list ref] wrote:
On Wed 13-10-21 09:46:46, Zhengyuan Liu wrote:
quoted
Hi, all

we are encounting following Mysql crash problem while importing tables :

    2021-09-26T11:22:17.825250Z 0 [ERROR] [MY-013622] [InnoDB] [FATAL]
    fsync() returned EIO, aborting.
    2021-09-26T11:22:17.825315Z 0 [ERROR] [MY-013183] [InnoDB]
    Assertion failure: ut0ut.cc:555 thread 281472996733168

At the same time , we found dmesg had following message:

    [ 4328.838972] Page cache invalidation failure on direct I/O.
    Possible data corruption due to collision with buffered I/O!
    [ 4328.850234] File: /data/mysql/data/sysbench/sbtest53.ibd PID:
    625 Comm: kworker/42:1

Firstly, we doubled Mysql has operating the file with direct IO and
buffered IO interlaced, but after some checking we found it did only
do direct IO using aio. The problem is exactly from direct-io
interface (__generic_file_write_iter) itself.

ssize_t __generic_file_write_iter()
{
...
        if (iocb->ki_flags & IOCB_DIRECT) {
                loff_t pos, endbyte;

                written = generic_file_direct_write(iocb, from);
                /*
                 * If the write stopped short of completing, fall back to
                 * buffered writes.  Some filesystems do this for writes to
                 * holes, for example.  For DAX files, a buffered write will
                 * not succeed (even if it did, DAX does not handle dirty
                 * page-cache pages correctly).
                 */
                if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
                        goto out;

                status = generic_perform_write(file, from, pos = iocb->ki_pos);
...
}

From above code snippet we can see that direct io could fall back to
buffered IO under certain conditions, so even Mysql only did direct IO
it could interleave with buffered IO when fall back occurred. I have
no idea why FS(ext3) failed the direct IO currently, but it is strange
__generic_file_write_iter make direct IO fall back to buffered IO, it
seems  breaking the semantics of direct IO.

The reproduced  environment is:
Platform:  Kunpeng 920 (arm64)
Kernel: V5.15-rc
PAGESIZE: 64K
Mysql:  V8.0
Innodb_page_size: default(16K)
Thanks for report. I agree this should not happen. How hard is this to
reproduce? Any idea whether the fallback to buffered IO happens because
iomap_dio_rw() returns -ENOTBLK or because it returns short write?
It is easy to reproduce in my test environment, as I said in the previous email
replied to Andrew this problem is related to kernel page size.
Can you post output of "dumpe2fs -h <device>" for the filesystem where the
problem happens? Thanks!
Sure, the output is:

# dumpe2fs -h /dev/sda3
dumpe2fs 1.45.3 (14-Jul-2019)
Filesystem volume name:   <none>
Last mounted on:          /data
Filesystem UUID:          09a51146-b325-48bb-be63-c9df539a90a1
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index
filetype needs_recovery sparse_super large_file
Filesystem flags:         unsigned_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              11034624
Block count:              44138240
Reserved block count:     2206912
Free blocks:              43168100
Free inodes:              11034613
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      1013
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Filesystem created:       Thu Oct 21 09:42:03 2021
Last mount time:          Thu Oct 21 09:43:36 2021
Last write time:          Thu Oct 21 09:43:36 2021
Mount count:              1
Maximum mount count:      -1
Last checked:             Thu Oct 21 09:42:03 2021
Check interval:           0 (<none>)
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:           256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      a7b04e61-1209-496d-ab9d-a51009b51ddb
Journal backup:           inode blocks
Journal features:         journal_incompat_revoke
Journal size:             1024M
Journal length:           262144
Journal sequence:         0x00000002
Journal start:            1

BTW, we have  also tested Ext4 and XFS and didn't see direct write fallback.

Thanks,
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help