Thread (4 messages) 4 messages, 2 authors, 2021-05-09

Re: regression: data corruption with ext4 on LUKS on nvme with torvalds master

From: Alex Xu (Hello71) <hidden>
Date: 2021-05-09 02:30:08
Also in: linux-block, linux-nvme, lkml

Possibly related (same subject, not in this thread)

Excerpts from Alex Xu (Hello71)'s message of May 8, 2021 1:54 pm:
Hi all,

Using torvalds master, I recently encountered data corruption on my ext4 
volume on LUKS on NVMe. Specifically, during heavy writes, the system 
partially hangs; SysRq-W shows that processes are blocked in the kernel 
on I/O. After forcibly rebooting, chunks of files are replaced with 
other, unrelated data. I'm not sure exactly what the data is; some of it 
is unknown binary data, but in at least one case, a list of file paths 
was inserted into a file, indicating that the data is misdirected after 
encryption.

This issue appears to affect files receiving writes in the temporal 
vicinity of the hang, but affects both new and old data: for example, my 
shell history file was corrupted up to many months before.

The drive reports no SMART issues.

I believe this is a regression in the kernel related to something merged 
in the last few days, as it consistently occurs with my most recent 
kernel versions, but disappears when reverting to an older kernel.

I haven't investigated further, such as by bisecting. I hope this is 
sufficient information to give someone a lead on the issue, and if it is 
a bug, nail it down before anybody else loses data.

Regards,
Alex.
I found the following test to reproduce a hang, which I guess may be the 
cause:

host$ cd /tmp
host$ truncate -s 10G drive
host$ qemu-system-x86_64 -drive format=raw,file=drive,if=none,id=drive -device nvme,drive=drive,serial=1 [... more VM setup options]
guest$ cryptsetup luksFormat /dev/nvme0n1
[accept warning, use any password]
guest$ cryptsetup open /dev/nvme0n1
[enter password]
guest$ mkfs.ext4 /dev/mapper/test
[normal output...]
Creating journal (16384 blocks): [hangs forever]

I bisected this issue to:

cd2c7545ae1beac3b6aae033c7f31193b3255946 is the first bad commit
commit cd2c7545ae1beac3b6aae033c7f31193b3255946
Author: Changheun Lee [off-list ref]
Date:   Mon May 3 18:52:03 2021 +0900

    bio: limit bio max size

I didn't try reverting this commit or further reducing the test case. 
Let me know if you need my kernel config or other information.

Regards,
Alex.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help