Re: [general question] rare silent data corruption when writing data
From: John Stoffel <hidden>
Date: 2020-05-08 16:10:21
Also in:
dm-devel
quoted
quoted
quoted
quoted
"Michal" == Michal Soltys [off-list ref] writes:
And of course it should also go to dm-devel@redhat.com, my fault for not including that as well. I strongly suspect it's an thin-lv problem somewhere, but I don't know enough to help chase down the problem in detail. John Michal> note: as suggested, I'm also CCing this to linux-lvm; the full Michal> context with replies starts at: Michal> https://www.spinics.net/lists/raid/msg64364.html There is also Michal> the initial post at the bottom as well. Michal> On 5/8/20 2:54 AM, John Stoffel wrote:
quoted
quoted
quoted
quoted
quoted
quoted
"Michal" == Michal Soltys [off-list ref] writes:
Michal> On 20/05/07 23:01, John Stoffel wrote:
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
"Roger" == Roger Heflin [off-list ref] writes:
Roger> Have you tried the same file 2x and verified the corruption is in the Roger> same places and looks the same?
quoted
quoted
quoted
Are these 1tb files VMDK or COW images of VMs? How are these files made. And does it ever happen with *smaller* files? What about if you just use a sparse 2tb file and write blocks out past 1tb to see if there's a problem?
Michal> The VMs are always directly on lvm volumes. (e.g. Michal> /dev/mapper/vg0-gitlab). The guest (btrfs inside the guest) detected the Michal> errors after we ran scrub on the filesystem.
quoted
Michal> Yes, the errors were also found on small files.
quoted
Those errors are in small files inside the VM, which is running btrfs ontop of block storage provided by your thin-lv, right?
Michal> Yea, the small files were in this case on that thin-lv. Michal> We also discovered (yesterday) file corruptions with VM hosting gitlab registry - this one was using the same thin-lv underneath, but the guest itself was using ext4 (in this case, docker simply reported incorrect sha checksum on (so far) 2 layers.
quoted
disks -> md raid5 -> pv -> vg -> lv-thin -> guest QCOW/LUN -> filesystem -> corruption
Michal> Those particular guests, yea. The host case it's just w/o "guest" step. Michal> But (so far) all corruption ended going via one of the lv-thin layers (and via one of md raids).
quoted
Michal> Since then we recreated the issue directly on the host, just Michal> by making ext4 filesystem on some LV, then doing write with Michal> checksum, sync, drop_caches, read and check checksum. The Michal> errors are, as I mentioned - always a full 4KiB chunks (always Michal> same content, always same position).
quoted
What position? Is it a 4k, 1.5m or some other consistent offset? And how far into the file? And this LV is a plain LV or a thin-lv? I'm running a debian box at home with RAID1 and I haven't seen this, but I'm not nearly as careful as you. Can you provide the output of:
Michal> What I meant that it doesn't "move" when verifying the same file (aka different reads from same test file). Between the tests, the errors are of course in different places - but it's always some 4KiB piece(s) - that look like correct pieces belonging somewhere else.
quoted
/sbin/lvs --version
Michal> LVM version: 2.03.02(2) (2018-12-18)
Michal> Library version: 1.02.155 (2018-12-18)
Michal> Driver version: 4.41.0
Michal> Configuration: ./configure --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --exec-prefix= --bindir=/bin --libdir=/lib/x86_64-linux-gnu --sbindir=/sbin --with-usrlibdir=/usr/lib/x86_64-linux-gnu --with-optimisation=-O2 --with-cache=internal --with-device-uid=0 --with-device-gid=6 --with-device-mode=0660 --with-default-pid-dir=/run --with-default-run-dir=/run/lvm --with-default-locking-dir=/run/lock/lvm --with-thin=internal --with-thin-check=/usr/sbin/thin_check --with-thin-dump=/usr/sbin/thin_dump --with-thin-repair=/usr/sbin/thin_repair --enable-applib --enable-blkid_wiping --enable-cmdlib --enable-dmeventd --enable-dbus-service --enable-lvmlockd-dlm --enable-lvmlockd-sanlock --enable-lvmpolld --enable-notify-dbus --enable-pkgconfig --enable-readline --enable-udev_rules --enable-udev_sync
quoted
too? Can you post your: /sbin/dmsetup status output too? There's a better command to use here, but I'm not an export. You might really want to copy this over to the linux-lvm@redhat.com mailing list as well.
Michal> x22v0-tp_ssd-tpool: 0 2577285120 thin-pool 19 8886/552960 629535/838960 - rw no_discard_passdown queue_if_no_space - 1024 Michal> x22v0-tp_ssd_tdata: 0 2147696640 linear Michal> x22v0-tp_ssd_tdata: 2147696640 429588480 linear Michal> x22v0-tp_ssd_tmeta_rimage_1: 0 4423680 linear Michal> x22v0-tp_ssd_tmeta: 0 4423680 raid raid1 2 AA 4423680/4423680 idle 0 0 - Michal> x22v0-gerrit--new: 0 268615680 thin 255510528 268459007 Michal> x22v0-btrfsnopool: 0 134430720 linear Michal> x22v0-gitlab_root: 0 629145600 thin 628291584 629145599 Michal> x22v0-tp_ssd_tmeta_rimage_0: 0 4423680 linear Michal> x22v0-nexus_old_storage: 0 10737500160 thin 5130817536 10737500159 Michal> x22v0-gitlab_reg: 0 2147696640 thin 1070963712 2147696639 Michal> x22v0-nexus_old_root: 0 268615680 thin 257657856 268615679 Michal> x22v0-tp_big_tmeta_rimage_1: 0 8601600 linear Michal> x22v0-tp_ssd_tmeta_rmeta_1: 0 245760 linear Michal> x22v0-micron_vol: 0 268615680 linear Michal> x22v0-tp_big_tmeta_rimage_0: 0 8601600 linear Michal> x22v0-tp_ssd_tmeta_rmeta_0: 0 245760 linear Michal> x22v0-gerrit--root: 0 268615680 thin 103388160 268443647 Michal> x22v0-btrfs_ssd_linear: 0 268615680 linear Michal> x22v0-btrfstest: 0 268615680 thin 40734720 268615679 Michal> x22v0-tp_ssd: 0 2577285120 linear Michal> x22v0-tp_big: 0 22164602880 linear Michal> x22v0-nexus3_root: 0 167854080 thin 21860352 167854079 Michal> x22v0-nusknacker--staging: 0 268615680 thin 268182528 268615679 Michal> x22v0-tmob2: 0 1048657920 linear Michal> x22v0-tp_big-tpool: 0 22164602880 thin-pool 35 35152/1075200 3870070/7215040 - rw no_discard_passdown queue_if_no_space - 1024 Michal> x22v0-tp_big_tdata: 0 4295147520 linear Michal> x22v0-tp_big_tdata: 4295147520 17869455360 linear Michal> x22v0-btrfs_ssd_test: 0 201523200 thin 191880192 201335807 Michal> x22v0-nussknacker2: 0 268615680 thin 58573824 268615679 Michal> x22v0-tmob1: 0 1048657920 linear Michal> x22v0-tp_big_tmeta: 0 8601600 raid raid1 2 AA 8601600/8601600 idle 0 0 - Michal> x22v0-nussknacker1: 0 268615680 thin 74376192 268615679 Michal> x22v0-touk--elk4: 0 839024640 linear Michal> x22v0-gerrit--backup: 0 268615680 thin 228989952 268443647 Michal> x22v0-tp_big_tmeta_rmeta_1: 0 245760 linear Michal> x22v0-openvpn--new: 0 134430720 thin 24152064 66272255 Michal> x22v0-k8sdkr: 0 268615680 linear Michal> x22v0-nexus3_storage: 0 10737500160 thin 4976683008 10737500159 Michal> x22v0-rocket: 0 167854080 thin 163602432 167854079 Michal> x22v0-tp_big_tmeta_rmeta_0: 0 245760 linear Michal> x22v0-roger2: 0 134430720 thin 33014784 134430719 Michal> x22v0-gerrit--new--backup: 0 268615680 thin 6552576 268443647 Michal> Also lvs -a with segment ranges: Michal> LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert LE Ranges Michal> btrfs_ssd_linear x22v0 -wi-a----- <128.09g /dev/md125:19021-20113 Michal> btrfs_ssd_test x22v0 Vwi-a-t--- 96.09g tp_ssd 95.21 Michal> btrfsnopool x22v0 -wi-a----- 64.10g /dev/sdt2:35-581 Michal> btrfstest x22v0 Vwi-a-t--- <128.09g tp_big 15.16 Michal> gerrit-backup x22v0 Vwi-aot--- <128.09g tp_big 85.25 Michal> gerrit-new x22v0 Vwi-a-t--- <128.09g tp_ssd 95.12 Michal> gerrit-new-backup x22v0 Vwi-a-t--- <128.09g tp_big 2.44 Michal> gerrit-root x22v0 Vwi-aot--- <128.09g tp_ssd 38.49 Michal> gitlab_reg x22v0 Vwi-a-t--- 1.00t tp_big 49.87 Michal> gitlab_reg_snapshot x22v0 Vwi---t--k 1.00t tp_big gitlab_reg Michal> gitlab_root x22v0 Vwi-a-t--- 300.00g tp_ssd 99.86 Michal> gitlab_root_snapshot x22v0 Vwi---t--k 300.00g tp_ssd gitlab_root Michal> k8sdkr x22v0 -wi-a----- <128.09g /dev/md126:20891-21983 Michal> [lvol0_pmspare] x22v0 ewi------- 4.10g /dev/sdt2:0-34 Michal> micron_vol x22v0 -wi-a----- <128.09g /dev/sdt2:582-1674 Michal> nexus3_root x22v0 Vwi-aot--- <80.04g tp_ssd 13.03 Michal> nexus3_storage x22v0 Vwi-aot--- 5.00t tp_big 46.35 Michal> nexus_old_root x22v0 Vwi-a-t--- <128.09g tp_ssd 95.92 Michal> nexus_old_storage x22v0 Vwi-a-t--- 5.00t tp_big 47.78 Michal> nusknacker-staging x22v0 Vwi-aot--- <128.09g tp_big 99.84 Michal> nussknacker1 x22v0 Vwi-aot--- <128.09g tp_big 27.69 Michal> nussknacker2 x22v0 Vwi-aot--- <128.09g tp_big 21.81 Michal> openvpn-new x22v0 Vwi-aot--- 64.10g tp_big 17.97 Michal> rocket x22v0 Vwi-aot--- <80.04g tp_ssd 97.47 Michal> roger2 x22v0 Vwi-a-t--- 64.10g tp_ssd 24.56 Michal> tmob1 x22v0 -wi-a----- <500.04g /dev/md125:8739-13005 Michal> tmob2 x22v0 -wi-a----- <500.04g /dev/md125:13006-17272 Michal> touk-elk4 x22v0 -wi-ao---- <400.08g /dev/md126:17477-20890 Michal> tp_big x22v0 twi-aot--- 10.32t 53.64 3.27 [tp_big_tdata]:0-90187 Michal> [tp_big_tdata] x22v0 Twi-ao---- 10.32t /dev/md126:0-17476 Michal> [tp_big_tdata] x22v0 Twi-ao---- 10.32t /dev/md126:21984-94694 Michal> [tp_big_tmeta] x22v0 ewi-aor--- 4.10g 100.00 [tp_big_tmeta_rimage_0]:0-34,[tp_big_tmeta_rimage_1]:0-34 Michal> [tp_big_tmeta_rimage_0] x22v0 iwi-aor--- 4.10g /dev/sda3:30-64 Michal> [tp_big_tmeta_rimage_1] x22v0 iwi-aor--- 4.10g /dev/sdb3:30-64 Michal> [tp_big_tmeta_rmeta_0] x22v0 ewi-aor--- 120.00m /dev/sda3:29-29 Michal> [tp_big_tmeta_rmeta_1] x22v0 ewi-aor--- 120.00m /dev/sdb3:29-29 Michal> tp_ssd x22v0 twi-aot--- 1.20t 75.04 1.61 [tp_ssd_tdata]:0-10486 Michal> [tp_ssd_tdata] x22v0 Twi-ao---- 1.20t /dev/md125:0-8738 Michal> [tp_ssd_tdata] x22v0 Twi-ao---- 1.20t /dev/md125:17273-19020 Michal> [tp_ssd_tmeta] x22v0 ewi-aor--- <2.11g 100.00 [tp_ssd_tmeta_rimage_0]:0-17,[tp_ssd_tmeta_rimage_1]:0-17 Michal> [tp_ssd_tmeta_rimage_0] x22v0 iwi-aor--- <2.11g /dev/sda3:11-28 Michal> [tp_ssd_tmeta_rimage_1] x22v0 iwi-aor--- <2.11g /dev/sdb3:11-28 Michal> [tp_ssd_tmeta_rmeta_0] x22v0 ewi-aor--- 120.00m /dev/sda3:10-10 Michal> [tp_ssd_tmeta_rmeta_1] x22v0 ewi-aor--- 120.00m /dev/sdb3:10-10
quoted
quoted
quoted
Are the LVs split across RAID5 PVs by any chance?
Michal> raid5s are used as PVs, but a single logical volume always uses one only Michal> one physical volume underneath (if that's what you meant by split across).
quoted
Ok, that's what I was asking about. It shouldn't matter... but just trying to chase down the details.quoted
quoted
It's not clear if you can replicate the problem without using lvm-thin, but that's what I suspect you might be having problems with.
Michal> I'll be trying to do that, though the heavier tests will have to wait Michal> until I move all VMs to other hosts (as that is/was our production machnie).
quoted
Sure, makes sense.quoted
quoted
Can you give us the versions of the your tools, and exactly how you setup your test cases? How long does it take to find the problem?
Michal> Regarding this, currently: Michal> kernel: 5.4.0-0.bpo.4-amd64 #1 SMP Debian 5.4.19-1~bpo10+1 (2020-03-09) x86_64 GNU/Linux (was also happening with 5.2.0-0.bpo.3-amd64) Michal> LVM version: 2.03.02(2) (2018-12-18) Michal> Library version: 1.02.155 (2018-12-18) Michal> Driver version: 4.41.0 Michal> mdadm - v4.1 - 2018-10-01
quoted
Michal> Will get all the details tommorow (the host is on up to date debian Michal> buster, the VMs are mix of archlinuxes and debians (and the issue Michal> happened on both)).
quoted
Michal> As for how long, it's a hit and miss. Sometimes writing and reading back Michal> ~16gb file fails (the cheksum read back differs from what was written) Michal> after 2-3 tries. That's on the host.
quoted
Michal> On the guest, it's been (so far) a guaranteed thing when we were Michal> creating very large tar file (900gb+). As for past two weeks we were Michal> unable to create that file without errors even once.
quoted
Ouch! That's not good. Just to confirm, these corruptions are all in a thin-lv based filesystem, right? I'd be interested to know if you can create another plain LV and cause the same error. Trying to simplify the potential problems.
Michal> I have been trying to - but so far didn't manage to replicate this with: Michal> - a physical partition Michal> - filesystem directly on a physical partition Michal> - filesystem directly on mdraid Michal> - filesystem directly on a linear volume Michal> Note that this _doesn't_ imply that I _always_ get errors if lvm-thin is in use - as I also had lengthy period of attempts to cause corruption on some thin volume w/o any successes either. But the ones that failed had those in common (so far): md & lvm-thin - with 4 KiB piece(s) being incorrect
quoted
quoted
quoted
Can you compile the newst kernel and newest thin tools and try them out?
Michal> I can, but a bit later (once we move VMs out of the host).
quoted
quoted
quoted
How long does it take to replicate the corruption?
Michal> When it happens, it's usually few tries tries of writing a 16gb file Michal> with random patterns and reading it back (directly on host). The Michal> irritating thing is that it can be somewhat hard to reproduce (e.g. Michal> after machine's reboot).
quoted
quoted
quoted
Sorry for all the questions, but until there's a test case which is repeatable, it's going to be hard to chase this down. I wonder if running 'fio' tests would be something to try? And also changing your RAID5 setup to use the default stride and stripe widths, instead of the large values you're using.
Michal> The raid5 is using mdadm's defaults (which is 512 KiB these days for a Michal> chunk). LVM on top is using much longer extents (as we don't really need Michal> 4mb granularity) and the lvm-thin chunks were set to match (and align) Michal> to raid's stripe.
quoted
quoted
quoted
Good luck!
Roger> I have not as of yet seen write corruption (except when a vendors disk Roger> was resetting and it was lying about having written the data prior to Roger> the crash, these were ssds, if your disk write cache is on and you Roger> have a disk reset this can also happen), but have not seen "lost Roger> writes" otherwise, but would expect the 2 read corruption I have seen Roger> to also be able to cause write issues. So for that look for scsi Roger> notifications for disk resets that should not happen.
quoted
quoted
quoted
Roger> I have had a "bad" controller cause read corruptions, those Roger> corruptions would move around, replacing the controller resolved it, Roger> so there may be lack of error checking "inside" some paths in the Roger> card. Lucky I had a number of these controllers and had cold spares Roger> for them. The give away here was 2 separate buses with almost Roger> identical load with 6 separate disks each and all12 disks on 2 buses Roger> had between 47-52 scsi errors, which points to the only component Roger> shared (the controller).
quoted
quoted
quoted
Roger> The backplane and cables are unlikely in general cause this, there is Roger> too much error checking between the controller and the disk from what Roger> I know.
quoted
quoted
quoted
Roger> I have had pre-pcie bus (PCI-X bus, 2 slots shared, both set to 133 Roger> cause random read corruptions, lowering speed to 100 fixed it), this Roger> one was duplicated on multiple identical pieces of hw with all Roger> different parts on the duplication machine.
quoted
quoted
quoted
Roger> I have also seen lost writes (from software) because someone did a Roger> seek without doing a flush which in some versions of the libs loses Roger> the unfilled block when the seek happens (this is noted in the man Roger> page, and I saw it 20years ago, it is still noted in the man page, so Roger> no idea if it was ever fixed). So has more than one application been Roger> noted to see the corruption?
quoted
quoted
quoted
Roger> So one question, have you seen the corruption in a path that would Roger> rely on one controller, or all corruptions you have seen involving Roger> more than one controller? Isolate and test each controller if you Roger> can, or if you can afford to replace it and see if it continues.
quoted
quoted
quoted
Roger> On Thu, May 7, 2020 at 12:33 PM Michal Soltys [off-list ref] wrote:
quoted
quoted
quoted
quoted
quoted
Note: this is just general question - if anyone experienced something similar or could suggest how to pinpoint / verify the actual cause.quoted
Thanks to btrfs's checksumming we discovered somewhat (even if quite rare) nasty silent corruption going on on one of our hosts. Or perhaps "corruption" is not the correct word - the files simply have precise 4kb (1 page) of incorrect data. The incorrect pieces of data look on their own fine - as something that was previously in the place, or written from wrong source.quoted
The hardware is (can provide more detailed info of course):quoted
- Supermicro X9DR7-LN4F - onboard LSI SAS2308 controller (2 sff-8087 connectors, 1 connected to backplane) - 96 gb ram (ecc) - 24 disk backplanequoted
- 1 array connected directly to lsi controller (4 disks, mdraid5, internal bitmap, 512kb chunk) - 1 array on the backplane (4 disks, mdraid5, journaled) - journal for the above array is: mdraid1, 2 ssd disks (micron 5300 pro disks) - 1 btrfs raid1 boot array on motherboard's sata ports (older but still fine intel ssds from DC 3500 series)quoted
Raid 5 arrays are in lvm volume group, and the logical volumes are used by VMs. Some of the volumes are linear, some are using thin-pools (with metadata on the aforementioned intel ssds, in mirrored config). LVM uses large extent sizes (120m) and the chunk-size of thin-pools is set to 1.5m to match underlying raid stripe. Everything is cleanly aligned as well.quoted
With a doze of testing we managed to roughly rule out the following elements as being the cause:quoted
- qemu/kvm (issue occured directly on host) - backplane (issue occured on disks directly connected via LSI's 2nd connector) - cable (as a above, two different cables) - memory (unlikely - ECC for once, thoroughly tested, no errors ever reported via edac-util or memtest) - mdadm journaling (issue occured on plain mdraid configuration as well) - disks themselves (issue occured on two separate mdadm arrays) - filesystem (issue occured on both btrfs and ext4 (checksumed manually) )quoted
We did not manage to rule out (though somewhat _highly_ unlikely):quoted
- lvm thin (issue always - so far - occured on lvm thin pools) - mdraid (issue always - so far - on mdraid managed arrays) - kernel (tested with - in this case - debian's 5.2 and 5.4 kernels, happened with both - so it would imply rather already longstanding bug somewhere)quoted
And finally - so far - the issue never occured:quoted
- directly on a disk - directly on mdraid - on linear lvm volume on top of mdraidquoted
As far as the issue goes it's:quoted
- always a 4kb chunk that is incorrect - in a ~1 tb file it can be from a few to few dozens of such chunks - we also found (or rather btrfs scrub did) a few small damaged files as well - the chunks look like a correct piece of different or previous dataquoted
The 4kb is well, weird ? Doesn't really matter any chunk/stripes sizes anywhere across the stack (lvm - 120m extents, 1.5m chunks on thin pools; mdraid - default 512kb chunks). It does nicely fit a page though ...quoted
Anyway, if anyone has any ideas or suggestions what could be happening (perhaps with this particular motherboard or vendor) or how to pinpoint the cause - I'll be grateful for any.