RE: raid6 - data intefrity issue - data mis-compare on rebuilding RAID 6 - with 100 Mb resync speed.
From: Manibalan P <hidden>
Date: 2014-03-11 13:22:00
Hi Neal,
I don't know what kernel "CentOS 6.4" runs. Please report the actual
kernel version as well as distro details. The Kernel version is : 2.6.32 Centos distribution : 2.6.32-358.23.2.el6.x86_64 #1 SMP : x86_64 GNU/Linux
I know nothing about "dit32" and so cannot easily interpret the output.
Is it saying that just a few bytes were wrong? It is not just few bytes of corruption, it looks like some number of sectors are corrupted (for example - 40 sectors ). dit32 will write a pattern of IO, and after each write cycle, it will read it back and verify. Actually, the data which is written on the reported LBA itself corrupted. What I mean to say is, this looks like write corruption.
Was the array fully synced before you started the test?
Yes , IO is started, only after the re-sync is completed.
And to add more info,
I am facing this mis-compare only with high resync speed
(30M to 100M), I ran the same test with resync speed min -10M and max -
30M, without any issue. So the issue has relationship with
sync_speed_max / min.
I can't think of anything else that might cause an inconsistency. I
test the
RAID6 recovery code from time to time and it always works flawlessly
for me. Do you suggest, any IO tool or test to ensure data integrity. One more thing, I like to bring to your notification. I did the same IO test on Ubuntu 13 (Linux ubuntu 3.8.0-19-generic #29-Ubuntu SMP Wed Apr 17 18:16:28 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux ) system also. And I faced same type of data corruption. Thanks, Manibalan. -----Original Message----- From: NeilBrown [mailto:neilb@suse.de] Sent: Tuesday, March 11, 2014 8:34 AM To: Manibalan P Cc: linux-raid@vger.kernel.org Subject: Re: raid6 - data intefrity issue - data mis-compare on rebuilding RAID 6 - with 100 Mb resync speed. On Fri, 7 Mar 2014 14:18:59 +0530 "Manibalan P" [off-list ref] wrote:
Hi,
Hi, when posting to vger.kernel.org lists, please don't send HTML mail, just plain text. Because you did the original email didn't get to the list.
We are facing a data integrity issue on RAID 6. On CentOS 6.4 kernel.
I don't know what kernel "CentOS 6.4" runs. Please report the actual kernel version as well as distro details.
Details of the setup: 1. 7 drives Raid6 md devices (md0) - Capacity 25 GB 2. Resync speed max and min set to 100000 (100Mb) 3. A script is running to simulate drive failure, this script
will
do the following a. Mdadm set faulty for two random drives on the md, the mdadm remove those drives. b. Mdadm add ond drive, and wait for rebuild to complete, then insert the next one. c. Wait till the md become optimal, and continue the disk
removal
cycle again.
4. iSCSI target is configured to "/dev/md0"
5. From Windows server, the md0 target is connected using
MicroSoft iSCSI initiator, and formatted with NTFS.
6. Dit32 IO tool is running on the formatted volume.
Issue#:
The Dit32 tool will running IO in multiple threads, ineach thread, IO will be written and verified.
And on the verification Cycle, we are getting
mis-compare. Below is the log from the dit32 tool.
Thu Mar 06 23:19:31 2014 INFO: DITNT application started
Thu Mar 06 23:20:19 2014 INFO: Test started on Drive D:
Dir Sets=8, Dirs per Set=70, Files per Dir=75
File Size=512KB
Read Only=N, Debug Stamp=Y, Verify During Copy=Y
Build I/O Size range=1 to 128 sectors
Copy Read I/O Size range=1 to 128 sectors
Copy Write I/O Size range=1 to 128 sectors
Verify I/O Size range=1 to 128 sectors
Fri Mar 07 01:28:09 2014 ERROR: Miscompare Found: File
"D:\dit\s6\d51\s6d51f37", offset=00048008
Expected Data: 06 33 25 01 0240 (dirSet, dirNo, fileNo,
elementNo,
sectorOffset)
Read Data: 05 08 2d 01 0240 (dirSet, dirNo, fileNo,
elementNo,
sectorOffset)
Read Request: offset=00043000, size=00008600
This mail has been attached with the following files for your
reference
1. Raid5.c and .h files, the Code what we are using.
2. RollingHotSpareTwoDriveFailure.sh - the script whichsimulates
the two disk failure. 3. dit32log.sav - Log file from the dit32 tool 4. s6d31f37 - the file where the corruption happened(hex format) 5. CentOS-system-info - md and system info
I didn't find any "CentOS-system-info" attached. I know nothing about "dit32" and so can not easily interpret the output. Is it saying that just a few bytes were wrong? Was the array fully synced before you started the test? I can't think of anything else that might cause an inconsistency. I test the RAID6 recovery code from time to time and it always works flawlessly for me. NeilBrown
Thanks,
Manibalan.