Re: strange problem with raid6 read errors on active non-degraded array

From: NeilBrown <hidden>
Date: 2014-07-02 10:45:02

On Wed, 02 Jul 2014 10:32:41 +0100 Pedro Teixeira [off-list ref] wrote:

- I'm having the following problem on a raid6 md volume consisting og  
16 1TB Seagtes SSHD's. ( using kernel 3.15.3 or 3.14.0 ) mdadm is 3.3.

  - every time I run a fsck.ext4 I will get the exact same errors (  
...short read ). Forcing a repair on the md0 volume shows no errors  
and completes without problems. All disks are active and the volume is  
not degraded, still I can't get rid of the short errors on those 16  
blocks and when the filesystem is mounted the read errors will come up  
from time to time as they are probably in use.

- If I try to read those blocks with DD  ( dd if=/dev/md0  of=test.txt  
seek=458227712 count=6 bs=4096 ) it will instantly create a 1.8T file  
but the file doesn't appear to have nothing on it ( and the file  
doesn't take the 1.8T on disk as the disk is much smaller )

- this started happening after having a three disk failure. I  
recovered from that failure by recreating the array with the  
non-failed 13 disks plus the last failed one ( events didn't differ  
much ). I then readed the other disks. The failed disks are all  
physically good, tested them with hdat2 and they don't have read/write  
errors so I reused them. I don't know why they failed, maybe some  
incompatibility with SSHD's and the LSI HBA controller..

root@nas3:/# dd if=/dev/md0  of=teste.txt seek=458227712 count=6 bs=4096
6+0 records in
6+0 records out
24576 bytes (25 kB) copied, 0.0019239 s, 12.8 MB/s
root@nas3:/# ls -lah teste.txt
-rw-r--r-- 1 root root 1.8T Jul  2 10:22 teste.txt
root@nas3:/#



root@nas3:/# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sde[0] sdq[15] sdp[14] sdo[17] sdn[19] sdm[16]  
sdl[18] sdk[9] sdj[8] sdi[7] sdh[6] sdg[5] sdf[4] sdb[3] sdd[2] sdc[1]
       13672838144 blocks super 1.2 level 6, 512k chunk, algorithm 2  
[16/16] [UUUUUUUUUUUUUUUU]

- When doing a fsck.ext4 of /dev/md0 it returns the following ( and I  
can do it over and over again with the exact same errors) :

root@nas3:/# fsck.ext4 -f /dev/md0
e2fsck 1.42.10 (18-May-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Error reading block 458227712 (Attempt to read block from filesystem  
resulted in short read) while reading inode and block bitmaps.  Ignore  
error<y>? yes


Can't possible happen!

(Do worry, I say that a lot - I'm usually wrong).

What sort of computer?  Particularly is it 32bit or 64bit?

Try using 'dd' to read a few meg at various offsets (1G, 2G, 4G, 6G, 8G, ....)
and find out if there is a pattern, where it can read and where it cannot.

NeilBrown

Attachments

signature.asc [application/pgp-signature] 828 bytes

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help