Thread (5 messages) 5 messages, 3 authors, 2014-07-25

Re: Fedora 20 RAID 6 errors on rebuild / check / repair

From: Kay Diederichs <hidden>
Date: 2014-07-24 07:33:21

On 07/24/2014 04:29 AM, George Rapp wrote:
Hi -

I have a Fedora 20 media server / MythTV backend utilizing a HighPoint
RocketRAID 2720SGL controller (Amazon product link:
http://is.gd/yqo2i1). The server performs fine under normal (minimal)
read-write operations, but during any high-I/O operations (rebuild
after mdadm --add, RAID check initiated by "echo check >
/sys/block/md6/md/sync_action" or "echo repair > ..."), I get sporadic
errors and poor performance on my RAID 6 array, /dev/md6.

Wondering if there is anything I can tweak to make my configuration
more stable. The inability to check or repair this RAID device has me
nervous.

The problems seem to start when I see the following error message in
/var/log/syslog:
quoted
Jul 22 21:23:37 backend3 kernel: [95876.375990] ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jul 22 21:23:37 backend3 kernel: [95876.376153] ata5.00: failed command: READ DMA
Jul 22 21:23:37 backend3 kernel: [95876.376284] ata5.00: cmd c8/00:08:40:11:81/00:00:00:00:00/e3 tag 11 dma 4096 in
Jul 22 21:23:37 backend3 kernel: [95876.376284]          res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
Jul 22 21:23:37 backend3 kernel: [95876.376750] ata5.00: status: { DRDY }
Jul 22 21:23:37 backend3 kernel: [95876.376874] ata5: hard resetting link
Jul 22 21:23:37 backend3 kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jul 22 21:23:37 backend3 kernel: ata5.00: failed command: READ DMA
Jul 22 21:23:37 backend3 kernel: ata5.00: cmd c8/00:08:40:11:81/00:00:00:00:00/e3 tag 11 dma 4096 in
         res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
Jul 22 21:23:37 backend3 kernel: ata5.00: status: { DRDY }
Jul 22 21:23:37 backend3 kernel: ata5: hard resetting link
Jul 22 21:23:40 backend3 kernel: [95878.742281] ata5.00: configured for UDMA/133
Jul 22 21:23:40 backend3 kernel: [95878.742413] ata5.00: device reported invalid CHS sector 0
Jul 22 21:23:40 backend3 kernel: [95878.742542] ata5: EH complete
Jul 22 21:23:40 backend3 kernel: ata5.00: configured for UDMA/133
Jul 22 21:23:40 backend3 kernel: ata5.00: device reported invalid CHS sector 0
Jul 22 21:23:40 backend3 kernel: ata5: EH complete

I thought the problem might be caused by NCQ being enabled -- previous
iterations of this error included the string 'ncq', like this:
quoted
ata7.00: cmd 60/00:00:68:4b:75/03:00:04:00:00/40 tag 0 ncq 393216 in
         res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)

so I disabled NCQ by adding "libata.force=noncq" to my kernel boot
parameters. However, it didn't help, as I still get the "...frozen"
errors. (I have young children, so any error message that includes the
word "Frozen" makes me twitchy ... 8^)

Right now, I'm attempting to rebuild the degraded RAID 6 array after
swapping out a disk that was getting an increasing number of these
errors:
quoted
Device: /dev/sdf [SAT], 35 Currently unreadable (pending) sectors

I started the rebuild on Monday night in single-user mode via
quoted
# mdadm --manage /dev/md6 --add /dev/sdf1

(my other partitions are /dev/sd[bcde]4, but I only created a single
partition on the new disk, to see if reliability would be better by
placing my RAID partition on the first partition rather than the last)

At first, the rebuild was supposed to take 4.3 days. I Googled around
and found a couple of speed optimization techniques, which I applied:
quoted
# sysctl -w dev.raid.speed_limit_max=100000

# cd /sys/block/md6/md
# echo 16384 > stripe_cache_size

This initially sped up the resync speed to 68-70000K/sec, until I hit
the first "exception Emask" error like the one I described above --
now the speed has dropped to 30K/sec, and the rebuild is scheduled to
last 439 more days! I don't know if I should just mark the new device
as failed and stop the sync, or let it keep grinding and hope it
speeds up.

Any pointers or tips appreciated. I've been running Linux software
RAID for 4-5 years, but this is the first time I've experienced this
kind of trouble.

More data on my system:


[root@backend3 gwr]# uname -a
Linux backend3 3.14.4-200.fc20.i686+PAE #1 SMP Tue May 13 14:03:12 UTC
2014 i686 i686 i386 GNU/Linux

[root@backend3 gwr]# mdadm --version
mdadm - v3.3 - 3rd September 2013

[root@backend3 log]# mdadm --detail /dev/md6
/dev/md6:
        Version : 1.2
  Creation Time : Sun Apr 24 17:31:27 2011
     Raid Level : raid6
     Array Size : 5756723712 (5490.04 GiB 5894.89 GB)
  Used Dev Size : 1918907904 (1830.01 GiB 1964.96 GB)
   Raid Devices : 5
  Total Devices : 5
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Wed Jul 23 22:15:05 2014
          State : active, degraded, recovering
 Active Devices : 4
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 512K

 Rebuild Status : 39% complete

           Name : backend3:md4
           UUID : 894bc20e:b9479ac9:7bfce54f:0ac12dd9
         Events : 1659380

    Number   Major   Minor   RaidDevice State
       0       8       36        0      active sync   /dev/sdc4
       1       8       20        1      active sync   /dev/sdb4
       5       8       68        2      active sync   /dev/sde4
       4       8       52        3      active sync   /dev/sdd4
       6       8       81        4      spare rebuilding   /dev/sdf1
George,

this is not necessarily a RAID problem. Can you exclude the possibility
that one or more of the disks have a hardware problem, like the one you
replaced which showed
quoted
Device: /dev/sdf [SAT], 35 Currently unreadable (pending) sectors
Hardware problems would explain the problems you have.

What does smartctl report about your disks, in particular:
Offline_Uncorrectable
Current_Pending_Sector
Reallocated_Sector_Ct

And, is it always ATA 5.00 that is mentioned in syslog? dmesg and the
"lsdrv" script (google for it) are useful in diagnosing this.

HTH,
Kay

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help