Re: Crash during raid6 reshape, now cannot restart?

From: Neil Brown <hidden>
Date: 2010-12-10 20:43:05

On Fri, 10 Dec 2010 09:05:47 -0800 Phil Genera [off-list ref] wrote:

I had a power failure during a large raid6 reshape (6->8 disks) on one
of my arm systems last night, and can't seem to get it going again.

I did this:
# mdadm --grow --backup-file=./backup.mdadm --array-size=8 /dev/md0

which (I've now noticed) didn't seem to write a backup file. There was
a read error during the reshape, but it claimed recovery:
Dec  9 20:48:07 love kernel: sd 2:0:0:0: [sda] Unhandled sense code
Dec  9 20:48:07 love kernel: sd 2:0:0:0: [sda] Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
Dec  9 20:48:07 love kernel: sd 2:0:0:0: [sda] Sense Key : Medium
Error [current]
Dec  9 20:48:07 love kernel: sd 2:0:0:0: [sda] Add. Sense: Unrecovered
read error
Dec  9 20:48:07 love kernel: sd 2:0:0:0: [sda] CDB: Read(10): 28 00 00
02 09 60 00 00 20 00
Dec  9 20:48:07 love kernel: end_request: I/O error, dev sda, sector 133472
Dec  9 20:48:08 love kernel: raid5:md0: read error corrected (8
sectors at 133472 on sda)
Dec  9 20:48:08 love kernel: raid5:md0: read error corrected (8
sectors at 133480 on sda)
Dec  9 20:48:08 love kernel: raid5:md0: read error corrected (8
sectors at 133488 on sda)
Dec  9 20:48:08 love kernel: raid5:md0: read error corrected (8
sectors at 133496 on sda)

Some time during the night, the electricity went away, and on reboot I get this:

raid5: reshape_position too early for auto-recovery - aborting.

Something must be going wrong with the math in raid5:

               if (mddev->delta_disks < 0
                    ? (here_new * mddev->new_chunk_sectors <=
                       here_old * mddev->chunk_sectors)
                    : (here_new * mddev->new_chunk_sectors >=
                       here_old * mddev->chunk_sectors)) {
                        /* Reading from the same stripe as writing to - bad */
                        printk(KERN_ERR "raid5: reshape_position too early for "
                               "auto-recovery - aborting.\n");
                        return -EINVAL;
                }

there 'here_new* new_chunk_size' must be over-flowing.  So the size of the
array must only just fit into sector_t.
On and arm5 you would need to have CONFIG_LBD set - do you know if it is?

I guess I need to make that code more robust when sector_t doesn't have lots
more bits that the size of the device...

If you can compile your own kernel, you should be able to get it to work
easily.  If not ... complain to whoever provided you with a kernel.

NeilBrown

as well as when I try to assemble the array manually. There's nothing
critical I don't have backed up, but there's a lot of TV on there I
was planning to watch :).

Any good ideas? I'd sure appreciate some help. I'm guessing this is
just a crash in the critical section, and without a backup file I'm
screwed. I'm surprised the backup file is still needed 200gb into the
reshape though. Thanks!


Versions & status:

# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md0 : inactive sdg[0] sdj[7] sdi[6] sdf[5] sde[4] sdd[3] sdc[2] sdh[1]
      3125690368 blocks super 0.91

# uname -a
Linux love 2.6.32-5-kirkwood #1 Sun Oct 31 11:19:32 UTC 2010 armv5tel GNU/Linux
# mdadm --version
mdadm - v3.1.4 - 31st August 2010


More details (and --examine of all disks attached):

# mdadm --detail /dev/md0
/dev/md0:
        Version : 0.91
  Creation Time : Fri Oct  9 09:32:08 2009
     Raid Level : raid6
  Used Dev Size : 390711296 (372.61 GiB 400.09 GB)
   Raid Devices : 8
  Total Devices : 8
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Fri Dec 10 05:52:35 2010
          State : active, Not Started
 Active Devices : 8
Working Devices : 8
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

  Delta Devices : 2, (6->8)

           UUID : 81ddccd8:5abf5b03:181548d9:47e92625
         Events : 0.1048248

    Number   Major   Minor   RaidDevice State
       0       8       96        0      active sync   /dev/sdg
       1       8      112        1      active sync   /dev/sdh
       2       8       32        2      active sync   /dev/sdc
       3       8       48        3      active sync   /dev/sdd
       4       8       64        4      active sync   /dev/sde
       5       8       80        5      active sync   /dev/sdf
       6       8      128        6      active sync   /dev/sdi
       7       8      144        7      active sync   /dev/sdj

--
Phil

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help