Re: Software RAID6 broke after power outage

From: Wols Lists <hidden>
Date: 2020-07-22 09:14:48

On 22/07/20 08:41, Cory Derenburger wrote:

My server lost power this morning. The server is running Linux Mint
(14?) on a battery backup and I believe it shutdown before losing
power. Upon restarting the server the computer hung for a while, and
after resetting and booting up in recovery mode my RAID is now
nonfunctional.

The server was set up years ago with a RAID 6 array built with mdadm.
To be honest I don't really know what is wrong with the array, it
seems to be an issue with disk sdc. I wanted to reach out for help to
confirm the issue and get some guidance before proceeding (or making
things worse).

Any assistance that can help me determine what steps to take to get
this server back up and running would be greatly appreciated. It's
been 4+ since I have touched RAID, and only attempted a recovery once.
If anyone can help I would be super appreciative.

https://raid.wiki.kernel.org/index.php/Linux_Raid#When_Things_Go_Wrogn
https://raid.wiki.kernel.org/index.php/Asking_for_help

I see you've included some stuff which is helpful, but can you do
everything that last page asks for. In particular, lsdrv.

Below I'm including outputs from various commands for the 3rd disk
which seems to be the culprit

dmesg - boot section section where first errors begin occurring
[    2.637856] md: bind<sdd1>
[    2.646987] random: nonblocking pool is initialized
[    2.647432] md: bind<sde1>
[    2.651429] md: bind<sdb1>
[    2.863538] ata3.00: exception Emask 0x0 SAct 0x10 SErr 0x0 action 0x0
[    2.863594] ata3.00: irq_stat 0x40000008
[    2.863643] ata3.00: failed command: READ FPDMA QUEUED
[    2.863695] ata3.00: cmd 60/08:20:08:08:00/00:00:00:00:00/40 tag 4
ncq 4096 in
[    2.863695]          res 41/40:00:09:08:00/00:00:00:00:00/40 Emask
0x409 (media error) <F>
[    2.863775] ata3.00: status: { DRDY ERR }
[    2.863822] ata3.00: error: { UNC }
[    2.873407] ata3.00: configured for UDMA/133
[    2.873476] sd 2:0:0:0: [sdc] Unhandled sense code
[    2.873525] sd 2:0:0:0: [sdc]
[    2.873571] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[    2.873619] sd 2:0:0:0: [sdc]
[    2.873665] Sense Key : Medium Error [current] [descriptor]
[    2.873819] Descriptor sense data with sense descriptors (in hex):
[    2.873901]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
[    2.874544]         00 00 08 09
[    2.874764] sd 2:0:0:0: [sdc]
[    2.874811] Add. Sense: Unrecovered read error - auto reallocate failed
[    2.874895] sd 2:0:0:0: [sdc] CDB:
[    2.874941] Read(10): 28 00 00 00 08 08 00 00 08 00
[    2.875428] end_request: I/O error, dev sdc, sector 2057
[    2.875478] Buffer I/O error on device sdc1, logical block 1

cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
[raid4] [raid10]
md0 : inactive sdb1[0](S) sde1[3](S) sdd1[2](S)
      5860147464 blocks super 1.2

{not sure why these drives are now showing as spares}

This is very common when an array fails to assemble properly.
Unfortunately, when there's one error, it often triggers a cascade of
fake errors, and this is probably the case here.

Below running mdstat for sdc.  Checking sdb, sdd, sde appear fine.

mdadm --examine /dev/sdc
/dev/sdc:   MBR Magic : aa55
Partition[0] :   3907027120 sectors at         2048 (type fd)

mdadm --examine /dev/sdc1
mdadm: No md superblock detected on /dev/sdc1.

fdisk -l
Disk /dev/sdb: 2000.4 GB, 2000398934016 bytes
81 heads, 63 sectors/track, 765633 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x38389fdc

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1            2048  3907029167  1953513560   fd  Linux raid autodetect

Disk /dev/sdc: 2000.4 GB, 2000398934016 bytes
81 heads, 63 sectors/track, 765633 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xd108824d

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1            2048  3907029167  1953513560   fd  Linux raid autodetect

Disk /dev/sdd: 2000.4 GB, 2000398934016 bytes
81 heads, 63 sectors/track, 765633 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x6207659a

   Device Boot      Start         End      Blocks   Id  System
/dev/sdd1            2048  3907029167  1953513560   fd  Linux raid autodetect

Disk /dev/sde: 2000.4 GB, 2000398934016 bytes
81 heads, 63 sectors/track, 765633 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xd9a4afcf

   Device Boot      Start         End      Blocks   Id  System
/dev/sde1            2048  3907029167  1953513560   fd  Linux raid autodetect


Is there other information needed to determine the issue?  Where do I
go from here?

How old is linux mint? Have you kept it up-to-date? Unfortunately, it
seems a lot of older systems suffer issues when the kernel is heavily
patched and mdadm is not updated, and this regularly surfaces on this
list where Ubuntu is concerned ...

mdadm --version
uname -a

Make sure you have a "latest and greatest" rescue disk to hand, and
we'll see what the others say.

Cheers,
Wol

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help