Re: Please show descriptive message about degraded raid when booting

From: Patrick Dung <hidden>
Date: 2020-03-24 04:45:11

By the way, for my original post, it's a virtual machine. I disconnect
one of the members from the raid 1.
I can't simulate hardware failure with VM. So there are no 'SCT Error
Recovery Control/TLER' timeout involved.

Thanks,
Patrick

On Tue, Mar 24, 2020 at 2:33 AM Patrick Dung [off-list ref] wrote:

Thanks for reply.

The problem occurs with my physical hardware and in virtual machine
(can't set TLER).
The log you see in my original post is captured/simulated from a
virtual machine.

The system is not 'hung '. If I run rd.debug it would have lots of
messages scrolling quickly that you can't see clearly.

What I am asking for is a more descriptive message from the MD raid,
try to display the status like:
Try to activate md/raid1:md125, currently 1 of of 2 disk online.
Timeout in X seconds.
Something like that.

Thanks,
Patrick

On Tue, Mar 24, 2020 at 2:14 AM Roger Heflin [off-list ref] wrote:

quoted

The system had hung.  The disks are failing inside the SCSI subsystem,
I don't believe (raid, lvm, multipath) will know anything about what
is going on inside the scsi layer.

Those default timeouts are usually at least 30 seconds, but in the
past the scsi subsystem did some retrying internally.  The timeout
needs to be higher than the length of time the disk could take.
Non-enterprise, non-raid disks generally have this timeout set 60-120
seconds hence MD waiting to see if the failure is a sector read
failure (will be a no-response until the disk timeout) or a complete
disk failure (no response ever).

cat /sys/block/sda/device/timeout shows the timeout.

Read about seterc, tler and smartctl for discussions about what is going on.

If you can then turn down your disks max timeout with the smartctl
commands then the disk will report back a sector failure faster and
that is usually what is happening.  If you turn down the disks timeout
to a max of say 7 seconds then you can set the scsi layers timeout to
say 10 seconds.   Then the only time the scsi timeout matters if if
the disk is there but not responding.


On Fri, Mar 20, 2020 at 11:50 AM Patrick Dung [off-list ref] wrote:

quoted

Hello,

Bump.

Got a reply from Fedora support but asking me to find upstream.
https://bugzilla.redhat.com/show_bug.cgi?id=1794139

Thanks,
Patrick

On Thu, Mar 5, 2020 at 10:57 PM Patrick Dung [off-list ref] wrote:

quoted

Hello,

The system have Linux software raid (md) raid 1.
One of the disk is missing or have problem.

The raid is degraded.
When the OS boot, it hangs at the message for outputting to kernel at
about three seconds.
There is no descriptive message that the RAID is degraded.
I know the problem because I had wrote zero to one of the disk of the
raid 1. If I don't know the problem (maybe cable is loose or disk
failure), it is confusing.

Related log:

[    2.917387] sd 32:0:0:0: [sda] 56623104 512-byte logical blocks:
(29.0 GB/27.0 GiB)
[    2.917446] sd 32:0:1:0: [sdb] 56623104 512-byte logical blocks:
(29.0 GB/27.0 GiB)
[    2.917499] sd 32:0:0:0: [sda] Write Protect is off
[    2.917516] sd 32:0:0:0: [sda] Mode Sense: 61 00 00 00
[    2.917557] sd 32:0:1:0: [sdb] Write Protect is off
[    2.917575] sd 32:0:1:0: [sdb] Mode Sense: 61 00 00 00
[    2.917615] sd 32:0:0:0: [sda] Cache data unavailable
[    2.917636] sd 32:0:0:0: [sda] Assuming drive cache: write through
[    2.917661] sd 32:0:1:0: [sdb] Cache data unavailable
[    2.917677] sd 32:0:1:0: [sdb] Assuming drive cache: write through
[    2.927076] sd 32:0:0:0: [sda] Attached SCSI disk
[    2.927458]  sdb: sdb1 sdb2 sdb3 sdb4
[    2.929018] sd 32:0:1:0: [sdb] Attached SCSI disk
[    3.060855] vmxnet3 0000:0b:00.0 ens192: intr type 3, mode 0, 3
vectors allocated
[    3.061826] vmxnet3 0000:0b:00.0 ens192: NIC Link is Up 10000 Mbps
[  139.411464] md/raid1:md125: active with 1 out of 2 mirrors
[  139.412176] md125: detected capacity change from 0 to 1073676288
[  139.433441] md/raid1:md126: active with 1 out of 2 mirrors
[  139.434182] md126: detected capacity change from 0 to 314507264
[  139.436894]  md126:
[  139.455511] md/raid1:md127: active with 1 out of 2 mirrors
[  139.456739] md127: detected capacity change from 0 to 27582726144

So there are about 130 seconds without any descriptive messages. I
thought the system had hanged.

Could the kernel display more descriptive messages about the RAID?

If I use rd.debug boot parameters, I know the kernel is still running.
But it is scrolling very fast without actually knowing what is the the
problem.

Thanks,
Patrick

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help