Thread (33 messages) 33 messages, 2 authors, 2020-07-08

Re: Assemblin journaled array fails

From: Song Liu <song@kernel.org>
Date: 2020-06-17 17:11:07

On Wed, Jun 17, 2020 at 5:53 AM Michal Soltys [off-list ref] wrote:
On 6/17/20 2:48 AM, Song Liu wrote:
quoted
On Tue, Jun 9, 2020 at 2:51 PM Michal Soltys [off-list ref] wrote:
quoted
On 20/06/09 20:36, Song Liu wrote:
quoted
On Tue, Jun 9, 2020 at 2:36 AM Michal Soltys [off-list ref] wrote:
quoted
On 6/5/20 2:26 PM, Michal Soltys wrote:
quoted
On 6/4/20 12:07 AM, Song Liu wrote:
quoted
The hang happens at expected place.
quoted
[Jun 3 09:02] INFO: task mdadm:2858 blocked for more than 120 seconds.
[  +0.060545]       Tainted: G            E
5.4.19-msl-00001-gbf39596faf12 #2
[  +0.062932] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
Could you please try disable the timeout message with

echo 0 > /proc/sys/kernel/hung_task_timeout_secs

And during this wait (after message
"r5c_recovery_flush_data_only_stripes before wait_event"),
checks whether the raid disks (not the journal disk) are taking IOs
(using tools like iostat).
No activity on component drives.
To expand on that - while there is no i/o activity whatsoever at the component drives (as well as journal), the cpu is of course still fully loaded (5 days so far):

UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
root      8129  6755 15   740  1904  10 Jun04 pts/2    17:42:34 mdadm -A /dev/md/r5_big /dev/md/r1_journal_big /dev/sdj1 /dev/sdi1 /dev/sdg1 /dev/sdh1
root      8147     2 84     0     0  30 Jun04 ?        4-02:09:47 [md124_raid5]
I guess the md thread stuck at some stripe. Does the kernel have
CONFIG_DYNAMIC_DEBUG enabled? If so, could you please try enable some pr_debug()
in function handle_stripe()?

Thanks,
Song
Massive spam in dmesg with messages like these:

[464836.603033] handling stripe 1551540328, state=0x41 cnt=1, pd_idx=3,
qd_idx=-1
                  , check:0, reconstruct:0
<cut>
I am really sorry for the delay. A few questions:
No worries.
quoted
1. There are two pr_debug() calls in handle_stripe():
        pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
        pr_debug("locked=%d uptodate=%d to_read=%d"

     Did you enable all of them? Or only the first one?
I enabled all of them (I think), to be precise:

echo -n 'func handle_stripe +p' >/sys/kernel/debug/dynamic_debug/control

Haven't seen any `locked` lines though.
That's a little weird, and probably explains why we stuck. Could you
please try the attached patch?

Thanks,
Song

Attachments

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help