Re: regression caused by block: freeze the queue earlier in del_gendisk
From: Thorsten Leemhuis <hidden>
Date: 2022-08-28 10:24:28
Also in:
lkml, regressions
Hi, this is your Linux kernel regression tracker. Thx for the report. CCing the regression mailing list, as it should be in the loop for all regressions, as explained here: https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html On 26.08.22 18:15, Dusty Mabe wrote:
I think I've found a regression introduced by: a09b314 o block: freeze the queue earlier in del_gendisk
Just FYI, in case you are not aware of it already: there was another report that this commit causes problems. See this thread for details: https://lore.kernel.org/all/72a5bf2e-cd56-a85c-2b99-cb8729a66fed@deltatee.com/#t (local) Anyway, let me add this report to the regressions tracking: [TLDR: I'm adding this regression report to the list of tracked regressions; all text from me you find below is based on a few templates paragraphs you might have encountered already already in similar form.]
In Fedora CoreOS we have tests that set up RAID1 on the /boot/ and /root/ partitions and then subsequently removes one of the disks to simulate a failure. Sometime recently this test started timing out occasionally. Looking a bit closer it appears instances are getting stuck during reboot with a bunch of looping messages:[ 17.978854] block device autoloading is deprecated and will be removed. [ 17.982555] block device autoloading is deprecated and will be removed. [ 17.985537] block device autoloading is deprecated and will be removed. [ 17.987546] block device autoloading is deprecated and will be removed. [ 17.989540] block device autoloading is deprecated and will be removed. [ 17.991547] block device autoloading is deprecated and will be removed. [ 17.993555] block device autoloading is deprecated and will be removed. [ 17.995539] block device autoloading is deprecated and will be removed. [ 17.997577] block device autoloading is deprecated and will be removed. [ 17.999544] block device autoloading is deprecated and will be removed. [ 22.979465] blkdev_get_no_open: 1666 callbacks suppressed ... ... ... [ 618.221270] blkdev_get_no_open: 1664 callbacks suppressed [ 618.221273] block device autoloading is deprecated and will be removed. [ 618.224274] block device autoloading is deprecated and will be removed. [ 618.227267] block device autoloading is deprecated and will be removed. [ 618.229274] block device autoloading is deprecated and will be removed. [ 618.231277] block device autoloading is deprecated and will be removed. [ 618.233277] block device autoloading is deprecated and will be removed. [ 618.235282] block device autoloading is deprecated and will be removed. [ 618.237370] block device autoloading is deprecated and will be removed. [ 618.239356] block device autoloading is deprecated and will be removed. [ 618.241290] block device autoloading is deprecated and will be removed.Using the Fedora kernels I narrowed it down to being introduced between `kernel-5.19.0-0.rc3.27.fc37` (good) and `kernel-5.19.0-0.rc4.33.fc37` (bad). I then did a bisect and found:$ git bisect bad a09b314005f3a0956ebf56e01b3b80339df577cc is the first bad commit commit a09b314005f3a0956ebf56e01b3b80339df577cc Author: Christoph Hellwig <hch@lst.de> Date: Tue Jun 14 09:48:27 2022 +0200 block: freeze the queue earlier in del_gendisk Freeze the queue earlier in del_gendisk so that the state does not change while we remove debugfs and sysfs files. Ming mentioned that being able to observer request in debugfs might be useful while the queue is being frozen in del_gendisk, which is made possible by this change. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220614074827.458955-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> block/genhd.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)Reverting this commit on top of latest git master (4c612826b) gave me successful results. Any ideas on what could be amiss here? Luckily the patch is tiny so hopefully it might be obvious. More details (including logs) in the following locations: - https://bugzilla.redhat.com/show_bug.cgi?id=2121791 - https://github.com/coreos/fedora-coreos-tracker/issues/1282 Thanks! Dusty
Thanks for the report. To be sure below issue doesn't fall through the cracks unnoticed, I'm adding it to regzbot, my Linux kernel regression tracking bot: #regzbot ^introduced a09b314005f3a0 #regzbot title block: timeouts when removing a disk from a RAID1 #regzbot ignore-activity This isn't a regression? This issue or a fix for it are already discussed somewhere else? It was fixed already? You want to clarify when the regression started to happen? Or point out I got the title or something else totally wrong? Then just reply -- ideally with also telling regzbot about it, as explained here: https://linux-regtracking.leemhuis.info/tracked-regression/ Reminder for developers: When fixing the issue, add 'Link:' tags pointing to the report (the mail this one replies to), as explained for in the Linux kernel's documentation; above webpage explains why this is important for tracked regressions. Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) P.S.: As the Linux kernel's regression tracker I deal with a lot of reports and sometimes miss something important when writing mails like this. If that's the case here, don't hesitate to tell me in a public reply, it's in everyone's interest to set the public record straight.