Re: regression caused by block: freeze the queue earlier in del_gendisk
From: Jens Axboe <axboe@kernel.dk>
Date: 2022-09-20 14:05:49
Also in:
linux-block, lkml, regressions
On 9/20/22 3:11 AM, Thorsten Leemhuis wrote:
Hi, this is your Linux kernel regression tracker. On 13.09.22 04:36, Dusty Mabe wrote:quoted
On 9/12/22 21:55, Ming Lei wrote:quoted
On Mon, Sep 12, 2022 at 09:16:18AM +0200, Christoph Hellwig wrote:quoted
On Fri, Sep 09, 2022 at 04:24:40PM +0800, Ming Lei wrote:quoted
On Wed, Sep 07, 2022 at 09:33:24AM +0200, Christoph Hellwig wrote:quoted
On Thu, Sep 01, 2022 at 03:06:08PM +0800, Ming Lei wrote:quoted
It is a bit hard to associate the above commit with reported issue.So the messages clearly are about something trying to open a device that went away at the block layer, but somehow does not get removed in time by udev (which seems to be a userspace bug in CoreOS). But even with that we really should not hang.Xiao Ni provides one script[1] which can reproduce the issue more or less.I've run the reproduced 10000 times on current mainline, and while it prints one of the autoloading messages per run, I've not actually seen any kind of hang.I can't reproduce the hang too.I obviously can reproduce the issue with the test in our Fedora CoreOS test suite. It's part of a framework (i.e. it's not simple some script you can run) but it is very reproducible so one can add some instrumentation to the kernel and feed it through a build/test cycle to see different results or logs. I'm willing to share this with other people (maybe a screen share or some written down instructions) if anyone would be interested.This thread looked stalled, or was there any progress in the past week? If not: Fedora apparently removed the patch in their kernels a while ago, as quite a few users where hitting it. What is preventing us from doing the same in mainline and 5.19.y until the issue can be resolved? The description of a09b314005f3 ("block: freeze the queue earlier in del_gendisk") doesn't sound like the change does something crucial that can't wait a bit. I might be totally wrong with that, but I think it's my duty to ask that question at this point.
Christoph and I discussed this one last week, and he has a plan to try a flag approach. Christoph, did you get a chance to bang that out? Would be nice to get this one wrapped up. -- Jens Axboe