RE: warning on flushing page cache on block device removal

From: Parav Pandit <hidden>
Date: 2025-06-18 09:27:51
Also in: linux-fsdevel

Hi Jan and others,

From: Parav Pandit
Sent: 03 June 2025 07:03 PM
To: Jan Kara <jack@suse.cz>
Cc: linux-block@vger.kernel.org; linux-fsdevel@vger.kernel.org
Subject: RE: warning on flushing page cache on block device removal

Hi Jan,

quoted

From: Parav Pandit <redacted>
Sent: Tuesday, May 27, 2025 7:55 PM

quoted

From: Jan Kara <jack@suse.cz>
Sent: Tuesday, May 27, 2025 7:51 PM

On Tue 27-05-25 12:07:20, Parav Pandit wrote:

quoted

From: Jan Kara <jack@suse.cz>
Sent: Tuesday, May 27, 2025 5:27 PM

On Tue 27-05-25 11:00:56, Parav Pandit wrote:

quoted

From: Jan Kara <jack@suse.cz>
Sent: Monday, May 26, 2025 10:09 PM

Hello!

On Sat 24-05-25 05:56:55, Parav Pandit wrote:

quoted

I am running a basic test of block device driver unbind,
bind while the fio is running random write IOs with
direct=0.  The test hits the WARN_ON assert on:

void pagecache_isize_extended(struct inode *inode, loff_t
from, loff_t
to) {
        int bsize = i_blocksize(inode);
        loff_t rounded_from;
        struct folio *folio;

        WARN_ON(to > inode->i_size);

This is because when the block device is removed during
driver unbind, the driver flow is,

del_gendisk()
    __blk_mark_disk_dead()
            set_capacity((disk, 0);
                bdev_set_nr_sectors()
                    i_size_write() -> This will set the
inode's isize to 0, while the

page cache is yet to be flushed.

quoted

Below is the kernel call trace.

Can someone help to identify, where should be the fix?
Should block layer to not set the capacity to 0?
Or page catch to overcome this dynamic changing of the size?
Or?

After thinking about this the proper fix would be for
i_size_write() to happen under i_rwsem because the change in
the middle of the write is what's confusing the iomap code.
I smell some deadlock potential here but it's perhaps worth
trying :)

Without it, I gave a quick try with inode_lock() unlock() in
i_size_write() and initramfs level it was stuck.  I am yet to
try with LOCKDEP.

You definitely cannot put inode_lock() into i_size_write().
i_size_write() is expected to be called under inode_lock. And
bdev_set_nr_sectors() is breaking this rule by not holding it.
So what you can try is to do
inode_lock() in bdev_set_nr_sectors() instead of grabbing bd_size_lock.

I replaced the bd_size_lock with inode_lock().
Was unable to reproduce the issue yet with the fix.

However, it right away breaks the Atari floppy driver who invokes
set_capacity() in queue_rq() at [1]. !!

[1]
https://elixir.bootlin.com/linux/v6.15/source/drivers/block/ataflop.c#L1544

With my limited knowledge I find the fix risky as bottom block layer is invoking
upper FS layer inode lock.
I suspect it may lead to A->B, B->A locking in some path.

Other than Atari floppy driver, I didn't find any other offending driver, but its
hard to say, its safe from A->B, B->A deadlock.
A = inode lock
B = block driver level lock

quoted

Ok. will try this.
I am off for few days on travel, so earliest I can do is on Sunday.

quoted

I was thinking, can the existing sequence lock be used for
64-bit case as well?

The sequence lock is about updating inode->i_size value itself.
But we need much larger scale protection here - we need to make
sure write to the block device is not happening while the device
size changes. And that's what inode_lock is usually used for.

Other option to explore (with my limited knowledge) is, When the
block device is removed, not to update the size,

Because queue dying flag and other barriers are placed to prevent
the IOs

entering lower layer or to fail them.

quoted

Can that be the direction to fix?

Well, that's definitely one line of defense and it's enough for
reads but for writes you don't want them to accumulate in the page
cache (and thus consume memory) when you know you have no way to
write

them

quoted

out. So there needs to be some way for buffered writes to recognize
the backing store is gone and stop them before dirtying pages.
Currently that's achieved by reducing i_size, we can think of other
mechanisms but reducing i_size is kind of elegant if we can
synchronize that

properly...

quoted

The block device notifies the bio layer by calling
blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue); Maybe we can

come

quoted

up with notification method that updates some flag to page cache layer
to drop buffered writes to floor.

Or other direction to explore, if the WAR_ON() is still valid, as it
can change anytime?

Is below WARN_ON() still valid, given the disk size can change any time?

void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to) {
        int bsize = i_blocksize(inode);
        loff_t rounded_from;
        struct folio *folio;

        WARN_ON(to > inode->i_size);

quoted

								Honza
--
Jan Kara [off-list ref]
SUSE Labs, CR

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help