RE: warning on flushing page cache on block device removal
From: Parav Pandit <hidden>
Date: 2025-06-18 09:27:51
Also in:
linux-fsdevel
Hi Jan and others,
From: Parav Pandit Sent: 03 June 2025 07:03 PM To: Jan Kara <jack@suse.cz> Cc: linux-block@vger.kernel.org; linux-fsdevel@vger.kernel.org Subject: RE: warning on flushing page cache on block device removal Hi Jan,quoted
From: Parav Pandit <redacted> Sent: Tuesday, May 27, 2025 7:55 PMquoted
From: Jan Kara <jack@suse.cz> Sent: Tuesday, May 27, 2025 7:51 PM On Tue 27-05-25 12:07:20, Parav Pandit wrote:quoted
quoted
From: Jan Kara <jack@suse.cz> Sent: Tuesday, May 27, 2025 5:27 PM On Tue 27-05-25 11:00:56, Parav Pandit wrote:quoted
quoted
From: Jan Kara <jack@suse.cz> Sent: Monday, May 26, 2025 10:09 PM Hello! On Sat 24-05-25 05:56:55, Parav Pandit wrote:quoted
I am running a basic test of block device driver unbind, bind while the fio is running random write IOs with direct=0. The test hits the WARN_ON assert on: void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to) { int bsize = i_blocksize(inode); loff_t rounded_from; struct folio *folio; WARN_ON(to > inode->i_size); This is because when the block device is removed during driver unbind, the driver flow is, del_gendisk() __blk_mark_disk_dead() set_capacity((disk, 0); bdev_set_nr_sectors() i_size_write() -> This will set the inode's isize to 0, while thepage cache is yet to be flushed.quoted
Below is the kernel call trace. Can someone help to identify, where should be the fix? Should block layer to not set the capacity to 0? Or page catch to overcome this dynamic changing of the size? Or?After thinking about this the proper fix would be for i_size_write() to happen under i_rwsem because the change in the middle of the write is what's confusing the iomap code. I smell some deadlock potential here but it's perhaps worth trying :)Without it, I gave a quick try with inode_lock() unlock() in i_size_write() and initramfs level it was stuck. I am yet to try with LOCKDEP.You definitely cannot put inode_lock() into i_size_write(). i_size_write() is expected to be called under inode_lock. And bdev_set_nr_sectors() is breaking this rule by not holding it. So what you can try is to do inode_lock() in bdev_set_nr_sectors() instead of grabbing bd_size_lock.I replaced the bd_size_lock with inode_lock(). Was unable to reproduce the issue yet with the fix. However, it right away breaks the Atari floppy driver who invokes set_capacity() in queue_rq() at [1]. !! [1] https://elixir.bootlin.com/linux/v6.15/source/drivers/block/ataflop.c#L1544 With my limited knowledge I find the fix risky as bottom block layer is invoking upper FS layer inode lock. I suspect it may lead to A->B, B->A locking in some path. Other than Atari floppy driver, I didn't find any other offending driver, but its hard to say, its safe from A->B, B->A deadlock. A = inode lock B = block driver level lockquoted
quoted
quoted
Ok. will try this. I am off for few days on travel, so earliest I can do is on Sunday.quoted
quoted
I was thinking, can the existing sequence lock be used for 64-bit case as well?The sequence lock is about updating inode->i_size value itself. But we need much larger scale protection here - we need to make sure write to the block device is not happening while the device size changes. And that's what inode_lock is usually used for.Other option to explore (with my limited knowledge) is, When the block device is removed, not to update the size, Because queue dying flag and other barriers are placed to prevent the IOsentering lower layer or to fail them.quoted
Can that be the direction to fix?Well, that's definitely one line of defense and it's enough for reads but for writes you don't want them to accumulate in the page cache (and thus consume memory) when you know you have no way to writethemquoted
out. So there needs to be some way for buffered writes to recognize the backing store is gone and stop them before dirtying pages. Currently that's achieved by reducing i_size, we can think of other mechanisms but reducing i_size is kind of elegant if we can synchronize thatproperly...quoted
The block device notifies the bio layer by calling blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue); Maybe we cancomequoted
up with notification method that updates some flag to page cache layer to drop buffered writes to floor. Or other direction to explore, if the WAR_ON() is still valid, as it can change anytime?
Is below WARN_ON() still valid, given the disk size can change any time?
void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to) {
int bsize = i_blocksize(inode);
loff_t rounded_from;
struct folio *folio;
WARN_ON(to > inode->i_size);
quoted
quoted
Honza -- Jan Kara [off-list ref] SUSE Labs, CR