Re: kernel BUG at fs/ext4/inode.c:1721!

From: Jan Kara <jack@suse.cz>
Date: 2021-10-20 16:38:39
Also in: lkml

On Mon 11-10-21 19:11:24, Eric Whitney wrote:

* Borislav Petkov [off-list ref]:

quoted

Hi Eric,

On Fri, Oct 08, 2021 at 01:33:05PM -0400, Eric Whitney wrote:

quoted

Hi, Boris - thanks very much for your report.

sure, np.

quoted

Was your kernel configured with the CONFIG_FS_ENCRYPTION option?

$ grep CONFIG_FS_ENCRYPTION /boot/config-5.15.0-rc4+ 
# CONFIG_FS_ENCRYPTION is not set

quoted

Could you please provide the output of the mount command for the affected
file system?

Well, I can't figure out from dmesg - it's all I have from that run -
which fs it was. So lemme give you all ext4 ones:

$ mount | grep ext4
/dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro)
/dev/sdc1 on /home type ext4 (rw,noatime)
/dev/sda1 on /mnt/oldhome type ext4 (rw,noatime)
/dev/sdb1 on /mnt/smr type ext4 (rw,noatime)
/dev/nvme1n1p1 on /mnt/kernel type ext4 (rw,nosuid,nodev,noatime,user)

quoted

Do you recall what sort of code might have been running on this system at
the time of failure (for example, kernel build, desktop apps, etc.)?

Good question. I'm not sure. Kernel build is likely as I do those on
that workstation constantly.

Unfortunately, I don't have an exact reproducer. And I can't debug stuff
on that box since it is my workstation and I've reverted it to 5.14.

What I can do is, I can slap 5.15-rc4 or whichever version you'd want me
to, on a test box and try running kernel builds or some other load to
see whether it would fire. I have a similar box to my workstation.

Or if you have a better idea...

Hi, Boris:

I've tried numerous kernel builds with -rc4 and rerun the full set of xfstests
we use when regressing ext4 each rc using a kernel that doesn't enable
FS_ENCRYPTION (I normally run with that) without luck.  The code that caused
the splat you saw is new and would run when an assertion is violated,
suggesting that there may be an unsuspected bug elsewhere in ext4.

Do you recall having seen any evidence of ENOMEM or ENOSPC conditions prior
to the failure?

If you're willing to share, please send along your kernel config file and I'll
try working with that as well.

In the meantime, should this bug get in your way, just revert the following
patch and you should be able to run without further trouble:

948ca5f30e1d "ext4: enforce buffer head state assertion in ext4_da_map_blocks"

I'll likely be posting a patch to revert this shortly, since it's going to
take some time to sort out what's going on without a reproducer.

Looking at this I can see that the assertion is 

BUG_ON(bh->b_blocknr != invalid_block);

and I suspect it is some kind of a race between ext4_da_map_blocks() and
writeback code? Writeback code holds only i_data_sem and page locks but
ext4_da_map_blocks() holds only page lock at that point. So page lock on
that particular page is the only thing that protects us from getting
outright out of date info from extent status tree. And I'm not sure all
extent status tree manipulations are careful enough to be also protected by
page locks of all pages that are inside given extent...

								Honza
-- 
Jan Kara [off-list ref]
SUSE Labs, CR

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help