Re: JBD2: journal transaction 6943 on loop0-8 is corrupt.

From: Manfred Spraul <hidden>
Date: 2021-12-30 08:16:17

Hi Ted,

On 12/30/21 02:37, Theodore Ts'o wrote:

quoted hunk ↗ jump to hunk

On Tue, Dec 28, 2021 at 09:36:22PM +0100, Manfred Spraul wrote:

quoted

Hi,

with simulated power failures, I see a corrupted journal

[39056.200845] JBD2: journal transaction 6943 on loop0-8 is corrupt.
[39056.200851] EXT4-fs (loop0): error loading journal

This means that the journal replay found a commit which was *not* the
last commit, and which contained a CRC error.  If it's the last commit
(e.g., there is no valid subsequent commit block), then it's possible
that the journal commit was never completed before the system crashed

--- e.g., it was an interrupted commit.

It is the last commit, there are no valid subsequent commit blocks.

The current failure model is simple: all blocks up to block <n> are 
written, the blocks starting from <n+1> are discarded.

What I can't rule out, but I think this is not what I see:
The image is small (512 MB) and everything is in memory. Thus I would 
not rule out that the whole journal is filled within less than one second.
The commit header contains h_commit_sec and h_commit_nsec, but from what 
I see, do_one_pass() evaluates only h_commit_sec.

Your test is aborting the commit at various points in the write I/O
stream, so it should be simulating an interrupted commit (assuming
that it's not corrupting any I/O.  So the jbd2 layer should have
understood it was the last commit in the journal, and been OK with the
checksum failure.

I think the jbd2 layer understood that it was the last commit - but it 
nevertheless failed the recovery.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/jbd2/recovery.c?h=v5.16-rc7#n809

|
		if  (pass  ==  PASS_SCAN  &&
			!jbd2_commit_block_csum_verify(journal,
							bh->b_data))  {
			chksum_error:
				if  (commit_time  <  last_trans_commit_time)
					goto  ignore_crc_mismatch;
				info->end_transaction  =  next_commit_ID;

				if  (!jbd2_has_feature_async_commit(journal))  {
					journal->j_failed_commit  =
						next_commit_ID;
					brelse(bh);
					break;
				}
			}
|

async_commit() is false.

journal->j_failed_commit is set, and thus after journal playback, a 
failure is reported :-(


What I have done:

- I have removed the line journal->j_failed_commit = next_commit_ID, 
then the image is mounted.

- I had added pr_info() lines, and this confirms that it starts with a 
failure of jbd2_block_csum_verify().

dmesg with JBD2 debug output and additional pr_info lines:

 >>>

[  748.591348] next_commit_id increased: 6943.
[  748.593397] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for 
sequence ID 6943 at 4544/8192
[  748.594322] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking 
block 4544
[  748.595879] fs/jbd2/recovery.c: (do_one_pass, 549): Found magic 5, 
sequence 6943
[  748.596800] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for 
sequence ID 6943 at 4545/8192
[  748.600073] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking 
block 4545
[  748.601398] fs/jbd2/recovery.c: (do_one_pass, 549): Found magic 1, 
sequence 6943
[  748.603481] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for 
sequence ID 6943 at 4574/8192
[  748.606238] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking 
block 4574
[  748.607241] fs/jbd2/recovery.c: (do_one_pass, 549): Found magic 2, 
sequence 6943
[  748.608248] commit_csum_verify error: provided 81be34bd.
[  748.610199] direct block verify error.
[  748.612327] chksum_error.
[  748.617616] error line 2.
[  748.622085] error line 3.
[  748.623291] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for 
sequence ID 6943 at 4575/8192
[  748.624133] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking 
block 4575
[  748.625134] done, info->end_transaction is 6943.
[  748.626966] done with update, info->end_transaction is 6943.
[  748.629521] next_commit_id: initial value 6799.
[  748.630836] fs/jbd2/recovery.c: (do_one_pass, 491): Starting recovery 
pass 1
[  748.631529] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for 
sequence ID 6799 at 6734/8192
[  748.633346] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking 
block 6734

<<<

What do you think?
Is JBD2 too aggressive in declaring something a journal corruption?

What is definitively correct is that the failure is minor. fsck -f was 
able to correct it.
Thus: What is your understanding:
If a mount command fails due to a journal corruption, should user space 
first try a fsck -f before giving up?


--

     Manfred

Attachments

0001-jbd2-recovery.c-Continue-on-csum-failures-for-commit.patch [text/x-patch] 2744 bytes · preview
0003-DEBUG-patch-add-printk-to-fs-jbd2-recovery.c.patch [text/x-patch] 3320 bytes · preview

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help