Re: JBD2: journal transaction 6943 on loop0-8 is corrupt.
From: Manfred Spraul <hidden>
Date: 2021-12-30 08:16:17
Hi Ted, On 12/30/21 02:37, Theodore Ts'o wrote:
quoted hunk ↗ jump to hunk
On Tue, Dec 28, 2021 at 09:36:22PM +0100, Manfred Spraul wrote:quoted
Hi, with simulated power failures, I see a corrupted journal [39056.200845] JBD2: journal transaction 6943 on loop0-8 is corrupt. [39056.200851] EXT4-fs (loop0): error loading journalThis means that the journal replay found a commit which was *not* the last commit, and which contained a CRC error. If it's the last commit (e.g., there is no valid subsequent commit block), then it's possible that the journal commit was never completed before the system crashed--- e.g., it was an interrupted commit.
It is the last commit, there are no valid subsequent commit blocks. The current failure model is simple: all blocks up to block <n> are written, the blocks starting from <n+1> are discarded. What I can't rule out, but I think this is not what I see: The image is small (512 MB) and everything is in memory. Thus I would not rule out that the whole journal is filled within less than one second. The commit header contains h_commit_sec and h_commit_nsec, but from what I see, do_one_pass() evaluates only h_commit_sec.
Your test is aborting the commit at various points in the write I/O stream, so it should be simulating an interrupted commit (assuming that it's not corrupting any I/O. So the jbd2 layer should have understood it was the last commit in the journal, and been OK with the checksum failure.
I think the jbd2 layer understood that it was the last commit - but it nevertheless failed the recovery. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/jbd2/recovery.c?h=v5.16-rc7#n809
|
if (pass == PASS_SCAN &&
!jbd2_commit_block_csum_verify(journal,
bh->b_data)) {
chksum_error:
if (commit_time < last_trans_commit_time)
goto ignore_crc_mismatch;
info->end_transaction = next_commit_ID;
if (!jbd2_has_feature_async_commit(journal)) {
journal->j_failed_commit =
next_commit_ID;
brelse(bh);
break;
}
}
|async_commit() is false. journal->j_failed_commit is set, and thus after journal playback, a failure is reported :-( What I have done: - I have removed the line journal->j_failed_commit = next_commit_ID, then the image is mounted. - I had added pr_info() lines, and this confirms that it starts with a failure of jbd2_block_csum_verify(). dmesg with JBD2 debug output and additional pr_info lines: >>> [ 748.591348] next_commit_id increased: 6943. [ 748.593397] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for sequence ID 6943 at 4544/8192 [ 748.594322] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking block 4544 [ 748.595879] fs/jbd2/recovery.c: (do_one_pass, 549): Found magic 5, sequence 6943 [ 748.596800] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for sequence ID 6943 at 4545/8192 [ 748.600073] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking block 4545 [ 748.601398] fs/jbd2/recovery.c: (do_one_pass, 549): Found magic 1, sequence 6943 [ 748.603481] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for sequence ID 6943 at 4574/8192 [ 748.606238] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking block 4574 [ 748.607241] fs/jbd2/recovery.c: (do_one_pass, 549): Found magic 2, sequence 6943 [ 748.608248] commit_csum_verify error: provided 81be34bd. [ 748.610199] direct block verify error. [ 748.612327] chksum_error. [ 748.617616] error line 2. [ 748.622085] error line 3. [ 748.623291] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for sequence ID 6943 at 4575/8192 [ 748.624133] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking block 4575 [ 748.625134] done, info->end_transaction is 6943. [ 748.626966] done with update, info->end_transaction is 6943. [ 748.629521] next_commit_id: initial value 6799. [ 748.630836] fs/jbd2/recovery.c: (do_one_pass, 491): Starting recovery pass 1 [ 748.631529] fs/jbd2/recovery.c: (do_one_pass, 517): Scanning for sequence ID 6799 at 6734/8192 [ 748.633346] fs/jbd2/recovery.c: (do_one_pass, 526): JBD2: checking block 6734 <<< What do you think? Is JBD2 too aggressive in declaring something a journal corruption? What is definitively correct is that the failure is minor. fsck -f was able to correct it. Thus: What is your understanding: If a mount command fails due to a journal corruption, should user space first try a fsck -f before giving up? -- Manfred
Attachments
- 0001-jbd2-recovery.c-Continue-on-csum-failures-for-commit.patch [text/x-patch] 2744 bytes · preview
- 0003-DEBUG-patch-add-printk-to-fs-jbd2-recovery.c.patch [text/x-patch] 3320 bytes · preview