Re: reproducible corruption in journal

From: Seamus Connor <hidden>
Date: 2021-02-24 19:01:57

*) It appears that your test is generating a large number of very
small transactions, and you are then "crashing" the file system by
disconnecting the file system from further updates, and running e2fsck
to replay the journal, throwing away the block writes after the
"disconnection", and then remounting the file system.  I'm going to
further guess that size of the small transactions are very similar,
and the amount of time between when the file system is mounted, and
when the file system is forcibly disconnected, is highly predictable
(e.g., always N seconds, plus or minus a small delta).

Yes, this matches the workload. I assume the transactions are very small 
because we are doing a large number of metadata operations, and 
because we are mounted sync?

Is that last point correct?  If so, that's a perfect storm where it's
possible for the journal replay to get confused, and mistake previous
blocks in the journal as ones part of the last valid file system
mount.  It's something which probably never happens in practice in
production, since users are generally not running a super-fixed
workload, and then causing the system to repeatedly crash after a
fixed interval, such that the mistake described above could happen.
That being said, it's arguably still a bug.

Does this hypothesis consistent with what you are seeing?

Yes, this is consistent with what I am seeing. The only thing to add is that
the workload isn't particularly fixed. The data being written is generated
by a production workload (we are recording statistics about hardware).
The interval at which we are shutting down the block device is regular
but not precise (+/- 30 seconds).

If so, I can see two possible solutions to avoid this:

1) When we initialize the journal, after replaying the journal and
writing a new journal superblock, we issue a discard for the rest of
the journal.  This won't help for block devices that don't support
discard, but it should slightly reduce work for the FTL, and perhaps
slightly improve the write endurance for flash.

Our virtual device doesn't support discard, could that be why others aren't
seeing this issue?

2) We should stop resetting the sequence number to zero, but instead,
keep the sequence number at the last used number.  For testing
purposes, we should have an option where the sequence number is forced
to (0U - 300) so that we test what happens when the 4 byte unsigned
integer wraps.

I can give this a try with my workload. Just so I can be sure I understand, the 
hypothesis is that we are running into issues during do_one_pass(..., PASS_SCAN)
because we are getting unlucky with  "if (sequence != next_commit_ID) {..."?
The solution is to reduce the occurrence of this issue (to basically zero) by not
resetting the sequence number? Have I understood you correctly? Looking
through e2fsprogs, I think there is a commit that already does this
(32448f50df7d974ded956bbc78a419cf65ec09a3) during replay. Another thing
that I could try is zeroing out the contents of inode 8 after a journal replay and
recreating the journal after each event.

Thanks for your help!

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help