Re: Saw your commit: Use mutex_lock_io() for journal->j_checkpoint_mutex

From: Tejun Heo <tj@kernel.org>
Date: 2017-02-21 20:45:10

Hello, Ted.

If this happens, it almost certainly means that the journal is too
small.  This was something that grad student I was mentoring found
when we were benchmarking our SMR-friendly jbd2 changes.  There's a
footnote to this effect in the Fast 2017 paper[1] 

[1] https://www.usenix.org/conference/fast17/technical-sessions/presentation/aghayev
    (if you want early access to the paper let me know; it's currently
    available to registered FAST 2017 attendees and will be opened up
    at the start of the FAST 2017 conference next week)

The short version is that on average, with a 5 second commit window
and a 30 second dirty writeback timeout, if you assume the worst case
of 100% of the metadata blocks being already in the buffer cache (so
they don't need to be read from disk), in 5 seconds the journal thread
could potential spew 150*5 == 750MB in a journal transaction.  But
that data won't be written back until 30 seconds later.  So if you are
continuously deleting files for 30 seconds, the journal should have
room for at least around 4500 megs worth of sequential writing.  Now,
that's an extreme worst case.  In reality there will be some disk
reads, not to mention the metadata writebacks, which will be random.

I see.  Yeah, that's close to what we were seeing.  We had a
malfunctioning workload which was deleting extremely high number of
files locking up the filesystem and thus other things on the host.
This was a clear misbehavior on the workload but debugging it took
longer than necessary because the waits didn't get accounted as
iowait, so the patch.

The bottom line is that 128MiB, which was the previous maximum journal
size, is simply way too small.  So in the latest e2fsprogs 1.43.x
release, the default has been changed so that for a sufficiently large
disk, the default journal size is 1 gig.

If you are using faster media (say, SSD or PCie-attached flash), and
you expect to have workloads that are extreme with respect to huge
amounts of metadata changes, an even bigger journal might be called
for.  (And these are the workloads where the lazy journalling that we
studied in the FAST paper is helpful, even on convential HDD's.)

Anyway, you might want to pass onto the system administrators (or the
SRE's, as applicable :-) that if they were hitting this case often,
they should seriously consider increasing the size of their ext4
journal.

Thanks a lot for the explanation!

-- 
tejun

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help