Re: [PATCH v3 1/2] ext4: introduce EXT4_BG_WAS_TRIMMED to optimize trim

From: Andreas Dilger <hidden>
Date: 2020-08-12 23:14:40

On Aug 10, 2020, at 7:24 AM, tytso@mit.edu wrote:

On Sat, Aug 08, 2020 at 10:33:08PM -0600, Andreas Dilger wrote:

quoted

What about storing "s_min_freed_blocks_to_trim" persistently in the
superblock, and then the admin can adjust this as desired?  If it is
set =1, then the "lazy trim" optimization would be disabled (every
FITRIM request would honor the trim requests whenever there is a
freed block in a group).  I suppose we could allow =0 to mean "do not
store the WAS_TRIMMED flag persistently", so there would be no change
for current behavior, and it would require a tune2fs option to set the
new value into the superblock (though we might consider setting this
to a non-zero value in mke2fs by default).

Currently the the minimum blocks to trim is passed in to FITRIM from
userspace; so we would need to define how the passed-in value from the
fstrim program interacts with the value stored in the sueprblock.
Would we always ignore the value passed-in from userspace?  That
doesn't seem right...

The s_min_freed_blocks_to_trim value would decide whether the persistent
WAS_TRIMMED flag is cleared from the group or not when blocks are freed.
If =1 then every block freed in a group will clear the WAS_TRIMMED flag
(this is essentially the current behavior).  If > 1 (the default with
the second patch in the series) then multiple blocks need to be freed
from a group before it is reconsidered for trim, which is reasonable,
since few underlying devices have 1-block TRIM granularity and issuing
a new TRIM each time is just overhead.  If =0 then the WAS_TRIMMED flag
is never saved, and every FITRIM call will result in every group being
trimmed again, which is what currently happens when the filesystem is
newly mounted.

quoted

The other thing we were thinkgin about was changing the "-o discard" code
to leverage the WAS_TRIMMED flag, and just do bulk trim periodically
in the filesystem as blocks are freed from groups, rather than tracking
freed extents in memory and submitting trims actively during IO.  Instead,
it would track groups that exceed "s_min_freed_blocks_to_trim", and trim
the whole group in the background when the filesystem is not active.

Hmm, maybe.  That's an awful lot of complexity, which is my concern
with that approach.

Part of the problem here is that discard is being used for different
things for different use cases and devices with different discard
speeds.  Right now, one of the primary uses of -o discard is for
people who have fast discard implementations and/or people who really
want to make sure every freed block is immediately discard --- perhaps
to meet security / privacy requirements (such as HIPPA compliance,
etc.).   I don't want to break that.

We now have a requirement of people who have very slow discards --- I
think at one point people mentioned something about for devices using
HDD, probably in some kind of dm-thin use case?  One solution that we
can use for those is simply use fstrim -m 8M or some such.  But it
appears that part of the problem is people do want more precision than
that?

At least in the DDN case it isn't for HDDs but just for huge NVMe RAID
devices where the many millions of TRIM requests take a noticeable time.
Also, even if the TRIM is done by mke2fs, fstrim will TRIM the whole
filesystem again after each mount because this state is not persistently
stored in the group descriptors.

Another solution might be to skip trimming block groups if there have
been blocks that have been freshly freed that are pending a commit,
and skip that block group until the commit has completed.  That might
also help reduce contention on a busy file system.

I think even with fast TRIM devices, sending lots of them inline with
other requests causes IO serialization, so "-o discard" is bad for
performance and has complex in-memory tracking of extents to TRIM.  The
other drawback of "-o discard" is even with "perfect" TRIM, the size
or alignment may not be suitable for the underlying NAND, so TRIMs that
eventually cover the entire group could in fact release none of the
underlying space because individually they weren't the right size/offset.

The proposal is to allow efficiently (i.e. 1 bit per group on disk, one
int per group in memory) aggregating and tracking which groups can
benefit from a TRIM request (e.g. after multi-MB of blocks are freed
there), rather than "-o discard" sending hundreds of TRIMs when individual
blocks/extents are freed.  The alternative of running "fstrim" from cron
is also not ideal, because it is hard to know what interval to use, and
there is no way to know if the filesystem is idle and it is a good time
to run, or if it *was* idle but suddenly becomes busy after fstrim starts.


Adding filesystem load-detection for fstrim, so that it only trims groups
while idle and stops when there is IO, would solve the contention issue
with actual filesystem usage.  At that point, the "enhancement" would
just be to essentially have fstrim restart periodically internally when
"-o discard" is used (or maybe "-o fstrim" instead?), with optimizations
to speed up finding groups that need to be trimmed as opposed to a full
scan each time.  Maybe the optimizations are premature, and a full scan
is easy enough even when there are millions of groups?  I guess if mballoc
can do full-group scans then fstrim can do the same easily enough.

Yet another solution might be bias block allocations towards LBA
ranges that have been deleted recently --- since another way to avoid
trims is to simply overwrite those LBA's.  But then the question is
how much memory are we willing to dedicate towards tracking recently
released LBA's, and to what level of granularity?  Perhaps we just
track the freed extents, and if they don't get used within a certain
period, or if we start getting put under memory pressure, we then send
the discards at that point.

I'd think that overwriting just-freed blocks is the opposite of what we
want?  Overwriting NAND blocks results in an erase cycle, and TRIM is
intended to pre-erase the blocks so that this doesn't need to be done
when there is a new write.

Ultimately, though, this is a space full of trade offs, and I'm
reminded of one of my father's favorite Chinese sayings: "You're
demanding a horse which can run fast, but which doesn't eat much
grass." (又要马儿跑，又要马儿不吃草).  Or translated more
idiomatically, you can't have your cake and eat it too.  It seems
this desire transcends all cultures.  :-)


Cheers, Andreas

Attachments

signature.asc [application/pgp-signature] 873 bytes

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help