Re: [PATCH v2 4/5] ext4: improve cr 0 / cr 1 group scanning

From: Andreas Dilger <hidden>
Date: 2021-02-26 03:44:21

On Feb 21, 2021, at 8:59 PM, harshad shirwadkar [off-list ref] wrote:

Thank you for all the feedback Andreas and Artem. Some comments below:

Thank you for working on this.  It is definitely an area that can use
this improvement.

quoted

I was wondering about the cost of the list/tree maintenance as well,
especially since there was a post from "kernel test robot" that this
patch introduced a performance regression.

Yeah, I'm pretty sure that the kernel test robot is complaining mainly
because of list/tree maintenance. I think if this optimization is
turned off, we probably should not even maintain the tree / lists.
That has one downside that is that we will have to disallow setting
this option during remount, which I guess is okay?

I think it is reasonable to not be able to change this at runtime.
This would only make a difference for a limited number of testers,
and virtually all users will never know it exists at all.

quoted

It would also make sense for totally full groups to be kept out of the
rb tree entirely, since they do not provide any value in that case (the
full groups will never be selected for allocations), and they just add
to the tree depth and potentially cause an imbalance if there are many
of them.  That also has the benefit of the rbtree efficiency *improving*
as the filesystem gets more full, which is right when it is most needed.

Ack

quoted

It might also make sense to keep totally empty groups out of the rbtree,
since they should always be found in cr0 already if the allocation is
large enough to fill the whole group?  Having a smaller rbtree makes
every insertion/removal that much more efficient.

Ack

quoted

Those groups will naturally be re-added into the rbtree when they have
blocks freed or allocated, so not much added complexity.


Does it make sense to disable "mb_optimize_scan" if filesystems are
smaller than a certain threshold?  Clearly, if there are only 1-2
groups, maintaining a list and rbtree has no real value, and with
only a handful of groups (< 16?) linear searching is probably as fast
or faster than maintaining the two data structures.  That is similar
to e.g. bubble sort vs. quicksort, where it is more efficient to sort
a list of ~5-8 entries with a dumb/fast algorithm instead of a complex
algorithm that is more efficient at larger scales.  That would also
(likely) quiet the kernel test robot, if we think that its testing is
not representative of real-world usage.

Ack, these are good optimizations. I'll add these in V3.

For testing purposes it should be possible to have "mb_optimize_scan=1"
force the use of this option, even if the filesystem is small.

Besides the optimizations mentioned here, I also think we should add
"mb_optimize_linear_limit" or such sysfs tunable which will control
how many groups should mballoc search linearly before using tree /
lists for allocation? That would help us with the disk seek time
performance.

There is already a linear search threshold parameters for mballoc,
namely mb_min_to_scan and mb_max_to_scan that could be used for this.
I think we could use "mb_min_to_scan=10" (the current default), or
maybe shrink this a bit (4?) if "mb_optimize_scan" is enabled.

We discussed on our last call that we probably should consult with the
block device's request queue to check if the underlying block device
is rotational or not. However, we also discussed that for more complex
devices (such as DMs setup on top of HDD and SSD etc), whether the
device is rotational or not is not a binary answer and we would need a
more complex interface (such as logical range to "is_rotational"
table) to make intelligent choice in the file system. Also, in such
cases, it is not clear if such a table needs to be passed to the file
system during mkfs time? or at mount time? or at run time?

I don't think the hybrid case is very important yet.  By far the
most common case is going to be "rotational=1" or "rotational=0"
for the whole device, so we should start by only optimizing for
those cases.  DM looks like it returns "rotational=0" correctly
when a composite device it is made of entirely non-rotational
devices and "rotational=1" as it should when it is a hybrid
HDD/SSD device (which I have in my local system).

Given the number of unknowns in the above discussion, I propose that
we start simple and evolve later. So my proposal is that we add a
"mb_optimize_linear_limit" tunable that accepts an integer value. In
the kernel, for non-rotational devices, that value will be defaulted
to 0 (which means no linear scan) and for rotational devices, that
value will be defaulted to a reasonable value (-- not sure what that
value would be though - 4?). This default can be overridden using the
sysfs interface. We can later evolve this interface to accept more
complex input such as logical range to rotational status.

Does that sound reasonable?

Yes, modulo using the existing "mb_min_to_scan" parameter for this.
I think 4 or 8 or 10 groups is reasonable (512MB, 1GB, 1.25GB),
since if it needs a seek anyway then we may as well find a good
group for this.

Cheers, Andreas

quoted

On Feb 11, 2021, at 3:30 AM, Andreas Dilger [off-list ref] wrote:

quoted

This function would be more efficient to do the list move under a single
write lock if the order doesn't change.  The order loop would just
save the largest free order, then grab the write lock, do the list_del(),
set bb_largest_free_order, and list_add_tail():

mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
{
    struct ext4_sb_info *sbi = EXT4_SB(sb);
    int i, new_order = -1;

    for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) {
            if (grp->bb_counters[i] > 0) {
                    new_order = i;
                    break;
            }
    }
    if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) {
            write_lock(&sbi->s_mb_largest_free_orders_locks[
                                          grp->bb_largest_free_order]);
            list_del_init(&grp->bb_largest_free_order_node);

            if (new_order != grp->bb_largest_free_order) {
                    write_unlock(&sbi->s_mb_largest_free_orders_locks[
                                          grp->bb_largest_free_order]);
                    grp->bb_largest_free_order = new_order;
                    write_lock(&sbi->s_mb_largest_free_orders_locks[
                                          grp->bb_largest_free_order]);
            }
            list_add_tail(&grp->bb_largest_free_order_node,
                  &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
            write_unlock(&sbi->s_mb_largest_free_orders_locks[
                                          grp->bb_largest_free_order]);
    }
}

In looking at my previous comment, I wonder if we could further reduce
the list locking here by not moving an entry to the end of the *same*
list if it is not currently at the head?  Since it was (presumably)
just moved to the end of the list by a recent allocation, it is very
likely that some other group will be chosen from the list head, so
moving within the list to maintain strict LRU is probably just extra
locking overhead that can be avoided...

Also, it isn't clear if *freeing* blocks from a group should move it
to the end of the same list, or just leave it as-is?  If there are
more frees from the list it is likely to be added to a new list soon,
and if there are no more frees, then it could stay in the same order.


Cheers, Andreas


Cheers, Andreas

Attachments

signature.asc [application/pgp-signature] 873 bytes

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help