Re: [PATCH v2 4/5] ext4: improve cr 0 / cr 1 group scanning
From: Andreas Dilger <hidden>
Date: 2021-02-26 03:44:21
On Feb 21, 2021, at 8:59 PM, harshad shirwadkar [off-list ref] wrote:
Thank you for all the feedback Andreas and Artem. Some comments below:
Thank you for working on this. It is definitely an area that can use this improvement.
quoted
I was wondering about the cost of the list/tree maintenance as well, especially since there was a post from "kernel test robot" that this patch introduced a performance regression.Yeah, I'm pretty sure that the kernel test robot is complaining mainly because of list/tree maintenance. I think if this optimization is turned off, we probably should not even maintain the tree / lists. That has one downside that is that we will have to disallow setting this option during remount, which I guess is okay?
I think it is reasonable to not be able to change this at runtime. This would only make a difference for a limited number of testers, and virtually all users will never know it exists at all.
quoted
It would also make sense for totally full groups to be kept out of the rb tree entirely, since they do not provide any value in that case (the full groups will never be selected for allocations), and they just add to the tree depth and potentially cause an imbalance if there are many of them. That also has the benefit of the rbtree efficiency *improving* as the filesystem gets more full, which is right when it is most needed.Ackquoted
It might also make sense to keep totally empty groups out of the rbtree, since they should always be found in cr0 already if the allocation is large enough to fill the whole group? Having a smaller rbtree makes every insertion/removal that much more efficient.Ackquoted
Those groups will naturally be re-added into the rbtree when they have blocks freed or allocated, so not much added complexity. Does it make sense to disable "mb_optimize_scan" if filesystems are smaller than a certain threshold? Clearly, if there are only 1-2 groups, maintaining a list and rbtree has no real value, and with only a handful of groups (< 16?) linear searching is probably as fast or faster than maintaining the two data structures. That is similar to e.g. bubble sort vs. quicksort, where it is more efficient to sort a list of ~5-8 entries with a dumb/fast algorithm instead of a complex algorithm that is more efficient at larger scales. That would also (likely) quiet the kernel test robot, if we think that its testing is not representative of real-world usage.Ack, these are good optimizations. I'll add these in V3.
For testing purposes it should be possible to have "mb_optimize_scan=1" force the use of this option, even if the filesystem is small.
Besides the optimizations mentioned here, I also think we should add "mb_optimize_linear_limit" or such sysfs tunable which will control how many groups should mballoc search linearly before using tree / lists for allocation? That would help us with the disk seek time performance.
There is already a linear search threshold parameters for mballoc, namely mb_min_to_scan and mb_max_to_scan that could be used for this. I think we could use "mb_min_to_scan=10" (the current default), or maybe shrink this a bit (4?) if "mb_optimize_scan" is enabled.
We discussed on our last call that we probably should consult with the block device's request queue to check if the underlying block device is rotational or not. However, we also discussed that for more complex devices (such as DMs setup on top of HDD and SSD etc), whether the device is rotational or not is not a binary answer and we would need a more complex interface (such as logical range to "is_rotational" table) to make intelligent choice in the file system. Also, in such cases, it is not clear if such a table needs to be passed to the file system during mkfs time? or at mount time? or at run time?
I don't think the hybrid case is very important yet. By far the most common case is going to be "rotational=1" or "rotational=0" for the whole device, so we should start by only optimizing for those cases. DM looks like it returns "rotational=0" correctly when a composite device it is made of entirely non-rotational devices and "rotational=1" as it should when it is a hybrid HDD/SSD device (which I have in my local system).
Given the number of unknowns in the above discussion, I propose that we start simple and evolve later. So my proposal is that we add a "mb_optimize_linear_limit" tunable that accepts an integer value. In the kernel, for non-rotational devices, that value will be defaulted to 0 (which means no linear scan) and for rotational devices, that value will be defaulted to a reasonable value (-- not sure what that value would be though - 4?). This default can be overridden using the sysfs interface. We can later evolve this interface to accept more complex input such as logical range to rotational status. Does that sound reasonable?
Yes, modulo using the existing "mb_min_to_scan" parameter for this. I think 4 or 8 or 10 groups is reasonable (512MB, 1GB, 1.25GB), since if it needs a seek anyway then we may as well find a good group for this. Cheers, Andreas
quoted
quoted
On Feb 11, 2021, at 3:30 AM, Andreas Dilger [off-list ref] wrote:quoted
quoted
This function would be more efficient to do the list move under a single write lock if the order doesn't change. The order loop would just save the largest free order, then grab the write lock, do the list_del(), set bb_largest_free_order, and list_add_tail(): mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp) { struct ext4_sb_info *sbi = EXT4_SB(sb); int i, new_order = -1; for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) { if (grp->bb_counters[i] > 0) { new_order = i; break; } } if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) { write_lock(&sbi->s_mb_largest_free_orders_locks[ grp->bb_largest_free_order]); list_del_init(&grp->bb_largest_free_order_node); if (new_order != grp->bb_largest_free_order) { write_unlock(&sbi->s_mb_largest_free_orders_locks[ grp->bb_largest_free_order]); grp->bb_largest_free_order = new_order; write_lock(&sbi->s_mb_largest_free_orders_locks[ grp->bb_largest_free_order]); } list_add_tail(&grp->bb_largest_free_order_node, &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]); write_unlock(&sbi->s_mb_largest_free_orders_locks[ grp->bb_largest_free_order]); } }In looking at my previous comment, I wonder if we could further reduce the list locking here by not moving an entry to the end of the *same* list if it is not currently at the head? Since it was (presumably) just moved to the end of the list by a recent allocation, it is very likely that some other group will be chosen from the list head, so moving within the list to maintain strict LRU is probably just extra locking overhead that can be avoided... Also, it isn't clear if *freeing* blocks from a group should move it to the end of the same list, or just leave it as-is? If there are more frees from the list it is likely to be added to a new list soon, and if there are no more frees, then it could stay in the same order. Cheers, Andreas
Cheers, Andreas
Attachments
- signature.asc [application/pgp-signature] 873 bytes