Re: [PATCH] btrfs: Stop kmalloc'ing btrfs_path in btrfs_lookup_bio_sums
From: David Sterba <hidden>
Date: 2021-06-30 10:16:47
On Tue, Jun 29, 2021 at 07:56:34PM +0800, Qu Wenruo wrote:
On 2021/6/28 下午11:06, Nikolay Borisov wrote:quoted
While doing some performance characterization of a workoad reading ~80g split among 4mb files via DIO I observed that btrfs_lookup_bio_sums was using rather excessive cpu cycles: 95.97%--ksys_pread64 95.96%--new_sync_read 95.95%--iomap_dio_rw 95.89%--iomap_apply 72.25%--iomap_dio_bio_actor 55.19%--iomap_dio_submit_bio 55.13%--btrfs_submit_direct 20.72%--btrfs_lookup_bio_sums 7.44%-- kmem_cache_alloc Timing the workload yielded: real 4m41.825s user 0m14.321s sys 10m38.419s Turns out this kmem_cache_alloc corresponds to the btrfs_path allocation that happens in btrfs_lookup_bio_sums. Nothing in btrfs_lookup_bio_sums warrants having the path be heap allocated. Just turn this allocation into a stack based one. After the change performance changes accordingly: real 4m21.742s user 0m13.679s sys 9m8.889s All in all there's a 20 seconds (~7%) reductino in real run time as well as the time spent in the kernel is reduced by 1:30 minutes (~14%). And btrfs_lookup_bio_sums ends up accounting for only 7.91% of cycles rather than 20% before. Signed-off-by: Nikolay Borisov <redacted>I'm surprised that btrfs_path in kernel no longer needs any initialization. I guess we can do a lot of more such conversion to use stack btrfs_path, just like what we did in btrfs-progs. But I'm not that sure if 112 bytes stack memory usage increase is OK or not for other locations.
In justified cases 112 on-stack instead of allocation could be considered but not in general. That progs to that is because it's in userspace where handling memory is way easier and stack sizes are megabytes, not a few kilobytes like in kernel. The btrfs_path is allocated from slab and the performance hit is negligible on the release build, as Nikolay found out.
But since this one get a pretty good performance increase, and this function itself doesn't have much stack memory usage anyway, it looks ok to me.
The stack size needs to take into account the whole call chain that could end up in this function, and in the IO paths we need to be conservative in case there are other layers on top or under btrfs that could need the stack space too.