Re: [PATCH v3 00/76] Optimize list lru memory consumption
From: Muchun Song <hidden>
Date: 2021-09-18 08:00:05
Also in:
linux-fsdevel, linux-mm, lkml
On Sat, Sep 18, 2021 at 2:56 PM Kari Argillander [off-list ref] wrote:
On Tue, Sep 14, 2021 at 03:28:22PM +0800, Muchun Song wrote:quoted
We introduced alloc_inode_sb() in previous version 2, which sets up the inode reclaim context properly, to allocate filesystems specific inode. So we have to convert to new API for all filesystems, which is done in one patch. Some filesystems are easy to convert (just replace kmem_cache_alloc() to alloc_inode_sb()), while other filesystems need to do more work. In order to make it easy for maintainers of different filesystems to review their own maintained part, I split the patch into patches which are per-filesystem in this version. I am not sure if this is a good idea, because there is going to be more commits. In our server, we found a suspected memory leak problem. The kmalloc-32 consumes more than 6GB of memory. Other kmem_caches consume less than 2GB memory. After our in-depth analysis, the memory consumption of kmalloc-32 slab cache is the cause of list_lru_one allocation. crash> p memcg_nr_cache_ids memcg_nr_cache_ids = $2 = 24574 memcg_nr_cache_ids is very large and memory consumption of each list_lru can be calculated with the following formula. num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) There are 4 numa nodes in our system, so each list_lru consumes ~3MB. crash> list super_blocks | wc -l 952 Every mount will register 2 list lrus, one is for inode, another is for dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3 MB (~5.6GB). But now the number of memory cgroups is less than 500. So I guess more than 12286 memory cgroups have been created on this machine (I do not know why there are so many cgroups, it may be a user's bug or the user really want to do that). Because memcg_nr_cache_ids has not been reduced to a suitable value. It leads to waste a lot of memory. If we want to reduce memcg_nr_cache_ids, we have to *reboot* the server. This is not what we want. In order to reduce memcg_nr_cache_ids, I had posted a patchset [1] to do this. But this did not fundamentally solve the problem. We currently allocate scope for every memcg to be able to tracked on every superblock instantiated in the system, regardless of whether that superblock is even accessible to that memcg. These huge memcg counts come from container hosts where memcgs are confined to just a small subset of the total number of superblocks that instantiated at any given point in time. For these systems with huge container counts, list_lru does not need the capability of tracking every memcg on every superblock. What it comes down to is that the list_lru is only needed for a given memcg if that memcg is instatiating and freeing objects on a given list_lru. As Dave said, "Which makes me think we should be moving more towards 'add the memcg to the list_lru at the first insert' model rather than 'instantiate all at memcg init time just in case'." This patchset aims to optimize the list lru memory consumption from different aspects. Patch 1-6 are code simplification. Patch 7 converts the array from per-memcg per-node to per-memcg Patch 8 introduces kmem_cache_alloc_lru() Patch 9 introduces alloc_inode_sb() Patch 10-66 convert all filesystems to alloc_inode_sb() respectively.There is now days also ntfs3. If you do not plan to convert this please CC me atleast so that I can do it when these lands. Argillander
Wow, a new filesystem. I didn't notice it before. I'll cover it in the next version and Cc you if you can do a review. Thanks for your reminder.