Re: [External] Re: [PATCH v2 17/21] mm: list_lru: replace linear array with xarray

From: Muchun Song <hidden>
Date: 2021-05-28 08:04:46
Also in: linux-fsdevel, linux-nfs, lkml

On Fri, May 28, 2021 at 11:43 AM Muchun Song [off-list ref] wrote:

On Thu, May 27, 2021 at 8:08 PM Matthew Wilcox [off-list ref] wrote:

quoted

On Thu, May 27, 2021 at 02:21:44PM +0800, Muchun Song wrote:

quoted

If we run 10k containers in the system, the size of the
list_lru_memcg->lrus can be ~96KB per list_lru. When we decrease the
number containers, the size of the array will not be shrinked. It is
not scalable. The xarray is a good choice for this case. We can save
a lot of memory when there are tens of thousands continers in the
system. If we use xarray, we also can remove the logic code of
resizing array, which can simplify the code.

I am all for this, in concept.  Some thoughts below ...

quoted

@@ -56,10 +51,8 @@ struct list_lru {
 #ifdef CONFIG_MEMCG_KMEM
      struct list_head        list;
      int                     shrinker_id;
-     /* protects ->memcg_lrus->lrus[i] */
-     spinlock_t              lock;
      /* for cgroup aware lrus points to per cgroup lists, otherwise NULL */
-     struct list_lru_memcg   __rcu *memcg_lrus;
+     struct xarray           *xa;
 #endif

Normally, we embed an xarray in its containing structure instead of
allocating it.  It's only a pointer, int and spinlock, so generally
16 bytes, as opposed to the 8 bytes for the pointer and a 16 byte
allocation.  There is a minor wrinkle in that currently 'NULL' is
used to indicate "is not cgroup aware".  Maybe there's another way
to indicate that?

Sure. I can drop patch 8 in this series. In that case, we can use
->memcg_aware to indicate that.

quoted

@@ -51,22 +51,12 @@ static int lru_shrinker_id(struct list_lru *lru)
 static inline struct list_lru_one *
 list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
 {
-     struct list_lru_memcg *memcg_lrus;
-     struct list_lru_node *nlru = &lru->node[nid];
+     if (list_lru_memcg_aware(lru) && idx >= 0) {
+             struct list_lru_per_memcg *mlru = xa_load(lru->xa, idx);

-     /*
-      * Either lock or RCU protects the array of per cgroup lists
-      * from relocation (see memcg_update_list_lru).
-      */
-     memcg_lrus = rcu_dereference_check(lru->memcg_lrus,
-                                        lockdep_is_held(&nlru->lock));
-     if (memcg_lrus && idx >= 0) {
-             struct list_lru_per_memcg *mlru;
-
-             mlru = rcu_dereference_check(memcg_lrus->lrus[idx], true);
              return mlru ? &mlru->nodes[nid] : NULL;
      }
-     return &nlru->lru;
+     return &lru->node[nid].lru;
 }

... perhaps we move the xarray out from under the #ifdef and use index 0
for non-memcg-aware lrus?  The XArray is specially optimised for arrays
which only have one entry at 0.

Sounds like a good idea. I can do a try.

I have thought more about this. If we do this, we need to allocate a
list_lru_per_memcg structure for the root memcg. Since the structure
of list_lru_node already aligns with cache line size. From this point
of view, this just wastes memory. We do not gain anything. Right?

Another approach is introducing a new structure of list_lru_nodes,
which is described as following.

struct list_lru_nodes {
         struct list_lru_node nodes[0];
};

Then we insert struct list_lru_nodes to the XArray (index == 0).
There will be two different types in the XArray. If index == 0,
the xa_load() returns list_lru_nodes pointer, otherwise, it returns
list_lru_per_memcg pointer. So list_lru_from_memcg_idx() still
need to handle different cases.

It looks like both approaches do not have any obvious
advantages.

What do you think about this, Mattew?

quoted

 int list_lru_memcg_alloc(struct list_lru *lru, struct mem_cgroup *memcg, gfp_t gfp)
 {
+     XA_STATE(xas, lru->xa, 0);
      unsigned long flags;
-     struct list_lru_memcg *memcg_lrus;
-     int i;
+     int i, ret = 0;

      struct list_lru_memcg_table {
              struct list_lru_per_memcg *mlru;

@@ -601,22 +522,45 @@ int list_lru_memcg_alloc(struct list_lru *lru, struct mem_cgroup *memcg, gfp_t g
              }
      }

-     spin_lock_irqsave(&lru->lock, flags);
-     memcg_lrus = rcu_dereference_protected(lru->memcg_lrus, true);
+     xas_lock_irqsave(&xas, flags);
      while (i--) {
              int index = memcg_cache_id(table[i].memcg);
              struct list_lru_per_memcg *mlru = table[i].mlru;

-             if (index < 0 || rcu_dereference_protected(memcg_lrus->lrus[index], true))
+             xas_set(&xas, index);
+retry:
+             if (unlikely(index < 0 || ret || xas_load(&xas))) {
                      kfree(mlru);
-             else
-                     rcu_assign_pointer(memcg_lrus->lrus[index], mlru);
+             } else {
+                     ret = xa_err(xas_store(&xas, mlru));

This is mixing advanced and normal XArray concepts ... sorry to have
confused you.  I think what you meant to do here was:

                        xas_store(&xas, mlru);
                        ret = xas_error(&xas);

Sure. Thanks for pointing it out. It's my bad usage.

quoted

Or you can avoid introducing 'ret' at all, and keep your errors in the
xa_state.  You're kind of mirroring the xa_state errors into 'ret'
anyway, so that seems easier to understand?

Make sense. I will do this in the next version. Thanks for your
all suggestions.

quoted

-     memcg_id = memcg_alloc_cache_id();
+     memcg_id = ida_simple_get(&memcg_cache_ida, 0, MEMCG_CACHES_MAX_SIZE,
+                               GFP_KERNEL);

        memcg_id = ida_alloc_max(&memcg_cache_ida,
                        MEMCG_CACHES_MAX_SIZE - 1, GFP_KERNEL);

... although i think there's actually a fencepost error, and this really
should be MEMCG_CACHES_MAX_SIZE.

Totally agree. I have fixed this issue in patch 19.

quoted

      objcg = obj_cgroup_alloc();
      if (!objcg) {
-             memcg_free_cache_id(memcg_id);
+             ida_simple_remove(&memcg_cache_ida, memcg_id);

                ida_free(&memcg_cache_ida, memcg_id);

I Will update to this new API.

quoted

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help