Re: [PATCH v5 5/6] memg: vmscan select victim node by weight

From: Michal Hocko <hidden>
Date: 2011-08-18 13:34:28
Also in: lkml

On Tue 09-08-11 19:12:02, KAMEZAWA Hiroyuki wrote:

This patch implements a node selection logic based on each node's weight.

This patch adds a new array of nodescan_tickets[]. This array holds
each node's scan weight in a tuple of 2 values. as

    for (i = 0, total_weight = 0; i < nodes; i++) {
        weight = node->weight;
        nodescan_tickets[i].start = total_weight;
        nodescan_tickets[i].length = weight;
    }

After this, a lottery logic as 'ticket = random32()/total_weight'
will make a ticket and bserach(ticket, nodescan_tickets[])
will find a node which holds [start, length] contains ticket.
(This is a lottery scheduling.)

By this, node will be selected in fair manner proportinal to
its weight.

The algorithm sounds interesting, I am just wondering how much gain it
gives over a simple node select with maximum weight (+ add some aging so
that we do not hammer a single node all the time). Have you tried that?

This patch improve the scan time. Following is a test result
ot kernel-make on 4-node fake-numa under 500M limit, with 8cpus.
2cpus per node.

[Before patch]
 772.52user 305.67system 4:11.48elapsed 428%CPU
 (0avgtext+0avgdata 1457264maxresident)k
 4797592inputs+5483240outputs (12550major+35707629minor)pagefaults 0swaps

Just to make sure I understand. Before means before this patch not the
entire patch set, right?

[After patch]
 773.73user 305.09system 3:51.28elapsed 466%CPU
 (0avgtext+0avgdata 1458464maxresident)k
 4400264inputs+4797056outputs (5578major+35690202minor)pagefaults 0swaps

elapsed time and major faults are reduced.

[...]

quoted hunk ↗ jump to hunk

Index: mmotm-Aug3/mm/memcontrol.c
===================================================================

--- mmotm-Aug3.orig/mm/memcontrol.c
+++ mmotm-Aug3/mm/memcontrol.c

[...]

quoted hunk ↗ jump to hunk

@@ -1660,6 +1671,46 @@ mem_cgroup_calc_numascan_weight(struct m
 }
 
 /*
+ * For lottery scheduling, this routine disributes "ticket" for
+ * scanning to each node. ticket will be recored into numascan_ticket
+ * array and this array will be used for scheduling, lator.
+ * For make lottery wair, we limit the sum of tickets almost 0xffff.
+ * Later, random() & 0xffff will do proportional fair lottery.
+ */
+#define NUMA_TICKET_SHIFT	(16)
+#define NUMA_TICKET_FACTOR	((1 << NUMA_TICKET_SHIFT) - 1)
+static void mem_cgroup_update_numascan_tickets(struct mem_cgroup *memcg)
+{
+	struct numascan_ticket *nt;
+	unsigned int node_ticket, assigned_tickets;
+	u64 weight;
+	int nid, assigned_num, generation;
+
+	/* update ticket information by double buffering */
+	generation = memcg->numascan_generation ^ 0x1;

Double buffering is used due to synchronization with consumers (they
are using the other one than is updated here), right?  Would be good to
mention in the coment...

quoted hunk ↗ jump to hunk

+
+	nt = memcg->numascan_tickets[generation];
+	assigned_tickets = 0;
+	assigned_num = 0;
+	for_each_node_mask(nid, memcg->scan_nodes) {
+		weight = memcg->info.nodeinfo[nid]->weight;
+		node_ticket = div64_u64(weight << NUMA_TICKET_SHIFT,
+					memcg->total_weight + 1);
+		if (!node_ticket)
+			node_ticket = 1;
+		nt->nid = nid;
+		nt->start = assigned_tickets;
+		nt->tickets = node_ticket;
+		assigned_tickets += node_ticket;
+		nt++;
+		assigned_num++;
+	}
+	memcg->numascan_tickets_num[generation] = assigned_num;
+	smp_wmb();
+	memcg->numascan_generation = generation;
+}
+
+/*
  * Update all node's scan weight in background.
  */
 static void mem_cgroup_numainfo_update_work(struct work_struct *work)

@@ -1672,6 +1723,8 @@ static void mem_cgroup_numainfo_update_w
 
 	memcg->total_weight = mem_cgroup_calc_numascan_weight(memcg);
 
+	synchronize_rcu();
+	mem_cgroup_update_numascan_tickets(memcg);

OK, so we have only one updater (this one) which will update generation.
Why do we need rcu here and for search? ACCESS_ONCE should be
sufficient (in mem_cgroup_select_victim_node) or are you afraid that we
could get to update twice during the search?

 	atomic_set(&memcg->numainfo_updating, 0);
 	css_put(&memcg->css);
 }

[...]

quoted hunk ↗ jump to hunk

@@ -1707,32 +1772,38 @@ static void mem_cgroup_may_update_nodema
  * we'll use or we've used. So, it may make LRU bad. And if several threads
  * hit limits, it will see a contention on a node. But freeing from remote
  * node means more costs for memory reclaim because of memory latency.
- *
- * Now, we use round-robin. Better algorithm is welcomed.
  */
-int mem_cgroup_select_victim_node(struct mem_cgroup *mem, nodemask_t **mask)
+int mem_cgroup_select_victim_node(struct mem_cgroup *memcg, nodemask_t **mask,
+				struct memcg_scanrecord *rec)
 {
-	int node;
+	int node = MAX_NUMNODES;
+	struct numascan_ticket *nt;
+	unsigned long lottery;
+	int generation;
 
+	if (rec->context == SCAN_BY_SHRINK)
+		goto out;

Why do we care about shrinking here? Due to overhead in node selection?

+
+	mem_cgroup_may_update_nodemask(memcg);
 	*mask = NULL;
-	mem_cgroup_may_update_nodemask(mem);
-	node = mem->last_scanned_node;
+	lottery = random32() & NUMA_TICKET_FACTOR;
 
-	node = next_node(node, mem->scan_nodes);
-	if (node == MAX_NUMNODES)
-		node = first_node(mem->scan_nodes);
-	/*
-	 * We call this when we hit limit, not when pages are added to LRU.
-	 * No LRU may hold pages because all pages are UNEVICTABLE or
-	 * memcg is too small and all pages are not on LRU. In that case,
-	 * we use curret node.
-	 */
-	if (unlikely(node == MAX_NUMNODES))
+	rcu_read_lock();
+	generation = memcg->numascan_generation;
+	nt = bsearch((void *)lottery,
+		memcg->numascan_tickets[generation],
+		memcg->numascan_tickets_num[generation],
+		sizeof(struct numascan_ticket), node_weight_compare);
+	rcu_read_unlock();
+	if (nt)
+		node = nt->nid;
+out:
+	if (unlikely(node == MAX_NUMNODES)) {
 		node = numa_node_id();
-	else
-		*mask = &mem->scan_nodes;
+		*mask = NULL;
+	} else
+		*mask = &memcg->scan_nodes;
 
-	mem->last_scanned_node = node;
 	return node;
 }

[...]

quoted hunk ↗ jump to hunk

Index: mmotm-Aug3/mm/vmscan.c
===================================================================

--- mmotm-Aug3.orig/mm/vmscan.c
+++ mmotm-Aug3/mm/vmscan.c

@@ -2378,9 +2378,9 @@ unsigned long try_to_free_mem_cgroup_pag
 	 * take care of from where we get pages. So the node where we start the
 	 * scan does not need to be the current node.
 	 */
-	nid = mem_cgroup_select_victim_node(mem_cont, &sc.nodemask);
+	nid = mem_cgroup_select_victim_node(mem_cont, &sc.nodemask, rec);
 
-	zonelist = NODE_DATA(nid)->node_zonelists;
+	zonelist = &NODE_DATA(nid)->node_zonelists[0];

Unnecessary change.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help