Re: [RFC][PATCH 05/13] mm/numa: automatically generate node migration order
From: Dave Hansen <hidden>
Date: 2021-02-01 19:14:00
Also in:
lkml
On 1/29/21 12:46 PM, Yang Shi wrote: ...
quoted
int next_demotion_node(int node) { - return node_demotion[node]; + /* + * node_demotion[] is updated without excluding + * this function from running. READ_ONCE() avoids + * reading multiple, inconsistent 'node' values + * during an update. + */Don't we need a smp_rmb() here? The single write barrier might be not enough in migration target set. Typically a write barrier should be used in pairs with a read barrier.
I don't think we need one, practically. Since there is no locking against node_demotion[] updates, although a smp_rmb() would ensure that this read is up-to-date, it could change freely after the smp_rmb(). In other words, smp_rmb() would shrink the window where a "stale" read could occur but would not eliminate it.
quoted
+ return READ_ONCE(node_demotion[node]);Why not consolidate the patch #4 in this patch? The patch #4 just add the definition of node_demotion and the function, then the function is changed in this patch, and the function is not used by anyone between the adding and changing.
I really wanted to highlight that the locking scheme and the READ_ONCE() (or lack thereof) was specifically due to how node_demotion[] was being updated. The READ_ONCE() is not, for instance, inherent to the data structure. ...
quoted
+/* + * When memory fills up on a node, memory contents can be + * automatically migrated to another node instead of + * discarded at reclaim. + * + * Establish a "migration path" which will start at nodes + * with CPUs and will follow the priorities used to build the + * page allocator zonelists. + * + * The difference here is that cycles must be avoided. If + * node0 migrates to node1, then neither node1, nor anything + * node1 migrates to can migrate to node0. + * + * This function can run simultaneously with readers of + * node_demotion[]. However, it can not run simultaneously + * with itself. Exclusion is provided by memory hotplug events + * being single-threaded.Maybe an example diagram for the physical topology and how the migration target is generated in the comment seems helpful to understand the code.
Sure. Were you thinking of a code comment, or enhanced changelog?
Let's say there's a system with two sockets each with the same three
classes of memory: fast, medium and slow. Each memory class is placed
in its own NUMA node and the CPUs are attached to the fast memory. That
leaves 6 NUMA nodes (0-5):
Socket A: 0, 1, 2
Socket B: 3, 4, 5
The migration path for this configuration path would start on the nodes
with the processors and fast memory, progress through medium and end
with the slow memory:
0 -> 1 -> 2 -> stop
3 -> 4 -> 5 -> stop
This is represented in the node_demotion[] like this:
{ 1, // Node 0 migrates to 1
2, // Node 1 migrates to 2
-1, // Node 2 does not migrate
4, // Node 3 migrates to 1
5, // Node 4 migrates to 2
-1} // Node 5 does not migrate
Is that what you were thinking of?
...quoted
+again: + this_pass = next_pass; + next_pass = NODE_MASK_NONE; + /* + * To avoid cycles in the migration "graph", ensure + * that migration sources are not future targets by + * setting them in 'used_targets'. Do this only + * once per pass so that multiple source nodes can + * share a target node. + * + * 'used_targets' will become unavailable in future + * passes. This limits some opportunities for + * multiple source nodes to share a desintation.s/desination/destination
Fixed, thanks.
quoted
+ */ + nodes_or(used_targets, used_targets, this_pass); + for_each_node_mask(node, this_pass) { + int target_node = establish_migrate_target(node, &used_targets); + + if (target_node == NUMA_NO_NODE) + continue; + + /* Visit targets from this pass in the next pass: */ + node_set(target_node, next_pass); + } + /* Is another pass necessary? */ + if (!nodes_empty(next_pass)) + goto again; +} + +void set_migration_target_nodes(void)It seems this function is not called outside migrate.c, so it should be static.
Fixed, thanks.