RE: [RFC PATCH v6 3/4] scheduler: scan idle cpu in cluster for tasks within one LLC | linux-arm-kernel

(off-list ancestor, not in this archive)

-----Original Message-----
From: Dietmar Eggemann [mailto:dietmar.eggemann@arm.com]
Sent: Friday, April 30, 2021 10:43 PM
To: Song Bao Hua (Barry Song) <redacted>; Vincent Guittot
[off-list ref]
Cc: tim.c.chen@linux.intel.com; catalin.marinas@arm.com; will@kernel.org;
rjw@rjwysocki.net; bp@alien8.de; tglx@linutronix.de; mingo@redhat.com;
lenb@kernel.org; peterz@infradead.org; rostedt@goodmis.org;
bsegall@google.com; mgorman@suse.de; msys.mizuma@gmail.com;
valentin.schneider@arm.com; gregkh@linuxfoundation.org; Jonathan Cameron
[off-list ref]; juri.lelli@redhat.com; mark.rutland@arm.com;
sudeep.holla@arm.com; aubrey.li@linux.intel.com;
linux-arm-kernel@lists.infradead.org; linux-kernel@vger.kernel.org;
linux-acpi@vger.kernel.org; x86@kernel.org; xuwei (O) [off-list ref];
Zengtao (B) [off-list ref]; guodong.xu@linaro.org; yangyicong
[off-list ref]; Liguozhu (Kenneth) [off-list ref];
linuxarm@openeuler.org; hpa@zytor.com
Subject: Re: [RFC PATCH v6 3/4] scheduler: scan idle cpu in cluster for tasks
within one LLC

On 29/04/2021 00:41, Song Bao Hua (Barry Song) wrote:

-----Original Message-----
From: Dietmar Eggemann [mailto:dietmar.eggemann@arm.com]
[...]

From: Dietmar Eggemann [mailto:dietmar.eggemann@arm.com]
[...]

On 20/04/2021 02:18, Barry Song wrote:
[...]

Though we will never go to slow path, wake_wide() will affect want_affine,
so eventually affect the "new_cpu"?
yes.

	for_each_domain(cpu, tmp) {
		/*
		 * If both 'cpu' and 'prev_cpu' are part of this domain,
		 * cpu is a valid SD_WAKE_AFFINE target.
		 */
		if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
			if (cpu != prev_cpu)
				new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync);

			sd = NULL; /* Prefer wake_affine over balance flags */
			break;
		}

		if (tmp->flags & sd_flag)
			sd = tmp;
		else if (!want_affine)
			break;
	}

If wake_affine is false, the above won't execute, new_cpu(target) will
always be "prev_cpu"? so when task size > cluster size in wake_wide(),
this means we won't pull the wakee to the cluster of waker? It seems
sensible.
What is `task size` here?

The criterion is `!(slave < factor || master < slave * factor)` or
`slave >= factor && master >= slave * factor` to wake wide.
Yes. For "task size", I actually mean a bundle of waker-wakee tasks
which can make "slave >= factor && master >= slave * factor" either
true or false, then change the target cpu where we are going to scan
from.
Now since I have moved to cluster level when tasks have been in same
LLC level, it seems it would be more sensible to use "cluster_size" as
factor?

I see that since you effectively change the sched domain size from LLC
to CLUSTER (e.g. 24->6) for wakeups with cpu and prev_cpu sharing LLC
(hence the `numactl -N 0` in your workload), wake_wide() has to take
CLUSTER size into consideration.

I was wondering if you saw wake_wide() returning 1 with your use cases:

numactl -N 0 /usr/lib/lmbench/bin/stream -P [6,12] -M 1024M -N 5
I couldn't make wake_wide return 1 by the above stream command.
And I can't reproduce it by a 1:1(monogamous) hackbench "-f 1".

But I am able to reproduce this issue by a M:N hackbench, for example:

numactl -N 0 hackbench -p -T -f 10 -l 20000 -g 1

hackbench will create 10 senders which will send messages to 10
receivers. (Each sender can send messages to all 10 receivers.)

I've often seen flips like:
waker wakee
1501  39
1509  17
11   1320
13   2016

11, 13, 17 is smaller than LLC but larger than cluster. So the wake_wide()
using cluster factor will return 1, on the other hand, if we always use
llc_size as factor, it will return 0.

However, it seems the change in wake_wide() could bring some negative
influence to M:N relationship(-f 10) according to tests made today by:

numactl -N 0 hackbench -p -T -f 10 -l 20000 -g $1

g             =      1     2       3       4
cluster_size     0.5768 0.6578  0.8117 1.0119
LLC_size         0.5479 0.6162  0.6922 0.7754

Always using llc_size as factor in wake_wide still shows better result
in the 10:10 polygamous hackbench.

So it seems the `slave >= factor && master >= slave * factor` isn't
a suitable criterion for cluster size?

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help

Possibly related (same subject, not in this thread)