Re: [PATCH RESEND v8 1/2] sched/numa: introduce per-cgroup NUMA locality info

From: Mel Gorman <mgorman@suse.de>
Date: 2020-02-21 14:20:20
Also in: linux-fsdevel, lkml

On Tue, Feb 18, 2020 at 09:39:35AM +0800, ?????? wrote:

On 2020/2/17 ??????10:16, Mel Gorman wrote:

quoted

On Mon, Feb 17, 2020 at 09:23:52PM +0800, ?????? wrote:

[snip]

quoted

IMHO the scan period changing should not be a problem now, since the
maximum period is defined by user, so monitoring at maximum period
on the accumulated page accessing counters is always meaningful, correct?

It has meaning but the scan rate drives the fault rate which is the basis
for the stats you accumulate. If the scan rate is high when accesses
are local, the stats can be skewed making it appear the task is much
more local than it may really is at a later point in time. The scan rate
affects the accuracy of the information. The counters have meaning but
they needs careful interpretation.

Yeah, to zip so many information from NUMA Balancing to some statistics
is a challenge itself, the locality still not so easy to be understood by
NUMA newbie :-P

Indeed and if they do not take into account historical skew into
account, they still might not understand.

quoted

FYI, by monitoring locality, we found that the kvm vcpu thread is not
covered by NUMA Balancing, whatever how many maximum period passed, the
counters are not increasing, or very slowly, although inside guest we are
copying memory.

Later we found such task rarely exit to user space to trigger task
work callbacks, and NUMA Balancing scan depends on that, which help us
realize the importance to enable NUMA Balancing inside guest, with the
correct NUMA topo, a big performance risk I'll say :-P

Which is a very interesting corner case in itself but also one that
could have potentially have been inferred from monitoring /proc/vmstat
numa_pte_updates or on a per-task basis by monitoring /proc/PID/sched and
watching numa_scan_seq and total_numa_faults. Accumulating the information
on a per-cgroup basis would require a bit more legwork.

That's not working for daily monitoring...

Indeed although at least /proc/vmstat is cheap to monitor and it could
at least be tracked if the number of NUMA faults are abnormally low or
the ratio of remote to local hints are problematic.

Besides, compared with locality, this require much more deeper understand
on the implementation, which could even be tough for NUMA developers to
assemble all these statistics together.

My point is that even with the patch, the definition of locality is
subtle. At a single point in time, the locality might appear to be low
but it's due to an event that happened far in the past.

quoted

Maybe not a good example, but we just try to highlight that NUMA Balancing
could have issue in some cases, and we want them to be exposed, somehow,
maybe by the locality.

Again, I'm somewhat neutral on the patch simply because I would not use
the information for debugging problems with NUMA balancing. I would try
using tracepoints and if the tracepoints were not good enough, I'd add or
fix them -- similar to what I had to do with sched_stick_numa recently.
The caveat is that I mostly look at this sort of problem as a developer.
Sysadmins have very different requirements, especially simplicity even
if the simplicity in this case is an illusion.

Fair enough, but I guess PeterZ still want your Ack, so neutral means
refuse in this case :-(

I think the patch is functionally harmless and can be disabled but I also
would be wary of dealing with a bug report that was based on the numbers
provided by the locality metric. The bulk of the work related to the bug
would likely be spent on trying to explain the metric and I've dealt with
quite a few bugs that were essentially "We don't like this number and think
something is wrong because of it -- fix it". Even then, I would want the
workload isolated and then vmstat recorded over time to determine it's
a persistent problem or not. That's the reason why I'm relucant to ack it.

I fully acknowledge that this may have value for sysadmins and may be a
good enough reason to merge it for environments that typically build and
configure their own kernels. I doubt that general distributions would
enable it but that's a guess.

BTW, how do you think about the documentation in second patch?

I think the documentation is great, it's clear and explains itself well.

Do you think it's necessary to have a doc to explain NUMA related statistics?

It would be nice but AFAIK, the stats in vmstats are not documented.
They are there because recording them over time can be very useful when
dealing with user bug reports.

-- 
Mel Gorman
SUSE Labs

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help