Re: [RFC PATCH 2/8] Documentation: arm: define DT cpu capacity bindings

From: Juri Lelli <hidden>
Date: 2015-12-15 12:22:45
Also in: linux-arm-kernel, linux-pm, lkml

On 14/12/15 16:59, Mark Brown wrote:

On Mon, Dec 14, 2015 at 12:36:16PM +0000, Juri Lelli wrote:

quoted

On 11/12/15 17:49, Mark Brown wrote:

quoted

The purpose of the capacity values is to influence the scheduler
behaviour and hence performance.  Without a concrete definition they're
just magic numbers which have meaining only in terms of their effect on
the performance of the system.  That is a sufficiently complex outcome
to ensure that there will be an element of taste in what the desired
outcomes are.  Sounds like tuneables to me.

quoted

Capacity values are meant to describe asymmetry (if any) of the system
CPUs to the scheduler. The scheduler can then use this additional bit of
information to try to do better scheduling decisions. Yes, having these
values available will end up giving you better performance, but I guess
this apply to any information we provide to the kernel (and scheduler);
the less dumb a subsystem is, the better we can make it work.

This information is a magic number, there's never going to be a right
answer.  If it needs changing it's not like the kernel is modeling a
concrete thing like the relative performance of the A53 and A57 poorly
or whatever, it's just that the relative values of number A and number B
are not what the system integrator desires.

quoted

If you are saying people should use other, more sensible, ways of
specifying the final values that actually get used in production then
why take the defaults from direct numbers DT in the first place?  If you
are saying that people should tune and then put the values in here then
that's problematic for the reasons I outlined.

quoted

IMHO, people should come up with default values that describe
heterogeneity in their system. Then use other ways to tune the system at
run time (depending on the workload maybe).

My argument is that they should be describing the hetrogeneity of their
system by describing concrete properties of their system rather than by
providing magic numbers.

quoted

As said, I understand your concerns; but, what I don't still get is
where CPU capacity values are so different from, say, idle states
min-residency-us. AFAIK there is a per-SoC benchmarking phase required
to come up with that values as well; you have to pick some benchmark
that stresses worst case entry/exit while measuring energy, then make
calculations that tells you when it is wise to enter a particular idle
state. Ideally we should derive min residency from specs, but I'm not
sure is how it works in practice.

Those at least have a concrete physical value that it is possible to
measure in a describable way that is unlikely to change based on the
internals of the kernel.  It would be kind of nice to have the broken
down numbers for entry time, exit time and power burn in suspend but
it's not clear it's worth the bother.  It's also one of these things
where we don't have any real proxies that get us anywhere in the
ballpark of where we want to be.

I'm proposing to add a new value because I couldn't find any proxies in
the current bindings that bring us any close to what we need. If I
failed in looking for them, and they actually exists, I'll personally be
more then happy to just rely on them instead of adding more stuff :-).

Interestingly, to me it sounds like we could actually use your first
paragraph above almost as it is to describe how to come up with capacity
values. In the documentation I put the following:

"One simple way to estimate CPU capacities is to iteratively run a
well-known CPU user space benchmark (e.g, sysbench, dhrystone, etc.) on
each CPU at maximum frequency and then normalize values w.r.t.  the best
performing CPU."

I don't see why this should change if we decide that the scheduler has
to change in the future.

Also, looking again at section 2 of idle-states bindings docs, we have a
nice and accurate description of what min-residency is, but not much
info about how we can actually measure that. Maybe, expanding the docs
section regarding CPU capacity could help?

quoted

It also seems a bit strange to expect people to do some tuning in one
place initially and then additional tuning somewhere else later, from
a user point of view I'd expect to always do my tuning in the same
place.

quoted

I think that runtime tuning needs are much more complex and have finer
grained needs than what you can achieve by playing with CPU capacities.
And I agree with you, users should only play with these other methods
I'm referring to; they should not mess around with platform description
bits. They should provide information about runtime needs, then the
scheduler (in this case) will do its best to give them acceptable
performance using improved knowledge about the platform.

So then why isn't it adequate to just have things like the core types in
there and work from there?  Are we really expecting the tuning to be so
much better than it's possible to come up with something that's so much
better on the scale that we're expecting this to be accurate that it's
worth just jumping straight to magic numbers?

I take your point here that having fine grained values might not really
give us appreciable differences (that is also why I proposed the
capacity-scale in the first instance), but I'm not sure I'm getting what
you are proposing here.

Today, and for arm only, we have a static table representing CPUs
"efficiency":

 /*
  * Table of relative efficiency of each processors
  * The efficiency value must fit in 20bit and the final
  * cpu_scale value must be in the range
  *   0 < cpu_scale < 3*SCHED_CAPACITY_SCALE/2
  * in order to return at most 1 when DIV_ROUND_CLOSEST
  * is used to compute the capacity of a CPU.
  * Processors that are not defined in the table,
  * use the default SCHED_CAPACITY_SCALE value for cpu_scale.
  */
 static const struct cpu_efficiency table_efficiency[] = {
 	{"arm,cortex-a15", 3891},
 	{"arm,cortex-a7",  2048},
 	{NULL, },
 };

When clock-frequency property is defined in DT, we try to find a match
for the compatibility string in the table above and then use the
associate number to compute the capacity. Are you proposing to have
something like this for arm64 as well?

BTW, the only info I could find about those numbers is from this thread

 http://lists.infradead.org/pipermail/linux-arm-kernel/2012-June/104072.html

Vincent, do we have more precise information about these numbers
somewhere else?

If I understand how that table was created, how do we think we will
extend it in the future to allow newer core types (say we replicate this
solution for arm64)?  It seems that we have to change it, rescaling
values, each time we have a new core on the market. How can we come up
with relative numbers, in the future, comparing newer cores to old ones
(that might be already out of the market by that time)?

quoted

Doing that and then switching to some other interface for real tuning
seems especially odd and I'm not sure that's something that users are
going to expect or understand.

quoted

As I'm saying above, users should not care about this first step of
platform description; not more than how much they care about other bits
in DTs that describe their platform.

That may be your intention but I don't see how it is realistic to expect
that this is what people will actually understand.  It's a number, it
has an effect and it's hard to see that people won't tune it, it's not
like people don't have to edit DTs during system integration.  People
won't reliably read documentation or look in mailing list threads and
other that that it has all the properties of a tuning interface.

Eh, sad but true. I guess we can, as we usually do, put more effort in
documenting how things are supposed to be used. Then, if people think
that they can make their system perform better without looking at
documentation or asking around, I'm not sure there is much we could do
to prevent them to do things wrong. There are already lot of things
people shouldn't touch if they don't know what they are doing. :-/

There's a tension here between what you're saying about people not being
supposed to care much about the numbers for tuning and the very fact
that there's a need for the DT to carry explicit numbers.

My point is that people with tuning needs shouldn't even look at DTs,
but put all their efforts in describing (using appropriate APIs) their
needs and how they apply to the workload they care about. Our job is to
put together information coming from users and knowledge of system
configuration to provide people the desired outcomes.

Best,

- Juri

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help