Re: [PATCH v3] PM / QoS: Introduce new classes: DMA-Throughput and DVFS-Latency

From: mark gross <hidden>
Date: 2012-03-18 17:07:06
Also in: lkml

On Fri, Mar 16, 2012 at 05:30:33PM +0900, MyungJoo Ham wrote:

On Sun, Mar 11, 2012 at 7:53 AM, Rafael J. Wysocki [off-list ref] wrote:

quoted

On Friday, March 09, 2012, MyungJoo Ham wrote:

quoted

On Thu, Mar 8, 2012 at 12:47 PM, mark gross [off-list ref] wrote:

quoted

On Wed, Mar 07, 2012 at 02:02:01PM +0900, MyungJoo Ham wrote:

quoted

1. CPU_DMA_THROUGHPUT

...

quoted

2. DVFS_LATENCY

The cpu_dma_throughput looks ok to me.  I do however; wonder about the
dvfs_lat_pm_qos.  Should that knob be exposed to user mode?  Does that
matter so much?  why can't dvfs_lat use the cpu_dma_lat?

BTW I'll be out of town for the next 10 days and probably will not get
to this email account until I get home.

--mark

1. Should DVFS Latency be exposed to user mode?

It would depend on the policy of the given system; however, yes, there
are systems that require a user interface for DVFS Latency.
With the example of user input response (response to user click,
typing, touching, and etc), a user program (probably platform s/w or
middleware) may input QoS requests. Besides, when a new "application"
is starting, such "middleware" may want faster responses from DVFS
mechanisms.

But this is a global knob, isn't it?  And it seems that a per-device one
is needed rather than that?

It also applies to your CPU_DMA_THROUGHPUT thing, doesn't it?


Yes, the two are global knobs. And both the two control multiple
devices simultaneously, not just a single device. I suppose per-device
QoS is appropriate for QoS requests directed to a single device. Am I
right about this one?


Let's assume that, in an example system, we have devfreq on GPU,
memory-Interface, and main bus and CPUfreq (Exynos5 will have them all
seperated).

If we use per-device QoS for DVFS LATENCY, in order to control the
DVFS response latency, we will need to make QoS requests to all the
four devices independently, not to the global DVFS LATENCY QOS CLASS.
There, we could have a shared single QoS request list for these four
DVFS devices, saying that the DVFS response should be done in "50ms"
after a sudden utilization increase.

We may be able to use "dev_pm_qos_add_notifier()" for a virtual device
representing "DVFS Latency" or "DMA Throughput" and let the GPU, CPU,
main-bus, and memory-interface listen to the events from the virtual
device. Hmm..., do you recommend this approach? creating a device
representing "DVFS" as a whole (both CPUFreq and device drivers of
devfreq).

CPU_DMA_THROUGHPUT is quite similar as CPU_DMA_LATENCY. However, we
think it is addtionally needed because many IPs (in-SoC devices) need
to specify its DMA usage in "kbytes/sec", not "usecs/ops". For
example, a video-decoding chip device driver may say it requires
"750000kbytes/sec" for 1080p60, "300000kbytes/sec" for 720p60, and so
on, which affects CPUfreq, memory-interface, and main-bus at the same
time.

I have an example of a need for cpu_dma_throughput for x86 soc's as
well.  Mostly my example comes down to on-demand thinking the work load
is low (gpu is doing all the work) yet the work load needs a higher
clock rates between frame times to avoid buffer under running the gfx
pipe).

My version of the patch didn't fly too well because it failed to offer a
scalable definition of the units of cpu_dma_throughput.  I tried using
KHZ as the unit (the units used in cpufreq).  However; Applications
written to assume HZ units on one system would need to re-written on the
next.  Perhaps using bandwidth would be better than throughput?

quoted

2. Does DVFS Latency matter?

Yes, in our experimental sets w/ Exynos4210 (those slapped in Galaxy
S2 equivalent; not exactly as I'm not conducted in Android systems,
but Tizen), we could see noticable difference w/ bare eyes for
user-input responses. When we shortened DVFS polling interval with
touches, the touch responses were greatly improved; e.g., losing 10
frames into losing 0 or 1 frame for a sudden input rush.

Well, this basically means PM QoS matters, which is kind of obvious.
It doesn't mean that it can't be implemented in a better way, though.

For DVFS-Latency and DMA-Throughput, I think a normal pm-qos-dev (one
device per one qos knob) isn't appropriate because there are multiple
devices that are required to react simultaneously.

It is possible to let multiple devices react by adding notifiers with
dev_pm_qos_add_notifier(). However, I felt that it wasn't the purpose
of this one and it might get things ugly. Anyway, was allowing
multiple devices to change their frequencies/voltages for a single
per-device QoS list the purpose of dev_pm_qos_add_notifier()?


Just throwing an idea and suggestion if it was the purpose,
I speculate that If we are going to do this (supporting multiple
devices per one qos knob without adding QoS class), we'd better create
"qos class device" in /drivers/qos/ and let those qos class handle
multiple devices depending on a single "qos class". Probably, this
will transform "global PM-QoS class" that notifies related devices
into "QoS class device" that notifies related devices.

quoted

3. Why not replace DVFS Latency w/ CPU-DMA-Latency/Throughput?

When we implement the user-input response enhancement with CPU-DMA QoS
requests, the PM-QoS will unconditionally increase CPU and BUS
frequencies/voltages with user inputs. However, with many cases it is
unnecessary; i.e., a user input means that there will be unexpected
changes soon; however, the change does not mean that the load will
increase. Thus, allowing DVFS mechanism to evolve faster was enough to
shorten the response time and not to increase frequencies and voltages
when not needed. There were significant difference in power
consumption with this changes if the user inputs were not involving
drastic graphics jobs; e.g., typing a text message.

Again, you're arguing for having PM QoS rather than not having it.  You don't
have to do that. :-)

Generally speaking, I don't think we should add any more PM QoS "classes"
as defined in pm_qos.h, since they are global and there's only one
list of requests per class.  While that may be good for CPU power
management (in an SMP system all CPUs are identical, so the same list of
requests may be applied to all of them), it generally isn't for I/O
devices (some of them work in different time scales, for example).

So, for example, most likely, a list of PM QoS requests for storage devices
shouldn't be applied to input devices (keyboards and mice to be precise) and
vice versa.

On the other hand, I don't think that applications should access PM QoS
interfaces associated with individual devices directly, because they may
not have enough information about the relationships between devices in the
system.  So, perhaps, there needs to be an interface allowing applications
to specify their PM QoS expectations in a general way (e.g. "I want <number>
disk I/O throughput") and a code layer between that interface and device
drivers translating those expecataions into PM QoS requests for specific
devices.

With DVFS Latency PM QoS Class, we can say "I want the system to react
in 50ms for any sudden utilization increases.". Without it, we should
say, for example, "CPUFreq/Ondemand should set interval at 25ms,
Devfreq/Bus should set interval at 25ms, and Devfreq/GPU should set
interval at 10ms."

And with CPU Throughput PM QoS Class, we can say "I want 1000000
kbytes/sec DMA transfer". Without it, we should say "Memory-Interface
at 1000000 kbytes/sec, Exynos4412 core should be at least 500MHz, and
Bus should be at least 166MHz".

What things are coming down to is we need to see if we can identify good
abstractions that can be portable / scalable across ISA's and boards,
such that applications would not need to be changed to work correctly
across all of them.

One issue I have with adding a single DVFS latency and throughput pm-qos
parameter is that what Device the DVFS *really* means changes from one
board to the next.  Thus making it impossible to abstract to user mode.

--mark

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help