Thread (79 messages) 79 messages, 7 authors, 2018-10-26

Re: [PATCH v4 1/2] lib/librte_power: traffic pattern aware power control

From: Kevin Traynor <hidden>
Date: 2018-06-27 17:33:06

On 06/26/2018 12:40 PM, Radu Nicolau wrote:
From: Liang Ma <redacted>

1. Abstract

For packet processing workloads such as DPDK polling is continuous.
This means CPU cores always show 100% busy independent of how much work
those cores are doing. It is critical to accurately determine how busy
a core is hugely important for the following reasons:

   * No indication of overload conditions

   * User do not know how much real load is on a system meaning resulted in
     wasted energy as no power management is utilized

Tried and failed schemes include calculating the cycles required from
the load on the core, in other words the busyness. For example,
how many cycles it costs to handle each packet and determining the
frequency cost per core. Due to the varying nature of traffic, types of
frames and cost in cycles to process, this mechanism becomes complex
quickly where a simple scheme is required to solve the problems.

2. Proposed solution

For all polling mechanism, the proposed solution focus on how many times
empty poll executed instead of calculating how many cycles it cost to
handle each packet. The less empty poll number means current core is busy
with processing workload, therefore,  the higher frequency is needed. The
high empty poll number indicate current core has lots spare time,
therefore, we can lower the frequency.
Hi Liang/Radu,

I can see the benefit of providing an API for the application to provide
the num rx from each poll, and then have the library step down/up the
freq based on that. However, not sure I follow why you are adding the
complexity of defining power states and training modes.
2.1 Power state definition:

	LOW:  the frequency is used for purge mode.

	MED:  the frequency is used to process modest traffic workload.

	HIGH: the frequency is used to process busy traffic workload.
Why does there need to be user defined freq levels? Why not just keep
stepping down the freq until there is some user-defined threshold of
zero polls reached. e.g. keep stepping down until 10% of polls are zero
poll and have a tail of some time (perhaps user defined) for the step down.
2.2 There are two phases to establish the power management system:

	a.Initialization/Training phase. There is no traffic pass-through,
	  the system will test average empty poll numbers  with
	  LOW/MED/HIGH  power state. Those average empty poll numbers
	  will be the baseline
	  for the normal phase. The system will collect all core's counter
	  every 100ms. The Training phase will take 5 seconds.
This is requiring an application to sit for 5 secs in order to train and
align poll numbers with states? That doesn't seem realistic to me.
	b.Normal phase. When the real traffic pass-though, the system will
	  compare run-time empty poll moving average value with base line
	  then make decision to move to HIGH power state of MED  power
	  state. The system will collect all core's counter every 10ms.
I only reviewed this commit msg and API usage, so maybe I didn't fully
get the use case or details, but it seems quite awkward from an
application perspective IMHO.
3. Proposed  API

1.  rte_power_empty_poll_stat_init(void);
which is used to initialize the power management system.
 
2.  rte_power_empty_poll_stat_free(void);
which is used to free the resource hold by power management system.
 
3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
which is used to update specific core empty poll counter, not thread safe
 
4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
which is used to update specific core valid poll counter, not thread safe
 
I think 4 could be dropped and 3 used instead. It could be a simple API
that takes in the core and nb_pkts from a poll. Seems clearer than
making a separate API for a special value of nb_pkts (i.e. 0) and the
application having to check to know which API should be called.
5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core empty poll counter.
 
6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core valid poll counter.

7.  rte_power_empty_poll_set_freq(enum freq_val index, uint32_t limit);
which allow user customize the frequency of power state.

8.  rte_power_empty_poll_setup_timer(void);
which is used to setup the timer/callback to process all above counter.
The new API should be experimental
ChangeLog:
v2: fix some coding style issues
v3: rename the filename, API name.
v4: updated makefile and symbol list

Signed-off-by: Liang Ma <redacted>
Signed-off-by: Radu Nicolau <redacted>
---
 lib/librte_power/Makefile               |   5 +-
 lib/librte_power/meson.build            |   5 +-
 lib/librte_power/rte_power_empty_poll.c | 521 ++++++++++++++++++++++++++++++++
 lib/librte_power/rte_power_empty_poll.h | 202 +++++++++++++
 lib/librte_power/rte_power_version.map  |  14 +-
 5 files changed, 742 insertions(+), 5 deletions(-)
 create mode 100644 lib/librte_power/rte_power_empty_poll.c
 create mode 100644 lib/librte_power/rte_power_empty_poll.h
Is there any in-tree documentation planned?

Kevin.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help