Re: Short process stall after assigning it to a cgroup

From: Ronny Meeus <hidden>
Date: 2021-06-25 07:32:59

Possibly related (same subject, not in this thread)

2021-06-25 · Re: Short process stall after assigning it to a cgroup · Ronny Meeus <hidden>
2021-06-25 · Re: Short process stall after assigning it to a cgroup · Michal Koutný <hidden>
2021-06-18 · Re: Short process stall after assigning it to a cgroup · Ronny Meeus <hidden>
2021-06-14 · Short process stall after assigning it to a cgroup · Ronny Meeus <hidden>

Op wo 23 jun. 2021 om 19:28 schreef Michal Koutn√Ω [off-list ref]:

Hello Ronny.

On Mon, Jun 14, 2021 at 05:29:35PM +0200, Ronny Meeus [off-list ref] wrote:

quoted

All apps are running in the realtime domain and I'm using kernel 4.9
and cgroup v1. [...]  when it enters a full load condition [...]
I start to gradually reduce the budget of the cgroup until the system
is idle enough.

Has your application some RT requirements or is there other reason why
you use group RT allocations? (When your app seems to require all CPU
time, you decide to curb it. And it still fullfills RT requirements?)

The application does not have strict RT requirements.
The main reason for using cgroups is to reduce the load of the high
consumer applications when the system is under high load so that also
lower prio apps can have a portion of the CPU.
We were working with fixed croups initially but this has the big
disadvantage that the unused budget configured in one group cannot be
used by another group and as such the processing power is basically
lost.

quoted

But sometimes, immediately after the process assignment, it stops for
a short period (something like 1 or 2s) and then starts to consume 40%
again.

What if you reduce cpu.rt_period_us (and cpu.rt_runtime_us
proportionally)? (Are the pauses shorter?) Is there any useful info in
/proc/$PID/stack during these periods?

I tried to use shorter periods like 100ms instead of 1s but the
problem is still observed.
Using a proportionally reducing algo is more complex to implement and
I think would not solve the issue either.

About the stack: it is difficult to know from the SW when the issue
happens so dumping the stack is not easy I think but it is a good
idea.
I will certainly think about it.
To observe the system I use a spirent traffic generator which shows me
the number of processed packets in a nice graph. In this way it is
easy to see that there are short peaks when the system is not
returning any packets.

quoted

Is that expected behavior?

Someone with RT group schedulling knowledge may tell :-)

HTH,
Michal

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help