Re: [PATCH net-next] net/core: add optional threading for backlog processing

From: Felix Fietkau <nbd@nbd.name>
Date: 2023-03-28 09:47:31
Also in: linux-doc, lkml

On 28.03.23 11:29, Paolo Abeni wrote:

On Fri, 2023-03-24 at 18:57 +0100, Felix Fietkau wrote:

quoted

On 24.03.23 18:47, Jakub Kicinski wrote:

quoted

On Fri, 24 Mar 2023 18:35:00 +0100 Felix Fietkau wrote:

quoted

I'm primarily testing this on routers with 2 or 4 CPUs and limited 
processing power, handling routing/NAT. RPS is typically needed to 
properly distribute the load across all available CPUs. When there is 
only a small number of flows that are pushing a lot of traffic, a static 
RPS assignment often leaves some CPUs idle, whereas others become a 
bottleneck by being fully loaded. Threaded NAPI reduces this a bit, but 
CPUs can become bottlenecked and fully loaded by a NAPI thread alone.

The NAPI thread becomes a bottleneck with RPS enabled?

The devices that I work with often only have a single rx queue. That can
easily become a bottleneck.

quoted

Making backlog processing threaded helps split up the processing work 
even more and distribute it onto remaining idle CPUs.

You'd want to have both threaded NAPI and threaded backlog enabled?

Yes

quoted

It can basically be used to make RPS a bit more dynamic and 
configurable, because you can assign multiple backlog threads to a set 
of CPUs and selectively steer packets from specific devices / rx queues

Can you give an example?

With the 4 CPU example, in case 2 queues are very busy - you're trying
to make sure that the RPS does not end up landing on the same CPU as
the other busy queue?

In this part I'm thinking about bigger systems where you want to have a
group of CPUs dedicated to dealing with network traffic without
assigning a fixed function (e.g. NAPI processing or RPS target) to each
one, allowing for more dynamic processing.

quoted

to them and allow the scheduler to take care of the rest.

You trust the scheduler much more than I do, I think :)

In my tests it brings down latency (both avg and p99) considerably in
some cases. I posted some numbers here:
https://lore.kernel.org/netdev/e317d5bc-cc26-8b1b-ca4b-66b5328683c4@nbd.name/ (local)

It's still not 110% clear to me why/how this additional thread could
reduce latency. What/which threads are competing for the busy CPU[s]? I
suspect it could be easier/cleaner move away the others (non RPS)
threads.

In the tests that I'm doing, network processing load from routing/NAT is 
enough to occupy all available CPUs.
If I dedicate the NAPI thread to one core and use RPS to steer packet 
processing to the other cores, the core taking care of NAPI has some 
idle cycles that go to waste, while the other cores are busy.
If I include the core in the RPS mask, it can take too much away from 
the NAPI thread.
Also, with a smaller number of flows and depending on the hash 
distribution of those flows, there can be an imbalance between how much 
processing each RPS instance gets, making this even more difficult to 
get right.
With the extra threading it works out better in my tests, because the 
scheduler can deal with these kinds of imbalances.

- Felix

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help