Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting tx queue

From: Jamal Hadi Salim <jhs@mojatatu.com>
Date: 2022-02-16 23:39:22

On 2022-02-16 08:36, Tonghao Zhang wrote:

On Wed, Feb 16, 2022 at 8:17 AM Jamal Hadi Salim [off-list ref] wrote:


[...]
The mapping to hardware made sense. Sorry I missed it earlier.

quoted

Can you paste a more complete example of a sample setup on some egress
port including what the classifier would be looking at?

Hi

   +----+      +----+      +----+     +----+
   | P1 |      | P2 |      | PN |     | PM |
   +----+      +----+      +----+     +----+
     |           |           |           |
     +-----------+-----------+-----------+
                        |
                        | clsact/skbedit
                        |      MQ
                        v
     +-----------+-----------+-----------+
     | q0        | q1        | qn        | qm
     v           v           v           v
   HTB/FQ      HTB/FQ  ...  FIFO        FIFO

Below is still missing your MQ setup (If i understood your diagram
correctly). Can you please post that?
Are you classids essentially mapping to q0..m?
tc -s class show after you run some traffic should help

NETDEV=eth0
tc qdisc add dev $NETDEV clsact
tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
src_ip 192.168.122.100 action skbedit queue_mapping hash-type skbhash
n m

Have you observed a nice distribution here?
for s/w side tc -s class show after you run some traffic should help
for h/w side ethtool -s

IIUC, the hash of the ip header with src_ip 192.168.122.100
(and dst ip,
is being distributed across queues n..m
[because either 192.168.122.100 is talking to many destination
IPs and/or ports?]
Is this correct if packets are being forwarded as opposed to
being sourced from the host?
ex: who sets the skb->hash (skb->l4_hash, skb->sw_hash etc)

The packets from pod(P1) which ip is 192.168.122.100, will use the txqueue n ~m.
P1 is the pod of latency sensitive traffic. so P1 use the fifo qdisc.

tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
src_ip 192.168.122.200 action skbedit queue_mapping hash-type skbhash
0 1

The packets from pod(P2) which ip is 192.168.122.200, will use the txqueue 0 ~1.
P2 is the pod of bulk sensitive traffic. so P2 use the htb qdisc to
limit its network rate, because we don't hope P2 use all bandwidth to
affect P1.

Understood.

quoted

Your diagram was unclear how the load balancing was going to be
achieved using the qdiscs (or was it the hardware?).

Firstly, in clsact hook, we select one tx queue from qn to qm for P1,
and use the qdisc of this tx queue, for example FIFO.
in underlay driver, because the we set the skb->queue_mapping in
skbedit, so the hw tx queue from qn to qm will be select too.
any way, in clsact hook, we can use the skbedit queue_mapping to
select software tx queue and hw tx queue.

ethtool -s and tc -s class if you have this running somewhere..

For doing balance, we can use the skbhash/cpuid/cgroup classid to
select tx queue from qn to qm for P1.
tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
src_ip 192.168.122.100 action skbedit queue_mapping hash-type cpuid n
m
tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
src_ip 192.168.122.100 action skbedit queue_mapping hash-type classid
n m

The skbhash should work fine if you have good entropy (varying dst ip
and dst port mostly, the srcip/srcport/protocol dont offer much  entropy
unless you have a lot of pods on your system).
i.e if it works correctly (forwarding vs host - see my question above)
then you should be able to pin a 5tuple flow to a tx queue.
If you have a large number of flows/pods then you could potentially
get a nice distribution.

I may be missing something on the cpuid one - seems high likelihood
of having the same flow on multiple queues (based on what
raw_smp_processor_id() returns, which i believe is not guaranteed to be
consistent). IOW, you could be sending packets out of order for the
same 5 tuple flow (because they end up in different queues).

As for classid variant - if these packets are already outside the
pod and into the host stack, is that field even valid?

Why we want to do the balance, because we don't want pin the packets
from Pod to one tx queue. (in k8s the pods are created or destroy
frequently, and the number of Pods > tx queue number).
sharing the tx queue equally is more important.

As long as the same flow is pinned to the same queue (see my comment
on cpuid).
Over a very long period what you describe maybe true but it also
seems depends on many other variables.
I think it would help to actually show some data on how true above
statement is (example the creation/destruction rate of the pods).
Or collect data over a very long period.

cheers,
jamal

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help