Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting tx queue
From: Jamal Hadi Salim <jhs@mojatatu.com>
Date: 2022-02-16 23:39:22
On 2022-02-16 08:36, Tonghao Zhang wrote:
On Wed, Feb 16, 2022 at 8:17 AM Jamal Hadi Salim [off-list ref] wrote:
[...] The mapping to hardware made sense. Sorry I missed it earlier.
quoted
Can you paste a more complete example of a sample setup on some egress port including what the classifier would be looking at?Hi +----+ +----+ +----+ +----+ | P1 | | P2 | | PN | | PM | +----+ +----+ +----+ +----+ | | | | +-----------+-----------+-----------+ | | clsact/skbedit | MQ v +-----------+-----------+-----------+ | q0 | q1 | qn | qm v v v v HTB/FQ HTB/FQ ... FIFO FIFO
Below is still missing your MQ setup (If i understood your diagram correctly). Can you please post that? Are you classids essentially mapping to q0..m? tc -s class show after you run some traffic should help
NETDEV=eth0 tc qdisc add dev $NETDEV clsact tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw src_ip 192.168.122.100 action skbedit queue_mapping hash-type skbhash n m
Have you observed a nice distribution here? for s/w side tc -s class show after you run some traffic should help for h/w side ethtool -s IIUC, the hash of the ip header with src_ip 192.168.122.100 (and dst ip, is being distributed across queues n..m [because either 192.168.122.100 is talking to many destination IPs and/or ports?] Is this correct if packets are being forwarded as opposed to being sourced from the host? ex: who sets the skb->hash (skb->l4_hash, skb->sw_hash etc)
The packets from pod(P1) which ip is 192.168.122.100, will use the txqueue n ~m. P1 is the pod of latency sensitive traffic. so P1 use the fifo qdisc. tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw src_ip 192.168.122.200 action skbedit queue_mapping hash-type skbhash 0 1 The packets from pod(P2) which ip is 192.168.122.200, will use the txqueue 0 ~1. P2 is the pod of bulk sensitive traffic. so P2 use the htb qdisc to limit its network rate, because we don't hope P2 use all bandwidth to affect P1.
Understood.
quoted
Your diagram was unclear how the load balancing was going to be achieved using the qdiscs (or was it the hardware?).Firstly, in clsact hook, we select one tx queue from qn to qm for P1, and use the qdisc of this tx queue, for example FIFO. in underlay driver, because the we set the skb->queue_mapping in skbedit, so the hw tx queue from qn to qm will be select too. any way, in clsact hook, we can use the skbedit queue_mapping to select software tx queue and hw tx queue.
ethtool -s and tc -s class if you have this running somewhere..
For doing balance, we can use the skbhash/cpuid/cgroup classid to select tx queue from qn to qm for P1. tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw src_ip 192.168.122.100 action skbedit queue_mapping hash-type cpuid n m tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw src_ip 192.168.122.100 action skbedit queue_mapping hash-type classid n m
The skbhash should work fine if you have good entropy (varying dst ip and dst port mostly, the srcip/srcport/protocol dont offer much entropy unless you have a lot of pods on your system). i.e if it works correctly (forwarding vs host - see my question above) then you should be able to pin a 5tuple flow to a tx queue. If you have a large number of flows/pods then you could potentially get a nice distribution. I may be missing something on the cpuid one - seems high likelihood of having the same flow on multiple queues (based on what raw_smp_processor_id() returns, which i believe is not guaranteed to be consistent). IOW, you could be sending packets out of order for the same 5 tuple flow (because they end up in different queues). As for classid variant - if these packets are already outside the pod and into the host stack, is that field even valid?
Why we want to do the balance, because we don't want pin the packets from Pod to one tx queue. (in k8s the pods are created or destroy frequently, and the number of Pods > tx queue number). sharing the tx queue equally is more important.
As long as the same flow is pinned to the same queue (see my comment on cpuid). Over a very long period what you describe maybe true but it also seems depends on many other variables. I think it would help to actually show some data on how true above statement is (example the creation/destruction rate of the pods). Or collect data over a very long period. cheers, jamal