Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting tx queue
From: Jamal Hadi Salim <jhs@mojatatu.com>
Date: 2022-02-22 11:44:40
On 2022-02-20 20:43, Tonghao Zhang wrote:
On Mon, Feb 21, 2022 at 2:30 AM Jamal Hadi Salim [off-list ref] wrote:quoted
On 2022-02-18 07:43, Tonghao Zhang wrote:quoted
On Thu, Feb 17, 2022 at 7:39 AM Jamal Hadi Salim [off-list ref] wrote:quoted
quoted
Thats a different use case than what you are presenting here. i.e the k8s pod scenario is purely a forwarding use case. But it doesnt matter tbh since your data shows reasonable results. [i didnt dig into the code but it is likely (based on your experimental data) that both skb->l4_hash and skb->sw_hash will _not be set_ and so skb_get_hash() will compute the skb->hash from scratch.]No, for example, for tcp, we have set hash in __tcp_transmit_skb which invokes the skb_set_hash_from_sk so in skbedit, skb_get_hash only gets skb->hash.
There is no tcp anything in the forwarding case. Your use case was for forwarding. I understand the local host tcp/udp variant.
quoted
quoted
quoted
I may be missing something on the cpuid one - seems high likelihood of having the same flow on multiple queues (based on what raw_smp_processor_id() returns, which i believe is not guaranteed to be consistent). IOW, you could be sending packets out of order for the same 5 tuple flow (because they end up in different queues).Yes, but think about one case, we pin one pod to one cpu, so all the processes of the pod will use the same cpu. then all packets from this pod will use the same tx queue.To Cong's point - if you already knew the pinned-to cpuid then you could just as easily set that queue map from user space?Yes, we can set it from user space. If we can know the cpu which the pod uses, and select the one tx queue automatically in skbedit, that can make the things easy?
Yes, but you know the CPU - so Cong's point is valid. You knew the CPU when you setup the cgroup for iperf by hand, you can use the same hand to set the queue map skbedit.
quoted
quoted
ip li set dev $NETDEV up tc qdisc del dev $NETDEV clsact 2>/dev/null tc qdisc add dev $NETDEV clsact ip link add ipv1 link $NETDEV type ipvlan mode l2 ip netns add n1 ip link set ipv1 netns n1 ip netns exec n1 ip link set ipv1 up ip netns exec n1 ifconfig ipv1 2.2.2.100/24 up tc filter add dev $NETDEV egress protocol ip prio 1 \ flower skip_hw src_ip 2.2.2.100 action skbedit queue_mapping hash-type cpuid 2 6 tc qdisc add dev $NETDEV handle 1: root mq tc qdisc add dev $NETDEV parent 1:1 handle 2: htb tc class add dev $NETDEV parent 2: classid 2:1 htb rate 100kbit tc class add dev $NETDEV parent 2: classid 2:2 htb rate 200kbit tc qdisc add dev $NETDEV parent 1:2 tbf rate 100mbit burst 100mb latency 1 tc qdisc add dev $NETDEV parent 1:3 pfifo tc qdisc add dev $NETDEV parent 1:4 pfifo tc qdisc add dev $NETDEV parent 1:5 pfifo tc qdisc add dev $NETDEV parent 1:6 pfifo tc qdisc add dev $NETDEV parent 1:7 pfifo set the iperf3 to one cpu # mkdir -p /sys/fs/cgroup/cpuset/n0 # echo 4 > /sys/fs/cgroup/cpuset/n0/cpuset.cpus # echo 0 > /sys/fs/cgroup/cpuset/n0/cpuset.mems # ip netns exec n1 iperf3 -c 2.2.2.1 -i 1 -t 1000 -P 10 -u -b 10G # echo $(pidof iperf3) > /sys/fs/cgroup/cpuset/n0/tasks # ethtool -S eth0 | grep -i tx_queue_[0-9]_bytes tx_queue_0_bytes: 7180 tx_queue_1_bytes: 418 tx_queue_2_bytes: 3015 tx_queue_3_bytes: 4824 tx_queue_4_bytes: 3738 tx_queue_5_bytes: 716102781 # before setting iperf3 to cpu 4 tx_queue_6_bytes: 17989642640 # after setting iperf3 to cpu 4, skbedit use this tx queue, and don't use tx queue 5 tx_queue_7_bytes: 4364 tx_queue_8_bytes: 42 tx_queue_9_bytes: 3030 # tc -s class show dev eth0 class mq 1:1 root leaf 2: Sent 9874 bytes 63 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 class mq 1:2 root leaf 8001: Sent 418 bytes 3 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 class mq 1:3 root leaf 8002: Sent 3015 bytes 13 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 class mq 1:4 root leaf 8003: Sent 4824 bytes 8 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 class mq 1:5 root leaf 8004: Sent 4074 bytes 19 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 class mq 1:6 root leaf 8005: Sent 716102781 bytes 480624 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 class mq 1:7 root leaf 8006: Sent 18157071781 bytes 12186100 pkt (dropped 0, overlimits 0 requeues 18) backlog 0b 0p requeues 18 class mq 1:8 root Sent 4364 bytes 26 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 class mq 1:9 root Sent 42 bytes 1 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 class mq 1:a root Sent 3030 bytes 13 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 class tbf 8001:1 parent 8001: class htb 2:1 root prio 0 rate 100Kbit ceil 100Kbit burst 1600b cburst 1600b Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 lended: 0 borrowed: 0 giants: 0 tokens: 2000000 ctokens: 2000000 class htb 2:2 root prio 0 rate 200Kbit ceil 200Kbit burst 1600b cburst 1600b Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 lended: 0 borrowed: 0 giants: 0 tokens: 1000000 ctokens: 1000000Yes, if you pin a flow/process to a cpu - this is expected. See my earlier comment. You could argue that you are automating things but it is not as a strong as the hash setup (and will have to be documented that it works only if you pin processes doing network i/o to cpus).Ok, it should be documented in iproute2. and we will doc this in commit message too.
I think this part is iffy. You could argue automation pov but i dont see much else.
quoted
Could you also post an example on the cgroups classid?The setup commands: NETDEV=eth0 ip li set dev $NETDEV up tc qdisc del dev $NETDEV clsact 2>/dev/null tc qdisc add dev $NETDEV clsact ip link add ipv1 link $NETDEV type ipvlan mode l2 ip netns add n1 ip link set ipv1 netns n1 ip netns exec n1 ip link set ipv1 up ip netns exec n1 ifconfig ipv1 2.2.2.100/24 up tc filter add dev $NETDEV egress protocol ip prio 1 \ flower skip_hw src_ip 2.2.2.100 action skbedit queue_mapping hash-type classid 2 6 tc qdisc add dev $NETDEV handle 1: root mq tc qdisc add dev $NETDEV parent 1:1 handle 2: htb tc class add dev $NETDEV parent 2: classid 2:1 htb rate 100kbit tc class add dev $NETDEV parent 2: classid 2:2 htb rate 200kbit tc qdisc add dev $NETDEV parent 1:2 tbf rate 100mbit burst 100mb latency 1 tc qdisc add dev $NETDEV parent 1:3 pfifo tc qdisc add dev $NETDEV parent 1:4 pfifo tc qdisc add dev $NETDEV parent 1:5 pfifo tc qdisc add dev $NETDEV parent 1:6 pfifo tc qdisc add dev $NETDEV parent 1:7 pfifo setup classid # mkdir -p /sys/fs/cgroup/net_cls/n0 # echo 0x100001 > /sys/fs/cgroup/net_cls/n0/net_cls.classid # echo $(pidof iperf3) > /sys/fs/cgroup/net_cls/n0/tasks
I would say some thing here as well. You know the classid, you manually set it above, you could have said: src_ip 2.2.2.100 action skbedit queue_mapping 1 cheers, jamal