Re: qdisc spin lock
From: Michael Ma <hidden>
Date: 2016-03-31 23:41:31
Thanks for the suggestion - I'll try the MQ solution out. It seems to be able to solve the problem well with the assumption that bandwidth can be statically partitioned. 2016-03-31 12:18 GMT-07:00 Jesper Dangaard Brouer [off-list ref]:
On Wed, 30 Mar 2016 00:20:03 -0700 Michael Ma [off-list ref] wrote:quoted
I know this might be an old topic so bare with me – what we are facing is that applications are sending small packets using hundreds of threads so the contention on spin lock in __dev_xmit_skb increases the latency of dev_queue_xmit significantly. We’re building a network QoS solution to avoid interference of different applications using HTB.Yes, as you have noticed with HTB there is a single qdisc lock, and congestion obviously happens :-) It is possible with different tricks to make it scale. I believe Google is using a variant of HTB, and it scales for them. They have not open source their modifications to HTB (which likely also involves a great deal of setup tricks). If your purpose it to limit traffic/bandwidth per "cloud" instance, then you can just use another TC setup structure. Like using MQ and assigning a HTB per MQ queue (where the MQ queues are bound to each CPU/HW queue)... But you have to figure out this setup yourself...quoted
But in this case when some applications send massive small packets in parallel, the application to be protected will get its throughput affected (because it’s doing synchronous network communication using multiple threads and throughput is sensitive to the increased latency) Here is the profiling from perf: - 67.57% iperf [kernel.kallsyms] [k] _spin_lock - 99.94% dev_queue_xmit - 96.91% _spin_lock - 2.62% __qdisc_run - 98.98% sch_direct_xmit - 99.98% _spin_lock As far as I understand the design of TC is to simplify locking schema and minimize the work in __qdisc_run so that throughput won’t be affected, especially with large packets. However if the scenario is that multiple classes in the queueing discipline only have the shaping limit, there isn’t really a necessary correlation between different classes. The only synchronization point should be when the packet is dequeued from the qdisc queue and enqueued to the transmit queue of the device. My question is – is it worth investing on avoiding the locking contention by partitioning the queue/lock so that this scenario is addressed with relatively smaller latency?Yes, there is a lot go gain, but it is not easy ;-)quoted
I must have oversimplified a lot of details since I’m not familiar with the TC implementation at this point – just want to get your input in terms of whether this is a worthwhile effort or there is something fundamental that I’m not aware of. If this is just a matter of quite some additional work, would also appreciate helping to outline the required work here. Also would appreciate if there is any information about the latest status of this work http://www.ijcset.com/docs/IJCSET13-04-04-113.pdfThis article seems to be very low quality... spelling errors, only 5 pages, no real code, etc. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer