Re: [RFC net] net/mlx5: Fix performance regression for request-response workloads
From: Alexandra Winter <wintera@linux.ibm.com>
Date: 2022-09-26 10:08:06
Also in:
linux-s390
On 08.09.22 14:41, Eric Dumazet wrote:
On Thu, Sep 8, 2022 at 2:40 AM Christian Borntraeger [off-list ref] wrote:quoted
Am 07.09.22 um 18:06 schrieb Eric Dumazet:quoted
On Wed, Sep 7, 2022 at 5:26 AM Alexandra Winter [off-list ref] wrote:quoted
Since linear payload was removed even for single small messages, an additional page is required and we are measuring performance impact. 3613b3dbd1ad ("tcp: prepare skbs for better sack shifting") explicitely allowed "payload in skb->head for first skb put in the queue, to not impact RPC workloads." 472c2e07eef0 ("tcp: add one skb cache for tx") made that obsolete and removed it. When d8b81175e412 ("tcp: remove sk_{tr}x_skb_cache") reverted it, this piece was not reverted and not added back in. When running uperf with a request-response pattern with 1k payload and 250 connections parallel, we measure 13% difference in throughput for our PCI based network interfaces since 472c2e07eef0. (our IO MMU is sensitive to the number of mapped pages)quoted
Could you please consider allowing linear payload for the first skb in queue again? A patch proposal is appended below.No. Please add a work around in your driver. You can increase throughput by 20% by premapping a coherent piece of memory in which you can copy small skbs (skb->head included) Something like 256 bytes per slot in the TX ring.FWIW this regression was withthe standard mellanox driver (nothing s390 specific).I did not claim this was s390 specific. Only IOMMU mode. I would rather not add back something which makes TCP stack slower (more tests in fast path) for the majority of us _not_ using IOMMU. In our own tests, this trick of using linear skbs was only helping benchmarks, not real workloads. Many drivers have to map skb->head a second time if they contain TCP payload, thus adding yet another corner case in their fast path. - Typical RPC workloads are playing with TCP_NODELAY - Typical bulk flows never have empty write queues... Really, I do not want this optimization back, this is not worth it. Again, a driver knows better if it is using IOMMU and if pathological layouts can be optimized to non SG ones, and using a pre-dma-map zone will also benefit pure TCP ACK packets (which do not have any payload) Here is the changelog of a patch I did for our GQ NIC (not yet upstreamed, but will be soon)
[...] Saeed, As discussed at LPC, could you please consider adding a workaround to the Mellanox driver, to use non-SG SKBs for small messages? As mentioned above we are seeing 13% throughput degradation, if 2 pages need to be mapped instead of 1. While Eric's ideas sound very promising, just using non-SG in these cases should be enough to mitigate the performance regression we see. Thank you in advance. Alexandra