Re: [RFC net] net/mlx5: Fix performance regression for request-response workloads

From: Alexandra Winter <wintera@linux.ibm.com>
Date: 2022-09-26 10:08:06
Also in: linux-s390


On 08.09.22 14:41, Eric Dumazet wrote:

On Thu, Sep 8, 2022 at 2:40 AM Christian Borntraeger
[off-list ref] wrote:

quoted

Am 07.09.22 um 18:06 schrieb Eric Dumazet:

quoted

On Wed, Sep 7, 2022 at 5:26 AM Alexandra Winter [off-list ref] wrote:

quoted

Since linear payload was removed even for single small messages,
an additional page is required and we are measuring performance impact.

3613b3dbd1ad ("tcp: prepare skbs for better sack shifting")
explicitely allowed "payload in skb->head for first skb put in the queue,
to not impact RPC workloads."
472c2e07eef0 ("tcp: add one skb cache for tx")
made that obsolete and removed it.
When
d8b81175e412 ("tcp: remove sk_{tr}x_skb_cache")
reverted it, this piece was not reverted and not added back in.

When running uperf with a request-response pattern with 1k payload
and 250 connections parallel, we measure 13% difference in throughput
for our PCI based network interfaces since 472c2e07eef0.
(our IO MMU is sensitive to the number of mapped pages)

quoted

Could you please consider allowing linear payload for the first
skb in queue again? A patch proposal is appended below.

No.

Please add a work around in your driver.

You can increase throughput by 20% by premapping a coherent piece of
memory in which
you can copy small skbs (skb->head included)

Something like 256 bytes per slot in the TX ring.

FWIW this regression was withthe standard mellanox driver (nothing s390 specific).

I did not claim this was s390 specific.

Only IOMMU mode.

I would rather not add back something which makes TCP stack slower
(more tests in fast path)
for the majority of us _not_ using IOMMU.

In our own tests, this trick of using linear skbs was only helping
benchmarks, not real workloads.

Many drivers have to map skb->head a second time if they contain TCP payload,
thus adding yet another corner case in their fast path.

- Typical RPC workloads are playing with TCP_NODELAY
- Typical bulk flows never have empty write queues...

Really, I do not want this optimization back, this is not worth it.

Again, a driver knows better if it is using IOMMU and if pathological
layouts can be optimized
to non SG ones, and using a pre-dma-map zone will also benefit pure
TCP ACK packets (which do not have any payload)

Here is the changelog of a patch I did for our GQ NIC (not yet
upstreamed, but will be soon)

[...]

Saeed,
As discussed at LPC, could you please consider adding a workaround to the
Mellanox driver, to use non-SG SKBs for small messages? As mentioned above
we are seeing 13% throughput degradation, if 2 pages need to be mapped
instead of 1.

While Eric's ideas sound very promising, just using non-SG in these cases
should be enough to mitigate the performance regression we see.

Thank you in advance.
Alexandra

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help