Re: [PATCH net-next 0/5] udp: Generalize GSO for UDP tunnels

From: Tom Herbert <hidden>
Date: 2014-09-29 03:59:24

On Sat, Sep 27, 2014 at 12:26 PM, Or Gerlitz [off-list ref] wrote:

On Sat, Sep 27, 2014 at 2:04 AM, Tom Herbert [off-list ref] wrote:

quoted

On Fri, Sep 26, 2014 at 1:16 PM, Or Gerlitz [off-list ref] wrote:

quoted

On Fri, Sep 26, 2014 at 7:22 PM, Tom Herbert [off-list ref] wrote:
[...]

quoted

Notes:
  - GSO for GRE/UDP where GRE checksum is enabled does not work.
    Handling this will require some special case code.
  - Software GSO now supports many varieties of encapsulation with
    SKB_GSO_UDP_TUNNEL{_CSUM}. We still need a mechanism to query
    for device support of particular combinations (I intend to
    add ndo_gso_check for that).

Tom,

As I wrote you earlier on another thread/s, fact is that there are
upstream drivers who advertize SKB_GSO_UDP_TUNNEL and aren't capable @
this point to issue proper HW segmentation of something which isn't
VXLAN.

Just to make sure, this series isn't expected to introduce a
regression, right? we don't expect the stack to attempt and xmit a
large 64KB UDP packet which isn't vxlan through these devices.

quoted

I am planning to post ndo_gso_check shortly. These patches should not
cause a regression with currently deployed functionality (VXLAN).

Can you sum up (please) in 1-2 liner what is the trick to avoid such
regression? that is what/where is the knob that would prevent such
giant chunk to be sent down to a NIC driver which does advertize
SKB_GSO_UDP_TUNNEL?

I posted patch for ndo_gso_check. Please let me know if you'll be able
to work with this. I'll also post the iproute changes soon so that the
FOU results can be repro'd.

quoted

  - MPLS seems to be the only previous user of inner_protocol. I don't
    believe these patches can affect that. For supporting GSO with
    MPLS over UDP, the inner_protocol should be set using the
    helper functions in this patch.
  - GSO for L2TP/UDP should also be straightforward now.

quoted

Tested GRE, IPIP, and SIT over fou as well as VLXAN. This was
done using 200 TCP_STREAMs in netperf.

[...]

quoted

   VXLAN
      TCP_STREAM TSO enabled on tun interface
        16.42% TX CPU utilization
        23.66% RX CPU utilization
        9081 Mbps
      TCP_STREAM TSO disabled on tun interface
        30.32% TX CPU utilization
        30.55% RX CPU utilization
        9185 Mbps

so TSO disabled has better BW vs TSO enabled?

Yes, I've noticed that on occasion, it does seem like TSO disabled
tends to get a little more throughput. I see this with plain GRE, so I
don't think it's directly related to fou or these patches. I suppose
there may be some subtle interactions with BQL or something like that.
I'd probably want to repro this on some other devices at some point to
dig deeper.

quoted

   Baseline (no encp, TSO and LRO enabled)
      TCP_STREAM
        11.85% TX CPU utilization
        15.13% RX CPU utilization
        9452 Mbps

I would strongly recommend to have a far better baseline when
developing and testing these changes in the stack in the form of 40Gbs
NICs.

The only point of putting the baseline was to show that encapsulation
with GSO/GRO/checksum-unnec-conversion is in the ballpark of
performance with native traffic which was a goal.

under (over...) 10Gbs, in the ballpark indeed.

We know nothing what would happen with baseline of 38Gbs (SB 40Gbs
NIC) 56Gbs (two bonded ports of 40Gbs NIC on PCIe gen3) or 100Gbs
(tomorrow's NIC HW, probably coming up next year)

quoted

So I'm pretty happy
with this performance right now, although it probably does mean remote
checksum offload won't show so impressive results with this test (TX
csum with data in case isn't so expensive).
Out of curiosity, why do you think using 40Gbs is far better for a baseline?

Oh, simply b/c with 40Gbs NICs, the baseline I expect for few sessions
(1,2,4 or 200 as you did) of plain TCP is four times better vs. your
current one (38Gbs vs 9.5Gbs) and this should pose a harder challenge
for the GSO/encapsulating stack to catch up with, agree?

Sure, I agree that it would be nice to have this tested on different
devices (40G, 1G, wireless, etc.)-- but right now I don't see anything
particularly obvious why performance shouldn't scale linearly.

Or.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help