Re: [PATCH net-next 0/5] udp: Generalize GSO for UDP tunnels
From: Tom Herbert <hidden>
Date: 2014-09-29 03:59:24
On Sat, Sep 27, 2014 at 12:26 PM, Or Gerlitz [off-list ref] wrote:
On Sat, Sep 27, 2014 at 2:04 AM, Tom Herbert [off-list ref] wrote:quoted
On Fri, Sep 26, 2014 at 1:16 PM, Or Gerlitz [off-list ref] wrote:quoted
On Fri, Sep 26, 2014 at 7:22 PM, Tom Herbert [off-list ref] wrote: [...]quoted
Notes: - GSO for GRE/UDP where GRE checksum is enabled does not work. Handling this will require some special case code. - Software GSO now supports many varieties of encapsulation with SKB_GSO_UDP_TUNNEL{_CSUM}. We still need a mechanism to query for device support of particular combinations (I intend to add ndo_gso_check for that).Tom, As I wrote you earlier on another thread/s, fact is that there are upstream drivers who advertize SKB_GSO_UDP_TUNNEL and aren't capable @ this point to issue proper HW segmentation of something which isn't VXLAN. Just to make sure, this series isn't expected to introduce a regression, right? we don't expect the stack to attempt and xmit a large 64KB UDP packet which isn't vxlan through these devices.quoted
I am planning to post ndo_gso_check shortly. These patches should not cause a regression with currently deployed functionality (VXLAN).Can you sum up (please) in 1-2 liner what is the trick to avoid such regression? that is what/where is the knob that would prevent such giant chunk to be sent down to a NIC driver which does advertize SKB_GSO_UDP_TUNNEL?
I posted patch for ndo_gso_check. Please let me know if you'll be able to work with this. I'll also post the iproute changes soon so that the FOU results can be repro'd.
quoted
quoted
quoted
- MPLS seems to be the only previous user of inner_protocol. I don't believe these patches can affect that. For supporting GSO with MPLS over UDP, the inner_protocol should be set using the helper functions in this patch. - GSO for L2TP/UDP should also be straightforward now.quoted
Tested GRE, IPIP, and SIT over fou as well as VLXAN. This was done using 200 TCP_STREAMs in netperf.[...]quoted
VXLAN TCP_STREAM TSO enabled on tun interface 16.42% TX CPU utilization 23.66% RX CPU utilization 9081 Mbps TCP_STREAM TSO disabled on tun interface 30.32% TX CPU utilization 30.55% RX CPU utilization 9185 Mbpsso TSO disabled has better BW vs TSO enabled?Yes, I've noticed that on occasion, it does seem like TSO disabled tends to get a little more throughput. I see this with plain GRE, so I don't think it's directly related to fou or these patches. I suppose there may be some subtle interactions with BQL or something like that. I'd probably want to repro this on some other devices at some point to dig deeper.quoted
quoted
Baseline (no encp, TSO and LRO enabled) TCP_STREAM 11.85% TX CPU utilization 15.13% RX CPU utilization 9452 MbpsI would strongly recommend to have a far better baseline when developing and testing these changes in the stack in the form of 40Gbs NICs.The only point of putting the baseline was to show that encapsulation with GSO/GRO/checksum-unnec-conversion is in the ballpark of performance with native traffic which was a goal.under (over...) 10Gbs, in the ballpark indeed. We know nothing what would happen with baseline of 38Gbs (SB 40Gbs NIC) 56Gbs (two bonded ports of 40Gbs NIC on PCIe gen3) or 100Gbs (tomorrow's NIC HW, probably coming up next year)quoted
So I'm pretty happy with this performance right now, although it probably does mean remote checksum offload won't show so impressive results with this test (TX csum with data in case isn't so expensive). Out of curiosity, why do you think using 40Gbs is far better for a baseline?Oh, simply b/c with 40Gbs NICs, the baseline I expect for few sessions (1,2,4 or 200 as you did) of plain TCP is four times better vs. your current one (38Gbs vs 9.5Gbs) and this should pose a harder challenge for the GSO/encapsulating stack to catch up with, agree?
Sure, I agree that it would be nice to have this tested on different devices (40G, 1G, wireless, etc.)-- but right now I don't see anything particularly obvious why performance shouldn't scale linearly.
Or.