Thread (26 messages) 26 messages, 5 authors, 2016-05-05

Re: [net-next PATCH v2 5/9] mlx4: Add support for UDP tunnel segmentation with outer checksum offload

From: Alexander Duyck <hidden>
Date: 2016-05-05 22:00:15

On Thu, May 5, 2016 at 2:39 PM, Or Gerlitz [off-list ref] wrote:
On Wed, May 4, 2016 at 7:06 PM, Alex Duyck [off-list ref] wrote:
quoted
On Wed, May 4, 2016 at 8:50 AM, Or Gerlitz [off-list ref] wrote:
quoted
On 5/3/2016 6:29 PM, Alexander Duyck wrote:
quoted
We split the one that would be a different size off via GSO.  So we
end up sending up 2 frames to the device if there is going to be one
piece that doesn't quite match.  We split that one piece off via GSO.
That is one of the reasons why I referred to it as partial GSO as all
we are using the software segmentation code for is to make sure we
have the GSO block consists of segments that are all the same size.

I see, so if somehow it happens a lot that the TCP stack sends down
something which once segmented ends up with the last segment being of
different size from the other ones we would have to call the NIC xmit
function twice (BTW can we use xmit_more here?)  -- which could be effecting
performance, I guess.

GSO_UDP_TUNNEL_CSUM (commit  0f4f4ffa7 "net: Add GSO support for UDP tunnels
with checksum") came to mark "that a device is capable of computing the UDP
checksum in the encapsulating header of a UDP tunnel" -- and the way we use
it here is that we do advertize that bit towards the stack for devices whose
HW can **not** do that, and things work b/c of LCO (this is my
understanding).

I miss something in the bigger picture here, what does this buy us? e.g vs
just letting this (say) vxlan tunnel use zero checksum on the outer UDP
packet, is that has something to do with RCO?
I think the piece you are missing is GSO_PARTIAL.  Basically
GSO_PARTIAL indicates that we can perform GSO as long as all segments
are the same size and also allows for ignoring one level of headers.
So in the case of ixgbe for instance we can support tunnel offloads as
long as we allow for the inner IPv4 ID to be a fixed value which is
identified by enabling TSO_MANGLEID.  In the case of i40e, mlx4, and
mlx5 the key bit is that we just have to have the frames the same size
for all segments and then we can support tunnels with outer checksum
because the checksum has been computed once and can be applied to all
of the segmented frames.
Yep, I think to basically follow on the PARTIAL thing, which once
advertised by i40e, mlx4 and mlx5 allow them support udp (and GRE in
i40e case) tunnels with outer checksum.

My question was what this buy us for the UDP case vs. using zero
checksum for the tunnel (outer packet), I tried to figure out if it
has something to do with the remote side, e.g for RCO or alike.
Basically, under PARTIAL, on the worst case we could have ending up
with 2x packet xmitted to the NIC - e.g if each TCP message which is
to be encapsulated by the stack and later segmented by the NIC HW is
broken to two b/c otherwise the last segmented packet will not be of
equal size as of the all the preceding ones.
There ends up being a few pieces to this.  In the case of i40e the Tx
gain seen is mostly for just transmitting the tunnel types with
checksums.  This is because without that we have to use software
segmentation and that is expensive because it requires 40+ frames to
transmit a single 64K block of TCP data.  In the case of GSO_PARTIAL
this data is usually all sent in a single packet because the TCP stack
tries to send MSS aligned blocks.

On the Rx side a gain can be seen if we exceed the number of ports
that can be used to support tunnels on the device.  This is because
the hardware can still offload the outer UDP checksum and as a result
it can still go GRO on the frame thanks to the code Tom Herbert added
that converts validated outer UDP checksums to checksum complete.
Without the outer UDP checksum present we wouldn't be able to do GRO
and throughput drops to the 6 - 9Gb/s range.
Or being a bit more positive... is there an expected performance  gain
when you use MANGLEID and/or PARTIAL to enable supporting UDP tunnel
segmentation checksum offload towards the stack? what is the reason
for that gain?
The TSO_MANGLEID bit is only really needed for igb and ixgbe.  Those
drivers don't support tunnel offloads directly.  Instead they can
support checksum offloads or a segmentation offload with an arbitrary
IP header size up to 511 bytes.  So in order to do segmentation for
tunnels what we are doing is repeating everything from the outer
transport header through the inner network header for each frame.  As
such we can only perform segmentation offloads for IPv4 in any type of
tunnel if we can repeat the IP ID for the inner header.  If we are
allowed to do that then we can move packets between functions on the
same device at 15Gb/s which is the upper limits of non-encapsulated
traffic for VF to VF.  Without that we are at 12Gb/s with outer
checksums and software segmentation, and only 6Gb/s with software
segmentation and outer checksum forced to 0.
As for GRE tunnel segmentation checksum offload, I saw in your i40e
patch that it made your testbed to go from 12Gbs to 20Gbs, is this b/c
the stack can not actually let the HW do the segmentation w.o checksum
offload? if not, can you help understand the source of the gain?
The device didn't advertise NETIF_F_GRE_CSUM so if there was a
checksum in the GRE header the packet had to be segmented in software.
By using the GSO_PARTIAL approach the speed is improved and comes up
to about 20Gb/s which is what the hardware does for standard GRE
tunnels.  Basically the best software segmentation can do is 12Gb/s
for most NICs on a single flow.  With hardware segment or GSO partial
we can push somewhere around 20Gb/s or more depending on the
configuration.
quoted
Hope that helps.
yes, your notes are very helpful, thanks for sparing the time..
No problem.

- Alex
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help