RE: [PATCH net] gso: do GSO for local skb with size bigger than MTU

From: Du, Fan <hidden>
Date: 2014-12-03 01:58:16

-----Original Message-----
From: Flavio Leitner [mailto:fbl@redhat.com]
Sent: Wednesday, December 3, 2014 5:33 AM
To: Jesse Gross
Cc: Du, Fan; Jason Wang; netdev@vger.kernel.org; davem@davemloft.net;
fw@strlen.de
Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU

On Tue, Dec 02, 2014 at 10:06:53AM -0800, Jesse Gross wrote:

quoted

On Tue, Dec 2, 2014 at 7:44 AM, Flavio Leitner [off-list ref] wrote:

quoted

On Sun, Nov 30, 2014 at 10:08:32AM +0000, Du, Fan wrote:

quoted

-----Original Message-----
From: Jason Wang [mailto:jasowang@redhat.com]
Sent: Friday, November 28, 2014 3:02 PM
To: Du, Fan
Cc: netdev@vger.kernel.org; davem@davemloft.net; fw@strlen.de; Du,
Fan
Subject: Re: [PATCH net] gso: do GSO for local skb with size
bigger than MTU



On Fri, Nov 28, 2014 at 2:33 PM, Fan Du [off-list ref] wrote:

quoted

Test scenario: two KVM guests sitting in different hosts
communicate to each other with a vxlan tunnel.

All interface MTU is default 1500 Bytes, from guest point of
view, its skb gso_size could be as bigger as 1448Bytes, however
after guest skb goes through vxlan encapuslation, individual
segments length of a gso packet could exceed physical NIC MTU
1500, which will be lost at recevier side.

So it's possible in virtualized environment, locally created skb
len after encapslation could be bigger than underlayer MTU. In
such case, it's reasonable to do GSO first, then fragment any
packet bigger than MTU as possible.

+---------------+ TX     RX +---------------+
|   KVM Guest   | -> ... -> |   KVM Guest   |
+-+-----------+-+           +-+-----------+-+
  |Qemu/VirtIO|               |Qemu/VirtIO|
  +-----------+               +-----------+
       |                            |
       v tap0                  tap0 v
  +-----------+               +-----------+
  | ovs bridge|               | ovs bridge|
  +-----------+               +-----------+
       | vxlan                vxlan |
       v                            v
  +-----------+               +-----------+
  |    NIC    |    <------>   |    NIC    |
  +-----------+               +-----------+

Steps to reproduce:
 1. Using kernel builtin openvswitch module to setup ovs bridge.
 2. Runing iperf without -M, communication will stuck.

Is this issue specific to ovs or ipv4? Path MTU discovery should
help in this case I believe.

Problem here is host stack push local over-sized gso skb down to
NIC, and perform GSO there without any further ip segmentation.

Reasonable behavior is do gso first at ip level, if gso-ed skb is
bigger than MTU && df is set, Then push

ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED message back to sender to adjust
mtu.

quoted

For PMTU to work, that's another issue I will try to address later on.

quoted


Signed-off-by: Fan Du <redacted>
---
 net/ipv4/ip_output.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index

bc6471d..558b5f8 100644

--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c

@@ -217,9 +217,10 @@ static int ip_finish_output_gso(struct

sk_buff
*skb)
   struct sk_buff *segs;
   int ret = 0;

-  /* common case: locally created skb or seglen is <= mtu */
-  if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
-        skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
+  /* Both locally created skb and forwarded skb could exceed
+   * MTU size, so make a unified rule for them all.
+   */
+  if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
           return ip_finish_output2(skb);


Are you using kernel's vxlan device or openvswitch's vxlan device?

Because for kernel's vxlan devices the MTU accounts for the header
overhead so I believe your patch would work.  However, the MTU is
not visible for the ovs's vxlan devices, so that wouldn't work.

This is being called after the tunnel code, so the MTU that is being
looked at in all cases is the physical device's. Since the packet has
already been encapsulated, tunnel header overhead is already accounted
for in skb_gso_network_seglen() and this should be fine for both OVS
and non-OVS cases.

Right, it didn't work on my first try and that explanation came to mind.

Anyway, I am testing this with containers instead of VMs, so I am using veth and
not Virtio-net.

This is the actual stack trace:

[...]
 do_output
 ovs_vport_send
 vxlan_tnl_send
 vxlan_xmit_skb
 udp_tunnel_xmit_skb
 iptunnel_xmit
  \ skb_scrub_packet => skb->ignore_df = 0;
 ip_local_out_sk
 ip_output
 ip_finish_output (_gso is inlined)
 ip_fragment

and on ip_fragment() it does:

503         if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) ||
504                      (IPCB(skb)->frag_max_size &&
505                       IPCB(skb)->frag_max_size > mtu))) {
506                 IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
507                 icmp_send(skb, ICMP_DEST_UNREACH,
ICMP_FRAG_NEEDED,
508                           htonl(mtu));
509                 kfree_skb(skb);
510                 return -EMSGSIZE;
511         }

Since IP_DF is set and skb->ignore_df is reset to 0, in my case the packet is
dropped and an ICMP is sent back. The connection remains stuck as before.
Doesn't virtio-net set DF bit?

Thanks for giving it a try and see what really happens. 

You almost there! Ip_segment honor IP_DF, this is bit is take care of by vxlan interface.
In practical env, vxlan interface should take a conservative attitude to allow fragmentation
by appending "options: df_default=false" when creating vxlan interface.

Why allow fragmentation? Because Guest or Container may send over-MTU-sized packet downwards.
Host is expected to be prepared to such incident. This is just what happens in real world cloud env.

Thanks,
fbl

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help