Thread (37 messages) 37 messages, 6 authors, 2014-05-16

Re: [Xen-devel] [3.15-rc3] Bisected: xen-netback mangles packets between two guests on a bridge since merge of "TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy" series.

From: Sander Eikelenboom <hidden>
Date: 2014-05-01 19:39:56

Thursday, May 1, 2014, 7:39:36 PM, you wrote:

Thursday, May 1, 2014, 5:46:01 PM, you wrote:
quoted
On 01/05/14 14:59, Sander Eikelenboom wrote:
quoted
Thursday, May 1, 2014, 3:37:41 PM, you wrote:
quoted
On 30/04/14 23:25, Sander Eikelenboom wrote:
quoted
Wednesday, April 30, 2014, 10:53:39 PM, you wrote:
quoted
On 30/04/14 11:45, Sander Eikelenboom wrote:
quoted
Hi Zoltan,

Your series "TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy", merged into mainline with merge commit 4caeccb4de76440e433a15009636e77d003eb3d6,
seem to introduce a subtle bug on network traffic between 2 guests on a bridge on the same host.
I have one guest running apache as webdav server with SSL and another guest that is using that is uploading large files to that webdav server.
Small requests (some get's and propfind's) seem to work ok, but when the bulk uploading begins it fails with:

Attempt 1 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL routines:SSL3_READ_BYTES:sslv3 alert bad record mac
Attempt 2 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL routines:SSL3_READ_BYTES:sslv3 alert bad record mac
Attempt 3 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL routines:SSL3_READ_BYTES:sslv3 alert bad record mac
Attempt 4 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL routines:SSL3_READ_BYTES:sslv3 alert bad record mac

So some how large (probably fragmented) packets can get mangled when from guest to guest on the same host.
I don't see this with clients that upload large files from external sources.
Probably if SSL wasn't complaining it would probably be unnoticed for longer and doing some silent corruption.

I first blamed openssl, since it started around all the latest openssl mayhem and updates, but it turns out it is all xen-netback related again.

Since these commits break bisectabillity:
       - 1bb332af4cd889e4b64dacbf4a793ceb3a70445d  (note in commit message && kernel panic)
       - 62bad3199a4c20505fc36c169deef20b25e17c5f  (kernel panic)
i stopped bisecting at this point.

The upside is .. it's 100% reproduceable :-)
That's good :) Can you take tcpdump captures along the way (sending
guest, dom0, receiving guest), and try to work out which packets are
different, and where? Although taking captures in Dom0 might change your
result, as it triggers the pages to be copied and unmapped before they
reach their target.
quoted
Thanks,
Zoli

Hrrmm that sounds like a lot of data and a lot of work ..
If you could make captures in the sending and receiving guest with
tcpdump (take care of increasing snaplen so the whole packet is there,
and filter to the SSH connection itself), and upload it somewhere for
me, that would be enough for start. I will try to work out where the
corruption happens.
Also, do you have timestamps for the above mentioned log entries? I
guess they appear on the receiving side.
And some info about the componenets on the server, so I can work out
where is that _ssl.c:1415, and which part of the packet it actually
looks for.
They appear on the sending side (duplicity) .. the receiving side (apache +
mod_dav + ssl | gnu_tls) gives a "Could not get next bucket brigade (URI:"
I will try to repro this case in house. What versions of these 
components you used?
Both guests are debian wheezy.
The webdav server has:
ii  apache2-mpm-event                2.2.22-13+deb7u1           amd64        Apache HTTP Server - event driven model
ii  apache2-utils                    2.2.22-13+deb7u1           amd64        uti
ii  apache2.2-bin                    2.2.22-13+deb7u1           amd64        Apa
ii  apache2.2-common                 2.2.22-13+deb7u1           amd64        Apa
ii  libapache2-mod-gnutls            0.5.10-1.1                 amd64        Apa
ii  libssl1.0.0:amd64                1.0.1e-2+deb7u7            amd64        SSL
ii  openssl                          1.0.1e-2+deb7u7            amd64        Sec
The guest with duplicity currently has a duplicity version from unstable 
recompiled for wheezy. But i previously also tried a downgrade to the standard 
wheezy version. It uses the webdav backend and a volumesize of 100MB.
Unfortunately it seems duplicity doesn't bail out at first instance, it seems it 
only reports error after the so the full tcpdumps i got are also 100MB each.
Since the error seems to happen when it's going through 
"xenvif_handle_frag_list", i have added a bunch of ratelimited printk's.
Will run that for both the cases:
        skb->truesize -= skb->data_len;
        skb->truesize -= nskb->data_len;
Let's see what that does different and if that gives an insight in what is going 
wrong.
The debug patch that i used is attached.
Here is some of the output, skb->data_len size sometimes varies slightly as does skb->truesize (is that allowed to become > 64k ?). 
All of it attached in dmesg.txt

[  647.256275] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 1: skb->data_len: 64890, skb->len: 65018, skb->truesize: 65658, nskb->data_len: 0, nskb->len: 0, nskb->truesize: 768  MAX_SKB_FRAGS:17 offset:128
[  647.273483] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 2: skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 offset:128
[  647.291478] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:0, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 offset:4224
[  647.310199] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:1, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 offset:8320
[  647.329874] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:2, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 offset:12416
[  647.350268] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:3, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 offset:16512
[  647.371562] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:4, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 offset:20608
[  647.393707] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:5, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 offset:24704
[  647.416604] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:6, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 offset:28800
[  647.440482] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:7, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 offset:32896
[  647.465490] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:8, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 offset:36992
[  647.491216] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:9, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 offset:41088
[  647.517698] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:10, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 offset:45184
[  647.545022] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:11, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 offset:49280
[  647.573482] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:12, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 offset:53376
[  647.602458] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:13, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 offset:57472
[  647.632271] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:14, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 offset:61568
[  647.662608] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:15, len: 3658, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 offset:65226
[  647.693632] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 4: i:16, skb->data_len: 65098, skb->len: 65226, skb->truesize: 66304, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 offset:65226

<snip>

[  848.058828] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 1: skb->data_len: 63493, skb->len: 63621, skb->truesize: 64261, nskb->data_len: 0, nskb->len: 0, nskb->truesize: 768  MAX_SKB_FRAGS:17 offset:128 
[  848.096299] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 2: skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 offset:128 
[  848.133427] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:0, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 offset:4224 
[  848.170519] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:1, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 offset:8320 
[  848.207578] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:2, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 offset:12416 
[  848.244847] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:3, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 offset:16512 
[  848.282044] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:4, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 offset:20608 
[  848.319114] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:5, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 offset:24704 
[  848.356198] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:6, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 offset:28800 
[  848.393369] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:7, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 offset:32896 
[  848.430450] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:8, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 offset:36992 
[  848.467631] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:9, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 offset:41088 
[  848.504711] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:10, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 offset:45184 
[  848.541759] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:11, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 offset:49280 
[  848.578914] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:12, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 offset:53376 
[  848.616120] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:13, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 offset:57472 
[  848.653251] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:14, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 offset:61568 
[  848.690269] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:15, len: 3658, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 offset:65226 
[  848.727394] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 4: i:16, skb->data_len: 65098, skb->len: 65226, skb->truesize: 66304, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 offset:65226 

quoted
Zoli
quoted
quoted
quoted
quoted
how ever .. could it be just a type and would the following make sense ?
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 7666540..abeea10 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -1366,7 +1366,7 @@ static int xenvif_handle_frag_list(struct xenvif *vif, struct sk_buff *skb)

          xenvif_fill_frags(vif, nskb);
          /* Subtract frags size, we will correct it later */
-       skb->truesize -= skb->data_len;
+       skb->truesize -= nskb->data_len;
          skb->len += nskb->len;
          skb->data_len += nskb->len;
quoted
Nope, that's correct there: after that skb->truesize will be the size of
the struct plus the linear buffer itself. The code is just about the
ditch the original fragments plus the skb on the frag_list. When the new
pages are created, it will update it again.
Well i just went a head and tried this .. and the uploading does seem to work fine with this change
.. (that obviously doesn't say anything about correctness)
quoted
Also, this code path runs only if the guest sends more slots we can
handle (so we put the extra one to the frag_list until we can get rid of
it). On Linux it can only happen with 3.2 or older guest kernels, and
only occasionally. As you said, this is 100% reproducible, so I would
doubt the problem is with this part of the code.
Well this assumption seems to be incorrect:
         - both dom0 and guest kernels are 3.15-rc3's.
         - but we do end up in this code path
quoted
Zoli


Attachments

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help