Thread (59 messages) 59 messages, 10 authors, 2012-10-04

Re: Possible networking regression in 3.6.0

From: Chris Clayton <hidden>
Date: 2012-09-18 15:51:15

Thanks for the reply, Eric.
quoted
quoted
-rc1 turned out to have the problem so I've bisected between 3.5 and
3.6-rc1. I arrived at:

$ git bisect bad
d2d68ba9fe8b38eb03124b3176a013bb8aa2b5e5 is the first bad commit
commit d2d68ba9fe8b38eb03124b3176a013bb8aa2b5e5
Author: David S. Miller [off-list ref]
Date:   Tue Jul 17 12:58:50 2012 -0700

      ipv4: Cache input routes in fib_info nexthops.

      Caching input routes is slightly simpler than output routes, since we
      don't need to be concerned with nexthop exceptions.  (locally
      destined, and routed packets, never trigger PMTU events or redirects
      that will be processed by us).

      However, we have to elide caching for the DIRECTSRC and non-zero itag
      cases.

      Signed-off-by: David S. Miller [off-list ref]

:040000 040000 6bbc75c1cbe62bf84ea412d3b98adf2b614779cd
3ad7256b4a71e63ca4530977c0550121ea803d35 M      include
:040000 040000 18c2a950a53c4eec9bfa12185d1e382dfed74af8
a2ab6157d6cd54930da395758c6ded3a225d1f04 M      net

The bisect log:
git bisect start
# bad: [0d7614f09c1ebdbaa1599a5aba7593f147bf96ee] Linux 3.6-rc1
git bisect bad 0d7614f09c1ebdbaa1599a5aba7593f147bf96ee
# good: [28a33cbc24e4256c143dce96c7d93bf423229f92] Linux 3.5
git bisect good 28a33cbc24e4256c143dce96c7d93bf423229f92
# bad: [614a6d4341b3760ca98a1c2c09141b71db5d1e90] Merge branch 'for-3.6'
of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
git bisect bad 614a6d4341b3760ca98a1c2c09141b71db5d1e90
# bad: [320f5ea0cedc08ef65d67e056bcb9d181386ef2c] genetlink: define
lockdep_genl_is_held() when CONFIG_LOCKDEP
git bisect bad 320f5ea0cedc08ef65d67e056bcb9d181386ef2c
# good: [0cd06647b7c24f6633e32a505930a9aa70138c22] Merge branch 'master'
of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
git bisect good 0cd06647b7c24f6633e32a505930a9aa70138c22
# good: [dbfa600148a25903976910863c75dae185f8d187] cxgb3: set maximal
number of default RSS queues
git bisect good dbfa600148a25903976910863c75dae185f8d187
# good: [efdfad3205403e1d1c5c0bdcbdb647ddd89bfaa3] bnx2: Try to recover
from PCI block reset
git bisect good efdfad3205403e1d1c5c0bdcbdb647ddd89bfaa3
# good: [1bf91cdc1bba94ea062a9147d924815c13f029f2] ixgbe: Drop
references to deprecated pci_ DMA api and instead use dma_ API
git bisect good 1bf91cdc1bba94ea062a9147d924815c13f029f2
# good: [b6dfd939fdc249fcf8cd7b8006f76239b33eb581] ixgbe: add support
for new 82599 device
git bisect good b6dfd939fdc249fcf8cd7b8006f76239b33eb581
# good: [3ba97381343b271296487bf073eb670d5465a8b8] net: ethernet:
davinci_emac: add pm_runtime support
git bisect good 3ba97381343b271296487bf073eb670d5465a8b8
# bad: [5e9965c15ba88319500284e590733f4a4629a288] Merge branch
'kill_rtcache'
git bisect bad 5e9965c15ba88319500284e590733f4a4629a288
# good: [f5b0a8743601a4477419171f5046bd07d1c080a0] net: Document
dst->obsolete better.
git bisect good f5b0a8743601a4477419171f5046bd07d1c080a0
# bad: [ba3f7f04ef2b19aace38f855aedd17fe43035d50] ipv4: Kill
FLOWI_FLAG_RT_NOCACHE and associated code.
git bisect bad ba3f7f04ef2b19aace38f855aedd17fe43035d50
# good: [f2bb4bedf35d5167a073dcdddf16543f351ef3ae] ipv4: Cache output
routes in fib_info nexthops.
git bisect good f2bb4bedf35d5167a073dcdddf16543f351ef3ae
# bad: [d2d68ba9fe8b38eb03124b3176a013bb8aa2b5e5] ipv4: Cache input
routes in fib_info nexthops.
git bisect bad d2d68ba9fe8b38eb03124b3176a013bb8aa2b5e5

Checking out the parent commit
(f2bb4bedf35d5167a073dcdddf16543f351ef3ae) and building and installing
the kernel gives a working configuration, so I'm pretty confident in the
outcome of the bisect. Reversing the patch gives errors, so I've not
tested master with the patch reversed.

Let me know if I can help in any way to identify a fix.
Sorry, I forgot to say that I also have tried running TinyCore Linux as
a KVM guest on a 3.6.0-rc6 kernel, and I can ping the router fine, so
the problem seems to be something specifically related to ruuning
Windows XP as the guest. I don't have any other guests installed so
that's as much as I can say, although I could maybe install a Win7 guest
tomorrow if that would help.

I hope you've seen my later email in which I reported my error in my 
testing that led me to believe that all was OK with a linux client. In 
fact, The router is inaccessible from both the Windows XP and the Linux 
clients.
It would help to have some traffic sample, maybe.
I'll need help here. How would I go about collecting that traffic. I 
have wireshark installed, but haven't used it for years. Would a trace 
from that be helpful? It might take me a while to figure out how to 
capture it?
Especially if the problem is not easily reproductible for us.

(I dont have Windows XP nor Win7)

Also the bisect might point to a commit with an already fixed bug :
This fix is already in 3.6.0-rc6. BTW, I've pulled the latest changes 
from kernel.org this afternoon, but that hasn't helped.
commit 4331debc51ee1ce319f4a389484e0e8e05de2aca
Author: Eric Dumazet [off-list ref]
Date:   Wed Jul 25 05:11:23 2012 +0000

     ipv4: rt_cache_valid must check expired routes

     commit d2d68ba9fe8 (ipv4: Cache input routes in fib_info nexthops.)
     introduced rt_cache_valid() helper. It unfortunately doesn't check if
     route is expired before caching it.

     I noticed sk_setup_caps() was constantly called on a tcp workload.

     Signed-off-by: Eric Dumazet [off-list ref]
     Signed-off-by: David S. Miller [off-list ref]
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help