Re: [v5 Patch 1/3] netpoll: add generic support for bridge and bonding devices
From: Cong Wang <hidden>
Date: 2010-05-28 02:43:34
Also in:
bridge, lkml
On 05/28/10 02:05, Flavio Leitner wrote:
Hi guys! I finally could test this to see if an old problem reported on bugzilla[1] was fixed now, but unfortunately it is still there. The ticket is private I guess, but basically the problem happens when bonding driver tries to print something after it had taken the write_lock (monitor functions, enslave/de-enslave), so the printk() will pass through netpoll, then on bonding again which no matter what mode you use, it will try to read_lock() the lock again. The result is a deadlock and the entire system hangs.
This is true, I already fixed some similar issues.
I manage to get a fresh backtrace with mode 1, see below:
[ 93.167079] Call Trace:
[ 93.167079] [<ffffffff81034cf9>] warn_slowpath_common+0x77/0x8f
[ 93.167079] [<ffffffff81034d5e>] warn_slowpath_fmt+0x3c/0x3e
[ 93.167079] [<ffffffff81366aef>] ? _raw_read_trylock+0x11/0x4b
[ 93.167079] [<ffffffffa02a2c42>] ? bond_start_xmit+0x12b/0x401 [bonding]
-> read_lock fails
[ 93.167079] [<ffffffffa02a2c9f>] bond_start_xmit+0x188/0x401 [bonding]
[ 93.167079] [<ffffffff81055b37>] ? trace_hardirqs_off+0xd/0xf
[ 93.167079] [<ffffffff812dfdb9>] netpoll_send_skb+0xbd/0x1f3
[ 93.167079] [<ffffffff812e00ed>] netpoll_send_udp+0x1fe/0x20d
[ 93.167079] [<ffffffffa02c017c>] write_msg+0x89/0xcd [netconsole]
[ 93.167079] [<ffffffff81034e65>] __call_console_drivers+0x67/0x79
[ 93.167079] [<ffffffff81034ed0>] _call_console_drivers+0x59/0x5d
[ 93.167079] [<ffffffff810352d3>] release_console_sem+0x121/0x1d7
[ 93.167079] [<ffffffff8103590a>] vprintk+0x35d/0x393
[ 93.167079] [<ffffffff8103f947>] ? add_timer+0x17/0x19
[ 93.167079] [<ffffffff81046ddf>] ? queue_delayed_work_on+0xa2/0xa9
[ 93.167079] [<ffffffff81363bb8>] printk+0x3c/0x44
[ 93.167079] [<ffffffffa02a3b17>] bond_select_active_slave+0x105/0x109 [bonding]
-> write_locked
[ 93.167079] [<ffffffffa02a4798>] bond_mii_monitor+0x479/0x4ed [bonding]
[ 93.167079] [<ffffffff81046009>] worker_thread+0x1ef/0x2e2
In this case, the message should be
"bonding: bond0: making interface eth0 the new active one"Hmm, you triggered a warning here, let me check the source code and try to reproduce it here.
quoted hunk ↗ jump to hunk
I did the following patch to discard the packet if it was IN_NETPOLL and the read_lock() fails, so I could go ahead testing it:diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 5e12462..a3b8bad 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c@@ -4258,8 +4258,19 @@ static int bond_xmit_activebackup(struct sk_buff *skb, struct net_device *bond_d struct bonding *bond = netdev_priv(bond_dev); int res = 1; - read_lock(&bond->lock); - read_lock(&bond->curr_slave_lock); + if (read_trylock(&bond->lock) == 0&& + (bond_dev->flags& IFF_IN_NETPOLL)) { + dev_kfree_skb(skb); + return NETDEV_TX_OK; + } + + if (read_trylock(&bond->curr_slave_lock) == 0&& + (bond_dev->flags& IFF_IN_NETPOLL)) { + read_unlock(&bond->lock); + dev_kfree_skb(skb); + return NETDEV_TX_OK; + } + if (!BOND_IS_OK(bond)) goto out;
This looks like a workaround, not a fix. :)
and I found another problem. The function netpoll_send_skb() checks
if the npinfo's queue length is zero and if it's not, it will queue
the packet to make sure it's in order and then schedule the thread
to run. Later, the thread wakes up running queue_process() which disables
interrupts before calling ndo_start_xmit(). However, dev_queue_xmit()
uses rcu_*_bh() and before return, it will enable the interrupts again,
spitting this:
------------[ cut here ]------------
WARNING: at kernel/softirq.c:143 local_bh_enable+0x3c/0x86()
Hardware name: Precision WorkStation 490
Modules linked in: netconsole bonding sunrpc ip6t_REJECT xt_tcpudp nf_conntrack_ipv6]
Pid: 17, comm: events/2 Not tainted 2.6.34-04700-gd938a70 #21
Call Trace:
[<ffffffff810381d6>] warn_slowpath_common+0x77/0x8f
[<ffffffff810381fd>] warn_slowpath_null+0xf/0x11
[<ffffffff8103d691>] local_bh_enable+0x3c/0x86
[<ffffffff812e4d85>] dev_queue_xmit+0x462/0x493
[<ffffffffa018805f>] bond_dev_queue_xmit+0x1bd/0x1e3 [bonding]
[<ffffffffa01881dd>] bond_start_xmit+0x158/0x37b [bonding]
-> interrupts disabled
[<ffffffff812f3fca>] queue_process+0x9d/0xf9
[<ffffffff8104d022>] worker_thread+0x19d/0x224
[<ffffffff812f3f2d>] ? queue_process+0x0/0xf9
[<ffffffff81050819>] ? autoremove_wake_function+0x0/0x34
[<ffffffff8104ce85>] ? worker_thread+0x0/0x224
[<ffffffff8105040b>] kthread+0x7a/0x82
[<ffffffff810036d4>] kernel_thread_helper+0x4/0x10
[<ffffffff81050391>] ? kthread+0x0/0x82
[<ffffffff810036d0>] ? kernel_thread_helper+0x0/0x10
---[ end trace 74e3904503fdb632 ]---
kernel/softirq.c:
141 static inline void _local_bh_enable_ip(unsigned long ip)
142 {
143 WARN_ON_ONCE(in_irq() || irqs_disabled());
144 #ifdef CONFIG_TRACE_IRQFLAGS
145 local_irq_disable();
146 #endif
147 /*
148 * Are softirqs going to be turned on now:
149 */
I am wondering if this was caused by the previous issue.
The git is updated up to: d938a70 be2net: increase POST timeout for EEH recovery Two slave interfaces, bonding mode 1, netconsole over bond0. [1] https://bugzilla.redhat.com/show_bug.cgi?id=248374#c5
How did you reproduce this? I will check that BZ to see if I can find how to reproduce this. Thanks.