Re: Kernel 4.1.12 crash

From: Guillaume Nault <hidden>
Date: 2015-11-30 20:42:12

Possibly related (same subject, not in this thread)

2015-11-26 · Re: Kernel 4.1.12 crash · Guillaume Nault <hidden>
2015-11-25 · Re: Kernel 4.1.12 crash · Guillaume Nault <hidden>
2015-11-25 · Re: Kernel 4.1.12 crash · Andrew <hidden>
2015-11-24 · Re: Kernel 4.1.12 crash · Andrew <hidden>
2015-11-22 · Re: Kernel 4.1.12 crash · Andrew <hidden>

[Adding Simon to the discussion]

On Mon, Nov 30, 2015 at 04:03:37PM +0100, Guillaume Nault wrote:

On Mon, Nov 30, 2015 at 12:05:13AM +0200, Andrew wrote:

quoted

26.11.2015 18:44, Guillaume Nault пишет:

quoted

On Wed, Nov 25, 2015 at 04:58:54PM +0200, Andrew wrote:

quoted

25.11.2015 16:10, Guillaume Nault пишет:

quoted

On Wed, Nov 25, 2015 at 12:59:52AM +0200, Andrew wrote:

quoted

Hi.

I tried to reproduce errors in virtual environment (some VMs on my
notebook).

I've tried to create 1000 client PPPoE sessions from this box via script:
for i in `seq 1 1000`; do pppd plugin rp-pppoe.so user test password test
nodefaultroute maxfail 0 persist nodefaultroute holdoff 1 noauth eth0; done

I've tried to reproduce the bug with your script, but couldn't get
anything to crash (VM is Debian Jessie i386 running on KVM with upstream
kernel 4.1.12). Does the crash happen before all sessions get
established?

Yes, crash happens even before all daemon instances are started. Sessions
don't get established because BRAS configured to reject sessions (so a lot
of concurrent connection retries happens) - I still didn't created account
for test user on it.

Ok, I got the crash too. In fact I had misunderstood your previous
message, crash happens when PPP sessions don't get established
(authentication failures in my case).

I'll investigate on that and let you know.

It seems like bug appears on mass ppp devices removing (I planned to use
this test environment to reproduce BRAS periodical crashes, but suddenly
I've got crashes on test client).

I've checked it with some kernels - it's present in 4.3.0, but it isn't
present in 3.10.57. I'll try to build 3.14/3.18 kernels to look how they
will work in this case.

Yes, it most likely was introduced by 287f3a943fef ("pppoe: Use
workqueue to die properly when a PADT is received"). I still have to
figure out why.

I confirm the bug comes from this commit.

It happens if pppoe_connect() reinitialises po->proto.pppoe.padt_work
after pppoe_disc_rcv() has added it to the system's work queue, and
before that work got scheduled. Then when scheduling occurs, the worker
thread tries to run a corrupted structure and crashes.

I'm going to work on a patch.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help