Re: Kernel 4.1.12 crash
From: Andrew <hidden>
Date: 2015-11-25 09:35:52
Hm, older image with 3.10.57 looks stable in same testcase - so at least one of bugs can be enough easily bisected. I'll try to downgrade kernel with same userland for testing, and then - bisect buggy commit. 25.11.2015 00:59, Andrew пишет:
Hi. I tried to reproduce errors in virtual environment (some VMs on my notebook). I've tried to create 1000 client PPPoE sessions from this box via script: for i in `seq 1 1000`; do pppd plugin rp-pppoe.so user test password test nodefaultroute maxfail 0 persist nodefaultroute holdoff 1 noauth eth0; done And on VM that is used as client I've got strange random crashes (that are present only when server is online - so they're network-related): http://postimg.org/image/ohr2mu3rj/ - crash is here: (gdb) list *process_one_work+0x32 0xc10607b2 is in process_one_work (/var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/kernel/workqueue.c:1952). 1947 __releases(&pool->lock) 1948 __acquires(&pool->lock) 1949 { 1950 struct pool_workqueue *pwq = get_work_pwq(work); 1951 struct worker_pool *pool = worker->pool; 1952 bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE; 1953 int work_color; 1954 struct worker *collision; 1955 #ifdef CONFIG_LOCKDEP 1956 /* http://postimg.org/image/x9mychssx/ - crash is here (noticed twice): 0xc10658bf is in kthread_data (/var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/kernel/kthread.c:136). 131 * The caller is responsible for ensuring the validity of @task when 132 * calling this function. 133 */ 134 void *kthread_data(struct task_struct *task) 135 { 136 return to_kthread(task)->data; 137 } which is leaded by strange place: (gdb) list *kthread_create_on_node+0x120 0xc1065340 is in kthread (/var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/kernel/kthread.c:176). 171 { 172 __kthread_parkme(to_kthread(current)); 173 } 174 175 static int kthread(void *_create) 176 { 177 /* Copy data: it's on kthread's stack */ 178 struct kthread_create_info *create = _create; 179 int (*threadfn)(void *data) = create->threadfn; 180 void *data = create->data; And earlier: (gdb) list *ret_from_kernel_thread+0x21 0xc13bb181 is at /var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/arch/x86/kernel/entry_32.S:312. 307 popl_cfi %eax 308 pushl_cfi $0x0202 # Reset kernel eflags 309 popfl_cfi 310 movl PT_EBP(%esp),%eax 311 call *PT_EBX(%esp) 312 movl $0,PT_EAX(%esp) 313 jmp syscall_exit 314 CFI_ENDPROC 315 ENDPROC(ret_from_kernel_thread) 316 Stack corruption?.. I'll try to make test environment on real hardware. And I'll try to test with older kernels. 22.11.2015 07:17, Alexander Duyck пишет:quoted
On 11/21/2015 12:16 AM, Andrew wrote:quoted
Memory corruption, if happens, IMHO shouldn't be a hardware-related - almost all of these boxes, except H61M-based box from 1st log, works for a long time with uptime more than year; and only software was changed on it; H61M-based box runs memtest86 for a tens of hours w/o any error. If it was caused by hardware - they should crash even earlier.I wasn't saying it was hardware related. My thought is that it could be some sort of use after free or double free type issue. Basically what you end up with is the memory getting corrupted by software that is accessing regions it shouldn't be.quoted
Rarely on different servers I saw 'zram decompression error' messages (in this case I've got such message on H61M-based box). Also, other people that uses accel-ppp as BRAS software, have different kernel panics/bugs/oopses on fresh kernels. I'll try to apply these patches, and I'll try to switch back to kernels that were stable on some boxes.If you could bisect this it would be useful. Basically we just need to determine where in the git history these issues started popping up so that we can then narrow down on the root cause. - Alex