[WARNING: A/V UNSCANNABLE][Merge tag 'media/v4.11-1' of git] ff58d005cd:... | linux-arm-kernel

[WARNING: A/V UNSCANNABLE][Merge tag 'media/v4.11-1' of git] ff58d005cd: BUG: unable to handle kernel NULL pointer dereference at 0000039c

From: sean@mess.org (Sean Young)
Date: 2017-02-25 11:16:58
Also in: linux-amlogic, linux-devicetree, linux-input, linux-leds, linux-media, linux-mediatek, linux-omap, lkml

On Fri, Feb 24, 2017 at 11:15:51AM -0800, Linus Torvalds wrote:

Added more relevant people. I've debugged the immediate problem below,
but I think there's another problem that actually triggered this.

On Fri, Feb 24, 2017 at 10:28 AM, kernel test robot
[off-list ref] wrote:

quoted

0day kernel testing robot got the below dmesg and the first bad commit is

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master

commit ff58d005cd10fcd372787cceac547e11cf706ff6
Merge: 5ab3566 9eeb0ed

    Merge tag 'media/v4.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

[...]

quoted

[    4.664940] rc rc0: lirc_dev: driver ir-lirc-codec (rc-loopback) registered at minor = 0
[    4.666322] BUG: unable to handle kernel NULL pointer dereference at 0000039c
[    4.666675] IP: serial_ir_irq_handler+0x189/0x410

This merge being fingered ends up being a subtle interaction with other changes.

Those "other changes" are (again) the interrupt retrigger code that
was reverted for 4.10, and then we tried to merge them again this
merge window.

Because the immediate cause is:

quoted

[    4.666675] EIP: serial_ir_irq_handler+0x189/0x410
[    4.666675] Call Trace:
[    4.666675]  <IRQ>
[    4.666675]  __handle_irq_event_percpu+0x57/0x100
[    4.666675]  handle_irq_event_percpu+0x1d/0x50
[    4.666675]  handle_irq_event+0x32/0x60
[    4.666675]  handle_edge_irq+0xa5/0x120
[    4.666675]  handle_irq+0x9d/0xd0
[    4.666675]  </IRQ>
[    4.666675]  do_IRQ+0x5f/0x130
[    4.666675]  common_interrupt+0x33/0x38
[    4.666675] EIP: hardware_init_port+0x3f/0x190
[    4.666675] EFLAGS: 00200246 CPU: 0
[    4.666675] EAX: c718990f EBX: 00000000 ECX: 00000000 EDX: 000003f9
[    4.666675] ESI: 000003f9 EDI: 000003f8 EBP: c0065d98 ESP: c0065d84
[    4.666675]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[    4.666675]  serial_ir_probe+0xbb/0x300
[    4.666675]  platform_drv_probe+0x48/0xb0

...

ie an interrupt came in immediately after the request_irq(), before
all the data was properly set up, which then causes the interrupt
handler to take a fault because it tries to access some field that
hasn't even been set up yet.

Oh dear. I've pointed out others making the same mistake when doing code
reviews, clearly I need review my own code better.

The code line is helpful, the faulting instruction is

      mov    0x39c(%rax),%eax   <--- fault
      call ..
      mov    someglobalvar,%edx

which together with the supplied config file makes me able to match it
up with the assembly generation around it:

        inb %dx, %al    # tmp254, value
        andb    $1, %al #, tmp255
        testb   %al, %al        # tmp255
        je      .L233   #,
  .L215:
        movl    serial_ir+8, %eax       # serial_ir.rcdev, serial_ir.rcdev
        xorl    %edx, %edx      # _66->timeout
        movl    924(%eax), %eax # _66->timeout, _66->timeout
        call    nsecs_to_jiffies        #
        movl    jiffies, %edx   # jiffies, jiffies.33_70
        addl    %eax, %edx      # _69, tmp259
        movl    $serial_ir+16, %eax     #,
        call    mod_timer       #
        movl    serial_ir+8, %eax       # serial_ir.rcdev,
        call    ir_raw_event_handle     #
        movl    $1, %eax        #, <retval>

so it's that "serial_ir.rcdev->timeout" access that faults. So this is
the faulting source code:

drivers/media/rc/serial_ir.c: 402

        mod_timer(&serial_ir.timeout_timer,
                  jiffies + nsecs_to_jiffies(serial_ir.rcdev->timeout));

        ir_raw_event_handle(serial_ir.rcdev);

        return IRQ_HANDLED;

and serial_ir.rcdev is NULL when ti tries to look up the timeout.

ir_raw_event_handle() call will also go bang if passed a null pointer, so
this problem existed before (since v4.10).

Thanks for debugging this, I'll send a patch as a reply to this email.


Sean

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help