Thread (28 messages) 28 messages, 2 authors, 2018-02-21

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

From: "tj@kernel.org" <tj@kernel.org>
Date: 2018-02-08 17:00:10

Hello, Bart.

On Thu, Feb 08, 2018 at 04:31:43PM +0000, Bart Van Assche wrote:
quoted
That sounds more like a scsi hotplug bug than an issue in the timeout
code unless we messed up @req pointer to begin with.
I don't think that this is related to SCSI hotplugging: this crash does not
occur with the v4.15 block layer core and it does not occur with my timeout
handler rework patch applied either. I think that means that we cannot
exclude the block layer core timeout handler rework as a possible cause.

The disassembler output is as follows:

(gdb) disas /s scsi_times_out
Dump of assembler code for function scsi_times_out:
drivers/scsi/scsi_error.c:
282     {
   0x0000000000005bd0 <+0>:     push   %r13
   0x0000000000005bd2 <+2>:     push   %r12
   0x0000000000005bd4 <+4>:     push   %rbp
./include/linux/blk-mq.h:
300             return rq + 1;
   0x0000000000005bd5 <+5>:     lea    0x178(%rdi),%rbp
drivers/scsi/scsi_error.c:
282     {
   0x0000000000005bdc <+12>:    push   %rbx
283             struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req);
284             enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
285             struct Scsi_Host *host = scmd->device->host;
   0x0000000000005bdd <+13>:    mov    0x1b0(%rdi),%rax
282     {
   0x0000000000005be4 <+20>:    mov    %rdi,%rbx
283             struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req);
284             enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
285             struct Scsi_Host *host = scmd->device->host;
   0x0000000000005be7 <+23>:    mov    (%rax),%r13
   0x0000000000005bea <+26>:    nopl   0x0(%rax,%rax,1)
[ ... ]
(gdb) print /x sizeof(struct request)
$2 = 0x178
(gdb) print &(((struct scsi_cmnd*)0)->device)
$4 = (struct scsi_device **) 0x38 <scsi_cmd_get_serial+8>
(gdb) print &(((struct scsi_device*)0)->host)       
$5 = (struct Scsi_Host **) 0x0

The crash is reported at address scsi_times_out+0x17 == scsi_times_out+23. The
instruction at that address tries to dereference scsi_cmnd.device (%rax). The
register dump shows that that pointer has the value NULL. The only function I
know of that clears the scsi_cmnd.device pointer is scsi_req_init(). The only
caller of that function in the SCSI core is scsi_initialize_rq(). That function
has two callers, namely scsi_init_command() and blk_get_request(). However,
the scsi_cmnd.device pointer is not cleared when a request finishes. This is
why I think that the above crash report indicates that scsi_times_out() was
called for a request that was being reinitialized and not by device hotplugging.
I could be misreading it but scsi_cmnd->device dereference should be
the following.

    0x0000000000005bdd <+13>:    mov    0x1b0(%rdi),%rax

%rdi is @req, 0x1b0(%rdi) seems to be the combined arithmetic of
blk_mq_rq_to_pdu() and ->device dereference - 0x178 + 0x38.  The
faulting access is (%rax), which is deref'ing host from device.

Thanks.

-- 
tejun
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help