Re: fsync hangs after scsi rejected a request

From: Ming Lei <tom.leiming@gmail.com>
Date: 2019-01-22 03:22:18

On Tue, Jan 22, 2019 at 5:13 AM Florian Stecker [off-list ref] wrote:

Hi everyone,

on my laptop, I am experiencing occasional hangs of applications during
fsync(), which are sometimes up to 30 seconds long. I'm using a BTRFS
which spans two partitions on the same SSD (one of them used to contain
a Windows, but I removed it and added the partition to the BTRFS volume
instead). Also, the problem only occurs when an I/O scheduler
(mq-deadline) is in use. I'm running kernel version 4.20.3.

 From what I understand so far, what happens is that a sync request
fails in the SCSI/ATA layer, in ata_std_qc_defer(), because it is a
"Non-NCQ command" and can not be queued together with other commands.
This propagates up into blk_mq_dispatch_rq_list(), where the call

ret = q->mq_ops->queue_rq(hctx, &bd);

returns BLK_STS_DEV_RESOURCE. Later in blk_mq_dispatch_rq_list(), there
is the piece of code

needs_restart = blk_mq_sched_needs_restart(hctx);
if (!needs_restart ||
        (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
        blk_mq_run_hw_queue(hctx, true);
else if (needs_restart && (ret == BLK_STS_RESOURCE))
        blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);

which restarts the queue after a delay if BLK_STS_RESOURCE was returned,
but somehow not for BLK_STS_DEV_RESOURCE. Instead, nothing happens and
fsync() seems to hang until some other process wants to do I/O.

So if I do

- else if (needs_restart && (ret == BLK_STS_RESOURCE))
+ else if (needs_restart && (ret == BLK_STS_RESOURCE || ret ==
BLK_STS_DEV_RESOURCE))

it fixes my problem. But was there a reason why BLK_STS_DEV_RESOURCE was
treated differently that BLK_STS_RESOURCE here?

Please see the comment:

/*
 * BLK_STS_DEV_RESOURCE is returned from the driver to the block layer if
 * device related resources are unavailable, but the driver can guarantee
 * that the queue will be rerun in the future once resources become
 * available again. This is typically the case for device specific
 * resources that are consumed for IO. If the driver fails allocating these
 * resources, we know that inflight (or pending) IO will free these
 * resource upon completion.
 *
 * This is different from BLK_STS_RESOURCE in that it explicitly references
 * a device specific resource. For resources of wider scope, allocation
 * failure can happen without having pending IO. This means that we can't
 * rely on request completions freeing these resources, as IO may not be in
 * flight. Examples of that are kernel memory allocations, DMA mappings, or
 * any other system wide resources.
 */
#define BLK_STS_DEV_RESOURCE    ((__force blk_status_t)13)

In any case, it seems wrong to me that ret is used here at all, as it
just contains the return value of the last request in the list, and
whether we rerun the queue should probably not only depend on the last
request?

Can anyone of the experts tell me whether this makes sense or I got
something completely wrong?

Sounds a bug in SCSI or ata driver.

I remember there is hole in SCSI wrt. returning BLK_STS_DEV_RESOURCE,
but I never get lucky to reproduce it.

scsi_queue_rq():
        ......
        case BLK_STS_RESOURCE:
                if (atomic_read(&sdev->device_busy) ||
                    scsi_device_blocked(sdev))
                        ret = BLK_STS_DEV_RESOURCE;

All in-flight request may complete between reading 'sdev->device_busy'
and setting ret as 'BLK_STS_DEV_RESOURCE', then this IO hang may
be triggered.

Thanks,
Ming Lei

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help