Re: [PATCH v4 07/10] block, scsi: Rework runtime power management
From: Bart Van Assche <hidden>
Date: 2018-08-06 15:20:11
On Sat, 2018-08-04 at 18:01 +0800, Ming Lei wrote:
On Sat, Aug 4, 2018 at 8:03 AM, Bart Van Assche <bart.vanassche�
EA-wdc.com> wrote:
quoted
=20diff --git a/block/blk-pm.c b/block/blk-pm.c index 2a4632d0be4b..0708185ead61 100644 --- a/block/blk-pm.c +++ b/block/blk-pm.c@@ -107,17 +107,31 @@ int blk_pre_runtime�
AF8-suspend(struct request_queue *q)
quoted
if (!q->dev) return ret; =20 + WARN_ON_ONCE(q->rpm_status != RP=
M_ACTIVE);
quoted
+ blk_pm_runtime_lock(q); + /* + * Switch to preempt-only mode before calling sync=
hronize_rcu() such
quoted
+ * that later blk_queue_enter() calls see =
the preempt-only state. See
quoted
+ * also http://lwn.net/Articles/573497/. + */ + blk_set_pm_only(q); + ret = -EBUSY; + if (percpu_ref_read(&q->q_usage�
8-counter) == 1) {quoted
+ synchronize_rcu(); + if (percpu_ref_read(&q->q_=
usage_counter) == 1)
quoted
+ ret = 0; + }=20 The above code looks too tricky, and the following issue might exist =
in case
of potential WRITEs re-order from blk_queue_enter(). =20 1) suppose there are one in-flight request not completed yet =20 2) observer is the percpu_ref_read() following synchronize�
8-rcu().
=20 3) inside blk_queue_enter(), WRITE in percpu_ref_tryg=
et_live()/percpu_ref_put()
can be re-ordered from view of the observer in 2) =20 4) then percpu_ref_read() may return 1 even though the only i=
n-flight
request isn't completed =20 5) suspend still moves on, and data loss may be triggered =20 Cc Paul, and Alan have been CCed already, so we have several memory model experts here, :-), hope they may help to clarify if this reorde=
r case
may exist.
Hello Ming,
There are two cases:
(a) The percpu_ref_tryget_live() call in (3) is triggered by a =
runtime power
state transition (RPM_*).
(b) The percpu_ref_tryget_live() call in (3) is not triggered b=
y a runtime
power state transition.
In case (b) the driver that submits the request should protect the request
with scsi_autopm_get_device() / scsi_autopm_put_dev=
ice() or by any other
mechanism that prevents runtime power state transitions. Since
scsi_autopm_get_device() is called before blk_queue_ent=
er(), since
scsi_autopm_get_device() only returns after it has resumed a de=
vice and since
scsi_autopm_get_device() prevents runtime suspend, the function
blk_pre_runtime_suspend() won't be called concurrently with blk=
_queue_enter()
in case (b).
Since the power management serializes power state transitions also for case
(a) the race described above won't happen. blk_pre_runtime_susp=
end() will have
finished before any RQF_PM requests are submitted.
Bart.