Thread (13 messages) 13 messages, 3 authors, 2017-05-25

Re: [PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

From: Paolo Valente <hidden>
Date: 2017-05-24 14:24:30
Also in: lkml

Il giorno 24 mag 2017, alle ore 12:53, Paolo Valente =
[off-list ref] ha scritto:
=20
quoted
=20
Il giorno 23 mag 2017, alle ore 21:42, Tejun Heo [off-list ref] ha =
scritto:
quoted
=20
Hello, Paolo.
=20
On Sat, May 20, 2017 at 09:27:33AM +0200, Paolo Valente wrote:
quoted
Consider a process or a group that is moved from a given source =
group
quoted
quoted
to a different group, or simply removed from a group (although I
didn't yet succeed in just removing a process from a group :) ).  =
The
quoted
quoted
pointer to the [b|c]fq_group contained in the schedulable entity
belonging to the source group *is not* updated, in BFQ, if the =
entity
quoted
quoted
is idle, and *is not* updated *unconditionally* in CFQ.  The update
will happen in bfq_get_rq_private or cfq_set_request, on the arrival
of a new request.  But, if the move happens right after the arrival =
of
quoted
quoted
a request, then all the scheduler functions executed until a new
request arrives for that entity will see a stale [b|c]fq_group.  =
Much
quoted
=20
Limited staleness is fine.  Especially in this case, it isn't too
weird to claim that the order between the two operations isn't =
clearly
quoted
defined.
=20
=20
ok
=20
quoted
quoted
worse, if also a blkcg_deactivate_policy or a blkg_destroy are
executed right after the move, then both the policy data pointed by
the [b|c]fq_group and the [b|c]fq_group itself may be deallocated.
So, all the functions of the scheduler invoked before next request
arrival may use dangling references!
=20
Hmm... but cfq_group is allocated along with blkcg and blkcg always
ensures that there are no blkg left before freeing the pd area in
blkcg_css_offline().
=20
=20
Exact, but even after all blkgs, as well as the cfq_group and pd, are
gone, the children cfq_queues of the gone cfq_group continue to point
to unexisting objects, until new cfq_set_requests are executed for
those cfq_queues.  To try to make this statement clearer, here is the
critical sequence for a cfq_queue, say cfqq, belonging to a cfq_group,
say cfqg:
=20
1 cfq_set_request for a request rq of cfqq
Sorry, this first event is irrelevant for the problem to occur.  What
matters is just that some scheduler hooks are invoked *after* the
deallocation of a cfq_group, and *before* a new cfq_set_request.

Paolo
2 removal of (the process associated with cfqq) from bfqg
3 destruction of the blkg that bfqg is associated with
4 destruction of the blkcg the above blkg belongs to
5 destruction of the pd pointed to by cfqg, and of cfqg itself
!!!-> from now on cfqq->cfqg is a dangling reference <-!!!
6 execution of cfq functions, different from cfq_set_request, on cfqq
	. cfq_insert, cfq_dispatch, cfq_completed_rq, ...
7 execution of a new cfq_set_request for cfqq
-> now cfqq->cfqg is again a sane pointer <-
=20
Every function executed at step 6 sees a dangling reference for
cfqq->cfqg.
=20
My fix for caching data doesn't solve this more serious problem.
=20
Where have I been mistaken?
=20
Thanks,
Paolo
=20
quoted
Thanks.
=20
--=20
tejun
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help