Thread (50 messages) 50 messages, 8 authors, 2012-09-03

Re: [PATCH v6 07/13] block: Avoid deadlocks with bio allocation by stacking drivers

From: Tejun Heo <hidden>
Date: 2012-08-24 20:29:06
Also in: dm-devel, lkml

Hello,

On Thu, Aug 23, 2012 at 10:55:54PM -0700, Kent Overstreet wrote:
quoted
Why aren't we turning off __GFP_WAIT instead?  e.g. What if the caller
has one of NUMA flags or __GFP_COLD specified?
Didn't think of that. The reason I did it that way is I wasn't sure if
just doing &= ~__GFP_WAIT would be correct, since that would leave
__GFP_IO|__GFP_FS set.
Using the appropriate __GFP_IO/FS flags is the caller's
responsibility.  The only thing bioset needs to worry and take action
about here is __GFP_WAIT causing indefinite wait in mempool.
quoted
Plesae don't mix struct definition relocation (or any relocation
really) with actual changes.  It buries the actual changes and makes
reviewing difficult.
Make a new patch that does nothing more than reorder the definitions,
then?
Yeap, with justification.
    block: Avoid deadlocks with bio allocation by stacking drivers
    
    Previously, if we ever try to allocate more than once from the same bio
    set while running under generic_make_request() (i.e. a stacking block
    driver), we risk deadlock.
    
    This is because of the code in generic_make_request() that converts
    recursion to iteration; any bios we submit won't actually be submitted
    (so they can complete and eventually be freed) until after we return -
    this means if we allocate a second bio, we're blocking the first one
    from ever being freed.
    
    Thus if enough threads call into a stacking block driver at the same
    time with bios that need multiple splits, and the bio_set's reserve gets
    used up, we deadlock.
    
    This can be worked around in the driver code - we could check if we're
    running under generic_make_request(), then mask out __GFP_WAIT when we
    go to allocate a bio, and if the allocation fails punt to workqueue and
    retry the allocation.
    
    But this is tricky and not a generic solution. This patch solves it for
    all users by inverting the previously described technique. We allocate a
    rescuer workqueue for each bio_set, and then in the allocation code if
    there are bios on current->bio_list we would be blocking, we punt them
    to the rescuer workqueue to be submitted.
    
    Tested it by forcing the rescue codepath to be taken (by disabling the
    first GFP_NOWAIT) attempt, and then ran it with bcache (which does a lot
    of arbitrary bio splitting) and verified that the rescuer was being
    invoked.
Yeah, the description looks good to me.
quoted hunk ↗ jump to hunk
 struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 {
+	gfp_t saved_gfp = gfp_mask;
 	unsigned front_pad;
 	unsigned inline_vecs;
 	unsigned long idx = BIO_POOL_NONE;
@@ -318,16 +336,44 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 		p = kmalloc(sizeof(struct bio) +
 			    nr_iovecs * sizeof(struct bio_vec),
 			    gfp_mask);
+
 		front_pad = 0;
 		inline_vecs = nr_iovecs;
 	} else {
+		/*
+		 * generic_make_request() converts recursion to iteration; this
+		 * means if we're running beneath it, any bios we allocate and
+		 * submit will not be submitted (and thus freed) until after we
+		 * return.
+		 *
+		 * This exposes us to a potential deadlock if we allocate
+		 * multiple bios from the same bio_set() while running
+		 * underneath generic_make_request(). If we were to allocate
+		 * multiple bios (say a stacking block driver that was splitting
+		 * bios), we would deadlock if we exhausted the mempool's
+		 * reserve.
+		 *
+		 * We solve this, and guarantee forward progress, with a rescuer
+		 * workqueue per bio_set. If we go to allocate and there are
+		 * bios on current->bio_list, we first try the allocation
+		 * without __GFP_WAIT; if that fails, we punt those bios we
+		 * would be blocking to the rescuer workqueue before we retry
+		 * with the original gfp_flags.
+		 */
Can you please add a comment in generic_make_request() to describe the
issue briefly and link back here?
 void bioset_free(struct bio_set *bs)
 {
+	if (bs->rescue_workqueue)
Why is the conditional necessary?  Is it possible to have a bioset w/o
rescue_workqueue?
+		destroy_workqueue(bs->rescue_workqueue);
+
 	if (bs->bio_pool)
 		mempool_destroy(bs->bio_pool);
This makes bioset_free() require process context, which probably is
okay for bioset but still isn't nice.  Might worth noting in the patch
description.

Thanks.

-- 
tejun
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help