Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion

From: NeilBrown <hidden>
Date: 2017-01-06 23:01:37
Also in: dm-devel, linux-bcache, linux-block, lkml
Subsystem: device-mapper (lvm), the rest · Maintainers: Alasdair Kergon, Mike Snitzer, Mikulas Patocka, Benjamin Marzinski, Linus Torvalds

On Sat, Jan 07 2017, Mike Snitzer wrote:

On Fri, Jan 06 2017 at 12:34pm -0500,
Mikulas Patocka [off-list ref] wrote:

quoted


On Fri, 6 Jan 2017, Mikulas Patocka wrote:

quoted


On Wed, 4 Jan 2017, Mike Snitzer wrote:

quoted

On Wed, Jan 04 2017 at 12:12am -0500,
NeilBrown [off-list ref] wrote:

quoted

Suggested-by: NeilBrown <redacted>
Signed-off-by: Jack Wang <redacted>
---
 block/blk-core.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 9e3ac56..47ef373 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c

@@ -2138,10 +2138,30 @@ blk_qc_t generic_make_request(struct bio *bio)
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
 		if (likely(blk_queue_enter(q, __GFP_DIRECT_RECLAIM) == 0)) {
+			struct bio_list lower, same, hold;
+
+			/* Create a fresh bio_list for all subordinate requests */
+			bio_list_init(&hold);
+			bio_list_merge(&hold, &bio_list_on_stack);
+			bio_list_init(&bio_list_on_stack);
 
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
+			/* sort new bios into those for a lower level
+			 * and those for the same level
+			 */
+			bio_list_init(&lower);
+			bio_list_init(&same);
+			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
+				if (q == bdev_get_queue(bio->bi_bdev))
+					bio_list_add(&same, bio);
+				else
+					bio_list_add(&lower, bio);
+			/* now assemble so we handle the lowest level first */
+			bio_list_merge(&bio_list_on_stack, &lower);
+			bio_list_merge(&bio_list_on_stack, &same);
+			bio_list_merge(&bio_list_on_stack, &hold);
 
 			bio = bio_list_pop(current->bio_list);
 		} else {

-- 
2.7.4

Mikulas, would you be willing to try the below patch with the
dm-snapshot deadlock scenario and report back on whether it fixes that?

Patch below looks to be the same as here:
https://marc.info/?l=linux-raid&m=148232453107685&q=p3

Neil and/or others if that isn't the patch that should be tested please
provide a pointer to the latest.

Thanks,
Mike

The bad news is that this doesn't fix the snapshot deadlock.

I created a test program for the snapshot deadlock bug (it was originally 
created years ago to test for a different bug, so it contains some cruft). 
You also need to insert "if (ci->sector_count) msleep(100);" to the end of 
__split_and_process_non_flush to make the kernel sleep when splitting the 
bio.

And with the above above patch, the snapshot deadlock bug still happens.

That is really unfortunate.  Would be useful to dig in and understand
why.  Because ordering of the IO in generic_make_request() really should
take care of it.

I *think* you might be able to resolve this by changing
__split_and_process_bio() to only ever perform a single split.  No
looping.
i.e. if the bio is too big to handle directly, then split off the front
to a new bio, which you bio_chain to the original.  The original then
has bio_advance() called to stop over the front, then
generic_make_request() so it is queued.
Then the code proceeds to __clone_and_map_data_bio() on the front that
got split off.
When that completes it *doesn't* loop round, but returns into
generic_make_request() which does the looping and makes sure to handle
the lowest-level bio next.

something vaguely like this:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3086da5664f3..06ee0960e415 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c

@@ -1216,6 +1216,14 @@ static int __split_and_process_non_flush(struct clone_info *ci)
 
 	len = min_t(sector_t, max_io_len(ci->sector, ti), ci->sector_count);
 
+	if (len < ci->sector_count) {
+		struct bio *split = bio_split(bio, len, GFP_NOIO, fs_bio_set);
+		bio_chain(split, bio);
+		generic_make_request(bio);
+		bio = split;
+		ci->sector_count = len;
+	}
+
 	r = __clone_and_map_data_bio(ci, ti, ci->sector, &len);
 	if (r < 0)
 		return r;

though I haven't tested, and the change (if it works) should probably be
more fully integrated into surrounding code.

You probably don't want to use "fs_bio_set" either - a target-local
pool would be best.

NeilBrown

Attachments

signature.asc [application/pgp-signature] 832 bytes

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help