Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER
From: Mike Snitzer <hidden>
Date: 2017-11-22 04:28:44
Also in:
dm-devel, lkml
On Tue, Nov 21 2017 at 11:00pm -0500, NeilBrown [off-list ref] wrote:
On Tue, Nov 21 2017, Mikulas Patocka wrote:quoted
On Tue, 21 Nov 2017, Mike Snitzer wrote:quoted
On Tue, Nov 21 2017 at 4:23pm -0500, Mikulas Patocka [off-list ref] wrote:quoted
This is not correct: 2206 static void dm_wq_work(struct work_struct *work) 2207 { 2208 struct mapped_device *md = container_of(work, struct mapped_device, work); 2209 struct bio *bio; 2210 int srcu_idx; 2211 struct dm_table *map; 2212 2213 if (!bio_list_empty(&md->rescued)) { 2214 struct bio_list list; 2215 spin_lock_irq(&md->deferred_lock); 2216 list = md->rescued; 2217 bio_list_init(&md->rescued); 2218 spin_unlock_irq(&md->deferred_lock); 2219 while ((bio = bio_list_pop(&list))) 2220 generic_make_request(bio); 2221 } 2222 2223 map = dm_get_live_table(md, &srcu_idx); 2224 2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) { 2226 spin_lock_irq(&md->deferred_lock); 2227 bio = bio_list_pop(&md->deferred); 2228 spin_unlock_irq(&md->deferred_lock); 2229 2230 if (!bio) 2231 break; 2232 2233 if (dm_request_based(md)) 2234 generic_make_request(bio); 2235 else 2236 __split_and_process_bio(md, map, bio); 2237 } 2238 2239 dm_put_live_table(md, srcu_idx); 2240 } You can see that if we are in dm_wq_work in __split_and_process_bio, we will not process md->rescued list.Can you elaborate further? We cannot be "in dm_wq_work in __split_and_process_bio" simultaneously. Do you mean as a side-effect of scheduling away from __split_and_process_bio? The more detail you can share the better.Suppose this scenario: * dm_wq_work calls __split_and_process_bio * __split_and_process_bio eventually reaches the function snapshot_map * snapshot_map attempts to take the snapshot lock * the snapshot lock could be released only if some bios submitted by the snapshot driver to the underlying device complete * the bios submitted to the underlying device were already offloaded by some other task and they are waiting on the list md->rescued * the bios waiting on md->rescued are not processed, because dm_wq_work is blocked in snapshot_map (called from __split_and_process_bio)Yes, I think you are right. I think the solution is to get rid of the dm_offload() infrastructure and make it not necessary. i.e. discard my patches dm: prepare to discontinue use of BIOSET_NEED_RESCUER and dm: revise 'rescue' strategy for bio-based bioset allocations And build on "dm: ensure bio submission follows a depth-first tree walk" which was written after those and already makes dm_offload() less important. Since that "depth-first" patch, every request to the dm device, after the initial splitting, allocates just one dm_target_io structure, and makes just one __map_bio() call, and so will behave exactly the way generic_make_request() expects and copes with - thus avoiding awkward dependencies and deadlocks. Except....
Yes, FYI I've also verified that even with just the "depth-first" patch (and dm_offload disabled) the snapshot deadlock is fixed.
a/ If any target defines ->num_write_bios() to return >1, __clone_and_map_data_bio() will make multiple calls to alloc_tio() and __map_bio(), which might need rescuing. But no target defines num_write_bios, and none have since it was removed from dm-cache 4.5 years ago. Can we discard num_write_bios??
Yes.
b/ If any target sets any of num_{flush,discard,write_same,write_zeroes}_bios
to a value > 1, then __send_duplicate_bios() will also make multiple
calls to alloc_tio() and __map_bio().
Some do.
dm-cache-target: flush=2
dm-snap: flush=2
dm-stripe: discard, write_same, write_zeroes all set to 'stripes'.
These will only be a problem if the second (or subsequent) alloc_tio()
blocks waiting for an earlier allocation to complete. This will only
be a problem if multiple threads are each trying to allocate multiple
dm_target_io from the same bioset at the same time.
This is rare and should be easier to address than the current
dm_offload() approach.
One possibility would be to copy the approach taken by
crypt_alloc_buffer() which needs to allocate multiple entries from a
mempool.
It first tries the with GFP_NOWAIT. If that fails it take a mutex and
tries with GFP_NOIO. This mean only one thread will try to allocate
multiple bios at once, so there can be no deadlock.
Below are two RFC patches. The first removes num_write_bios.
The second is incomplete and makes a stab are allocating multiple bios
at once safely.
A third would be needed to remove dm_offload() etc... but I cannot quite
fit that in today - must be off.Great.
quoted hunk ↗ jump to hunk
From: NeilBrown <redacted> Date: Wed, 22 Nov 2017 14:25:18 +1100 Subject: [PATCH] DM: remove num_write_bios target interface. No target provides num_write_bios and none has done since 2013. Having the possibility of num_write_bios > 1 complicates bio allocation. So remove the interface and assume there is only one bio needed. If a target ever needs more, it must provide a suitable bioset and allocate itself based on its particular needs. Signed-off-by: NeilBrown <redacted> --- drivers/md/dm.c | 22 ++++------------------ include/linux/device-mapper.h | 15 --------------- 2 files changed, 4 insertions(+), 33 deletions(-)diff --git a/drivers/md/dm.c b/drivers/md/dm.c index b20febd6cbc7..8c1a05609eea 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c@@ -1323,27 +1323,13 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti, { struct bio *bio = ci->bio; struct dm_target_io *tio; - unsigned target_bio_nr; - unsigned num_target_bios = 1; int r = 0; - /* - * Does the target want to receive duplicate copies of the bio? - */ - if (bio_data_dir(bio) == WRITE && ti->num_write_bios) - num_target_bios = ti->num_write_bios(ti, bio); - - for (target_bio_nr = 0; target_bio_nr < num_target_bios; target_bio_nr++) { - tio = alloc_tio(ci, ti, target_bio_nr); - tio->len_ptr = len; - r = clone_bio(tio, bio, sector, *len); - if (r < 0) { - free_tio(tio); - break; - } + tio = alloc_tio(ci, ti, 0); + tio->len_ptr = len; + r = clone_bio(tio, bio, sector, *len); + if (r >= 0) __map_bio(tio); - } -
This bit is wrong, free_tio() is needed if clone_bio() fails. I can fix it up though. I'll work through your patches tomorrow. Thanks, Mike