Re: [PATCH 2/9] raid5-cache: add trim support for log

From: Shaohua Li <hidden>
Date: 2015-10-12 16:57:20

On Mon, Oct 12, 2015 at 05:00:14PM +1100, Neil Brown wrote:

Shaohua Li [off-list ref] writes:

quoted

Since superblock is updated infrequently, we do a simple trim of log
disk (a synchronous trim)

Signed-off-by: Shaohua Li <redacted>
---
 drivers/md/raid5-cache.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 62 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index caeec10..5dbe084 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c

@@ -85,6 +85,7 @@ struct r5l_log {
 	spinlock_t no_space_stripes_lock;
 
 	bool need_cache_flush;
+	bool in_teardown;
 };
 
 /*

@@ -644,6 +645,60 @@ void r5l_flush_stripe_to_raid(struct r5l_log *log)
 }
 
 static void r5l_write_super(struct r5l_log *log, sector_t cp);
+static void r5l_write_super_and_discard_space(struct r5l_log *log,
+	sector_t end)
+{
+	struct block_device *bdev = log->rdev->bdev;
+	struct mddev *mddev;
+
+	r5l_write_super(log, end);
+
+	if (!blk_queue_discard(bdev_get_queue(bdev)))
+		return;
+
+	mddev = log->rdev->mddev;
+	/*
+	 * This is to avoid a deadlock. r5l_quiesce holds reconfig_mutex and
+	 * wait for this thread to finish. This thread waits for
+	 * MD_CHANGE_PENDING clear, which is supposed to be done in
+	 * md_check_recovery(). md_check_recovery() tries to get
+	 * reconfig_mutex. Since r5l_quiesce already holds the mutex,
+	 * md_check_recovery() fails, so the PENDING never get cleared. The
+	 * in_teardown check workaround this issue.
+	 * */

Thanks for this comment (but can we just end the comment with "*/"
instead of  "* */" - that's what everyone else does).

sorry, I will set vim correctly.

It helps a lot.  If feels a bit clumsy, but maybe that is just me.  At
least I understand now and it make sense.

quoted

+	if (!log->in_teardown) {
+		set_bit(MD_CHANGE_DEVS, &mddev->flags);
+		set_bit(MD_CHANGE_PENDING, &mddev->flags);
+		md_wakeup_thread(mddev->thread);
+		wait_event(mddev->sb_wait,
+			!test_bit(MD_CHANGE_PENDING, &mddev->flags) ||
+			log->in_teardown);
+		/*
+		 * r5l_quiesce could run after in_teardown check and hold
+		 * mutex first. Superblock might get updated twice.
+		 * */
+		if (log->in_teardown)
+			md_update_sb(mddev, 1);
+	} else {
+		BUG_ON(!mddev_is_locked(mddev));
+		md_update_sb(mddev, 1);
+	}
+
+	if (log->last_checkpoint < end) {
+		blkdev_issue_discard(bdev,
+				log->last_checkpoint + log->rdev->data_offset,
+				end - log->last_checkpoint, GFP_NOIO, 0);
+	} else {
+		blkdev_issue_discard(bdev,
+				log->last_checkpoint + log->rdev->data_offset,
+				log->device_size - log->last_checkpoint,
+				GFP_NOIO, 0);
+		blkdev_issue_discard(bdev, log->rdev->data_offset, end,
+				GFP_NOIO, 0);
+	}
+}
+
+
 static void r5l_do_reclaim(struct r5l_log *log)
 {
 	sector_t reclaim_target = xchg(&log->reclaim_target, 0);

@@ -685,7 +740,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
 	 * here, because the log area might be reused soon and we don't want to
 	 * confuse recovery
 	 */
-	r5l_write_super(log, next_checkpoint);
+	r5l_write_super_and_discard_space(log, next_checkpoint);
 
 	mutex_lock(&log->io_mutex);
 	log->last_checkpoint = next_checkpoint;
@@ -721,9 +776,11 @@ static void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
 
 void r5l_quiesce(struct r5l_log *log, int state)
 {
+	struct mddev *mddev;
 	if (!log || state == 2)
 		return;
 	if (state == 0) {
+		log->in_teardown = 0;
 		log->reclaim_thread = md_register_thread(r5l_reclaim_thread,
 					log->rdev->mddev, "reclaim");
 	} else if (state == 1) {
@@ -731,6 +788,10 @@ void r5l_quiesce(struct r5l_log *log, int state)
 		 * at this point all stripes are finished, so io_unit is at
 		 * least in STRIPE_END state
 		 */
+		log->in_teardown = 1;
+		/* make sure r5l_write_super_and_discard_space exits */
+		mddev = log->rdev->mddev;
+		wake_up(&mddev->sb_wait);
 		r5l_wake_reclaim(log, -1L);
 		md_unregister_thread(&log->reclaim_thread);

I think this is racey (though you haven't changed the code, so it must
have been racy before).
There is no guarantee that the thread will actually wake up to handle
the reclaim before md_unregister_thread marks the thread as
kthread_should_stop().
Maybe the thing to do would be
   md_unregister_thread()
   log->reclaim_target = (unsigned long) -1;
   r5l_do_reclaim(log);

or something like that.  i.e. just do it synchronously.

This doesn't sound like a race. If the thread doesn't run because it
exits after kthread_should_stop check, the reclaim_target should remain
to be -1 after md_unregister_thread

Then.... maybe you don't need in_teardown.
When r5l_do_reclaim() is called from the thread, you wait.
When called from r5l_quiesce(), you md_update_sb().

Would that work?

It wouldn't. This only fixes the deadlock issue r5l_quiesce calls
r5l_do_reclaim() directly. The problem is r5l_wake_reclaim() will wake
the thread, md_unregister_thread will wait till the thread exits. At
that time the thread might try to lock the reconfig_mutex. Since
r5l_quesce already holds the mutex, the md_unregister_thread never
returns. In summary, r5l_do_reclaim() is called from the thread can't be
used to determine whether we need the deadlock avoidness.

Thanks,
Shaohua

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help