Re: [PATCH v2] md/cluster: fix deadlock when doing reshape job

From: heming.zhao@suse.com <hidden>
Date: 2020-11-13 13:24:02

On 11/13/20 10:50 AM, Xiao Ni wrote:

quoted hunk ↗ jump to hunk


----- Original Message -----

quoted

From: "heming zhao" <redacted>
To: "Xiao Ni" <redacted>, linux-raid@vger.kernel.org, song@kernel.org, "guoqing jiang"
[off-list ref]
Cc: "lidong zhong" <redacted>, neilb@suse.de, colyli@suse.de
Sent: Thursday, November 12, 2020 7:27:46 PM
Subject: Re: [PATCH v2] md/cluster: fix deadlock when doing reshape job

Hello,

On 11/12/20 1:08 PM, Xiao Ni wrote:

quoted


On 11/11/2020 11:51 PM, Zhao Heming wrote:

quoted

There is a similar deadlock in commit 0ba959774e93
("md-cluster: use sync way to handle METADATA_UPDATED msg")
My patch fixed issue is very like commit 0ba959774e93, except <c>.
0ba959774e93 step <c> is "update sb", my fix is "mdadm --remove"

... ...
+               !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state),
+               msecs_to_jiffies(5000));
+    if (rv)
+        return lock_token(cinfo, mddev_locked);
+    else
+        return -1;
   }

Hi Heming

Can we handle this problem like metadata_update_start? lock_comm and
metadata_update_start are two users that
want to get token lock. lock_comm can do the same thing as
metadata_update_start does. If so, process_recvd_msg
can call md_reload_sb without waiting. All threads can work well when the
initiated node release token lock. Resync
can send message and clear MD_CLUSTER_SEND_LOCK, then lock_comm can go on
working. In this way, all threads
work successfully without failure.

Currently MD_CLUSTER_SEND_LOCKED_ALREADY only for adding a new disk.
(please refer Documentation/driver-api/md/md-cluster.rst section: 5. Adding a
new Device")
During ADD_NEW_DISK process, md-cluster will block all other msg sending
until metadata_update_finish calls
unlock_comm.

With your idea, md-cluster will allow to concurrently send msg. I'm not very
familiar with all raid1 scenarios.
But at least, you break the rule of handling ADD_NEW_DISK. it will allow send
other msg during ADD_NEW_DISK.

Hi Heming

It doesn't need to change MD_CLUSTER_SEND_LOCKED_ALREADY. Is it ok to do something like this:

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 4aaf482..f6f576b 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c

@@ -664,29 +664,12 @@ static void recv_daemon(struct md_thread *thread)
   * Takes the lock on the TOKEN lock resource so no other
   * node can communicate while the operation is underway.
   */
-static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked)
+static int lock_token(struct md_cluster_info *cinfo)
  {
-	int error, set_bit = 0;
+	int error;
  	struct mddev *mddev = cinfo->mddev;
  
-	/*
-	 * If resync thread run after raid1d thread, then process_metadata_update
-	 * could not continue if raid1d held reconfig_mutex (and raid1d is blocked
-	 * since another node already got EX on Token and waitting the EX of Ack),
-	 * so let resync wake up thread in case flag is set.
-	 */
-	if (mddev_locked && !test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
-				      &cinfo->state)) {
-		error = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
-					      &cinfo->state);
-		WARN_ON_ONCE(error);
-		md_wakeup_thread(mddev->thread);
-		set_bit = 1;
-	}
  	error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX);
-	if (set_bit)
-		clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
-
  	if (error)
  		pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n",
  				__func__, __LINE__, error);

@@ -701,10 +684,30 @@ static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked)
   */
  static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked)
  {
+	int ret, set_bit = 0;
+
+	/*
+	 * If resync thread run after raid1d thread, then process_metadata_update
+	 * could not continue if raid1d held reconfig_mutex (and raid1d is blocked
+	 * since another node already got EX on Token and waitting the EX of Ack),
+	 * so let resync wake up thread in case flag is set.
+	 */
+	if (mddev_locked && !test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
+				      &cinfo->state)) {
+		error = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
+					      &cinfo->state);
+		WARN_ON_ONCE(error);
+		md_wakeup_thread(mddev->thread);
+		set_bit = 1;
+	}
+
  	wait_event(cinfo->wait,
  		   !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));
  
-	return lock_token(cinfo, mddev_locked);
+	ret = lock_token(cinfo, mddev_locked);
+	if (ret && set_bit)
+		clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
+	return ret;
  }
  
  static void unlock_comm(struct md_cluster_info *cinfo)


thank you for your comments & idea.
I totally understand your solution, it's better than mine.
I will use this idea to file v3 patch.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help