Re: [PATCH] mlx4_ib: Increase the timeout for CM cache

From: Håkon Bugge <hidden>
Date: 2019-02-06 08:51:11
Also in: linux-rdma, lkml

On 5 Feb 2019, at 23:36, Jason Gunthorpe [off-list ref] wrote:

On Thu, Jan 31, 2019 at 06:09:51PM +0100, Håkon Bugge wrote:

quoted

Using CX-3 virtual functions, either from a bare-metal machine or
pass-through from a VM, MAD packets are proxied through the PF driver.

Since the VMs have separate name spaces for MAD Transaction Ids
(TIDs), the PF driver has to re-map the TIDs and keep the book keeping
in a cache.

Following the RDMA CM protocol, it is clear when an entry has to
evicted form the cache. But life is not perfect, remote peers may die
or be rebooted. Hence, it's a timeout to wipe out a cache entry, when
the PF driver assumes the remote peer has gone.

We have experienced excessive amount of DREQ retries during fail-over
testing, when running with eight VMs per database server.

The problem has been reproduced in a bare-metal system using one VM
per physical node. In this environment, running 256 processes in each
VM, each process uses RDMA CM to create an RC QP between himself and
all (256) remote processes. All in all 16K QPs.

When tearing down these 16K QPs, excessive DREQ retries (and
duplicates) are observed. With some cat/paste/awk wizardry on the
infiniband_cm sysfs, we observe:

     dreq:       5007
cm_rx_msgs:
     drep:       3838
     dreq:      13018
      rep:       8128
      req:       8256
      rtu:       8256
cm_tx_msgs:
     drep:       8011
     dreq:      68856
      rep:       8256
      req:       8128
      rtu:       8128
cm_tx_retries:
     dreq:      60483

Note that the active/passive side is distributed.

Enabling pr_debug in cm.c gives tons of:

[171778.814239] <mlx4_ib> mlx4_ib_multiplex_cm_handler: id{slave:
1,sl_cm_id: 0xd393089f} is NULL!

By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the
tear-down phase of the application is reduced from 113 to 67
seconds. Retries/duplicates are also significantly reduced:

cm_rx_duplicates:
     dreq:       7726
[]
cm_tx_retries:
     drep:          1
     dreq:       7779

Increasing the timeout further didn't help, as these duplicates and
retries stem from a too short CMA timeout, which was 20 (~4 seconds)
on the systems. By increasing the CMA timeout to 22 (~17 seconds), the
numbers fell down to about one hundred for both of them.

Adjustment of the CMA timeout is _not_ part of this commit.

Signed-off-by: Håkon Bugge <redacted>
---
drivers/infiniband/hw/mlx4/cm.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Jack? What do you think?

I am tempted to send a v2 making this a sysctl tuneable. This because, full-rack testing using 8 servers, each with 8 VMs, only showed 33% reduction in the occurrences of "mlx4_ib_multiplex_cm_handler: id{slave:1,sl_cm_id: 0xd393089f} is NULL" with this commit.

But sure, Jack's opinion matters.


Thxs, Håkon

quoted

diff --git a/drivers/infiniband/hw/mlx4/cm.c b/drivers/infiniband/hw/mlx4/cm.c
index fedaf8260105..8c79a480f2b7 100644
--- a/drivers/infiniband/hw/mlx4/cm.c
+++ b/drivers/infiniband/hw/mlx4/cm.c

@@ -39,7 +39,7 @@

#include "mlx4_ib.h"

-#define CM_CLEANUP_CACHE_TIMEOUT  (5 * HZ)
+#define CM_CLEANUP_CACHE_TIMEOUT  (30 * HZ)

struct id_map_entry {
	struct rb_node node;
-- 
2.20.1

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help