Re: [Bug 214523] New: RDMA Mellanox RoCE drivers are unresponsive to ARP... | linux-rdma

Re: [Bug 214523] New: RDMA Mellanox RoCE drivers are unresponsive to ARP updates during a reconnect

From: Chuck Lever III <hidden>
Date: 2021-09-26 17:36:10

Hi Leon-

Thanks for the suggestion! More below.

On Sep 26, 2021, at 4:02 AM, Leon Romanovsky [off-list ref] wrote:

On Fri, Sep 24, 2021 at 03:34:32PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:

quoted

https://bugzilla.kernel.org/show_bug.cgi?id=214523

           Bug ID: 214523
          Summary: RDMA Mellanox RoCE drivers are unresponsive to ARP
                   updates during a reconnect
          Product: Drivers
          Version: 2.5
   Kernel Version: 5.14
         Hardware: All
               OS: Linux
             Tree: Mainline
           Status: NEW
         Severity: normal
         Priority: P1
        Component: Infiniband/RDMA
         Assignee: drivers_infiniband-rdma@kernel-bugs.osdl.org
         Reporter: kolga@netapp.com
       Regression: No

RoCE RDMA connection uses CMA protocol to establish an RDMA connection. During
the setup the code uses hard coded timeout/retry values. These values are used
for when Connect Request is not being answered to to re-try the request. During
the re-try attempts the ARP updates of the destination server are ignored.
Current timeout values lead to 4+minutes long attempt at connecting to a server
that no longer owns the IP since the ARP update happens. 

The ask is to make the timeout/retry values configurable via procfs or sysfs.
This will allow for environments that use RoCE to reduce the timeouts to a more
reasonable values and be able to react to the ARP updates faster. Other CMA
users (eg IB or others) can continue to use existing values.

I would rather not add a user-facing tunable. The fabric should
be better at detecting addressing changes within a reasonable
time. It would be helpful to provide a history of why the ARP
timeout is so lax -- do certain ULPs rely on it being long?

quoted

The problem exist in all kernel versions but bugzilla is filed for 5.14 kernel.

The use case is (RoCE-based) NFSoRDMA where a server went down and another
server was brought up in its place. RDMA layer introduces 4+ minutes in being
able to re-establish an RDMA connection and let IO resume, due to inability to
react to the ARP update.

RDMA-CM has many different timeouts, so I hope that my answer is for the
right timeout.

We probably need to extend rdma_connect() to receive remote_cm_response_timeout
value, so NFSoRDMA will set it to whatever value its appropriate.

The timewait will be calculated based it in ib_send_cm_req().

I hope a mechanism can be found that behaves the same or nearly the
same way for all RDMA fabrics.

For those who are not NFS-savvy:

Simple NFS server failover is typically implemented with a heartbeat
between two similar platforms that both access the same backend
storage. When one platform fails, the other detects it and takes over
the failing platform's IP address. Clients detect connection loss
with the failing platform, and upon reconnection to that IP address
are transparently directed to the other platform.

NFS server vendors have tried to extend this behavior to RDMA fabrics,
with varying degrees of success.

In addition to enforcing availability SLAs, the time it takes to
re-establish a working connection is critical for NFSv4 because each
client maintains a lease to prevent the server from purging open and
lock state. If the reconnect takes too long, the client's lease is
jeopardized because other clients can then access files that client
might still have locked or open.


--
Chuck Lever

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help