Thread (16 messages) 16 messages, 3 authors, 2026-04-10

RE: [EXTERNAL] Re: [PATCH rdma-next 0/8] RDMA/mana_ib: Handle service reset for RDMA resources

From: Long Li <longli@microsoft.com>
Date: 2026-03-17 23:43:58
Also in: linux-hyperv, linux-rdma, lkml

On Fri, Mar 13, 2026 at 01:59:28PM -0300, Jason Gunthorpe wrote:
quoted
On Sat, Mar 07, 2026 at 07:38:14PM +0200, Leon Romanovsky wrote:
quoted
On Fri, Mar 06, 2026 at 05:47:14PM -0800, Long Li wrote:
quoted
When the MANA hardware undergoes a service reset, the ETH
auxiliary device
(mana.eth) used by DPDK persists across the reset cycle — it is
not removed and re-added like RC/UD/GSI QPs. This means userspace
RDMA consumers such as DPDK have no way of knowing that firmware
handles for their PD, CQ, WQ, QP and MR resources have become stale.
NAK to any of this.

In case of hardware reset, mana_ib AUX device needs to be destroyed
and recreated later.
Yeah, that is our general model for any serious RAS event where the
driver's view of resources becomes out of sync with the HW.

You have tear down the ib_device by removing the aux and then bring
back a new one.

There is an IB_EVENT_DEVICE_FATAL, but the purpose of that event is to
tell userspace to close and re-open their uverbs FD.

We don't have a model where a uverbs FD in userspace can continue to
work after the device has a catasrophic RAS event.

There may be room to have a model where the ib device doesn't fully
unplug/replug so it retains its name and things, but that is core code
not driver stuff.
Good luck with that model. It is going to break RDMA-CM hotplug support.
   I think we can preserve RDMA-CM behavior without requiring ib_device
   unregister/re-register.

   On device reset, the driver can dispatch IB_EVENT_DEVICE_FATAL (or a
   new reset event) through ib_dispatch_event(). RDMA-CM already handles
   device events — we would add a handler that iterates all rdma_cm_ids
   on the device and sends RDMA_CM_EVENT_DEVICE_REMOVAL to each, same
   as cma_process_remove() does today. The difference: cma_device stays
   alive, so applications can reconnect on the same device after recovery
   instead of waiting for a new one to appear.

   The motivation for keeping ib_device alive is that some RDMA consumers
   — DPDK and NCCL — don't use RDMA-CM at all. They use raw verbs and
   manage QP state themselves. For these users, a persistent ib_device
   with IB_EVENT_PORT_ERR / IB_EVENT_PORT_ACTIVE notifications enables
   reliable in-place recovery without reopening the device.

   This matters especially for PCI DPC recovery, which is becoming
   critical for large-scale GPU/storage deployments. See this talk for
   context on the value of surviving DPC events:
   https://www.youtube.com/watch?v=TpNNeMGEsdU&t=1619s

   Today a DPC event on one NIC kills all RDMA connections and can
   crash entire training jobs. If the ib_device persists and the driver
   recreates firmware resources after recovery, raw verbs users can
   resume without full teardown, and RDMA-CM users get the same
   disconnect/reconnect behavior they have today.

Thanks,
Long
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help