Thread (27 messages) 27 messages, 3 authors, 2021-09-09

Re: [PATCH for-next 4/4] RDMA/efa: CQ notifications

From: Jason Gunthorpe <jgg@nvidia.com>
Date: 2021-09-02 15:41:29

On Thu, Sep 02, 2021 at 06:17:45PM +0300, Gal Pressman wrote:
On 02/09/2021 18:10, Jason Gunthorpe wrote:
quoted
On Thu, Sep 02, 2021 at 06:09:39PM +0300, Gal Pressman wrote:
quoted
On 02/09/2021 16:02, Jason Gunthorpe wrote:
quoted
On Thu, Sep 02, 2021 at 10:03:16AM +0300, Gal Pressman wrote:
quoted
On 01/09/2021 18:36, Jason Gunthorpe wrote:
quoted
On Wed, Sep 01, 2021 at 05:24:43PM +0300, Gal Pressman wrote:
quoted
On 01/09/2021 14:57, Jason Gunthorpe wrote:
quoted
On Wed, Sep 01, 2021 at 02:50:42PM +0300, Gal Pressman wrote:
quoted
On 20/08/2021 21:27, Jason Gunthorpe wrote:
quoted
On Wed, Aug 11, 2021 at 06:11:31PM +0300, Gal Pressman wrote:
quoted
diff --git a/drivers/infiniband/hw/efa/efa_main.c b/drivers/infiniband/hw/efa/efa_main.c
index 417dea5f90cf..29db4dec02f0 100644
+++ b/drivers/infiniband/hw/efa/efa_main.c
@@ -67,6 +67,46 @@ static void efa_release_bars(struct efa_dev *dev, int bars_mask)
     pci_release_selected_regions(pdev, release_bars);
 }

+static void efa_process_comp_eqe(struct efa_dev *dev, struct efa_admin_eqe *eqe)
+{
+    u16 cqn = eqe->u.comp_event.cqn;
+    struct efa_cq *cq;
+
+    cq = xa_load(&dev->cqs_xa, cqn);
+    if (unlikely(!cq)) {
This seems unlikely to be correct, what prevents cq from being
destroyed concurrently?

A comp_handler cannot be running after cq destroy completes.
Sorry for the long turnaround, was OOO.

The CQ cannot be destroyed until all completion events are acked.
https://github.com/linux-rdma/rdma-core/blob/7fd01f0c6799f0ecb99cae03c22cf7ff61ffbf5a/libibverbs/man/ibv_get_cq_event.3#L45
https://github.com/linux-rdma/rdma-core/blob/7fd01f0c6799f0ecb99cae03c22cf7ff61ffbf5a/libibverbs/cmd_cq.c#L208
That is something quite different, and in userspace.

What in the kernel prevents tha xa_load and the xa_erase from racing together?
Good point.
I think we need to surround efa_process_comp_eqe() with an rcu_read_lock() and
have a synchronize_rcu() after removing it from the xarray in
destroy_cq.
Try to avoid synchronize_rcu()
I don't see how that's possible?
Usually people use call_rcu() instead
Oh nice, thanks.

I think the code would be much simpler using synchronize_rcu(), and the
destroy_cq flow is usually on the cold path anyway. I also prefer to be certain
that the CQ is freed once the destroy verb returns and not rely on the callback
scheduling.
I would not be happy to see synchronize_rcu on uverbs destroy
functions, it is too easy to DOS the kernel with that.
OK, but isn't the fact that the uverb can return before the CQ is actually
destroyed problematic?
Yes, you can't allow that, something other than RCU needs to prevent
that
Maybe it's an extreme corner case, but if I created max_cq CQs, destroyed one,
and try to create another one, it is not guaranteed that the create operation
would succeed - even though the destroy has finished.
More importantly a driver cannot call completion callbacks once
destroy cq has returned.

Jason
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help