Thread (10 messages) 10 messages, 2 authors, 2021-09-02

RE: [PATCH rdma-core] irdma: Restore full memory barrier for doorbell optimization

From: "Nikolova, Tatyana E" <tatyana.e.nikolova@intel.com>
Date: 2021-08-09 20:07:38

-----Original Message-----
From: Jason Gunthorpe <jgg@nvidia.com>
Sent: Thursday, August 5, 2021 8:29 PM
To: Nikolova, Tatyana E <tatyana.e.nikolova@intel.com>
Cc: dledford@redhat.com; leon@kernel.org; linux-rdma@vger.kernel.org
Subject: Re: [PATCH rdma-core] irdma: Restore full memory barrier for
doorbell optimization

On Thu, Aug 05, 2021 at 11:11:34AM -0500, Tatyana Nikolova wrote:
quoted
During the irdma library upstream submission we agreed to replace
atomic_thread_fence(memory_order_seq_cst) in the irdma doorbell
optimization algorithm with udma_to_device_barrier().
However, further regression testing uncovered cases where in absence
of a full memory barrier, the algorithm incorrectly skips ringing the
doorbell.

There has been a discussion about the necessity of a full memory
barrier for the doorbell optimization in the past:
https://lore.kernel.org/linux-rdma/20170301172920.GA11340@ssaleem-
MOBL
quoted
4.amr.corp.intel.com/

The algorithm decides whether to ring the doorbell based on input from
the shadow memory (hw_tail). If the hw_tail is behind the sq_head,
then the algorithm doesn't ring the doorbell, because it assumes that
the HW is still actively processing WQEs.

The shadow area indicates the last WQE processed by the HW and it is
okay if the shadow area value isn't the most current. However there
can't be a window of time between reading the shadow area and setting
the valid bit for the last WQE posted, because the HW may go idle and
the algorithm won't detect this.

The following two cases illustrate this issue and are identical,
except for ops ordering. The first case is an example of how the wrong
order results in not ringing the doorbell when the HW has gone idle.
I can't really understand this explanation. since this seemes to be about a
concurrency problem can you please explain it using a normal ladder
diagram? (eg 1a3402d93c73 ("posix-cpu-timers: Fix rearm racing against
process tick") to pick an example at random)
Hi Jason,

The doorbell optimization algorithm involves the following operations:

1.	Software writing the valid bit in the WQE.
2.	Software reading shadow memory (hw_tail) value.
3.	Software deciding whether or not to ring the doorbell.
4.	Hardware reading that WQE.

Without the MFENCE after step 1, the processor is free to move instructions around - it can move the write of the valid bit later (after software reads the hw_tail value).  

MFENCE ensures the order of 1 and 2. 

MFENCE (Vol. 1 11-12)
"The MFENCE instruction establishes a memory fence for both loads and stores. The processor ensures that no load or store after MFENCE will become globally visible until all loads and stores before MFENCE are globally visible"

If the order of 1 and 2 is not maintained, the algorithm can fail because the hardware may not see the valid bit and the doorbell optimization algorithm does not ring the doorbell. This causes hardware to go idle even though there is work to do.

The failure scenario without the MFENCE is shown in a ladder diagram:

SOFTWARE                        CPU                                  DMA DEVICE

Writes the valid bit
SFENCE
Reads the hw_tail value

                                              <Reorders instructions>
                                              Reads hw_tail value 

Decides not to ring the doorbell.

                                                                                        Reads WQE - <valid bit is not set and hardware goes idle>
                                              Writes valid bit
			   SFENCE
	
The order of 1 and 2 is not maintained which causes hardware to go idle even though there is work to do.

Our doorbell optimization algorithm has always required a full memory barrier because it prevents reordering of stores and loads.
There used to be an MFENCE in the i40iw doorbell algorithm. In a previous discussion, https://lore.kernel.org/linux-rdma/20170306194052.GB31672@obsidianresearch.com/ (local), you suggested the use of atomic_thread_fence(memory_order_seq_cst), which is a C11 barrier, and this did work for us.

Thank you,
Tatyana

 
quoted
diff --git a/providers/irdma/uk.c b/providers/irdma/uk.c index
c7053c52..d63996db 100644
+++ b/providers/irdma/uk.c
@@ -118,7 +118,7 @@ void irdma_uk_qp_post_wr(struct irdma_qp_uk
*qp)
quoted
 	__u32 sw_sq_head;

 	/* valid bit is written and loads completed before reading shadow */
-	udma_to_device_barrier();
+	atomic_thread_fence(memory_order_seq_cst);
Because it certainly looks wrong to replace a DMA barrier with something
that is not a DMA barrier.

I'm guessing this problem is that the shadow memory is not locked and also
not using using atomics to control concurrent access it? If so then the fix is to
use atomics for the shadow memory and place the proper order
requirement on the atomic itself.

Jason
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help