Re: RFC on writel and writel_relaxed
From: Arnd Bergmann <arnd@arndb.de>
Date: 2018-03-26 21:30:03
Also in:
linux-rdma
On Mon, Mar 26, 2018 at 11:09 PM, Jason Gunthorpe [off-list ref] wrote:
On Mon, Mar 26, 2018 at 10:43:43PM +0200, Arnd Bergmann wrote:quoted
On Mon, Mar 26, 2018 at 10:25 PM, Jason Gunthorpe [off-list ref] wrote:quoted
On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote:quoted
On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe [off-list ref] wrote:quoted
On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote:quoted
quoted
quoted
This is a super performance critical operation for most drivers and directly impacts network performance.Perhaps there ought to be writel_nobarrier() (etc) that never contain any barriers at all. This might mean that they are always just the memory operation, but it would make it more obvious what the driver was doing.I think that is what writel_relaxed is supposed to be. The only restriction it has is that the writes to a single device using UC memory must be kept in program order..Not sure about whether we have ever defined what happens to writel_relaxed() on WC memory though: On ARM, we disallow the compiler to combine writes, but the CPU still might.If the driver uses WC memory then I think it should not expect anything in terms of how writes map to TLPs other than nothing combines across mmiowb() and mmiowb() is fully globally ordered when enclosed in a spinlock. The entire point of using WC memory is usually to get combining :) If the driver doesn't want that then it should map UC..Usually, WC memory is used with memcpy_toio() though, which by definition doesn't have any barriers between accesses, and is required to get the correct byte ordering on writes to memory buffers.memcpy_toio is too expensive to actually use for anything performance though. It is too pessimistic. What the drivers usually want is a unwound block of 4 or 8 8-byte copies. No function calls, no branching. Everything is already known to be aligned. Most of the drivers have a unwound loop with writeq() or something to do it.
But isn't the writeq() barrier much more expensive than anything you'd do in function calls?
quoted
quoted
The same document says that _relaxed() does not give that guarentee. The lwn articule on this went into some depth on the interaction with spinlocks. As far as I can see, containment in a spinlock seems to be the only different between writel and writel_relaxed..I was always puzzled by this: The intention of _relaxed() on ARM (where it originates) was to skip the barrier that serializes DMA with MMIO, not to skip the serialization between MMIO and locks.But that was never a requirement of writel(), Documentation/memory-barriers.txt gives an explicit example demanding the wmb() before writel() for ordering system memory against writel.
Indeed, but it's in an example for when to use dma_wmb(), not wmb().
Adding Alexander Duyck to Cc, he added that section as part of
1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
dma_wmb()"). Also adding the other people that were involved with that.
I actually have no idea why ARM had that barrier, I always assumed it was to give program ordering to the accesses and that _relaxed allowed re-ordering (the usual meaning of relaxed).. But the barrier document makes it pretty clear that the only difference between the two is spinlock containment, and WillD wrote this text, so I belive it is accurate for ARM. Very confusing.
It does mention serialization with both DMA and locks in the
section about readX_relaxed()/writeX_relaxed(). The part
about DMA is very clear here, and I must have just forgotten
the exact semantics with regards to spinlocks. I'm still not
sure what prevents a writel() from leaking out the end of a
spinlock section that doesn't happen with writel_relaxed(), since
the barrier in writel() comes before the access, and the
spin_unlock() shouldn't affect the external buses.
Arnd