Thread (42 messages) 42 messages, 11 authors, 2018-04-02

Re: RFC on writel and writel_relaxed

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: 2018-03-28 04:42:59
Also in: linux-rdma, linuxppc-dev

Possibly related (same subject, not in this thread)

On Tue, 2018-03-27 at 23:24 -0400, Sinan Kaya wrote:
On 3/27/2018 10:51 PM, Linus Torvalds wrote:
quoted
quoted
The discussion at hand is about

        dma_buffer->foo = 1;                    /* WB */
        writel(KICK, DMA_KICK_REGISTER);        /* UC */
Yes. That certainly is ordered on x86. In fact, afaik it's ordered
even if that writel() might be of type WC, because that only delays
writes, it doesn't move them earlier.
Now that we clarified x86 myth, Is this guaranteed on all architectures?
If not we need to fix it. It's guaranteed on the "main" ones (arm,
arm64, powerpc, i386, x86_64). We might need to check with other arch
maintainers for the rest.

We really want Linux to provide well defined "sane" semantics for the
basic writel accessors.

Note: We still have open questions about how readl() relates to
surrounding memory accesses. It looks like ARM and powerpc do different
things here.
We keep getting IA64 exception example. Maybe, this is also corrected since
then.
I would think ia64 fixed it back when it was all discussed. I was under
the impression all ia64 had "special" was the writel vs. spin_unlock
which requires mmiowb, but maybe that was never completely fixed ?
Jose Abreu says "I don't know about x86 but arc architecture doesn't
have a wmb() in the writel() function (in some configs)".
Well, it probably should then.
As long as we have these exceptions, these wmb() in common drivers is not
going anywhere and relaxed-arches will continue paying performance penalty.
Well, let's fix them or leave them broken, at this point, it doesn't
matter. We can give all arch maintainers a wakeup call and start making
drivers work based on the documented assumptions.
I see 15% performance loss on ARM64 servers using Intel i40e network
drivers and an XL710 adapter due to CPU keeping itself busy doing barriers
most of the time rather than real work because of sequences like this all over
the place.

         dma_buffer->foo = 1;                    /* WB */
	 wmb()
         writel(KICK, DMA_KICK_REGISTER);        /* UC */

I posted several patches last week to remove duplicate barriers on ARM while
trying to make the code friendly with other architectures.

Basically changing it to

dma_buffer->foo = 1;                    /* WB */
wmb()
writel_relaxed(KICK, DMA_KICK_REGISTER);        /* UC */
mmiowb()

This is a small step in the performance direction until we remove all exceptions.

https://www.spinics.net/lists/netdev/msg491842.html
https://www.spinics.net/lists/linux-rdma/msg62434.html
https://www.spinics.net/lists/arm-kernel/msg642336.html

Discussion started to move around the need for relaxed API on PPC and then
why wmb() question came up.
I'm working on the problem of relaxed APIs for powerpc, but we can keep
that orthogonal. As is, today, a wmb() + writel() and a wmb() +
writel_relaxed() on powerpc are identical. So changing them will not
break us.

But I don't see the point of doing that transformation if we can just
get the straying archs fixed. It's not like any of them has a
significant market presence these days anyway.

Cheers,
Ben.
Sinan
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help