Re: [PATCH v4 for 4.14 1/3] membarrier: Provide register expedited private command

From: Nicholas Piggin <npiggin@gmail.com>
Date: 2017-09-29 11:39:20
Also in: linux-arch, lkml

On Fri, 29 Sep 2017 12:31:31 +0200
Peter Zijlstra [off-list ref] wrote:

On Fri, Sep 29, 2017 at 02:27:57AM +1000, Nicholas Piggin wrote:

quoted

The biggest power boxes are more tightly coupled than those big
SGI systems, but even so just plodding along taking and releasing
locks in turn would be fine on those SGI ones as well really. Not DoS
level. This is not a single mega hot cache line or lock that is
bouncing over the entire machine, but one process grabbing a line and
lock from each of 1000 CPUs.

Slight disturbance sure, but each individual CPU will see it as 1/1000th
of a disturbance, most of the cost will be concentrated in the syscall
caller.

But once the:

	while (1)
		sys_membarrier()

thread has all those (lock) lines in M state locally, it will become
very hard for the remote CPUs to claim them back, because its constantly

Not really. There is some ability to hold onto a line for a time, but
there is no way to starve them, let alone starve hundreds of other
CPUs. They will request the cacheline exclusive and eventually get it.
Then the membarrier CPU has to pay to get it back. If there is a lot of
activity on the locks, the membarrier will have a difficult time to take
each one.

I don't say there is zero cost or can't interfere with others, only that
it does not seem particularly bad compared with other things. Once you
restrict it to mm_cpumask, then it's quite partitionable.

I would really prefer to go this way on powerpc first. We could add the
the registration APIs as basically no-ops, but which would allow the
locking approach to be changed if we find it causes issues. I'll try to
find some time and a big system when I can.

touching them. Sure it will touch a 1000 other lines before its back to
this one, but if they're all local that's fairly quick.

But you're right, your big machines have far smaller NUMA factors.

quoted

Bouncing that lock across the machine is *painful*, I have vague
memories of cases where the lock ping-pong was most the time spend.

But only Power needs this, all the other architectures are fine with the
lockless approach for MEMBAR_EXPEDITED_PRIVATE.

Yes, we can add an iterator function that power can override in a few
lines. Less arch specific code than this proposal.

A semi related issue; I suppose we can do a arch upcall to flush_tlb_mm
and reset the mm_cpumask when we change cpuset groups.

For powerpc we have been looking at how mm_cpumask can be improved.
It has real drawbacks even when you don't consider this new syscall.

Thanks,
Nick

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help