Thread (14 messages) 14 messages, 5 authors, 2019-03-22

Re: per-cpu thoughts

From: Mark Rutland <mark.rutland@arm.com>
Date: 2019-03-12 11:24:13
Also in: linux-riscv

On Mon, Mar 11, 2019 at 11:39:56AM -0700, Paul Walmsley wrote:
On Mon, 11 Mar 2019, Mark Rutland wrote:
quoted
On Mon, Mar 11, 2019 at 08:26:45AM -0700, Paul Walmsley wrote:
quoted
On Mon, 11 Mar 2019, Björn Töpel wrote:
quoted
But the generic one disables interrupts, right?

I believe the rational for RV is similar to ARM's; AMO+preemption
disable regions is *slightly* better than the generic, but not as good
as the IA one. Or am I missing something?
There's been a discussion going on in a private thread about this that I 
unfortunately didn't add you to.  The discussion is still ongoing, but I 
think Christoph and myself and a few other folks have agreed that the 
preempt_disable/enable is not needed for the amoadd approach.  This is 
since the apparent intention of the preemption disable/enable is to ensure 
the correctness of the counter increment; however there is no risk of 
incorrectness in an amoadd sequence since the atomic add is locked across 
all of the cache coherency domain. 
We also thought that initially, but there's a sbutle race that can
occur, and so we added code to disable preemption in commit:

  f3eab7184ddcd486 ("arm64: percpu: Make this_cpu accessors pre-empt safe")

The problem on arm64 is that our atomics take a single base register,
and we have to generate the percpu address with separate instructions
from the atomic itself. That means we can get preempted between address
generation and the atomic, which is problematic for sequences like:

	// Thread-A			// Thread-B

	this_cpu_add(var)
					local_irq_disable(flags)
					...
					v = __this_cpu_read(var);
					v = some_function(v);
					__this_cpu_write(var, v);
					...
					local_irq_restore(flags)

... which can unexpectedly race as:


	// Thread-A			// Thread-B
	
	< generate CPU X addr >
	< preempted >

					< scheduled on CPU X >
					local_irq_disable(flags);
					v = __this_cpu_read(var);

	< scheduled on CPU Y >
	< add to CPU X's var >
					v = some_function(v);
					__this_cpu_write(var, v);
					local_irq_restore(flags);

... and hence we lose an update to a percpu variable.
Makes sense, and thank you very much for the detailed sequences.  
Open-coded per-cpu code sequences would also cause RISC-V to be exposed to 
the same issue.  Christoph mentioned this, but at the time my attention 
was focused on per-cpu counters, and not the broader range of per-cpu code 
sequences.
quoted
I suspect RISC-V would have the same problem, unless its AMOs can
generate the percpu address and perform the update in a single
instruction.
My understanding is that many of Christoph's per-cpu performance concerns 
revolve around counters in the VM code, such as:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/vmstat.c#n355
The mod_*_state() functions are the only ones which mess with
preemption, and that should only mandate a few locally-visible
modifications of preempt_count.

Similar cases apply within SLUB, and I'd hoped to improve that with my
this-cpu-reg branch, but I didn't see a measureable improvement on
workloads I tried.

Have you seen a measureable performance problem here?
and probably elsewhere by now.  It may be worth creating a distinct API 
for those counters.  If only increment, decrement, and read operations are 
needed, there shouldn't be a need to disable or re-enable 
preemption in those code paths - assuming that one is either able to 
tolerate the occasional cache line bounce or retries in a long LL/SC 
sequence.  Any opinions on that?
I'm afraid I don't understand this code well enough to say whether that
would be safe.

It's not clear to me whether there would be a measureable performance
difference, as I'd expect fiddling with preempt_count to be relatively
cheap. The AMOs themselves don't need to enforce ordering here, and only
a few compiler barriers are necessary.

Thanks,
Mark.
quoted
FWIW, I had a go at building percpu ops that didn't need to disable
preemption, but that required LL/SC atomics, reserving a GPR for the
percpu offset, and didn't result in a measurable difference in practice.
The patches are at:

https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=arm64/this-cpu-reg&id=84ee5f23f93d4a650e828f831da9ed29c54623c5
Very interesting indeed.  Thank you for sharing that,

- Paul

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help