Thread (274 messages) 274 messages, 15 authors, 2014-03-07

Re: [RFC][PATCH 0/5] arch: atomic rework

From: Torvald Riegel <hidden>
Date: 2014-02-12 06:07:12
Also in: lkml

On Tue, 2014-02-11 at 07:59 -0800, Paul E. McKenney wrote:
On Mon, Feb 10, 2014 at 11:09:24AM -0800, Linus Torvalds wrote:
quoted
On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel [off-list ref] wrote:
quoted
Intuitively, this is wrong because this let's the program take a step
the abstract machine wouldn't do.  This is different to the sequential
code that Peter posted because it uses atomics, and thus one can't
easily assume that the difference is not observable.
Btw, what is the definition of "observable" for the atomics?

Because I'm hoping that it's not the same as for volatiles, where
"observable" is about the virtual machine itself, and as such volatile
accesses cannot be combined or optimized at all.

Now, I claim that atomic accesses cannot be done speculatively for
writes, and not re-done for reads (because the value could change),
but *combining* them would be possible and good.

For example, we often have multiple independent atomic accesses that
could certainly be combined: testing the individual bits of an atomic
value with helper functions, causing things like "load atomic, test
bit, load same atomic, test another bit". The two atomic loads could
be done as a single load without possibly changing semantics on a real
machine, but if "visibility" is defined in the same way it is for
"volatile", that wouldn't be a valid transformation. Right now we use
"volatile" semantics for these kinds of things, and they really can
hurt.

Same goes for multiple writes (possibly due to setting bits):
combining multiple accesses into a single one is generally fine, it's
*adding* write accesses speculatively that is broken by design..

At the same time, you can't combine atomic loads or stores infinitely
- "visibility" on a real machine definitely is about timeliness.
Removing all but the last write when there are multiple consecutive
writes is generally fine, even if you unroll a loop to generate those
writes. But if what remains is a loop, it might be a busy-loop
basically waiting for something, so it would be wrong ("untimely") to
hoist a store in a loop entirely past the end of the loop, or hoist a
load in a loop to before the loop.

Does the standard allow for that kind of behavior?
You asked!  ;-)

So the current standard allows merging of both loads and stores, unless of
course ordring constraints prevent the merging.  Volatile semantics may be
used to prevent this merging, if desired, for example, for real-time code.
Agreed.
Infinite merging is intended to be prohibited, but I am not certain that
the current wording is bullet-proof (1.10p24 and 1.10p25).
Yeah, maybe not.  But it at least seems to rather clearly indicate the
intent ;)
The only prohibition against speculative stores that I can see is in a
non-normative note, and it can be argued to apply only to things that are
not atomics (1.10p22).
I think this one is specifically about speculative stores that would
affect memory locations that the abstract machine would not write to,
and that might be observable or create data races.  While a compiler
could potentially prove that such stores aren't leading to a difference
in the behavior of the program (e.g., by proving that there are no
observers anywhere and this isn't overlapping with any volatile
locations), I think that this is hard in general and most compilers will
just not do such things.  In GCC, bugs in that category were fixed after
researchers doing fuzz-testing found them (IIRC, speculative stores by
loops).
I don't see any prohibition against reordering
a store to precede a load preceding a conditional branch -- which would
not be speculative if the branch was know to be taken and the load
hit in the store buffer.  In a system where stores could be reordered,
some other CPU might perceive the store as happening before the load
that controlled the conditional branch.  This needs to be addressed.
I don't know the specifics of your example, but from how I understand
it, I don't see a problem if the compiler can prove that the store will
always happen.

To be more specific, if the compiler can prove that the store will
happen anyway, and the region of code can be assumed to always run
atomically (e.g., there's no loop or such in there), then it is known
that we have one atomic region of code that will always perform the
store, so we might as well do the stuff in the region in some order.

Now, if any of the memory accesses are atomic, then the whole region of
code containing those accesses is often not atomic because other threads
might observe intermediate results in a data-race-free way.

(I know that this isn't a very precise formulation, but I hope it brings
my line of reasoning across.)
Why this hole?  At the time, the current formalizations of popular
CPU architectures did not exist, and it was not clear that all popular
hardware avoided speculative stores.  
I'm not quite sure which hole you see there.  Can you elaborate?
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help