diff --git a/docs/html/training/articles/perf-anr.jd b/docs/html/training/articles/perf-anr.jd index 864fb34035361..abef5456a0664 100644 --- a/docs/html/training/articles/perf-anr.jd +++ b/docs/html/training/articles/perf-anr.jd @@ -1,4 +1,5 @@ page.title=Keeping Your App Responsive +page.article=true @jd:body
Android 3.0 and later platform versions are optimized to support +multiprocessor architectures. This document introduces issues that +can arise when writing code for symmetric multiprocessor systems in C, C++, and the Java +programming language (hereafter referred to simply as “Java” for the sake of +brevity). It's intended as a primer for Android app developers, not as a complete +discussion on the subject. The focus is on the ARM CPU architecture.
+ +If you’re in a hurry, you can skip the Theory section +and go directly to Practice for best practices, but this +is not recommended.
+ + +SMP is an acronym for “Symmetric Multi-Processor”. It describes a design in +which two or more identical CPU cores share access to main memory. Until +a few years ago, all Android devices were UP (Uni-Processor).
+ +Most — if not all — Android devices do have multiple CPUs, but generally one +of them is used to run applications while others manage various bits of device +hardware (for example, the radio). The CPUs may have different architectures, and the +programs running on them can’t use main memory to communicate with each +other.
+ +Most Android devices sold today are built around SMP designs, +making things a bit more complicated for software developers. The sorts of race +conditions you might encounter in a multi-threaded program are much worse on SMP +when two or more of your threads are running simultaneously on different cores. +What’s more, SMP on ARM is more challenging to work with than SMP on x86. Code +that has been thoroughly tested on x86 may break badly on ARM.
+ +The rest of this document will explain why, and tell you what you need to do +to ensure that your code behaves correctly.
+ + +This is a high-speed, glossy overview of a complex subject. Some areas will +be incomplete, but none of it should be misleading or wrong.
+ +See Further reading at the end of the document for +pointers to more thorough treatments of the subject.
+ +Memory consistency models, or often just “memory models”, describe the +guarantees the hardware architecture makes about memory accesses. For example, +if you write a value to address A, and then write a value to address B, the +model might guarantee that every CPU core sees those writes happen in that +order.
+ +The model most programmers are accustomed to is sequential +consistency, which is described like this (Adve & +Gharachorloo):
+ +If you look at a bit of code and see that it does some reads and writes from +memory, on a sequentially-consistent CPU architecture you know that the code +will do those reads and writes in the expected order. It’s possible that the +CPU is actually reordering instructions and delaying reads and writes, but there +is no way for code running on the device to tell that the CPU is doing anything +other than execute instructions in a straightforward manner. (We’re ignoring +memory-mapped device driver I/O for the moment.)
+ +To illustrate these points it’s useful to consider small snippets of code, +commonly referred to as litmus tests. These are assumed to execute in +program order, that is, the order in which the instructions appear here is +the order in which the CPU will execute them. We don’t want to consider +instruction reordering performed by compilers just yet.
+ +Here’s a simple example, with code running on two threads:
+ +| Thread 1 | +Thread 2 | +
|---|---|
A = 3 |
+reg0 = B |
+
In this and all future litmus examples, memory locations are represented by +capital letters (A, B, C) and CPU registers start with “reg”. All memory is +initially zero. Instructions are executed from top to bottom. Here, thread 1 +stores the value 3 at location A, and then the value 5 at location B. Thread 2 +loads the value from location B into reg0, and then loads the value from +location A into reg1. (Note that we’re writing in one order and reading in +another.)
+ +Thread 1 and thread 2 are assumed to execute on different CPU cores. You +should always make this assumption when thinking about +multi-threaded code.
+ +Sequential consistency guarantees that, after both threads have finished +executing, the registers will be in one of the following states:
+ + +| Registers | +States | +
|---|---|
| reg0=5, reg1=3 | +possible (thread 1 ran first) | +
| reg0=0, reg1=0 | +possible (thread 2 ran first) | +
| reg0=0, reg1=3 | +possible (concurrent execution) | +
| reg0=5, reg1=0 | +never | +
To get into a situation where we see B=5 before we see the store to A, either +the reads or the writes would have to happen out of order. On a +sequentially-consistent machine, that can’t happen.
+ +Most uni-processors, including x86 and ARM, are sequentially consistent. +Most SMP systems, including x86 and ARM, are not.
+ +x86 SMP provides processor consistency, which is slightly weaker than +sequential. While the architecture guarantees that loads are not reordered with +respect to other loads, and stores are not reordered with respect to other +stores, it does not guarantee that a store followed by a load will be observed +in the expected order.
+ +Consider the following example, which is a piece of Dekker’s Algorithm for +mutual exclusion:
+ +| Thread 1 | +Thread 2 | +
|---|---|
A = true |
+B = true |
+
The idea is that thread 1 uses A to indicate that it’s busy, and thread 2 +uses B. Thread 1 sets A and then checks to see if B is set; if not, it can +safely assume that it has exclusive access to the critical section. Thread 2 +does something similar. (If a thread discovers that both A and B are set, a +turn-taking algorithm is used to ensure fairness.)
+ +On a sequentially-consistent machine, this works correctly. On x86 and ARM +SMP, the store to A and the load from B in thread 1 can be “observed” in a +different order by thread 2. If that happened, we could actually appear to +execute this sequence (where blank lines have been inserted to highlight the +apparent order of operations):
+ +| Thread 1 | +Thread 2 | +
|---|---|
reg1 = B |
+
+ |
+
This results in both reg1 and reg2 set to “false”, allowing the threads to +execute code in the critical section simultaneously. To understand how this can +happen, it’s useful to know a little about CPU caches.
+ +This is a substantial topic in and of itself. An extremely brief overview +follows. (The motivation for this material is to provide some basis for +understanding why SMP systems behave as they do.)
+ +Modern CPUs have one or more caches between the processor and main memory. +These are labeled L1, L2, and so on, with the higher numbers being successively +“farther” from the CPU. Cache memory adds size and cost to the hardware, and +increases power consumption, so the ARM CPUs used in Android devices typically +have small L1 caches and little or no L2/L3.
+ +Loading or storing a value into the L1 cache is very fast. Doing the same to +main memory can be 10-100x slower. The CPU will therefore try to operate out of +the cache as much as possible. The write policy of a cache determines when data +written to it is forwarded to main memory. A write-through cache will initiate +a write to memory immediately, while a write-back cache will wait until it runs +out of space and has to evict some entries. In either case, the CPU will +continue executing instructions past the one that did the store, possibly +executing dozens of them before the write is visible in main memory. (While the +write-through cache has a policy of immediately forwarding the data to main +memory, it only initiates the write. It does not have to wait +for it to finish.)
+ +The cache behavior becomes relevant to this discussion when each CPU core has +its own private cache. In a simple model, the caches have no way to interact +with each other directly. The values held by core #1’s cache are not shared +with or visible to core #2’s cache except as loads or stores from main memory. +The long latencies on memory accesses would make inter-thread interactions +sluggish, so it’s useful to define a way for the caches to share data. This +sharing is called cache coherency, and the coherency rules are defined +by the CPU architecture’s cache consistency model.
+ +With that in mind, let’s return to the Dekker example. When core 1 executes +“A = 1”, the value gets stored in core 1’s cache. When core 2 executes “if (A +== 0)”, it might read from main memory or it might read from core 2’s cache; +either way it won’t see the store performed by core 1. (“A” could be in core +2’s cache because of a previous load from “A”.)
+ +For the memory consistency model to be sequentially consistent, core 1 would +have to wait for all other cores to be aware of “A = 1” before it could execute +“if (B == 0)” (either through strict cache coherency rules, or by disabling the +caches entirely so everything operates out of main memory). This would impose a +performance penalty on every store operation. Relaxing the rules for the +ordering of stores followed by loads improves performance but imposes a burden +on software developers.
+ +The other guarantees made by the processor consistency model are less +expensive to make. For example, to ensure that memory writes are not observed +out of order, it just needs to ensure that the stores are published to other +cores in the same order that they were issued. It doesn’t need to wait for +store #1 to finish being published before it can start on store +#2, it just needs to ensure that it doesn’t finish publishing #2 before it +finishes publishing #1. This avoids a performance bubble.
+ +Relaxing the guarantees even further can provide additional opportunities for +CPU optimization, but creates more opportunities for code to behave in ways the +programmer didn’t expect.
+ +One additional note: CPU caches don’t operate on individual bytes. Data is +read or written as cache lines; for many ARM CPUs these are 32 bytes. If you +read data from a location in main memory, you will also be reading some adjacent +values. Writing data will cause the cache line to be read from memory and +updated. As a result, you can cause a value to be loaded into cache as a +side-effect of reading or writing something nearby, adding to the general aura +of mystery.
+ +Before going further, it’s useful to define in a more rigorous fashion what +is meant by “observing” a load or store. Suppose core 1 executes “A = 1”. The +store is initiated when the CPU executes the instruction. At some +point later, possibly through cache coherence activity, the store is +observed by core 2. In a write-through cache it doesn’t really +complete until the store arrives in main memory, but the memory +consistency model doesn’t dictate when something completes, just when it can be +observed.
+ + +(In a kernel device driver that accesses memory-mapped I/O locations, it may +be very important to know when things actually complete. We’re not going to go +into that here.)
+ +Observability may be defined as follows:
+ +A less formal way to describe it (where “you” and “I” are CPU cores) would be:
+ +The notion of observing a write is intuitive; observing a read is a bit less +so (don’t worry, it grows on you).
+ +With this in mind, we’re ready to talk about ARM.
+ +ARM SMP provides weak memory consistency guarantees. It does not guarantee that +loads or stores are ordered with respect to each other.
+ +| Thread 1 | +Thread 2 | +
|---|---|
A = 41 |
+loop_until (B == 1) |
+
Recall that all addresses are initially zero. The “loop_until” instruction +reads B repeatedly, looping until we read 1 from B. The idea here is that +thread 2 is waiting for thread 1 to update A. Thread 1 sets A, and then sets B +to 1 to indicate data availability.
+ +On x86 SMP, this is guaranteed to work. Thread 2 will observe the stores +made by thread 1 in program order, and thread 1 will observe thread 2’s loads in +program order.
+ +On ARM SMP, the loads and stores can be observed in any order. It is +possible, after all the code has executed, for reg to hold 0. It’s also +possible for it to hold 41. Unless you explicitly define the ordering, you +don’t know how this will come out.
+ +(For those with experience on other systems, ARM’s memory model is equivalent +to PowerPC in most respects.)
+ + +Memory barriers provide a way for your code to tell the CPU that memory +access ordering matters. ARM/x86 uniprocessors offer sequential consistency, +and thus have no need for them. (The barrier instructions can be executed but +aren’t useful; in at least one case they’re hideously expensive, motivating +separate builds for SMP targets.)
+ +There are four basic situations to consider:
+ +Recall our earlier example:
+ +| Thread 1 | +Thread 2 | +
|---|---|
A = 41 |
+loop_until (B == 1) |
+
Thread 1 needs to ensure that the store to A happens before the store to B. +This is a “store/store” situation. Similarly, thread 2 needs to ensure that the +load of B happens before the load of A; this is a load/load situation. As +mentioned earlier, the loads and stores can be observed in any order.
+ +Going back to the cache discussion, assume A and B are on separate cache +lines, with minimal cache coherency. If the store to A stays local but the +store to B is published, core 2 will see B=1 but won’t see the update to A. On +the other side, assume we read A earlier, or it lives on the same cache line as +something else we recently read. Core 2 spins until it sees the update to B, +then loads A from its local cache, where the value is still zero.
+We can fix it like this:
+ +| Thread 1 | +Thread 2 | +
|---|---|
A = 41 |
+loop_until (B == 1) |
+
The store/store barrier guarantees that all observers will +observe the write to A before they observe the write to B. It makes no +guarantees about the ordering of loads in thread 1, but we don’t have any of +those, so that’s okay. The load/load barrier in thread 2 makes a similar +guarantee for the loads there.
+ +Since the store/store barrier guarantees that thread 2 observes the stores in +program order, why do we need the load/load barrier in thread 2? Because we +also need to guarantee that thread 1 observes the loads in program order.
+ +The store/store barrier could work by flushing all +dirty entries out of the local cache, ensuring that other cores see them before +they see any future stores. The load/load barrier could purge the local cache +completely and wait for any “in-flight” loads to finish, ensuring that future +loads are observed after previous loads. What the CPU actually does doesn’t +matter, so long as the appropriate guarantees are kept. If we use a barrier in +core 1 but not in core 2, core 2 could still be reading A from its local +cache.
+Because the architectures have different memory models, these barriers are +required on ARM SMP but not x86 SMP.
+ +The Dekker’s Algorithm fragment shown earlier illustrated the need for a +store/load barrier. Here’s an example where a load/store barrier is +required:
+ +| Thread 1 | +Thread 2 | +
|---|---|
reg = A |
+loop_until (B == 1) |
+
Thread 2 could observe thread 1’s store of B=1 before it observe’s thread 1’s +load from A, and as a result store A=41 before thread 1 has a chance to read A. +Inserting a load/store barrier in each thread solves the problem:
+ +| Thread 1 | +Thread 2 | +
|---|---|
reg = A |
+loop_until (B == 1) |
+
A store to local cache may be observed before a load from main memory, +because accesses to main memory are so much slower. In this case, assume core +1’s cache has the cache line for B but not A. The load from A is initiated, and +while that’s in progress execution continues. The store to B happens in local +cache, and by some means becomes available to core 2 while the load from A is +still in progress. Thread 2 is able to exit the loop before it has observed +thread 1’s load from A.
+ +A thornier question is: do we need a barrier in thread 2? If the CPU doesn’t +perform speculative writes, and doesn’t execute instructions out of order, can +thread 2 store to A before thread 1’s read if thread 1 guarantees the load/store +ordering? (Answer: no.) What if there’s a third core watching A and B? +(Answer: now you need one, or you could observe B==0 / A==41 on the third core.) + It’s safest to insert barriers in both places and not worry about the +details.
+As mentioned earlier, store/load barriers are the only kind required on x86 +SMP.
+ +Different CPUs provide different flavors of barrier instruction. For +example:
+ +“Full barrier” means all four categories are included.
+ +It is important to recognize that the only thing guaranteed by barrier +instructions is ordering. Do not treat them as cache coherency “sync points” or +synchronous “flush” instructions. The ARM “dmb” instruction has no direct +effect on other cores. This is important to understand when trying to figure +out where barrier instructions need to be issued.
+ + +(This is a slightly more advanced topic and can be skipped.) + +
The ARM CPU provides one special case where a load/load barrier can be +avoided. Consider the following example from earlier, modified slightly:
+ +| Thread 1 | +Thread 2 | +
|---|---|
[A+8] = 41 |
+loop: |
+
This introduces a new notation. If “A” refers to a memory address, “A+n” +refers to a memory address offset by 8 bytes from A. If A is the base address +of an object or array, [A+8] could be a field in the object or an element in the +array.
+ +The “loop_until” seen in previous examples has been expanded to show the load +of B into reg0. reg1 is assigned the numeric value 8, and reg2 is loaded from +the address [A+reg1] (same location that thread 1 is accessing).
+ +This will not behave correctly because the load from B could be observed +after the load from [A+reg1]. We can fix this with a load/load barrier after +the loop, but on ARM we can also just do this:
+ +| Thread 1 | +Thread 2 | +
|---|---|
A = 41 |
+loop: |
+
What we’ve done here is change the assignment of reg1 from a constant (8) to +a value that depends on what we loaded from B. In this case, we do a bitwise +AND of the value with 0, which yields zero, which means reg1 still has the value +8. However, the ARM CPU believes that the load from [A+reg1] depends upon the +load from B, and will ensure that the two are observed in program order.
+ +This is called an address dependency. Address dependencies exist +when the value returned by a load is used to compute the address of a subsequent +load or store. It can let you avoid the need for an explicit barrier in certain +situations.
+ +ARM does not provide control dependency guarantees. To illustrate +this it’s necessary to dip into ARM code for a moment: (Barrier Litmus Tests and Cookbook).
+ ++LDR r1, [r0] +CMP r1, #55 +LDRNE r2, [r3] ++ +
The loads from r0 and r3 may be observed out of order, even though the load +from r3 will not execute at all if [r0] doesn’t hold 55. Inserting AND r1, r1, +#0 and replacing the last instruction with LDRNE r2, [r3, r1] would ensure +proper ordering without an explicit barrier. (This is a prime example of why +you can’t think about consistency issues in terms of instruction execution. +Always think in terms of memory accesses.)
+ +While we’re hip-deep, it’s worth noting that ARM does not provide causal +consistency:
+ +| Thread 1 | +Thread 2 | +Thread 3 | +
|---|---|---|
A = 1 |
+loop_until (A == 1) |
+loop: |
+
Here, thread 1 sets A, signaling thread 2. Thread 2 sees that and sets B to +signal thread 3. Thread 3 sees it and loads from A, using an address dependency +to ensure that the load of B and the load of A are observed in program +order.
+ +It’s possible for reg2 to hold zero at the end of this. The fact that a +store in thread 1 causes something to happen in thread 2 which causes something +to happen in thread 3 does not mean that thread 3 will observe the stores in +that order. (Inserting a load/store barrier in thread 2 fixes this.)
+ +Barriers come in different flavors for different situations. While there can +be performance advantages to using exactly the right barrier type, there are +code maintenance risks in doing so — unless the person updating the code +fully understands it, they might introduce the wrong type of operation and cause +a mysterious breakage. Because of this, and because ARM doesn’t provide a wide +variety of barrier choices, many atomic primitives use full +barrier instructions when a barrier is required.
+ +The key thing to remember about barriers is that they define ordering. Don’t +think of them as a “flush” call that causes a series of actions to happen. +Instead, think of them as a dividing line in time for operations on the current +CPU core.
+ + +Atomic operations guarantee that an operation that requires a series of steps +always behaves as if it were a single operation. For example, consider a +non-atomic increment (“++A”) executed on the same variable by two threads +simultaneously:
+ +| Thread 1 | +Thread 2 | +
|---|---|
reg = A |
+reg = A |
+
If the threads execute concurrently from top to bottom, both threads will +load 0 from A, increment it to 1, and store it back, leaving a final result of +1. If we used an atomic increment operation, you would be guaranteed that the +final result will be 2.
+ +The most fundamental operations — loading and storing 32-bit values +— are inherently atomic on ARM so long as the data is aligned on a 32-bit +boundary. For example:
+ +| Thread 1 | +Thread 2 | +
|---|---|
reg = 0x00000000 |
+reg = 0xffffffff |
+
The CPU guarantees that A will hold 0x00000000 or 0xffffffff. It will never +hold 0x0000ffff or any other partial “mix” of bytes.
+ +The atomicity guarantee is lost if the data isn’t aligned. Misaligned data +could straddle a cache line, so other cores could see the halves update +independently. Consequently, the ARMv7 documentation declares that it provides +“single-copy atomicity” for all byte accesses, halfword accesses to +halfword-aligned locations, and word accesses to word-aligned locations. +Doubleword (64-bit) accesses are not atomic, unless the +location is doubleword-aligned and special load/store instructions are used. +This behavior is important to understand when multiple threads are performing +unsynchronized updates to packed structures or arrays of primitive types.
+There is no need for 32-bit “atomic read” or “atomic write” functions on ARM +or x86. Where one is provided for completeness, it just does a trivial load or +store.
+ +Operations that perform more complex actions on data in memory are +collectively known as read-modify-write (RMW) instructions, because +they load data, modify it in some way, and write it back. CPUs vary widely in +how these are implemented. ARM uses a technique called “Load Linked / Store +Conditional”, or LL/SC.
+ +A linked or locked load reads the data from memory as +usual, but also establishes a reservation, tagging the physical memory address. +The reservation is cleared when another core tries to write to that address. To +perform an LL/SC, the data is read with a reservation, modified, and then a +conditional store instruction is used to try to write the data back. If the +reservation is still in place, the store succeeds; if not, the store will fail. +Atomic functions based on LL/SC usually loop, retrying the entire +read-modify-write sequence until it completes without interruption.
+It’s worth noting that the read-modify-write operations would not work +correctly if they operated on stale data. If two cores perform an atomic +increment on the same address, and one of them is not able to see what the other +did because each core is reading and writing from local cache, the operation +won’t actually be atomic. The CPU’s cache coherency rules ensure that the +atomic RMW operations remain atomic in an SMP environment.
+ +This should not be construed to mean that atomic RMW operations use a memory +barrier. On ARM, atomics have no memory barrier semantics. While a series of +atomic RMW operations on a single address will be observed in program order by +other cores, there are no guarantees when it comes to the ordering of atomic and +non-atomic operations.
+ +It often makes sense to pair barriers and atomic operations together. The +next section describes this in more detail.
+ +As usual, it’s useful to illuminate the discussion with an example. We’re +going to consider a basic mutual-exclusion primitive called a spin +lock. The idea is that a memory address (which we’ll call “lock”) +initially holds zero. When a thread wants to execute code in the critical +section, it sets the lock to 1, executes the critical code, and then changes it +back to zero when done. If another thread has already set the lock to 1, we sit +and spin until the lock changes back to zero.
+ +To make this work we use an atomic RMW primitive called +compare-and-swap. The function takes three arguments: the memory +address, the expected current value, and the new value. If the value currently +in memory matches what we expect, it is replaced with the new value, and the old +value is returned. If the current value is not what we expect, we don’t change +anything. A minor variation on this is called compare-and-set; instead +of returning the old value it returns a boolean indicating whether the swap +succeeded. For our needs either will work, but compare-and-set is slightly +simpler for examples, so we use it and just refer to it as “CAS”.
+ +The acquisition of the spin lock is written like this (using a C-like +language):
+ +do {
+ success = atomic_cas(&lock, 0, 1)
+} while (!success)
+
+full_memory_barrier()
+
+critical-section
+
+If no thread holds the lock, the lock value will be 0, and the CAS operation +will set it to 1 to indicate that we now have it. If another thread has it, the +lock value will be 1, and the CAS operation will fail because the expected +current value does not match the actual current value. We loop and retry. +(Note this loop is on top of whatever loop the LL/SC code might be doing inside +the atomic_cas function.)
+ +On SMP, a spin lock is a useful way to guard a small critical section. If we +know that another thread is going to execute a handful of instructions and then +release the lock, we can just burn a few cycles while we wait our turn. +However, if the other thread happens to be executing on the same core, we’re +just wasting time because the other thread can’t make progress until the OS +schedules it again (either by migrating it to a different core or by preempting +us). A proper spin lock implementation would optimistically spin a few times +and then fall back on an OS primitive (such as a Linux futex) that allows the +current thread to sleep while waiting for the other thread to finish up. On a +uniprocessor you never want to spin at all. For the sake of brevity we’re +ignoring all this.
+The memory barrier is necessary to ensure that other threads observe the +acquisition of the lock before they observe any loads or stores in the critical +section. Without that barrier, the memory accesses could be observed while the +lock is not held.
+ +The full_memory_barrier call here actually does
+two independent operations. First, it issues the CPU’s full
+barrier instruction. Second, it tells the compiler that it is not allowed to
+reorder code around the barrier. That way, we know that the
+atomic_cas call will be executed before anything in the critical
+section. Without this compiler reorder barrier, the compiler has a
+great deal of freedom in how it generates code, and the order of instructions in
+the compiled code might be much different from the order in the source code.
Of course, we also want to make sure that none of the memory accesses +performed in the critical section are observed after the lock is released. The +full version of the simple spin lock is:
+ +do {
+ success = atomic_cas(&lock, 0, 1) // acquire
+} while (!success)
+full_memory_barrier()
+
+critical-section
+
+full_memory_barrier()
+atomic_store(&lock, 0) // release
+
+We perform our second CPU/compiler memory barrier immediately +before we release the lock, so that loads and stores in the +critical section are observed before the release of the lock.
+ +As mentioned earlier, the atomic_store operation is a simple
+assignment on ARM and x86. Unlike the atomic RMW operations, we don’t guarantee
+that other threads will see this value immediately. This isn’t a problem,
+though, because we only need to keep the other threads out. The
+other threads will stay out until they observe the store of 0. If it takes a
+little while for them to observe it, the other threads will spin a little
+longer, but we will still execute code correctly.
It’s convenient to combine the atomic operation and the barrier call into a +single function. It also provides other advantages, which will become clear +shortly.
+ + +When acquiring the spinlock, we issue the atomic CAS and then the barrier. +When releasing the spinlock, we issue the barrier and then the atomic store. +This inspires a particular naming convention: operations followed by a barrier +are “acquiring” operations, while operations preceded by a barrier are +“releasing” operations. (It would be wise to install the spin lock example +firmly in mind, as the names are not otherwise intuitive.)
+ +Rewriting the spin lock example with this in mind:
+ +do {
+ success = atomic_acquire_cas(&lock, 0, 1)
+} while (!success)
+
+critical-section
+
+atomic_release_store(&lock, 0)
+
+This is a little more succinct and easier to read, but the real motivation +for doing this lies in a couple of optimizations we can now perform.
+ +First, consider atomic_release_store. We need to ensure that
+the store of zero to the lock word is observed after any loads or stores in the
+critical section above it. In other words, we need a load/store and store/store
+barrier. In an earlier section we learned that these aren’t necessary on x86
+SMP -- only store/load barriers are required. The implementation of
+atomic_release_store on x86 is therefore just a compiler reorder
+barrier followed by a simple store. No CPU barrier is required.
The second optimization mostly applies to the compiler (although some CPUs, +such as the Itanium, can take advantage of it as well). The basic principle is +that code can move across acquire and release barriers, but only in one +direction.
+ +Suppose we have a mix of locally-visible and globally-visible memory +accesses, with some miscellaneous computation as well:
+ +local1 = arg1 / 41
+local2 = threadStruct->field2
+threadStruct->field3 = local2
+
+do {
+ success = atomic_acquire_cas(&lock, 0, 1)
+} while (!success)
+
+local5 = globalStruct->field5
+globalStruct->field6 = local5
+
+atomic_release_store(&lock, 0)
+
+Here we see two completely independent sets of operations. The first set +operates on a thread-local data structure, so we’re not concerned about clashes +with other threads. The second set operates on a global data structure, which +must be protected with a lock.
+ +A full compiler reorder barrier in the atomic ops will ensure that the +program order matches the source code order at the lock boundaries. However, +allowing the compiler to interleave instructions can improve performance. Loads +from memory can be slow, but the CPU can continue to execute instructions that +don’t require the result of that load while waiting for it to complete. The +code might execute more quickly if it were written like this instead:
+ +do {
+ success = atomic_acquire_cas(&lock, 0, 1)
+} while (!success)
+
+local2 = threadStruct->field2
+local5 = globalStruct->field5
+local1 = arg1 / 41
+threadStruct->field3 = local2
+globalStruct->field6 = local5
+
+atomic_release_store(&lock, 0)
+
+We issue both loads, do some unrelated computation, and then execute the +instructions that make use of the loads. If the integer division takes less +time than one of the loads, we essentially get it for free, since it happens +during a period where the CPU would have stalled waiting for a load to +complete.
+ +Note that all of the operations are now happening inside the +critical section. Since none of the “threadStruct” operations are visible +outside the current thread, nothing else can see them until we’re finished here, +so it doesn’t matter exactly when they happen.
+ +In general, it is always safe to move operations into a +critical section, but never safe to move operations out of a +critical section. Put another way, you can migrate code “downward” across an +acquire barrier, and “upward” across a release barrier. If the atomic ops used +a full barrier, this sort of migration would not be possible.
+ +Returning to an earlier point, we can state that on x86 all loads are +acquiring loads, and all stores are releasing stores. As a result:
+ +Hence, you only need store/load barriers on x86 SMP.
+ +Labeling atomic operations with “acquire” or “release” describes not only +whether the barrier is executed before or after the atomic operation, but also +how the compiler is allowed to reorder code.
+ +Debugging memory consistency problems can be very difficult. If a missing +memory barrier causes some code to read stale data, you may not be able to +figure out why by examining memory dumps with a debugger. By the time you can +issue a debugger query, the CPU cores will have all observed the full set of +accesses, and the contents of memory and the CPU registers will appear to be in +an “impossible” state.
+ +Here we present some examples of incorrect code, along with simple ways to +fix them. Before we do that, we need to discuss the use of a basic language +feature.
+ +When writing single-threaded code, declaring a variable “volatile” can be +very useful. The compiler will not omit or reorder accesses to volatile +locations. Combine that with the sequential consistency provided by the +hardware, and you’re guaranteed that the loads and stores will appear to happen +in the expected order.
+ +However, accesses to volatile storage may be reordered with non-volatile +accesses, so you have to be careful in multi-threaded uniprocessor environments +(explicit compiler reorder barriers may be required). There are no atomicity +guarantees, and no memory barrier provisions, so “volatile” doesn’t help you at +all in multi-threaded SMP environments. The C and C++ language standards are +being updated to address this with built-in atomic operations.
+ +If you think you need to declare something “volatile”, that is a strong +indicator that you should be using one of the atomic operations instead.
+ +In most cases you’d be better off with a synchronization primitive (like a +pthread mutex) rather than an atomic operation, but we will employ the latter to +illustrate how they would be used in a practical situation.
+ +For the sake of brevity we’re ignoring the effects of compiler optimizations +here — some of this code is broken even on uniprocessors — so for +all of these examples you must assume that the compiler generates +straightforward code (for example, compiled with gcc -O0). The fixes presented here do +solve both compiler-reordering and memory-access-ordering issues, but we’re only +going to discuss the latter.
+ +MyThing* gGlobalThing = NULL;
+
+void initGlobalThing() // runs in thread 1
+{
+ MyStruct* thing = malloc(sizeof(*thing));
+ memset(thing, 0, sizeof(*thing));
+ thing->x = 5;
+ thing->y = 10;
+ /* initialization complete, publish */
+ gGlobalThing = thing;
+}
+
+void useGlobalThing() // runs in thread 2
+{
+ if (gGlobalThing != NULL) {
+ int i = gGlobalThing->x; // could be 5, 0, or uninitialized data
+ ...
+ }
+}
+
+The idea here is that we allocate a structure, initialize its fields, and at +the very end we “publish” it by storing it in a global variable. At that point, +any other thread can see it, but that’s fine since it’s fully initialized, +right? At least, it would be on x86 SMP or a uniprocessor (again, making the +erroneous assumption that the compiler outputs code exactly as we have it in the +source).
+ +Without a memory barrier, the store to gGlobalThing could be observed before
+the fields are initialized on ARM. Another thread reading from thing->x could
+see 5, 0, or even uninitialized data.
This can be fixed by changing the last assignment to:
+ +atomic_release_store(&gGlobalThing, thing);+ +
That ensures that all other threads will observe the writes in the proper
+order, but what about reads? In this case we should be okay on ARM, because the
+address dependency rules will ensure that any loads from an offset of
+gGlobalThing are observed after the load of
+gGlobalThing. However, it’s unwise to rely on architectural
+details, since it means your code will be very subtly unportable. The complete
+fix also requires a barrier after the load:
MyThing* thing = atomic_acquire_load(&gGlobalThing); + int i = thing->x;+ +
Now we know the ordering will be correct. This may seem like an awkward way +to write code, and it is, but that’s the price you pay for accessing data +structures from multiple threads without using locks. Besides, address +dependencies won’t always save us:
+ +MyThing gGlobalThing;
+
+void initGlobalThing() // runs in thread 1
+{
+ gGlobalThing.x = 5;
+ gGlobalThing.y = 10;
+ /* initialization complete */
+ gGlobalThing.initialized = true;
+}
+
+void useGlobalThing() // runs in thread 2
+{
+ if (gGlobalThing.initialized) {
+ int i = gGlobalThing.x; // could be 5 or 0
+ }
+}
+
+Because there is no relationship between the initialized field and the
+others, the reads and writes can be observed out of order. (Note global data is
+initialized to zero by the OS, so it shouldn’t be possible to read “random”
+uninitialized data.)
We need to replace the store with:
+atomic_release_store(&gGlobalThing.initialized, true);+ +
and replace the load with:
+int initialized = atomic_acquire_load(&gGlobalThing.initialized);+ +
Another example of the same problem occurs when implementing +reference-counted data structures. The reference count itself will be +consistent so long as atomic increment and decrement operations are used, but +you can still run into trouble at the edges, for example:
+ +void RefCounted::release()
+{
+ int oldCount = atomic_dec(&mRefCount);
+ if (oldCount == 1) { // was decremented to zero
+ recycleStorage();
+ }
+}
+
+void useSharedThing(RefCountedThing sharedThing)
+{
+ int localVar = sharedThing->x;
+ sharedThing->release();
+ sharedThing = NULL; // can’t use this pointer any more
+ doStuff(localVar); // value of localVar might be wrong
+}
+
+The release() call decrements the reference count using a
+barrier-free atomic decrement operation. Because this is an atomic RMW
+operation, we know that it will work correctly. If the reference count goes to
+zero, we recycle the storage.
The useSharedThing() function extracts what it needs from
+sharedThing and then releases its copy. However, because we didn’t
+use a memory barrier, and atomic and non-atomic operations can be reordered,
+it’s possible for other threads to observe the read of
+sharedThing->x after they observe the recycle
+operation. It’s therefore possible for localVar to hold a value
+from "recycled" memory, for example a new object created in the same
+location by another thread after release() is called.
This can be fixed by replacing the call to atomic_dec() with
+atomic_release_dec(). The barrier ensures that the reads from
+sharedThing are observed before we recycle the object.
In most cases the above won’t actually fail, because the “recycle” function
+is likely guarded by functions that themselves employ barriers (libc heap
+free()/delete(), or an object pool guarded by a
+mutex). If the recycle function used a lock-free algorithm implemented without
+barriers, however, the above code could fail on ARM SMP.
We haven’t discussed some relevant Java language features, so we’ll take a +quick look at those first.
+ +The “synchronized” keyword provides the Java language’s in-built locking +mechanism. Every object has an associated “monitor” that can be used to provide +mutually exclusive access.
+ +The implementation of the “synchronized” block has the same basic structure +as the spin lock example: it begins with an acquiring CAS, and ends with a +releasing store. This means that compilers and code optimizers are free to +migrate code into a “synchronized” block. One practical consequence: you must +not conclude that code inside a synchronized block happens +after the stuff above it or before the stuff below it in a function. Going +further, if a method has two synchronized blocks that lock the same object, and +there are no operations in the intervening code that are observable by another +thread, the compiler may perform “lock coarsening” and combine them into a +single block.
+ +The other relevant keyword is “volatile”. As defined in the specification +for Java 1.4 and earlier, a volatile declaration was about as weak as its C +counterpart. The spec for Java 1.5 was updated to provide stronger guarantees, +almost to the level of monitor synchronization.
+ +The effects of volatile accesses can be illustrated with an example. If +thread 1 writes to a volatile field, and thread 2 subsequently reads from that +same field, then thread 2 is guaranteed to see that write and all writes +previously made by thread 1. More generally, the writes made by +any thread up to the point where it writes the field will be +visible to thead 2 when it does the read. In effect, writing to a volatile is +like a monitor release, and reading from a volatile is like a monitor +acquire.
+ +Non-volatile accesses may be reorded with respect to volatile accesses in the +usual ways, for example the compiler could move a non-volatile load or store “above” a +volatile store, but couldn’t move it “below”. Volatile accesses may not be +reordered with respect to each other. The VM takes care of issuing the +appropriate memory barriers.
+ +It should be mentioned that, while loads and stores of object references and
+most primitive types are atomic, long and double
+fields are not accessed atomically unless they are marked as volatile.
+Multi-threaded updates to non-volatile 64-bit fields are problematic even on
+uniprocessors.
Here’s a simple, incorrect implementation of a monotonic counter: (Java +theory and practice: Managing volatility).
+ +class Counter {
+ private int mValue;
+
+ public int get() {
+ return mValue;
+ }
+ public void incr() {
+ mValue++;
+ }
+}
+
+Assume get() and incr() are called from multiple
+threads, and we want to be sure that every thread sees the current count when
+get() is called. The most glaring problem is that
+mValue++ is actually three operations:
reg = mValuereg = reg + 1mValue = regIf two threads execute in incr() simultaneously, one of the
+updates could be lost. To make the increment atomic, we need to declare
+incr() “synchronized”. With this change, the code will run
+correctly in multi-threaded uniprocessor environments.
It’s still broken on SMP, however. Different threads might see different
+results from get(), because we’re reading the value with an ordinary load. We
+can correct the problem by declaring get() to be synchronized.
+With this change, the code is obviously correct.
Unfortunately, we’ve introduced the possibility of lock contention, which
+could hamper performance. Instead of declaring get() to be
+synchronized, we could declare mValue with “volatile”. (Note
+incr() must still use synchronize.) Now we know that
+the volatile write to mValue will be visible to any subsequent volatile read of
+mValue. incr() will be slightly slower, but
+get() will be faster, so even in the absence of contention this is
+a win if reads outnumber writes. (See also {@link
+java.util.concurrent.atomic.AtomicInteger}.)
Here’s another example, similar in form to the earlier C examples:
+ +class MyGoodies {
+ public int x, y;
+}
+class MyClass {
+ static MyGoodies sGoodies;
+
+ void initGoodies() { // runs in thread 1
+ MyGoodies goods = new MyGoodies();
+ goods.x = 5;
+ goods.y = 10;
+ sGoodies = goods;
+ }
+
+ void useGoodies() { // runs in thread 2
+ if (sGoodies != null) {
+ int i = sGoodies.x; // could be 5 or 0
+ ....
+ }
+ }
+}
+
+This has the same problem as the C code, namely that the assignment
+sGoodies = goods might be observed before the initialization of the
+fields in goods. If you declare sGoodies with the
+volatile keyword, you can think about the loads as if they were
+atomic_acquire_load() calls, and the stores as if they were
+atomic_release_store() calls.
(Note that only the sGoodies reference itself is volatile. The
+accesses to the fields inside it are not. The statement z =
+sGoodies.x will perform a volatile load of MyClass.sGoodies
+followed by a non-volatile load of sGoodies.x. If you make a local
+reference MyGoodies localGoods = sGoodies, z =
+localGoods.x will not perform any volatile loads.)
A more common idiom in Java programming is the infamous “double-checked +locking”:
+ +class MyClass {
+ private Helper helper = null;
+
+ public Helper getHelper() {
+ if (helper == null) {
+ synchronized (this) {
+ if (helper == null) {
+ helper = new Helper();
+ }
+ }
+ }
+ return helper;
+ }
+}
+
+The idea is that we want to have a single instance of a Helper
+object associated with an instance of MyClass. We must only create
+it once, so we create and return it through a dedicated getHelper()
+function. To avoid a race in which two threads create the instance, we need to
+synchronize the object creation. However, we don’t want to pay the overhead for
+the “synchronized” block on every call, so we only do that part if
+helper is currently null.
This doesn’t work correctly on uniprocessor systems, unless you’re using a +traditional Java source compiler and an interpreter-only VM. Once you add fancy +code optimizers and JIT compilers it breaks down. See the “‘Double Checked +Locking is Broken’ Declaration” link in the appendix for more details, or Item +71 (“Use lazy initialization judiciously”) in Josh Bloch’s Effective Java, +2nd Edition..
+ +Running this on an SMP system introduces an additional way to fail. Consider
+the same code rewritten slightly, as if it were compiled into a C-like language
+(I’ve added a couple of integer fields to represent Helper’s
+constructor activity):
if (helper == null) {
+ // acquire monitor using spinlock
+ while (atomic_acquire_cas(&this.lock, 0, 1) != success)
+ ;
+ if (helper == null) {
+ newHelper = malloc(sizeof(Helper));
+ newHelper->x = 5;
+ newHelper->y = 10;
+ helper = newHelper;
+ }
+ atomic_release_store(&this.lock, 0);
+}
+
+Now the problem should be obvious: the store to helper is
+happening before the memory barrier, which means another thread could observe
+the non-null value of helper before the stores to the
+x/y fields.
You could try to ensure that the store to helper happens after
+the atomic_release_store() on this.lock by rearranging
+the code, but that won’t help, because it’s okay to migrate code upward —
+the compiler could move the assignment back above the
+atomic_release_store() to its original position.
There are two ways to fix this:
+helper outside a synchronized block.helper volatile. With this one small change, the code
+in Example J-3 will work correctly on Java 1.5 and later. (You may want to take
+a minute to convince yourself that this is true.)This next example illustrates two important issues when using volatile:
+ +class MyClass {
+ int data1, data2;
+ volatile int vol1, vol2;
+
+ void setValues() { // runs in thread 1
+ data1 = 1;
+ vol1 = 2;
+ data2 = 3;
+ }
+
+ void useValues1() { // runs in thread 2
+ if (vol1 == 2) {
+ int l1 = data1; // okay
+ int l2 = data2; // wrong
+ }
+ }
+ void useValues2() { // runs in thread 2
+ int dummy = vol2;
+ int l1 = data1; // wrong
+ int l2 = data2; // wrong
+ }
+
+Looking at useValues1(), if thread 2 hasn’t yet observed the
+update to vol1, then it can’t know if data1 or
+data2 has been set yet. Once it sees the update to
+vol1, it knows that the change to data1 is also
+visible, because that was made before vol1 was changed. However,
+it can’t make any assumptions about data2, because that store was
+performed after the volatile store.
The code in useValues2() uses a second volatile field,
+vol2, in an attempt to force the VM to generate a memory barrier.
+This doesn’t generally work. To establish a proper “happens-before”
+relationship, both threads need to be interacting with the same volatile field.
+You’d have to know that vol2 was set after data1/data2
+in thread 1. (The fact that this doesn’t work is probably obvious from looking
+at the code; the caution here is against trying to cleverly “cause” a memory
+barrier instead of creating an ordered series of accesses.)
In C/C++, use the pthread operations, like mutexes and
+semaphores. These include the proper memory barriers, providing correct and
+efficient behavior on all Android platform versions. Be sure to use them
+correctly, for example be wary of signaling a condition variable without holding the
+corresponding mutex.
It's best to avoid using atomic functions directly. Locking and +unlocking a pthread mutex require a single atomic operation each if there’s no +contention, so you’re not going to save much by replacing mutex calls with +atomic ops. If you need a lock-free design, you must fully understand the +concepts in this entire document before you begin (or, better yet, find an +existing code library that is known to be correct on SMP ARM).
+ +Be extremely circumspect with "volatile” in C/C++. It often indicates a +concurrency problem waiting to happen.
+ +In Java, the best answer is usually to use an appropriate utility class from +the {@link java.util.concurrent} package. The code is well written and well +tested on SMP.
+ +Perhaps the safest thing you can do is make your class immutable. Objects +from classes like String and Integer hold data that cannot be changed once the +class is created, avoiding all synchronization issues. The book Effective +Java, 2nd Ed. has specific instructions in “Item 15: Minimize Mutability”. Note in +particular the importance of declaring fields “final" (Bloch).
+ +If neither of these options is viable, the Java “synchronized” statement +should be used to guard any field that can be accessed by more than one thread. +If mutexes won’t work for your situation, you should declare shared fields +“volatile”, but you must take great care to understand the interactions between +threads. The volatile declaration won’t save you from common concurrent +programming mistakes, but it will help you avoid the mysterious failures +associated with optimizing compilers and SMP mishaps.
+ +The Java Memory Model guarantees that assignments to final fields are visible +to all threads once the constructor has finished — this is what ensures +proper synchronization of fields in immutable classes. This guarantee does not +hold if a partially-constructed object is allowed to become visible to other +threads. It is necessary to follow safe construction practices.(Safe +Construction Techniques in Java).
+ +The pthread library and VM make a couple of useful guarantees: all accesses
+previously performed by a thread that creates a new thread are observable by
+that new thread as soon as it starts, and all accesses performed by a thread
+that is exiting are observable when a join() on that thread
+returns. This means you don’t need any additional synchronization when
+preparing data for a new thread or examining the results of a joined thread.
Whether or not these guarantees apply to interactions with pooled threads +depends on the thread pool implementation.
+ +In C/C++, the pthread library guarantees that any accesses made by a thread
+before it unlocks a mutex will be observable by another thread after it locks
+that same mutex. It also guarantees that any accesses made before calling
+signal() or broadcast() on a condition variable will
+be observable by the woken thread.
Java language threads and monitors make similar guarantees for the comparable +operations.
+ +The C and C++ language standards are evolving to include a sophisticated +collection of atomic operations. A full matrix of calls for common data types +is defined, with selectable memory barrier semantics (choose from relaxed, +consume, acquire, release, acq_rel, seq_cst).
+ +See the Further Reading section for pointers to the +specifications.
+ + +While this document does more than merely scratch the surface, it doesn’t +manage more than a shallow gouge. This is a very broad and deep topic. Some +areas for further exploration:
+ +@ThreadSafe and
+@GuardedBy (from net.jcip.annotations).The Further Reading section in the appendix has links to +documents and web sites that will better illuminate these topics.
+ +This document describes a lot of “weird” things that can, in theory, happen. +If you’re not convinced that these issues are real, a practical example may be +useful.
+ +Bill Pugh’s Java memory model web site has a few +test programs on it. One interesting test is ReadAfterWrite.java, which does +the following:
+ +| Thread 1 | +Thread 2 | +
|---|---|
for (int i = 0; i < ITERATIONS; i++) { |
+for (int i = 0; i < ITERATIONS; i++) { |
+
Where a and b are declared as volatile
+int fields, and AA and BB are ordinary
+integer arrays.
+
+
This is trying to determine if the VM ensures that, after a value is written +to a volatile, the next read from that volatile sees the new value. The test +code executes these loops a million or so times, and then runs through afterward +and searches the results for inconsistencies.
+ +At the end of execution,AA and BB will be full of
+gradually-increasing integers. The threads will not run side-by-side in a
+predictable way, but we can assert a relationship between the array contents.
+For example, consider this execution fragment:
| Thread 1 | +Thread 2 | +
|---|---|
(initially a == 1534) |
+(initially b == 165)
+ |
+
(This is written as if the threads were taking turns executing so that it’s +more obvious when results from one thread should be visible to the other, but in +practice that won’t be the case.)
+ +Look at the assignment of AA[166] in thread 2. We are capturing
+the fact that, at the point where thread 2 was on iteration 166, it can see that
+thread 1 was on iteration 1536. If we look one step in the future, at thread
+1’s iteration 1537, we expect to see that thread 1 saw that thread 2 was at
+iteration 166 (or later). BB[1537] holds 167, so it appears things
+are working.
Now suppose we fail to observe a volatile write to b:
| Thread 1 | +Thread 2 | +
|---|---|
(initially a == 1534) |
+(initially b == 165) |
+
Now, BB[1537] holds 165, a smaller value than we expected, so we
+know we have a problem. Put succinctly, for i=166, BB[AA[i]+1] < i. (This also
+catches failures by thread 2 to observe writes to a, for example if we
+miss an update and assign AA[166] = 1535, we will get
+BB[AA[166]+1] == 165.)
If you run the test program under Dalvik (Android 3.0 “Honeycomb” or later)
+on an SMP ARM device, it will never fail. If you remove the word “volatile”
+from the declarations of a and b, it will consistently
+fail. The program is testing to see if the VM is providing sequentially
+consistent ordering for accesses to a and b, so you
+will only see correct behavior when the variables are volatile. (It will also
+succeed if you run the code on a uniprocessor device, or run it while something
+else is using enough CPU that the kernel doesn’t schedule the test threads on
+separate cores.)
If you run the modified test a few times you will note that it doesn’t fail +in the same place every time. The test fails consistently because it performs +the operations a million times, and it only needs to see out-of-order accesses +once. In practice, failures will be infrequent and difficult to locate. This +test program could very well succeed on a broken VM if things just happen to +work out.
+ +(This isn’t something most programmers will find themselves implementing, +but the discussion is illuminating.)
+ +Consider once again volatile accesses in Java. Earlier we made reference to +their similarities with acquiring loads and releasing stores, which works as a +starting point but doesn’t tell the full story.
+ +We start with a fragment of Dekker’s algorithm. Initially both
+flag1 and flag2 are false:
| Thread 1 | +Thread 2 | +
|---|---|
flag1 = true |
+flag2 = true |
+
flag1 and flag2 are declared as volatile boolean
+fields. The rules for acquiring loads and releasing stores would allow the
+accesses in each thread to be reordered, breaking the algorithm. Fortunately,
+the JMM has a few things to say here. Informally:
+
+Taken together, these rules say that the volatile accesses in our example +must be observable in program order by all threads. Thus, we will never see +these threads executing the “critical-stuff” simultaneously.
+ +Another way to think about this is in terms of data races. A data race
+occurs if two accesses to the same memory location by different threads are not
+ordered, at least one of them stores to the memory location, and at least one of
+them is not a synchronization action (Boehm and McKenney). The memory model
+declares that a program free of data races must behave as if executed by a
+sequentially-consistent machine. Because both flag1 and
+flag2 are volatile, and volatile accesses are considered
+synchronization actions, there are no data races and this code must execute in a
+sequentially consistent manner.
As we saw in an earlier section, we need to insert a store/load barrier +between the two operations. The code executed in the VM for a volatile access +will look something like this:
+ +| volatile load | +volatile store | +
|---|---|
reg = A |
+store/store barrier |
+
The volatile load is just an acquiring load. The volatile store is similar +to a releasing store, but we’ve omitted load/store from the pre-store barrier, +and added a store/load barrier afterward.
+ +What we’re really trying to guarantee, though, is that (using thread 1 as an +example) the write to flag1 is observed before the read of flag2. We could +issue the store/load barrier before the volatile load instead and get the same +result, but because loads tend to outnumber stores it’s best to associate it +with the store.
+ +On some architectures, it’s possible to implement volatile stores with an +atomic operation and skip the explicit store/load barrier. On x86, for example, +atomics provide a full barrier. The ARM LL/SC operations don’t include a +barrier, so for ARM we must use explicit barriers.
+ +(Much of this is due to Doug Lea and his “JSR-133 Cookbook for Compiler +Writers” page.)
+ +Web pages and documents that provide greater depth or breadth. The more generally useful articles are nearer the top of the list.
+ +java.util.concurrent package. Near the bottom of the page is a section entitled “Memory Consistency Properties” that explains the guarantees made by the various classes.
+