Tag Archives: C++11

The x86 Memory Model

Often I’ve found myself wanting to point someone to a description of the x86’s memory model, but there wasn’t any that quite laid it out the way I wanted. So this is my take on how shared memory works on multiprocessor x86 systems. The guts of this description is adapted/copied from “A Better x86 Memory Model: x86-TSO” by Scott Owens, Susmit Sarkar, and Peter Sewell; this presentation strips away most of the math and presents it in a more operational style. Any mistakes are almost certainly mine and not theirs.

Components of the System:

There is a memory subsystem that supports the following operations: store, load, fence, lock, unlock. The memory subsystem contains the following:

  1. Memory: A map from addresses to values
  2. Write buffers: Per-processor lists of (address, value) pairs; these are pending writes, waiting to be sent to memory
  3. “The Lock”: Which processor holds the lock, or None, if it is not held. Roughly speaking, while the lock is held, only the processor that holds it can perform memory operations.

There is a set of processors that execute instructions in program order, dispatching commands to the memory subsystem when they need to do memory operations. Atomic instructions are implemented by taking “the lock”, doing whatever reads and writes are necessary, and then dropping “the lock”. We abstract away from this.

Definitions

A processor is “not blocked” if either the lock is unheld or it holds the lock.

Memory System Operation

Processors issue commands to the memory subsystem. The subsystem loops, processing commands; each iteration it can pick the command issued by any of the processors to execute. (Each will only have one.) Some of the commands issued by processors may not be eligible to execute because their preconditions do not hold.

  1. If a processor p wants to read from address a and p is not blocked:
    a. If there are no pending writes to a in p’s write buffer, return the value from memory
    b. If there is a pending write to a in p’s write buffer, return the most recent value in the write buffer
  2. If a processor p wants to write value v to address a, add (a, v) to the back of p’s write buffer
  3. At any time, if a processor p is not blocked, the memory subsystem can remove the oldest entry (a, v) from p’s write buffer and update memory so that a maps to v
  4. If a processor p wants to issue a barrier
    a. If the barrier is an MFENCE, p’s write buffer must be empty
    b. If the barrier is an LFENCE/SFENCE, there are no preconditions; these are no-ops **
  5. If a processor p’s wants to lock the lock, the lock must not be held and p’s write buffer must be empty; the lock is set to be p
  6. If a processor p’s wants to unlock the lock, the lock must held by p and p’s write buffer must be empty; the lock is set to be None

Remarks

So, the only funny business that can happen is that a load can happen before a prior store to a different location has been flushed from the write buffer into memory. This means that if CPU0 executes “x = 1; r0 = y” and CPU1 executes “y = 1; r1 = x”, with x and y both initially zero, we can get “r0 == r1 == 0”.

The common intuition that atomic instructions act like there is an MFENCE before and after them is basically right; MFENCE requires the write buffer to empty before it can execute and so do lock and unlock.

x86 is a pleasure to compile atomics code for. The “release” and “acquire” operations in the C++11 memory model don’t require any fencing to work. Neither do the notions of “execution order” and “visibility order” in my advisor and my RMC memory model.

** The story about LFENCE/SFENCE is a little complicated. Some sources insist that they actually do things. The Cambridge model models them as no-ops. The guarantees that they are documented to provide are just true all the time, though. I think they are useful when using non-temporal memory accesses (which I’ve never done), but not in general.

 

Why We Fight

Why We Fight, or

Why Your Language Needs A (Good) Memory Model, or

The Tragedy Of memory_order_consume’s Unimplementability

This, one of the most terrifying technical documents I’ve ever read, is why we fight: https://www.kernel.org/doc/Documentation/RCU/rcu_dereference.txt.

Background

For background, RCU is a mechanism used heavily in the Linux kernel for locking around read-mostly data structures; that is, data structures that are read frequently but fairly infrequently modified. It is a scheme that allows for blazingly fast read-side critical sections (no atomic operations, no memory barriers, not even any writing to cache lines that other CPUs may write to) at the expense of write-side critical sections being quite expensive.

The catch is that writers might be modifying the data structure as readers access it: writers are allowed to modify the data structure (often a linked list) as long as they do not free any memory removed until it is “safe”. Since writers can be modifying data structures as readers are reading from it, without any synchronization between them, we are now in danger of running afoul of memory reordering. In particular, if a writer initializes some structure (say, a routing table entry) and adds it to an RCU protected linked list, it is important that any reader that sees that the entry has been added to the list also sees the writes that initialized the entry! While this will always be the case on the well-behaved x86 processor, architectures like ARM and POWER don’t provide this guarantee.

The simple solution to make the memory order work out is to add barriers on both sides on platforms where it is need: after initializing the object but before adding it to the list and after reading a pointer from the list but before accessing its members (including the next pointer). This cost is totally acceptable on the write-side, but is probably more than we are willing to pay on the read-side. Fortunately, we have an out: essentially all architectures (except for the notoriously poorly behaved Alpha) will not reorder instructions that have a data dependency between them. This means that we can get away with only issuing a barrier on the write-side and taking advantage of the data dependency on the read-side (between loading a pointer to an entry and reading fields out of that entry). In Linux this is implemented with macros “rcu_assign_pointer” (that issues a barrier if necessary, and then writes the pointer) on the write-side and “rcu_dereference” (that reads the value and then issues a barrier on Alpha) on the read-side.

There is a catch, though: the compiler. There is no guarantee that something that looks like a data dependency in your C source code will be compiled as a data dependency. The most obvious way to me that this could happen is by optimizing “r[i ^ i]” or the like into “r[0]”, but there are many other ways, some quite subtle. This document, linked above, is the Linux kernel team’s effort to list all of the ways a compiler might screw you when you are using rcu_dereference, so that you can avoid them.

This is no way to run a railway.

Language Memory Models

Programming by attempting to quantify over all possible optimizations a compiler might perform and avoiding them is a dangerous way to live. It’s easy to mess up, hard to educate people about, and fragile: compiler writers are feverishly working to invent new optimizations that will violate the blithe assumptions of kernel writers! The solution to this sort of problem is that the language needs to provide the set of concurrency primitives that are used as building blocks (so that the compiler can constrain its code transformations as needed) and a memory model describing how they work and how they interact with regular memory accesses (so that programmers can reason about their code). Hans Boehm makes this argument in the well-known paper Threads Cannot be Implemented as a Library.

One of the big new features of C++11 and C11 is a memory model which attempts to make precise what values can be read by threads in concurrent programs and to provide useful tools to programmers at various levels of abstraction and simplicity. It is complicated, and has a lot of moving parts, but overall it is definitely a step forward.

One place it falls short, however, is in its handling of “rcu_dereference” style code, as described above. One of the possible memory orders in C11 is “memory_order_consume”, which establishes an ordering relationship with all operations after it that are data dependent on it. There are two problems here: first, these operations deeply complicate the semantics; the C11 memory model relies heavily on a relation called “happens before” to determine what writes are visible to reads; with consume, this relation is no longer transitive. Yuck! Second, it seems to be nearly unimplementable; tracking down all the dependencies and maintaining them is difficult, and no compiler yet does it; clang and gcc both just emit barriers. So now we have a nasty semantics for our memory model and we’re still stuck trying to reason about all possible optimizations. (There is work being done to try to repair this situation; we will see how it turns out.)

Shameless Plug

My advisor, Karl Crary, and I are working on designing an alternate memory model (called RMC) for C and C++ based on explicitly specifying the execution and visibility constraints that you depend on. We have a paper on it and I gave a talk about it at POPL this year. The paper is mostly about the theory, but the talk tried to be more practical, and I’ll be posting more about RMC shortly. RMC is quite flexible. All of the C++11 model apart from consume can be implemented in terms of RMC (although that’s probably not the best way to use it) and consume style operations are done in a more explicit and more implementable (and implemented!) way.