Most lock-free data structures rely on atomic compare and swap (CAS) operation and in order to solve ABA issue the CAS must work on a double word (a pointer and a counter.) Implementing such kind of pointer is often tedious and error prone. In particular, for Intel x86, 32bit and 64bit code use a different mnemonic. This article present a template based implementation that hides the hard stuff.

Since the introduction of multi-core processors, parallel computing is growing in attention. While lock-based techniques have been studied for a long time, modern HPC require more scalable data structures. Lock-free data structures have proven better scaling by avoiding thread blocking.

But avoiding locks introduces new issues, among which the so-called ABA problem (see next section.) There exists various strategy to avoid this issue, and among all the use of atomic CAS on double word pointers seems the less intrusive one (other choices may rely on aspect that may not be available in all context: garbage collector or Load-Link/Store-Condition for example.)

Implementing a double-word CAS is tedious, you have to inline some assembly code and most of all your code is word size dependent. Here is a simple solution to implement such a CAS with gcc-style inline assembly and C++ template for x86 and x86-64 processors.

But, let’s start with quick description of the ABA problem.

The ABA problem

Classical lock-free data structures often use a simple retry-strategy based on the idea that one fetch the desired pointer, access inner data and then try to update the pointer in the structure, if the pointer has changed since we fetch it, the algorithm simply retry by fetching it again.

The ABA issue may appears when the change to the pointer may not be visible simply reading the address. Let’s see the scenario:

  • Thread 0 access the pointer in the structure and read address A
  • Thread 1 change the pointer with address B and invalidate address A
  • Thread 2 (or Thread 1) allocate a new cell at address A and again update the main pointer and write A
  • Thread 0 test the pointer (with a CAS) and find A, since the pointer doesn’t seem to have changed, it will silently consider that nothing has changed.

If the first thread has already read content of the cell, this content will be out-of-date.

To prevent this issue, we have several strategies. Most strategies rely on different memory management approach such as garbage collection (in a garbage collected paradigm, the cell at address A won’t be invalidate since it is hold by someone, and thus it can’t be reused.)

But garbage collection introduces more parallel issues and requires some sort of language integration. It is acceptable when it is already part of the environment (like in Java), but you don’t want it when building high-performance application in less high-level languages (such as C/C++.)

Other solutions based on transactional memory or LL/SC operations have other drawback such as hardware requirements that are not standard.

Double-word CAS

A more simple way to solve the ABA issue used for example in the article from Micheal and Scott Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms, is to replace your pointer by a pair with a pointer and a counter.

The strategy is simple, each time the pointer is changed the counter is incremented, thus even if the address is the same the counter value will differ.

The only remaining issue is how to perform a double word CAS ?

Atomic CAS requires processor’s specific instructions, on the x86/x86-64 processors you’ll find that the CAS instruction is named cmpxchg.

But, as usual, this instruction has various versions depending on the size of the value to swap. So, if you want a double-word CAS for 32bit wide pointers, you’ll use cmpxchg8b and cmpxchg16b for 64bit wide pointers.

Note: GCC provides atomic operations, but these operations are limited to integral types, that is types with size less (or equal) to 8 bytes, but we need 16 bytes wide CAS.

Using cmpxchg for 32bit wide pointers

So, let’s start with the easy case: 32bit wide pointers. It’s easy because you have 64bit integral types (here we need some unsigned integer) directly defined.

The double-word CAS will look like that:

uint64_t cas(uint64_t* adr, uint64_t nval, uint64_t cmp)
{
  uint64_t old;
  __asm__ __volatile__(
    "lock cmpxchg8b %0\n\t"
     : "=a" (old)
      ,"+m" (*adr)
     : "d" ((uint32_t)(cmp >> 32)), "a" ((uint32_t)(cmp & 0xffffffff))
      ,"c" ((uint32_t)(nval >> 32)), "b" ((uint32_t)(nval & 0xffffffff))
     : "cc"
  );
}

Note: the previous code is doing some supposition upon the memory layout of the unsigned integer.

The cmpxchg8b atomically compare the given 8 bytes with the given values (in d and a), if the values matches it replaces the old value with the new one. In any case it returns the old value.

You may have notice the lock prefix. While cmpxchg8b is guaranteed to be atomic, the instruction doesn’t implies any memory barrier: re-ordering of fetch and store operations by the processor are not required to be consistent with the relative ordering of the instructions flow (so write operations lexically before the cmpxchg8b can actually take place after the CAS itself.) But, and this is more important, other processors may change the memory state during the execution of the cmpxchg8b ! The lock prefix enforce a global memory barrier (for the current processor but also for the other.) It also force cache line invalidation so other attempt to read the memory cell will require a real fetch in memory.

Why the memory barrier is not the default behavior of an atomic operation ? The memory barrier have a cost, and in some situation it is not required, thus it is simpler to provide the barrier as an option rather than a default behavior.

And a 64bit version ?

At first, it seems logical to simply replace cmpxchg8b by cmpxchg16b is the previous code to obtain a double-word CAS for 64bit wide pointer, no ?

Of course not, we don’t have a 128bits wide integer type (some compiler may provide such a type) so we have to embedded our pair in a struct (we’ll see the code later in the template version.) Beware that cmpxchg16b requires a memory operand aligned on 16 bytes boundaries.

But that’s not all. In the previous version, the CAS operation returns the old value, which is then often used to test if the operation has succeed or not. But, the compiler won’t let us simply compare structures like integers.

Hopefully, cmpxchg16b (as cmpxchg8b and cmpxchg) set an arithmetic flag indicating whether the operation has succeed or not ! Thus, we only have to do a setz to some Boolean-like value.

Taking everything together:

  • we need a struct holding our pair (in fact we probably have it already)
  • our compare and swap will return a Boolean value
  • depending on the pointer size we should use cmpxchg8b or cmpxchg16b

Now, how can we have a unified code size dependent ?

Using template

In order to switch upon the size of pointer, we’ll use template.

Let see the code:

template<typename T,unsigned N=sizeof (uint32_t)>
struct DPointer {
public:
  union {
    uint64_t ui;
    struct {
      T* ptr;
      size_t count;
    };
  };

  DPointer() : ptr(NULL), count(0) {}
  DPointer(T* p) : ptr(p), count(0) {}
  DPointer(T* p, size_t c) : ptr(p), count(c) {}

  bool cas(DPointer<T,N> const& nval, DPointer<T,N> const & cmp)
  {
    bool result;
    __asm__ __volatile__(
        "lock cmpxchg8b %1\n\t"
        "setz %0\n"
        : "=q" (result)
         ,"+m" (ui)
        : "a" (cmp.ptr), "d" (cmp.count)
         ,"b" (nval.ptr), "c" (nval.count)
        : "cc"
    );
    return result;
  }

  // We need == to work properly
  bool operator==(DPointer<T,N> const&x) { return x.ui == ui; }

};

template<typename T>
struct DPointer <T,sizeof (uint64_t)> {
public:
  union {
    uint64_t ui[2];
    struct {
      T* ptr;
      size_t count;
    } __attribute__ (( __aligned__( 16 ) ));
  };

  DPointer() : ptr(NULL), count(0) {}
  DPointer(T* p) : ptr(p), count(0) {}
  DPointer(T* p, size_t c) : ptr(p), count(c) {}

  bool cas(DPointer<T,8> const& nval, DPointer<T,8> const& cmp)
  {
    bool result;
    __asm__ __volatile__ (
        "lock cmpxchg16b %1\n\t"
        "setz %0\n"
        : "=q" ( result )
         ,"+m" ( ui )
        : "a" ( cmp.ptr ), "d" ( cmp.count )
         ,"b" ( nval.ptr ), "c" ( nval.count )
        : "cc"
    );
    return result;
  }

  // We need == to work properly
  bool operator==(DPointer<T,8> const&x)
  {
    return x.ptr == ptr && x.count == count;
  }
};

The first trick is to use an anonymous union (and an anonymous struct) in order to have access to the pointer and the counter directly, and a direct value access to the value as a whole itself for the assembly code. In fact, we probably can had done it without, but it simpler to read like that.

As you can see, the template as an integer parameter (that is not use) and is specialized upon it (for N=8.) Now, when you want to use our pointer, all you have to do is to instantiate our template with N=sizeof (void*).

An Example

Here is an quick’n’dirty implementation of the non-blocking concurrent queue described in the article by Micheal and Scott.

template<typename T>
class Queue
{
public:
  struct Node;
  typedef DPointer<Node,sizeof (size_t)> Pointer;

  struct Node
  {
    T           value;
    Pointer     next;

    Node() : next(NULL) {}
    Node(T x, Node* nxt) : value(x), next(nxt) {}
  };

  Pointer       Head, Tail;

  Queue() {
    Node       *node = new Node();
    Head.ptr = Tail.ptr = node;
  }

  void push(T x);
  bool take(T& pvalue);
};

template<typename T>
void Queue<T>::push(T x) {
  Node         *node = new Node(x,NULL);
  Pointer       tail, next;
  do {
    tail = Tail;
    next = tail.ptr->next;
    if (tail == Tail) {
      if (next.ptr == NULL) {
        if (tail.ptr->next.cas(Pointer(node,next.count+1),next))
          break;
      } else {
        Tail.cas(Pointer(next.ptr,tail.count+1), tail);
      }
    }
  } while (true);
  Tail.cas(Pointer(node,tail.count+1), tail);
}

template<typename T>
bool Queue<T>::take(T& pvalue) {
  Pointer       head, tail, next;
  do {
    head = Head;
    tail = Tail;
    next = head.ptr->next;
    if (head == Head)
      if (head.ptr == tail.ptr) {
        if (next.ptr == NULL) return false;
        Tail.cas(Pointer(next.ptr,tail.count+1), tail);
      } else {
        pvalue = next.ptr->value;
        if (Head.cas(Pointer(next.ptr,head.count+1), head))
          break;
      }
  } while (true);
  delete(head.ptr);
  return true;
}

Going further

There’s some possible enhancement of our pointer template:

  • Our assembly code doesn’t support -fPIC relocation: by convention, ebx is supposed to be preserved in each block of code, so, we have to backup its value before using in the asm inline block.
  • Not all operation are done atomically, in order to have a better complete implementation, we should override some operators.