-
Implementing generic double-word compare and swap for x86/x86-64
Written by Marwan Burelle
February 27, 2013 at 10:10Most lock-free data structures rely on atomic compare and swap (CAS) operation and in order to solve ABA issue the CAS must work on a double word (a pointer and a counter.) Implementing such kind of pointer is often tedious and error prone. In particular, for Intel x86, 32bit and 64bit code use a different mnemonic. This article present a template based implementation that hides the hard stuff.
Since the introduction of multi-core processors, parallel computing is growing in attention. While lock-based techniques have been studied for a long time, modern HPC require more scalable data structures. Lock-free data structures have proven better scaling by avoiding thread blocking.
But avoiding locks introduces new issues, among which the so-called ABA problem (see next section.) There exists various strategy to avoid this issue, and among all the use of atomic CAS on double word pointers seems the less intrusive one (other choices may rely on aspect that may not be available in all context: garbage collector or Load-Link/Store-Condition for example.)
Implementing a double-word CAS is tedious, you have to inline some assembly code and most of all your code is word size dependent. Here is a simple solution to implement such a CAS with gcc-style inline assembly and C++ template for x86 and x86-64 processors.
But, let's start with quick description of the ABA problem.
The ABA problem
Classical lock-free data structures often use a simple retry-strategy based on the idea that one fetch the desired pointer, access inner data and then try to update the pointer in the structure, if the pointer has changed since we fetch it, the algorithm simply retry by fetching it again.
The ABA issue may appears when the change to the pointer may not be visible simply reading the address. Let's see the scenario:
- Thread 0 access the pointer in the structure and read address A
- Thread 1 change the pointer with address B and invalidate address A
- Thread 2 (or Thread 1) allocate a new cell at address A and again update the main pointer and write A
- Thread 0 test the pointer (with a CAS) and find A, since the pointer doesn't seem to have changed, it will silently consider that nothing has changed.
If the first thread has already read content of the cell, this content will be out-of-date.
To prevent this issue, we have several strategies. Most strategies rely on different memory management approach such as garbage collection (in a garbage collected paradigm, the cell at address A won't be invalidate since it is hold by someone, and thus it can't be reused.)
But garbage collection introduces more parallel issues and requires some sort of language integration. It is acceptable when it is already part of the environment (like in Java), but you don't want it when building high-performance application in less high-level languages (such as C/C++.)
Other solutions based on transactional memory or LL/SC operations have other drawback such as hardware requirements that are not standard.
Double-word CAS
A more simple way to solve the ABA issue used for example in the article from Micheal and Scott Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms, is to replace your pointer by a pair with a pointer and a counter.
The strategy is simple, each time the pointer is changed the counter is incremented, thus even if the address is the same the counter value will differ.
The only remaining issue is how to perform a double word CAS ?
Atomic CAS requires processor's specific instructions, on the x86/x86-64 processors you'll find that the CAS instruction is named cmpxchg.
But, as usual, this instruction has various versions depending on the size of the value to swap. So, if you want a double-word CAS for 32bit wide pointers, you'll use cmpxchg8b and cmpxchg16b for 64bit wide pointers.
Note: GCC provides atomic operations, but these operations are limited to integral types, that is types with size less (or equal) to 8 bytes, but we need 16 bytes wide CAS.
Using cmpxchg for 32bit wide pointers
So, let's start with the easy case: 32bit wide pointers. It's easy because you have 64bit integral types (here we need some unsigned integer) directly defined.
The double-word CAS will look like that:
1 2 3 4 5 6 7 8 9 10 11 12
uint64_t cas(uint64_t* adr, uint64_t nval, uint64_t cmp) { uint64_t old; __asm__ __volatile__( "lock cmpxchg8b %0\n\t" : "=a" (old) ,"+m" (*adr) : "d" ((uint32_t)(cmp >> 32)), "a" ((uint32_t)(cmp & 0xffffffff)) ,"c" ((uint32_t)(nval >> 32)), "b" ((uint32_t)(nval & 0xffffffff)) : "cc" ); }
Note: the previous code is doing some supposition upon the memory layout of the unsigned integer.
The cmpxchg8b atomically compare the given 8 bytes with the given values (in d and a), if the values matches it replaces the old value with the new one. In any case it returns the old value.
You may have notice the lock prefix. While cmpxchg8b is guaranteed to be atomic, the instruction doesn't implies any memory barrier: re-ordering of fetch and store operations by the processor are not required to be consistent with the relative ordering of the instructions flow (so write operations lexically before the cmpxchg8b can actually take place after the CAS itself.) But, and this is more important, other processors may change the memory state during the execution of the cmpxchg8b ! The lock prefix enforce a global memory barrier (for the current processor but also for the other.) It also force cache line invalidation so other attempt to read the memory cell will require a real fetch in memory.
Why the memory barrier is not the default behavior of an atomic operation ? The memory barrier have a cost, and in some situation it is not required, thus it is simpler to provide the barrier as an option rather than a default behavior.
And a 64bit version ?
At first, it seems logical to simply replace cmpxchg8b by cmpxchg16b is the previous code to obtain a double-word CAS for 64bit wide pointer, no ?
Of course not, we don't have a 128bits wide integer type (some compiler may provide such a type) so we have to embedded our pair in a struct (we'll see the code later in the template version.) Beware that cmpxchg16b requires a memory operand aligned on 16 bytes boundaries.
But that's not all. In the previous version, the CAS operation returns the old value, which is then often used to test if the operation has succeed or not. But, the compiler won't let us simply compare structures like integers.
Hopefully, cmpxchg16b (as cmpxchg8b and cmpxchg) set an arithmetic flag indicating whether the operation has succeed or not ! Thus, we only have to do a setz to some Boolean-like value.
Taking everything together:
- we need a struct holding our pair (in fact we probably have it already)
- our compare and swap will return a Boolean value
- depending on the pointer size we should use cmpxchg8b or cmpxchg16b
Now, how can we have a unified code size dependent ?
Using template
In order to switch upon the size of pointer, we'll use template.
Let see the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
template<typename T,unsigned N=sizeof (uint32_t)> struct DPointer { public: union { uint64_t ui; struct { T* ptr; size_t count; }; }; DPointer() : ptr(NULL), count(0) {} DPointer(T* p) : ptr(p), count(0) {} DPointer(T* p, size_t c) : ptr(p), count(c) {} bool cas(DPointer<T,N> const& nval, DPointer<T,N> const & cmp) { bool result; __asm__ __volatile__( "lock cmpxchg8b %1\n\t" "setz %0\n" : "=q" (result) ,"+m" (ui) : "a" (cmp.ptr), "d" (cmp.count) ,"b" (nval.ptr), "c" (nval.count) : "cc" ); return result; } // We need == to work properly bool operator==(DPointer<T,N> const&x) { return x.ui == ui; } }; template<typename T> struct DPointer <T,sizeof (uint64_t)> { public: union { uint64_t ui[2]; struct { T* ptr; size_t count; } __attribute__ (( __aligned__( 16 ) )); }; DPointer() : ptr(NULL), count(0) {} DPointer(T* p) : ptr(p), count(0) {} DPointer(T* p, size_t c) : ptr(p), count(c) {} bool cas(DPointer<T,8> const& nval, DPointer<T,8> const& cmp) { bool result; __asm__ __volatile__ ( "lock cmpxchg16b %1\n\t" "setz %0\n" : "=q" ( result ) ,"+m" ( ui ) : "a" ( cmp.ptr ), "d" ( cmp.count ) ,"b" ( nval.ptr ), "c" ( nval.count ) : "cc" ); return result; } // We need == to work properly bool operator==(DPointer<T,8> const&x) { return x.ptr == ptr && x.count == count; } };
The first trick is to use an anonymous union (and an anonymous struct) in order to have access to the pointer and the counter directly, and a direct value access to the value as a whole itself for the assembly code. In fact, we probably can had done it without, but it simpler to read like that.
As you can see, the template as an integer parameter (that is not use) and is specialized upon it (for N=8.) Now, when you want to use our pointer, all you have to do is to instantiate our template with N=sizeof (void*).
An Example
Here is an quick'n'dirty implementation of the non-blocking concurrent queue described in the article by Micheal and Scott.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
template<typename T> class Queue { public: struct Node; typedef DPointer<Node,sizeof (size_t)> Pointer; struct Node { T value; Pointer next; Node() : next(NULL) {} Node(T x, Node* nxt) : value(x), next(nxt) {} }; Pointer Head, Tail; Queue() { Node *node = new Node(); Head.ptr = Tail.ptr = node; } void push(T x); bool take(T& pvalue); }; template<typename T> void Queue<T>::push(T x) { Node *node = new Node(x,NULL); Pointer tail, next; do { tail = Tail; next = tail.ptr->next; if (tail == Tail) { if (next.ptr == NULL) { if (tail.ptr->next.cas(Pointer(node,next.count+1),next)) break; } else { Tail.cas(Pointer(next.ptr,tail.count+1), tail); } } } while (true); Tail.cas(Pointer(node,tail.count+1), tail); } template<typename T> bool Queue<T>::take(T& pvalue) { Pointer head, tail, next; do { head = Head; tail = Tail; next = head.ptr->next; if (head == Head) if (head.ptr == tail.ptr) { if (next.ptr == NULL) return false; Tail.cas(Pointer(next.ptr,tail.count+1), tail); } else { pvalue = next.ptr->value; if (Head.cas(Pointer(next.ptr,head.count+1), head)) break; } } while (true); delete(head.ptr); return true; }
Going further
There's some possible enhancement of our pointer template:
- Our assembly code doesn't support -fPIC relocation: by convention, ebx is supposed to be preserved in each block of code, so, we have to backup its value before using in the asm inline block.
- Not all operation are done atomically, in order to have a better complete implementation, we should override some operators.
TweetPermalink & comments -
PythonGDB tutorial for reverse engineering - part 1
Written by Pierre Bourdon
May 01, 2012 at 11:50Dynamic analysis of computer software is usually done using two tools:
- A disassembler, in order to get a good overview of the global program structure (or more locally, a good overview of a function control flow). Most people use Hex-Rays' IDA to achieve that goal: it works on all 3 major operating systems, can handle almost anything you throw at him and is very extensible using IDAPython for scripting. It also has great support for analysis, allowing the reverse engineer to rename symbols or add comments to the disassembly.
- A debugger, in order to trace the program execution, break on specific code locations or conditions, show the program state at any point, etc. Debuggers are usually specific to a system: Windows people mostly use OllyDBG (has a lot of community support and plugins), Immunity Debugger (a fork of an old OllyDBG version which went its on way, it supports Python plugins and has a lot of very interesting community contributions like mona) or WinDBG (mostly for remote kernel/drivers debugging). On Linux or Mac OS X most people use the very simple GDB, which I am going to talk about in this article.
Until very recently GDB did not have any support for scripting and plugins. The GDB command language was very simple and didn't provide a lot of facilities to write complex functions: no way to interface with external libraries, very limited control flow features, no advanced variable types like lists or hashtables. This is ok when you only use your debugger to break at some point and display a variable, but when you want to write complex plugins such as mona (which is used to find sequences of instructions in memory that can be used to bypass NX/DEP) the GDB command language is really not enough.
In october 2009 GDB 7.0 was released, including changes from Tom Tromey from RedHat which added Python scripting support for GDB. Tromey also wrote a long list of articles about why Python scripting was a major feature, with a lot of examples including using PythonGDB to write pretty printers for custom data types. Recently I used PythonGDB a bit in order to reverse engineer some Linux binaries I did not have the source of, so I wanted to show some of the useful stuff PythonGDB can help for in your reverse engineering work.
Basic PythonGDB usage
First, check that the GDB version you have installed does have Python support:
1 2 3 4 5 6 7 8 9 10 11 12
(gdb) help python Evaluate a Python command. The command can be given as an argument, for instance: python print 23 If no argument is given, the following lines are read and used as the Python commands. Type a line containing "end" to indicate the end of the command. (gdb) python print "Hello, World!" Hello, World!There are three ways to run Python code from inside GDB:
- Using the
pythoncommand as shown above: good for one-shot Python instructions which you will likely not use anymore. Also good for testing the PythonGDB API directly from inside GDB. - Using the
sourcecommand to load a Python script from the filesystem. This is the best way to load scripts manually. - Using the autoloading mechanisms to load specific scripts for a binary. If a
script file named
<binary>-gdb.pyexists in the current directory, GDB will automatically load it. The binary itself can also specify GDB scripts to load when it is being debugged by putting the scripts in the.debug_gdb_scriptsELF section.
GDB exports a Python module called
gdbwhich provides all of the required functions to interface with the debugger: reading memory from the process, getting the list of all breakpoints, registering commands and pretty printers, as well as a lot of additional features.Writing our first PythonGDB plugin
Format strings bugs can be tricky to detect: unless you try to send format specifiers like
%neverywhere you could send a string in the program, you will most likely never notice that one of your format strings can be user-controlled and provide a way for malicious users to inject code remotely into your process.Our first PythonGDB plugin will try to help detecting these bugs by dynamically analyzing calls to the
*printffamily functions and break the execution if one of those functions is called with an argument that is not in read-only memory (like.rodata).PythonGDB allows us to define custom breakpoints which execute Python code when they are triggered and can either stop the program or allow it to continue. We are going to use this to break on the
printffunctions, parse/proc/<pid>/mapsto check if the memory location used as the format string is in read-write memory, and if it is, display an error and break execution. If it is not, just continue executing the program without interruption.Here is how you would implement this with PythonGDB:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
import gdb class CheckFmtBreakpoint(gdb.Breakpoint): """Breakpoint checking whether the first argument of a function call is in read-only location, stopping program execution if it is not.""" def __init__(self, spec, fmt_idx): # spec: specifies where to break # # gdb.BP_BREAKPOINT: specified that we are a breakpoint and not a # watchpoint. # # internal=True: the breakpoint won't show up in "info breakpoints" and # commands like this. super(CheckFmtBreakpoint, self).__init__( spec, gdb.BP_BREAKPOINT, internal=True ) # Argument index of the format string (printf = 1, sprintf = 2) self.fmt_idx = fmt_idx def stop(self): """Method called by GDB when the breakpoint is triggered.""" # Read the i-th argument of an x86_64 function call args = ["$rdi", "$rsi", "$rdx", "$rcx"] fmt_addr = int(gdb.parse_and_eval(args[self.fmt_idx])) # Parse /proc/<pid>/maps for this process proc_map = [] with open("/proc/%d/maps" % gdb.selected_inferior().pid) as fp: proc_map = self._parse_map(fp.read()) # Find the memory range which contains our format address for mapping in proc_map: if mapping["start"] <= fmt_addr < mapping["end"]: break else: print "%016x belongs to an unknown memory range" % fmt_addr return True # Check the memory permissions if "w" in mapping["perms"]: print "Format string in writable memory!" return True return False def _parse_map(self, file_contents): """Parse a /proc/<pid>/maps file to a list of dictionaries containing these fields: - start: the start address of the range - end: the end address of the range - perms: the permissions string""" zones = [] for line in file_contents.split('\n'): if not line: continue memrange, perms, _ = line.split(None, 2) start, end = memrange.split('-') zones.append({ 'start': int(start, 16), 'end': int(end, 16), 'perms': perms }) return zones # Set breakpoints on all *printf functions CheckFmtBreakpoint("printf", 0) CheckFmtBreakpoint("fprintf", 1) CheckFmtBreakpoint("sprintf", 1) CheckFmtBreakpoint("snprintf", 2) CheckFmtBreakpoint("vprintf", 0) CheckFmtBreakpoint("vfprintf", 1) CheckFmtBreakpoint("vsprintf", 1) CheckFmtBreakpoint("vsnprintf", 2)
We can then test this with a very simple program that will execute code that allows the user to control a format string:
1 2 3 4 5 6 7
#include <stdio.h> int main(int argc, char** argv) { printf(argv[1]); return 0; }
Using our simple plugin on this executable gives the following output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
(gdb) source checkfmt.py Function "fprintf" not defined. Function "sprintf" not defined. Function "snprintf" not defined. Function "vprintf" not defined. Function "vfprintf" not defined. Function "vsprintf" not defined. Function "vsnprintf" not defined. (gdb) r "Test %x" Starting program: /home/delroth/test/fstring/a.out "Test %x" Format string in writable memory! Breakpoint -1, 0x00007ffff7a88b00 in printf () from /lib/libc.so.6 (gdb) bt #0 0x00007ffff7a88b00 in printf () from /lib/libc.so.6 #1 0x0000000000400503 in main ()
This ends the first part of my PythonGDB for reverse engineering tutorial. The next part will be about execution tracing using GDB: performing timing attacks on binaries by counting the number of executed instructions, as well as counting the number of times a breakpoint was triggered to guess what a virtual machine is doing (kind of like I did in my PlaidCTF "simple" binary cracking).