LSE Blog

Operating systems, computer security, languages theory, and even more!

  • About us

    • Main website
    • Git repositories
    • @lse_epita

    RSS Feed

  • Categories
    • Events
    • Hardware
    • Language
    • Reverse Engineering
    • Security
    • System
      • Linux
    • Tutorials
      • Parallelism
      • PythonGDB
    • Writeups
      • CSAW CTF 2012 Quals
      • DEFCON 2013 Quals
      • DEFCON2K12 Prequals
      • Hack.lu CTF 2012
      • Hack.lu CTF 2013
      • NDH2K12 Prequals
      • NDH2K13 Quals
      • Olympic-CTF 2014
      • PlaidCTF 2012
      • SecuInside2K12 Prequals
      • ebCTF 2013
  • Authors
    • ✉ Samuel Angebault
    • ✉ Remi Audebert
    • ✉ Jean-Loup Bogalho
    • ✉ Pierre Bourdon
    • ✉ Marwan Burelle
    • ✉ Samuel Chevet
    • ✉ Pierre-Marie de Rodat
    • ✉ Ivan Delalande
    • ✉ Corentin Derbois
    • ✉ Nassim Eddequiouaq
    • ✉ Louis Feuvrier
    • ✉ Fabien Goncalves
    • ✉ Nicolas Hureau
    • ✉ Gabriel Laskar
    • ✉ Stanislas Lejay
    • ✉ Franck Michea
    • ✉ Bruno Pujos
    • ✉ Clement Rouault
    • ✉ Pierre Surply
    • ✉ Kevin Tavukciyan
    • More »
  • PlaidCTF 2017 - no_mo_flo writeup (RE)

    Written by Gabriel Laskar and Stanislas Lejay
    2017-04-24 10:12:00

    Can you go with the flow?

    no_mo_flo is a reverse engineering challenge from this year's PlaidCTF. It's a 64-bit executable that reads 32 characters from stdin, and tells you if this is the correct flag or not (classic).

    Opening it in IDA reveals that it takes the input and breaks it into two 16 bytes buffers:

    1
    2
    3
    4
    5
    for ( i = 0; i <= 15; ++i )
    {
      v5[i] = buf[2 * i];
      v6[i] = buf[2 * i + 1];
    }
    

    It will then register a SIGFPE handler and trigger divisions by 0. When triggerred, the handler will emulate jumps depending on $rflags, $r10, and $r11.

    The SIGFPE handler looks like this

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    int __fastcall sigfpe_handler(__int64 a1, siginfo_t *a2, ucontext_t *ctx)
    {
        greg_t reg_r11;
        greg_t reg_eflags;
        greg_t reg_r10;
        char *str;
    
        if ( custom_flow_enabled ) {
            reg_r10 = ctx->uc_mcontext.gregs[REG_R10];
            reg_eflags = ctx->uc_mcontext.gregs[REG_EFL];
            reg_r11 = ctx->uc_mcontext.gregs[REG_R11];
            switch ( reg_r11 ) {
                case CUSTOM_JMP:
                    reg_r11 = custom_jmp(reg_r10);
                    break;
                case CUSTOM_JNL:
                    reg_r11 = custom_jnl(reg_r10, reg_eflags);
                    break;
                case CUSTOM_JNG:
                    reg_r11 = custom_jng(reg_r10, reg_eflags);
                    break;
                case CUSTOM_JG:
                    reg_r11 = custom_jg(reg_r10, reg_eflags);
                    break;
                case CUSTOM_JL:
                    reg_r11 = custom_jl(reg_r10, reg_eflags);
                    break;
                case CUSTOM_JNE:
                    reg_r11 = custom_jne(reg_r10, reg_eflags);
                    break;
                case CUSTOM_JE:
                    reg_r11 = custom_je(reg_r10, reg_eflags);
                    break;
                default:
                    break;
            }
            zero = 1LL;
            custom_flow_enabled = 0LL;
        } else {
            reg_r11 = sigaction(8, 0LL, 0LL);
            if ( (signed int)reg_r11 < 0 ) {
                str = strerror(errno);
                reg_r11 = printf("sigaction install fail %s\n", str);
            }
        }
        return reg_r11;
    }
    

    As we can see, $r11 is used as an opcode, and $r10 to store the jump value. If we look at the function called inside the switch, we have a reimplementation of the x86 opcodes, for example with jne:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    __int64 custom_jne(__int64 reg_r10, __int64 reg_eflags)
    {
        if (reg_eflags & X86_EFLAGS_ZF)
            custom_jmp_to = reg_r10;
        else
            custom_jmp_to = custom_jmp_from + 56;
    
        return custom_jmp_to;
    }
    

    The code that triggers the SIGFPE handler looks like this:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    .text:0000000000400F18 check_odd_bytes:                        ; CODE XREF: main+AA
    .text:0000000000400F18                 sub     rsp, 8
    .text:0000000000400F1C                 mov     esi, 1
    .text:0000000000400F21                 mov     eax, 0
    .text:0000000000400F26                 mov     edx, eax
    .text:0000000000400F28                 shl     edx, 2
    .text:0000000000400F2B                 movsxd  rdx, edx
    .text:0000000000400F2E                 mov     rax, rdi
    .text:0000000000400F31                 add     rax, rdx
    .text:0000000000400F34                 mov     rdx, rax
    .text:0000000000400F37                 mov     eax, [rdx]
    .text:0000000000400F39                 mov     edx, eax
    .text:0000000000400F3B                 sub     edx, 3
    .text:0000000000400F3E                 mov     eax, edx
    .text:0000000000400F40                 cmp     eax, 40h
    .text:0000000000400F43                 lea     r10, check_odd_byte_1
    .text:0000000000400F4B                 mov     r11, CUSTOM_JNE
    .text:0000000000400F52                 mov     dword ptr ds:custom_flow_enabled, 1
    .text:0000000000400F5D                 mov     ds:custom_save_rax, rax
    .text:0000000000400F65                 mov     rax, 0
    .text:0000000000400F6C                 mov     ds:custom_save_rdx, rdx
    .text:0000000000400F74                 lea     rdx, loc_400F7B
    .text:0000000000400F7B
    .text:0000000000400F7B loc_400F7B:                             ; DATA XREF: check_odd_bytes+5C
    .text:0000000000400F7B                 mov     ds:custom_jmp_from, rdx
    .text:0000000000400F83                 cdq
    .text:0000000000400F84                 idiv    ds:zero
    .text:0000000000400F8C                 mov     ds:zero, 0
    .text:0000000000400F98                 mov     rax, ds:custom_save_rax
    .text:0000000000400FA0                 mov     rdx, ds:custom_save_rdx
    .text:0000000000400FA8                 mov     r11, ds:custom_jmp_to
    .text:0000000000400FB0                 jmp     r11
    
    .text:0000000000400FB3                 mov     esi, 0
    .text:0000000000400FB8                 lea     r10, check_odd_byte_1
    .text:0000000000400FC0                 mov     r11, CUSTOM_JMP
    .text:0000000000400FC7                 mov     dword ptr ds:custom_flow_enabled, 1
    .text:0000000000400FD2                 mov     ds:custom_save_rax, rax
    .text:0000000000400FDA                 mov     rax, 0
    .text:0000000000400FE1                 mov     ds:custom_save_rdx, rdx
    .text:0000000000400FE9                 lea     rdx, loc_400FF0
    .text:0000000000400FF0
    .text:0000000000400FF0 loc_400FF0:                             ; DATA XREF: .text:0000000000400FE9
    .text:0000000000400FF0                 mov     ds:custom_jmp_from, rdx
    .text:0000000000400FF8                 cdq
    .text:0000000000400FF9                 idiv    ds:zero
    .text:0000000000401001                 mov     ds:zero, 0
    .text:000000000040100D                 mov     rax, ds:custom_save_rax
    .text:0000000000401015                 mov     rdx, ds:custom_save_rdx
    .text:000000000040101D                 mov     r11, ds:custom_jmp_to
    .text:0000000000401025                 jmp     r11
    

    If we look a little inside it, this roughly translates into:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    check_odd_bytes:                        ; CODE XREF: main+AA
                    sub     rsp, 8
                    mov     esi, 1
                    mov     eax, 0
                    mov     edx, eax
                    shl     edx, 2
                    movsxd  rdx, edx
                    mov     rax, rdi
                    add     rax, rdx
                    mov     rdx, rax
                    mov     eax, [rdx]
                    mov     edx, eax
                    sub     edx, 3
                    mov     eax, edx
                    cmp     eax, 40h
                    jne     check_odd_byte_1 ; Here is the change
    
                    mov     esi, 0
                    jmp     check_odd_byte_1 ; and the second one
    

    Two main functions will then be called, sub_4006c6 and sub_400f18, that will respectively verify the first buffer and the second one. Two nice solving techniques are broken by this scheme: symbolic analysis (like angr) is very hard with stuff like signal handlers and instruction counting is impossible since characters are not checked sequentially (here, they are checked two by two, the even ones, then the odd ones).

    While gaby was reversing and simplifying the jumps handling to NOP out the divisions by 0 (see above), he figured out that the first function was not using the handler at all. So I tried to launch angr on the first function only, and managed to get the first half of the flag like this:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    #!/usr/bin/env python2
    
    import angr
    from simuvex.procedures.stubs.UserHook import UserHook
    
    p = angr.Project('no_flo_f51e2f24345e094cd2080b7b690f69fb')
    
    
    # You win
    find = 0x4027ce
    # You lose
    main = 0x40272e
    # Basic blocks that get eax to be reset in the first function
    # (see get_basic_blocks.py)
    avoid = (0x4027f8, 0x40071d, 0x40077a, 0x4007d7, 0x400834, 0x400894, 0x4008f4,
             0x400950, 0x4009a8, 0x400a09, 0x400a6f, 0x400ac7, 0x400b24, 0x400b81,
             0x400bd9, 0x400c31, 0x400c8e, 0x400ce6, 0x400d3e, 0x400d96, 0x400df2,
             0x400e4a, 0x400ea0, 0x400eeb)
     flag_addr = 0
    
    
    def read(state):
        state.regs.rax = 32
        global flag_addr
        flag_addr = state.regs.rsi
    
        for i in range(31):
            if i % 2 == 0:
                # We are interested by the bytes that go into the first function
                state.mem[state.regs.rsi + i].char = state.se.BVS('c', 8)
            else:
                if i > 4 and i < 31:
                    # Other are put to '`' to be computed later with
                    # v0lt (see solve2.py)
                    state.mem[state.regs.rsi + i].char = '`'
                elif i == 1:
                    state.add_constraints(state.memory.load(flag_addr, 5) == int("PCTF{".encode("hex"), 16))
                    state.mem[state.regs.rsi + 31].char = '}'
    
    def clear_rax(state):
        state.regs.rax = 0
    
    def do_nothing(state):
        # There might be an angr builtin
        # no time to read the docs!
        pass
    
    # sighandler does nothing
    p.hook(0x4027ba, angr.Hook(UserHook, user_func=do_nothing, length=5))
    # read
    p.hook(0x40274a, angr.Hook(UserHook, user_func=read, length=5))
    # second function is completely bypassed for now
    p.hook(0x4027d1, angr.Hook(UserHook, user_func=clear_rax, length=(0x4027e0 - 0x4027d1)))
    
    init = p.factory.blank_state(addr=main)
    pgp = p.factory.path_group(init)
    ex = pgp.explore(find=find, avoid=avoid)
    
    # Print half the flag to pipe it into v0lt
    print(ex.found[0].state.se.any_str(ex.found[0].state.memory.load(flag_addr, 32)))
    

    Basic blocks' addresses from the previous script were dumped from IDA with this:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    from idautils import *
    from bisect import *
    
    START = 0x4006c6
    END = 0x400f13
    
    # From https://reverseengineering.stackexchange.com/a/1648/11827
    class BBWrapper(object):
        def __init__(self, ea, bb):
            self.ea_ = ea
            self.bb_ = bb
    
            def get_bb(self):
                return self.bb_
    
            def __lt__(self, other):
                return self.ea_ < other.ea_
    
    class BBCache(object):
        def __init__(self, f):
            self.bb_cache_ = []
            for bb in idaapi.FlowChart(f):
                self.bb_cache_.append(BBWrapper(bb.startEA, bb))
            self.bb_cache_ = sorted(self.bb_cache_)
    
    def find_block(self, ea):
        i = bisect_right(self.bb_cache_, BBWrapper(ea, None))
        if i:
            return self.bb_cache_[i-1].get_bb()
        else:
            return None
    
    
    
    bb_cache = BBCache(idaapi.get_func(START))
    for func in Functions(START, END):
        addr = func
        while addr < END:
            disasm = GetDisasm(addr)
            if "mov" in disasm and "r8d, 0" in disasm:
                print("{0}".format(hex(bb_cache.find_block(addr).startEA)))
            decoded = DecodeInstruction(addr)
            addr += decoded.size if decoded else 1
    

    angr runs for less than 10 seconds and gives us half the flag:

    1
    2
    3
    p1kachu@GreenLabOfGazon:no_mo_flow$ ./solve1.py
    PCTF{`0`f`0`_`0`l`k`_`h`h`l`_`0}
    p1kachu@GreenLabOfGazon:no_mo_flow$
    

    and now, that we have this, we can "bruteforce" the other half using instruction counting since we will always pass the first check (the even characters)!

    Using v0lt, we are able to get the second half of the flag:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    #!/usr/bin/env python3
    
    from v0lt import *
    
    # Get half the flag from angr
    first_half = input()
    
    # Create an instruction counting instance that reads from stdin a password
    # of 32 chars, and try to recover the other half of it
    ic = InstructionCounter("/home/p1kachu/Desktop/tools/pin/",
                            "/home/p1kachu/no_flo_f51e2f24345e094cd2080b7b690f69fb",
                            binary_args=" &> /dev/null",
                            length=32,
                            input_form=InputForm.STDIN,
                            fixed_chars=first_half)
    
    # ¯\_(ツ)_/¯
    flag = ic.Accurate();
    

    And, a little while later, the Russian Anthem was played ;)

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    p1kachu@GreenLabOfGazon:Downloads$ ./solve1.py | ./solve2.py
    [+]SUCCESS  char known:   P -> P
    [+]SUCCESS  char known:   C -> PC
    [+]SUCCESS  char known:   T -> PCT
    [+]SUCCESS  char known:   F -> PCTF
    [+]SUCCESS  char known:   { -> PCTF{
    [+]SUCCESS  char guessed: n -> PCTF{n
    [+]SUCCESS  char known:   0 -> PCTF{n0
    [+]SUCCESS  char guessed: _ -> PCTF{n0_
    [+]SUCCESS  char known:   f -> PCTF{n0_f
    [+]SUCCESS  char guessed: l -> PCTF{n0_fl
    [+]SUCCESS  char known:   0 -> PCTF{n0_fl0
    [+]SUCCESS  char guessed: ? -> PCTF{n0_fl0?
    [+]SUCCESS  char known:   _ -> PCTF{n0_fl0?_
    [+]SUCCESS  char guessed: m -> PCTF{n0_fl0?_m
    [+]SUCCESS  char known:   0 -> PCTF{n0_fl0?_m0
    [+]SUCCESS  char guessed: _ -> PCTF{n0_fl0?_m0_
    [+]SUCCESS  char known:   l -> PCTF{n0_fl0?_m0_l
    [+]SUCCESS  char guessed: i -> PCTF{n0_fl0?_m0_li
    [+]SUCCESS  char known:   k -> PCTF{n0_fl0?_m0_lik
    [+]SUCCESS  char guessed: e -> PCTF{n0_fl0?_m0_like
    [+]SUCCESS  char known:   _ -> PCTF{n0_fl0?_m0_like_
    [+]SUCCESS  char guessed: a -> PCTF{n0_fl0?_m0_like_a
    [+]SUCCESS  char known:   h -> PCTF{n0_fl0?_m0_like_ah
    [+]SUCCESS  char guessed: _ -> PCTF{n0_fl0?_m0_like_ah_
    [+]SUCCESS  char known:   h -> PCTF{n0_fl0?_m0_like_ah_h
    [+]SUCCESS  char guessed: 3 -> PCTF{n0_fl0?_m0_like_ah_h3
    [+]SUCCESS  char known:   l -> PCTF{n0_fl0?_m0_like_ah_h3l
    [+]SUCCESS  char guessed: l -> PCTF{n0_fl0?_m0_like_ah_h3ll
    [+]SUCCESS  char known:   _ -> PCTF{n0_fl0?_m0_like_ah_h3ll_
    [+]SUCCESS  char guessed: n -> PCTF{n0_fl0?_m0_like_ah_h3ll_n
    [+]SUCCESS  char known:   0 -> PCTF{n0_fl0?_m0_like_ah_h3ll_n0
    [+]SUCCESS  char known:   } -> PCTF{n0_fl0?_m0_like_ah_h3ll_n0}
    [+]SUCCESS  pass found: PCTF{n0_fl0?_m0_like_ah_h3ll_n0}
    p1kachu@GreenLabOfGazon:no_mo_flow$ ./no_flo_f51e2f24345e094cd2080b7b690f69fb
    PCTF{n0_fl0?_m0_like_ah_h3ll_n0}
    Good flow!!
    p1kachu@GreenLabOfGazon:no_mo_flow$
    

    flag: PCTF{n0_fl0?_m0_like_ah_h3ll_n0}

    This was nice, because we could clearly see that the binary had been made such that these kind of techniques would not work! Almost no reversing was necessary for this (even if we did a lot before figuring out this). A little bit of hacking never hurts ;) Thanks PPP!

    You can find the binary and sripts here

    Tweet
    Permalink & comments
  • LSE Week 2017 Announcement

    Written by Kevin Tavukciyan
    2017-04-20 15:27:52

    For the seventh year, we are going to give a 3 day conference to show the work we are doing here at the LSE, about various themes we like, have encountered or overall judge interesting.

    The exact planning and subjects addressed will be announced later, as well as the exact timetable. As we did last year, we are also opening the talks to external contributors and all LSE members, present or past.

    The presentations will be held in French as usual and we will try to record everything.

    All details are on the main page of the event: LSE Summer Week 2017

    Tweet
    Permalink & comments
  • Playing with Mach-O binaries and dyld

    Written by Stanislas Lejay
    2017-03-14 20:17:40

    One cool way to get your hands dirty when discovering something is to try to make it do simple stuff in some stupid/overkill way.

    When I first had "fun" with the Linux ELF format, I was told to call printf without using it directly, by finding which address to call from inside the binary. For this, one would start from the mapped program header, find the r_debug structure which would give the program's link map containing the mapped libc's base address. From it, one would find printf by iterating over the library's symbol table and find where it is, before calling it. No syscall's allowed, so everything must come from the process's own memory and structures.

    Recently I wanted to give a closer look at macOS, and decided to try the same thing with Mach-O binaries. This post will be a sum-up for me to remember, and for anybody that might want to learn anything about macOS in general. I will not re-explain what already exists on other websites, I'll just link them instead.

    Prerequisites

    First things first, we are looking for printf, from the libc. To find it, just write a simple program, and open gdb. There are multiple ways to determine which library we are looking for, but using info sharedlibrary and determining in which range falls printf is one of the simplest. In our case, we care about /usr/lib/system/libsystem_c.dylib.

    This binary is what file calls a Mach-O universal binary, which in fact is a wrapper around multiple Mach-Os. Also called Fat binaries in the old days, they were used to mix x86 and PPC binaries in a single blob. Now, it ships libraries for both 32 and 64 bits architectures.

    1
    2
    p1kachu@OrangeLabOfSun:osx$ file libsystem_c.dylib
    help/libsystem_c.dylib: Mach-O universal binary with 2 architectures: [x86_64: Mach-O 64-bit x86_64 dynamically linked shared library, flags:<NOUNDEFS|DYLDLINK|TWOLEVEL|NO_REEXPORTED_DYLIBS|APP_EXTENSION_SAFE>] [i386: Mach-O i386 dynamically linked shared library, flags:<NOUNDEFS|DYLDLINK|TWOLEVEL|NO_REEXPORTED_DYLIBS|APP_EXTENSION_SAFE>]
    

    A universal binary consists of a fat binary header, and multiple Mach-Os. So we'll only take a look at one of the Mach-O, the one used by our system (in our case, the first one). Here is, however, an overview of the format:

    fatbinaries

    Finding the libc

    In memory will only be mapped the corresponding Mach-O, so that's what we are going to look for in our process's address space. We first need to understand how the dynamic linker maps it. Let's take a look at /usr/include/mach-o/* to try to find some informations. The interesting stuff lies into dyld_images.h and loader.h. We see that the structure dyld_images.h:dyld_all_image_infos has two interesting fields: a pointer (infoArray) to an array of struct dyld_image_info, which gives us every mapped binary in memory, and infoArrayCount which gives the number of binaries in the array. We can thus iterate over these structures to find the libsystem_c.dylib address in memory.

    Here are the important parts from dyld_images.h (macOS Sierra). Comments have been moved/reduced for more readability.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    struct dyld_image_info {
            /* base address image is mapped into */
            const struct mach_header* imageLoadAddress;
            /* path dyld used to load the image */
            const char*               imageFilePath;
            /* time_t of image file */
            uintptr_t                 imageFileModDate;
    // ...
    };
    
    // ...
    
    /* internal limit */
    #define DYLD_MAX_PROCESS_INFO_NOTIFY_COUNT  8
    
    struct dyld_all_image_infos {
            uint32_t                      version;  /* 1 in Mac OS X 10.4 and 10.5 */
            uint32_t                      infoArrayCount;
            const struct dyld_image_info* infoArray;
            dyld_image_notifier           notification;
            bool                          processDetachedFromSharedRegion;
            /* Mac OS X 10.6, iPhoneOS 2.0 and later */
            bool                          libSystemInitialized;
            const struct mach_header*     dyldImageLoadAddress;
            /* Mac OS X 10.6, iPhoneOS 3.0 and later */
            void*                         jitInfo;
            /* Mac OS X 10.6, iPhoneOS 3.0 and later */
            const char*                   dyldVersion;
            const char*                   errorMessage;
            uintptr_t                     terminationFlags;
            /* Mac OS X 10.6, iPhoneOS 3.1 and later */
            void*                         coreSymbolicationShmPage;
            /* Mac OS X 10.6, iPhoneOS 3.1 and later */
            uintptr_t                     systemOrderFlag;
            /* Mac OS X 10.7, iPhoneOS 3.1 and later */
            uintptr_t                     uuidArrayCount;
            const struct dyld_uuid_info*  uuidArray; /* only images not in dyld shared cache */
            /* Mac OS X 10.7, iOS 4.0 and later */
            struct dyld_all_image_infos*  dyldAllImageInfosAddress;
            /* Mac OS X 10.7, iOS 4.2 and later */
            uintptr_t                     initialImageCount;
            /* Mac OS X 10.7, iOS 4.2 and later */
            uintptr_t                     errorKind;
            const char*                   errorClientOfDylibPath;
            const char*                   errorTargetDylibPath;
            const char*                   errorSymbol;
            /* Mac OS X 10.7, iOS 4.3 and later */
            uintptr_t                     sharedCacheSlide;
            /* Mac OS X 10.9, iOS 7.0 and later */
            uint8_t                       sharedCacheUUID[16];
            /* (macOS 10.12, iOS 10.0 and later */
            uintptr_t                     sharedCacheBaseAddress;
            uint64_t                      infoArrayChangeTimestamp;
            const char*                   dyldPath;
            mach_port_t                   notifyPorts[DYLD_MAX_PROCESS_INFO_NOTIFY_COUNT];
    #if __LP64__
            uintptr_t                     reserved[13-(DYLD_MAX_PROCESS_INFO_NOTIFY_COUNT/2)];
    #else
            uintptr_t                     reserved[12-DYLD_MAX_PROCESS_INFO_NOTIFY_COUNT];
    #endif
    };
    

    By looking at this structure afterwards, one can notice that other fields from this structure could have been quite useful for our purpose !

    However, we first have to find its address in memory. The function /usr/include/mach/task.h:task_info does exactly this, but uses a mach port, which is a kernel-provided inter-process communication mechanism. It's not exactly a syscall, but still, it's a little bit like cheating. I don't think there is any reliable way of doing it without (as of Yosemite at least).

    Phew! We are now able to get the base address of libsystem_c.dylib:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    static char *find_libc(void)
    {
            // Get DYLD task infos
            struct task_dyld_info dyld_info;
            mach_msg_type_number_t count = TASK_DYLD_INFO_COUNT;
            kern_return_t ret;
            ret = task_info(mach_task_self_,
                            TASK_DYLD_INFO,
                            (task_info_t)&dyld_info,
                            &count);
            if (ret != KERN_SUCCESS) {
                    return NULL;
            }
    
            // Get image array's size and address
            mach_vm_address_t image_infos = dyld_info.all_image_info_addr;
            struct dyld_all_image_infos *infos;
            infos = (struct dyld_all_image_infos *)image_infos;
            uint32_t image_count = infos->infoArrayCount;
            struct dyld_image_info *image_array = infos->infoArray;
    
            // Find libsystem_c.dylib among them
            struct dyld_image_info *image;
            for (int i = 0; i < image_count; ++i) {
                    image = image_array + i;
    
                    // Find libsystem_c.dylib's load address
                    if (strstr(image->imageFilePath, "libsystem_c.dylib")) {
                            return (char*)image->imageLoadAddress;
                    }
        }
    }
    

    Getting printf

    Right. So now we have the binary in memory, let's finally take a look at the Mach-O format. A good introduction has already been written here, so let's not dive in too deep and directly look for what interests us, accessing the symbol table. Thus, we are looking for the LC_SYMTAB command, which will give us the strtab and symtab offsets on which we will iterate to find printf.

    The process of looking up the name of a given entry in the lazy or non-lazy pointer tables looks like this:

    fatbinaries

    Analysing it with jtool gives us an overview on what we are supposed to find:

    1
    2
    3
    4
    5
    6
    p1kachu@OrangeLabOfSun:osx$ ./jtool.ELF64 -arch x86_64 -l libsystem_c.dylib
    [...]
    LC 05: LC_SYMTAB
            Symbol table is at offset 0x9da70 (645744), 2372 entries
            String table is at offset 0xa7708 (685832), 32264 bytes
    [...]
    

    However, the values recovered from memory are quite different:

    1
    2
    3
    P1kachu@GreyLabOfSteel:~/D/L/W/c/osx$ ./get_symcmd
    symoff: 0x134596ef
    stroff: 0x141ad9f4
    

    And then began the SIGSEGV ballet. Something was definitely off.

    The shared cache

    Let's take a step back in late 2009, with iOS 3.1. One change in the way iOS handled libraries was introduced by the mean of the Dyld shared cache, which combines all system (private and public) libraries into a big cache file to improve performance. On macOS, the same thing happened. The shared caches live in /private/var/db/dyld/ and regroups a lot of libraries (~400 for Yosemite and ~670 for Sierra, as for the x86_64 versions). The file format isn't documented and changes between versions, so we must trick a little bit. Some informations about it can be retrieved using jtool again:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    p1kachu@OrangeLabOfSun:osx$ ./jtool.ELF64 -h dyld_shared_cache_x86_64h_yosemite
    File is a shared cache containing 414 images (use -l to list)
    Header size: 0x70 bytes
    Got gap of -8 bytes:
    3 mappings starting from 0x68. 414 Images starting from 0xc8
    mapping r-x/r-x  251MB     7fff80000000 -> 7fff8fb31000      (0-fb31000)
    mapping rw-/rw-   38MB     7fff70000000 -> 7fff72604000      (fb31000-12135000)
    mapping r--/r--   75MB     7fff8fb31000 -> 7fff9466d000      (12135000-16c71000)
    DYLD base address: 7fff5fc00000
    Local Symbols:  0x0-0x0 (0 bytes)
    Code Signature: 0x16c71000-0x16e38a07 (1866247 bytes)
    Slide info:     0x16ba7000-0x16c71000 (827392 bytes)
            Slide Info version 1, TOC offset: 24, count 9732, entries: 6309 of size 128
    p1kachu@OrangeLabOfSun:osx$ ./jtool.ELF64 -h dyld_shared_cache_x86_64h_sierra
    File is a shared cache containing 675 images (use -l to list)
    Header size: 0x70 bytes
    Got gap of 40 bytes: 0xf8 0x00 0x00 0x00 0x00 0x00 0x5790 0x00 0x29d 0x00
    3 mappings starting from 0x98. 675 Images starting from 0xf8
    mapping r-x/r-x  424MB     7fff70000000 -> 7fff8a824000      (0-1a824000)
    mapping rw-/rw-   75MB     7fff8e824000 -> 7fff933a7000      (1a824000-1f3a7000)
    mapping r--/r--  118MB     7fff973a7000 -> 7fff9ea3c000      (1f3a7000-26a3c000)
    DYLD base address: 0
    Local Symbols:  0x0-0x0 (0 bytes)
    Code Signature: 0x26a3c000-0x26f14000 (5079040 bytes)
    Slide info:     0x1f3a7000-0x1f3b1000 (40960 bytes)
            Slide Info version 2, TOC offset: 4096, count 40, entries: 38702 of size 0
    

    Memory layout subtlety

    On Yosemite (and probably other versions that I didn't look at), the cache memory mapping differs from the file layout: as can be seen using jtool's output above, the TEXT mapping is after the DATA, while it is the opposite in the file layout. This was put back to normal between Yosemite and Sierra.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    DYLD SHARED CACHE MAPPINGS ON YOSEMITE *
    ========================================
    
    (*): Without ASLR slide
    
      ----------------------  0x7fff70000000
     |                      |
     |                      |
     |                      |
     |                      |
     |         RW-          |
     |                      |
     |                      |
     |                      |
     |----------------------| 0x7fff70000000 + [RW-].size
     |         Junk         |
     |----------------------| 0x7fff80000000
     |     Cache Header     |
     |----------------------|
     |                      |
     |         R-X          |
     |                      |
     |         ...          |
     |  libsystem_c.dylib   |
     |         ...          |
     |                      |
     |                      |
     |----------------------| 0x7fff80000000 + [R-X].size
     |                      |
     |                      |
     |                      |
     |         R--          |
     |                      |
     |                      |
     |                      |
     |                      |
      ----------------------  0x7fff80000000 + [R-X].size + [R--].size
    
      cache.base = [R-X].address + [R-X].size - [R--].offset
    

    Among these cached libraries is our libsystem_c, and thus we simply understand that the {str,sym}tabs offsets are from the beginning of the cache file.

    Finding it on Yosemite was not trivial without issuing syscalls, and I thus went for the stupid way: I first found the loaded library with the smallest load address (the first one contained in the shared cache), and got back into memory until finding the shared cache magic string (dyld_v1 x86_64\0).

    On Sierra, however, one can observe that the dyld_all_image_infos structure contains a nice field named sharedCacheBaseAddress. I used it to avoid memcmping more memory.

    With this, we can find the symtab, iterate over each of them and check the corresponding strings, looking for _printf.

    Conclusion

    The final code, compatible with at least Yosemite and Sierra, is available here.

    I may have skipped some informations. I read way too much from different sources to be able to put everything down. If anything is unclear, feel free to ping me by mail or twitter.

    Interesting auxilliary stuff

    Shared cache and ASLR

    The shared cache is loaded in memory at boot and is the same for every process. Even if affected by ASLR, it will not be re-randomized on a per program basis, and thus any program leaking addresses from it actually leaks system-wide addresses, which is nice!

    Links

    • Code for calling printf
    • Slides of the corresponding Lightning talk
    • Dynamic Symbol table duel - ELF vs Mach-O
    • Mach-O executables
    • /usr/include/mach-o/*

    Tweet
    Permalink & comments
  • One Device to drive them all

    Written by Pierre Surply
    2016-10-24 15:50:00

    Prologue

    Three Devices for logic analysis of passively captured traces,
    Seven for inter-chip communication driven by hardwired interfaces,
    Nine for in-circuit debugging limited to specific purpose,
    One for complex hardware hacking scenarios.

    Three tinkerers took those words as they are. Overthrown by the complexity implied by the multiplicity of inefficient tools, they thought that time had come to undertake this problem from another angle.

    All they needed was a simple way to manipulate the exotic devices that they required for their projects. Manufactured by foreign organizations, devices referred here were designed to fulfill a predefined purpose and were intended to be used as black boxes. Without any knowledge of the internal mechanisms involved in their operations, it was conceivable to integrate them if they were in the kind of environment that they were promised to.

    But those tinkerers though differently. Their situation was mostly complicated by the fact that they had already acquired a good control of their personal computers that they considered as their main and perfect workstation. Well defined and roughly understood, they were too stubborn to learn another way to work as they unanimously decided that this method was the most effective and compliant with the rest of their work.

    So instead of reworking there methodology, they agreed that defining a third device whose only purpose was to handle the interfacing between the workstation and the device under test were inescapable. The first member of the group asked to others what options were available to fit this position.

    The second one said that he already made an intensive usage of the Arduino for that. Providing an easy access and control of its GPIO and some hardwired bus controllers, it was suitable for the most simple cases.

    The third one discussed the merits of the Bus Pirate from Dangerous Prototype. Mature and widely-used, this tool provided a direct control of its interface via USB without the need to develop a specific firmware to be actually used.

    The first one replied to these proposals that they had a common issue: they simply performed the communication with the host by using an interface based on the translation of USB to UART speeded at 115200 bauds. For him, it prohibited a fine-grained configuration and then limited the full capacities provided by the USB protocol.

    They all agreed on this last point and started to work on a first prototype of their response to this situation.

    It was based on a STM32F072 microcontroller and mapped SPI, I2C, UART and CAN signals to physical headers. As this chip was able to drive USB signals, a USB mini-connector was directly connected to it.

    Concerning the software side, one interesting idea here was to expose the hardware interfaces using the corresponding subsystem in the Linux kernel. Even though these subsystems were mostly used to describe on-chip interfaces, adapting them to wrap up the USB functions was feasible. For instance, the SPI exposed by the device could be manipulated as a regular spidev.

    Although the concept of such board was appealing at the time, limitations quickly appeared. First of all, most of the USB protocol had to be implemented via software on the STM32F072 which led to a significant overhead on each USB transaction. Secondly, fully implementing the host driver in kernel space implied a rigid configuration and error-prone if not implemented correctly. Finally, the global stability of the STM32F072 MCU was quite poor especially during a development phase where on-chip debugging had to be frequently used.

    One year passed and no one was actually enthusiastic to use this dead-born project in a real context. The first one, whose credibility was at its lower point, got the bravery to propose to the two others to rethink the project from the beginning. And they accepted, against all odds.

    This write-up must be considered as the collection of thoughts that led them to the design and the manufacture of a second version of this small, unpretentious, and unfinished electronic board.

    Chapter I: Forging the One Device

    The first step for them was to clearly define how and what could make the second version of the board better than the previous one. The main issue was related to the lack of flexibility of the design and they wondered how they could handle a protocol not supported by the microcontroller they used.

    Then they decided to take a look at the wide range of Programmable Logic Devices available nowadays. As a first prototype, a CPLD appeared to be the best choice for such application. Compared to a regular FPGA, these non-volatile PLD were cheaper and required a much more simpler configuration circuit. They also thought that the prototype was designed to only prove a concept and moving to a more powerful FPGA for next versions was conceivable.

    Section I: From Ink...

    From a high-level point of view, the board had been specified to expose a reasonable number of IOs directly connected to a controller, here an Altera Max V CPLD. As the flaky soft USB implementation of the previous version was quite inconvenient to maintain and to keep reliable, the job here had been assigned to a well-known and solid dedicated USB controller: the FX2LP from Cypress Semiconductor. This highly integrated USB 2.0 microcontroller implemented most of the protocol logic in silicon and only burdened its integrated 8051's firmware with the high-level configuration aspect of USB.

    And then came the question about the communication between the USB controller and the IO controller. The FX2LP embedded a powerful mechanism to forward the content of a USB entrypoint to an hardware FIFO without any interaction with the internal 8051. These EP buffer's words could then be dequeued by an external component using an hardware interface.

    However, this one was defined by a 16-bit data bus and 6 control signals which was quite pin-consuming for the CPLD they chose. Fortunately, another mechanisms offered by the FX2LP allowed the programming of a custom protocol to transmit and receive these data with the external world: the General Programmable Interface. As for the regular FIFO interface, this hardware unit was almost completely independent from the 8051. The firmware was only responsible to program the hardware state-machines used to represent the waveforms of a one-word transmission.

    In their case, they chose to allocate 8 wires for the bidirectional data bus, 3 control signals driven by the USB controller and 2 'ready' signals initiated by the IO controller. At that point, none of them had actually thought about the exact shape of the waveforms and the purpose of the control signals but planned to consider that once the first board would be fully manufactured.

    The USB device interface was composed of 3 endpoints. The endpoint 0 acted as a regular control endpoint and was used to transfer small requests. Meanwhile, endpoints 2 and 6 were dedicated to bulk transmissions and receptions respectively. The two last were directly connected to the internal FIFO while the first one was completely handled by the 8051.

    To power these components, the 5V supplied by the USB were firstly shifted to 3.3V using a low-dropout voltage regulator to power the USB controller and the IO banks of the CPLD while a 1.8V regulator powered the CPLD's internal logic.

    The main clock was managed by the FX2LP. Connected to a 24MHz crystal, the internal PLL were configured by the 8051 firmware allowing a CPU clock frequency of 48MHz, 24MHz or 12MHz. As the output of the phase-locked loop was also exposed outside the USB controller by the CLKOUT pin, the CPLD used it as a system clock.

    The GPIF unit had a dedicated clock that could be fed internally or imposed by an external device. All operations on this interface were aligned to this signal. In order to avoid to deal with multiple clock domains in the CPLD, they arranged to drive the IFCLK signal from the IO controller at the half frequency of the system clock.

    An I2C EEPROM had been connected to USB controller in order to store its firmware in a persistent way. The internal reset logic of the FX2LP was designed to scan the I2C bus for EEPROM from where a valid firmware could be loaded. Once the program was fully copied to internal RAM, no operations were performed on this bus.

    After several tries, they finally validated the following schematic:

    Section II: ...To Copper

    Once the design approved, the next step consisted to draw the printed circuit board. Two layers were enough to route the entire netlist in a surface of 5x5cm.

    The top layer was dedicated to voltage regulation, CPLD, connectors and a couple of switches and LEDs. Meanwhile, the bottom one contained the whole circuit required to make the USB controller working: crystal, EEPROM, I2C pull-up resistors, ...

    IOs from the CPLD were exposed via 2 dual-row 20-pin female headers of 2.54mm pitch.

    As the board was manually soldered, it was not conceivable for them to use BGA components for this prototype. So the 100-pin LQFP version of the CPLD had been used as well as the 56-pin SSOP package of the Cypress's chip.

    After hours of painful electrical tests, a first sample of a fully soldered board was born by the end of the Spring:

    Chapter II: On Reprogrammability They Hoped

    Although the physical board was ready, a firmware was still needed to make it working. The situation was more complex than just a simple binary located in a single ROM as most of the boards of this category are.

    First of all, the firmware for the FX2LP had been implemented which basically consisted to configure the USB and the GPIF units of the chip. Nothing uncommon here: writing applications for this kind of microcontroller was quite easy as it was well-documented and that tons of similar usages of this chip already existed and were publicly available. The code has been written in a couple of hours and no new features have been added since as they decided to make the firmware serving only one unique purpose: translate USB data to IO controller in the most simple and lightweight way.

    For them, most of the customizations that would be needed should be fully-implemented at the IO controller level. The real challenge here was to take advantage of the CPLD as a powerful and programmable IO controller.

    One solution would be to base the CPLD's design on a soft-processor: modifying IO's behaviour would mean loading a new firmware into its RAM. Although this architecture was quite common when using an FPGA, it became more inconvenient when basing it on a CPLD due to the lack of memory blocks.

    The second solution would be to generate and configure the design of the CPLD according to the user's needs dynamically. As pursuing this concept using a regular hardware description language seemed almost impossible for them, they decided to fully base the design generation on Migen. This python module allowed the meta-programming of synchronous register transfer level design and handled the generation of a verilog file that could then be synthesised by the regular Altera's toolchain.

    Section I: Modularity And Modulation

    They fully defined the architecture around the concept of modularity. To demonstrate how it would transpire in a real context, they took the example of a Pulse-Width Modulation interface.

    The main principal of such technique was to use a rectangular pulse wave whose pulse width was modulated resulting in the variation of the average value of the waveform.

    A possible implementation of a PWM module could be achieved by using a counter whose width defined the period of the signal and a digital comparator to generate the needed duty cycle.

    In this case, the only signal that was likely exposed externally would be the output of the comparator, negated or not. Moreover, a 'parameter' of this circuit would be the left-input of the comparator and was typically the kind of signal that would be interesting to implement as a register writable from the host.

    For their example, they also considered that the counter value could be watched from the host.

    The 'parameter' signals were called 'Control Registers' and were intended to be readable and/or writable from the host while the signals that would be eligible to be mapped to a physical pin of the CPLD were called 'IO Signals'.

    In a more generic way, this kind of module, that they called 'IO Module', could always be represented according to the following template:

    • An internal logic block that could contain both combinational and sequential logic left to IO Module's discretion.

    • 'Control Registers' connected to an internal bus and used to watch and control the activity of the internal logic from the host.

    • 'IO Signals' intended to interact with an external component and to be mapped to real pin.

    Imposing such kind of interface also meant imposing a huge, redundant and overblown part of HDL code only to ensure the glue logic between the core logic of the module and the rest of the design. This was where meta-programming became appropriated.

    A python module called bmii had been developed to extend the structures provided by Migen. For instance, an extension of the 'Module' objects was included in this library to add all facilities needed to generate the intended glue logic.

    1
    2
    3
    from bmii import *
    
    iom = IOModule("pwm")
    

    This object contained the cregs special attribute which was used to manage the control registers of the IOModule. CtrlReg was charged to construct a special 8-bit width Migen's Signal which embedded extra information needed to build the control registers network. The direction of such register had to be manually specified during instantiation. It could be:

    • RDONLY: Only readable from the host. The signal had to be driven by the internal logic of the IOModule.
    • WRONLY: The signal could only be latched from the host but could not read it back. This direction was useful to suggest the toolchain to synthesise this signal as a wire instead of a verilog's reg.
    • RDWR: The signal could be read and written from the host. Synthesis of this kind of signal would likely result to verilog's reg.

    For the PWM IOModule, only the pulse's WIDTH and the COUNTER signals had to be accessed from the host.

    1
    2
    iom.cregs += CtrlReg("WIDTH", CtrlRegDir.RDWR)
    iom.cregs += CtrlReg("COUNTER", CtrlRegDir.RDONLY)
    

    In the same way, iosignals attribute handled the signals intended to be mapped to physical pins. An IOSignal always correspond to a 1-bit width signal. The direction of an IOSignal was also needed to be explicitly specified.

    • OUT: Signal driven by the IOModule.
    • IN: Signal driven by an external component and read by the IOModule's logic.
    • DIRCTL: Signal driven by the IOModule and used to control the tri-state buffer of a pin.

    The PWM only used two outputs:

    1
    2
    iom.iosignals += IOSignal("OUT", IOSignalDir.OUT)
    iom.iosignals +=  IOSignal("NOUT", IOSignalDir.OUT)
    

    Finally, the internal logic could be described by using Migen's special attributes:

    1
    2
    3
    iom.sync += iom.cregs.COUNTER.eq(iom.cregs.COUNTER + 1)
    iom.comb += iom.iosignals.OUT.eq(iom.cregs.COUNTER < iom.cregs.WIDTH)
    iom.comb += iom.iosignals.NOUT.eq(~iom.iosignals.OUT)
    

    Section II: An Iron Hand In A Velvet Glove

    The concept of control register was illustrated and justified. Their aim was then to think about how to make them accessible from the host by using USB.

    Concretely, this step meant defining a unit that would be able to translate GPIF waveforms to a more convenient protocol to drive the internal bus. This unit had been called 'Northbridge'.

    The internal bus had been defined as follow:

    • MOSI[0:7] and MISO[0:7] represented the both directions of the data bus.
    • WR distinguished a read or a write operation.
    • MADDR[0:2] and RADDR[0:4] were used to generate the chip select signal for a module and a control register respectively.
    • REQ informed the control register that an operation was going to be performed.

    The issue here was related to the fact that the GPIF data bus had exactly the same width that a control register. This meant that the addressing and the read/write operations on the internal bus could not be achieved in a single clock tick.

    From the GPIF point of view, performing an operation on the internal bus meant sending the module/control register address (latched by the Northbridge) before proceeding to the actual read/write operation.

    The northbridge managed the GPIF's control signals as follow:

    • CTL0 and CTL1 were basically forwarded to the REQ and WR signals of internal bus respectively.
    • CTL2 was used to indicate that the USB controller was latching an address and that the current operation must not be considered as a regular write operation.

    The northbridge was polling for operation by checking the value of the CTL0 signal when clocking the interface clock.

    In addition of containing a value, control registers were generated with extra signals used to represent the operation currently performed on it and then facilitated their usage from the internal logic.

    The wr and rd signals indicated that the control register was selected and that a write or read operation respectively was going to be performed. These signals were asserted during several clock ticks as they were directly forwarded by the northbridge from the GPIF. So to facilitate the use of them in a synchronous circuit, wr_pulse and rd_pulse were derived from the previous signals. By using a 'level to pulse' state machine, wr_pulse were implemented to be asserted during exactly one clock tick when the write operation was completed and then indicated to the internal logic that a valid value was available in the register. In a meantime, rd_pulse pulsed the beginning of the read operation to inform the IOModule that the control register was going to be read and then gave it time to feed a correct value before the next falling edge of rd signal, moment when its value was actually captured by the northbridge.

    At that point, any control register could be accessed from the host using the correct USB request. In order to make the usage of the USB easier from the host point of view, an additional interface had been introduced: the BMIIModule.

    A python object of this type contained two special attributes: the first one was the IOModule which represented the RTL design while the second was called the driver of the BMIIModule. Automatically created, the drv attribute was able to inspect the IOModule to generate the correct USB request according to the information specified in the RTL about the control registers addresses and directions.

    1
    pwm = BMIIModule(iom)
    

    To finalize the generation of the IO controller design, the BMII object acted as a top-level representation of the whole design of the board. It must be informed that a new module had to be added by using its add_module method.

    A call to this procedure meant connecting the IOModule to the internal bus, allocating module and control registers addresses.

    1
    2
    b = BMII()
    b.add_module(pwm)
    

    Once the CPLD configured, the host could easily accessed the control registers by simply setting the attributes of the drv aliased with the control registers names:

    1
    2
    pwm.drv.WIDTH = 42
    cnt = int(pwm.drv.COUNTER)
    

    Section III: The Signal Goes South

    In the same way the northbridge managed the communication with the external USB controller, a other dedicated unit had been defined to handle the multiplexing of the IOSignals to physical IO pins. Obviously called the southbridge, it was implemented as a special IOModule which had no IOSignals and was only charged to manage the signals coming from other modules. For each physical pin, the southbridge was charged to generate the following circuit:

    Each pin was considered bidirectional and the direction could be configured with an IOSignal defined as such. An unlimited number of signals could read the value of a pin while only one could drive it.

    To inform the southbridge that an IOSignal had to be connected to a pin, assignment to pins attribute of this unit had to be performed as follow:

    1
    b.ioctl.sb.pins.LED0 += pwm.iomodule.iosignals.OUT
    

    The direction declared during the definition of the IOSignal were used to determine where the signal had to be connected on the pin multiplexing circuit.

    As the southbridge was considered as a regular IOModule, it was connected to the internal bus and then exposed its own control registers. This opportunity was leveraged to make the pins controllable from host bypassing the need of defining a specific IOModule when a simple operation had to be performed on the IOs.

    PINDIR, PINDIRMUX, PINOUT, PINMUX and PINSCAN signals of each pin were accessible using southbridge's control registers. For instance, making the LED blinked could be commanded by:

    1
    2
    3
    b.modules.southbridge.drv.PINMUXMISC.LED1 = 1       # Make the southbridge drive the LED0 pin
    b.modules.southbridge.drv.PINOUTMISC.LED1 = \
        int(b.modules.southbridge.drv.PINSCANMISC.LED1) ^ 1     # Toggle the LED0 pin
    

    For the example design previously defined, a complete mapping of the internal bus's address space looked as follow:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    b.list_modules()
    -- 
    0x0: northbridge
        0x0: IDCODE (CtrlRegDir.RDONLY)
        0x1: SCRATCH (CtrlRegDir.RDWR)
    0x1: southbridge
        0x0: PINDIR1L (CtrlRegDir.RDWR)
        0x1: PINDIR1H (CtrlRegDir.RDWR)
        0x2: PINDIR2L (CtrlRegDir.RDWR)
        0x3: PINDIR2H (CtrlRegDir.RDWR)
        0x4: PINSCAN1L (CtrlRegDir.RDONLY)
        0x5: PINSCAN1H (CtrlRegDir.RDONLY)
        0x6: PINSCAN2L (CtrlRegDir.RDONLY)
        0x7: PINSCAN2H (CtrlRegDir.RDONLY)
        0x8: PINSCANMISC (CtrlRegDir.RDONLY)
        0x9: PINMUX1L (CtrlRegDir.RDWR)
        0xa: PINMUX1H (CtrlRegDir.RDWR)
        0xb: PINMUX2L (CtrlRegDir.RDWR)
        0xc: PINMUX2H (CtrlRegDir.RDWR)
        0xd: PINDIRMUX1L (CtrlRegDir.RDWR)
        0xe: PINDIRMUX1H (CtrlRegDir.RDWR)
        0xf: PINDIRMUX2L (CtrlRegDir.RDWR)
        0x10: PINDIRMUX2H (CtrlRegDir.RDWR)
        0x11: PINMUXMISC (CtrlRegDir.RDWR)
        0x12: PINOUT1L (CtrlRegDir.RDWR)
        0x13: PINOUT1H (CtrlRegDir.RDWR)
        0x14: PINOUT2L (CtrlRegDir.RDWR)
        0x15: PINOUT2H (CtrlRegDir.RDWR)
        0x16: PINOUTMISC (CtrlRegDir.RDWR)
    0x2: PWM
        0x0: WIDTH (CtrlRegDir.RDWR)
        0x1: COUNTER (CtrlRegDir.RDONLY)
    

    The northbridge used two control registers defined for testing purposes only. The IDCODE contained a magic number read by the USB controller to verify the validity of the CPLD's configuration while the SCRATCH register was used to test write operations on the bus.

    To sum up, the following architecture had been defined as the basis for further improvements:

    Section IV: An Autarchical Sequence

    As this architecture was mainly based on the flexibility provided by the CPLD, one issue still remained before becoming truly usable: the compiling and programming sequences of a BMII's design had to stay self-contained and to avoid the need of external hardware tools.

    The building sequence aimed to produce the binary blob of the USB firmware as well as the bitstream of the IO controller. For the FX2LP, a ninja build file was generated to proceed to the compiling of the custom firmware using sdcc.

    Concerning the IO controller, the verilog generation was left to Migen while the building of the bitstream was ensured by Quartus.

    1
    b.build_all()
    

    The programming sequence was a bit more tricky. A first and trivial way to achieve this was to use a USB Blaster JTAG probe to configure the CPLD with the desired bitstream. In order to be self-programmed, the CPLD's JTAG signals had been connected to a tri-state buffer in addition to the regular 10-pin JTAG header. Ensured by a standard 74244, this buffer was driven by the USB controller. The goal of this circuit was to give the ability to communicate with the CPLD via JTAG when the JTAGE was asserted.

    To be able to reuse Quartus Programmer software to program the CPLD, the open-source implementation of the USB Blaster protocol for FX2LP (ixo.de USB JTAG) had been adapted to match the wiring of their circuit.

    1
    b.program_all()
    

    The programming sequence could be summarize as follow:

    • The first step was to load the custom USB Blaster firmware into the USB controller using fxload.
    • If a JTAG IDCODE scan was successful, the bitstream was uploaded using Quartus Programmer.
    • To be able to write their own FX2LP firmware to the EEPROM, a second stage firmware loader was programmed in the chip. It added a new USB vendor command allowing writing operations on the I2C bus.
    • Finally, the regular firmware was loaded in the USB controller.

    Chapter III: The Fellowship Of The Joint Test

    As a first application of there board, the second tinkerer proposed to implement a full-featured JTAG probe that anyone could use as an alternative to Flyswatter, Bus Blaster or any other cheap JTAG probe.

    The JTAG defines an electrical standard for on-chip instrumentation by using a dedicated debug port implementing a serial communication interface. This protocol was well-defined and simple enough to be used as a comprehensive example.

    The third one replied that demonstrating the usefulness of their project by trying to mimic other well-known and mature JTAG probes was a waste of time since reaching comparable performance would required more effort that he could imagine at the time.

    The first tinkerer mitigated that argument by pointing the fact that no cheap JTAG probe was generic enough to be compatible with a very wide range of platforms and very few of them were designed to be used in contexts other than just CPU's on-chip debugging. He agreed and started to think about a possible implementation of such protocol using their project.

    Section I: The Bridge Of Shockley

    Even though the JTAG standard was quite strict about the communication logic, the electrical characteristics of the signals were left to the target device. This meant that the probe had the responsibility to drive them with the target voltage.

    Assuming that the main board was only able to drive 3.3V IOs, expanding it with the needed interface was required.

    A first version had been implemented using voltage level shifters and worked well with some mainstream devices. However, some platforms from specific manufacturers pull-up JTAG signals with very low resistors, which forced the probe to drive more current than most of the voltage level shifters could supply.

    As a quick fix, the expansion board had been equipped with bipolar junction transistors for output signals.

    In a more generic way, they though that being forced to design expansion board to electrically convert signals from the main board to the driven target was not a big deal. Main board's IO could simply not be electrically universal.

    Section II: The Self-Surgery

    For a naive implementation of JTAG protocol, the IOModule consisted of simply connecting the TMS and TDI outputs to a write-only control register while wiring the TCK to its wr_pulse signal. In this configuration, each JTAG clock tick was triggered by writing to this control register.

    Each devices on a JTAG's daisy chain communicated via a Test Access Port. This hardware unit implemented a stateful protocol to expose its debug facilities. As it was possible to make all of them converged to a reset and stable state, it was easy to walk though this state machine by keeping all TAPs synchronized.

    Assuming this, a unique state machine was implemented in the IOModule to keep the track of the current TAP state. A control register had been allocated to allow the host to check this state when needed.

    Devices responded to JTAG scans with the TDO signal. The FIFO block was used to buffer received data before being read by the host thought a read-only register. This case perfectly demonstrated the usage of the rd_pulse signal since it was used to dequeue the next value of the FIFO submodule.

    Although most platforms's JTAG daisy chain were short and fixed, some of them could dynamically append TAP to the chain, making the usage of general purpose JTAG tools unusable. To describe this kind of situation, facilities had been implemented to describe a dynamic TAP network.

    1
    from bmii.modules.jtag import JTAG, TAP, DR
    

    A JTAG object extended a regular BMIIModule to abstract the low-level operations to the JTAG's IOModule.

    TAP and DR were provided to describe the current layout of the TAP network. For instance, describing the Max V's JTAG would look like this:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    class AlteraMaxVJTAG(JTAG):
        def __init__(self):
            JTAG.__init__(self)
    
            tap = TAP("CPLDTAP", 10)    # 10-bit instrwuction register
    
            #          name             instr. reg.   length
            tap += DR("SAMPLE/PRELOAD", 0b0000000101, 480)
            tap += DR("EXTEST",         0b0000001111, 480)
            tap += DR("BYPASS",         0b1111111111, 1)
            tap += DR("USERCODE",       0b0000000111, 32)
            tap += DR("IDCODE",         0b0000000110, 32)
            tap += DR("HIGHZ",          0b0000001011, 1)
            tap += DR("CLAMP",          0b0000001010, 32)
            tap += DR("USER0",          0b0000001100, 32)
            tap += DR("USER1",          0b0000001110, 32)
    
            self.add_tap(tap)
    
            @classmethodw
            def default(cls, bmii):
                jtag = cls()
                bmii.add_module(jtag)
    
                bmii.ioctl.sb.pins.IO10 += jtag.iomodule.iosignals.TMS
                bmii.ioctl.sb.pins.IO11 += jtag.iomodule.iosignals.TCK
                bmii.ioctl.sb.pins.IO12 += jtag.iomodule.iosignals.TRST
                bmii.ioctl.sb.pins.IO13 += jtag.iomodule.iosignals.TDI
                bmii.ioctl.sb.pins.IO21 += jtag.iomodule.iosignals.TDO
    
                return jtag
    

    According to that description, scanning the IDCODE of the device could be simply done by:

    1
    2
    3
    4
    5
    b = BMII()
    jtag = AlteraMaxVJTAG.default(b)
    
    jtag.reset()
    jtag.irdrscan("CPLDTAP", "IDCODE")
    

    A possible improvement for this would be to generate this tap network directly from the BSDL files of daisy chained devices. The usage of BJT to drive JTAG signals was also a very quick and easy response to the low pull-up resistance problem. The third tinkerer complained that many other solutions could be implemented there as the BJT had a very long switching time and then forced to drive signals at 12MHz when many targets supported to be clocked up to 100MHz in their debug port.

    Chapter IV: And In Darkness Bind Them

    Sceptical about the results of the first application, the third tinkerer thought about a niche application that only few people would actually need. Enthusiastic but upset by the pragmatism of the two other, he left the group to develop his idea by his own.

    For him, a second purpose for this board was purely and simply to act as a test bench for analysing black-boxed devices. To demonstrate his idea, he chose the first device he could found on his drawer: a Z80 packaged in a DIP-40.

    Primary sold by Zilog as an improved Intel 8080, it had become a very popular processor for simple embedded applications since it was truly easy to make this chip working with a custom circuit. This device was then the perfect guinea pig for his experiences.

    Section I: The Calm Before The Storm

    Before trying to blow up the chip, defining the RTL needed to correctly drive the CPU was necessary.

    1
    iom = IOModule("Z80TB")
    

    The DIP-40 version of this CPU exposed a 16-bit address bus and a 8-bit data bus. As the last one was bidirectional, three different IOSignals had to be defined: DIN, DOUT and DDIR. In order to keep the main board and the device under test synchronized, the CPU's clock was managed by the IOModule. All other required control signals were defined as IOSignals.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    ADDRESS_WIDTH = 14 # Truncated, actually 16.
    DATA_WIDTH = 8
    
    iom.iosignals += IOSignal("CLK", IOSignalDir.OUT)
    iom.iosignals += IOSignal("_M1", IOSignalDir.IN)
    iom.iosignals += IOSignal("_MREQ", IOSignalDir.IN)
    iom.iosignals += IOSignal("_IOREQ", IOSignalDir.IN)
    iom.iosignals += IOSignal("_RD", IOSignalDir.IN)
    iom.iosignals += IOSignal("_WR", IOSignalDir.IN)
    iom.iosignals += IOSignal("_WAIT", IOSignalDir.OUT)
    iom.iosignals += IOSignal("_HALT", IOSignalDir.IN)
    iom.iosignals += IOSignal("_RESET", IOSignalDir.OUT)
    iom.iosignals += IOSignal("_RFSH", IOSignalDir.IN)
    
    for i in range(ADDRESS_WIDTH):
        iom.iosignals += IOSignal("A{}".format(i), IOSignalDir.IN)
    
    oe = Signal()
    for i in range(DATA_WIDTH):
        iom.iosignals += IOSignal("DIN{}".format(i), IOSignalDir.IN)
        iom.iosignals += IOSignal("DOUT{}".format(i), IOSignalDir.OUT)
        iom.iosignals += IOSignal("DDIR{}".format(i), IOSignalDir.DIRCTL)
        iom.comb += getattr(iom.iosignals,"DDIR{}".format(i)).eq(oe)
    

    From the host point of view, the only reasonable access points was the information about the state of the CPU, the address it was accessing and the data it transferred.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    iom.cregs += CtrlReg("STATE", CtrlRegDir.RDONLY)
    
    iom.cregs += CtrlReg("DIN", CtrlRegDir.RDONLY)
    for i in range(DATA_WIDTH):
        iom.comb += iom.cregs.DIN[i].eq(getattr(iom.iosignals, "DIN{}".format(i)))
    
    iom.cregs += CtrlReg("DOUT", CtrlRegDir.WRONLY)
    for i in range(DATA_WIDTH):
        iom.comb += getattr(iom.iosignals, "DOUT{}".format(i)).eq(iom.cregs.DOUT[i])
    
    iom.cregs += CtrlReg("ADDRL", CtrlRegDir.RDONLY)
    iom.cregs += CtrlReg("ADDRH", CtrlRegDir.RDONLY)
    for i in range(ADDRESS_WIDTH):
        if i < 8:
            addr = iom.cregs.ADDRL
        else:
            addr = iom.cregs.ADDRH
        iom.comb += addr[i % 8].eq(getattr(iom.iosignals, "A{}".format(i)))
    

    A special control register had been added to perform special control operations on the CPU. It was mainly used to manually control the RESET signal forcing the reset of the chip from any CPU state.

    1
    2
    3
    4
    iom.cregs += CtrlReg("CTL", CtrlRegDir.RDWR)
    iom.cregs.CTL[0] = "RESET"
    
    iom.comb += iom.iosignals._RESET.eq(~iom.cregs.CTL.RESET)
    

    The clock signal of the Z80 had been fixed to half the frequency of the system clock. Due to clocking requirement of the chip, this signal was fixed to 8MHz.

    1
    iom.sync += iom.iosignals.CLK.eq(~iom.iosignals.CLK)
    

    Requests from the Z80 CPU followed 3 stages. When it was not halted, the testbench entered an IDLE state. During this one, the CPU was still performing operations internally but did not request any external resources.

    The second stage followed a request detection. The goal here was to freeze the CPU execution until the host provided an instruction to the testbench about how to handle the request.

    Finally, the last stage meant actually responding to CPU's request according to host instructions.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    from enum import IntEnum
    
    class Z80State(IntEnum):
        UNKNOWN = 0b00000000
        IDLE    = 0b00000001
        FETCH   = 0b00000010
        MEMRD   = 0b00000100
        MEMWR   = 0b00001000
        IORD    = 0b00010000
        IOWR    = 0b00100000
        HALTED  = 0b01000000
    

    To implement this state machine in the RTL, Migen provided a facilities to define FSM in its generic library:

    1
    2
    3
    4
    from migen.genlib
    
    fsm = FSM()
    iom.submodules += fsm
    

    According to Z80 waveforms, the request for bus access was asserted using _MREQ or _IOREQ. During the request initiation, _RD, _WR and address bus are driven and valid.

    When living the IDLE state, the testbench could determined what kind of request was going to be performed and could notified the host about that.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    fsm.act("IDLE",
        iom.cregs.STATE.eq(Z80State.IDLE),
        If(~iom.iosignals._HALT, NextState("HALTED")).\
        Else(
            If(~iom.iosignals._MREQ & iom.iosignals._RFSH,
                If(~iom.iosignals._RD,
                    If(~iom.iosignals._M1, NextState("FETCH")).\
                    Else(NextState("MEMRD"))).\
                Elif(~iom.iosignals._WR, NextState("MEMWR"))).\
            Elif(~iom.iosignals._IOREQ,
                If(~iom.iosignals._WR, NextState("IOWR")).\
                Elif(~iom.iosignals._RD, NextState("IORD")))))
    
    fsm.act("HALTED",
        iom.cregs.STATE.eq(Z80State.HALTED),
        If(iom.iosignals._HALT, NextState("IDLE")))
    

    While waiting for an answer from the host, the trick here was to assert the _WAIT input of the CPU in order to notify it that bus cycle could not be completed at that moment. This left enough time for the host to communicate its desired operation. To finalize a write operation, the host just had to read from the WRITE register. Completed a read operation was performed by writing to READ control register.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    bus_access = Signal()
    iom.comb += iom.iosignals._WAIT.eq(~bus_access)
    
    def goto_rd():
        return If(iom.cregs.DOUT.wr_pulse, NextState("READ"))
    
    def goto_wr():
        return If(iom.cregs.DIN.rd_pulse, NextState("WRITE"))
    
    fsm.act("FETCH",
        iom.cregs.STATE.eq(Z80State.FETCH),
        bus_access.eq(1),
        goto_rd())
    
    fsm.act("MEMRD",
        iom.cregs.STATE.eq(Z80State.MEMRD),
        bus_access.eq(1),
        goto_rd())
    
    fsm.act("MEMWR",
        iom.cregs.STATE.eq(Z80State.MEMWR),
        bus_access.eq(1),
        goto_wr())
    
    fsm.act("IORD",
        iom.cregs.STATE.eq(Z80State.IORD),
        bus_access.eq(1),
        goto_rd())
    
    fsm.act("IOWR",
        iom.cregs.STATE.eq(Z80State.IOWR),
        bus_access.eq(1),
        goto_wr())
    

    To finally complete the bus cycle after intervention from the host, the data bus just had to be driven in the corresponding direction:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    def goto_idle():
        return If(iom.iosignals._MREQ & iom.iosignals._IOREQ, NextState("IDLE"))
    
    fsm.act("READ",
        iom.cregs.STATE.eq(Z80State.IDLE),
        oe.eq(1),
        goto_idle())
    
    fsm.act("WRITE",
        iom.cregs.STATE.eq(Z80State.IDLE),
        goto_idle())
    

    Section II: The Gates Open

    Once the testbench logic defined, the BMIIModule could then be integrated to a final BMII design:

    1
    2
    3
    4
    z80tb = BMIIModule(iom)
    
    b = BMII()
    b.add_module(z80tb)
    

    The actual wiring to the tested Z80 looked as follow. Due to the lake of physical IO pins on the main board, the two last pins of the address bus had been ignored.

    The southbridge had to be informed to this configuration. Any changes on the physical circuit only implied rerouting of the testbench's IOModule on the southbridge unit:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    b.ioctl.sb.pins.IO28 += iom.iosignals._RESET
    b.ioctl.sb.pins.IO29 += iom.iosignals._WAIT
    b.ioctl.sb.pins.IO2A += iom.iosignals.CLK
    b.ioctl.sb.pins.IO2B += iom.iosignals._M1
    b.ioctl.sb.pins.IO2C += iom.iosignals._MREQ
    b.ioctl.sb.pins.IO2D += iom.iosignals._IOREQ
    b.ioctl.sb.pins.IO2E += iom.iosignals._RD
    b.ioctl.sb.pins.IO2F += iom.iosignals._WR
    b.ioctl.sb.pins.IO1F += iom.iosignals._HALT
    b.ioctl.sb.pins.IO1E += iom.iosignals._RFSH
    
    for i in range(ADDRESS_WIDTH):
        pin = getattr(b.ioctl.sb.pins, "IO1{}".format(hex(i)[2:].upper()))
        pin += getattr(iom.iosignals, "A{}".format(i))
    
    for i in range(DATA_WIDTH):
        pin = getattr(b.ioctl.sb.pins, "IO2{}".format(i))
        pin += getattr(iom.iosignals, "DIN{}".format(i))
        pin += getattr(iom.iosignals, "DOUT{}".format(i))
        pin += getattr(iom.iosignals, "DDIR{}".format(i))
    

    Section III: La Grande Illusion

    As the IO controller design was completed, the host driver had to be completed in order to define the exact behaviour of the testbench.

    For this example, the goal was to be able to execute a very short piece of code on the connected Z80. The content of the main memory had been defined as:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    def ld_hl_nn(nn):
        return [0x2A, nn & 0xFF, (nn >> 8) & 0xFF]
    
    def ld_b_n(n):
        return [0x06, n]
    
    def ld_c_n(n):
        return [0x0E, n]
    
    def otir():
        return [0xED, 0xB3]
    
    def halt():
        return [0x76]
    
    
    from itertools import chain, islice, repeat
    
    s = "LSE"
    instrs = chain(
        # Instructions
        ld_hl_nn(0x000A),    # 0000 - Load string address
        ld_b_n(len(s)),      # 0003 - Load string length
        ld_c_n(0),           # 0005 - Set IO port address
        otir(),              # 0007 - Output the string
        halt(),              # 0009 - Halt the CPU
    
        # Data
        [0x0C, 0x00],        # 000A - String address
        [ord(c) for c in s], # 000C - String content
    
        # Padding
        repeat(halt())       # Fill the rest of the memory
                             # with HALT instruction
    )
    
    mem = list(islice(instrs, 256))
    

    The only job of the host was to poll the STATUS register and to reply by reading from the DIN control register or by writing to DOUT according to the CPU's request.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    recvbuff = ""
    
    # Reset the CPU by pulsing the _RESET signal
    z80tb.drv.CTL.RESET = 1
    z80tb.drv.CTL.RESET = 0
    
    while True:
        state = int(z80tb.drv.STATE)
        print("{}   \t-- Addr: {:04x}".format(str(Z80State(state)),
                (int(z80tb.drv.ADDRH) << 8) | int(z80tb.drv.ADDRL)),
             end='')
    
        # Emulate main memory reading
        if (state in [Z80State.FETCH, Z80State.MEMRD]):
            z80tb.drv.DOUT = mem[int(z80tb.drv.ADDRL)]
    
        # Emulate main memory writing
        elif (state == Z80State.MEMWR):
            mem[int(z80tb.drv.ADDRL)] = int(z80tb.drv.DIN)
    
        # Emulate reading from device
        elif (state == Z80State.IORD):
            z80tb.drv.DOUT = 0xFF
    
        # Emulate writing to device
        elif (state == Z80State.IOWR):
            data = int(z80tb.drv.DIN)
            recvbuff += chr(data)
            print(" | Data: {:02x} ({})".format(data, chr(data)), end='')
    
        # Stop main loop when CPU reaches the halt state
        elif (state == Z80State.HALTED):
            break
        print()
    
    print("Received string: [{}]".format(recvbuff))
    
    -- 
    Z80State.FETCH      -- Addr: 0000
    Z80State.MEMRD      -- Addr: 0001
    Z80State.MEMRD      -- Addr: 0002
    Z80State.MEMRD      -- Addr: 000a
    Z80State.MEMRD      -- Addr: 000b
    Z80State.FETCH      -- Addr: 0003
    Z80State.MEMRD      -- Addr: 0004
    Z80State.FETCH      -- Addr: 0005
    Z80State.MEMRD      -- Addr: 0006
    Z80State.FETCH      -- Addr: 0007
    Z80State.FETCH      -- Addr: 0008
    Z80State.MEMRD      -- Addr: 000c
    Z80State.IOWR       -- Addr: 0200 | Data: 4c (L)
    Z80State.FETCH      -- Addr: 0007
    Z80State.FETCH      -- Addr: 0008
    Z80State.MEMRD      -- Addr: 000d
    Z80State.IOWR       -- Addr: 0100 | Data: 53 (S)
    Z80State.FETCH      -- Addr: 0007
    Z80State.FETCH      -- Addr: 0008
    Z80State.MEMRD      -- Addr: 000e
    Z80State.IOWR       -- Addr: 0000 | Data: 45 (E)
    Z80State.FETCH      -- Addr: 0009
    Z80State.HALTED     -- Addr: 001f
    
    Received string: [LSE]
    

    Chapter V: The Feebleness Appears

    In a meantime, the two other tinkerers were focussed on testing the main board on some more pragmatic scenarios in order to check its limitations with the hope to serve a real purpose.

    Section I: The Relativity of Space...

    Their experience with the implementation of a JTAG module were marked by the difficulty to debug and trace the state of the digital design. As the northbridge and the internal bus logic were considered reliable enough, they decided to implement an IOModule exclusively designed to probe any other signals of the IO controller design.

    Acting as an internal logic analyser, a probing circuit composed of one control register fed by a FIFO was generated for each probed signals.

    The capture was triggered by a special configurable signal and could be reset by the host at any moment.

    As an example, the following design made the main board to act as a very cheap logic analyzer where all IO signals were simultaneously probed. The trigger was wired to the physical switch input:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    b = BMII()
    la = LogicAnalyzer(4) # Probing FIFO of 4 elements
    b.add_module(la)
    
    sb = b.modules.southbridge.iomodule
    
    #        Probe name     Width   Signal
    la.probe("IO1L",        8,      sb.cregs.PINSCAN1L)
    la.probe("IO1H",        8,      sb.cregs.PINSCAN1H)
    la.probe("IO2L",        8,      sb.cregs.PINSCAN2L)
    la.probe("IO2H",        8,      sb.cregs.PINSCAN2H)
    la.probe("IOMISC",      8,      sb.cregs.PINSCANMISC)
    
    la.set_trigger(~sb.cregs.PINSCANMISC.SW)
    

    In parallel of that, an implementation of a master SPI module was in development. It was a perfect test case for the logic analyzer as it was not yet tested on a real SPI slave.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    from bmii.modules.spi import SPIMaster
    from bmii.modules.spidev import SerialFlash
    
    b = BMII.default()
    spi = SPIMaster.default(b)
    
    la.probe("SCLK", 1, spi.iomodule.iosignals.SCLK)
    la.probe("SS0", 1, spi.iomodule.iosignals.SS0)
    la.probe("MOSI", 1, spi.iomodule.iosignals.MOSI)
    
    la.set_trigger(spi.iomodule.cregs.TX.wr_pulse)
    

    The SPI module initiated a transaction when its TX register was written. Its wr_pulse was then used to define the trigger of the logic analyzer as the goal was to analyse the output signal during an SPI activity.

    The capture method of a logic analyzer object waited for a capture be completed and then dequeued the samples by reading the control register of each probe.

    1
    2
    3
    4
    5
    6
    la.reset()
    spi.select_slave(0)
    spi.tranceive(42)
    
    la.capture()
    la.show()
    

    Finally, the show method could be used to generate the captured waveforms to a VCD file and to display it using gtkwave:

    However, each probe circuit was significantly logicblock-consuming which limited the use of tiny FIFO making the logic analyser useless on complex circuit.

    Section II: ...And Time

    After this first disappointment related to the quite limited space provided by the CPLD, they pursue their work on the SPI module by implementing required operations to drive a JEDEC-compliant serial flash memory.

    1
    2
    3
    4
    5
    6
    sf = SerialFlash.default(b, spi, slave_id=0)
    sf.read_id()
    -- 
    Manufacturer ID: 0xC2 (Macronix)
    Memory Type: 0x20
    Memory Capacity: 0x15 (16Mb)
    

    Driving the SPI flash was actually quite easy when it was previously extracted from its original circuit. This one was desoldered from a PC motherboard:

    1
    2
    3
    sf.dump(0x1FE000, size=25)
    
    b'Award BootBlock BIOS v1.0'
    

    The real challenge could be to probe the SPI packet in a passive way. This implied to base the IOModule logic on the SPI clock imposed by an external device instead of the regular system clock. Even though all this logic had been implemented and tested on simple devices, it was still returning malformed data when used on a PC motherboard since the BIOS flash was clocked at a frequency higher than 40MHz.

    Their guess for the reason of this issue was based on the fact that no IO pins were connected to a clock input of the CPLD. This meant that the SPI clock was gated by a regular IO input not designed to support such high frequency.

    Chapter VI: Displayed As Of Yore

    Affected by these previous failures, the two first tinkerers doubted about the real efficiency of the current hardware design of their board. By curiosity and driven by their discouragement, they look for the third one, probably lost in his solo projects.

    They found him in its basement, soldering wires and axial resistors to a VGA connector. He explained that he was oddly trying to make the main board acting as a video card. That was a plain useless job but he was glad to do it. Bored, the two other tried to helped him to finish and agreed that it would be their last experience with their board.

    Section I: The Dilemma Of Etching Copper

    Although driving VGA signals was something quite simple, they estimated that creating a dedicated expansion board would make their job easier. Firstly, it would allow the mechanical integration of a decent VGA connector. Secondly, it was a good opportunity to add some extra memories to the board as the CPLD would not be able to store enough data needed to implement a video card. A standard 128KB static RAM packaged in a SSOP package has been chosen due to to its simple interface and its fast respond time.

    The VGA's RGB pins must be driven by analog signals which implied the use of Digital to Analog Converters to be controlled from the CPLD. As these signals were defined to be ground terminated by a 75 Ohm resistor on the monitor side, a cheap equivalent of a DAC could be obtained by connecting different resistors to several CPLD's outputs, connected in parallel and acting as a voltage divider with monitor's termination resistor (see R1 to R6).

    By allocating 6 outputs for driving RGB signals, 64 colors could be generated. However, the limited number of IO pins prevented the usage of all of the 17-pin SRAM's address bus in the same time that the 6 pins of the RGB signals.

    In order to postpone this design decision, jumpers had been added to the extension PCB to allow the configuration at soldering-time. The first setting allowed the usage of 8 colors with a 256KB video RAM while the second one constrained the use of a 16KB RAM but could drive 64 colors (see table at the bottom layer of the PCB).

    Section II: A Proselytized Static Memory

    On a regular video card, framebuffer was supposed to be stored on a dual-port RAM in order to allow the controller to write displayed frame in the same time that it was read by the signal generator. As this kind of device must be controlled by a large number of pin, a regular SRAM had been used to substitute a real VRAM.

    Of course, this tweak forced a tighter management of the VRAM as two independents actors were using it at the same time while providing a unique interface.

    From a high-level point of view the simple video card could be represented as an IOModule by following this architecture:

    To manage the VRAM, the trick was to exploit the fact that the pixel clock required to display with a resolution of 640x480 at 60Hz was fixed to 25.175 MHz. As the IO controller was clocked at 48MHz, odd ticks were used to read from VRAM and to drive the pixel clock at 24Mhz which was acceptable for most of the recent VGA monitors. Meanwhile, even ticks where used to perform the write operations on the VRAM. To ensure that writing operations were successful, the read operation that followed a writing was cancelled which was not critical most of the time but could led to small display glitches

    The VRAM management unit could be described with the following state-machine:

    • 1: If a write operation has to be performed, then, drive the data and the address bus. Else, drive the address bus for the next reading.
    • 2: Reading state: Capture the output of the VRAM
    • 3: Writing state: Indicate to the VRAM that the data bus is ready to be read for a memory writing.

    Section III: Words Engraved In A Black Screen

    As the VRAM management core logic and the VGA signal generation was correctly working, only the logic needed to drive the read from the VRAM and to drive RGB signals according to VRAM's data had to be adapted to modify the displaying.

    To demonstrate how the VRAM could be managed, a simple text mode had been implemented.

    VRAM had been organized as follow: - 0x0000 - Text framebuffer: as the VGA-compatible text mode implemented on PC platforms, each characters consisted of one byte for the ASCII code and a second contained the color. - 0x0700 - Character set (3KB): Sprites representing each character. A font similar to the IBM's code page 437 was used.

    As only one reading on the VRAM was possible per pixel clock tick, reading sequence had to be aligned to the character display. While the three last pixels of a character, the VRAM reading logic fetched the ASCII code and the color of the next character on the framebuffer and provided to the display logic the corresponding sprite's row from the character set.

    Epilogue

    Surprisingly, the two first tinkerers found unexpected satisfaction to complete this dumb video card. The result of this last experience reflected the childish feelings that pushed them to start their first board: a satisfying design serving a useless objective.

    This forced step-back helped them to highlight the items that could improve the next version of the board, if someone would be brave enough to go on on their footsteps. The lack of logic blocks could be easily solved by switching to an FPGA. A lot of decent ones were still available in 144-pin EQFP packages. Allocating pins to an external RAM would also not be a waste. Many other applications were blocked by the lack of an embedded and easy to use memory.

    Concerning the timing issues encountered while probing the SPI flash, simply mapping some clock inputs to physical headers would be enough to unscramble most of them.

    After that, the tinkerers team split up. Each of them had been aligned to the 'state-of-art'-ish folk and they finally scattered, where engineers dwell...

    References

    Sources

    • Git repository
    • Migen
    • PyUSB

    Datasheets

    • Altera Max V handbook
    • Cypress EZ-USB FX2LP
    • Z84C00 Specifications

    Tweet
    Permalink & comments
  • LSE Week 2016: Schedule

    Written by Stanislas Lejay
    2016-07-05 13:31:31

    Our schedule for the LSE Week 2016 is out !

    The schedule will be as follow:

    • July, Thursday the 14th all day long
    • July, Friday the 15th in the evening
    • July, Saturday the 16th all day long

    The complete schedule is available on the page dedicated to the event

    Tweet
    Permalink & comments
    • More »

© LSE 2012 — Main website — RSS Feed