LSE Blog

Operating systems, computer security, languages theory, and even more!

  • About us

    • Main website
    • Git repositories
    • @lse_epita

    RSS Feed

  • Categories
    • Events
    • Hardware
    • Language
    • Reverse Engineering
    • Security
    • System
      • Linux
    • Tutorials
      • Parallelism
      • PythonGDB
    • Writeups
      • CSAW CTF 2012 Quals
      • DEFCON 2013 Quals
      • DEFCON2K12 Prequals
      • Hack.lu CTF 2012
      • Hack.lu CTF 2013
      • NDH2K12 Prequals
      • NDH2K13 Quals
      • Olympic-CTF 2014
      • PlaidCTF 2012
      • SecuInside2K12 Prequals
      • ebCTF 2013
  • Authors
    • ✉ Samuel Angebault
    • ✉ Remi Audebert
    • ✉ Jean-Loup Bogalho
    • ✉ Pierre Bourdon
    • ✉ Marwan Burelle
    • ✉ Samuel Chevet
    • ✉ Pierre-Marie de Rodat
    • ✉ Ivan Delalande
    • ✉ Corentin Derbois
    • ✉ Nassim Eddequiouaq
    • ✉ Louis Feuvrier
    • ✉ Fabien Goncalves
    • ✉ Nicolas Hureau
    • ✉ Gabriel Laskar
    • ✉ Stanislas Lejay
    • ✉ Franck Michea
    • ✉ Bruno Pujos
    • ✉ Clement Rouault
    • ✉ Pierre Surply
    • ✉ Kevin Tavukciyan
  • C! - system oriented programming

    Written by Marwan Burelle
    2012-05-09 14:52:00

    This is a first article, intended to be an introduction to to C!, more articles presenting syntax and inner parts of the compiler will follow.

    C! is one of our projects here at LSE (System Lab of EPITA.) It is a programming language oriented toward lower-level system programming (kernel or driver for example.)

    We were looking for a modern programming language for kernel programming and after trying some (D, C++, OCaml … ) it appears that most of them were too oriented toward userland to be used in kernel programming context.

    So we decide to modify and extend C to fit our need and quickly aim toward a new programming language: C!

    Modern languages and kernel programming

    Most parts of kernel code are rather classical: data structures, algorithms and a lot of glue. But, some crucial aspects require lower-level programming: direct management of memory, talking to specific CPU part (interruption management, MMU … ), complete control over data layout, bit-by-bit data manipulation …

    Thus, to write some kernel code (or a complete kernel) we need a native language with direct access to this kind of low-level operations. This implies the ability to include ASM code somehow, to manage function calls from ASM and to build standard functions (so you can have function pointers for various interruption mechanisms.)

    And, since you're not in user-land, you can't use user-land facilities (standard system libs for example). For most languages this means that you must rewrite memory allocators and tools that come shipped with them (especially for managed memory languages using garbage collection).

    Another issue is the binary format: when writing user-land programs, your compiler builds a file suited for the kernel binary loader. On most current Unix systems, your file will respect the ELF format. Of course, you can write an ELF loader in your bootloader (or any part of your booting process for that matters) but since you are managing memory and memory mapping, you can't rely on the way a program is loaded on your system and thus the organization of your ELF must reflect these constraints.

    Of course, this issue is not language dependent, even with a pure ASM or C kernel, you will have to control the way your linker builds the final binary. But, in C (and obviously in ASM) there are no major issues there, the structure of your program will be sufficiently simple so that the only important question is: where will I be in memory?

    So, what's wrong with modern languages?

    For the most evolved ones such as languages with transparent memory management and garbage collection, one of the most important problem is to provide a replacement for all aspects of the standard libraries of the system: memory allocator, threads and locks management, etc. And in that case some aspects just can't be rewritten the same way it is in user-land.

    The C++ situation is somehow better and worse: in theory there is less runtime needs than most modern languages. The good part is that you can bypass the most problematic elements of C++ (such as RTTI or exceptions) so you don't have to fight against them. Once you've deactivated problematic features and found what can't be used without them, you have to provide runtime elements needed by your code: start-up code, pure virtual fallback, dynamic stack allocation code, dynamic memory allocation for new and delete operators (for objects and array) …

    Roaming here and there, you'll find documentation on how you can write your C++ kernel, but let's face it: is the required work really worth the pain?

    What's wrong with C

    So, if you're still reading me, this means that you're partially convinced that using C++ (or D, or OCaml, or … ) is not a good idea for your kernel. But, why not go on with the good old C programming language?

    Since it was designed for that job, it is probably the best (or one of the best) fit for it. But, we want more.

    Here is a quick list of what we may find wrong or missing in C:

    • The C syntax contains a lot of ambiguous traps
    • While the type system of C is basically size based, a lot of types have an ambiguous size (int for example)
    • Controlling size and signedness of integers is often painful
    • There is no clean way to provide some form of genericity or polymorphism
    • There's no typed macros
    • The type system and most static verification mechanisms are too basic compared to what could be done now
    • C miss a namespace (or module) mechanism
    • While you can do object oriented programming, it is tedious and error prone

    In fact, the above list can divided in two categories:

    • syntax and base language issues
    • missing modern features

    Genese of C!

    Once we stated what was wrong with C, I came up with the idea that we could write a simple syntactic front-end to C or a kind of preprocessor, where we would fix most syntax issues. Since we were playing with syntax, we could add some syntactic sugar as well.

    We then decided to take a look at object oriented C: smart usage of function pointers and structures let you build some basic objects. You can even have fully object oriented code. But while it is dead simple to use code with object oriented design, the code itself is complex, tedious, error prone and most of the time unreadable. So, all the gain of the OOP on the outer side, is lost on the inner side.

    So why not encapsulated object oriented code in our syntax extension ?

    But, OOP means typing (Ok, I wanted static typing for OOP.) And thus, we need to write our own type system and type checker.

    Finally, from a simple syntax preprocessor, we ended up with a language of its own.

    Compiler-to-compiler

    Designing and implementing a programing language implies a lot of work: parsing, static analysis (mostly type checking), managing syntactic sugar, and a lot of code transformations in order to produce machine code.

    While syntactic parts and typing are unavoidable, code productions can be shortened somehow: you write a frontend for an existing compiler or use a generic backend such as LLVM. But you still need to produce some kind of abstract ASM, a kind of generic machine code that will be transformed into target specific machine code.

    The fact is that a normal compiler will already have done a lot of optimization and smart code transformation before the backend stage. In our case, this means that we should do an important part of the job of a complete C compiler while we are working with code that is mainly C (with a different concrete syntax.)

    The last solution (the one we chose) is to produce code for another compiler: in that case all the magic is in the target compiler and we can concentrate our effort on syntax, typing and extensions that can be expressed in the target language.

    Based on our previous discussion, you can deduce that we chose to produce C code. Presenting all aspects of using C as a target language will be discussed further in a future article.

    Syntactic Sugar

    An interesting aspect of building a high-level language is that we can add new shiny syntax extensions quite simply. We decided to focus on syntax extensions that offer comfort without introducing hidden complexity.

    Integers as bit-arrays

    In lower-level code, you often manipulate integers a bit at a time, so we decided to add a syntax to do that without manipulating masks and bitwise logical operands.

    Thus, any integer value (even signed, but this may change, or trigger a warning) can be used as array in left and right position (you can test and assign bit per bit your integer!).

    A small example (wait for the next article for full syntax description):

    1
    2
    3
    4
    5
    x : int<+32> = 0b001011; // yes, binary value!
    t : int<+1>;
    t = x[0];
    x[0] = x[5];
    x[5] = t;
    

    Assembly blocks

    When writing kernel code, you need assembly code blocks. The syntax provided by gcc is annoying, you have to correctly manage the string yourself (adding newline and so on.)

    On the other hand, I don't want to add a full assembly parser (as in D compiler for example.) Despite the fact that it is boring and tedious, it implies that the language is stuck to some architectures and we have to rewrite the parser for each new architecture we need …

    In the end, I found a way to integrate asm blocks without the noise of gcc but keeping it close enough to be able to translate it directly. Of course, this means that you still have to write clobber lists and stuff.

    A little example (using a typed macro function):

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    #cas(val: volatile void**, cmp: volatile void*, exch: volatile void*) : void*
    {
      old: volatile void*;
      asm ("=a" (old); "r" (val), "a" (cmp), "r" (exch);)
      {
                lock
                cmpxchg %3, (%1)
      }
      return old;
    }
    

    Macro stuff

    Actually, C! has no dedicated preprocessing tools but we included some syntax to provide code that will be macro rather than functions or variables.

    First, you can transform any variable or function declaration into a kind of typed macro by simply adding a sharp in front of the name (see previous example). The generated code will be a traditional C macro with all the "dirty" code needed to manage return, call by value and so on.

    The other nice syntax extension is macro classes: a macro class provides methods (in fact macro functions) on non object types. The idea is to define simple and recurring operations on a value without boxing it (next article will provide examples).

    Modules

    Another missing feature of C is a proper module mechanism. We provide a simple module infrastructure sharing a lot (but far simpler) with C++ namespaces. Basically, every C! file is a module and referring to symbols from that module requires the module name, like namespaces. Of course you can also open the module, that is making directly available (without namespace) every symbol of the module.

    Namespaces provide a simple way to avoid name specialization: inside the module you can refer to it directly and outside you use the module name and thus no inter-module conflict could happen.

    What's next

    In the next article of this series I will present you with C! syntax, the very basis of the object system and macro stuff.

    The compiler is still in a prototype state: all features described here are working, but some details are still a bit fuzzy and may need you to do some adjustments in the generated code.

    As of now, you can clone C! on its LSE git repository, take a look at the examples in the tests directory and begin writing your own code. Unfortunately, we don't have an automated build procedure yet, so you will have to do it step by step.

    Tweet
    Permalink & comments
  • C! - system oriented programming - syntax explanation

    Written by Marwan Burelle
    2012-06-26 10:40:00

    Following the previous article introducing C! I now present the language itself. I kept presentation as short as possible and present relation to C syntax when it's relevant.

    Basic syntax: statement and expressions

    Globally C! code will look like C code. There're few details due to some adjustement but you'll find usual operators, functions call, loop and if statements … The global structure of the code will look very familiar.

    Among minor differencies are: cast, function pointer usage and types syntax.

    Declarations

    The most striking differencies is probably declarations syntax. In C, there's no clear separtation between the declared entity (variables, functions or type names) and the type description of the entity. For example, in C, if you declare an array of characters you'll write something like:

    1
    char t[256];
    

    The variable name is t and its type is array of char (the size being some extra information.)

    In C!, we choose to break things more clearly, and have in the declaration a part naming the entity and a part describing its type, the previous expression becomes:

    1
    t : char[256];
    

    This clarified the question of the position of the star when declaring a pointer, for example in C, we shall write:

    1
    char *p;
    

    and in C!:

    1
    p : char*;
    

    The star no longer needs to be attached to p and you can't write ambiguous declarations like:

    1
    char* p, c;
    

    Where c is character and not a pointer to character. Of course, the drawback is that we must write two lines for that example:

    1
    2
    p : char*;
    c : char;
    

    The same logic appears on function declaration, for example the following C code:

    1
    2
    3
    4
    5
    6
    char f(char c)
    {
      if (c < 'a' || c > 'z')
        return c;
      return 'A' + c - 'a';
    }
    

    Will be written in C!:

    1
    2
    3
    4
    5
    6
    f(c : char) : char
    {
      if (c < 'a' || c > 'z')
        return c;
      return 'A' + c - 'a';
    }
    

    We apply the same idea to cast, thus the following code:

    1
    2
    3
    4
    void f(void *p, char *c)
    {
      *c = *((char*)p);
    }
    

    becomes:

    1
    2
    3
    4
    f(p : void*, c : char*) : void
    {
      *c = *(p : char*);
    }
    

    The same logic is shown in type name definitions:

    1
    typedef char *string;
    

    becomes:

    1
    typedef string = char*;
    

    Again, function pointer have a simplified syntax: the name of the variable is no longer inside the type. So the following C code:

    1
    char (*f)(char,char*);
    

    Becomes:

    1
    f : <(char,char*)> : char;
    

    Of course, you can add initilization expressions:

    1
    a : char = 'a';
    

    Integer and floating point numbers

    We decide to have explicit size and signedness in integer types. Thus, integer will be declared as follow:

    1
    2
    3
    x : int<32>;  // a signed 32bits integer
    y : int<+16>; // an unsigned 16bits integer
    z : int<24>;  // uncommon size declaration
    

    Sizes not belongings to standard sizes are stored using available integer types in C99 (the ones defined in stdint.h) and are masked when needed to prevent usage of unwanted values.

    The same ideas apply to floating point numbers:

    1
    f : float<64>; // a double float
    

    Of course, you can define some types name (but you can't use int, char and float):

    1
    typedef short = int<16>;
    

    Sized integer in structure definition are directly translated as bitfields, so we have a single syntax.

    We extends the language syntax with a notion of bits arrays: that is an integer can be used as an array of bits:

    1
    2
    3
    4
    x : int<+32> = 41;
    x[31] = 1;           // set the most significant bit to 1
    x[31] = 0;           // set the most significant bit to 0
    x += (x[0] ? 1 : 0); // make x even if not
    

    When setting bit, value other than 0 are transformed into 1.

    Object Oriented Extension

    We introduce a classical, but yet simple, OOP extension to our language. So first, you can define classes with attributes, methods and constructors:

    1
    2
    3
    4
    5
    6
    7
    class A {
      x : int<32>;
      get() : int<32> { return x; }
      set(y : int<32>) : void { x = y; }
      // A simple constructor
      init(y : int<32>) { x = y; }
    }
    

    We have simple inheritance and methods are true methods (that is virtual methods):

    1
    2
    3
    4
    5
    6
    7
    8
    class B : A {
      y : float<32>;
      init(a : int<32>, b : float<32>) {
        A(this, a) // call A constuctor
        y = b;
      }
      get() : int<32> { return x + (y : int<32>); }
    }
    

    We don't have (yet ?) method overloading, only overriding.

    Object in C! are always pointer and you should allocate them by yourself (so we don't rely on predefined allocator) but you can create some kind of « local object » that is an object defined on the stack or as global value.

    1
    2
    3
    og : A = A(some_pointer, 41); // object creation require pre-allocation
    ol : local A(42);             // object on the stack
    og.set(og.get() + 1);
    

    There's no implicit destructor calls for now, but depending on real nead we may add it for local objects.

    Since we only have pointed-object there's no implicit copy as in C++ nor there's need for references. Access to content (all is public) is done with the simple dot syntax.

    The constructor for an object is a simple function that take a pointer to the concrete object (the object pointer) and any needed parameters. It returns the object pointer. If you're object is "compatible" with the object built by a given constructor, you safely can pass it to the constructor (as in the previous example.)

    Local object are not automatically initialized, in the following code

    1
    o : local A;
    

    Object o is allocated on the local scope but not initiliazed: methods table is "empty" (a method call will fail … ) In near future we probably be able to detect that, or at least provide a minimal initialization.

    We also provide interface and abstract methods.

    I may explain generated code in some future article.

    Typed macro and Macro Class

    We introduce a simple way to define typed macro constants and macro functions: you just a # at the begining of a declaration:

    1
    2
    3
    4
    5
    #X : int<32> = 42;
    #square(x : int<32>) : int<32>
    {
      return x * x;
    }
    

    Our macro functions enjoy a real call by value semantics (using some tricks in the generated code) and (once typed by C!) are real cpp macro in the generated code!

    The other macro extension is the macro class concept: we syntactically embeded a value (of any type) in some kind of object with methods. The result produces special macro but let you use your values just like an object.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    macro class A : int<32> // storage kind
    {
      get() const  : int<32> // won't modify inner storage
      {
        return this; // this represent the inner storage value
      }
      set(x : int<32>) : void // non const can modify inner storage
      {
        this = x;
      }
    }
    

    For now, all "macro code" generate CPP macro (with a lot of tricks to respect call by value and return management. It is not excluded to generate inlined functions in the future as long as we are sure that semantics is preserved.

    One of the idea behind macro class is to provide a simple syntax (OO like) for constructions that do not require functions (or worse the burden of a whole object.)

    Properties

    Properties is an other extension (very young and poorly tested) in the same spirit than macro class.

    The idea is quite simple: it provides a way to overload access to any kind of value (structured or not) and make it appears as another type (the virtual type.) You just have to provide a getter and a setter and when context requires the virtual type the compiler automatically insert the right accesser.

    For example, you have a 32 bits unsigned integer stored in two different locations but you want to access it as if it is a plain and simple integer. Suppose you have a structure s storing the two pointer, you'll have use it that way in plain old C:

    1
    2
    3
    4
    unsigned x, y = 70703;
    x = ((*(s.high)) << 16) + *(s.low); // getting the value
    *(s.high) = y >> 16;                // setting the value
    *(s.low) = y & (0xffff);
    

    You can declare a property that way (I included the structure describing our splitted integer):

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    struct segint {
      high: int<+16>*;
      low: int<+16>*;
    }
    
    property V(segint) : int<+32>
    {
      get() {
        return ((*(this.high)) << 16) + *(this.low);
      }
      set(y : int<+32>) {
        *(this.high) = y >> 16;
        *(this.low) = y & (0xffff);
      }
    }
    

    And then, to use it:

    1
    2
    3
    4
    5
    6
    s : segint;
    s.high = &high; s.low = &low; // init the struct
    x : V = s; // warning: x is a copy of s
    y : int<+32>;
    y = x + 1; // accessing the value
    x = 70703  // setting it
    

    Since a property can have any real type you want, it can be part of an object and have its own this pointer corresponding to a pointer to the object (since every thing is public the property have a fool access to the object.)

    As of now, accessors are generated as macro and access to the real value is done through a reference (so it can be modified.)

    Support for op-assign (operators like +=) and other similar operators (mainly ++) will probably be added later.

    Tweet
    Permalink & comments

© LSE 2012 — Main website — RSS Feed