0xCAFEBABE ? - java class file format, an overview
Lately, we’ve been having a look into java. First, we tried to understand the file-format. A java application is often presented in a .jar, which is basically a zip archive (you can also find .war files which are also zip archive). Inside this archive you'll find several files, especially some .class files which are the one containing the java bytecode. Those files are the one we'll look into.
The file begins with a header including the magic number (
0xCAFEBABE), the minor version which is 0 and the major version for Java SE 7:
0x0033(51.00). Every number in the class file are stored in big-endian. Right after that header, we can find the Constant Pool count which is the number of entries in the constant pool table plus one and then the array. There are several entries representing several items in the constant pool like constants, classes, etc..
After that, there is the access flag of the class, the
super_classidentifiers which are indices in the constant pool in order to refer to the current class and the super class. This is followed by the interface table and its size, the table contains all the interfaces from which the current class inherits. Then we find the field table and size, followed by the methods and the attributes of the class.
Here is mainly an overview of the class file.
The constant pool is probably the most important part of the Class file. It contains all the information that will be needed on the other part of the file. The constant pool is an array containing several entries, the index of the array starts at 1, not 0. The different structures in the table do not have the same size, and so the constant pool may have a variable size. Each entry begins with a tag on one byte, indicating the type of entry:
CONSTANT_Utf8: indicating an utf8 modified entry. Java uses a particular type of utf8 for representing the constant string values.
CONSTANT_Integer: representing a constant integer on 4 bytes, just like everything in the class file format the integer is a big-endian.
CONSTANT_Float: representing a float on 4 bytes, it follows the IEEE 754 floating point format, with possibility of representing both infinity and NaN.
CONSTANT_Long: same as
CONSTANT_Integerbut represents the integer on 8 bytes. Something particular about this entry is that it is counting twice in the constant pool’s number of entries.
CONSTANT_Floatit follows the IEEE 754 for the double format, like
CONSTANT_Longit stores the number on 8 bytes and also counts twice in the constant pool.
CONSTANT_Class: this one is used to represent a class or an interface, it has only one caracteristic which is an index in the constant pool to a
CONSTANT_Utf8indicating the name of the class.
CONSTANT_String: its goal is to represent constant object of string type. like
CONSTANT_Class, it only contains one information which is the index of a
CONSTANT_Utf8in the constant pool to represent the string’svalue.
CONSTANT_Fieldref: this represents a reference to a field. it contains the index of
CONSTANT_Classto represent the class or interface in which the field is and the index of a
CONSTANT_NameAndType(see below) for representing the name and the field’s type. (http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.3.2).
CONSTANT_Fieldref, it contains a
CONSTANT_Classindex and a
CONSTANT_Classmust represent a class and not an interface. The
CONSTANT_NameAndTypemust represent a method descriptor (http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.3.3).
CONSTANT_InterfaceMethodref: it is similar to the
CONSTANT_Methodreftype except that the
CONSTANT_Classentry must represent an interface.
CONSTANT_NameAndType: this structure is used to represent a field or method without indicating the class or interface it belongs to, it contains two indices in the constant pool which must have the type
CONSTANT_Utf8, the first represents the name and the other one represents a valid descriptor of the field or the method.
CONSTANT_MethodHandle: this field is used to resolve symbolic reference to a method handle. The way of resolving a method depends on something called the bytecode behavior which is indicated by a kind indicator (from 1 to 9). It also contains a reference on two bytes which is an index in the constant pool pointing on a
CONSTANT_InterfaceMethodrefdepending on the kind.
CONSTANT_MethodType: this field is used to resolve the method’s type, it contains an index to a
CONSTANT_Utf8which should represent the method’s type.
CONSTANT_InvokeDynamic: this structure is used by the invokedynamic instruction to specify a bootstrap method. It contains an index into the bootstrap method table (see attributes below) and an index into the constant pool to a
CONSTANT_NameAndTyperepresenting the method name and method descriptor.
Here is global overview of each of those structures:
General and Interfaces
After the Constant Pool, we can find several information about the current class, there is an information about the class name and the superclass. There are also general information about the class in the access flag. There are several access flag types for classes, fields and methods. The different kind of access flags are:
After the general information field, there is an information field about the interfaces. All the interfaces the class implements are represented in the interface table. Each entry in that table is a constant pool index representing a
CONSTANT_Classwhich must be an interface.
Each field, method and class have others characteristics and informations. These information are contained inside attributes. There are several attribute types, each one of them can be applied to one or several fields, methods, classes and codes. The attributes are used to represent :
- Local variables, constant value, information about the stack and exceptions
- Inner Classes, Bootstrap Methods, Enclosing methods
- Information for debug/decompilation
- Complementary information (Deprecated, Signature...)
Each attribute begins by an index into the constant pool, it must point to a
CONSTANT_Utf8entry telling which type of attribute this is. Afterward, since the different types of attributes have different structures, the attribute length is indicated. An implementation of the Java Virtual Machine is not necessary in order to handle each kind of attribute because knowing the length allows to pass an unhandle attribute and execute correctly the file.
The most important attribute is probably the Code:
It begins with the common header, the attribute name index should point to a
CONSTANT_Utf8representing the string
It is followed by two variables: max stack and max locals which represent the stack size and the size of the local variables including the one used for passing arguments to methods. Then there is the code length and the code which is the bytecode that will be executed by the JVM when the method is called.
Right after that, you’ll find a table representing the exception handlers inside the functions, it indicates the start and the end of the zone where the exception should be catched, the start of the entry if the exception is raised and the catch type which is an index to a
CONSTANT_Classinto the constant pool. The catch type can also be 0, in this case it will be called with every exceptions, this is in generally used for the finally statement.
After the exceptions section, it is possible to add some attributes for the code especially about the stack and the local variable. The Code is an attribute that may contain other attribute.
Fields & Methods
The fields and methods are added in two tables which contain the same elements. The access flags are different for the fields and methods since they are represented above.
After the flag section, we find the name which is an index to a
CONSTANT_Utf8in the constant pool representing the name of the method/field. The descriptor index is also an index to a
CONSTANT_Utf8which represents a descriptor defining the method or field type.
Finally, the method and field can have attributes, moreover a method will contain a code attribute which will contain itself the method code.
The class file is really important for the JVM and having a look at the file format explains a lot of things about the way the JVM work internally.
Recently Java SE 8 has been released, there are several small differences with Java SE 7 even though the major part of the class file has not changed. In particular, it defines new attributes :
There are also several modifications in different sections changing the default behaviour of the JVM. It also adds precision and constraints to parts of the class file. The version number for Java SE 7 is 51.00 and 52.00 for Java SE 8.
We’ve written a parser of the class file format in Python3 that you can find here : java.py.
It uses the srddl module for python, available here : https://bitbucket.org/kushou/srddl