0xCAFEBABE ? - java class file format, an overview
Lately, we’ve been having a look into java. First, we tried to understand the file-format. A java application is often presented in a .jar, which is basically a zip archive (you can also find .war files which are also zip archive). Inside this archive you’ll find several files, especially some .class files which are the one containing the java bytecode. Those files are the one we’ll look into.
The file begins with a header including the magic number (0xCAFEBABE
), the minor version which is 0
and the major version for Java SE 7: 0x0033
(51.00). Every number in the class file are stored in big-endian.
Right after that header, we can find the Constant Pool count which is the number of entries in the constant pool table plus one and
then the array. There are several entries representing several items in the constant
pool like constants, classes, etc..
After that, there is the access flag
of the class, the this_class
and super_class
identifiers which are indices in the constant
pool in order to refer to the current class and the super class. This is followed by
the interface table and its size, the table contains all the interfaces from which the current class
inherits. Then we find the field table and size, followed by the methods and the attributes of the class.
Here is mainly an overview of the class file.
Constant Pool
The constant pool is probably the most important part of the Class file. It contains all the information that will be needed on the other part of the file. The constant pool is an array containing several entries, the index of the array starts at 1, not 0. The different structures in the table do not have the same size, and so the constant pool may have a variable size. Each entry begins with a tag on one byte, indicating the type of entry:
-
CONSTANT_Utf8
: indicating an utf8 modified entry. Java uses a particular type of utf8 for representing the constant string values. -
CONSTANT_Integer
: representing a constant integer on 4 bytes, just like everything in the class file format the integer is a big-endian. -
CONSTANT_Float
: representing a float on 4 bytes, it follows the IEEE 754 floating point format, with possibility of representing both infinity and NaN. -
CONSTANT_Long
: same asCONSTANT_Integer
but represents the integer on 8 bytes. Something particular about this entry is that it is counting twice in the constant pool’s number of entries. -
CONSTANT_Double
: asCONSTANT_Float
it follows the IEEE 754 for the double format, likeCONSTANT_Long
it stores the number on 8 bytes and also counts twice in the constant pool. -
CONSTANT_Class
: this one is used to represent a class or an interface, it has only one caracteristic which is an index in the constant pool to aCONSTANT_Utf8
indicating the name of the class. -
CONSTANT_String
: its goal is to represent constant object of string type. likeCONSTANT_Class
, it only contains one information which is the index of aCONSTANT_Utf8
in the constant pool to represent the string’svalue. -
CONSTANT_Fieldref
: this represents a reference to a field. it contains the index ofCONSTANT_Class
to represent the class or interface in which the field is and the index of aCONSTANT_NameAndType
(see below) for representing the name and the field’s type. (http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.3.2). -
CONSTANT_Methodref
: likeCONSTANT_Fieldref
, it contains aCONSTANT_Class
index and aCONSTANT_NameAndType
. TheCONSTANT_Class
must represent a class and not an interface. TheCONSTANT_NameAndType
must represent a method descriptor (http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.3.3). -
CONSTANT_InterfaceMethodref
: it is similar to theCONSTANT_Methodref
type except that theCONSTANT_Class
entry must represent an interface. -
CONSTANT_NameAndType
: this structure is used to represent a field or method without indicating the class or interface it belongs to, it contains two indices in the constant pool which must have the typeCONSTANT_Utf8
, the first represents the name and the other one represents a valid descriptor of the field or the method. -
CONSTANT_MethodHandle
: this field is used to resolve symbolic reference to a method handle. The way of resolving a method depends on something called the bytecode behavior which is indicated by a kind indicator (from 1 to 9). It also contains a reference on two bytes which is an index in the constant pool pointing on aCONSTANT_Fieldref
,CONSTANT_Methodref
orCONSTANT_InterfaceMethodref
depending on the kind. -
CONSTANT_MethodType
: this field is used to resolve the method’s type, it contains an index to aCONSTANT_Utf8
which should represent the method’s type. -
CONSTANT_InvokeDynamic
: this structure is used by the invokedynamic instruction to specify a bootstrap method. It contains an index into the bootstrap method table (see attributes below) and an index into the constant pool to aCONSTANT_NameAndType
representing the method name and method descriptor.
Here is global overview of each of those structures:
General and Interfaces
After the Constant Pool, we can find several information about the current class, there is an information about the class name and the superclass. There are also general information about the class in the access flag. There are several access flag types for classes, fields and methods. The different kind of access flags are:
After the general information field, there is an information field about the interfaces. All
the interfaces the class implements are represented in the interface table.
Each entry in that table is a constant pool index representing a CONSTANT_Class
which must be an interface.
Attributes
Each field, method and class have others characteristics and informations. These information are contained inside attributes. There are several attribute types, each one of them can be applied to one or several fields, methods, classes and codes. The attributes are used to represent :
- Code
- Local variables, constant value, information about the stack and exceptions
- Inner Classes, Bootstrap Methods, Enclosing methods
- Annotations
- Information for debug/decompilation
- Complementary information (Deprecated, Signature…)
Each attribute begins by an index into the constant pool, it must point to a
CONSTANT_Utf8
entry telling which type of attribute this is. Afterward, since
the different types of attributes have different structures, the attribute length
is indicated. An implementation of the Java Virtual Machine is not
necessary in order to handle each kind of attribute because knowing the length allows to pass an unhandle attribute and execute correctly the file.
The most important attribute is probably the Code:
It begins with the common header, the attribute name index should point to a
CONSTANT_Utf8
representing the string "Code"
.
It is followed by two variables: max stack and max locals which represent the stack size and the size of the local variables including the one used for passing arguments to methods. Then there is the code length and the code which is the bytecode that will be executed by the JVM when the method is called.
Right after that, you’ll find a table representing the exception handlers inside the
functions, it indicates the start and the end of the zone where the exception
should be catched, the start of the entry if the exception is raised and the catch
type which is an index to a CONSTANT_Class
into the constant pool.
The catch type can also be 0, in this case it will be called with every exceptions, this is in
generally used for the finally statement.
After the exceptions section, it is possible to add some attributes for the code especially about the stack and the local variable. The Code is an attribute that may contain other attribute.
Fields & Methods
The fields and methods are added in two tables which contain the same elements. The access flags are different for the fields and methods since they are represented above.
After the flag section, we find the name which is an index to a CONSTANT_Utf8
in the constant pool representing the name of the method/field. The
descriptor index is also an index to a CONSTANT_Utf8
which represents a
descriptor defining the method or field type.
Finally, the method and field can have attributes, moreover a method will contain a code attribute which will contain itself the method code.
Conclusion
The class file is really important for the JVM and having a look at the file format explains a lot of things about the way the JVM work internally.
Recently Java SE 8 has been released, there are several small differences with
Java SE 7 even though the major part of the class file has not changed. In
particular, it defines new attributes : RuntimeVisibleTypeAnnotations
,
RuntimeInvisibleTypeAnnotations
and MethodParameters
.
There are also several modifications in different sections changing the default behaviour of the JVM. It also adds precision and constraints to parts of the class file. The version number for Java SE 7 is 51.00 and 52.00 for Java SE 8.
We’ve written a parser of the class file format in Python3 that you can find here : java.py.
It uses the srddl module for python, available here : https://bitbucket.org/kushou/srddl