Lately, we’ve been having a look into java. First, we tried to understand the file-format. A java application is often presented in a .jar, which is basically a zip archive (you can also find .war files which are also zip archive). Inside this archive you’ll find several files, especially some .class files which are the one containing the java bytecode. Those files are the one we’ll look into.

The file begins with a header including the magic number (0xCAFEBABE), the minor version which is 0 and the major version for Java SE 7: 0x0033 (51.00). Every number in the class file are stored in big-endian. Right after that header, we can find the Constant Pool count which is the number of entries in the constant pool table plus one and then the array. There are several entries representing several items in the constant pool like constants, classes, etc..

After that, there is the access flag of the class, the this_class and super_class identifiers which are indices in the constant pool in order to refer to the current class and the super class. This is followed by the interface table and its size, the table contains all the interfaces from which the current class inherits. Then we find the field table and size, followed by the methods and the attributes of the class.

Here is mainly an overview of the class file.

Class overview

Constant Pool

The constant pool is probably the most important part of the Class file. It contains all the information that will be needed on the other part of the file. The constant pool is an array containing several entries, the index of the array starts at 1, not 0. The different structures in the table do not have the same size, and so the constant pool may have a variable size. Each entry begins with a tag on one byte, indicating the type of entry:

  • CONSTANT_Utf8 : indicating an utf8 modified entry. Java uses a particular type of utf8 for representing the constant string values.

  • CONSTANT_Integer : representing a constant integer on 4 bytes, just like everything in the class file format the integer is a big-endian.

  • CONSTANT_Float : representing a float on 4 bytes, it follows the IEEE 754 floating point format, with possibility of representing both infinity and NaN.

  • CONSTANT_Long : same as CONSTANT_Integer but represents the integer on 8 bytes. Something particular about this entry is that it is counting twice in the constant pool’s number of entries.

  • CONSTANT_Double : as CONSTANT_Float it follows the IEEE 754 for the double format, like CONSTANT_Long it stores the number on 8 bytes and also counts twice in the constant pool.

  • CONSTANT_Class : this one is used to represent a class or an interface, it has only one caracteristic which is an index in the constant pool to a CONSTANT_Utf8 indicating the name of the class.

  • CONSTANT_String : its goal is to represent constant object of string type. like CONSTANT_Class, it only contains one information which is the index of a CONSTANT_Utf8 in the constant pool to represent the string’svalue.

  • CONSTANT_Fieldref : this represents a reference to a field. it contains the index of CONSTANT_Class to represent the class or interface in which the field is and the index of a CONSTANT_NameAndType (see below) for representing the name and the field’s type. (http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.3.2).

  • CONSTANT_Methodref : like CONSTANT_Fieldref, it contains a CONSTANT_Class index and a CONSTANT_NameAndType. The CONSTANT_Class must represent a class and not an interface. The CONSTANT_NameAndType must represent a method descriptor (http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.3.3).

  • CONSTANT_InterfaceMethodref : it is similar to the CONSTANT_Methodref type except that the CONSTANT_Class entry must represent an interface.

  • CONSTANT_NameAndType : this structure is used to represent a field or method without indicating the class or interface it belongs to, it contains two indices in the constant pool which must have the type CONSTANT_Utf8, the first represents the name and the other one represents a valid descriptor of the field or the method.

  • CONSTANT_MethodHandle : this field is used to resolve symbolic reference to a method handle. The way of resolving a method depends on something called the bytecode behavior which is indicated by a kind indicator (from 1 to 9). It also contains a reference on two bytes which is an index in the constant pool pointing on a CONSTANT_Fieldref, CONSTANT_Methodref or CONSTANT_InterfaceMethodref depending on the kind.

  • CONSTANT_MethodType : this field is used to resolve the method’s type, it contains an index to a CONSTANT_Utf8 which should represent the method’s type.

  • CONSTANT_InvokeDynamic : this structure is used by the invokedynamic instruction to specify a bootstrap method. It contains an index into the bootstrap method table (see attributes below) and an index into the constant pool to a CONSTANT_NameAndType representing the method name and method descriptor.

Here is global overview of each of those structures:

Constant Pool

General and Interfaces

After the Constant Pool, we can find several information about the current class, there is an information about the class name and the superclass. There are also general information about the class in the access flag. There are several access flag types for classes, fields and methods. The different kind of access flags are:

Access Flag

After the general information field, there is an information field about the interfaces. All the interfaces the class implements are represented in the interface table. Each entry in that table is a constant pool index representing a CONSTANT_Class which must be an interface.

Attributes

Each field, method and class have others characteristics and informations. These information are contained inside attributes. There are several attribute types, each one of them can be applied to one or several fields, methods, classes and codes. The attributes are used to represent :

  • Code
  • Local variables, constant value, information about the stack and exceptions
  • Inner Classes, Bootstrap Methods, Enclosing methods
  • Annotations
  • Information for debug/decompilation
  • Complementary information (Deprecated, Signature…)

Each attribute begins by an index into the constant pool, it must point to a CONSTANT_Utf8 entry telling which type of attribute this is. Afterward, since the different types of attributes have different structures, the attribute length is indicated. An implementation of the Java Virtual Machine is not necessary in order to handle each kind of attribute because knowing the length allows to pass an unhandle attribute and execute correctly the file.

The most important attribute is probably the Code:

Code Attribute

It begins with the common header, the attribute name index should point to a CONSTANT_Utf8 representing the string "Code".

It is followed by two variables: max stack and max locals which represent the stack size and the size of the local variables including the one used for passing arguments to methods. Then there is the code length and the code which is the bytecode that will be executed by the JVM when the method is called.

Right after that, you’ll find a table representing the exception handlers inside the functions, it indicates the start and the end of the zone where the exception should be catched, the start of the entry if the exception is raised and the catch type which is an index to a CONSTANT_Class into the constant pool. The catch type can also be 0, in this case it will be called with every exceptions, this is in generally used for the finally statement.

After the exceptions section, it is possible to add some attributes for the code especially about the stack and the local variable. The Code is an attribute that may contain other attribute.

Fields & Methods

The fields and methods are added in two tables which contain the same elements. The access flags are different for the fields and methods since they are represented above.

After the flag section, we find the name which is an index to a CONSTANT_Utf8 in the constant pool representing the name of the method/field. The descriptor index is also an index to a CONSTANT_Utf8 which represents a descriptor defining the method or field type.

Finally, the method and field can have attributes, moreover a method will contain a code attribute which will contain itself the method code.

Conclusion

The class file is really important for the JVM and having a look at the file format explains a lot of things about the way the JVM work internally.

Recently Java SE 8 has been released, there are several small differences with Java SE 7 even though the major part of the class file has not changed. In particular, it defines new attributes : RuntimeVisibleTypeAnnotations, RuntimeInvisibleTypeAnnotations and MethodParameters.

There are also several modifications in different sections changing the default behaviour of the JVM. It also adds precision and constraints to parts of the class file. The version number for Java SE 7 is 51.00 and 52.00 for Java SE 8.

We’ve written a parser of the class file format in Python3 that you can find here : java.py.

It uses the srddl module for python, available here : https://bitbucket.org/kushou/srddl