blob: a79ce5721fecb864b9aa8712139ed332970cb435 [file] [log] [blame]
\documentclass[12pt,twoside]{article}
\usepackage{epsf,a4wide,moreverb,url}
\usepackage{palatino}
\newcommand\jc{{\sffamily BCEL }}
\newcommand\cp{{constant pool }}
\newcommand\cpe{constant pool}
\newcommand\jvm{{Java Virtual Machine }}
\newcommand\jvme{{Java Virtual Machine}}
\newcommand\vm{{Virtual Machine }}
\newcommand\href[2]{#2}
\begin{document}
\title{Byte Code Engineering Library (BCEL)\\
Description and usage manual\\
{\small \textbf{Version 1.0}}}
\author{{\Large Markus Dahm}\\\\
\href{mailto:markus.dahm@inf.fu-berlin.de}{\texttt{markus.dahm@berlin.de}}}
\maketitle
%\tableofcontents
\begin{abstract}
Extensions and improvements of the programming language Java and its
related execution environment (Java Virtual Machine, JVM) are the
subject of a large number of research projects and proposals. There
are projects, for instance, to add parameterized types to Java, to
implement ``Aspect-Oriented Programming'', to perform sophisticated
static analysis, and to improve the run-time performance.
Since Java classes are compiled into portable binary class files
(called \emph{byte code}), it is the most convenient and
platform-independent way to implement these improvements not by
writing a new compiler or changing the JVM, but by transforming the
byte code. These transformations can either be performed after
compile-time, or at load-time. Many programmers are doing this by
implementing their own specialized byte code manipulation tools, which
are, however, restricted in the range of their re-usability.
To deal with the necessary class file transformations, we introduce an
API that helps developers to conveniently implement their
transformations.
\end{abstract}
\section{Introduction}\label{sec:intro}
The Java language \cite{gosling} has become very popular and many
research projects deal with further improvements of the language or
its run-time behavior. The possibility to extend a language with new
concepts is surely a desirable feature, but implementation issues
should be hidden from the user. Fortunately, the concepts of the \jvm
permit the user-transparent implementation of such extensions with
relatively little effort.
Because the target language of Java is an interpreted language with a
small and easy-to-understand set of instructions (the \emph{byte
code}), developers can implement and test their concepts in a very
elegant way. One can write a plug-in replacement for the system's
class loader which is responsible for dynamically loading class files
at run-time and passing the byte code to the \vm (see section
\ref{sec:classloaders}). Class loaders may thus be used to intercept
the loading process and transform classes before they get actually
executed by the JVM \cite{classloader}. While the original class
files always remain unaltered, the behavior of the class loader may be
reconfigured for every execution or instrumented dynamically.
The \jc API (Byte Code Engineering Library), formerly known as
JavaClass, is a toolkit for the static analysis and dynamic creation
or transformation of Java class files. It enables developers to
implement the desired features on a high level of abstraction without
handling all the internal details of the Java class file format and
thus re-inventing the wheel every time. \jc is written entirely in
Java and freely available under the terms of the Apache Software
License. \footnote{The distribution is available at
\url{http://jakarta.apache.org/bcel/}, including several code
examples and javadoc manuals. }
This paper is structured as follows: We give a brief description of
the \jvm and the class file format in section \ref{sec:jvm}. Section
\ref{sec:api} introduces the \jc API. Section \ref{sec:application}
describes some typical application areas and example projects. The
appendix contains code examples that are to long to be presented in
the main part of this paper. All examples are included in the
down-loadable distribution.
\subsection{Related work}
There are a number of proposals and class libraries that have some
similarities with \textsc{BCEL}: The JOIE \cite{joie} toolkit can
be used to instrument class loaders with dynamic behavior. Similarly,
``Binary Component Adaptation'' \cite{bca} allows components to be
adapted and evolved on-the-fly. Han Lee's ``Byte-code Instrumenting
Tool'' \cite{bit} allows the user to insert calls to analysis methods
anywhere in the byte code. The Jasmin language \cite{jasmin} can be
used to hand-write or generate pseudo-assembler code. D-Java
\cite{classfile} and JCF \cite{inside} are class viewing tools.
In contrast to these projects, \jc is intended to be a general purpose
tool for ``byte code engineering''. It gives full control to the
developer on a high level of abstraction and is not restricted to any
particular application area.
\section{The Java Virtual Machine}\label{sec:jvm}
Readers already familiar with the \jvm and the Java class file format
may want to skip this section and proceed with section \ref{sec:api}.
Programs written in the Java language are compiled into a portable
binary format called \emph{byte code}. Every class is represented by
a single class file containing class related data and byte code
instructions. These files are loaded dynamically into an interpreter
(\jvme, JVM) and executed.
Figure \ref{fig:jvm} illustrates the procedure of compiling and
executing a Java class: The source file (\texttt{HelloWorld.java}) is
compiled into a Java class file (\texttt{HelloWorld.class}), loaded by
the byte code interpreter and executed. In order to implement
additional features, researchers may want to transform class files
(drawn with bold lines) before they get actually executed. This
application area is one of the main issues of this article.
\begin{figure}[htbp]
\begin{center}
\leavevmode
\epsfxsize\textwidth
\epsfbox{eps/jvm.eps}
\caption{Compilation and execution of Java classes}
\label{fig:jvm}
\end{center}
\end{figure}
Note that the use of the general term ``Java'' implies two meanings:
on the one hand, Java as a programming language is meant, on the other
hand, the Java Virtual Machine, which is not necessarily targeted by
the Java language exclusively, but may be used by other languages as
well (e.g. Eiffel \cite{eiffel}, or Ada \cite{ada}). We assume the
reader to be familiar with the Java language and to have a general
understanding of the Virtual Machine.
\subsection{Java class file format}\label{sec:format}
Giving a full overview of the design issues of the Java class file
format and the associated byte code instructions is beyond the scope
of this paper. We will just give a brief introduction covering the
details that are necessary for understanding the rest of this
paper. The format of class files and the byte code instruction set are
described in more detail in the ``\jvm Specification'' \cite{jvm}
\footnote{Also available online at
\url{http://www.javasoft.com/docs/books/vmspec/index.html}}, and in
\cite{jasmin}. Especially, we will not deal with the security
constraints that the \jvm has to check at run-time, i.e. the byte code
verifier.
Figure \ref{fig:classfile} shows a simplified example of the contents
of a Java class file: It starts with a header containing a ``magic
number'' (\texttt{0xCAFEBABE}) and the version number, followed by the
\emph{\cpe}, which can be roughly thought of as the text segment of an
executable, the \emph{access rights} of the class encoded by a bit
mask, a list of interfaces implemented by the class, lists containing
the fields and methods of the class, and finally the \emph{class
attributes}, e.g. the \texttt{SourceFile} attribute telling the name
of the source file. Attributes are a way of putting additional,
e.g. user-defined, information into class file data structures. For
example, a custom class loader may evaluate such attribute data in
order to perform its transformations. The JVM specification declares
that unknown, i.e. user-defined attributes must be ignored by any \vm
implementation.
\begin{figure}[htbp]
\begin{center}
\leavevmode
\epsfxsize\textwidth
\epsfbox{eps/classfile.eps}
\caption{Java class file format}
\label{fig:classfile}
\end{center}
\end{figure}
Because all of the information needed to dynamically resolve the
symbolic references to classes, fields and methods at run-time is
coded with string constants, the \cp contains in fact the largest
portion of an average class file, approximately 60\% \cite{statistic}.
The byte code instructions themselves just make up 12\%.
The right upper box shows a ``zoomed'' excerpt of the \cpe, while the
rounded box below depicts some instructions that are contained within
a method of the example class. These instructions represent the
straightforward translation of the well-known statement:
\begin{verbatim}
System.out.println("Hello, world");
\end{verbatim}
The first instruction loads the contents of the field \texttt{out} of
class \texttt{java.lang.System} onto the operand stack. This is an
instance of the class \texttt{java.io.PrintStream}. The \texttt{ldc}
(``Load constant'') pushes a reference to the string "Hello world" on
the stack. The next instruction invokes the instance method
\texttt{println} which takes both values as parameters (Instance
methods always implicitly take an instance reference as their first
argument).
Instructions, other data structures within the class file and
constants themselves may refer to constants in the \cpe. Such
references are implemented via fixed indexes encoded directly into the
instructions. This is illustrated for some items of the figure
emphasized with a surrounding box.
For example, the \texttt{invokevirtual} instruction refers to a
\texttt{MethodRef} constant that contains information about the name
of the called method, the signature (i.e. the encoded argument and
return types), and to which class the method belongs. In fact, as
emphasized by the boxed value, the \texttt{MethodRef} constant itself
just refers to other entries holding the real data, e.g. it refers to
a \texttt{ConstantClass} entry containing a symbolic reference to the
class \texttt{java.io.PrintStream}. To keep the class file compact,
such constants are typically shared by different instructions.
Similarly, a field is represented by a \texttt{Fieldref} constant that
includes information about the name, the type and the containing class
of the field.
The \cp basically holds the following types of constants: References
to methods, fields and classes, strings, integers, floats, longs, and
doubles.
\subsection{Byte code instruction set}\label{sec:code}
The JVM is a stack-oriented interpreter that creates a local stack
frame of fixed size for every method invocation. The size of the local
stack has to be computed by the compiler. Values may also be stored
intermediately in a frame area containing \emph{local variables} which
can be used like a set of registers. These local variables are
numbered from 0 to 65535, i.e. you have a maximum of 65536 of local
variables. The stack frames of caller and callee method are
overlapping, i.e. the caller pushes arguments onto the operand stack
and the called method receives them in local variables.
The byte code instruction set currently consists of 212 instructions,
44 opcodes are marked as reserved and may be used for future
extensions or intermediate optimizations within the Virtual
Machine. The instruction set can be roughly grouped as follows:
\begin{description}
\item[Stack operations:] Constants can be pushed onto the stack either
by loading them from the \cp with the \texttt{ldc} instruction or with
special ``short-cut'' instructions where the operand is encoded into
the instructions, e.g. \texttt{iconst\_0} or \texttt{bipush} (push
byte value).
\item[Arithmetic operations:] The instruction set of the \jvm
distinguishes its operand types using different instructions to
operate on values of specific type. Arithmetic operations starting
with \texttt{i}, for example, denote an integer operation. E.g.,
\texttt{iadd} that adds two integers and pushes the result back on the
stack. The Java types \texttt{boolean}, \texttt{byte},
\texttt{short}, and \texttt{char} are handled as integers by the JVM.
\item[Control flow:] There are branch instructions like \texttt{goto}
and \texttt{if\_icmpeq}, which compares two integers for
equality. There is also a \texttt{jsr} (jump sub-routine) and
\texttt{ret} pair of instructions that is used to implement the
\texttt{finally} clause of \texttt{try-catch} blocks. Exceptions may
be thrown with the \texttt{athrow} instruction.
Branch targets are coded as offsets from the current byte code
position, i.e. with an integer number.
\item[Load and store operations] for local variables like
\texttt{iload} and \texttt{istore}. There are also array operations
like \texttt{iastore} which stores an integer value into an array.
\item[Field access:] The value of an instance field may be retrieved
with \texttt{getfield} and written with \texttt{putfield}. For static
fields, there are \texttt{getstatic} and \texttt{putstatic}
counterparts.
\item[Method invocation:] Methods may either be called via static
references with \texttt{invokesta\-tic} or be bound virtually with the
\texttt{invokevirtual} instruction. Super class methods and private
methods are invoked with \texttt{invokespecial}.
\item[Object allocation:] Class instances are allocated with the
\texttt{new} instruction, arrays of basic type like \texttt{int[]}
with \texttt{newarray}, arrays of references like \texttt{String[][]}
with \texttt{anewarray} or \texttt{multianewarray}.
\item[Conversion and type checking:] For stack operands of basic type
there exist casting operations like \texttt{f2i} which converts a
float value into an integer. The validity of a type cast may be
checked with \texttt{checkcast} and the \texttt{instanceof} operator
can be directly mapped to the equally named instruction.
\end{description}
Most instructions have a fixed length, but there are also some
variable-length instructions: In particular, the \texttt{lookupswitch}
and \texttt{tableswitch} instructions, which are used to implement
\texttt{switch()} statements. Since the number of \texttt{case}
clauses may vary, these instructions contain a variable number of
statements.
We will not list all byte code instructions here, since these are
explained in detail in the JVM specification. The opcode names are
mostly self-explaining, so understanding the following code examples
should be fairly intuitive.
\subsection{Method code}\label{sec:code2}
Non-abstract methods contain an attribute (\texttt{Code}) that holds
the following data: The maximum size of the method's stack frame, the
number of local variables and an array of byte code
instructions. Optionally, it may also contain information about the
names of local variables and source file line numbers that can be used
by a debugger.
Whenever an exception is thrown, the JVM performs exception handling
by looking into a table of exception handlers. The table marks
handlers, i.e. pieces of code, to be responsible for exceptions of
certain types that are raised within a given area of the byte
code. When there is no appropriate handler the exception is propagated
back to the caller of the method. The handler information is itself
stored in an attribute contained within the \texttt{Code} attribute.
\subsection{Byte code offsets}\label{sec:offsets}
Targets of branch instructions like \texttt{goto} are encoded as
relative offsets in the array of byte codes. Exception handlers and
local variables refer to absolute addresses within the byte code. The
former contains references to the start and the end of the
\texttt{try} block, and to the instruction handler code. The latter
marks the range in which a local variable is valid, i.e. its scope.
This makes it difficult to insert or delete code areas on this level
of abstraction, since one has to recompute the offsets every time and
update the referring objects. We will see in section \ref{sec:cgapi}
how \jc remedies this restriction.
\subsection{Type information}\label{sec:types}
Java is a type-safe language and the information about the types of
fields, local variables, and methods is stored in
\emph{signatures}. These are strings stored in the \cp and encoded in
a special format. For example the argument and return types of the
\texttt{main} method
\begin{verbatim}
public static void main(String[] argv)
\end{verbatim}
are represented by the signature
\begin{verbatim}
([java/lang/String;)V
\end{verbatim}
Classes and arrays are internally represented by strings like
\texttt{"java/lang/String"}, basic types like \texttt{float} by an
integer number. Within signatures they are represented by single
characters, e.g., \texttt{"I"}, for integer.
\subsection{Code example}\label{sec:fac}
The following example program prompts for a number and prints the
faculty of it. The \texttt{readLine()} method reading from the
standard input may raise an \texttt{IOException} and if a misspelled
number is passed to \texttt{parseInt()} it throws a
\texttt{NumberFormatException}. Thus, the critical area of code must be
encapsulated in a \texttt{try-catch} block.
{\small \begin{verbatim}
import java.io.*;
public class Faculty {
private static BufferedReader in = new BufferedReader(new
InputStreamReader(System.in));
public static final int fac(int n) {
return (n == 0)? 1 : n * fac(n - 1);
}
public static final int readInt() {
int n = 4711;
try {
System.out.print("Please enter a number> ");
n = Integer.parseInt(in.readLine());
} catch(IOException e1) { System.err.println(e1); }
catch(NumberFormatException e2) { System.err.println(e2); }
return n;
}
public static void main(String[] argv) {
int n = readInt();
System.out.println("Faculty of " + n + " is " + fac(n));
}}
\end{verbatim}}
This code example typically compiles to the following chunks of byte
code:
\subsubsection{Method fac}
{\small \begin{verbatim}
0: iload_0
1: ifne #8
4: iconst_1
5: goto #16
8: iload_0
9: iload_0
10: iconst_1
11: isub
12: invokestatic Faculty.fac (I)I (12)
15: imul
16: ireturn
LocalVariable(start_pc = 0, length = 16, index = 0:int n)
\end{verbatim}}
The method \texttt{fac} has only one local variable, the argument
\texttt{n}, stored in slot 0. This variable's scope ranges from the
start of the byte code sequence to the very end. If the value of
\texttt{n} (stored in local variable 0, i.e. the value fetched with
\texttt{iload\_0}) is not equal to 0, the \texttt{ifne} instruction
branches to the byte code at offset 8, otherwise a 1 is pushed onto
the operand stack and the control flow branches to the final return.
For ease of reading, the offsets of the branch instructions, which are
actually relative, are displayed as absolute addresses in these
examples.
If recursion has to continue, the arguments for the multiplication
(\texttt{n} and \texttt{fac(n - 1)}) are evaluated and the results
pushed onto the operand stack. After the multiplication operation has
been performed the function returns the computed value from the top of
the stack.
\subsubsection{Method readInt}
{\small \begin{verbatim}
0: sipush 4711
3: istore_0
4: getstatic java.lang.System.out Ljava/io/PrintStream;
7: ldc "Please enter a number> "
9: invokevirtual java.io.PrintStream.print (Ljava/lang/String;)V
12: getstatic Faculty.in Ljava/io/BufferedReader;
15: invokevirtual java.io.BufferedReader.readLine ()Ljava/lang/String;
18: invokestatic java.lang.Integer.parseInt (Ljava/lang/String;)I
21: istore_0
22: goto #44
25: astore_1
26: getstatic java.lang.System.err Ljava/io/PrintStream;
29: aload_1
30: invokevirtual java.io.PrintStream.println (Ljava/lang/Object;)V
33: goto #44
36: astore_1
37: getstatic java.lang.System.err Ljava/io/PrintStream;
40: aload_1
41: invokevirtual java.io.PrintStream.println (Ljava/lang/Object;)V
44: iload_0
45: ireturn
Exception handler(s) =
From To Handler Type
4 22 25 java.io.IOException(6)
4 22 36 NumberFormatException(10)
\end{verbatim}}
First the local variable \texttt{n} (in slot 0) is initialized to the
value 4711. The next instruction, \texttt{getstatic}, loads the
static \texttt{System.out} field onto the stack. Then a string is
loaded and printed, a number read from the standard input and
assigned to \texttt{n}.
If one of the called methods (\texttt{readLine()} and
\texttt{parseInt()}) throws an exception, the \jvm calls one of the
declared exception handlers, depending on the type of the exception.
The \texttt{try}-clause itself does not produce any code, it merely
defines the range in which the following handlers are active. In the
example the specified source code area maps to a byte code area
ranging from offset 4 (inclusive) to 22 (exclusive). If no exception
has occurred (``normal'' execution flow) the \texttt{goto}
instructions branch behind the handler code. There the value of
\texttt{n} is loaded and returned.
For example the handler for \texttt{java.io.IOException} starts at
offset 25. It simply prints the error and branches back to the normal
execution flow, i.e. as if no exception had occurred.
\section{The BCEL API}\label{sec:api}
The \jc API abstracts from the concrete circumstances of the \jvm and
how to read and write binary Java class files. The API mainly
consists of three parts:
\begin{enumerate}
\item A package that contains classes that describe ``static''
constraints of class files, i.e., reflect the class file format and
is not intended for byte code modifications. The classes may be
used to read and write class files from or to a file. This is
useful especially for analyzing Java classes without having the
source files at hand. The main data structure is called
\texttt{JavaClass} which contains methods, fields, etc..
\item A package to dynamically generate or modify \texttt{JavaClass}
objects. It may be used e.g. to insert analysis code, to strip
unnecessary information from class files, or to implement the code
generator back-end of a Java compiler.
\item Various code examples and utilities like a class file viewer, a
tool to convert class files into HTML, and a converter from class
files to the Jasmin assembly language \cite{jasmin}.
\end{enumerate}
\subsection{JavaClass}\label{sec:javaclass}
The ``static'' component of the \jc API resides in the package
\path{de.fub.bytecode.classfile} and represents class files. All of the
binary components and data structures declared in the JVM
specification \cite{jvm} and described in section \ref{sec:jvm} are
mapped to classes. Figure \ref{fig:umljc} shows an UML diagram of the
hierarchy of classes of the \jc API. Figure \ref{fig:umlcp} in the
appendix also shows a detailed diagram of the \texttt{ConstantPool}
components.
\begin{figure}[htbp]
\begin{center}
\leavevmode
\epsfysize0.93\textheight
\epsfbox{eps/javaclass.eps}
\caption{UML diagram for the \jc API}\label{fig:umljc}
\end{center}
\end{figure}
The top-level data structure is \texttt{JavaClass}, which in most
cases is created by a \texttt{Class\-Par\-ser} object that is capable
of parsing binary class files. A \texttt{JavaClass} object basically
consists of fields, methods, symbolic references to the super class
and to the implemented interfaces.
The \cp serves as some kind of central repository and is thus of
outstanding importance for all components. \texttt{ConstantPool}
objects contain an array of fixed size of \texttt{Constant} entries,
which may be retrieved via the \texttt{getConstant()} method taking an
integer index as argument. Indexes to the \cp may be contained in
instructions as well as in other components of a class file and in \cp
entries themselves.
Methods and fields contain a signature, symbolically defining their
types. Access flags like \texttt{public static final} occur in
several places and are encoded by an integer bit mask, e.g.
\texttt{public static final} matches to the Java expression
\begin{verbatim}
int access_flags = ACC_PUBLIC | ACC_STATIC | ACC_FINAL;
\end{verbatim}
As mentioned in section \ref{sec:format} already, several components
may contain \emph{attribute} objects: classes, fields, methods, and
\texttt{Code} objects (introduced in section \ref{sec:code2}). The
latter is an attribute itself that contains the actual byte code
array, the maximum stack size, the number of local variables, a table
of handled exceptions, and some optional debugging information coded
as \texttt{LineNumberTable} and \texttt{LocalVariableTable}
attributes. Attributes are in general specific to some data structure,
i.e. no two components share the same kind of attribute, though this
is not explicitly forbidden. In the figure the \texttt{Attribute}
classes are marked with the component they belong to.
\subsection{Class repository}
Using the provided \texttt{Repository} class, reading class files into
a \texttt{JavaClass} object is quite simple:
\begin{verbatim}
JavaClass clazz = Repository.lookupClass("java.lang.String");
\end{verbatim}
The repository also contains methods providing the dynamic equivalent
of the \texttt{instanceof} operator, and other useful routines:
\begin{verbatim}
if(Repository.instanceOf(clazz, super_class) {
...
}
\end{verbatim}
\subsubsection{Accessing class file data}
Information within the class file components may be accessed like Java
Beans via intuitive set/get methods. All of them also define a
\texttt{toString()} method so that implementing a simple class viewer
is very easy. In fact all of the examples used here have been produced
this way:
{\small \begin{verbatim}
System.out.println(clazz);
printCode(clazz.getMethods());
...
public static void printCode(Method[] methods) {
for(int i=0; i < methods.length; i++) {
System.out.println(methods[i]);
Code code = methods[i].getCode();
if(code != null) // Non-abstract method
System.out.println(code);
}
}
\end{verbatim}}
\subsubsection{Analyzing class data}
Last but not least, \jc supports the \emph{Visitor} design
pattern \cite{design}, so one can write visitor objects to traverse
and analyze the contents of a class file. Included in the distribution
is a class \texttt{JasminVisitor} that converts class files into the
Jasmin assembler language \cite{jasmin}.
\subsection{ClassGen}\label{sec:cgapi}
This part of the API (package \path{ork.apache.bcel.generic}) supplies
an abstraction level for creating or transforming class files
dynamically. It makes the static constraints of Java class files like
the hard-coded byte code addresses generic. The generic \cpe, for
example, is implemented by the class \texttt{ConstantPoolGen} which
offers methods for adding different types of constants. Accordingly,
\texttt{ClassGen} offers an interface to add methods, fields, and
attributes. Figure \ref{fig:umlcg} gives an overview of this part of
the API.
\begin{figure}[htbp]
\begin{center}
\leavevmode
\epsfysize0.93\textheight
\epsfbox{eps/classgen.eps}
\caption{UML diagram of the ClassGen API}\label{fig:umlcg}
\end{center}
\end{figure}
\subsubsection{Types}
We abstract from the concrete details of the type signature syntax
(see \ref{sec:types}) by introducing the \texttt{Type} class, which is
used, for example, by methods to define their return and argument
types. Concrete sub-classes are \texttt{BasicType},
\texttt{ObjectType}, and \texttt{ArrayType} which consists of the
element type and the number of dimensions. For commonly used types the
class offers some predefined constants. For example the method
signature of the \texttt{main} method as shown in section
\ref{sec:types} is represented by:
\begin{verbatim}
Type return_type = Type.VOID;
Type[] arg_types = new Type[] { new ArrayType(Type.STRING, 1) };
\end{verbatim}
\texttt{Type} also contains methods to convert types into textual
signatures and vice versa. The sub-classes contain implementations of
the routines and constraints specified by the Java Language
Specification \cite{gosling}.
\subsubsection{Generic fields and methods}
Fields are represented by \texttt{FieldGen} objects, which may be
freely modified by the user. If they have the access rights
\texttt{static final}, i.e. are constants and of basic type, they may
optionally have an initializing value.
Generic methods contain methods to add exceptions the method may
throw, local variables, and exception handlers. The latter two are
represented by user-configurable objects as well. Because exception
handlers and local variables contain references to byte code
addresses, they also take the role of an \emph{instruction targeter}
in our terminology. Instruction targeters contain a method
\texttt{updateTarget()} to redirect a reference. Generic
(non-abstract) methods refer to \emph{instruction lists} that consist
of instruction objects. References to byte code addresses are
implemented by handles to instruction objects. This is explained in
more detail in the following sections.
The maximum stack size needed by the method and the maximum number of
local variables used may be set manually or computed via the
\texttt{setMaxStack()} and \texttt{setMaxLocals()} methods
automatically.
\subsubsection{Instructions}
Modeling instructions as objects may look somewhat odd at first sight,
but in fact enables programmers to obtain a high-level view upon
control flow without handling details like concrete byte code offsets.
Instructions consist of a tag, i.e. an opcode, their length in bytes
and an offset (or index) within the byte code. Since many instructions
are immutable, the \texttt{InstructionConstants} interface offers
shareable predefined ``fly-weight'' constants to use.
Instructions are grouped via sub-classing, the type hierarchy of
instruction classes is illustrated by (incomplete) figure
\ref{fig:umlinstr} in the appendix. The most important family of
instructions are the \emph{branch instructions}, e.g. \texttt{goto},
that branch to targets somewhere within the byte code. Obviously,
this makes them candidates for playing an \texttt{InstructionTargeter}
role, too. Instructions are further grouped by the interfaces they
implement, there are, e.g., \texttt{TypedInstruction}s that are
associated with a specific type like \texttt{ldc}, or
\texttt{ExceptionThrower} instructions that may raise exceptions when
executed.
All instructions can be traversed via \texttt{accept(Visitor v)} methods,
i.e., the Visitor design pattern. There is however some special trick
in these methods that allows to merge the handling of certain
instruction groups. The \texttt{accept()} do not only call the
corresponding \texttt{visit()} method, but call \texttt{visit()}
methods of their respective super classes and implemented interfaces
first, i.e. the most specific \texttt{visit()} call is last. Thus one
can group the handling of, say, all \texttt{BranchInstruction}s into
one single method.
For debugging purposes it may even make sense to ``invent'' your own
instructions. In a sophisticated code generator like the one used as a
backend of the Barat framework \cite{barat} one often has to insert
temporary \texttt{nop} (No operation) instructions. When examining
the produced code it may be very difficult to track back where the
\texttt{nop} was actually inserted. One could think of a derived
\texttt{nop2} instruction that contains additional debugging
information. When the instruction list is dumped to byte code, the
extra data is simply dropped.
One could also think of new byte code instructions operating on
complex numbers that are replaced by normal byte code upon load-time
or are recognized by a new JVM.
\subsubsection{Instruction lists}\label{sec:il}
An \emph{instruction list} is implemented by a list of
\emph{instruction handles} encapsulating instruction objects.
References to instructions in the list are thus not implemented by
direct pointers to instructions but by pointers to instruction
\emph{handles}. This makes appending, inserting and deleting areas of
code very simple. Since we use symbolic references, computation of
concrete byte code offsets does not need to occur until finalization,
i.e. until the user has finished the process of generating or
transforming code. We will use the term instruction handle and
instruction synonymously throughout the rest of the paper.
Instruction handles may contain additional user-defined data using the
\texttt{addAttribute()} method.
\paragraph{Appending.}
One can append instructions or other instruction lists anywhere to an
existing list. The instructions are appended after the given
instruction handle. All append methods return a new instruction
handle which may then be used as the target of a branch instruction,
e.g..
{\small \begin{verbatim}
InstructionList il = new InstructionList();
...
GOTO g = new GOTO(null);
il.append(g);
...
InstructionHandle ih = il.append(InstructionConstants.ACONST_NULL);
g.setTarget(ih);
\end{verbatim}}
\paragraph{Inserting.}
Instructions may be inserted anywhere into an existing list. They are
inserted before the given instruction handle. All insert methods
return a new instruction handle which may then be used as the start
address of an exception handler, for example.
{\small \begin{verbatim}
InstructionHandle start = il.insert(insertion_point,
InstructionConstants.NOP);
...
mg.addExceptionHandler(start, end, handler, "java.io.IOException");
\end{verbatim}}
\paragraph{Deleting.}
Deletion of instructions is also very straightforward; all instruction
handles and the contained instructions within a given range are
removed from the instruction list and disposed. The \texttt{delete()}
method may however throw a \texttt{TargetLostException} when there are
instruction targeters still referencing one of the deleted
instructions. The user is forced to handle such exceptions in a
\texttt{try-catch} block and redirect these references elsewhere. The
\emph{peep hole} optimizer described in section \ref{sec:nop} gives a
detailed example for this.
{\small \begin{verbatim}
try {
il.delete(first, last);
} catch(TargetLostException e) {
InstructionHandle[] targets = e.getTargets();
for(int i=0; i < targets.length; i++) {
InstructionTargeter[] targeters = targets[i].getTargeters();
for(int j=0; j < targeters.length; j++)
targeters[j].updateTarget(targets[i], new_target);
}
}
\end{verbatim}}
\paragraph{Finalizing.}
When the instruction list is ready to be dumped to pure byte code, all
symbolic references must be mapped to real byte code offsets. This is
done by the \texttt{getByteCode()} method which is called by default
by \texttt{MethodGen.getMethod()}. Afterwards you should call
\texttt{dispose()} so that the instruction handles can be reused
internally. This helps to reduce memory usage.
\begin{verbatim}
InstructionList il = new InstructionList();
ClassGen cg = new ClassGen("HelloWorld", "java.lang.Object",
"<generated>", ACC_PUBLIC | ACC_SUPER,
null);
MethodGen mg = new MethodGen(ACC_STATIC | ACC_PUBLIC,
Type.VOID, new Type[] {
new ArrayType(Type.STRING, 1)
}, new String[] { "argv" },
"main", "HelloWorld", il, cp);
...
cg.addMethod(mg.getMethod());
il.dispose(); // Reuse instruction handles of list
\end{verbatim}
\subsubsection{Code example revisited}
Using instruction lists gives us a generic view upon the code: In
Figure \ref{fig:il} we again present the code chunk of the
\texttt{readInt()} method of the faculty example in section
\ref{sec:fac}: The local variables \texttt{n} and \texttt{e1} both
hold two references to instructions, defining their scope. There are
two \texttt{goto}s branching to the \texttt{iload} at the end of the
method. One of the exception handlers is displayed, too: it references
the start and the end of the \texttt{try} block and also the exception
handler code.
\begin{figure}[htbp]
\begin{center}
\leavevmode
\epsfxsize\textwidth
\epsfbox{eps/il.eps}
\caption{Instruction list for \texttt{readInt()} method}
\label{fig:il}
\end{center}
\end{figure}
\subsubsection{Instruction factories}\label{sec:compound}
To simplify the creation of certain instructions the user can use the
supplied \texttt{InstructionFactory} class which offers a lot of
useful methods to create instructions from scratch. Alternatively, he
can also use \emph{compound instructions}: When producing byte code,
some patterns typically occur very frequently, for instance the
compilation of arithmetic or comparison expressions. You certainly do
not want to rewrite the code that translates such expressions into
byte code in every place they may appear. In order to support this,
the \jc API includes a \emph{compound instruction} (an interface with
a single \texttt{getInstructionList()} method). Instances of this
class may be used in any place where normal instructions would occur,
particularly in append operations.
\paragraph{Example: Pushing constants.}
Pushing constants onto the operand stack may be coded in different
ways. As explained in section \ref{sec:code} there are some
``short-cut'' instructions that can be used to make the produced byte
code more compact. The smallest instruction to push a single
\texttt{1} onto the stack is \texttt{iconst\_1}, other possibilities
are \texttt{bipush} (can be used to push values between -128 and 127),
\texttt{sipush} (between -32768 and 32767), or \texttt{ldc} (load
constant from \cpe).
Instead of repeatedly selecting the most compact instruction in, say,
a switch, one can use the compound \texttt{PUSH} instruction whenever
pushing a constant number or string. It will produce the appropriate
byte code instruction and insert entries into to \cp if necessary.
\begin{verbatim}
il.append(new PUSH(cp, "Hello, world"));
il.append(new PUSH(cp, 4711));
\end{verbatim}
\subsubsection{Code patterns using regular expressions}\label{sec:peephole}
When transforming code, for instance during optimization or when
inserting analysis method calls, one typically searches for certain
patterns of code to perform the transformation at. To simplify
handling such situations \jc introduces a special feature: One can
search for given code patterns within an instruction list using
\emph{regular expressions}. In such expressions, instructions are
represented by symbolic names, e.g. "\texttt{`IfInstruction'}". Meta
characters like \verb|+|, \verb|*|, and \verb@(..|..)@ have their
usual meanings. Thus, the expression
\begin{verbatim}
"`NOP'+(`ILOAD__'|`ALOAD__')*"
\end{verbatim}
represents a piece of code consisting of at least one \texttt{NOP}
followed by a possibly empty sequence of \texttt{ILOAD} and
\texttt{ALOAD} instructions.
The \texttt{search()} method of class \texttt{FindPattern} gets an
instruction list and a regular expression as arguments and returns an
array describing the area of matched instructions. Additional
constraints to the matching area of instructions, which can not be
implemented via regular expressions, may be expressed via \emph{code
constraints}.
\subsubsection{Example: Optimizing boolean expressions.}
In Java, boolean values are mapped to 1 and to 0, respectively. Thus,
the simplest way to evaluate boolean expressions is to push a 1 or a 0
onto the operand stack depending on the truth value of the expression.
But this way, the subsequent combination of boolean expressions (with
\verb|&&|, e.g) yields long chunks of code that push lots of 1s and
0s onto the stack.
When the code has been finalized these chunks can be optimized with a
\emph{peep hole} algorithm: An \texttt{IfInstruction} (e.g. the
comparison of two integers: \texttt{if\_icmpeq}) that either produces
a 1 or a 0 on the stack and is followed by an \texttt{ifne}
instruction (branch if stack value $\neq$ 0) may be replaced by the
\texttt{IfInstruction} with its branch target replaced by the target
of the \texttt{ifne} instruction:
{\small \verbatimtabinput{bool.java}}
The applied code constraint object ensures that the matched code
really corresponds to the targeted expression pattern. Subsequent
application of this algorithm removes all unnecessary stack operations
and branch instructions from the byte code. If any of the deleted
instructions is still referenced by an \texttt{InstructionTargeter}
object, the reference has to be updated in the \texttt{catch}-clause.
Code example \ref{sec:hello} gives a verbose example of how to create
a class file, while example \ref{sec:nop} shows how to implement a
simple peephole optimizer and how to deal with \texttt{TargetLost}
exceptions.
\paragraph{Example application:}
The expression
\begin{verbatim}
if((a == null) || (i < 2))
System.out.println("Ooops");
\end{verbatim}
can be mapped to both of the chunks of byte code shown in figure
\ref{fig:code}. The left column represents the unoptimized code while
the right column displays the same code after an aggressively
optimizing peep hole algorithm has been applied:
\begin{figure}[hpt]
\begin{minipage}{0.49\textwidth}
{\small \verbatimtabinput{unopt}} \vfil
\end{minipage}
\begin{minipage}{0.49\textwidth}
{\small \verbatimtabinput{opt}} \vfil
\end{minipage}\label{fig:code}\caption{Optimizing boolean expressions}
\begin{center}
\end{center}
\end{figure}
\section{Application areas}\label{sec:application}
There are many possible application areas for \jc ranging from class
browsers, profilers, byte code optimizers, and compilers to
sophisticated run-time analysis tools and extensions to the Java
language \cite{agesen, myers}.
Compilers like the Barat compiler \cite{barat} use \jc to implement a
byte code generating back end. Other possible application areas are
the static analysis of byte code \cite{thies} or examining the
run-time behavior of classes by inserting calls to profiling methods
into the code. Further examples are extending Java with Eiffel-like
assertions \cite{jawa}, automated delegation \cite{classfilters}, or
with the concepts of ``Aspect-Oriented Programming'' \cite{aspect}.
\subsection{Class loaders}\label{sec:classloaders}
Class loaders are responsible for loading class files from the file
system or other resources and passing the byte code to the \vm
\cite{classloader}. A custom \texttt{ClassLoader} object may be used
to intercept the standard procedure of loading a class, i.e. the
system class loader, and perform some transformations before actually
passing the byte code to the JVM.
A possible scenario is described in figure \ref{fig:classloader}:
During run-time the \vm requests a custom class loader to load a given
class. But before the JVM actually sees the byte code, the class
loader makes a ``side-step'' and performs some transformation to the
class. To make sure that the modified byte code is still valid and
does not violate any of the JVM's rules it is checked by the verifier
before the JVM finally executes it.
\begin{figure}[ht]
\begin{center}
\leavevmode
\epsfxsize\textwidth
\epsfbox{eps/classloader.eps}
\caption{Class loaders}\label{fig:classloader}
\end{center}
\end{figure}
Using class loaders is an elegant way of extending the \jvm with new
features without actually modifying it. This concept enables
developers to use \emph{load-time reflection} to implement their ideas
as opposed to the static reflection supported by the Java Reflection
API \cite{reflection}. Load-time transformations supply the user with
a new level of abstraction. He is not strictly tied to the static
constraints of the original authors of the classes but may customize
the applications with third-party code in order to benefit from new
features. Such transformations may be executed on demand and neither
interfere with other users, nor alter the original byte code. In fact,
class loaders may even create classes \emph{ad hoc} without loading a
file at all.
\subsubsection{Example: Poor Man's Genericity}
The ``Poor Man's Genericity'' project \cite{pmg} that extends Java
with parameterized classes, for example, uses \jc in two places to
generate instances of parameterized classes: During compile-time (the
standard \texttt{javac} with some slightly changed classes) and at
run-time using a custom class loader. The compiler puts some
additional type information into class files which is evaluated at
load-time by the class loader. The class loader performs some
transformations on the loaded class and passes them to the VM. The
following algorithm illustrates how the load method of the class
loader fulfills the request for a parameterized class,
e.g. \verb|Stack<String>|
\begin{enumerate}
\item Search for class \texttt{Stack}, load it, and check for a
certain class attribute containing additional type information. I.e.
the attribute defines the ``real'' name of the class,
i.e. \verb|Stack<A>|.
\item Replace all occurrences and references to the formal type
\texttt{A} with references to the actual type \texttt{String}. For
example the method
\begin{verbatim}
void push(A obj) { ... }
\end{verbatim}
becomes
\begin{verbatim}
void push(String obj) { ... }
\end{verbatim}
\item Return the resulting class to the Virtual Machine.
\end{enumerate}
\bibliographystyle{alpha}\bibliography{manual}
\newpage\appendix
\pagestyle{empty}
\include{appendix}
\include{diagrams}
\end{document}
% LocalWords: Freie Universit Institut Informatik dahm inf fu berlin de JOIE
% LocalWords: JCF HelloWorld Jasmin Eiffel SourceFile classfile lang io ldc ar
% LocalWords: PrintStream invokevirtual MethodRef ConstantClass Fieldref dup
% LocalWords: iconst bipush iadd cmpeq jsr athrow iload istore iastore er ic
% LocalWords: getfield putfield getstatic putstatic invokestatic newarray nop
% LocalWords: anewarray checkcast instanceof lookupswitch tableswitch Barat VM
% LocalWords: ConstantPool getConstant LineNumberTable LocalvariableTable ifne
% LocalWords: invokesta invokespecial multianewarray ConstantPoolGen MethodGen
% LocalWords: BasicType ObjectType ArrayType InstructionTarge getByteCode bool
% LocalWords: BranchInstruction InstructionTargeter InstructionHandle regex
% LocalWords: TargetLostException codeconstraint CompoundInstruction sipush
% LocalWords: getInstructionList FindPattern CodeConstraint IfInstruction fac
% LocalWords: TargetLost javac readLine IOException parseInt readInt CLASSPATH
% LocalWords: NumberFormatException toString JasminVisitor getSignature JVM's
% LocalWords: FieldGen updateTarget LGPL BCEL LocalVariableTable setMaxStack
% LocalWords: setMaxLocals InstructionConstants TypedInstruction addAttribute
% LocalWords: ExceptionThrower getMethod InstructionFactory ALOAD