| \documentclass[12pt,twoside]{article} |
| |
| \usepackage{epsf,a4wide,moreverb,url} |
| \usepackage{palatino} |
| |
| \newcommand\jc{{\sffamily BCEL }} |
| \newcommand\cp{{constant pool }} |
| \newcommand\cpe{constant pool} |
| \newcommand\jvm{{Java Virtual Machine }} |
| \newcommand\jvme{{Java Virtual Machine}} |
| \newcommand\vm{{Virtual Machine }} |
| \newcommand\href[2]{#2} |
| |
| \begin{document} |
| |
| \title{Byte Code Engineering Library (BCEL)\\ |
| Description and usage manual\\ |
| {\small \textbf{Version 1.0}}} |
| |
| \author{{\Large Markus Dahm}\\\\ |
| \href{mailto:markus.dahm@inf.fu-berlin.de}{\texttt{markus.dahm@berlin.de}}} |
| |
| \maketitle |
| |
| %\tableofcontents |
| |
| \begin{abstract} |
| Extensions and improvements of the programming language Java and its |
| related execution environment (Java Virtual Machine, JVM) are the |
| subject of a large number of research projects and proposals. There |
| are projects, for instance, to add parameterized types to Java, to |
| implement ``Aspect-Oriented Programming'', to perform sophisticated |
| static analysis, and to improve the run-time performance. |
| |
| Since Java classes are compiled into portable binary class files |
| (called \emph{byte code}), it is the most convenient and |
| platform-independent way to implement these improvements not by |
| writing a new compiler or changing the JVM, but by transforming the |
| byte code. These transformations can either be performed after |
| compile-time, or at load-time. Many programmers are doing this by |
| implementing their own specialized byte code manipulation tools, which |
| are, however, restricted in the range of their re-usability. |
| |
| To deal with the necessary class file transformations, we introduce an |
| API that helps developers to conveniently implement their |
| transformations. |
| \end{abstract} |
| |
| \section{Introduction}\label{sec:intro} |
| |
| The Java language \cite{gosling} has become very popular and many |
| research projects deal with further improvements of the language or |
| its run-time behavior. The possibility to extend a language with new |
| concepts is surely a desirable feature, but implementation issues |
| should be hidden from the user. Fortunately, the concepts of the \jvm |
| permit the user-transparent implementation of such extensions with |
| relatively little effort. |
| |
| Because the target language of Java is an interpreted language with a |
| small and easy-to-understand set of instructions (the \emph{byte |
| code}), developers can implement and test their concepts in a very |
| elegant way. One can write a plug-in replacement for the system's |
| class loader which is responsible for dynamically loading class files |
| at run-time and passing the byte code to the \vm (see section |
| \ref{sec:classloaders}). Class loaders may thus be used to intercept |
| the loading process and transform classes before they get actually |
| executed by the JVM \cite{classloader}. While the original class |
| files always remain unaltered, the behavior of the class loader may be |
| reconfigured for every execution or instrumented dynamically. |
| |
| The \jc API (Byte Code Engineering Library), formerly known as |
| JavaClass, is a toolkit for the static analysis and dynamic creation |
| or transformation of Java class files. It enables developers to |
| implement the desired features on a high level of abstraction without |
| handling all the internal details of the Java class file format and |
| thus re-inventing the wheel every time. \jc is written entirely in |
| Java and freely available under the terms of the Apache Software |
| License. \footnote{The distribution is available at |
| \url{http://jakarta.apache.org/bcel/}, including several code |
| examples and javadoc manuals. } |
| |
| This paper is structured as follows: We give a brief description of |
| the \jvm and the class file format in section \ref{sec:jvm}. Section |
| \ref{sec:api} introduces the \jc API. Section \ref{sec:application} |
| describes some typical application areas and example projects. The |
| appendix contains code examples that are to long to be presented in |
| the main part of this paper. All examples are included in the |
| down-loadable distribution. |
| |
| \subsection{Related work} |
| |
| There are a number of proposals and class libraries that have some |
| similarities with \textsc{BCEL}: The JOIE \cite{joie} toolkit can |
| be used to instrument class loaders with dynamic behavior. Similarly, |
| ``Binary Component Adaptation'' \cite{bca} allows components to be |
| adapted and evolved on-the-fly. Han Lee's ``Byte-code Instrumenting |
| Tool'' \cite{bit} allows the user to insert calls to analysis methods |
| anywhere in the byte code. The Jasmin language \cite{jasmin} can be |
| used to hand-write or generate pseudo-assembler code. D-Java |
| \cite{classfile} and JCF \cite{inside} are class viewing tools. |
| |
| In contrast to these projects, \jc is intended to be a general purpose |
| tool for ``byte code engineering''. It gives full control to the |
| developer on a high level of abstraction and is not restricted to any |
| particular application area. |
| |
| \section{The Java Virtual Machine}\label{sec:jvm} |
| |
| Readers already familiar with the \jvm and the Java class file format |
| may want to skip this section and proceed with section \ref{sec:api}. |
| |
| Programs written in the Java language are compiled into a portable |
| binary format called \emph{byte code}. Every class is represented by |
| a single class file containing class related data and byte code |
| instructions. These files are loaded dynamically into an interpreter |
| (\jvme, JVM) and executed. |
| |
| Figure \ref{fig:jvm} illustrates the procedure of compiling and |
| executing a Java class: The source file (\texttt{HelloWorld.java}) is |
| compiled into a Java class file (\texttt{HelloWorld.class}), loaded by |
| the byte code interpreter and executed. In order to implement |
| additional features, researchers may want to transform class files |
| (drawn with bold lines) before they get actually executed. This |
| application area is one of the main issues of this article. |
| |
| \begin{figure}[htbp] |
| \begin{center} |
| \leavevmode |
| \epsfxsize\textwidth |
| \epsfbox{eps/jvm.eps} |
| \caption{Compilation and execution of Java classes} |
| \label{fig:jvm} |
| \end{center} |
| \end{figure} |
| |
| Note that the use of the general term ``Java'' implies two meanings: |
| on the one hand, Java as a programming language is meant, on the other |
| hand, the Java Virtual Machine, which is not necessarily targeted by |
| the Java language exclusively, but may be used by other languages as |
| well (e.g. Eiffel \cite{eiffel}, or Ada \cite{ada}). We assume the |
| reader to be familiar with the Java language and to have a general |
| understanding of the Virtual Machine. |
| |
| \subsection{Java class file format}\label{sec:format} |
| |
| Giving a full overview of the design issues of the Java class file |
| format and the associated byte code instructions is beyond the scope |
| of this paper. We will just give a brief introduction covering the |
| details that are necessary for understanding the rest of this |
| paper. The format of class files and the byte code instruction set are |
| described in more detail in the ``\jvm Specification'' \cite{jvm} |
| \footnote{Also available online at |
| \url{http://www.javasoft.com/docs/books/vmspec/index.html}}, and in |
| \cite{jasmin}. Especially, we will not deal with the security |
| constraints that the \jvm has to check at run-time, i.e. the byte code |
| verifier. |
| |
| Figure \ref{fig:classfile} shows a simplified example of the contents |
| of a Java class file: It starts with a header containing a ``magic |
| number'' (\texttt{0xCAFEBABE}) and the version number, followed by the |
| \emph{\cpe}, which can be roughly thought of as the text segment of an |
| executable, the \emph{access rights} of the class encoded by a bit |
| mask, a list of interfaces implemented by the class, lists containing |
| the fields and methods of the class, and finally the \emph{class |
| attributes}, e.g. the \texttt{SourceFile} attribute telling the name |
| of the source file. Attributes are a way of putting additional, |
| e.g. user-defined, information into class file data structures. For |
| example, a custom class loader may evaluate such attribute data in |
| order to perform its transformations. The JVM specification declares |
| that unknown, i.e. user-defined attributes must be ignored by any \vm |
| implementation. |
| |
| \begin{figure}[htbp] |
| \begin{center} |
| \leavevmode |
| \epsfxsize\textwidth |
| \epsfbox{eps/classfile.eps} |
| \caption{Java class file format} |
| \label{fig:classfile} |
| \end{center} |
| \end{figure} |
| |
| Because all of the information needed to dynamically resolve the |
| symbolic references to classes, fields and methods at run-time is |
| coded with string constants, the \cp contains in fact the largest |
| portion of an average class file, approximately 60\% \cite{statistic}. |
| The byte code instructions themselves just make up 12\%. |
| |
| The right upper box shows a ``zoomed'' excerpt of the \cpe, while the |
| rounded box below depicts some instructions that are contained within |
| a method of the example class. These instructions represent the |
| straightforward translation of the well-known statement: |
| |
| \begin{verbatim} |
| System.out.println("Hello, world"); |
| \end{verbatim} |
| |
| The first instruction loads the contents of the field \texttt{out} of |
| class \texttt{java.lang.System} onto the operand stack. This is an |
| instance of the class \texttt{java.io.PrintStream}. The \texttt{ldc} |
| (``Load constant'') pushes a reference to the string "Hello world" on |
| the stack. The next instruction invokes the instance method |
| \texttt{println} which takes both values as parameters (Instance |
| methods always implicitly take an instance reference as their first |
| argument). |
| |
| Instructions, other data structures within the class file and |
| constants themselves may refer to constants in the \cpe. Such |
| references are implemented via fixed indexes encoded directly into the |
| instructions. This is illustrated for some items of the figure |
| emphasized with a surrounding box. |
| |
| For example, the \texttt{invokevirtual} instruction refers to a |
| \texttt{MethodRef} constant that contains information about the name |
| of the called method, the signature (i.e. the encoded argument and |
| return types), and to which class the method belongs. In fact, as |
| emphasized by the boxed value, the \texttt{MethodRef} constant itself |
| just refers to other entries holding the real data, e.g. it refers to |
| a \texttt{ConstantClass} entry containing a symbolic reference to the |
| class \texttt{java.io.PrintStream}. To keep the class file compact, |
| such constants are typically shared by different instructions. |
| Similarly, a field is represented by a \texttt{Fieldref} constant that |
| includes information about the name, the type and the containing class |
| of the field. |
| |
| The \cp basically holds the following types of constants: References |
| to methods, fields and classes, strings, integers, floats, longs, and |
| doubles. |
| |
| \subsection{Byte code instruction set}\label{sec:code} |
| |
| The JVM is a stack-oriented interpreter that creates a local stack |
| frame of fixed size for every method invocation. The size of the local |
| stack has to be computed by the compiler. Values may also be stored |
| intermediately in a frame area containing \emph{local variables} which |
| can be used like a set of registers. These local variables are |
| numbered from 0 to 65535, i.e. you have a maximum of 65536 of local |
| variables. The stack frames of caller and callee method are |
| overlapping, i.e. the caller pushes arguments onto the operand stack |
| and the called method receives them in local variables. |
| |
| The byte code instruction set currently consists of 212 instructions, |
| 44 opcodes are marked as reserved and may be used for future |
| extensions or intermediate optimizations within the Virtual |
| Machine. The instruction set can be roughly grouped as follows: |
| |
| \begin{description} |
| \item[Stack operations:] Constants can be pushed onto the stack either |
| by loading them from the \cp with the \texttt{ldc} instruction or with |
| special ``short-cut'' instructions where the operand is encoded into |
| the instructions, e.g. \texttt{iconst\_0} or \texttt{bipush} (push |
| byte value). |
| |
| \item[Arithmetic operations:] The instruction set of the \jvm |
| distinguishes its operand types using different instructions to |
| operate on values of specific type. Arithmetic operations starting |
| with \texttt{i}, for example, denote an integer operation. E.g., |
| \texttt{iadd} that adds two integers and pushes the result back on the |
| stack. The Java types \texttt{boolean}, \texttt{byte}, |
| \texttt{short}, and \texttt{char} are handled as integers by the JVM. |
| |
| \item[Control flow:] There are branch instructions like \texttt{goto} |
| and \texttt{if\_icmpeq}, which compares two integers for |
| equality. There is also a \texttt{jsr} (jump sub-routine) and |
| \texttt{ret} pair of instructions that is used to implement the |
| \texttt{finally} clause of \texttt{try-catch} blocks. Exceptions may |
| be thrown with the \texttt{athrow} instruction. |
| |
| Branch targets are coded as offsets from the current byte code |
| position, i.e. with an integer number. |
| |
| \item[Load and store operations] for local variables like |
| \texttt{iload} and \texttt{istore}. There are also array operations |
| like \texttt{iastore} which stores an integer value into an array. |
| |
| \item[Field access:] The value of an instance field may be retrieved |
| with \texttt{getfield} and written with \texttt{putfield}. For static |
| fields, there are \texttt{getstatic} and \texttt{putstatic} |
| counterparts. |
| |
| \item[Method invocation:] Methods may either be called via static |
| references with \texttt{invokesta\-tic} or be bound virtually with the |
| \texttt{invokevirtual} instruction. Super class methods and private |
| methods are invoked with \texttt{invokespecial}. |
| |
| \item[Object allocation:] Class instances are allocated with the |
| \texttt{new} instruction, arrays of basic type like \texttt{int[]} |
| with \texttt{newarray}, arrays of references like \texttt{String[][]} |
| with \texttt{anewarray} or \texttt{multianewarray}. |
| |
| \item[Conversion and type checking:] For stack operands of basic type |
| there exist casting operations like \texttt{f2i} which converts a |
| float value into an integer. The validity of a type cast may be |
| checked with \texttt{checkcast} and the \texttt{instanceof} operator |
| can be directly mapped to the equally named instruction. |
| \end{description} |
| |
| Most instructions have a fixed length, but there are also some |
| variable-length instructions: In particular, the \texttt{lookupswitch} |
| and \texttt{tableswitch} instructions, which are used to implement |
| \texttt{switch()} statements. Since the number of \texttt{case} |
| clauses may vary, these instructions contain a variable number of |
| statements. |
| |
| We will not list all byte code instructions here, since these are |
| explained in detail in the JVM specification. The opcode names are |
| mostly self-explaining, so understanding the following code examples |
| should be fairly intuitive. |
| |
| \subsection{Method code}\label{sec:code2} |
| |
| Non-abstract methods contain an attribute (\texttt{Code}) that holds |
| the following data: The maximum size of the method's stack frame, the |
| number of local variables and an array of byte code |
| instructions. Optionally, it may also contain information about the |
| names of local variables and source file line numbers that can be used |
| by a debugger. |
| |
| Whenever an exception is thrown, the JVM performs exception handling |
| by looking into a table of exception handlers. The table marks |
| handlers, i.e. pieces of code, to be responsible for exceptions of |
| certain types that are raised within a given area of the byte |
| code. When there is no appropriate handler the exception is propagated |
| back to the caller of the method. The handler information is itself |
| stored in an attribute contained within the \texttt{Code} attribute. |
| |
| \subsection{Byte code offsets}\label{sec:offsets} |
| |
| Targets of branch instructions like \texttt{goto} are encoded as |
| relative offsets in the array of byte codes. Exception handlers and |
| local variables refer to absolute addresses within the byte code. The |
| former contains references to the start and the end of the |
| \texttt{try} block, and to the instruction handler code. The latter |
| marks the range in which a local variable is valid, i.e. its scope. |
| This makes it difficult to insert or delete code areas on this level |
| of abstraction, since one has to recompute the offsets every time and |
| update the referring objects. We will see in section \ref{sec:cgapi} |
| how \jc remedies this restriction. |
| |
| \subsection{Type information}\label{sec:types} |
| |
| Java is a type-safe language and the information about the types of |
| fields, local variables, and methods is stored in |
| \emph{signatures}. These are strings stored in the \cp and encoded in |
| a special format. For example the argument and return types of the |
| \texttt{main} method |
| |
| \begin{verbatim} |
| public static void main(String[] argv) |
| \end{verbatim} |
| |
| are represented by the signature |
| |
| \begin{verbatim} |
| ([java/lang/String;)V |
| \end{verbatim} |
| |
| Classes and arrays are internally represented by strings like |
| \texttt{"java/lang/String"}, basic types like \texttt{float} by an |
| integer number. Within signatures they are represented by single |
| characters, e.g., \texttt{"I"}, for integer. |
| |
| \subsection{Code example}\label{sec:fac} |
| |
| The following example program prompts for a number and prints the |
| faculty of it. The \texttt{readLine()} method reading from the |
| standard input may raise an \texttt{IOException} and if a misspelled |
| number is passed to \texttt{parseInt()} it throws a |
| \texttt{NumberFormatException}. Thus, the critical area of code must be |
| encapsulated in a \texttt{try-catch} block. |
| |
| {\small \begin{verbatim} |
| import java.io.*; |
| public class Faculty { |
| private static BufferedReader in = new BufferedReader(new |
| InputStreamReader(System.in)); |
| public static final int fac(int n) { |
| return (n == 0)? 1 : n * fac(n - 1); |
| } |
| public static final int readInt() { |
| int n = 4711; |
| try { |
| System.out.print("Please enter a number> "); |
| n = Integer.parseInt(in.readLine()); |
| } catch(IOException e1) { System.err.println(e1); } |
| catch(NumberFormatException e2) { System.err.println(e2); } |
| return n; |
| } |
| public static void main(String[] argv) { |
| int n = readInt(); |
| System.out.println("Faculty of " + n + " is " + fac(n)); |
| }} |
| \end{verbatim}} |
| |
| This code example typically compiles to the following chunks of byte |
| code: |
| |
| \subsubsection{Method fac} |
| |
| {\small \begin{verbatim} |
| 0: iload_0 |
| 1: ifne #8 |
| 4: iconst_1 |
| 5: goto #16 |
| 8: iload_0 |
| 9: iload_0 |
| 10: iconst_1 |
| 11: isub |
| 12: invokestatic Faculty.fac (I)I (12) |
| 15: imul |
| 16: ireturn |
| |
| LocalVariable(start_pc = 0, length = 16, index = 0:int n) |
| \end{verbatim}} |
| |
| The method \texttt{fac} has only one local variable, the argument |
| \texttt{n}, stored in slot 0. This variable's scope ranges from the |
| start of the byte code sequence to the very end. If the value of |
| \texttt{n} (stored in local variable 0, i.e. the value fetched with |
| \texttt{iload\_0}) is not equal to 0, the \texttt{ifne} instruction |
| branches to the byte code at offset 8, otherwise a 1 is pushed onto |
| the operand stack and the control flow branches to the final return. |
| For ease of reading, the offsets of the branch instructions, which are |
| actually relative, are displayed as absolute addresses in these |
| examples. |
| |
| If recursion has to continue, the arguments for the multiplication |
| (\texttt{n} and \texttt{fac(n - 1)}) are evaluated and the results |
| pushed onto the operand stack. After the multiplication operation has |
| been performed the function returns the computed value from the top of |
| the stack. |
| |
| \subsubsection{Method readInt} |
| |
| {\small \begin{verbatim} |
| 0: sipush 4711 |
| 3: istore_0 |
| 4: getstatic java.lang.System.out Ljava/io/PrintStream; |
| 7: ldc "Please enter a number> " |
| 9: invokevirtual java.io.PrintStream.print (Ljava/lang/String;)V |
| 12: getstatic Faculty.in Ljava/io/BufferedReader; |
| 15: invokevirtual java.io.BufferedReader.readLine ()Ljava/lang/String; |
| 18: invokestatic java.lang.Integer.parseInt (Ljava/lang/String;)I |
| 21: istore_0 |
| 22: goto #44 |
| 25: astore_1 |
| 26: getstatic java.lang.System.err Ljava/io/PrintStream; |
| 29: aload_1 |
| 30: invokevirtual java.io.PrintStream.println (Ljava/lang/Object;)V |
| 33: goto #44 |
| 36: astore_1 |
| 37: getstatic java.lang.System.err Ljava/io/PrintStream; |
| 40: aload_1 |
| 41: invokevirtual java.io.PrintStream.println (Ljava/lang/Object;)V |
| 44: iload_0 |
| 45: ireturn |
| |
| Exception handler(s) = |
| From To Handler Type |
| 4 22 25 java.io.IOException(6) |
| 4 22 36 NumberFormatException(10) |
| \end{verbatim}} |
| |
| First the local variable \texttt{n} (in slot 0) is initialized to the |
| value 4711. The next instruction, \texttt{getstatic}, loads the |
| static \texttt{System.out} field onto the stack. Then a string is |
| loaded and printed, a number read from the standard input and |
| assigned to \texttt{n}. |
| |
| If one of the called methods (\texttt{readLine()} and |
| \texttt{parseInt()}) throws an exception, the \jvm calls one of the |
| declared exception handlers, depending on the type of the exception. |
| The \texttt{try}-clause itself does not produce any code, it merely |
| defines the range in which the following handlers are active. In the |
| example the specified source code area maps to a byte code area |
| ranging from offset 4 (inclusive) to 22 (exclusive). If no exception |
| has occurred (``normal'' execution flow) the \texttt{goto} |
| instructions branch behind the handler code. There the value of |
| \texttt{n} is loaded and returned. |
| |
| For example the handler for \texttt{java.io.IOException} starts at |
| offset 25. It simply prints the error and branches back to the normal |
| execution flow, i.e. as if no exception had occurred. |
| |
| \section{The BCEL API}\label{sec:api} |
| |
| The \jc API abstracts from the concrete circumstances of the \jvm and |
| how to read and write binary Java class files. The API mainly |
| consists of three parts: |
| |
| \begin{enumerate} |
| |
| \item A package that contains classes that describe ``static'' |
| constraints of class files, i.e., reflect the class file format and |
| is not intended for byte code modifications. The classes may be |
| used to read and write class files from or to a file. This is |
| useful especially for analyzing Java classes without having the |
| source files at hand. The main data structure is called |
| \texttt{JavaClass} which contains methods, fields, etc.. |
| |
| \item A package to dynamically generate or modify \texttt{JavaClass} |
| objects. It may be used e.g. to insert analysis code, to strip |
| unnecessary information from class files, or to implement the code |
| generator back-end of a Java compiler. |
| |
| \item Various code examples and utilities like a class file viewer, a |
| tool to convert class files into HTML, and a converter from class |
| files to the Jasmin assembly language \cite{jasmin}. |
| \end{enumerate} |
| |
| \subsection{JavaClass}\label{sec:javaclass} |
| |
| The ``static'' component of the \jc API resides in the package |
| \path{de.fub.bytecode.classfile} and represents class files. All of the |
| binary components and data structures declared in the JVM |
| specification \cite{jvm} and described in section \ref{sec:jvm} are |
| mapped to classes. Figure \ref{fig:umljc} shows an UML diagram of the |
| hierarchy of classes of the \jc API. Figure \ref{fig:umlcp} in the |
| appendix also shows a detailed diagram of the \texttt{ConstantPool} |
| components. |
| |
| \begin{figure}[htbp] |
| \begin{center} |
| \leavevmode |
| \epsfysize0.93\textheight |
| \epsfbox{eps/javaclass.eps} |
| \caption{UML diagram for the \jc API}\label{fig:umljc} |
| \end{center} |
| \end{figure} |
| |
| The top-level data structure is \texttt{JavaClass}, which in most |
| cases is created by a \texttt{Class\-Par\-ser} object that is capable |
| of parsing binary class files. A \texttt{JavaClass} object basically |
| consists of fields, methods, symbolic references to the super class |
| and to the implemented interfaces. |
| |
| The \cp serves as some kind of central repository and is thus of |
| outstanding importance for all components. \texttt{ConstantPool} |
| objects contain an array of fixed size of \texttt{Constant} entries, |
| which may be retrieved via the \texttt{getConstant()} method taking an |
| integer index as argument. Indexes to the \cp may be contained in |
| instructions as well as in other components of a class file and in \cp |
| entries themselves. |
| |
| Methods and fields contain a signature, symbolically defining their |
| types. Access flags like \texttt{public static final} occur in |
| several places and are encoded by an integer bit mask, e.g. |
| \texttt{public static final} matches to the Java expression |
| |
| \begin{verbatim} |
| int access_flags = ACC_PUBLIC | ACC_STATIC | ACC_FINAL; |
| \end{verbatim} |
| |
| As mentioned in section \ref{sec:format} already, several components |
| may contain \emph{attribute} objects: classes, fields, methods, and |
| \texttt{Code} objects (introduced in section \ref{sec:code2}). The |
| latter is an attribute itself that contains the actual byte code |
| array, the maximum stack size, the number of local variables, a table |
| of handled exceptions, and some optional debugging information coded |
| as \texttt{LineNumberTable} and \texttt{LocalVariableTable} |
| attributes. Attributes are in general specific to some data structure, |
| i.e. no two components share the same kind of attribute, though this |
| is not explicitly forbidden. In the figure the \texttt{Attribute} |
| classes are marked with the component they belong to. |
| |
| \subsection{Class repository} |
| |
| Using the provided \texttt{Repository} class, reading class files into |
| a \texttt{JavaClass} object is quite simple: |
| |
| \begin{verbatim} |
| JavaClass clazz = Repository.lookupClass("java.lang.String"); |
| \end{verbatim} |
| |
| The repository also contains methods providing the dynamic equivalent |
| of the \texttt{instanceof} operator, and other useful routines: |
| |
| \begin{verbatim} |
| if(Repository.instanceOf(clazz, super_class) { |
| ... |
| } |
| \end{verbatim} |
| |
| \subsubsection{Accessing class file data} |
| |
| Information within the class file components may be accessed like Java |
| Beans via intuitive set/get methods. All of them also define a |
| \texttt{toString()} method so that implementing a simple class viewer |
| is very easy. In fact all of the examples used here have been produced |
| this way: |
| |
| {\small \begin{verbatim} |
| System.out.println(clazz); |
| printCode(clazz.getMethods()); |
| ... |
| public static void printCode(Method[] methods) { |
| for(int i=0; i < methods.length; i++) { |
| System.out.println(methods[i]); |
| |
| Code code = methods[i].getCode(); |
| if(code != null) // Non-abstract method |
| System.out.println(code); |
| } |
| } |
| \end{verbatim}} |
| |
| \subsubsection{Analyzing class data} |
| |
| Last but not least, \jc supports the \emph{Visitor} design |
| pattern \cite{design}, so one can write visitor objects to traverse |
| and analyze the contents of a class file. Included in the distribution |
| is a class \texttt{JasminVisitor} that converts class files into the |
| Jasmin assembler language \cite{jasmin}. |
| |
| \subsection{ClassGen}\label{sec:cgapi} |
| |
| This part of the API (package \path{ork.apache.bcel.generic}) supplies |
| an abstraction level for creating or transforming class files |
| dynamically. It makes the static constraints of Java class files like |
| the hard-coded byte code addresses generic. The generic \cpe, for |
| example, is implemented by the class \texttt{ConstantPoolGen} which |
| offers methods for adding different types of constants. Accordingly, |
| \texttt{ClassGen} offers an interface to add methods, fields, and |
| attributes. Figure \ref{fig:umlcg} gives an overview of this part of |
| the API. |
| |
| \begin{figure}[htbp] |
| \begin{center} |
| \leavevmode |
| \epsfysize0.93\textheight |
| \epsfbox{eps/classgen.eps} |
| \caption{UML diagram of the ClassGen API}\label{fig:umlcg} |
| \end{center} |
| \end{figure} |
| |
| \subsubsection{Types} |
| |
| We abstract from the concrete details of the type signature syntax |
| (see \ref{sec:types}) by introducing the \texttt{Type} class, which is |
| used, for example, by methods to define their return and argument |
| types. Concrete sub-classes are \texttt{BasicType}, |
| \texttt{ObjectType}, and \texttt{ArrayType} which consists of the |
| element type and the number of dimensions. For commonly used types the |
| class offers some predefined constants. For example the method |
| signature of the \texttt{main} method as shown in section |
| \ref{sec:types} is represented by: |
| |
| \begin{verbatim} |
| Type return_type = Type.VOID; |
| Type[] arg_types = new Type[] { new ArrayType(Type.STRING, 1) }; |
| \end{verbatim} |
| |
| \texttt{Type} also contains methods to convert types into textual |
| signatures and vice versa. The sub-classes contain implementations of |
| the routines and constraints specified by the Java Language |
| Specification \cite{gosling}. |
| |
| \subsubsection{Generic fields and methods} |
| |
| Fields are represented by \texttt{FieldGen} objects, which may be |
| freely modified by the user. If they have the access rights |
| \texttt{static final}, i.e. are constants and of basic type, they may |
| optionally have an initializing value. |
| |
| Generic methods contain methods to add exceptions the method may |
| throw, local variables, and exception handlers. The latter two are |
| represented by user-configurable objects as well. Because exception |
| handlers and local variables contain references to byte code |
| addresses, they also take the role of an \emph{instruction targeter} |
| in our terminology. Instruction targeters contain a method |
| \texttt{updateTarget()} to redirect a reference. Generic |
| (non-abstract) methods refer to \emph{instruction lists} that consist |
| of instruction objects. References to byte code addresses are |
| implemented by handles to instruction objects. This is explained in |
| more detail in the following sections. |
| |
| The maximum stack size needed by the method and the maximum number of |
| local variables used may be set manually or computed via the |
| \texttt{setMaxStack()} and \texttt{setMaxLocals()} methods |
| automatically. |
| |
| \subsubsection{Instructions} |
| |
| Modeling instructions as objects may look somewhat odd at first sight, |
| but in fact enables programmers to obtain a high-level view upon |
| control flow without handling details like concrete byte code offsets. |
| Instructions consist of a tag, i.e. an opcode, their length in bytes |
| and an offset (or index) within the byte code. Since many instructions |
| are immutable, the \texttt{InstructionConstants} interface offers |
| shareable predefined ``fly-weight'' constants to use. |
| |
| Instructions are grouped via sub-classing, the type hierarchy of |
| instruction classes is illustrated by (incomplete) figure |
| \ref{fig:umlinstr} in the appendix. The most important family of |
| instructions are the \emph{branch instructions}, e.g. \texttt{goto}, |
| that branch to targets somewhere within the byte code. Obviously, |
| this makes them candidates for playing an \texttt{InstructionTargeter} |
| role, too. Instructions are further grouped by the interfaces they |
| implement, there are, e.g., \texttt{TypedInstruction}s that are |
| associated with a specific type like \texttt{ldc}, or |
| \texttt{ExceptionThrower} instructions that may raise exceptions when |
| executed. |
| |
| All instructions can be traversed via \texttt{accept(Visitor v)} methods, |
| i.e., the Visitor design pattern. There is however some special trick |
| in these methods that allows to merge the handling of certain |
| instruction groups. The \texttt{accept()} do not only call the |
| corresponding \texttt{visit()} method, but call \texttt{visit()} |
| methods of their respective super classes and implemented interfaces |
| first, i.e. the most specific \texttt{visit()} call is last. Thus one |
| can group the handling of, say, all \texttt{BranchInstruction}s into |
| one single method. |
| |
| For debugging purposes it may even make sense to ``invent'' your own |
| instructions. In a sophisticated code generator like the one used as a |
| backend of the Barat framework \cite{barat} one often has to insert |
| temporary \texttt{nop} (No operation) instructions. When examining |
| the produced code it may be very difficult to track back where the |
| \texttt{nop} was actually inserted. One could think of a derived |
| \texttt{nop2} instruction that contains additional debugging |
| information. When the instruction list is dumped to byte code, the |
| extra data is simply dropped. |
| |
| One could also think of new byte code instructions operating on |
| complex numbers that are replaced by normal byte code upon load-time |
| or are recognized by a new JVM. |
| |
| \subsubsection{Instruction lists}\label{sec:il} |
| |
| An \emph{instruction list} is implemented by a list of |
| \emph{instruction handles} encapsulating instruction objects. |
| References to instructions in the list are thus not implemented by |
| direct pointers to instructions but by pointers to instruction |
| \emph{handles}. This makes appending, inserting and deleting areas of |
| code very simple. Since we use symbolic references, computation of |
| concrete byte code offsets does not need to occur until finalization, |
| i.e. until the user has finished the process of generating or |
| transforming code. We will use the term instruction handle and |
| instruction synonymously throughout the rest of the paper. |
| Instruction handles may contain additional user-defined data using the |
| \texttt{addAttribute()} method. |
| |
| \paragraph{Appending.} |
| One can append instructions or other instruction lists anywhere to an |
| existing list. The instructions are appended after the given |
| instruction handle. All append methods return a new instruction |
| handle which may then be used as the target of a branch instruction, |
| e.g.. |
| |
| {\small \begin{verbatim} |
| InstructionList il = new InstructionList(); |
| ... |
| GOTO g = new GOTO(null); |
| il.append(g); |
| ... |
| InstructionHandle ih = il.append(InstructionConstants.ACONST_NULL); |
| g.setTarget(ih); |
| \end{verbatim}} |
| |
| \paragraph{Inserting.} |
| Instructions may be inserted anywhere into an existing list. They are |
| inserted before the given instruction handle. All insert methods |
| return a new instruction handle which may then be used as the start |
| address of an exception handler, for example. |
| |
| {\small \begin{verbatim} |
| InstructionHandle start = il.insert(insertion_point, |
| InstructionConstants.NOP); |
| ... |
| mg.addExceptionHandler(start, end, handler, "java.io.IOException"); |
| \end{verbatim}} |
| |
| |
| \paragraph{Deleting.} |
| Deletion of instructions is also very straightforward; all instruction |
| handles and the contained instructions within a given range are |
| removed from the instruction list and disposed. The \texttt{delete()} |
| method may however throw a \texttt{TargetLostException} when there are |
| instruction targeters still referencing one of the deleted |
| instructions. The user is forced to handle such exceptions in a |
| \texttt{try-catch} block and redirect these references elsewhere. The |
| \emph{peep hole} optimizer described in section \ref{sec:nop} gives a |
| detailed example for this. |
| |
| {\small \begin{verbatim} |
| try { |
| il.delete(first, last); |
| } catch(TargetLostException e) { |
| InstructionHandle[] targets = e.getTargets(); |
| for(int i=0; i < targets.length; i++) { |
| InstructionTargeter[] targeters = targets[i].getTargeters(); |
| for(int j=0; j < targeters.length; j++) |
| targeters[j].updateTarget(targets[i], new_target); |
| } |
| } |
| \end{verbatim}} |
| |
| \paragraph{Finalizing.} |
| When the instruction list is ready to be dumped to pure byte code, all |
| symbolic references must be mapped to real byte code offsets. This is |
| done by the \texttt{getByteCode()} method which is called by default |
| by \texttt{MethodGen.getMethod()}. Afterwards you should call |
| \texttt{dispose()} so that the instruction handles can be reused |
| internally. This helps to reduce memory usage. |
| |
| \begin{verbatim} |
| InstructionList il = new InstructionList(); |
| |
| ClassGen cg = new ClassGen("HelloWorld", "java.lang.Object", |
| "<generated>", ACC_PUBLIC | ACC_SUPER, |
| null); |
| MethodGen mg = new MethodGen(ACC_STATIC | ACC_PUBLIC, |
| Type.VOID, new Type[] { |
| new ArrayType(Type.STRING, 1) |
| }, new String[] { "argv" }, |
| "main", "HelloWorld", il, cp); |
| ... |
| cg.addMethod(mg.getMethod()); |
| il.dispose(); // Reuse instruction handles of list |
| \end{verbatim} |
| |
| \subsubsection{Code example revisited} |
| |
| Using instruction lists gives us a generic view upon the code: In |
| Figure \ref{fig:il} we again present the code chunk of the |
| \texttt{readInt()} method of the faculty example in section |
| \ref{sec:fac}: The local variables \texttt{n} and \texttt{e1} both |
| hold two references to instructions, defining their scope. There are |
| two \texttt{goto}s branching to the \texttt{iload} at the end of the |
| method. One of the exception handlers is displayed, too: it references |
| the start and the end of the \texttt{try} block and also the exception |
| handler code. |
| |
| \begin{figure}[htbp] |
| \begin{center} |
| \leavevmode |
| \epsfxsize\textwidth |
| \epsfbox{eps/il.eps} |
| \caption{Instruction list for \texttt{readInt()} method} |
| \label{fig:il} |
| \end{center} |
| \end{figure} |
| |
| \subsubsection{Instruction factories}\label{sec:compound} |
| |
| To simplify the creation of certain instructions the user can use the |
| supplied \texttt{InstructionFactory} class which offers a lot of |
| useful methods to create instructions from scratch. Alternatively, he |
| can also use \emph{compound instructions}: When producing byte code, |
| some patterns typically occur very frequently, for instance the |
| compilation of arithmetic or comparison expressions. You certainly do |
| not want to rewrite the code that translates such expressions into |
| byte code in every place they may appear. In order to support this, |
| the \jc API includes a \emph{compound instruction} (an interface with |
| a single \texttt{getInstructionList()} method). Instances of this |
| class may be used in any place where normal instructions would occur, |
| particularly in append operations. |
| |
| \paragraph{Example: Pushing constants.} |
| Pushing constants onto the operand stack may be coded in different |
| ways. As explained in section \ref{sec:code} there are some |
| ``short-cut'' instructions that can be used to make the produced byte |
| code more compact. The smallest instruction to push a single |
| \texttt{1} onto the stack is \texttt{iconst\_1}, other possibilities |
| are \texttt{bipush} (can be used to push values between -128 and 127), |
| \texttt{sipush} (between -32768 and 32767), or \texttt{ldc} (load |
| constant from \cpe). |
| |
| Instead of repeatedly selecting the most compact instruction in, say, |
| a switch, one can use the compound \texttt{PUSH} instruction whenever |
| pushing a constant number or string. It will produce the appropriate |
| byte code instruction and insert entries into to \cp if necessary. |
| |
| \begin{verbatim} |
| il.append(new PUSH(cp, "Hello, world")); |
| il.append(new PUSH(cp, 4711)); |
| \end{verbatim} |
| |
| \subsubsection{Code patterns using regular expressions}\label{sec:peephole} |
| |
| When transforming code, for instance during optimization or when |
| inserting analysis method calls, one typically searches for certain |
| patterns of code to perform the transformation at. To simplify |
| handling such situations \jc introduces a special feature: One can |
| search for given code patterns within an instruction list using |
| \emph{regular expressions}. In such expressions, instructions are |
| represented by symbolic names, e.g. "\texttt{`IfInstruction'}". Meta |
| characters like \verb|+|, \verb|*|, and \verb@(..|..)@ have their |
| usual meanings. Thus, the expression |
| |
| \begin{verbatim} |
| "`NOP'+(`ILOAD__'|`ALOAD__')*" |
| \end{verbatim} |
| |
| represents a piece of code consisting of at least one \texttt{NOP} |
| followed by a possibly empty sequence of \texttt{ILOAD} and |
| \texttt{ALOAD} instructions. |
| |
| The \texttt{search()} method of class \texttt{FindPattern} gets an |
| instruction list and a regular expression as arguments and returns an |
| array describing the area of matched instructions. Additional |
| constraints to the matching area of instructions, which can not be |
| implemented via regular expressions, may be expressed via \emph{code |
| constraints}. |
| |
| \subsubsection{Example: Optimizing boolean expressions.} |
| |
| In Java, boolean values are mapped to 1 and to 0, respectively. Thus, |
| the simplest way to evaluate boolean expressions is to push a 1 or a 0 |
| onto the operand stack depending on the truth value of the expression. |
| But this way, the subsequent combination of boolean expressions (with |
| \verb|&&|, e.g) yields long chunks of code that push lots of 1s and |
| 0s onto the stack. |
| |
| When the code has been finalized these chunks can be optimized with a |
| \emph{peep hole} algorithm: An \texttt{IfInstruction} (e.g. the |
| comparison of two integers: \texttt{if\_icmpeq}) that either produces |
| a 1 or a 0 on the stack and is followed by an \texttt{ifne} |
| instruction (branch if stack value $\neq$ 0) may be replaced by the |
| \texttt{IfInstruction} with its branch target replaced by the target |
| of the \texttt{ifne} instruction: |
| |
| {\small \verbatimtabinput{bool.java}} |
| |
| The applied code constraint object ensures that the matched code |
| really corresponds to the targeted expression pattern. Subsequent |
| application of this algorithm removes all unnecessary stack operations |
| and branch instructions from the byte code. If any of the deleted |
| instructions is still referenced by an \texttt{InstructionTargeter} |
| object, the reference has to be updated in the \texttt{catch}-clause. |
| |
| Code example \ref{sec:hello} gives a verbose example of how to create |
| a class file, while example \ref{sec:nop} shows how to implement a |
| simple peephole optimizer and how to deal with \texttt{TargetLost} |
| exceptions. |
| |
| \paragraph{Example application:} |
| The expression |
| |
| \begin{verbatim} |
| if((a == null) || (i < 2)) |
| System.out.println("Ooops"); |
| \end{verbatim} |
| |
| can be mapped to both of the chunks of byte code shown in figure |
| \ref{fig:code}. The left column represents the unoptimized code while |
| the right column displays the same code after an aggressively |
| optimizing peep hole algorithm has been applied: |
| |
| \begin{figure}[hpt] |
| \begin{minipage}{0.49\textwidth} |
| {\small \verbatimtabinput{unopt}} \vfil |
| \end{minipage} |
| \begin{minipage}{0.49\textwidth} |
| {\small \verbatimtabinput{opt}} \vfil |
| \end{minipage}\label{fig:code}\caption{Optimizing boolean expressions} |
| \begin{center} |
| |
| |
| \end{center} |
| \end{figure} |
| \section{Application areas}\label{sec:application} |
| |
| There are many possible application areas for \jc ranging from class |
| browsers, profilers, byte code optimizers, and compilers to |
| sophisticated run-time analysis tools and extensions to the Java |
| language \cite{agesen, myers}. |
| |
| Compilers like the Barat compiler \cite{barat} use \jc to implement a |
| byte code generating back end. Other possible application areas are |
| the static analysis of byte code \cite{thies} or examining the |
| run-time behavior of classes by inserting calls to profiling methods |
| into the code. Further examples are extending Java with Eiffel-like |
| assertions \cite{jawa}, automated delegation \cite{classfilters}, or |
| with the concepts of ``Aspect-Oriented Programming'' \cite{aspect}. |
| |
| \subsection{Class loaders}\label{sec:classloaders} |
| |
| Class loaders are responsible for loading class files from the file |
| system or other resources and passing the byte code to the \vm |
| \cite{classloader}. A custom \texttt{ClassLoader} object may be used |
| to intercept the standard procedure of loading a class, i.e. the |
| system class loader, and perform some transformations before actually |
| passing the byte code to the JVM. |
| |
| A possible scenario is described in figure \ref{fig:classloader}: |
| During run-time the \vm requests a custom class loader to load a given |
| class. But before the JVM actually sees the byte code, the class |
| loader makes a ``side-step'' and performs some transformation to the |
| class. To make sure that the modified byte code is still valid and |
| does not violate any of the JVM's rules it is checked by the verifier |
| before the JVM finally executes it. |
| |
| \begin{figure}[ht] |
| \begin{center} |
| \leavevmode |
| \epsfxsize\textwidth |
| \epsfbox{eps/classloader.eps} |
| \caption{Class loaders}\label{fig:classloader} |
| \end{center} |
| \end{figure} |
| |
| Using class loaders is an elegant way of extending the \jvm with new |
| features without actually modifying it. This concept enables |
| developers to use \emph{load-time reflection} to implement their ideas |
| as opposed to the static reflection supported by the Java Reflection |
| API \cite{reflection}. Load-time transformations supply the user with |
| a new level of abstraction. He is not strictly tied to the static |
| constraints of the original authors of the classes but may customize |
| the applications with third-party code in order to benefit from new |
| features. Such transformations may be executed on demand and neither |
| interfere with other users, nor alter the original byte code. In fact, |
| class loaders may even create classes \emph{ad hoc} without loading a |
| file at all. |
| |
| \subsubsection{Example: Poor Man's Genericity} |
| |
| The ``Poor Man's Genericity'' project \cite{pmg} that extends Java |
| with parameterized classes, for example, uses \jc in two places to |
| generate instances of parameterized classes: During compile-time (the |
| standard \texttt{javac} with some slightly changed classes) and at |
| run-time using a custom class loader. The compiler puts some |
| additional type information into class files which is evaluated at |
| load-time by the class loader. The class loader performs some |
| transformations on the loaded class and passes them to the VM. The |
| following algorithm illustrates how the load method of the class |
| loader fulfills the request for a parameterized class, |
| e.g. \verb|Stack<String>| |
| |
| \begin{enumerate} |
| \item Search for class \texttt{Stack}, load it, and check for a |
| certain class attribute containing additional type information. I.e. |
| the attribute defines the ``real'' name of the class, |
| i.e. \verb|Stack<A>|. |
| |
| \item Replace all occurrences and references to the formal type |
| \texttt{A} with references to the actual type \texttt{String}. For |
| example the method |
| |
| \begin{verbatim} |
| void push(A obj) { ... } |
| \end{verbatim} |
| |
| becomes |
| |
| \begin{verbatim} |
| void push(String obj) { ... } |
| \end{verbatim} |
| |
| \item Return the resulting class to the Virtual Machine. |
| \end{enumerate} |
| |
| \bibliographystyle{alpha}\bibliography{manual} |
| |
| \newpage\appendix |
| |
| \pagestyle{empty} |
| \include{appendix} |
| \include{diagrams} |
| |
| \end{document} |
| % LocalWords: Freie Universit Institut Informatik dahm inf fu berlin de JOIE |
| % LocalWords: JCF HelloWorld Jasmin Eiffel SourceFile classfile lang io ldc ar |
| % LocalWords: PrintStream invokevirtual MethodRef ConstantClass Fieldref dup |
| % LocalWords: iconst bipush iadd cmpeq jsr athrow iload istore iastore er ic |
| % LocalWords: getfield putfield getstatic putstatic invokestatic newarray nop |
| % LocalWords: anewarray checkcast instanceof lookupswitch tableswitch Barat VM |
| % LocalWords: ConstantPool getConstant LineNumberTable LocalvariableTable ifne |
| % LocalWords: invokesta invokespecial multianewarray ConstantPoolGen MethodGen |
| % LocalWords: BasicType ObjectType ArrayType InstructionTarge getByteCode bool |
| % LocalWords: BranchInstruction InstructionTargeter InstructionHandle regex |
| % LocalWords: TargetLostException codeconstraint CompoundInstruction sipush |
| % LocalWords: getInstructionList FindPattern CodeConstraint IfInstruction fac |
| % LocalWords: TargetLost javac readLine IOException parseInt readInt CLASSPATH |
| % LocalWords: NumberFormatException toString JasminVisitor getSignature JVM's |
| % LocalWords: FieldGen updateTarget LGPL BCEL LocalVariableTable setMaxStack |
| % LocalWords: setMaxLocals InstructionConstants TypedInstruction addAttribute |
| % LocalWords: ExceptionThrower getMethod InstructionFactory ALOAD |
| |