docs/manual.tex - commons-bcel - Git at Google

 \documentclass[12pt,twoside]{article}

 \usepackage{epsf,a4wide,moreverb,url}
 \usepackage{palatino}

 \newcommand\jc{{\sffamily BCEL }}
 \newcommand\cp{{constant pool }}
 \newcommand\cpe{constant pool}
 \newcommand\jvm{{Java Virtual Machine }}
 \newcommand\jvme{{Java Virtual Machine}}
 \newcommand\vm{{Virtual Machine }}
 \newcommand\href[2]{#2}

 \begin{document}

 \title{Byte Code Engineering Library (BCEL)\\
        Description and usage manual\\
        {\small \textbf{Version 1.0}}}

 \author{{\Large Markus Dahm}\\\\
         \href{mailto:markus.dahm@inf.fu-berlin.de}{\texttt{markus.dahm@berlin.de}}}

 \maketitle

 %\tableofcontents

 \begin{abstract}
 Extensions  and improvements of the  programming language Java and its
 related  execution environment (Java   Virtual Machine,  JVM)  are the
 subject  of a large number  of research  projects and proposals. There
 are  projects, for  instance, to add   parameterized types to Java, to
 implement ``Aspect-Oriented Programming'', to perform sophisticated
 static analysis, and to improve the run-time performance.

 Since  Java classes  are  compiled into  portable  binary class  files
 (called   \emph{byte   code}),  it   is   the   most  convenient   and
 platform-independent  way  to  implement  these  improvements  not  by
 writing a  new compiler or changing  the JVM, but  by transforming the
 byte  code.   These  transformations  can either  be  performed  after
 compile-time,  or at load-time.   Many programmers  are doing  this by
 implementing their own specialized byte code manipulation tools, which
 are, however, restricted in the range of their re-usability.

 To deal with the necessary class file transformations, we introduce an
 API   that   helps   developers   to  conveniently   implement   their
 transformations.
 \end{abstract}

 \section{Introduction}\label{sec:intro}

 The  Java language  \cite{gosling} has  become very  popular  and many
 research projects  deal with further  improvements of the  language or
 its run-time behavior.  The possibility  to extend a language with new
 concepts  is surely  a  desirable feature,  but implementation  issues
 should be hidden from the user.  Fortunately, the concepts of the \jvm
 permit  the user-transparent  implementation of  such  extensions with
 relatively little effort.

 Because the target language of  Java is an interpreted language with a
 small  and  easy-to-understand  set  of instructions  (the  \emph{byte
 code}), developers  can implement  and test their  concepts in  a very
 elegant way.   One can  write a plug-in  replacement for  the system's
 class loader which is  responsible for dynamically loading class files
 at  run-time  and  passing the  byte  code  to  the \vm  (see  section
 \ref{sec:classloaders}).  Class loaders may  thus be used to intercept
 the  loading process and  transform classes  before they  get actually
 executed  by the  JVM  \cite{classloader}.  While  the original  class
 files always remain unaltered, the behavior of the class loader may be
 reconfigured for every execution or instrumented dynamically.

 The \jc API (Byte Code Engineering Library), formerly known as
 JavaClass, is a toolkit for the static analysis and dynamic creation
 or transformation of Java class files.  It enables developers to
 implement the desired features on a high level of abstraction without
 handling all the internal details of the Java class file format and
 thus re-inventing the wheel every time.  \jc is written entirely in
 Java and freely available under the terms of the Apache Software
 License.  \footnote{The distribution is available at
   \url{http://jakarta.apache.org/bcel/}, including several code
   examples and javadoc manuals.  }

 This paper is structured as follows:  We give a  brief description of
 the \jvm and the class  file format in section \ref{sec:jvm}.  Section
 \ref{sec:api} introduces  the \jc API.   Section \ref{sec:application}
 describes  some typical  application areas  and example  projects. The
 appendix contains  code examples that are  to long to  be presented in
 the  main  part  of this  paper.  All  examples  are included  in  the
 down-loadable distribution.

 \subsection{Related work}

 There are  a number  of proposals and  class libraries that  have some
 similarities with \textsc{BCEL}: The JOIE \cite{joie} toolkit can
 be used to instrument class loaders with dynamic behavior.  Similarly,
 ``Binary  Component Adaptation''  \cite{bca} allows  components  to be
 adapted and  evolved on-the-fly.  Han  Lee's ``Byte-code Instrumenting
 Tool'' \cite{bit} allows the user  to insert calls to analysis methods
 anywhere in the  byte code.  The Jasmin language  \cite{jasmin} can be
 used  to   hand-write  or  generate   pseudo-assembler  code.   D-Java
 \cite{classfile} and JCF \cite{inside} are class viewing tools.

 In contrast to these projects, \jc is intended to be a general purpose
 tool  for ``byte  code engineering''.   It gives  full control  to the
 developer on a high level of  abstraction and is not restricted to any
 particular application area.

 \section{The Java Virtual Machine}\label{sec:jvm}

 Readers already familiar with the \jvm and the Java class file format
 may want to skip this section and proceed with section \ref{sec:api}.

 Programs written  in the  Java language are  compiled into  a portable
 binary format called \emph{byte  code}.  Every class is represented by
 a  single class  file  containing  class related  data  and byte  code
 instructions. These  files are loaded dynamically  into an interpreter
 (\jvme, JVM) and executed.

 Figure  \ref{fig:jvm}  illustrates  the  procedure  of  compiling  and
 executing a Java class:  The source file (\texttt{HelloWorld.java}) is
 compiled into a Java class file (\texttt{HelloWorld.class}), loaded by
 the  byte  code  interpreter  and  executed.  In  order  to  implement
 additional  features, researchers  may want  to transform  class files
 (drawn  with  bold lines)  before  they  get  actually executed.  This
 application area is one of the main issues of this article.

 \begin{figure}[htbp]
   \begin{center}
     \leavevmode
     \epsfxsize\textwidth
     \epsfbox{eps/jvm.eps}
     \caption{Compilation and execution of Java classes}
     \label{fig:jvm}
   \end{center}
 \end{figure}

 Note that the  use of the general term  ``Java'' implies two meanings:
 on the one hand, Java as a programming language is meant, on the other
 hand, the Java  Virtual Machine, which is not  necessarily targeted by
 the Java language  exclusively, but may be used  by other languages as
 well (e.g.   Eiffel \cite{eiffel}, or Ada \cite{ada}).   We assume the
 reader to  be familiar with  the Java language  and to have  a general
 understanding of the Virtual Machine.

 \subsection{Java class file format}\label{sec:format}

 Giving a  full overview of  the design issues  of the Java  class file
 format and the  associated byte code instructions is  beyond the scope
 of this paper.   We will just give a  brief introduction covering the
 details  that  are  necessary  for  understanding  the  rest  of  this
 paper. The format of class files and the byte code instruction set are
 described  in more  detail  in the  ``\jvm Specification''  \cite{jvm}
 \footnote{Also             available             online             at
 \url{http://www.javasoft.com/docs/books/vmspec/index.html}},   and  in
 \cite{jasmin}.   Especially,  we  will  not  deal  with  the  security
 constraints that the \jvm has to check at run-time, i.e. the byte code
 verifier.

 Figure \ref{fig:classfile} shows a  simplified example of the contents
 of a  Java class file:  It starts with  a header containing  a ``magic
 number'' (\texttt{0xCAFEBABE}) and the version number, followed by the
 \emph{\cpe}, which can be roughly thought of as the text segment of an
 executable, the  \emph{access rights}  of the class  encoded by  a bit
 mask, a list of interfaces  implemented by the class, lists containing
 the  fields and  methods of  the  class, and  finally the  \emph{class
 attributes}, e.g.  the  \texttt{SourceFile} attribute telling the name
 of  the source  file.  Attributes  are  a way  of putting  additional,
 e.g. user-defined,  information into class file  data structures.  For
 example, a  custom class  loader may evaluate  such attribute  data in
 order to perform its  transformations.  The JVM specification declares
 that unknown, i.e.  user-defined attributes must be ignored by any \vm
 implementation.

 \begin{figure}[htbp]
   \begin{center}
     \leavevmode
     \epsfxsize\textwidth
     \epsfbox{eps/classfile.eps}
     \caption{Java class file format}
     \label{fig:classfile}
   \end{center}
 \end{figure}

 Because  all of  the  information needed  to  dynamically resolve  the
 symbolic  references to  classes, fields  and methods  at  run-time is
 coded  with string  constants, the  \cp contains  in fact  the largest
 portion of an average class file, approximately 60\% \cite{statistic}.
 The byte code instructions themselves just make up 12\%.

 The right upper box shows a  ``zoomed'' excerpt of the \cpe, while the
 rounded box below depicts  some instructions that are contained within
 a  method  of the  example  class.  These  instructions represent  the
 straightforward translation of the well-known statement:

 \begin{verbatim}
    System.out.println("Hello, world");
 \end{verbatim}

 The first instruction loads the  contents of the field \texttt{out} of
 class  \texttt{java.lang.System} onto  the operand  stack. This  is an
 instance of  the class \texttt{java.io.PrintStream}.  The \texttt{ldc}
 (``Load constant'') pushes a reference  to the string "Hello world" on
 the  stack.    The  next  instruction  invokes   the  instance  method
 \texttt{println}  which  takes  both  values as  parameters  (Instance
 methods always  implicitly take an  instance reference as  their first
 argument).

 Instructions,  other  data  structures   within  the  class  file  and
 constants  themselves  may  refer  to  constants  in  the  \cpe.  Such
 references are implemented via fixed indexes encoded directly into the
 instructions.   This  is illustrated  for  some  items  of the  figure
 emphasized    with   a    surrounding   box.

 For  example,  the  \texttt{invokevirtual}  instruction  refers  to  a
 \texttt{MethodRef} constant  that contains information  about the name
 of the  called method, the  signature (i.e.  the encoded  argument and
 return types),  and to  which class the  method belongs.  In  fact, as
 emphasized by the boxed  value, the \texttt{MethodRef} constant itself
 just refers to other entries holding the real data, e.g.  it refers to
 a \texttt{ConstantClass} entry containing  a symbolic reference to the
 class \texttt{java.io.PrintStream}.   To keep the  class file compact,
 such  constants  are   typically  shared  by  different  instructions.
 Similarly, a field is represented by a \texttt{Fieldref} constant that
 includes information about the name, the type and the containing class
 of the field.

 The \cp  basically holds the following types  of constants: References
 to methods, fields and  classes, strings, integers, floats, longs, and
 doubles.

 \subsection{Byte code instruction set}\label{sec:code}

 The JVM  is a  stack-oriented interpreter that  creates a  local stack
 frame of fixed size for every method invocation. The size of the local
 stack has to  be computed by the compiler.  Values  may also be stored
 intermediately in a frame area containing \emph{local variables} which
 can  be used  like  a set  of  registers.  These  local variables  are
 numbered from 0  to 65535, i.e.  you have a maximum  of 65536 of local
 variables.   The  stack  frames   of  caller  and  callee  method  are
 overlapping, i.e.  the caller  pushes arguments onto the operand stack
 and the called method receives them in local variables.

 The byte code instruction  set currently consists of 212 instructions,
 44  opcodes  are  marked  as  reserved  and may  be  used  for  future
 extensions   or   intermediate   optimizations  within   the   Virtual
 Machine. The instruction set can be roughly grouped as follows:

 \begin{description}
 \item[Stack operations:] Constants can be pushed onto the stack either
 by loading them from the \cp with the \texttt{ldc} instruction or with
 special ``short-cut''  instructions where the operand  is encoded into
 the  instructions, e.g.   \texttt{iconst\_0} or  \texttt{bipush} (push
 byte value).

 \item[Arithmetic  operations:]   The  instruction  set   of  the  \jvm
 distinguishes  its  operand  types  using  different  instructions  to
 operate on  values of  specific type.  Arithmetic  operations starting
 with  \texttt{i}, for  example,  denote  an  integer  operation.  E.g.,
 \texttt{iadd} that adds two integers and pushes the result back on the
 stack.     The    Java    types    \texttt{boolean},    \texttt{byte},
 \texttt{short}, and \texttt{char} are handled as integers by the JVM.

 \item[Control flow:] There  are branch instructions like \texttt{goto}
 and   \texttt{if\_icmpeq},    which   compares   two    integers   for
 equality.  There  is  also   a  \texttt{jsr}  (jump  sub-routine)  and
 \texttt{ret} pair of instructions that  is  used to implement the
 \texttt{finally} clause of  \texttt{try-catch} blocks.  Exceptions may
 be thrown with the \texttt{athrow} instruction.

 Branch  targets  are coded  as  offsets  from  the current  byte  code
 position, i.e. with an integer number.

 \item[Load   and   store operations]    for   local   variables   like
 \texttt{iload} and  \texttt{istore}.  There are also  array operations
 like \texttt{iastore} which stores an integer value into an array.

 \item[Field access:] The  value of an instance field  may be retrieved
 with \texttt{getfield} and written with \texttt{putfield}.  For static
 fields,   there    are   \texttt{getstatic}   and   \texttt{putstatic}
 counterparts.

 \item[Method  invocation:] Methods  may  either be  called via  static
 references with \texttt{invokesta\-tic} or be bound virtually with the
 \texttt{invokevirtual}  instruction. Super  class methods  and private
 methods are invoked with \texttt{invokespecial}.

 \item[Object  allocation:]  Class  instances  are allocated  with  the
 \texttt{new}  instruction, arrays  of basic  type  like \texttt{int[]}
 with \texttt{newarray}, arrays  of references like \texttt{String[][]}
 with \texttt{anewarray} or \texttt{multianewarray}.

 \item[Conversion and type checking:]  For stack operands of basic type
 there  exist casting  operations  like \texttt{f2i}  which converts  a
 float  value into  an integer.   The validity  of a  type cast  may be
 checked with  \texttt{checkcast} and the  \texttt{instanceof} operator
 can be directly mapped to the equally named instruction.
 \end{description}

 Most  instructions  have a  fixed  length,  but  there are  also  some
 variable-length instructions: In particular, the \texttt{lookupswitch}
 and  \texttt{tableswitch} instructions,  which are  used  to implement
 \texttt{switch()}  statements.   Since  the  number  of  \texttt{case}
 clauses  may vary,  these instructions  contain a  variable  number of
 statements.

 We  will not list  all byte  code instructions  here, since  these are
 explained in  detail in the  JVM specification.  The opcode  names are
 mostly self-explaining,  so understanding the  following code examples
 should be fairly intuitive.

 \subsection{Method code}\label{sec:code2}

 Non-abstract methods  contain an attribute  (\texttt{Code}) that holds
 the following data: The maximum  size of the method's stack frame, the
 number   of   local   variables    and   an   array   of   byte   code
 instructions. Optionally,  it may  also contain information  about the
 names of local variables and source file line numbers that can be used
 by a debugger.

 Whenever  an exception is thrown, the  JVM performs exception handling
 by looking   into a  table  of exception  handlers.   The table  marks
 handlers, i.e.  pieces  of code, to  be responsible for  exceptions of
 certain types  that  are raised   within a  given  area  of  the  byte
 code. When there is no appropriate handler the exception is propagated
 back to the caller of the method. The handler information is itself
 stored in an attribute contained within the \texttt{Code} attribute.

 \subsection{Byte code offsets}\label{sec:offsets}

 Targets  of  branch instructions  like  \texttt{goto}  are encoded  as
 relative offsets  in the array  of byte codes. Exception  handlers and
 local variables refer to absolute addresses within the byte code.  The
 former  contains  references   to  the  start  and  the   end  of  the
 \texttt{try} block,  and to the instruction handler  code.  The latter
 marks the  range in which a  local variable is valid,  i.e. its scope.
 This makes it  difficult to insert or delete code  areas on this level
 of abstraction, since one has  to recompute the offsets every time and
 update the referring objects. We will see in section \ref{sec:cgapi}
 how \jc remedies this restriction.

 \subsection{Type information}\label{sec:types}

 Java is  a type-safe language and  the information about  the types of
 fields,    local    variables,    and    methods    is    stored    in
 \emph{signatures}. These are strings stored  in the \cp and encoded in
 a special  format.  For example the  argument and return  types of the
 \texttt{main} method

 \begin{verbatim}
   public static void main(String[] argv)
 \end{verbatim}

 are represented by the signature

 \begin{verbatim}
   ([java/lang/String;)V
 \end{verbatim}

 Classes  and  arrays  are   internally  represented  by  strings  like
 \texttt{"java/lang/String"},  basic types  like  \texttt{float} by  an
 integer number. Within signatures they are represented by single
 characters, e.g., \texttt{"I"}, for integer.

 \subsection{Code example}\label{sec:fac}

 The  following example  program prompts  for a  number and  prints the
 faculty  of  it.  The  \texttt{readLine()}  method  reading  from  the
 standard input  may raise an \texttt{IOException} and  if a misspelled
 number    is    passed   to    \texttt{parseInt()}    it   throws    a
 \texttt{NumberFormatException}. Thus, the critical area of code must be
 encapsulated in a \texttt{try-catch} block.

 {\small \begin{verbatim}
 import java.io.*;
 public class Faculty {
   private static BufferedReader in = new BufferedReader(new
                                 InputStreamReader(System.in));
   public static final int fac(int n) {
     return (n == 0)? 1 : n * fac(n - 1);
   }
   public static final int readInt() {
     int n = 4711;
     try {
       System.out.print("Please enter a number> ");
       n = Integer.parseInt(in.readLine());
     } catch(IOException e1) { System.err.println(e1); }
       catch(NumberFormatException e2) { System.err.println(e2); }
     return n;
   }
   public static void main(String[] argv) {
     int n = readInt();
     System.out.println("Faculty of " + n + " is " + fac(n));
   }}
 \end{verbatim}}

 This code example  typically compiles to the following  chunks of byte
 code:

 \subsubsection{Method fac}

 {\small \begin{verbatim}
 0:  iload_0
 1:  ifne            #8
 4:  iconst_1
 5:  goto            #16
 8:  iload_0
 9:  iload_0
 10: iconst_1
 11: isub
 12: invokestatic    Faculty.fac (I)I (12)
 15: imul
 16: ireturn

 LocalVariable(start_pc = 0, length = 16, index = 0:int n)
 \end{verbatim}}

 The  method \texttt{fac}  has only  one local  variable,  the argument
 \texttt{n}, stored in  slot 0.  This variable's scope  ranges from the
 start of  the byte  code sequence to  the very  end.  If the  value of
 \texttt{n} (stored  in local variable  0, i.e. the value  fetched with
 \texttt{iload\_0}) is  not equal  to 0, the  \texttt{ifne} instruction
 branches to  the byte code at offset  8, otherwise a 1  is pushed onto
 the operand stack  and the control flow branches  to the final return.
 For ease of reading, the offsets of the branch instructions, which are
 actually   relative,  are displayed  as  absolute  addresses in  these
 examples.

 If  recursion has to  continue, the  arguments for  the multiplication
 (\texttt{n}  and \texttt{fac(n -  1)}) are  evaluated and  the results
 pushed onto the operand stack.  After the multiplication operation has
 been performed the function returns the computed value from the top of
 the stack.

 \subsubsection{Method readInt}

 {\small \begin{verbatim}
 0:  sipush        4711
 3:  istore_0
 4:  getstatic     java.lang.System.out Ljava/io/PrintStream;
 7:  ldc           "Please enter a number> "
 9:  invokevirtual java.io.PrintStream.print (Ljava/lang/String;)V
 12: getstatic     Faculty.in Ljava/io/BufferedReader;
 15: invokevirtual java.io.BufferedReader.readLine ()Ljava/lang/String;
 18: invokestatic  java.lang.Integer.parseInt (Ljava/lang/String;)I
 21: istore_0
 22: goto          #44
 25: astore_1
 26: getstatic     java.lang.System.err Ljava/io/PrintStream;
 29: aload_1
 30: invokevirtual java.io.PrintStream.println (Ljava/lang/Object;)V
 33: goto          #44
 36: astore_1
 37: getstatic     java.lang.System.err Ljava/io/PrintStream;
 40: aload_1
 41: invokevirtual java.io.PrintStream.println (Ljava/lang/Object;)V
 44: iload_0
 45: ireturn

 Exception handler(s) =
 From    To      Handler Type
 4       22      25      java.io.IOException(6)
 4       22      36      NumberFormatException(10)
 \end{verbatim}}

 First the local variable \texttt{n}  (in slot 0) is initialized to the
 value  4711.   The  next  instruction, \texttt{getstatic},  loads  the
 static  \texttt{System.out} field onto  the stack.   Then a  string is
 loaded and  printed, a number   read from the  standard input and
 assigned to \texttt{n}.

 If    one   of   the    called   methods    (\texttt{readLine()}   and
 \texttt{parseInt()}) throws  an exception, the  \jvm calls one  of the
 declared exception  handlers, depending on the type  of the exception.
 The \texttt{try}-clause  itself does not  produce any code,  it merely
 defines the range in which  the following handlers are active.  In the
 example  the specified  source  code area  maps  to a  byte code  area
 ranging from offset 4 (inclusive)  to 22 (exclusive).  If no exception
 has   occurred   (``normal''   execution   flow)   the   \texttt{goto}
 instructions  branch behind  the  handler code.   There  the value  of
 \texttt{n} is loaded and returned.

 For  example the handler   for \texttt{java.io.IOException}  starts at
 offset 25. It simply prints the error  and branches back to the normal
 execution flow, i.e. as if no exception had occurred.

 \section{The BCEL API}\label{sec:api}

 The \jc API abstracts from  the concrete circumstances of the \jvm and
 how  to  read and  write  binary Java  class  files.   The API  mainly
 consists of three parts:

 \begin{enumerate}

 \item A package that contains classes that describe ``static''
   constraints of class files, i.e., reflect the class file format and
   is not intended for byte code modifications.  The classes may be
   used to read and write class files from or to a file.  This is
   useful especially for analyzing Java classes without having the
   source files at hand.  The main data structure is called
   \texttt{JavaClass} which contains methods, fields, etc..

 \item A  package to dynamically generate  or modify \texttt{JavaClass}
 objects.  It  may be  used  e.g. to  insert  analysis  code, to  strip
 unnecessary  information from class  files, or  to implement  the code
 generator back-end of a Java compiler.

 \item Various code examples and  utilities like a class file viewer, a
 tool  to convert class  files into  HTML, and  a converter  from class
 files to the Jasmin assembly language \cite{jasmin}.
 \end{enumerate}

 \subsection{JavaClass}\label{sec:javaclass}

 The  ``static''  component of  the  \jc  API  resides in  the  package
 \path{de.fub.bytecode.classfile} and represents class files.  All of the
 binary   components  and   data   structures  declared   in  the   JVM
 specification  \cite{jvm} and described  in section  \ref{sec:jvm} are
 mapped to classes.  Figure \ref{fig:umljc} shows an UML diagram of the
 hierarchy of  classes of the  \jc API.  Figure \ref{fig:umlcp}  in the
 appendix also  shows a  detailed diagram of  the \texttt{ConstantPool}
 components.

 \begin{figure}[htbp]
   \begin{center}
     \leavevmode
     \epsfysize0.93\textheight
     \epsfbox{eps/javaclass.eps}
     \caption{UML diagram for the \jc API}\label{fig:umljc}
   \end{center}
 \end{figure}

 The  top-level data  structure  is \texttt{JavaClass},  which in  most
 cases is created by  a \texttt{Class\-Par\-ser} object that is capable
 of parsing  binary class files. A  \texttt{JavaClass} object basically
 consists of  fields, methods, symbolic  references to the  super class
 and to the implemented interfaces.

 The  \cp serves  as some  kind of  central repository  and is  thus of
 outstanding  importance  for  all  components.   \texttt{ConstantPool}
 objects contain  an array of fixed size  of \texttt{Constant} entries,
 which may be retrieved via the \texttt{getConstant()} method taking an
 integer  index as argument.  Indexes to  the \cp  may be  contained in
 instructions as well as in other components of a class file and in \cp
 entries themselves.

 Methods and  fields contain  a signature, symbolically  defining their
 types.   Access  flags  like  \texttt{public static  final}  occur  in
 several  places  and  are  encoded   by  an  integer  bit  mask,  e.g.
 \texttt{public static final} matches to the Java expression

 \begin{verbatim}
   int access_flags = ACC_PUBLIC | ACC_STATIC | ACC_FINAL;
 \end{verbatim}

 As mentioned in section \ref{sec:format} already, several components
 may contain \emph{attribute} objects: classes, fields, methods, and
 \texttt{Code} objects (introduced in section \ref{sec:code2}).  The
 latter is an attribute itself that contains the actual byte code
 array, the maximum stack size, the number of local variables, a table
 of handled exceptions, and some optional debugging information coded
 as \texttt{LineNumberTable} and \texttt{LocalVariableTable}
 attributes. Attributes are in general specific to some data structure,
 i.e. no two components share the same kind of attribute, though this
 is not explicitly forbidden. In the figure the \texttt{Attribute}
 classes are marked with the component they belong to.

 \subsection{Class repository}

 Using the provided \texttt{Repository} class, reading class files into
 a \texttt{JavaClass} object is quite simple:

 \begin{verbatim}
   JavaClass clazz = Repository.lookupClass("java.lang.String");
 \end{verbatim}

 The repository also contains methods providing the dynamic equivalent
 of the \texttt{instanceof} operator, and other useful routines:

 \begin{verbatim}
   if(Repository.instanceOf(clazz, super_class) {
     ...
   }
 \end{verbatim}

 \subsubsection{Accessing class file data}

 Information within the class file components may be accessed like Java
 Beans via intuitive set/get methods.  All of them also define a
 \texttt{toString()} method so that implementing a simple class viewer
 is very easy. In fact all of the examples used here have been produced
 this way:

 {\small \begin{verbatim}
   System.out.println(clazz);
   printCode(clazz.getMethods());
   ...
   public static void printCode(Method[] methods) {
     for(int i=0; i < methods.length; i++) {
       System.out.println(methods[i]);

       Code code = methods[i].getCode();
       if(code != null) // Non-abstract method
         System.out.println(code);
     }
   }
 \end{verbatim}}

 \subsubsection{Analyzing class data}

 Last but not least, \jc supports the \emph{Visitor} design
 pattern \cite{design},  so one can write visitor  objects to traverse
 and analyze the contents of a class file. Included in the distribution
 is a  class \texttt{JasminVisitor} that converts class  files into the
 Jasmin assembler language \cite{jasmin}.

 \subsection{ClassGen}\label{sec:cgapi}

 This part of the API (package \path{ork.apache.bcel.generic}) supplies
 an abstraction level for creating or transforming class files
 dynamically.  It makes the static constraints of Java class files like
 the hard-coded byte code addresses generic.  The generic \cpe, for
 example, is implemented by the class \texttt{ConstantPoolGen} which
 offers methods for adding different types of constants.  Accordingly,
 \texttt{ClassGen} offers an interface to add methods, fields, and
 attributes.  Figure \ref{fig:umlcg} gives an overview of this part of
 the API.

 \begin{figure}[htbp]
   \begin{center}
     \leavevmode
     \epsfysize0.93\textheight
     \epsfbox{eps/classgen.eps}
     \caption{UML diagram of the ClassGen API}\label{fig:umlcg}
   \end{center}
 \end{figure}

 \subsubsection{Types}

 We abstract from the concrete details of the type signature syntax
 (see \ref{sec:types}) by introducing the \texttt{Type} class, which is
 used, for example, by methods to define their return and argument
 types.  Concrete sub-classes are \texttt{BasicType},
 \texttt{ObjectType}, and \texttt{ArrayType} which consists of the
 element type and the number of dimensions. For commonly used types the
 class offers some predefined constants.  For example the method
 signature of the \texttt{main} method as shown in section
 \ref{sec:types} is represented by:

 \begin{verbatim}
   Type   return_type = Type.VOID;
   Type[] arg_types   = new Type[] { new ArrayType(Type.STRING, 1) };
 \end{verbatim}

 \texttt{Type} also contains methods to convert types into textual
 signatures and vice versa. The sub-classes contain implementations of
 the routines and constraints specified by the Java Language
 Specification \cite{gosling}.

 \subsubsection{Generic fields and methods}

 Fields  are represented  by  \texttt{FieldGen} objects,  which may  be
 freely  modified  by  the  user.   If  they  have  the  access  rights
 \texttt{static final}, i.e. are constants  and of basic type, they may
 optionally have an initializing value.

 Generic  methods contain  methods  to add  exceptions  the method  may
 throw,  local variables, and  exception handlers.  The latter  two are
 represented by  user-configurable objects as  well.  Because exception
 handlers  and   local  variables  contain  references   to  byte  code
 addresses, they  also take the role of  an \emph{instruction targeter}
 in   our  terminology.    Instruction  targeters   contain   a  method
 \texttt{updateTarget()}    to   redirect    a    reference.    Generic
 (non-abstract) methods refer  to \emph{instruction lists} that consist
 of  instruction  objects.   References  to  byte  code  addresses  are
 implemented by  handles to instruction  objects. This is  explained in
 more detail in the following sections.

 The maximum stack size needed by the method and the maximum number of
 local variables used may be set manually or computed via the
 \texttt{setMaxStack()} and \texttt{setMaxLocals()} methods
 automatically.

 \subsubsection{Instructions}

 Modeling instructions as objects may look somewhat odd at first sight,
 but in fact enables programmers to obtain a high-level view upon
 control flow without handling details like concrete byte code offsets.
 Instructions consist of a tag, i.e. an opcode, their length in bytes
 and an offset (or index) within the byte code. Since many instructions
 are immutable, the \texttt{InstructionConstants} interface offers
 shareable predefined ``fly-weight'' constants to use.

 Instructions are grouped via sub-classing, the type hierarchy of
 instruction classes is illustrated by (incomplete) figure
 \ref{fig:umlinstr} in the appendix.  The most important family of
 instructions are the \emph{branch instructions}, e.g.  \texttt{goto},
 that branch to targets somewhere within the byte code.  Obviously,
 this makes them candidates for playing an \texttt{InstructionTargeter}
 role, too. Instructions are further grouped by the interfaces they
 implement, there are, e.g., \texttt{TypedInstruction}s that are
 associated with a specific type like \texttt{ldc}, or
 \texttt{ExceptionThrower} instructions that may raise exceptions when
 executed.

 All instructions can be traversed via \texttt{accept(Visitor v)} methods,
 i.e., the Visitor design pattern. There is however some special trick
 in these methods that allows to merge the handling of certain
 instruction groups. The \texttt{accept()} do not only call the
 corresponding \texttt{visit()} method, but call \texttt{visit()}
 methods of their respective super classes and implemented interfaces
 first, i.e. the most specific \texttt{visit()} call is last. Thus one
 can group the handling of, say, all \texttt{BranchInstruction}s into
 one single method.

 For debugging purposes  it may even make sense  to ``invent'' your own
 instructions. In a sophisticated code generator like the one used as a
 backend of  the Barat framework  \cite{barat} one often has  to insert
 temporary  \texttt{nop} (No  operation) instructions.   When examining
 the produced  code it may  be very difficult  to track back  where the
 \texttt{nop}  was actually  inserted.  One  could think  of  a derived
 \texttt{nop2}   instruction   that   contains   additional   debugging
 information. When  the instruction  list is dumped  to byte  code, the
 extra data is simply dropped.

 One  could also  think  of  new byte  code  instructions operating  on
 complex numbers that  are replaced by normal byte  code upon load-time
 or are recognized by a new JVM.

 \subsubsection{Instruction lists}\label{sec:il}

 An \emph{instruction list} is implemented by a list of
 \emph{instruction handles} encapsulating instruction objects.
 References to instructions in the list are thus not implemented by
 direct pointers to instructions but by pointers to instruction
 \emph{handles}. This makes appending, inserting and deleting areas of
 code very simple. Since we use symbolic references, computation of
 concrete byte code offsets does not need to occur until finalization,
 i.e.  until the user has finished the process of generating or
 transforming code.  We will use the term instruction handle and
 instruction synonymously throughout the rest of the paper.
 Instruction handles may contain additional user-defined data using the
 \texttt{addAttribute()} method.

 \paragraph{Appending.}
 One can append instructions or  other instruction lists anywhere to an
 existing  list.   The  instructions   are  appended  after  the  given
 instruction  handle.   All append  methods  return  a new  instruction
 handle which may  then be used as the target  of a branch instruction,
 e.g..

 {\small \begin{verbatim}
   InstructionList il = new InstructionList();
   ...
   GOTO g = new GOTO(null);
   il.append(g);
   ...
   InstructionHandle ih = il.append(InstructionConstants.ACONST_NULL);
   g.setTarget(ih);
 \end{verbatim}}

 \paragraph{Inserting.}
 Instructions may be  inserted anywhere into an existing  list.  They are
 inserted  before the  given  instruction handle.   All insert  methods
 return a  new instruction handle which  may then be used  as the start
 address of an exception handler, for example.

 {\small \begin{verbatim}
   InstructionHandle start = il.insert(insertion_point,
                                       InstructionConstants.NOP);
   ...
   mg.addExceptionHandler(start, end, handler, "java.io.IOException");
 \end{verbatim}}


 \paragraph{Deleting.}
 Deletion of instructions is also very straightforward; all instruction
 handles and the contained instructions within a given range are
 removed from the instruction list and disposed.  The \texttt{delete()}
 method may however throw a \texttt{TargetLostException} when there are
 instruction targeters still referencing one of the deleted
 instructions.  The user is forced to handle such exceptions in a
 \texttt{try-catch} block and redirect these references elsewhere. The
 \emph{peep hole} optimizer described in section \ref{sec:nop} gives a
 detailed example for this.

 {\small \begin{verbatim}
   try {
     il.delete(first, last);
   } catch(TargetLostException e) {
     InstructionHandle[] targets = e.getTargets();
     for(int i=0; i < targets.length; i++) {
       InstructionTargeter[] targeters = targets[i].getTargeters();
       for(int j=0; j < targeters.length; j++)
          targeters[j].updateTarget(targets[i], new_target);
     }
   }
 \end{verbatim}}

 \paragraph{Finalizing.}
 When the instruction list is ready to be dumped to pure byte code, all
 symbolic references must be mapped to real byte code offsets.  This is
 done by the \texttt{getByteCode()} method which is called by default
 by \texttt{MethodGen.getMethod()}. Afterwards you should call
 \texttt{dispose()} so that the instruction handles can be reused
 internally. This helps to reduce memory usage.

 \begin{verbatim}
   InstructionList il = new InstructionList();

   ClassGen  cg = new ClassGen("HelloWorld", "java.lang.Object",
                               "<generated>", ACC_PUBLIC | ACC_SUPER,
                               null);
   MethodGen mg = new MethodGen(ACC_STATIC | ACC_PUBLIC,
                                Type.VOID, new Type[] {
                                  new ArrayType(Type.STRING, 1)
                                }, new String[] { "argv" },
                                "main", "HelloWorld", il, cp);
   ...
   cg.addMethod(mg.getMethod());
   il.dispose(); // Reuse instruction handles of list
 \end{verbatim}

 \subsubsection{Code example revisited}

 Using  instruction lists gives  us a  generic view  upon the  code: In
 Figure  \ref{fig:il}   we  again  present   the  code  chunk   of  the
 \texttt{readInt()}   method  of   the  faculty   example   in  section
 \ref{sec:fac}:  The local  variables \texttt{n}  and  \texttt{e1} both
 hold two references to  instructions, defining their scope.  There are
 two \texttt{goto}s branching  to the \texttt{iload} at the  end of the
 method. One of the exception handlers is displayed, too: it references
 the start and the end of the \texttt{try} block and also the exception
 handler code.

 \begin{figure}[htbp]
   \begin{center}
     \leavevmode
     \epsfxsize\textwidth
     \epsfbox{eps/il.eps}
     \caption{Instruction list for \texttt{readInt()} method}
     \label{fig:il}
   \end{center}
 \end{figure}

 \subsubsection{Instruction factories}\label{sec:compound}

 To simplify the creation of certain instructions the user can use the
 supplied \texttt{InstructionFactory} class which offers a lot of
 useful methods to create instructions from scratch. Alternatively, he
 can also use \emph{compound instructions}: When producing byte code,
 some patterns typically occur very frequently, for instance the
 compilation of arithmetic or comparison expressions.  You certainly do
 not want to rewrite the code that translates such expressions into
 byte code in every place they may appear. In order to support this,
 the \jc API includes a \emph{compound instruction} (an interface with
 a single \texttt{getInstructionList()} method).  Instances of this
 class may be used in any place where normal instructions would occur,
 particularly in append operations.

 \paragraph{Example: Pushing constants.}
 Pushing constants  onto the  operand stack may  be coded  in different
 ways.  As   explained  in   section  \ref{sec:code}  there   are  some
 ``short-cut'' instructions that can be  used to make the produced byte
 code  more  compact.  The   smallest  instruction  to  push  a  single
 \texttt{1} onto  the stack is  \texttt{iconst\_1}, other possibilities
 are \texttt{bipush} (can be used to push values between -128 and 127),
 \texttt{sipush}  (between  -32768 and  32767),  or \texttt{ldc}  (load
 constant from \cpe).

 Instead of repeatedly selecting  the most compact instruction in, say,
 a switch, one can  use the compound \texttt{PUSH} instruction whenever
 pushing a constant  number or string. It will  produce the appropriate
 byte code instruction and insert entries into to \cp if necessary.

 \begin{verbatim}
   il.append(new PUSH(cp, "Hello, world"));
   il.append(new PUSH(cp, 4711));
 \end{verbatim}

 \subsubsection{Code patterns using regular expressions}\label{sec:peephole}

 When  transforming  code, for  instance  during  optimization or  when
 inserting analysis  method calls,  one typically searches  for certain
 patterns  of  code to  perform  the  transformation  at.  To  simplify
 handling such situations \jc introduces a special feature: One can
 search  for  given code  patterns  within  an  instruction list  using
 \emph{regular  expressions}.   In  such expressions,  instructions  are
 represented by symbolic names, e.g.  "\texttt{`IfInstruction'}".  Meta
 characters  like  \verb|+|, \verb|*|,  and  \verb@(..|..)@ have  their
 usual meanings. Thus, the expression

 \begin{verbatim}
   "`NOP'+(`ILOAD__'|`ALOAD__')*"
 \end{verbatim}

 represents a  piece of  code consisting of  at least  one \texttt{NOP}
 followed  by   a  possibly   empty  sequence  of   \texttt{ILOAD}  and
 \texttt{ALOAD} instructions.

 The  \texttt{search()} method  of class  \texttt{FindPattern}  gets an
 instruction list and a regular  expression as arguments and returns an
 array  describing   the  area  of   matched  instructions.  Additional
 constraints to  the matching  area of instructions,  which can  not be
 implemented via  regular expressions, may be  expressed via \emph{code
 constraints}.

 \subsubsection{Example: Optimizing boolean expressions.}

 In Java, boolean  values are mapped to 1 and  to 0, respectively. Thus,
 the simplest way to evaluate boolean expressions is to push a 1 or a 0
 onto the operand stack depending on the truth value of the expression.
 But this  way, the subsequent combination of  boolean expressions (with
 \verb|&&|, e.g) yields  long chunks of code that push  lots of 1s and
 0s onto the stack.

 When the code has been finalized  these chunks can be optimized with a
 \emph{peep  hole}  algorithm:  An  \texttt{IfInstruction}  (e.g.   the
 comparison of two  integers: \texttt{if\_icmpeq}) that either produces
 a  1  or  a 0  on  the  stack  and  is  followed by  an  \texttt{ifne}
 instruction (branch  if stack value $\neq$  0) may be  replaced by the
 \texttt{IfInstruction} with  its branch target replaced  by the target
 of  the  \texttt{ifne}  instruction:

 {\small \verbatimtabinput{bool.java}}

 The  applied code  constraint  object ensures  that  the matched  code
 really  corresponds to  the targeted  expression  pattern.  Subsequent
 application of this algorithm removes all unnecessary stack operations
 and  branch instructions from  the byte  code. If  any of  the deleted
 instructions  is still  referenced by  an \texttt{InstructionTargeter}
 object, the reference has to be updated in the \texttt{catch}-clause.

 Code example \ref{sec:hello} gives a  verbose example of how to create
 a class  file, while  example \ref{sec:nop} shows  how to  implement a
 simple  peephole optimizer  and how  to deal  with \texttt{TargetLost}
 exceptions.

 \paragraph{Example application:}
 The expression

 \begin{verbatim}
   if((a == null) || (i < 2))
     System.out.println("Ooops");
 \end{verbatim}

 can be mapped to both of the chunks of byte code shown in figure
 \ref{fig:code}. The left column represents the unoptimized code while
 the right column displays the same code after an aggressively
 optimizing peep hole algorithm has been applied:

 \begin{figure}[hpt]
 \begin{minipage}{0.49\textwidth}
   {\small \verbatimtabinput{unopt}} \vfil
 \end{minipage}
 \begin{minipage}{0.49\textwidth}
   {\small \verbatimtabinput{opt}} \vfil
 \end{minipage}\label{fig:code}\caption{Optimizing boolean expressions}
 \begin{center}


 \end{center}
 \end{figure}
 \section{Application areas}\label{sec:application}

 There are many possible application areas for \jc ranging from class
 browsers, profilers, byte code optimizers, and compilers to
 sophisticated run-time analysis tools and extensions to the Java
 language \cite{agesen, myers}.

 Compilers like the Barat compiler  \cite{barat} use \jc to implement a
 byte code  generating back end.  Other possible  application areas are
 the  static  analysis   of  byte code \cite{thies}  or  examining  the
 run-time behavior  of classes by inserting calls  to profiling methods
 into the  code. Further examples  are extending Java  with Eiffel-like
 assertions  \cite{jawa}, automated delegation  \cite{classfilters}, or
 with the concepts of ``Aspect-Oriented Programming'' \cite{aspect}.

 \subsection{Class loaders}\label{sec:classloaders}

 Class loaders  are responsible for  loading class files from  the file
 system  or  other resources  and  passing the  byte  code  to the  \vm
 \cite{classloader}.  A custom  \texttt{ClassLoader} object may be used
 to  intercept the  standard procedure  of  loading a  class, i.e.  the
 system class loader, and  perform some transformations before actually
 passing the byte code to the JVM.

 A  possible  scenario is  described  in figure  \ref{fig:classloader}:
 During run-time the \vm requests a custom class loader to load a given
 class.  But before  the JVM  actually sees  the byte  code,  the class
 loader makes  a ``side-step'' and performs some  transformation to the
 class. To  make sure that  the modified byte  code is still  valid and
 does not violate any of the  JVM's rules it is checked by the verifier
 before the JVM finally executes it.

 \begin{figure}[ht]
   \begin{center}
     \leavevmode
     \epsfxsize\textwidth
     \epsfbox{eps/classloader.eps}
     \caption{Class loaders}\label{fig:classloader}
   \end{center}
 \end{figure}

 Using class loaders  is an elegant way of extending  the \jvm with new
 features  without   actually  modifying  it.    This  concept  enables
 developers to use \emph{load-time reflection} to implement their ideas
 as opposed to  the static reflection supported by  the Java Reflection
 API \cite{reflection}.  Load-time transformations supply the user with
 a new  level of abstraction.   He is not  strictly tied to  the static
 constraints of the  original authors of the classes  but may customize
 the applications  with third-party code  in order to benefit  from new
 features. Such  transformations may be executed on  demand and neither
 interfere with other users, nor alter the original byte code. In fact,
 class loaders may even create  classes \emph{ad hoc} without loading a
 file at all.

 \subsubsection{Example: Poor Man's Genericity}

 The  ``Poor Man's  Genericity'' project  \cite{pmg} that  extends Java
 with parameterized  classes, for  example, uses \jc  in two  places to
 generate instances of  parameterized classes: During compile-time (the
 standard  \texttt{javac} with  some slightly  changed classes)  and at
 run-time  using  a  custom  class  loader.   The  compiler  puts  some
 additional  type information into  class files  which is  evaluated at
 load-time  by  the  class  loader.   The class  loader  performs  some
 transformations on  the loaded  class and passes  them to the  VM. The
 following  algorithm illustrates  how  the load  method  of the  class
 loader   fulfills    the   request   for    a   parameterized   class,
 e.g. \verb|Stack<String>|

 \begin{enumerate}
 \item  Search for  class  \texttt{Stack},  load it,  and  check for  a
 certain class  attribute containing additional  type information. I.e.
 the   attribute   defines   the    ``real''   name   of   the   class,
 i.e. \verb|Stack<A>|.

 \item  Replace  all  occurrences  and  references to  the  formal  type
 \texttt{A}  with references  to the  actual type  \texttt{String}. For
 example the method

 \begin{verbatim}
   void push(A obj) { ... }
 \end{verbatim}

 becomes

 \begin{verbatim}
   void push(String obj) { ... }
 \end{verbatim}

 \item Return the resulting class to the Virtual Machine.
 \end{enumerate}

 \bibliographystyle{alpha}\bibliography{manual}

 \newpage\appendix

 \pagestyle{empty}
 \include{appendix}
 \include{diagrams}

 \end{document}
 % LocalWords:  Freie Universit Institut Informatik dahm inf fu berlin de JOIE
 % LocalWords:  JCF HelloWorld Jasmin Eiffel SourceFile classfile lang io ldc ar
 % LocalWords:  PrintStream invokevirtual MethodRef ConstantClass Fieldref dup
 % LocalWords:  iconst bipush iadd cmpeq jsr athrow iload istore iastore er ic
 % LocalWords:  getfield putfield getstatic putstatic invokestatic newarray nop
 % LocalWords:  anewarray checkcast instanceof lookupswitch tableswitch Barat VM
 % LocalWords:  ConstantPool getConstant LineNumberTable LocalvariableTable ifne
 % LocalWords:  invokesta invokespecial multianewarray ConstantPoolGen MethodGen
 % LocalWords:  BasicType ObjectType ArrayType InstructionTarge getByteCode bool
 % LocalWords:  BranchInstruction InstructionTargeter InstructionHandle regex
 % LocalWords:  TargetLostException codeconstraint CompoundInstruction sipush
 % LocalWords:  getInstructionList FindPattern CodeConstraint IfInstruction fac
 % LocalWords:  TargetLost javac readLine IOException parseInt readInt CLASSPATH
 % LocalWords:  NumberFormatException toString JasminVisitor getSignature JVM's
 % LocalWords:  FieldGen updateTarget LGPL BCEL LocalVariableTable setMaxStack
 % LocalWords:  setMaxLocals InstructionConstants TypedInstruction addAttribute
 % LocalWords:  ExceptionThrower getMethod InstructionFactory ALOAD