blob: b916efab737c9da055364aa9b6d1d0a8b4e327bc [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<document>
<properties>
<author email="akarasulu@apache.org">Alex Karasulu</author>
<title>Refactoring the ASN.1 Runtime</title>
</properties>
<body>
<section name="Refactoring the ASN.1 Runtime">
<p>
The use of Snacc4J as the runtime ASN.1 BER codec for LDAP impossed an
IP issue for the new Directory Project under incubation. This resulted
in the creation of our own implementation, and hence the Apache ASN.1
Runtime library was created.
</p>
<p>
Before continuing any further it might be a good idea to read about
the existing architecture to understand the changes that are being
proposed.
</p>
<subsection name="High Level Goals and Changes">
<p>
The internal 0.2 release was the first successful attempt to produce a
replacement for Snacc4J. As of release 0.8 of ApacheDS it provides
BER encoders and decoders for LDAP requests and responses. The library
was designed with performance in mind. Some very good ideas were
introduced and really put to the test. However the library does have
performance problems. The designs to make this into a high performance
library were not totally followed through. Furthermore the code base
is very difficult to maintain needing some reorganization. We hope to
refactor the library so it is more efficient, and easier to maintain
while reducing the number of dependencies it has. In the process we
would like to introduce some new features and improvements which are
listed below:
</p>
<ul>
<li>
Better ByteBuffer utilization by splicing buffers instead of copying
them.
</li>
<li>
Repace current Tuple class with well defined Tuple interfaces:
specifically we need to remove TLV field processing from a Tuple
as well as tag cooking functionality. Tag cooking refers to the
application of transformations that turn tag bytes into a 4 byte
Java primitive integers. These functions need to be localized
within utility classes.
</li>
<li>
Some BER based protocols only use a subset of the encoding rules.
For example LDAP only uses definite length encodings for constructed
tuples. A reduced set of rules are much easier to code, maintain,
and often will perform significantly better than codecs designed for
the entire rule set. The key here however is to make sure that
the core of the codec can be replaced transparently without imposing
code changes.
</li>
<li>
The Tuples of primitives like binary values store the Tag, Length
and Value of the primitive TLV Tuple in memory. Sometimes primitive
values can be dangerously large for a server to encode or decode.
Primitive tuples could be blobs of large binaries like images. If
tuple values are larger than some application defined limit they
aught to be streamed to disk rather than kept in main memory.
Streaming to disk makes the server more efficient overall since it
can maintain a constant sized decoding footprint. However switching
to disk based storage will rightfully slow down the current operation
which involves a large primitive. This is a tradeoff that should
be configurable by API users and ultimately ApacheDS administrators.
</li>
<li>
Better logging and error handling for codecs with pershaps some
management interfaces to control the properties of codecs.
</li>
<li>
A single deployable artifact where the ber and codec jars are fused.
</li>
<li>
Make the code easier to maintain while improving its structure.
</li>
</ul>
</subsection>
</section>
<section name="Tuple Interface/Class Hierarchies">
<p>
Presently Tuples contain the functionality to decode and encode
fields. Tuples can even encode themselves to a buffer as BER or
DER. A Tuple is not a simple bean and that's all that it should be.
Hence one of our goals is to factor out this additional functionality.
</p>
<p>
A Tuple is a single class that acts more like a union of different
types rather than using inheritance to differentiate. There are
distinct types of tuples, constructed verses primitive for example.
Instead of using complex logic to differentiate what kind of Tuple an
instance is it is much better to differentiate the Tuple into
subclasses. Hence we propose a new interface and implementation
hierarchy for Tuples.
</p>
<p>
Let's start by proposing a minimal Tuple interface.
</p>
<source>
interface Tuple
{
/**
* Gets the zero based index into a PDU where the first byte of this
* Tuple's tag resides.
*
* @return zero based index of Tag's first byte in the PDU
*/
int getTagStartIndex();
/**
* Gets this TLV Tuple's Tag (T) as a type safe enumeration.
*
* @return type safe enumeration for the Tag
*/
TagEnum getTag();
/**
* Gets whether or not this Tuple is constructed.
*
* @return true if the Tag is constructed false if it is primitive.
*/
boolean isConstructed();
}
</source>
<p>
These interfaces give the minimum information needed for a Tuple
that is not specific to another specialized type of Tuple. Meaning
all Tuples share these methods. We can also go a step further and
implement an AbstractTuple where protected members are used to
implement these methods. Note that isConstructed() will probably be
left abstract so subclasses can just return true or false. For
brevity this code is not shown but other classes in the section below
will extend from AbstractTuple.
</p>
<subsection name="Primitive Vs. Constructed Tuples">
<p>
We need to go a step further and start differentiating between Tuples
that are primitive and those that are constructed. In this step we
introduce two new abstract classes PrimitiveTuple and
ConstructedTuple.
</p>
<p>
These two classes will be described below but one might ask why both
are still abstract. This is because we need to differentiate further
for buffered verses streamed Tuples in the case of primitive Tuples.
For constructed Tuples we need to differentiate between definate
length verses indefinite length Tuples. With our approach, only the
leaf nodes of the inheritance hierarchy will be concrete. Below is
the definition for the PrimitiveTuple.
</p>
<source>
public abstract class PrimitiveTuple extends AbstractTuple
{
/** the number of bytes used to compose the Tuple's length field */
protected int lengthFieldSz = 0;
/** the number of bytes used to compose the Tuple's value field */
protected int valueFieldSz = 0;
...
public final boolean isConstructed()
{
return false;
}
/**
* Gets whether or not this Tuple's value is buffered in memory or
* streamed to disk.
*
* @return true if the value is buffered in memory, false if it is streamed
* to disk
*/
public abstract boolean isBuffered();
/**
* Gets the number of bytes in the length (L) field of this TLV Tuple.
*
* @return number of bytes for the length
*/
public final int getLengthFieldSize()
{
return lengthFieldSz;
}
/**
* Gets the number of bytes in the value (V) field of this TLV Tuple.
*
* @return number of bytes for the value
*/
public final int getValueFieldSize();
{
return valueFieldSz;
}
...
}
</source>
<p>
This abstract class adds two new concrete methods for tracking the
size of the length and value fields. Constructed Tuples may not
necessarily have a length value associated with them if they are
of the indeterminate form. Furthermore the value of constructed
Tuples are the nested child Tuples subordinate to them. So there
is no need to track the value prematurely now for anything other
than primitive Tuples.
</p>
<p>
Note that the isBuffered() method is implemented as final and always
returns false for this lineage of Tuples. A final modifier on the
method makes sense and sometimes helps the compiler inline this
method so we don't always pay a price for using it in addition to
subclassing. A new abstract method isBuffered() is introduced which
is discussed in detail within the Buffered Vs. Streamed section.
</p>
<p>
Now let's take a look at the ConstructedTuple abstract class.
</p>
<source>
public abstract class ConstructedTuple extends AbstractTuple
{
public final boolean isConstructed()
{
return true;
}
/**
* Gets whether or not the length of this constructed Tuple is of the
* definate form or of the indefinite length form.
*
* @return true if the length is definate, false if the length is of the
* indefinite form
*/
public abstract boolean isLengthDefinate();
}
</source>
<p>
ConstructedTuple implements the <code>isConstructed()</code> method
as final since it will always return false for this lineage of
Tuples. Also a new abstract method isLengthDefinate() is introduced
to see if the Tuple uses the indefinite length form or not.
</p>
</subsection>
<subsection name="Definate Vs. Indefinite Length">
<p>
The ConstructedTuple can be further differentiated into two
subclasses to represent definate and indefinite length constructed
TLV Tuples. The indefinite form does not have a length value
associated with it where as the definate lenght form does. Let's
explore the concrete IndefiniteLegthTuple definition.
</p>
<source>
public class IndefiniteLength extends ConstructedTuple
{
public final boolean isLengthDefinate()
{
return false;
}
}
</source>
<p>
Yep this is pretty simple. There is very little to track for this
Tuple since most of the tracking is handled by its decendent Tuples.
The class also is concrete. What about the DefinateLength
implementation ...
</p>
<source>
public class DefinateLength extends ConstructedTuple
{
/** the number of bytes used to compose the Tuple's length field */
protected int lengthFieldSz = 0;
/** the number of bytes used to compose the Tuple's value field */
protected int valueFieldSz = 0;
...
public final boolean isLengthDefinate()
{
return true;
}
/**
* Gets the number of bytes in the length (L) field of this TLV Tuple.
*
* @return number of bytes for the length
*/
public final int getLengthFieldSize()
{
return lengthFieldSz;
}
/**
* Gets the number of bytes in the value (V) field of this TLV Tuple.
*
* @return number of bytes for the value
*/
public final int getValueFieldSize();
{
return valueFieldSz;
}
}
</source>
<p>
Now this introduces two new concrete methods for getting the length
of the length field and the length of the value field. A determinate
length TLV has a valid value within the Length (L) field. The value
of the length field is the length of the value field. Hence the
reason why we include both these concrete methods.
</p>
</subsection>
<subsection name="Buffered Vs. Streamed PrimitiveTuples">
<p>
As we mentioned before, there are two kinds of primitive Tuples.
Those that keep there value in a buffer within the TLV Tuple object,
in which case it is buffered within memory, and those that stream
the value to disk and store a referral to the value on disk. These
two beasts are so different it makes sense to differentiate between
them using subclasses. Let's take a look at a BufferedTuple which
is the simplest one.
</p>
<source>
public class BufferedTuple extends PrimitiveTuple
{
/** contains ByteBuffers which contain parts of the value */
private final ArrayList value = new ArrayList();
/** pre-fab final unmodifiable wrapper around our modifiable list */
private final List unmodifiable = Collections.unmodifiableList( value );
public final boolean isBuffered()
{
return true;
}
/**
* Gets the value of this Tuple as a List of ByteBuffers.
*
* @return a list of ByteBuffers containing parts of the value
*/
public final List getValue()
{
return unmodifiable;
}
}
</source>
<p>
The implementation introduces a final <code>getValue()</code> method
which returns an unmodifiable wrapper around a modifiable list of
ByteBuffers. The <code>isBuffered()</code> method is made final and
implemented to return true all the time. This is easy so let's now
take a look at the StreamedTuple implementation.
</p>
<source>
public abstract class StreamedTuple extends PrimitiveTuple
{
public final boolean isBuffered()
{
return false;
}
// might experiment with a getURL to represent the source of
// the data stream - we need to discuss this on the list
/**
* Depending on the backing store used for accessing streamed data there
* may need to be multiple subclasses that implement this method.
*
* @return an InputStream that can be used to read this Tuple's streamed
* value data
*/
public abstract InputStream getValueStream();
// another question is whether or not to offer a readable Channel instead
// of an InputStream? This is another topic for discussion.
}
</source>
<p>
At this point we know that there could be multiple ways to implement
this kind of StreamedTuple. Notice though the value is accessed
through a stream provided by the Tuple. This way the large value
stored on disk need not all be kept in memory at one time during the
decode or encode process.
</p>
</subsection>
<p>
Some code will be removed from the Tuple class today during the
refactoring and kept in a TupleUtils class. Functionality like
the encoding, decoding of Tuple fields and tag cooking can be
offloaded to this class.
</p>
</section>
<section name="notes">
<p>
By far the largest part of the refactoring effort is in introducing
this new hierarchy and introducing some patterns that improve the
maintainability of the code like the State pattern. Other minor
details for this dev cycle are discussed below.
</p>
<subsection name="Termination Tuples">
<p>
A lot of effort is made to track the position of a Tuple within a
PDU. This is why we have methods like getTagStartIndex(). We want
to know where the first byte of a Tuple's tag is within a PDU. This
positional accounting enables better error reporting when problems
result. They also allow us to detect when we start and stop
processing a PDU.
</p>
<p>
The minimum amount of information needed to track the position of a
Tuple within a PDU or the start and stop points of a PDU is to have
the Tuple's tag start index, and the lengths of fields within the
Tuple.
</p>
<p>
In a decoder for example we know that we've processed the last
topmost Tuple of a PDU when we get a Tuple whose <code>
getTagStartIndex()</code> returns 0. <b>WARNING</b>: AbstractTuple
should default the value the start tag index to -1 so it cannot
be interpretted as a terminator.
</p>
</subsection>
<subsection name="New Coherent Replacement for Stateful Codec API">
<p>
There have been many complaints about the codec API being too
generic or the callback mechanism being somewhat unintuitive.
Perhaps we can work on more specific interfaces which incorporate
the concepts of producer and consumer. Plus let's see if we can
make these interfaces specific so we don't have ugly codes and casts
all over the place.
</p>
<p>
Also in the end we want to do away with this codec API which was
originally intended to fuse back into commons. I've abandoned this
idea because it is too difficult to make all parties happy. The
best thing to do is create our own interface that fit well and
enable them to be wrapped for other APIs. Hence going towards custom
codec API's is not an issue. The old codec stuff can be pushed into
the protocol framework API.
</p>
<p>
Furthermore at the end of the day we want there to be a single runtime
jar without any dependencies for the ASN.1 stuff. That means no more
codec API as it is with jar today within the ASN.1 project.
</p>
<p>
Some new producer consumer interface ideas are listed below:
</p>
<ul>
<li>
BufferConsumer: consumes ByteBuffers. Something like <code>void
consume(ByteBuffer bb)</code> comes to mind. Perhaps even with
overloads to take a list or array of BBs.
</li>
<li>
TupleProducer: generates Tuples (often is a BufferConsumer). Some
thing like <code>void setConsumer(TupleConsumer consumer)</code>
comes to mind.
</li>
<li>
TupleConsumer: consumes Tuples generated by a TupleProducer.
Something like <code>void consume(Tuple tlv)</code> comes to mind.
</li>
<li>
MessageProducer: produces populated message stubs
</li>
</ul>
</subsection>
<subsection name="Possibly Merging TupleNode and Tuple">
<p>
Right now to build Tuple trees we use yet another class to wrap
Tuples called TupleNodes. This kept the contents of the Tuple
class less conjested. The Tuple class will no longer exist and the
conjestion issues is no longer valid. The question now is, is it
worth keeping parent child methods in TupleNode when creating trees
while paying for extra object creation?
</p>
<p>
Note that the TupleNode methods are not required on Tuple to process
a byte stream of encoded TLV data in a sax-like fashion. These
methods are only required for higher level operations like visitations
from visitors during the encoding process. The question really is
whether we will make Tuple impure to save a little time so we don't
have to create TupleNode objects to wrap Tuples and model the
hierarchy? This is something that needs to be discussed.
</p>
<p>
Contrary to the purist approach of keeping Tuple and TupleNode
separate one can merge the two. A codec need not honor these methods
by building the tree. Meaning these tree node (TupleNode) methods
may simply return null. If these methods are honored then it is the
intent of the codec to build a tree. If the tree is built the
processing is more like DOM and if not then it is more like SAX. We
should not tax the DOM like processing use case by forcing the need
to create extra wrappers, while accomodating the purist view.
</p>
</subsection>
<subsection name="Removing the Digester Concept">
<p>
I don't know what I was thinking when I devised this rule based
approach similar to the Digester in commons. This was a big mistake
and IMO one of the reasons why we have performance issues. This
datastructure can be removed entirely from upper layers that depend
on it.
</p>
<p>
Granted this means we are going to have to weave once again our own
classes for handling LDAP specific PDU's however I think this will be
easy to do. I will essentially rewrite the LDAP provider based on
our runtime to hardcode the switching rather than using this rule
based triggering approach. The new approach is also going to
simplify the code significantly making it more maintainable.
Hopefully these changes will also speed up the code since less
objects will need to be created every time a decoder is instantiated.
</p>
</subsection>
<subsection name="It's Time For DER and CER">
<p>
We need to find a way to make the rules used while decoding and
encoding Tuples plugable. This way we can change the rules to
encode as generic BER, reduced BER (for increases in performance
in the case of specific protocol needs). DER likewise is a reduced
set of BER with restrictions on the encoding and range of values
that can be interpreted from primitive values. If the plugability
is there the runtime is a flexible TLV Tuple codec that can change
the rules use to handle the substrate.
</p>
<p>
We could easily have BerDecoder, CerDecoder and even protocol specific
decoders with those BER rules used by a protocol such as
LdapBerDecoder for those BER decoding rules that only apply to LDAP.
</p>
</subsection>
</section>
</body>
</document>