design/entities.html - xerces2-j - Git at Google

 <!-- $Id$ -->
 <html>
  <head>
   <title>Xerces 2 | Entities</title>
   <link rel='stylesheet' type='text/css' href='css/site.css'>
   <link rel='stylesheet' type='text/css' href='css/diagram.css'>
   <style type='text/css'>
    .note { font-size: smaller }
   </style>
  </head>
  <body>
   <span class='netscape'>
   <a name='TOP'></a>
   <h1>Entity Management</h1>
   <a name='TOC'></a>
   <h2>Table of Contents</h2>
   <p>
    <ul>
     <li><a href='#Overview'>Overview</a>
      <ul>
       <li><a href='#Overview.Xerces'>Xerces</a></li>
       <li><a href='#Overview.Crimson'>Crimson</a></li>
      </ul>
     </li>
     <li><a href='#Assumptions'>Assumptions</a></li>
     <li><a href='#EntityManager'>EntityManager</a></li>
     <li><a href='#EntityScanner'>EntityScanner</a></li>
     <li><a href='#Notes'>Notes</a>
      <ul>
       <li><a href='#Notes.OpenIssues'>Open Issues</a></li>
      </ul>
     </li>
    </ul>
   </p>
   <hr>
   <a name='Overview'></a>
   <h2>Overview</h2>
   <p>
    An XML document is comprised of various entities which can
    be encoded using different character encodings. The document
    instance is known as the <em>document entity</em> whereas
    we'll call the DTD the <em>dtd entity</em>. In addition,
    <em>general entities</em> and <em>parameter entities</em>
    act as macros for inserting fragments into the parse stream
    when the entity is referenced in the document and DTD,
    respectively.
   </p>
   <p>
    There must be a way to declare, locate, and read
    entities in their respective character encoding. The entity
    manager handles locating entities and obtaining an entity
    scanner capable of scanning the entity content. Depending on
    the character encoding, there may be custom readers for
    performance reasons. Regardless of the character encoding,
    though, the interface to scan the underlying content must
    be consistent and simple.
   </p>
   <a name='Overview.Xerces'></a>
   <h3>Xerces</h3>
   <p>
    The complexity of the original Xerces code resulted, in
    large part, from the readers and entity management. The
    entity readers were defined with a large set of methods
    so that read operations could be optimized for each
    reader and that character transcoding could be deferred.
    However, this meant that every reader had to implement
    all of the methods separately which introduced more
    chances for bugs in the code and made it harder to
    understand the system.
   </p>
   <a name='Overview.Crimson'></a>
   <h3>Crimson</h3>
   <p>
    Crimson took a simpler approach to reading entities.
    There is only one reader class that delegates the read
    calls to a few optimized input stream readers. And
    without attempting to defer character encoding, the
    code path is greatly simplified. But its not all roses.
    If you look deep enough in the code you'll find that
    the entity management code is <em>somewhat</em> complex
    because of the nature of XML entities.
   </p>
   <a name='Assumptions'></a>
   <h2>Assumptions</h2>
   <p>
    Before designing the entity management, a few assumptions
    were made:
    <ul>
     <li>
      Characters are <em>always</em> transcoded<br>
      <span class='note'>
       This greatly simplifies the system and allows us to avoid
       using a string pool. There <em>is</em> a performance cost
       but the simplicity and understandability of the code far
       outweighs any performance lost.
      </span>
     </li>
     <li>
      There will be a single entity manager per parser instance<br>
      <span class='note'>
       Scanners need to have a way of locating entities and
       reading their contents. An entity manager would provide
       that mechanism.
      </span>
     </li>
     <li>
      There will be a single entity scanner per parser instance that
      XML scanners will use<br>
      <span class='note'>
       This entity scanner can still delegate to custom, optimized
       input stream readers for performance.
      </span>
     </li>
    </ul>
   </p>
   <a name='EntityManager'></a>
   <h2>Entity Manager</h2>
   <p>
    The entity manager is a core component in any parser
    configuration and there is only one entity manager per parser
    instance. Some of the responsibilities of the entity
    manager are:
    <ul>
     <li>Registering declared entities</li>
     <li>Resolving external entities</li>
     <li>Starting entities</li>
    </ul>
   </p>
   <p>
    The <code><a href='design.html#XMLEntityManager'>XMLEntityManager</a></code>
    class implements the entity management in the parser. This
    class contains methods for registering general and parameter
    entities; resolving entities either by default or by using
    the SAX <code>EntityResolver</code> registered by the user;
    and starting named and unnamed entities. The various XML
    scanners query the entity scanner by calling
    <code>getEntityScanner</code> on the entity manager.
   </p>
   <a name='EntityScanner'></a>
   <h2>Entity Scanner</h2>
   <p>
    The entity scanner is responsible for scanning "primitive"
    XML structure from an entity and reporting the parse location.
    The <code><a href='design.html#XMLEntityScanner'>XMLEntityScanner</a></code>
    class contains methods to peek at the current character; scan
    names and content; etc.
   </p>
   <p>
    There is only one entity scanner per entity manager.
    The entity scanner works directly with the entity manager in
    order to read from the underlying character readers. This
    makes scanning of the entities transparent to the caller.
    Changing readers; auto-detecting encodings from input
    streams; and buffering is done "under the covers" and does
    not affect how the caller interacts with the entity scanner.
   </p>
   <p>
    If both the entity manager and entity scanner are singletons
    per parser instance, why aren't they a single object?
    The manager and scanner <em>could</em> be a single object
    but they are separate in order to have a cleaner separation
    of functionality and API. Even though they are separate,
    they share common data, as shown in the following diagram.
   </p>
   <p>
    <table border='2' cellpadding='10' cellspacing='0'>
     <tr class='diagram'>
      <td>
       <table border='0' cellpadding='2' cellspacing='0'>
        <tr>
         <td class='config-component'>
          <table cellpadding='7' cellspacing='2'>
           <tr class='diagram'>
            <td class='component'>Entity<br>Manager</td>
            <td class='component'>Entity<br>Scanner</td>
           </tr>
          </table>
          <li>entity resolver</li>
          <li>reader stack</li>
          <li>entity handler</li>
         </td>
        </tr>
       </table>
      </td>
     </tr>
    </table>
   </p>
   <a name='Notes'></a>
   <h2>Notes</h2>
   <p>
    It is expected that the entity management and readers will
    need to be re-evaluated as the Xerces 2 concept is
    implemented. The operation of reading entities directly
    impacts the performance of the parser and while this isn't
    an initial requirement it is important.
   </p>
   <a name='Notes.OpenIssues'></a>
   <h3>Open Issues</h3>
   <p>
    There are currently some open issues. [Note: these should
    move to <a href='issues.html'>implementation issues</a>.]
    <dl>
     <dt>Entity Encoding</dt>
     <dd>
      When an entity is started that is read from an input stream,
      the encoding must first be auto-detected. Then, as the
      appropriate scanner parses the XMLDecl or TextDecl line,
      a new encoding must be set on the entity scanner. An API
      must be created in order to set the encoding. However,
      the work of swapping out the character reader is done
      transparently from the caller.
     </dd>
     <dt>Open Readers</dt>
     <dd>
      Who closes open readers? The parser should close all readers
      that it created but should not close any readers that are
      passed to the parser via the <code>parse(InputSource)</code>
      method. And at what time are the readers closed in the case
      of an unrecoverable error?
     </dd>
    </dl>
   </p>
   </span>
   <a name='BOTTOM'></a>
   <hr>
   <span class='netscape'>
    Last modified: $Date$
   </span>
  </body>
 </html>
	<!-- $Id$ -->
	<html>
	<head>
	<title>Xerces 2 \| Entities</title>
	<link rel='stylesheet' type='text/css' href='css/site.css'>
	<link rel='stylesheet' type='text/css' href='css/diagram.css'>
	<style type='text/css'>
	.note { font-size: smaller }
	</style>
	</head>
	<body>
	<span class='netscape'>
	<a name='TOP'></a>
	<h1>Entity Management</h1>
	<a name='TOC'></a>
	<h2>Table of Contents</h2>
	<p>
	<ul>
	<li><a href='#Overview'>Overview</a>
	<ul>
	<li><a href='#Overview.Xerces'>Xerces</a></li>
	<li><a href='#Overview.Crimson'>Crimson</a></li>
	</ul>
	</li>
	<li><a href='#Assumptions'>Assumptions</a></li>
	<li><a href='#EntityManager'>EntityManager</a></li>
	<li><a href='#EntityScanner'>EntityScanner</a></li>
	<li><a href='#Notes'>Notes</a>
	<ul>
	<li><a href='#Notes.OpenIssues'>Open Issues</a></li>
	</ul>
	</li>
	</ul>
	</p>
	<hr>
	<a name='Overview'></a>
	<h2>Overview</h2>
	<p>
	An XML document is comprised of various entities which can
	be encoded using different character encodings. The document
	instance is known as the <em>document entity</em> whereas
	we'll call the DTD the <em>dtd entity</em>. In addition,
	<em>general entities</em> and <em>parameter entities</em>
	act as macros for inserting fragments into the parse stream
	when the entity is referenced in the document and DTD,
	respectively.
	</p>
	<p>
	There must be a way to declare, locate, and read
	entities in their respective character encoding. The entity
	manager handles locating entities and obtaining an entity
	scanner capable of scanning the entity content. Depending on
	the character encoding, there may be custom readers for
	performance reasons. Regardless of the character encoding,
	though, the interface to scan the underlying content must
	be consistent and simple.
	</p>
	<a name='Overview.Xerces'></a>
	<h3>Xerces</h3>
	<p>
	The complexity of the original Xerces code resulted, in
	large part, from the readers and entity management. The
	entity readers were defined with a large set of methods
	so that read operations could be optimized for each
	reader and that character transcoding could be deferred.
	However, this meant that every reader had to implement
	all of the methods separately which introduced more
	chances for bugs in the code and made it harder to
	understand the system.
	</p>
	<a name='Overview.Crimson'></a>
	<h3>Crimson</h3>
	<p>
	Crimson took a simpler approach to reading entities.
	There is only one reader class that delegates the read
	calls to a few optimized input stream readers. And
	without attempting to defer character encoding, the
	code path is greatly simplified. But its not all roses.
	If you look deep enough in the code you'll find that
	the entity management code is <em>somewhat</em> complex
	because of the nature of XML entities.
	</p>
	<a name='Assumptions'></a>
	<h2>Assumptions</h2>
	<p>
	Before designing the entity management, a few assumptions
	were made:
	<ul>
	<li>
	Characters are <em>always</em> transcoded<br>
	<span class='note'>
	This greatly simplifies the system and allows us to avoid
	using a string pool. There <em>is</em> a performance cost
	but the simplicity and understandability of the code far
	outweighs any performance lost.
	</span>
	</li>
	<li>
	There will be a single entity manager per parser instance<br>
	<span class='note'>
	Scanners need to have a way of locating entities and
	reading their contents. An entity manager would provide
	that mechanism.
	</span>
	</li>
	<li>
	There will be a single entity scanner per parser instance that
	XML scanners will use<br>
	<span class='note'>
	This entity scanner can still delegate to custom, optimized
	input stream readers for performance.
	</span>
	</li>
	</ul>
	</p>
	<a name='EntityManager'></a>
	<h2>Entity Manager</h2>
	<p>
	The entity manager is a core component in any parser
	configuration and there is only one entity manager per parser
	instance. Some of the responsibilities of the entity
	manager are:
	<ul>
	<li>Registering declared entities</li>
	<li>Resolving external entities</li>
	<li>Starting entities</li>
	</ul>
	</p>
	<p>
	The <code><a href='design.html#XMLEntityManager'>XMLEntityManager</a></code>
	class implements the entity management in the parser. This
	class contains methods for registering general and parameter
	entities; resolving entities either by default or by using
	the SAX <code>EntityResolver</code> registered by the user;
	and starting named and unnamed entities. The various XML
	scanners query the entity scanner by calling
	<code>getEntityScanner</code> on the entity manager.
	</p>
	<a name='EntityScanner'></a>
	<h2>Entity Scanner</h2>
	<p>
	The entity scanner is responsible for scanning "primitive"
	XML structure from an entity and reporting the parse location.
	The <code><a href='design.html#XMLEntityScanner'>XMLEntityScanner</a></code>
	class contains methods to peek at the current character; scan
	names and content; etc.
	</p>
	<p>
	There is only one entity scanner per entity manager.
	The entity scanner works directly with the entity manager in
	order to read from the underlying character readers. This
	makes scanning of the entities transparent to the caller.
	Changing readers; auto-detecting encodings from input
	streams; and buffering is done "under the covers" and does
	not affect how the caller interacts with the entity scanner.
	</p>
	<p>
	If both the entity manager and entity scanner are singletons
	per parser instance, why aren't they a single object?
	The manager and scanner <em>could</em> be a single object
	but they are separate in order to have a cleaner separation
	of functionality and API. Even though they are separate,
	they share common data, as shown in the following diagram.
	</p>
	<p>
	<table border='2' cellpadding='10' cellspacing='0'>
	<tr class='diagram'>
	<td>
	<table border='0' cellpadding='2' cellspacing='0'>
	<tr>
	<td class='config-component'>
	<table cellpadding='7' cellspacing='2'>
	<tr class='diagram'>
	<td class='component'>Entity<br>Manager</td>
	<td class='component'>Entity<br>Scanner</td>
	</tr>
	</table>
	<li>entity resolver</li>
	<li>reader stack</li>
	<li>entity handler</li>
	</td>
	</tr>
	</table>
	</td>
	</tr>
	</table>
	</p>
	<a name='Notes'></a>
	<h2>Notes</h2>
	<p>
	It is expected that the entity management and readers will
	need to be re-evaluated as the Xerces 2 concept is
	implemented. The operation of reading entities directly
	impacts the performance of the parser and while this isn't
	an initial requirement it is important.
	</p>
	<a name='Notes.OpenIssues'></a>
	<h3>Open Issues</h3>
	<p>
	There are currently some open issues. [Note: these should
	move to <a href='issues.html'>implementation issues</a>.]
	<dl>
	<dt>Entity Encoding</dt>
	<dd>
	When an entity is started that is read from an input stream,
	the encoding must first be auto-detected. Then, as the
	appropriate scanner parses the XMLDecl or TextDecl line,
	a new encoding must be set on the entity scanner. An API
	must be created in order to set the encoding. However,
	the work of swapping out the character reader is done
	transparently from the caller.
	</dd>
	<dt>Open Readers</dt>
	<dd>
	Who closes open readers? The parser should close all readers
	that it created but should not close any readers that are
	passed to the parser via the <code>parse(InputSource)</code>
	method. And at what time are the readers closed in the case
	of an unrecoverable error?
	</dd>
	</dl>
	</p>
	</span>
	<a name='BOTTOM'></a>
	<hr>
	<span class='netscape'>
	Last modified: $Date$
	</span>
	</body>
	</html>