site/dev/design-notes/term-sharing-in-schema-compiler.adoc - daffodil-site - Git at Google

 :page-layout: page
 :keywords: schema-compiler performance alignment optimization
 // ///////////////////////////////////////////////////////////////////////////
 //
 // This file is written in AsciiDoc.
 //
 // If you can read this comment, your browser is not rendering asciidoc automatically.
 //
 // You need to install the asciidoc plugin to Chrome or Firefox
 // so that this page will be properly rendered for your viewing pleasure.
 //
 // You can get the plugins by searching the web for 'asciidoc plugin'
 //
 // You will want to change plugin settings to enable diagrams (they're off by default.)
 //
 // You need to view this page with Chrome or Firefox.
 //
 // ///////////////////////////////////////////////////////////////////////////
 //
 // When editing, please start each sentence on a new line.
 // See https://asciidoctor.org/docs/asciidoc-recommended-practices/#one-sentence-per-line[one sentence-per-line writing technique.]
 // It is acceptable to start each sentence on a new line, but then wrap the lines to a reasonable length.
 // It's the starting on a new line that matters.
 //
 // This makes textual diffs of this file useful in a similar way to the way they work for code.
 //
 // //////////////////////////////////////////////////////////////////////////

 == Term Sharing in the Schema Compiler

 === Introduction

 The DFDL language has a composition property known as _referential transparency_.
 It is not supposed to matter whether you lift a part of the schema out
 and create a reusable group, reusable type, or reusable element from
 it.

 Because DFDL (version 1.0) does not allow recursive definitions of any
 kind, it is theoretically possible to implement this by treating all
 reusable definitions in the language like macros.
 I.e., inline-expand all definitions at their point of use.
 This will work for schemas up to some size. However, it leads to an
 unacceptable expansion in schema compilation time and space, as the
 size of the schema once all these inline substitutions have been
 performed can be exponentially larger than the original
 non-substituted schema.

 To avoid this undesirable combinatorial explosion, the Daffodil schema
 compiler arranges to achieve the same macro-like semantics, while
 still sharing the core parts of the reusable components so they need
 only be compiled once for each unique way the component is used.

 This is also part of eventually enabling an extension of the DFDL
 language to allow recursive definitions.

 The dfdl:alignment property is one of the largest contributors to
 complexity in sharing objects by the schema compiler.

 Alignment will be used as the example in this design note to
 illustrate the schema compiler behavior.

 === DSOM Term Objects

 DSOM is the Daffodil Schema Object Model.
 Within this model, the components from a DFDL schema that actually have
 representations in the data stream or are computed, are called _terms_ and DSOM models them as
 subclasses of the class/trait Term.
 Terms which are computed (via dfdl:inputValueCalc) are said to have _no representation_ in the data stream.

 For this discussion we are concerned only with _concrete_ terms, meaning those that do have representation.

 The classes in Daffodil that are used for concrete terms are:

 * ElementRef
 * Root (A degenerate element ref to the root element)
 * LocalElementDecl
 * the quasi-elements:

 ** PrefixLengthQuasiElementDecl (Fake local element decl used for the
    length fields of Prefixed-length types)

 ** RepTypeQuasiElementDecl (Fake local element decl used as the
    representation of elements that are computed via the dfdl:repType
    mechanism.)

 * Sequence
 * SequenceGroupRef

 * ChoiceBranchImpliedSequence (The sequence that is inserted when a
   choice branch is a single element decl/ref)

 * Choice
 * ChoiceGroupRef

 A https://cwiki.apache.org/confluence/display/DAFFODIL/DFDL+Schema+Object+Model+%28DSOM%29+with+UML[UML Diagram] is available showing these classes.

 === Term Representation Regions

 Consider the representation of a term, which we call the term's
 _region_, as appearing between two other term regions in the data
 stream as illustrated below:

 [ditaa]
 ....
 ---+ +-----------------------------------------------+ +---
    | |                    term                       | |
    | |                                               | |
    | | +---------------+ +--------+ +--------------+ | |
 ...| | | unique before | | shared | | shared after | | |...
    | | +---------------+ +--------+ +--------------+ | |
  --+ +-----------------------------------------------+ +--
 ....
 In the above, lower position bits are to the left.
 Higher position bits to the right.
 In the diagram, we see that the term's region consists of 3
 sub-regions, named _unique before_, _shared_, and _shared after_. Each
 term appears in a context of possible additional terms before it and
 after it which are shown with ellipsis above.

 A term can be an element or a model-group (sequence or choice).
 By far the most common situation is that a term appears inside a model group.
 There are two exceptions. The _root_, and the model-group of a complex type.

 The core idea here is that we are separating the representation of a
 term into unique and shared parts.
 The unique region is affected by the surrounding model group context.
 The shared regions are, in many cases, sharable across instances of
 this term and is independent of the surrounding model-group context.
 So if the term was defined by way of a reusable global definition,
 then the goal is that there need be only one implementation of the
 shared part(s).

 There are some limits to this sharing.
 A single implementation is not always achievable, but generally one or
 a small number of implementations are possible.

 === Limits to Sharing

 Consider if this term above was defined by way of a XSD group
 reference.
 The DFDL properties expressed on the group reference must
 be combined with those expressed on the group definition, to form the
 complete set of properties associated with the term.
 This combining creates situations where a given shared group definition can have
 wildly different representations for different uses.
 Consider this:

 ```xml
 <group name="g">
   <sequence>
     <element name="a" type="xs:int"/>
     <element name="b" type="xs:int"/>
     <element name="c" type="xs:int"/>
   </sequence>
 </group>

 ....

 <element name="e">
   <complexType>
     <sequence>
       ... some stuff here ...
       <group ref="tns:g" dfdl:separator="|"/> <!-- separated -->
       ... more stuff here ...
       <group ref="tns:g"/> <!-- not separated -->
       ... yet more stuff here ...
     </sequence>
   </complexType>
 </element>
 ```

 The two uses of the group definition for 'g' are wildly different, in
 that one has separators, the other does not.

 This is a rather extreme example, but DFDL implementations must allow
 for this.

 Experience with DFDL schemas to-date (2020-01-21) is that this is very
 rare and appears only in test cases designed to exercise it.
 However, less extreme versions of this are possible, such as different
 uses all being separated and delimited textual data, but where the
 distinct uses do not all use the same separator characters, so the
 separator to be used would be specified differently on each group
 reference. Another expected variation could be separated sequence
 groups where some uses are infix separator and others are prefix or
 postfix separator.

 Anecdotally, the vast majority of schemas seen to-date reuse groups
 entirely, that is specifying no properties at all on the group
 reference.

 ==== Property Environment or PropEnv

 We define the _property environment_ or _PropEnv_ of a term, as the
 complete set of properties for that term, combining them from all the
 schema components that define the term. This can include element
 references, group references, local or global element declarations,
 global group definitions, and local and global type definitions.

 A key observation is that the PropEnv of a term is entirely defined by:

 * local properties (including any dfdl:ref property) on the schema component for the term.
 ** E.g., for an element reference, the local properties expressed on the element reference itself.
 * default format object at top level of the schema where the term lexically appears.
 * definition object being referenced.
 ** E.g., for an element reference, the global element declaration being referenced.

 So if two terms, say group references, have the same PropEnv, then we
 can share much about their definitions when implementing them.

 Hence, when constructing the Daffodil schema compiler's representation
 for a term, we can maintain a cache for each global definition, with
 the key of the cache being the PropEnv.
 When a given term uses a global definition, if the point of use has
 the same PropEnv as another, both can share the part of the term that
 is context independent, that is, the shared region.

 The actual components of the PropEnv of a term are slighly more
 complicated than described above.
 The table below gives the components of the PropEnv for the various
 kinds of terms. Note that the "local properties" below includes any
 dfdl:ref property if it appears, but equality of any property which
 has as its value a QName, like dfdl:ref, the property value is based
 on a resolved QName, not the QName string.

 [cols="2,6a"]
 .PropEnv Components
 |===
 |Term Definition (one PropEnv subtype per) | PropEnv Components

 |Local Element Decl with type reference
 |
 * Local properties expressed on the Local Element Decl (Set of property name + value pairs)
 * Lexically enclosing default format (object ref)
 * Type definition (object ref) or primitive type (object ref)
 | Local Element Decl with anonymous Simple Type Definition
 |
 * Combined set of local properties expressed on the Local Element Decl and the anoymous Simple Type definition.
 * Lexically enclosing default format (object ref)
 * Base simple type definition (object ref), or primitive type (object ref)
 |Local Element Decl with anonymous complex type definition
 |
 * Local properties expressed on the local element decl
 * Lexically enclosing default format (object ref)
 * Anonymous complex type (object ref)
 |Element Reference to Global Element Decl
 |
 * Local properties expressed on the element reference
 * Lexically enclosing default format (object ref)
 * Global Element Decl (object ref)
 |===

 Lookups by PropEnv compare sets by equality of members, and object
 references by pointer equality i.e, eq comparison.

 This definition of PropEnv can be improved by memoizing the
 construction of default format objects, so that equivalent default
 formats from different schema documents are represented by the same
 object.
 That is, so that multiple schema documents each containing:
 ```xml
   <dfdl:format ref="prefix:formatName"/>
 ```
 are instantiated as the same object when the QName resolves to the same format object.

 ==== Details of Unique and Shared Regions

 The division of the reprsentation into the unique part which appears
 before the shared parts indicates where we can share implementation,
 and where we must have unique context-specific implementation for the
 term.

 To understand this better, the below breaks down the sub-regions of a term further.
 First we look at the details of the unique-before region:

 //
 // DITAA max line length (for github) ruler -------------------------------------------------------------------------|
 //

 [ditaa]
 ....
                      +--+ +-----------------------------------------------+ +--+
                         | |                    term                       | |
                         | |                                               | |
                         | | +---------------+ +--------+ +--------------+ | |
                      ...| | | unique before | | shared | | shared after | | |...
                         | | +---------------+ +--------+ +--------------+ | |
                       +-+ +-----------------------------------------------+ +-+
                             |               |
                             |               |
 +---------------------------+               +----------------------------------+
 |                                                                              |
 v                                                                              v
 +------------------------------------------------------------------------------+
 | +-------------------+                    unique before                       |
 | |  sequence before  |                                                        |
 | | +------+ +------+ | +-----+ +---------+ +------+ +------+ +------+ +-----+ |
 | | |prefix| |prefix| | |lSkip| |alignFill| |init'r| |init'r| |prefix| +value| |
 | | |infix | |infix | | |     | |         | | MTA  | |      | |length| | MTA | |
 | | |sep'r | |sep'r | | |     | |         | |      | |      | |      | |     | |
 | | | MTA  | |      | | |     | |         | |      | |      | |      | |     | |
 | | +------+ +------+ | +-----+ +---------+ +------+ +------+ +------+ +-----+ |
 | +-------------------+                                                        |
 +------------------------------------------------------------------------------+

 ....
 The sub-regions within the unique before region are:

 * prefix infix sep'r MTA - mandatory text alignment for a separator in infix or prefix position.
 When a separator is defined, this region is populated with up to 7 bits to
 obtain text alignment for the characters of the separator in the specified charset encoding.
 Note that this encoding and that of the term's initator/terminator are not necessarily the same encoding.
 * prefix infix sep'r - separator when defined and in prefix or infix position.
 This consists of text characters.

 * lSkip - leading skip region.
 This is fixed length (commonly 0)
 * alignFill - alignment fill region.
 This dynamically sized region depends on the bit position where it starts.
 * init'r MTA - mandatory text alignment for initiator.
 When an initiator is defined, this region is populated with up to 7 bits to
 obtain text alignment for the characters of the initiator in specified charset encoding.
 * init'r - initiator.
 This consists of text characters.
 * prefix length - (element terms only) the prefix length.
 Sub-detail of this region is not shown here.
 The PrefixLengthQuasiElementDecl class is used to represent this subregion in DSOM.
 * value MTA - (element terms only, simple type only) mandatory text alignment for the value.
 When the value has text representation this insures those characters start
 on the right bit-boundary for characters of the specified charset encoding.

 The sizes of the alignment regions (alignFill, and the MTA regions) in
 the above all depend on where the start of this term is.
 That depends on where the prior term ends, if there is a prior term.
 That is, the sizes of the alignment regions depend on the enclosing
 model groups, and what can appear before this term in that nest.

 Now we look at an expansion of the shared regions:

 [ditaa]
 ....


                   +--+ +-----------------------------------------------+ +--+
                      | |                    term                       | |
                      | |                                               | |
                      | | +---------------+ +--------+ +--------------+ | |
                   ...| | | unique before | | shared | | shared after | | |...
                      | | +---------------+ +--------+ +--------------+ | |
                    +-+ +-----------------------------------------------+ +-+
                                            |                         |
                                            |                         |
            +-------------------------------+                         +-----------------+
            |                                                                           |
            |                                                                           |
            v                                                                           v
            +---------------------+ +---------------------------------------------------+
            |shared               | |       shared after        +---------------------+ |
            |                     | |                           |    sequence after   | |
            | +--------+ +------+ | | +------+ +------+ +-----+ | +-------+ +-------+ | |
            | | content| |elt or| | | |term'r| |term'r| |tSkip| | |postfix| |postfix| | |
            | |   or   | |choice| | | | MTA  | |      | |     | | | sep'r | | sep'r | | |
            | | value  | |unused| | | |      | |      | |     | | |  MTA  | |       | | |
            | |        | |      | | | |      | |      | |     | | |       | |       | | |
            | +--------+ +------+ | | +------+ +------+ +-----+ | +-------+ +-------+ | |
            |                     | |                           +---------------------+ |
            +---------------------+ +---------------------------------------------------+

            ^
            |
            |
            +
    Known Alignment Point

 ....

 It is key that the shared region can assume the alignment
 fill and MTA regions have been sized per the requirements of this
 term.
 Hence, the shared region is context independent and its implementation - in schema
 compiler objects and in runtime structure representation, need only be represented once.

 .Corner Case: Interior Alignment
 [IMPORTANT]
 --
 The semantics here aren't identical to those of macro expansion, but the difference is hard to
 detect.

 The https://opendfdl.github.io/gwdrp-dfdl-v1.0.5_r11/gwdrp-dfdl-v1.0.5-r11-change-tracking.htm[DFDL Spec (recent draft)]
 includes a link:https://opendfdl.github.io/gwdrp-dfdl-v1.0.5_r11/gwdrp-dfdl-v1.0.5-r11-change-tracking.htm#_Toc27061169[section (23.6 in the linked draft)] on
 this
 // Hmmm. just writing _interior alignment problem_ didn't work. This passthrough to HTML will work for sure.
 pass:[<i>interior alignment problem</i>].

 The technique for approximating alignment info described here is slighly less precise in its
 analysis because it resets its knowledge of the alignment at each known alignment point.
 Whereas pure macro expansion could carry forward knowledge of alignment and perhaps optimize out more
 alignment fill regions.

 Hence, it's possible that the technique here will leave some alignment fill regions in place and thereby some
 formats will experience circular deadlocks when unparsing due to this variable-length interior-alignment.

 The Daffodil test suite includes examples of this interior alignment that cause unparser circular deadlocks.
 (see testOutputValueCalcVariableLengthThenAlignmentDeadlock)

 --

 The shared region contains a text or simple binary value, a nil representation, or inductively, a sequence of terms.
 When the term is an element of complex type, an element unused region can follow (depending on the length kind),
 and if the term is a choice with dfdl:choiceLengthKind='explicit' then a choice unused region can follow.

 Then, after the shared part, the shared-after region has these subregions:

 * term'r MTA - mandatory text alignment for the terminator.
 * term'r - terminator
 * tSkip - trailing skip region. This is fixed length (commonly 0)
 * postfix sep'r MTA - - mandatory text alignment for a separator in postfix position.
 When a separator is defined, this region is populated with up to 7 bits to
 obtain text alignment for the characters of the separator in the specified charset encoding.
 Note that this encoding and that of the term's initator/terminator are not necessarily the same encoding.
 * postfix sep'r - separator when defined and in postfix position.

 The terminator MTA region varies in size based on its starting
 alignment, and so it depends on the size of the shared region.

 The postfix separator MTA region similarly varies in size based on where the
 tSkip region ended, and so it depends on the size of the shared region, and
 the sizes of the terminator MTA region, terminator, and tSkip regions.

 NOTE: The shared-after region does not have any sub-regions the size
 of which depend on the context.
 However, we separate it from the central shared region in order to
 separate all concerns around alignment and fill into separate regions from the
 central shared region, which could perhaps be called the shared
 content region.
 The central shared region therefore does not deal with alignment at all.

 The central idea here is that all the regions to the left (before) the known alignment point are context dependent.
 That is, whether these alignment fill and mandatory text alignment regions can be statically determined
 in size is dependent on where the prior term ended.

 In contrast, the known alignment point is a fresh start for alignment. The alignment as required by the alignment properties
 will have been achieved at that point. Hence, from there and to the right (after), everything can be assessed relative to
 this new fresh anchor.

 Inductively, even the context dependent regions are not dependent on terms very far back.
 Let's consider the regions in a straddle of two adjacent terms in a sequence:

 //
 // DITAA max line length (for github) ruler -------------------------------------------------------------------------|
 //
 [ditaa]
 ....
 +----------------------------------------------+ +------------------------------------------------
                                 Term 1         | |           Term 2
                    +-------------------------+ | |  +-------------------------------------+
  +---+ +---------+ |shared after             | | |  |unique before                        | +---+
      | | shared  | |                         | | |  |                                     | |shared
      | |         | |             +---------+ | | |  | +---------+                         | |
      | | +-+ +-+ | | +-+ +-+ +-+ |sequence | | | |  | |sequence | +-+ +-+ +-+ +-+ +-+ +-+ | |
      | | | | | | | | | | | | | | | after   | | | |  | | before  | | | | | | | | | | | | | | |
      | | | | | | | | | | | | | | | +-+ +-+ | | | |  | | +-+ +-+ | | | | | | | | | | | | | | |
      | | | | | | | | | | | | | | | | | | | | | | |  | | | | | | | | | | | | | | | | | | | | |
  ... | | | | | | | | | | | | | | | | | | | | | | |  | | | | | | | | | | | | | | | | | | | | | ...
      | | +-+ +-+ | | +-+ +-+ +-+ | | | | | | | | |  | | | | | | | +-+ +-+ +-+ +-+ +-+ +-+ | |
      | |         | |             | +-+ +-+ | | | |  | | +-+ +-+ |                         | |
  +---+ +---------+ |             +---------+ | | |  | +---------+                         | +---+
        ^           +-------------------------+ | |  +-------------------------------------+ ^
        |                                       | |                                          |
 +----------------------------------------------+ +-------------------------------------------------
        |                                                                                    |
        |                                                                                    |
        |                                                                                    |
        |                                                                                    |
        +                                                                                    +
   Known Alignment Point 1                                                  Known Alignment Point 2

 ....

 The Known Alignment Point 2 is only context dependent on the possible prior terms back to Known Alignment Point 1.
 This is true certaintly when Term 1 is a scalar element or a model-group itself.

 In reality the induction here is more complex because Term 1 might be an optional element or an array, in which case the prior term before Term 1 may also be involved
 in computing the region sizes for the Term 2 unique-before regions.
 But this never needs to consider anything further back than the prior scalar term, or the start of the shared region of the enclosing model group, which need
 only be considered if Term 2 can be first within its enclosing model group.

 CAUTION: Separators may be present even if terms are not, based on dfdl:separatorSuppressionPolicy and a few other properties.
 The diagrams show sequence before region as if it was part of the term, but really it may appear even if the term is absent.


 Given the above, for any term, we can compute its PropEnv, and create
 a distinct shared instance for that term for each unique PropEnv.

 Since the shared region contains all the children of any
 child-containing term, sharing this insures that the schema
 compilation space/speed is proportional to, roughly, the size of the
 schema without any non-constant multiplicative factor.
 This insures no combinatorial explosion of compilation time.

 === Optimizing Alignment Regions

 A primary task of the schema compiler is to eliminate alignment fill
 (and mandatory text alignment aka MTA) regions that are unnecessary
 because the data can be proven to always be properly aligned.

 This is done by computing, for each region, a compile-time alignment approximation.
 The AlignmentMultipleOf class represents this information.
 AlignmentMultipleOf objects form a mathematical lattice where perfect
 alignment is the bottom of the lattice, no alignment (or alignment
 multiple of 1 bit) is the top of the lattice, and various multiples
 (8, 32, etc.) live in between. For two lattice values a and b, we say
 a < b (a is 'weaker than' b) if a is a multiple of b.

 A key observation is this: Each time we create an alignment constraint
 for a shared region, this is not dependent on any other constraint.
 That is, the constraints on alignment do not propagate from term to term.
 A shared region can assume it starts at the alignment that is required
 for the term based on the dfdl:alignment property, the initiator MTA
 if there is an initiator, and the value MTA for text-representation
 simple values.

 Hence, terms that appear down-stream of any shared region do *not*
 depend on any constraints propagating from before the shared region.
 This greatly reduces the complexity and the distance across the schema
 that constraint-changes propagate.
 It also insures there is no need for an iterative contraint solver, since nothing
 about alignment propagates from one term to the next.
 Terms are separated by the fact that the shared region of a term starts from a
 known alignment (specified by its alignment and alignment units
 property).


 === Other Context-Dependent Computations

 By dividing up the representation of a term into the unique context-dependent
 and shared context-indepenent parts explicitly we provide a home for the various
 calculations the schema compiler performs in order to do other optimizations.


 [plantuml, target="unshared-shared", format="png"]
 ....
 hide empty members

 class UnsharedTerm {

   isAlignmentFillNeeded: Boolean
   isInitiatorMTANeeded: Boolean
   isValueMTANeeded:Boolean

   isPrefixInfixSeparatorMTANeeded: Boolean

   isPotentiallyTrailing: Boolean

   needsBitOrderChange: Boolean

   isScanningForDelimiter: Boolean
   isScanningForTerminator: Boolean

   initiatedContentCheck: Unit

 }

 class SharedTerm {
  isTerminatorMTANeeded: Boolean
  isPostfixSeparatorMTANeeded: Boolean
  // along with everything else

 }

 UnsharedTerm -right-> SharedTerm : shared

 ....

 In the above class diagram, members that must be computed on the unshared term object are
 shown in the class.
 This set of things should always be minimized.
 That is, as few things should be computed on unshared term objects as posible.
 It is assumed that everything else is computed on the shared term object, but members
 specifically about regions discussed in sections above are called out.

 The UnsharedTerm object is effectively state of the enclosing model group DSOM object.
 It exists in 1 to 1 correspondence (aka it is "owned" by the enclosing model group object.)

 In the Runtime 1 backend, a processor (parser/unparser) is generated for the shared term object,
 and that processor is referenced from the processor generated for the unshared term object.