| :page-layout: page |
| :keywords: schema-compiler performance alignment optimization |
| // /////////////////////////////////////////////////////////////////////////// |
| // |
| // This file is written in AsciiDoc. |
| // |
| // If you can read this comment, your browser is not rendering asciidoc automatically. |
| // |
| // You need to install the asciidoc plugin to Chrome or Firefox |
| // so that this page will be properly rendered for your viewing pleasure. |
| // |
| // You can get the plugins by searching the web for 'asciidoc plugin' |
| // |
| // You will want to change plugin settings to enable diagrams (they're off by default.) |
| // |
| // You need to view this page with Chrome or Firefox. |
| // |
| // /////////////////////////////////////////////////////////////////////////// |
| // |
| // When editing, please start each sentence on a new line. |
| // See https://asciidoctor.org/docs/asciidoc-recommended-practices/#one-sentence-per-line[one sentence-per-line writing technique.] |
| // It is acceptable to start each sentence on a new line, but then wrap the lines to a reasonable length. |
| // It's the starting on a new line that matters. |
| // |
| // This makes textual diffs of this file useful in a similar way to the way they work for code. |
| // |
| // ////////////////////////////////////////////////////////////////////////// |
| |
| == Term Sharing in the Schema Compiler |
| |
| === Introduction |
| |
| The DFDL language has a composition property known as _referential transparency_. |
| It is not supposed to matter whether you lift a part of the schema out |
| and create a reusable group, reusable type, or reusable element from |
| it. |
| |
| Because DFDL (version 1.0) does not allow recursive definitions of any |
| kind, it is theoretically possible to implement this by treating all |
| reusable definitions in the language like macros. |
| I.e., inline-expand all definitions at their point of use. |
| This will work for schemas up to some size. However, it leads to an |
| unacceptable expansion in schema compilation time and space, as the |
| size of the schema once all these inline substitutions have been |
| performed can be exponentially larger than the original |
| non-substituted schema. |
| |
| To avoid this undesirable combinatorial explosion, the Daffodil schema |
| compiler arranges to achieve the same macro-like semantics, while |
| still sharing the core parts of the reusable components so they need |
| only be compiled once for each unique way the component is used. |
| |
| This is also part of eventually enabling an extension of the DFDL |
| language to allow recursive definitions. |
| |
| The dfdl:alignment property is one of the largest contributors to |
| complexity in sharing objects by the schema compiler. |
| |
| Alignment will be used as the example in this design note to |
| illustrate the schema compiler behavior. |
| |
| === DSOM Term Objects |
| |
| DSOM is the Daffodil Schema Object Model. |
| Within this model, the components from a DFDL schema that actually have |
| representations in the data stream or are computed, are called _terms_ and DSOM models them as |
| subclasses of the class/trait Term. |
| Terms which are computed (via dfdl:inputValueCalc) are said to have _no representation_ in the data stream. |
| |
| For this discussion we are concerned only with _concrete_ terms, meaning those that do have representation. |
| |
| The classes in Daffodil that are used for concrete terms are: |
| |
| * ElementRef |
| * Root (A degenerate element ref to the root element) |
| * LocalElementDecl |
| * the quasi-elements: |
| |
| ** PrefixLengthQuasiElementDecl (Fake local element decl used for the |
| length fields of Prefixed-length types) |
| |
| ** RepTypeQuasiElementDecl (Fake local element decl used as the |
| representation of elements that are computed via the dfdl:repType |
| mechanism.) |
| |
| * Sequence |
| * SequenceGroupRef |
| |
| * ChoiceBranchImpliedSequence (The sequence that is inserted when a |
| choice branch is a single element decl/ref) |
| |
| * Choice |
| * ChoiceGroupRef |
| |
| A https://cwiki.apache.org/confluence/display/DAFFODIL/DFDL+Schema+Object+Model+%28DSOM%29+with+UML[UML Diagram] is available showing these classes. |
| |
| === Term Representation Regions |
| |
| Consider the representation of a term, which we call the term's |
| _region_, as appearing between two other term regions in the data |
| stream as illustrated below: |
| |
| [ditaa] |
| .... |
| ---+ +-----------------------------------------------+ +--- |
| | | term | | |
| | | | | |
| | | +---------------+ +--------+ +--------------+ | | |
| ...| | | unique before | | shared | | shared after | | |... |
| | | +---------------+ +--------+ +--------------+ | | |
| --+ +-----------------------------------------------+ +-- |
| .... |
| In the above, lower position bits are to the left. |
| Higher position bits to the right. |
| In the diagram, we see that the term's region consists of 3 |
| sub-regions, named _unique before_, _shared_, and _shared after_. Each |
| term appears in a context of possible additional terms before it and |
| after it which are shown with ellipsis above. |
| |
| A term can be an element or a model-group (sequence or choice). |
| By far the most common situation is that a term appears inside a model group. |
| There are two exceptions. The _root_, and the model-group of a complex type. |
| |
| The core idea here is that we are separating the representation of a |
| term into unique and shared parts. |
| The unique region is affected by the surrounding model group context. |
| The shared regions are, in many cases, sharable across instances of |
| this term and is independent of the surrounding model-group context. |
| So if the term was defined by way of a reusable global definition, |
| then the goal is that there need be only one implementation of the |
| shared part(s). |
| |
| There are some limits to this sharing. |
| A single implementation is not always achievable, but generally one or |
| a small number of implementations are possible. |
| |
| === Limits to Sharing |
| |
| Consider if this term above was defined by way of a XSD group |
| reference. |
| The DFDL properties expressed on the group reference must |
| be combined with those expressed on the group definition, to form the |
| complete set of properties associated with the term. |
| This combining creates situations where a given shared group definition can have |
| wildly different representations for different uses. |
| Consider this: |
| |
| ```xml |
| <group name="g"> |
| <sequence> |
| <element name="a" type="xs:int"/> |
| <element name="b" type="xs:int"/> |
| <element name="c" type="xs:int"/> |
| </sequence> |
| </group> |
| |
| .... |
| |
| <element name="e"> |
| <complexType> |
| <sequence> |
| ... some stuff here ... |
| <group ref="tns:g" dfdl:separator="|"/> <!-- separated --> |
| ... more stuff here ... |
| <group ref="tns:g"/> <!-- not separated --> |
| ... yet more stuff here ... |
| </sequence> |
| </complexType> |
| </element> |
| ``` |
| |
| The two uses of the group definition for 'g' are wildly different, in |
| that one has separators, the other does not. |
| |
| This is a rather extreme example, but DFDL implementations must allow |
| for this. |
| |
| Experience with DFDL schemas to-date (2020-01-21) is that this is very |
| rare and appears only in test cases designed to exercise it. |
| However, less extreme versions of this are possible, such as different |
| uses all being separated and delimited textual data, but where the |
| distinct uses do not all use the same separator characters, so the |
| separator to be used would be specified differently on each group |
| reference. Another expected variation could be separated sequence |
| groups where some uses are infix separator and others are prefix or |
| postfix separator. |
| |
| Anecdotally, the vast majority of schemas seen to-date reuse groups |
| entirely, that is specifying no properties at all on the group |
| reference. |
| |
| ==== Property Environment or PropEnv |
| |
| We define the _property environment_ or _PropEnv_ of a term, as the |
| complete set of properties for that term, combining them from all the |
| schema components that define the term. This can include element |
| references, group references, local or global element declarations, |
| global group definitions, and local and global type definitions. |
| |
| A key observation is that the PropEnv of a term is entirely defined by: |
| |
| * local properties (including any dfdl:ref property) on the schema component for the term. |
| ** E.g., for an element reference, the local properties expressed on the element reference itself. |
| * default format object at top level of the schema where the term lexically appears. |
| * definition object being referenced. |
| ** E.g., for an element reference, the global element declaration being referenced. |
| |
| So if two terms, say group references, have the same PropEnv, then we |
| can share much about their definitions when implementing them. |
| |
| Hence, when constructing the Daffodil schema compiler's representation |
| for a term, we can maintain a cache for each global definition, with |
| the key of the cache being the PropEnv. |
| When a given term uses a global definition, if the point of use has |
| the same PropEnv as another, both can share the part of the term that |
| is context independent, that is, the shared region. |
| |
| The actual components of the PropEnv of a term are slighly more |
| complicated than described above. |
| The table below gives the components of the PropEnv for the various |
| kinds of terms. Note that the "local properties" below includes any |
| dfdl:ref property if it appears, but equality of any property which |
| has as its value a QName, like dfdl:ref, the property value is based |
| on a resolved QName, not the QName string. |
| |
| [cols="2,6a"] |
| .PropEnv Components |
| |=== |
| |Term Definition (one PropEnv subtype per) | PropEnv Components |
| |
| |Local Element Decl with type reference |
| | |
| * Local properties expressed on the Local Element Decl (Set of property name + value pairs) |
| * Lexically enclosing default format (object ref) |
| * Type definition (object ref) or primitive type (object ref) |
| | Local Element Decl with anonymous Simple Type Definition |
| | |
| * Combined set of local properties expressed on the Local Element Decl and the anoymous Simple Type definition. |
| * Lexically enclosing default format (object ref) |
| * Base simple type definition (object ref), or primitive type (object ref) |
| |Local Element Decl with anonymous complex type definition |
| | |
| * Local properties expressed on the local element decl |
| * Lexically enclosing default format (object ref) |
| * Anonymous complex type (object ref) |
| |Element Reference to Global Element Decl |
| | |
| * Local properties expressed on the element reference |
| * Lexically enclosing default format (object ref) |
| * Global Element Decl (object ref) |
| |=== |
| |
| Lookups by PropEnv compare sets by equality of members, and object |
| references by pointer equality i.e, eq comparison. |
| |
| This definition of PropEnv can be improved by memoizing the |
| construction of default format objects, so that equivalent default |
| formats from different schema documents are represented by the same |
| object. |
| That is, so that multiple schema documents each containing: |
| ```xml |
| <dfdl:format ref="prefix:formatName"/> |
| ``` |
| are instantiated as the same object when the QName resolves to the same format object. |
| |
| ==== Details of Unique and Shared Regions |
| |
| The division of the reprsentation into the unique part which appears |
| before the shared parts indicates where we can share implementation, |
| and where we must have unique context-specific implementation for the |
| term. |
| |
| To understand this better, the below breaks down the sub-regions of a term further. |
| First we look at the details of the unique-before region: |
| |
| // |
| // DITAA max line length (for github) ruler -------------------------------------------------------------------------| |
| // |
| |
| [ditaa] |
| .... |
| +--+ +-----------------------------------------------+ +--+ |
| | | term | | |
| | | | | |
| | | +---------------+ +--------+ +--------------+ | | |
| ...| | | unique before | | shared | | shared after | | |... |
| | | +---------------+ +--------+ +--------------+ | | |
| +-+ +-----------------------------------------------+ +-+ |
| | | |
| | | |
| +---------------------------+ +----------------------------------+ |
| | | |
| v v |
| +------------------------------------------------------------------------------+ |
| | +-------------------+ unique before | |
| | | sequence before | | |
| | | +------+ +------+ | +-----+ +---------+ +------+ +------+ +------+ +-----+ | |
| | | |prefix| |prefix| | |lSkip| |alignFill| |init'r| |init'r| |prefix| +value| | |
| | | |infix | |infix | | | | | | | MTA | | | |length| | MTA | | |
| | | |sep'r | |sep'r | | | | | | | | | | | | | | | |
| | | | MTA | | | | | | | | | | | | | | | | | |
| | | +------+ +------+ | +-----+ +---------+ +------+ +------+ +------+ +-----+ | |
| | +-------------------+ | |
| +------------------------------------------------------------------------------+ |
| |
| .... |
| The sub-regions within the unique before region are: |
| |
| * prefix infix sep'r MTA - mandatory text alignment for a separator in infix or prefix position. |
| When a separator is defined, this region is populated with up to 7 bits to |
| obtain text alignment for the characters of the separator in the specified charset encoding. |
| Note that this encoding and that of the term's initator/terminator are not necessarily the same encoding. |
| * prefix infix sep'r - separator when defined and in prefix or infix position. |
| This consists of text characters. |
| |
| * lSkip - leading skip region. |
| This is fixed length (commonly 0) |
| * alignFill - alignment fill region. |
| This dynamically sized region depends on the bit position where it starts. |
| * init'r MTA - mandatory text alignment for initiator. |
| When an initiator is defined, this region is populated with up to 7 bits to |
| obtain text alignment for the characters of the initiator in specified charset encoding. |
| * init'r - initiator. |
| This consists of text characters. |
| * prefix length - (element terms only) the prefix length. |
| Sub-detail of this region is not shown here. |
| The PrefixLengthQuasiElementDecl class is used to represent this subregion in DSOM. |
| * value MTA - (element terms only, simple type only) mandatory text alignment for the value. |
| When the value has text representation this insures those characters start |
| on the right bit-boundary for characters of the specified charset encoding. |
| |
| The sizes of the alignment regions (alignFill, and the MTA regions) in |
| the above all depend on where the start of this term is. |
| That depends on where the prior term ends, if there is a prior term. |
| That is, the sizes of the alignment regions depend on the enclosing |
| model groups, and what can appear before this term in that nest. |
| |
| Now we look at an expansion of the shared regions: |
| |
| [ditaa] |
| .... |
| |
| |
| +--+ +-----------------------------------------------+ +--+ |
| | | term | | |
| | | | | |
| | | +---------------+ +--------+ +--------------+ | | |
| ...| | | unique before | | shared | | shared after | | |... |
| | | +---------------+ +--------+ +--------------+ | | |
| +-+ +-----------------------------------------------+ +-+ |
| | | |
| | | |
| +-------------------------------+ +-----------------+ |
| | | |
| | | |
| v v |
| +---------------------+ +---------------------------------------------------+ |
| |shared | | shared after +---------------------+ | |
| | | | | sequence after | | |
| | +--------+ +------+ | | +------+ +------+ +-----+ | +-------+ +-------+ | | |
| | | content| |elt or| | | |term'r| |term'r| |tSkip| | |postfix| |postfix| | | |
| | | or | |choice| | | | MTA | | | | | | | sep'r | | sep'r | | | |
| | | value | |unused| | | | | | | | | | | MTA | | | | | |
| | | | | | | | | | | | | | | | | | | | | |
| | +--------+ +------+ | | +------+ +------+ +-----+ | +-------+ +-------+ | | |
| | | | +---------------------+ | |
| +---------------------+ +---------------------------------------------------+ |
| |
| ^ |
| | |
| | |
| + |
| Known Alignment Point |
| |
| .... |
| |
| It is key that the shared region can assume the alignment |
| fill and MTA regions have been sized per the requirements of this |
| term. |
| Hence, the shared region is context independent and its implementation - in schema |
| compiler objects and in runtime structure representation, need only be represented once. |
| |
| .Corner Case: Interior Alignment |
| [IMPORTANT] |
| -- |
| The semantics here aren't identical to those of macro expansion, but the difference is hard to |
| detect. |
| |
| The https://opendfdl.github.io/gwdrp-dfdl-v1.0.5_r11/gwdrp-dfdl-v1.0.5-r11-change-tracking.htm[DFDL Spec (recent draft)] |
| includes a link:https://opendfdl.github.io/gwdrp-dfdl-v1.0.5_r11/gwdrp-dfdl-v1.0.5-r11-change-tracking.htm#_Toc27061169[section (23.6 in the linked draft)] on |
| this |
| // Hmmm. just writing _interior alignment problem_ didn't work. This passthrough to HTML will work for sure. |
| pass:[<i>interior alignment problem</i>]. |
| |
| The technique for approximating alignment info described here is slighly less precise in its |
| analysis because it resets its knowledge of the alignment at each known alignment point. |
| Whereas pure macro expansion could carry forward knowledge of alignment and perhaps optimize out more |
| alignment fill regions. |
| |
| Hence, it's possible that the technique here will leave some alignment fill regions in place and thereby some |
| formats will experience circular deadlocks when unparsing due to this variable-length interior-alignment. |
| |
| The Daffodil test suite includes examples of this interior alignment that cause unparser circular deadlocks. |
| (see testOutputValueCalcVariableLengthThenAlignmentDeadlock) |
| |
| -- |
| |
| The shared region contains a text or simple binary value, a nil representation, or inductively, a sequence of terms. |
| When the term is an element of complex type, an element unused region can follow (depending on the length kind), |
| and if the term is a choice with dfdl:choiceLengthKind='explicit' then a choice unused region can follow. |
| |
| Then, after the shared part, the shared-after region has these subregions: |
| |
| * term'r MTA - mandatory text alignment for the terminator. |
| * term'r - terminator |
| * tSkip - trailing skip region. This is fixed length (commonly 0) |
| * postfix sep'r MTA - - mandatory text alignment for a separator in postfix position. |
| When a separator is defined, this region is populated with up to 7 bits to |
| obtain text alignment for the characters of the separator in the specified charset encoding. |
| Note that this encoding and that of the term's initator/terminator are not necessarily the same encoding. |
| * postfix sep'r - separator when defined and in postfix position. |
| |
| The terminator MTA region varies in size based on its starting |
| alignment, and so it depends on the size of the shared region. |
| |
| The postfix separator MTA region similarly varies in size based on where the |
| tSkip region ended, and so it depends on the size of the shared region, and |
| the sizes of the terminator MTA region, terminator, and tSkip regions. |
| |
| NOTE: The shared-after region does not have any sub-regions the size |
| of which depend on the context. |
| However, we separate it from the central shared region in order to |
| separate all concerns around alignment and fill into separate regions from the |
| central shared region, which could perhaps be called the shared |
| content region. |
| The central shared region therefore does not deal with alignment at all. |
| |
| The central idea here is that all the regions to the left (before) the known alignment point are context dependent. |
| That is, whether these alignment fill and mandatory text alignment regions can be statically determined |
| in size is dependent on where the prior term ended. |
| |
| In contrast, the known alignment point is a fresh start for alignment. The alignment as required by the alignment properties |
| will have been achieved at that point. Hence, from there and to the right (after), everything can be assessed relative to |
| this new fresh anchor. |
| |
| Inductively, even the context dependent regions are not dependent on terms very far back. |
| Let's consider the regions in a straddle of two adjacent terms in a sequence: |
| |
| // |
| // DITAA max line length (for github) ruler -------------------------------------------------------------------------| |
| // |
| [ditaa] |
| .... |
| +----------------------------------------------+ +------------------------------------------------ |
| Term 1 | | Term 2 |
| +-------------------------+ | | +-------------------------------------+ |
| +---+ +---------+ |shared after | | | |unique before | +---+ |
| | | shared | | | | | | | |shared |
| | | | | +---------+ | | | | +---------+ | | |
| | | +-+ +-+ | | +-+ +-+ +-+ |sequence | | | | | |sequence | +-+ +-+ +-+ +-+ +-+ +-+ | | |
| | | | | | | | | | | | | | | | after | | | | | | before | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | +-+ +-+ | | | | | | +-+ +-+ | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
| ... | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ... |
| | | +-+ +-+ | | +-+ +-+ +-+ | | | | | | | | | | | | | | | | +-+ +-+ +-+ +-+ +-+ +-+ | | |
| | | | | | +-+ +-+ | | | | | | +-+ +-+ | | | |
| +---+ +---------+ | +---------+ | | | | +---------+ | +---+ |
| ^ +-------------------------+ | | +-------------------------------------+ ^ |
| | | | | |
| +----------------------------------------------+ +------------------------------------------------- |
| | | |
| | | |
| | | |
| | | |
| + + |
| Known Alignment Point 1 Known Alignment Point 2 |
| |
| .... |
| |
| The Known Alignment Point 2 is only context dependent on the possible prior terms back to Known Alignment Point 1. |
| This is true certaintly when Term 1 is a scalar element or a model-group itself. |
| |
| In reality the induction here is more complex because Term 1 might be an optional element or an array, in which case the prior term before Term 1 may also be involved |
| in computing the region sizes for the Term 2 unique-before regions. |
| But this never needs to consider anything further back than the prior scalar term, or the start of the shared region of the enclosing model group, which need |
| only be considered if Term 2 can be first within its enclosing model group. |
| |
| CAUTION: Separators may be present even if terms are not, based on dfdl:separatorSuppressionPolicy and a few other properties. |
| The diagrams show sequence before region as if it was part of the term, but really it may appear even if the term is absent. |
| |
| |
| |
| |
| |
| Given the above, for any term, we can compute its PropEnv, and create |
| a distinct shared instance for that term for each unique PropEnv. |
| |
| Since the shared region contains all the children of any |
| child-containing term, sharing this insures that the schema |
| compilation space/speed is proportional to, roughly, the size of the |
| schema without any non-constant multiplicative factor. |
| This insures no combinatorial explosion of compilation time. |
| |
| === Optimizing Alignment Regions |
| |
| A primary task of the schema compiler is to eliminate alignment fill |
| (and mandatory text alignment aka MTA) regions that are unnecessary |
| because the data can be proven to always be properly aligned. |
| |
| This is done by computing, for each region, a compile-time alignment approximation. |
| The AlignmentMultipleOf class represents this information. |
| AlignmentMultipleOf objects form a mathematical lattice where perfect |
| alignment is the bottom of the lattice, no alignment (or alignment |
| multiple of 1 bit) is the top of the lattice, and various multiples |
| (8, 32, etc.) live in between. For two lattice values a and b, we say |
| a < b (a is 'weaker than' b) if a is a multiple of b. |
| |
| A key observation is this: Each time we create an alignment constraint |
| for a shared region, this is not dependent on any other constraint. |
| That is, the constraints on alignment do not propagate from term to term. |
| A shared region can assume it starts at the alignment that is required |
| for the term based on the dfdl:alignment property, the initiator MTA |
| if there is an initiator, and the value MTA for text-representation |
| simple values. |
| |
| Hence, terms that appear down-stream of any shared region do *not* |
| depend on any constraints propagating from before the shared region. |
| This greatly reduces the complexity and the distance across the schema |
| that constraint-changes propagate. |
| It also insures there is no need for an iterative contraint solver, since nothing |
| about alignment propagates from one term to the next. |
| Terms are separated by the fact that the shared region of a term starts from a |
| known alignment (specified by its alignment and alignment units |
| property). |
| |
| |
| === Other Context-Dependent Computations |
| |
| By dividing up the representation of a term into the unique context-dependent |
| and shared context-indepenent parts explicitly we provide a home for the various |
| calculations the schema compiler performs in order to do other optimizations. |
| |
| |
| [plantuml, target="unshared-shared", format="png"] |
| .... |
| hide empty members |
| |
| class UnsharedTerm { |
| |
| isAlignmentFillNeeded: Boolean |
| isInitiatorMTANeeded: Boolean |
| isValueMTANeeded:Boolean |
| |
| isPrefixInfixSeparatorMTANeeded: Boolean |
| |
| isPotentiallyTrailing: Boolean |
| |
| needsBitOrderChange: Boolean |
| |
| isScanningForDelimiter: Boolean |
| isScanningForTerminator: Boolean |
| |
| initiatedContentCheck: Unit |
| |
| } |
| |
| class SharedTerm { |
| isTerminatorMTANeeded: Boolean |
| isPostfixSeparatorMTANeeded: Boolean |
| // along with everything else |
| |
| } |
| |
| UnsharedTerm -right-> SharedTerm : shared |
| |
| .... |
| |
| In the above class diagram, members that must be computed on the unshared term object are |
| shown in the class. |
| This set of things should always be minimized. |
| That is, as few things should be computed on unshared term objects as posible. |
| It is assumed that everything else is computed on the shared term object, but members |
| specifically about regions discussed in sections above are called out. |
| |
| The UnsharedTerm object is effectively state of the enclosing model group DSOM object. |
| It exists in 1 to 1 correspondence (aka it is "owned" by the enclosing model group object.) |
| |
| In the Runtime 1 backend, a processor (parser/unparser) is generated for the shared term object, |
| and that processor is referenced from the processor generated for the unshared term object. |
| |