site/dev/design-notes/infoset-inputter-streaming-unparser.adoc - daffodil-site - Git at Google

 :page-layout: page
 :keywords: unparser InfosetInputter NextElementResolver Streaming
 // ///////////////////////////////////////////////////////////////////////////
 //
 // This file is written in AsciiDoc.
 //
 // If you can read this comment, your browser is not rendering asciidoc automatically.
 //
 // You need to install the asciidoc plugin to Chrome or Firefox
 // so that this page will be properly rendered for your viewing pleasure.
 //
 // You can get the plugins by searching the web for 'asciidoc plugin'
 //
 // You will want to change plugin settings to enable diagrams (they're off by default.)
 //
 // You need to view this page with Chrome or Firefox.
 //
 // ///////////////////////////////////////////////////////////////////////////
 //
 // When editing, please start each sentence on a new line.
 // See https://asciidoctor.org/docs/asciidoc-recommended-practices/#one-sentence-per-line[one sentence-per-line writing technique.]
 // This makes textual diffs of this file useful in a similar way to the way they work for code.
 //
 // //////////////////////////////////////////////////////////////////////////

 == Infoset Inputters and Runtime 1 Streaming Unparser & NextElementResolver

 == Introduction

 Describes the way the streaming unparser in Runtime 1 works, in particular how
 the unparser resolves the next element - identifying which ERD (Element Runtime Data)
 corresponds to the incoming infoset event.

 == Design Principles

 Daffodil Runtime 1 keeps track of the nest of dynamic context terms at runtime on a stack which
 is maintained by the Unparsers.

 === Schema Compilation

 At schema compilation time, we compute, for each Term, the set of possible next
 elements that can follow it, but only as far as the end of the _lexically
 enclosing_ model group.

 This is done in the possibleNextLexicalSiblingElementsInInfoset member of
 the Term class.

 From this information, we compute a element resolver that handles only this
 partial resolution, which is an instance of class PartialNextElementResolver.
 This is associated with every Term and carried fro runtime on the TermRuntimeData
 objects.

 This is done in the partialNextElementResolver member of the TermRuntime1Mixin,
 as the PartialNextElementResolver is a Runtime 1 data structure/object.
 This object is serialized as part of the TermRuntimeData.

 The fact that this is computed using only information going out to the end of the
 single lexically enclosing model-group avoids a combinatorial
 explosion in the schema compiler.

 === Unparser Runtime

 At runtime for the Unparser, element resolving is done by the InfosetInputter.

 Complete/total element name resolution is done by combining the partial
 next element resolvers from the dynamic nest of terms.
 This is done with a stack in the UState.
 Specifically, the stack, called `trdStack` is state of the InfosetInputter.
 Unparsers are burdened with having to push/pop TermRuntimeData objects to maintain this stack.

 The algorithm which iterates down the stack entries to carry out a next element
 resolution for the InfosetInputter, is carried on the NextElementResolver trait,
 which is mixed into the InfosetInputter trait, so is shared by all InfosetInputter
 derived classes.
 Note that no state is shared.
 The stack is private to a given instance of an InfosetInputter class,
 so is private per-thread state just like the rest of the UState.

 This design incurs runtime overhead to push/pop TRDs on the stack,
 and next element resolution is no longer just a single hash-table lookup, but
 potentially an iteration through multiple such lookups.

 === Schema Compiler Support for Dynamic Context Stack

 Now we focus in on just Term, ModelGroup, and ElementBase.
 The diagram below shows the Term class, with the possibleNextLexicalSiblingElementsInInfoset
 member.
 All Terms have as a member their (1-based) position within the enclosing
 lexical model group.

 The set of possible elements that can follow a given Term can be closed,
 as when a required element is found downstream. If all downstream elements
 in the enclosing lexical model group are optional, then the set of possible
 elements is open ended. The Open/Closed status of the list is
 signified by a PossibleNextElements case class object correspondingly.

 One final detail is that when compiling the set of possible next elements for
 a choice, there can be elements on branches that are required, but in
 the context of that choice, if any branch has all optional content, then
 all the possible next elements for all branches are optional.

 The Term itself, if it is an ElementBase, has a isRequiredInInfoset member, which is determined
 based on properties.
 When combined in a choice, we need a way to override
 this, and we encapsulate the ElementBase in a Sibling case class object which
 carries its own isRequiredInInfoset member which can be set to false to
 override the underlying ElementBase's isRequiredInInfoset due to that
 corresponding element appearing in a choice branch.

 [plantuml, format="png"]
 ....
 hide empty members

 package core <<rectangle>> {

 class TermRuntime1Mixin << trait >> {
   partialNextElementResolver : PartialNextElementResolver
   termRuntimeData : TermRuntimeData
 }

 class TermGrammarMixin << trait >>

 TermGrammarMixin -up-|> TermRuntime1Mixin

 abstract class Term {
    {field} position : Int
    {field} possibleNextLexicalSiblingElementsInInfoset : PossibleNextElements
 }
 Term -up-|> TermGrammarMixin

 abstract class ModelGroup

 Term <|-- ElementBase
 Term <|-- ModelGroup


 ModelGroup "1" --> "*" Term : group members
 ModelGroup "0..1" <-- "1" ElementBase : model group

 class Sibling {
   e: ElementBase
   isRequiredInInfoset : Boolean
 }

 abstract class PossibleNextElements {
   sibs : Seq[Sibling]
 }

 Term "1" *-- "1" PossibleNextElements : possibles >
 PossibleNextElements <|-- Closed
 PossibleNextElements <|-- Open
 Sibling "*" <-right- "1" PossibleNextElements : sibs <


 ElementBase "1" <-left- "1" Sibling : e <

 }

 package runtime_1 <<rectangle>> {

 abstract class TermRuntimeData{
   partialNextElementResolver : PartialNextElementResolver
 }

 abstract class PartialNextElementResolver {
   maybeNextElement : Maybe[ElementRuntimeData]
 }
 TermRuntimeData -down-> PartialNextElementResolver

 TermRuntimeData <-right- TermRuntime1Mixin : creates <

 PossibleNextElements .right. PartialNextElementResolver : compiles into >

 }

 ....

 The set of possible subsequent ElementBase objects that can
 follow another is computed inductively starting from the last element
 or group within a sequence.
 It is limited to the length of the current sequence, since sequences can be
 shared.

 (Note that unshared i.e., local sequences could in principle be
 special cased - collapsed together - for purposes of this analysis.
 We assume here that we will treat all sequence groups as if they were
 shared via group refs.)

 === Runtime Support for Dynamic Context Stack

 Below we show the runtime 1 objects that implement the next element resolution
 for the streaming unparser.

 The Unparser has a UState, which has an InfosetInputter.
 The InfosetInputter trait mixes in the implementation of NextElementResolver
 which provides the push/pop API for maintaining the TRD stack, and it
 provides the nextElement method which does complete next element resolving, chaining
 through the PartialNextElementResolvers obtained from the stack of TRDs.

 [plantuml, format="png"]
 ....
 hide empty members

 package runtime_1 <<rectangle>> {

 abstract class PartialNextElementResolver {
   maybeNextElement : Maybe[ElementRuntimeData]
 }

 abstract class NextElementResolver <<trait>> {
    def nextElement(name, ns, useNS) : ElementRuntimeData
    def pushTRD(trd)
    def popTRD()
 }

 NextElementResolver -right-> TermRuntimeData : has stack of

 abstract class InfosetInputter


 abstract class TermRuntimeData

 TermRuntimeData *-down- "1" PartialNextElementResolver : has

 abstract class UStateMain{
   pushTRD(trd)
   popTRD()
 }

 abstract class Unparser {
   pushTRD(trd)
   popTRD()
 }

 Unparser -down-> UStateMain : push/pop TRDs >

 UStateMain -right-> InfosetInputter : push/pop TRDs >

 InfosetInputter -up-|> NextElementResolver : is a


 }
 ....

 Any given PartialNextElementResolver can resolve a name (+ namespace when used)
 to an ERD, or it can not be able to perform the resolution, which is not
 an error.

 As a whole the next-element resoluiton algorithm requires that the unparser
 maintain the stack of TermRuntimeData objects and the resolution
 algorithm works down the stack using the partial resolver from the most-
 recently pushed runtime data object first.
 If that does not resolve the element it either fails - because the partial
 resolver knows that the incoming element event *must* be one of the possibilities
 it represents (Closed set of possibles), or it moves to the next deeper runtime-data
 object on the stack having a partial next element resolver (for Open set of possibles),
 and tries that.
 This continues until it succeeds, or until an ERD is found on the stack, at
 which point the resolution fails and an unparse error (fatal) is issued.

 Note that the context of next-element-resolution cannot span the boundary of a
 complex element.
 This is because an end-element event must be received before any subsequent
 element start events.

 This dynamic-context TRD stack need not be copied to UStates for Suspensions as
 those only occur after Infoset elements have been created.

 === Some interactions, or non-interactions with other aspects of the Unparser

 * Orthogonal to suspensions - next element resolution is over before a
   suspension is constructed for the element.

 ==== Interaction with ChoiceBranchMap

 Next element resolving for the infoset inputter is orthogonal to
 choiceBranchMap - next element resolution must be done
 first, and the result of it is used by the choiceBranchMap to choose
 which branch of the choice is implied by arrival of that element.

 === Testing and Design for Test

 * Schema Compiler
 ** Unit tests in scala code for schema compiler methods that compute the _possibles_ object.
  These should cover various combinations of sequences/choices with required
  and optional elements and groups shared using group references.
 ** Instrumentation in the schema compiler to measure the amount of work going
  on so that combinatorial explosions are detected earlier.
  This should output a report of the metrics at the end of compilation, or perhaps
  incrementally as compilation proceeds.

 * Runtime 1
 ** Faking TRDs so as to test XML InfosetOutputters in unit testing. Only the
  partialNextElementResolver member of the TRD is needed.
  It may be possible to use
  the schema compiler to create the appropriate scenario of TRDs and ERDs and
  then the InfosetOutputter push(trd) method can be called to create a stack
  that matches a test scenario. This allows testing the nextElement resolver
  algorithm in isolation from the push/pop logic that Unparsers must maintain.
 ** Unparsers must push/pop appropriately. Possibly built in checking can be
  put in place that detects when a pop is missing? This is TBD as the push/pop
  may or may not be able to be centralized in the code base.


 === Summary of Features

 * Dynamic context stack (stack of RuntimeData) in UState
 * Unparsers maintain dynamic context stack, pushing and popping TRDs as model-groups are processed.
 * Stack algorithm for NextElementResolver runs down stack trying partial resolutions
 * Algorithms on Term to determine sets of possibly following infoset events.

 == Appendix: Review of DSOM Design

 === DSOM Term Classes

 In Daffodil, DSOM now enables sharing of ModelGroup members, which are instances
 of classes derived from Term.

 The Term class is central to Daffodil because Terms are the entities that can
 can resolve scoped properties by combining those from an element ref or group ref, to
 and element decl or group def, from element decl to type def, etc.
 Term is the start point for the chains of non-default property providers,
 and for the chains of default property providers, from which scoped properties are resolved.

 For review, a DSOM class diagram showing the Term class hierarchy is below.

 [plantuml, format="png"]
 ....
 hide empty members
 abstract class Term

 together {
   abstract class SequenceTermBase
   abstract class ChoiceTermBase
   abstract class ElementBase
 }

 abstract class ModelGroup

 Term <|-- ElementBase
 Term <|-- ModelGroup
 ModelGroup <|-- SequenceTermBase
 ModelGroup <|-- ChoiceTermBase

 abstract class LocalElementDeclBase
 ElementBase <|-- LocalElementDeclBase
 LocalElementDeclBase <|-- LocalElementDecl
 abstract class AbstractElementRef
 ElementBase <|-- AbstractElementRef
 AbstractElementRef <|-- Root
 AbstractElementRef <|-- ElementRef

 together {
   class LocalElementDecl
   class ElementRef

   class Sequence
   class SequenceGroupRef
   class Choice
   class ChoiceGroupRef
 }
 together {
   class Root
   class PrefixLengthQuasiElementDecl
   class RepTypeQuasiElementDecl
   class ChoiceBranchImpliedSequence
 }
 abstract class QuasiElementDeclBase
 LocalElementDeclBase <|-- QuasiElementDeclBase
 QuasiElementDeclBase <|-- PrefixLengthQuasiElementDecl
 QuasiElementDeclBase <|-- RepTypeQuasiElementDecl

 abstract class SequenceGroupTermBase
 SequenceTermBase <|-- SequenceGroupTermBase
 SequenceTermBase <|-- ChoiceBranchImpliedSequence
 SequenceGroupTermBase <|-- Sequence
 SequenceGroupTermBase <|-- SequenceGroupRef

 abstract class ChoiceGroupTermBase
 ChoiceTermBase <|-- ChoiceGroupTermBase
 ChoiceGroupTermBase <|-- Choice
 ChoiceGroupTermBase <|-- ChoiceGroupRef

 ModelGroup "1" ..> "*" Term : group members
 ModelGroup "0..1" <-- "1" ElementBase : model group

 ....

 The above reflects the status quo of objects.
 The concrete classes are all at the bottom (marked with C).

 Most of organizing principles depend on just the core abstract classes.
 Here's the same diagram, but focusing in on just the core abstract classes.

 [plantuml, format="png"]
 ....
 hide empty members
 abstract class Term

 together {
   abstract class SequenceTermBase
   abstract class ChoiceTermBase
   abstract class ElementBase
 }

 abstract class ModelGroup
 abstract class GroupDefLike <<trait>>
 abstract class GroupRef <<trait>>

 Term <|-- ElementBase
 Term <|-up- ModelGroup
 ModelGroup <|-- SequenceTermBase
 ModelGroup <|-- ChoiceTermBase
 GroupDefLike <|.. ModelGroup : some are
 GroupRef <|.. ModelGroup : some are
 GroupRef -left-> GroupDefLike : group ref

 GroupDefLike "1" -up-> "*" Term : group members
 ModelGroup "0..1" <-- "1" ElementBase : model group

 ....

 The dashed lines from ModelGroup are to illustrate that some ModelGroup classes
 inherit from GroupDefLike, as they define lexically, a surround of their group
 members.
 Other ModelGroup classes inherit GroupRef (SequenceGroupRef and ChoiceGroupRef)
 which share the group definition.
 Ignoring the Root element for a moment, every Term has a lexically enclosing
 ModelGroup which inherits from GroupDefLike. This is the mechanism by which
 ModelGroups and their contents can be shared in multiple distinct contexts.

 With those classes in mind, we can look into the changes to Daffodil
 to support a _dynamic context stack_.

 CAUTION: Some code in the schema compiler has moved around to prepare for an era where
 Daffodil supports more than one backend/runtime system.
 See the <<Appendix: Schema Compiler grammar and runtime1 Packages,Appendix: Schema Compiler grammar and runtime1 Packages>>


 == Appendix: Schema Compiler grammar and runtime1 Packages

 The basic DSOM object classes are composed by mixin with components coming
 from different packages that separate concerns.

 Some basic non-function-changing refactoring begins the process of breaking out
 runtime1-specific code and separating that from code that will be in common
 to all runtime backends.

 The primary packages in the schema compiler (daffodil-core) are:

 * dsom - The Daffodil Schema Object Model - object corresponding to the parts
 of a DFDL schema. Handles properties, property scoping, include/import,
 namespaces, and all XML-schema-related functionality.
 * grammar - Introduces combinators and primitives of the data syntax grammar.
 Supposed to contain all backend/runtime independent logic.
 * runtime1 - Logic (mixins, methods, members, classes, etc.) specific to the
 _Runtime 1_ backend.
 For example, this includes calculations that create the data structures
 needed to support streaming unparsing, a complex behavior that other Daffodil
 runtime backends are unlikely to implement.

 Some primary classes and mixins from these packages are shown here:

 [plantuml, format="png"]
 ....
 hide empty members
 package core <<rectangle>> {

 package grammar {

 abstract class Gram <<trait>> {
 }

 abstract class TermGrammarMixin <<trait>> {
   def gram : Gram
   }
 abstract class ModelGroupGrammarMixin <<trait>>
 abstract class ElementBaseGrammarMixin <<trait>>
 abstract class SequenceGrammarMixin <<trait>>
 abstract class ChoiceGrammarMixin <<trait>>
 }

 note bottom of grammar
   Contains Combinators and Primitives that
   represent the grammar regions that make up
   the data syntax grammar.

   Optimizes this in a runtime-backend independent
   manner.
 end note

 package runtime1 {

 abstract class GramRuntime1Mixin <<trait>> {
   def parser(): Parser
   def unparser(): Unparser
 }
 Gram -up-|> GramRuntime1Mixin
 abstract class TermRuntime1Mixin <<trait>> {
   def termRuntimeData
 }
 TermGrammarMixin -up-> TermRuntime1Mixin
 abstract class ModelGroupRuntime1Mixin <<trait>> {
   def termRuntimeData = modelGroupRuntimeData
   def modelGroupRuntimeData
 }
 ModelGroupGrammarMixin -up-> ModelGroupRuntime1Mixin
 abstract class ElementBaseRuntime1Mixin <<trait>> {
   def termRuntimeData = elementRuntimeData
   lazy val elementRuntimeData
 }
 ElementBaseGrammarMixin -up-> ElementBaseRuntime1Mixin
 abstract class SequenceRuntime1Mixin <<trait>> {
   def modelGroupRuntimeData = sequenceRuntimeData
   lazy val sequenceRuntimeData
 }
 SequenceGrammarMixin -up-> SequenceRuntime1Mixin
 abstract class ChoiceRuntime1Mixin <<trait>> {
   def modelGroupRuntimeData = choiceRuntimeData
   lazy val choiceRuntimeData
 }
 ChoiceGrammarMixin -up-> ChoiceRuntime1Mixin

 }

 note bottom of runtime1
   Contains mixins with methods and members
   specific to the Runtime 1 backend.
 end note

 package dsom {
 abstract class Term
 Term -up-|> TermGrammarMixin
 Term -up-> Gram : gram

 together {
   abstract class SequenceTermBase
   abstract class ChoiceTermBase
   abstract class ElementBase
 }

 abstract class ModelGroup
 ModelGroup -up-|> ModelGroupGrammarMixin
 ElementBase -up-|> ElementBaseGrammarMixin
 SequenceTermBase -up-> SequenceGrammarMixin
 ChoiceTermBase -up-|> ChoiceGrammarMixin

 Term <|-- ElementBase
 Term <|-- ModelGroup
 ModelGroup <|-- SequenceTermBase
 ModelGroup <|-- ChoiceTermBase


 ModelGroup "1" --> "*" Term : group members
 ModelGroup "0..1" <-- "1" ElementBase : model group
 }
 }
 ....

 CAUTION: The grammar.primitives package has not yet been refactored to break
 runtime1-specific methods out into runtime1-package mixins.
 Each Primitive or
 Combinator should mixin a trait for each runtime.

 CAUTION: DPath expressions are another whole area that are not as yet
 refactored to separate specific runtime 1 functionality from general
 functionality.

 CAUTION: Eventually this mixin of runtime should be unmixed and converted
 into a delegation structure so that one can choose a runtime instead of
 having all runtimes mixed in. E.g., today every runtime has to use
 different method names.

 == Appendix: An Aside about a Recursive Future

 In a future version of DFDL we plan to allow recursive definitions.
 It is theoretically possible to define a recursive structure using
 only reusable groups:

     <xs:element name="r">
       <xs:complexType>
         <xs:group ref="g"/>
       </xs:complexType>
     </xs:eleemnt>

     <xs:group name="g">
       <xs:choice>
         <xs:sequence>
           <xs:sequence>
           <xs:annotation><xs:appinfo source="http://www.ogf.org/dfdl/">
             <dfdl:assert testKind="pattern" testPattern="."/><!-- there is more data -->
           </xs:appinfo></xs:annotation>
           </xs:sequence>
           <xs:element name="x" type="xs:string" dfdl:length="1" dfdl:lengthKind="explicit"/>
           <xs:group ref="g"/>
         </xs:sequence>
         <xs:sequence/>
      </xs:choice>
     <xs:group>

 The above example simulates an array using group recursion.

 CAUTION: It is not clear that even if DFDL allows recursion it would allow it on
 groups alone.
 Requiring recursion to utilize elements would seem to be consistent with DFDL's
 current restriction which requires repeating/optional things to be elements.

 CAUTION: Furthermore, it is not clear if the above is allowed in XML Schema.
 XML Schema's UPA rules may in fact require elements to be used in the
 formulation of recursive structure.

 Recursion requires evolution of the current Daffodil architecture to one that
 enables sharing of definitions of groups, elements and types.

 == Notes on Rationale and Evolution from Earlier Design Points

 This section describes the implementation prior to the fix to Daffodil-2192, and
 how it evolved. The changes from the prior design were significant.

 The code design is somewhat simpler in that prior code has two different kinds of
 next element resolvers, one for next-siblings, and one for first children of
 an element of complex type.
 This is simplified in the current design. There is only one PartialNextElementResolver
 per TermRuntimeData object.

 Note that the prior NextElementResolver class went away and
 its functionality appears in the PartialNextElementResolver to
 reflect that the class itself implements only
 part of the algorithm.

 === Prior Behavior (Flawed) Part 1

 Currently, the InfosetInputter architecture is dependent on being able to
 take an element name (+ namespace when used), have as a context a current
 element ElementRuntimeData object, and then given a hash-table which is
 stored on the element ERD, lookup the name and namespace (when used) to
 get the next element ERD.
 This lookup hash table is encapsulated in the NextElementResolver classes.
 At schema compile time, given each element ERD, it is precomputed what the
 possible valid set of subsequent following ERDs can be (including the current
 one again, in case of an array).
 This set of possibilities is captured in an instance of the NextElementResolver
  class, and is stored on each ERD.

 This process makes unparsing a self-validating kind of processing.
 As an external representation of the Infoset is converted into an
 actual stream of Infoset events, each incoming element event is
 scrutinized against the schema for whether it is allowed.
 This scrutiny is built into the fact that the incoming element
 name+namespace is being looked up in the NextElementResolver, and then
 once each element is created, a new NextElementResolver object is
 taken from its ERD for the next infoset event lookup.

 An invariant (which remains true in current behavior and in proposed new
 changes) is that the next element resolution process cannot cross a complex
 -type element boundary, as an end-element event ends the scope of the possible
 legal subsequent elements that can arrive next.
 However, within a complex type, there can be many groups and group references.

 This architecture depends on the fact that the ElementRuntimeData (ERD) objects
 are unique per element, and per element context - that is, they are depending
 on the fact that element ERDs are not shared and so they uniquely identify both
 the element, and the context of that element (nest of all enclosing groups).

 As part of the changes to fix our schema compiler speed/space issues
 (DAFFODIL-1444), this assumption is no longer valid.
 Elements (specifically DSOM ElementBase objects) can be shared. Groups
 can be shared.
 The expected improvement in schema compiler performance and the space
 reduction is fundamentally driven from this sharing.
 This sharing is also needed to implement a desirable future feature
 which is to enable recursive definitions in DFDL.

 Ultimately, the required change is from one where an ERD is a unique
 identifier of an element and all its dynamic context, to one where
 because ERDs are shared, the Unparser runtime must maintain a stack of
 the dynamic context information sufficient to perform next-element
 resolution as each infoset event must be synthesized.

 See also <<Appendix: An Aside about a Recursive Future,Appendix: An Aside about a Recursive Future>>

 === Prior Behavior (Flawed) Part 2

 As part of changes to _eventually_ improve schema compiler space and speed usage,
 (DAFFODIL-1444) we introduced sharing information to the DSOM objects in the schema.

 The sharing information is present - every global schema component
 object can examine every context in the schema from which it is shared.

 This unfortunately creates a situation where we get a combinatorial explosion
 of computation.
 The sharable objects, such as a group definition, are still today NOT being
 shared, they are being copied.
 Yet within each copy, each element in it behaves as if it was being shared.
 So, for example, if you ask the last element of a group definition what possible
 elements can follow it, it responds with a list of every element that can
 follow for every context where the group is referenced.
 What's worse, this same thing is calculated repeatedly since the sharable
 group is, in fact, not being shared currently so there are many copies of
 this final element to do this computation on.

 The result is something which grows in algorithmic complexity very fast.
 If you have a very small schema or wait long enough, then compilation of the
 schema should complete and be correct.
 Hence, most all small tests work fine.
 Larger schemas fail to finish compiling in any reasonable amount of time or
 memory space.