site/dev/design-notes/runtime2-todos.adoc - daffodil-site - Git at Google

 :page-layout: page
 :keywords: dfdl-to-c backend code-generator runtime2
 // ///////////////////////////////////////////////////////////////////////////
 //
 // This file is written in https://asciidoctor.org/docs/what-is-asciidoc/[AsciiDoc]
 // with https://rhodesmill.org/brandon/2012/one-sentence-per-line/[semantic linefeeds].
 //
 // When editing, please start each sentence on a new line.
 // This makes textual diffs of this file useful
 // in a similar way to the way they work for code.
 //
 // //////////////////////////////////////////////////////////////////////////

 == Runtime2 ToDos

 === Overview

 We have built an initial DFDL-to-C backend
 and code generator for Apache Daffodil.
 Currently the C code generator can support
 binary boolean, integer, and real numbers,
 arrays of simple and complex elements,
 choice groups using dispatch/branch keys,
 validation of "fixed" attributes,
 and padding of explicit length complex elements with fill bytes.
 We plan to continue building out the C code generator
 until it supports a minimal subset of the DFDL 1.0 specification
 for embedded devices.

 We are using this document
 to keep track of some changes
 requested by reviewers
 so we don't forget to make these changes.
 If someone wants to help
 (which would be appreciated),
 please let the mailto:dev@daffodil.apache.org[dev] list know
 in order to avoid duplication.

 === Remove dependence on argp and gcc

 Update BUILD.md and the github actions to use clang instead of gcc.
 Will have to find the correct steps on Fedora / Linux / Windows
 and check they work (e.g., install clang-10 package on Ubuntu
 and set CC=clang-10 and AR=llvm-ar-10 environment variables).

 Likewise, replace argp calls with getopt calls
 in daffodil_main.c and daffodil_argp.c
 even though it means giving up long options
 and other nice features that argp has.

 === Report hanging problem running sbt (really dev.dirs) from MSYS2 on Windows

 We need to open a issue with a reproducible test case
 in the dev.dirs/directories-jvm project on GitHub.
 Note that dev.dirs exhibits the problem
 but they may or may not be responsible for it.
 Their code which tries to run a Windows PowerShell script
 using a Java subprocess call hangs
 when run from MSYS2 on Windows
 although it works fine when run from CMD on Windows.
 Then we need to wait until
 the hanging problem is fixed in the directories library,
 coursier picks up the new directories version,
 sbt picks up the new coursier version,
 and daffodil picks up the new sbt version,
 before we can remove the "echo >> $GITHUB_ENV" lines
 from .github/workflows/main.yml.

 === Reporting errors using structs, not strings

 We have replaced error message strings
 with error structs everywhere now.
 However, we may need to expand the error struct
 to include a pointer (pstate/ustate for data position)
 and another pointer (ERD or static context object
 for schema filename/line number).

 We also may want to implement error logging variants
 that both do and don't humanize the errors,
 e.g., a hardware/FPGA-type implementation might just output numbers
 and an external tool might have to "humanize" these numbers
 using knowledge of the schema and runtime data objects,
 like an offline log processor does.

 === Recovering after errors

 As we continue to build out runtime2,
 we may need to distinguish more types of errors
 and allow backtracking and retrying.
 Right now we handle only parse/unparse and
 validation errors in limited ways.
 Parse/unparse errors abort the parsing/unparsing
 and return to the caller immediately
 without resetting the stream's position.
 Validation errors are collected in an array
 and printed after parsing or unparsing.
 The only places where there are calls to stop the program
 are in daffodil_main.c (top-level error handling)
 and stack.c (empty, overflow, underflow errors which should never happen).

 Most of the parse functions set pstate->error
 only if they couldn't read data into their buffer
 due to an I/O error or EOF,
 which doesn't seem recoverable to me.
 Likewise, the unparse functions set ustate->error
 only if they couldn't write data from their buffer
 due to an I/O error, which doesn't seem recoverable to me.

 Only the parse_endian_bool functions set pstate->error
 if they read an integer which doesn't match either true_rep or false_rep
 when an exact match to either is required.
 If we decide to implement backtracking and retrying,
 they should call fseek to reset the stream's position
 back to where they started reading the integer
 before they return to their callers.
 Right now all parse calls are followed by
 if statements to check for error and return immediately.
 The code generator would have to generate code
 which can advance the stream's position by some byte(s)
 and try the parse call again as an attempt
 to resynchronize with a correct data stream
 after a bunch of failures.

 Note that we actually run the generated code in an embedded processor
 and call our own fread/frwrite functions
 which replace the stdio fread/fwrite functions
 since the C code runs bare metal without OS functions.
 We can implement fseek but we should have a good use case.

 === Javadoc-like tool for C code

 We should consider adopting one of the javadoc-like tools for C code
 and structuring our comments that way.

 === DSOM "fixed" getter

 Note: If we change runtime1 to validate "fixed" values
 like runtime2 does, then we can resolve
 https://issues.apache.org/jira/browse/DAFFODIL-117[DAFFODIL-117].

 === Improve TDML Runner

 We want to improve the TDML Runner
 to make it easier to run TDML tests
 with both runtime1 and runtime2.
 We want to eliminate the need
 to configure a `daf:tdmlImplementation` tunable
 in the TDML test using 12 lines of code.

 I had an initial idea which was that
 the TDML Runner could run both runtime1 and runtime2
 automatically (in parallel or serially)
 if it sees a TDML root attribute
 saying `defaultImplementations="daffodil daffodil-runtime2"`
 or a parser/unparseTestCase attribute
 saying `implementations="daffodil daffodil-runtime2"`.
 To make running the same test on runtime1/runtime2 easier
 we also could add an implementation attribute
 to tdml:errors/warnings elements
 saying which implementation they are for
 and tell the TDML Runner to check errors/warnings
 for runtime2 as well as runtime1.

 Then I had another idea which might be easier to implement.
 If we could find a way to set Daffodil's tdmlImplementation tunable
 using a command line option or environment variable
 or some other way to change TDML Runner's behavior
 when running both "sbt test" and "daffodil test"
 then we could simply run "sbt test" or "daffodil test" twice
 (first using runtime1 and then using runtime2)
 in order to verify all the cross tests work on both.
 I think this way would be easier than making TDML Runner
 automatically run all the implementations it can find
 in parallel or serially when running cross tests.

 If the second idea works as I hope it does,
 then we can start the process of adding "daffodil-runtime2"
 to some of the cross tests we have for daffodil and ibm.
 We also chould change ibm's ProcessFactory class
 to have a different name than daffodil's ProcessFactory class
 and update TDML Runner's match expression to use the new class name.
 Then some developers could add the ibmDFDLCrossTester plugin
 to their daffodil checkout permanently
 instead of having to do & undo that change
 each time they want to run daffodil/ibm cross tests.

 === C struct/field name collisions

 To avoid possible name collisions,
 we should prepend struct names and field names with namespace prefixes
 if their infoset elements have non-null namespace prefixes.
 Alternatively, we may need to use enclosing elements' names
 as prefixes to avoid name collisions without namespaces.

 === Anonymous/multiple choice groups

 We already handle elements having xs:choice complex types.
 In addition, we should support anonymous/multiple choice groups.
 We may need to refine the choice runtime structure
 in order to allow multiple choice groups
 to be inlined into parent elements.
 Here is an example schema
 and corresponding C code to demonstrate:

 [source,xml]
 ----
   <xs:complexType name="NestedUnionType">
     <xs:sequence>
       <xs:element name="first_tag" type="idl:int32"/>
       <xs:choice dfdl:choiceDispatchKey="{xs:string(./first_tag)}">
         <xs:element name="foo" type="idl:FooType" dfdl:choiceBranchKey="1 2"/>
         <xs:element name="bar" type="idl:BarType" dfdl:choiceBranchKey="3 4"/>
       </xs:choice>
       <xs:element name="second_tag" type="idl:int32"/>
       <xs:choice dfdl:choiceDispatchKey="{xs:string(./second_tag)}">
         <xs:element name="fie" type="idl:FieType" dfdl:choiceBranchKey="1"/>
         <xs:element name="fum" type="idl:FumType" dfdl:choiceBranchKey="2"/>
       </xs:choice>
     </xs:sequence>
   </xs:complexType>
 ----

 [source,c]
 ----
 typedef struct NestedUnion
 {
     InfosetBase _base;
     int32_t     first_tag;
     size_t      _choice_1; // choice of which union field to use
     union
     {
         foo foo;
         bar bar;
     };
     int32_t     second_tag;
     size_t      _choice_2; // choice of which union field to use
     union
     {
         fie fie;
         fum fum;
     };
 } NestedUnion;
 ----

 === Choice dispatch key expressions

 We currently support only a very restricted
 and simple subset of choice dispatch key expressions.
 We would like to refactor the DPath expression compiler
 and make it generate C code
 in order to support arbitrary choice dispatch key expressions.

 === No match between choice dispatch key and choice branch keys

 Right now c-daffodil is more strict than scala-daffodil
 when unparsing infoset XML files with no matches (or mismatches)
 between choice dispatch keys and branch keys.
 Perhaps c-daffodil should load such an XML file
 without a no match processing error
 and unparse the infoset to a binary data file
 without a no match processing error.
 We would have to code and call a choice branch resolver in C
 which peeks at the next XML element,
 figures out which branch
 does that element indicate exists
 inside the choice group,
 and initializes the choice and element runtime data
 (_choice and childNode->erd member fields) accordingly.
 We probably would replace the initChoice() call in walkInfosetNode()
 with a call to that choice branch resolver
 and we might not need to call initChoice() in unparseSelf().
 When I called initChoice() in all these parse, walk, and unparse places,
 I was pondering removing the _choice member field
 and calling initChoice() as a function
 to tell us which element to visit next,
 but we probably should have a mutable choice runtime data structure
 that applications can override if they want to.

 === Floating point numbers

 Right now runtime2 prints floating point numbers
 in XML infosets slightly differently than runtime1 does.
 This means we may need to use different XML infosets
 in TDML tests depending on the runtime implementation.
 In order to use the same XML infoset in TDML tests,
 we should make the TDML Runner
 compare floating point numbers numerically, not textually,
 as discussed in https://issues.apache.org/jira/browse/DAFFODIL-2402[DAFFODIL-2402].

 === Arrays

 Instead of expanding arrays inline within childrenERDs,
 we may want to store a single entry
 for an array in childrenERDs
 giving the array's offset and size of all its elements.
 We would have to write code
 for special case treatment of array member fields
 versus scalar member fields
 but we could save space/memory in childrenERDs
 for use cases with very large arrays.
 An array element's ERD should have minOccurs and maxOccurs
 where minOccurs is unsigned
 and maxOccurs is signed with -1 meaning "unbounded".
 The actual number of children in an array instance
 would have to be stored with the array instance
 in the C struct or the ERD.
 An array node has to be a different kind of infoset node
 with a place for this number of actual children to be stored.
 Probably all ERDs should just get minOccurs and maxOccurs
 and a scalar is just one with 1, 1 as those values,
 an optional element is 0, 1,
 and an array is all other legal combinations
 like N, -1 and N, and M with N<=M.
 A restriction that minOccurs is 0, 1,
 or equal to maxOccurs (which is not -1)
 is acceptable.
 A restriction that maxOccurs is 1, -1,
 or equal to minOccurs
 is also fine
 (means variable-length arrays always have unbounded number of elements).

 === Daffodil module/subdirectory names

 When Daffodil is ready to move from a 3.x to a 4.x release,
 rename the modules to have shorter and easier to understand names
 as discussed in https://issues.apache.org/jira/browse/DAFFODIL-2406[DAFFODIL-2406].
	:page-layout: page
	:keywords: dfdl-to-c backend code-generator runtime2
	// ///////////////////////////////////////////////////////////////////////////
	//
	// This file is written in https://asciidoctor.org/docs/what-is-asciidoc/[AsciiDoc]
	// with https://rhodesmill.org/brandon/2012/one-sentence-per-line/[semantic linefeeds].
	//
	// When editing, please start each sentence on a new line.
	// This makes textual diffs of this file useful
	// in a similar way to the way they work for code.
	//
	// //////////////////////////////////////////////////////////////////////////

	== Runtime2 ToDos

	=== Overview

	We have built an initial DFDL-to-C backend
	and code generator for Apache Daffodil.
	Currently the C code generator can support
	binary boolean, integer, and real numbers,
	arrays of simple and complex elements,
	choice groups using dispatch/branch keys,
	validation of "fixed" attributes,
	and padding of explicit length complex elements with fill bytes.
	We plan to continue building out the C code generator
	until it supports a minimal subset of the DFDL 1.0 specification
	for embedded devices.

	We are using this document
	to keep track of some changes
	requested by reviewers
	so we don't forget to make these changes.
	If someone wants to help
	(which would be appreciated),
	please let the mailto:dev@daffodil.apache.org[dev] list know
	in order to avoid duplication.

	=== Remove dependence on argp and gcc

	Update BUILD.md and the github actions to use clang instead of gcc.
	Will have to find the correct steps on Fedora / Linux / Windows
	and check they work (e.g., install clang-10 package on Ubuntu
	and set CC=clang-10 and AR=llvm-ar-10 environment variables).

	Likewise, replace argp calls with getopt calls
	in daffodil_main.c and daffodil_argp.c
	even though it means giving up long options
	and other nice features that argp has.

	=== Report hanging problem running sbt (really dev.dirs) from MSYS2 on Windows

	We need to open a issue with a reproducible test case
	in the dev.dirs/directories-jvm project on GitHub.
	Note that dev.dirs exhibits the problem
	but they may or may not be responsible for it.
	Their code which tries to run a Windows PowerShell script
	using a Java subprocess call hangs
	when run from MSYS2 on Windows
	although it works fine when run from CMD on Windows.
	Then we need to wait until
	the hanging problem is fixed in the directories library,
	coursier picks up the new directories version,
	sbt picks up the new coursier version,
	and daffodil picks up the new sbt version,
	before we can remove the "echo >> $GITHUB_ENV" lines
	from .github/workflows/main.yml.

	=== Reporting errors using structs, not strings

	We have replaced error message strings
	with error structs everywhere now.
	However, we may need to expand the error struct
	to include a pointer (pstate/ustate for data position)
	and another pointer (ERD or static context object
	for schema filename/line number).

	We also may want to implement error logging variants
	that both do and don't humanize the errors,
	e.g., a hardware/FPGA-type implementation might just output numbers
	and an external tool might have to "humanize" these numbers
	using knowledge of the schema and runtime data objects,
	like an offline log processor does.

	=== Recovering after errors

	As we continue to build out runtime2,
	we may need to distinguish more types of errors
	and allow backtracking and retrying.
	Right now we handle only parse/unparse and
	validation errors in limited ways.
	Parse/unparse errors abort the parsing/unparsing
	and return to the caller immediately
	without resetting the stream's position.
	Validation errors are collected in an array
	and printed after parsing or unparsing.
	The only places where there are calls to stop the program
	are in daffodil_main.c (top-level error handling)
	and stack.c (empty, overflow, underflow errors which should never happen).

	Most of the parse functions set pstate->error
	only if they couldn't read data into their buffer
	due to an I/O error or EOF,
	which doesn't seem recoverable to me.
	Likewise, the unparse functions set ustate->error
	only if they couldn't write data from their buffer
	due to an I/O error, which doesn't seem recoverable to me.

	Only the parse_endian_bool functions set pstate->error
	if they read an integer which doesn't match either true_rep or false_rep
	when an exact match to either is required.
	If we decide to implement backtracking and retrying,
	they should call fseek to reset the stream's position
	back to where they started reading the integer
	before they return to their callers.
	Right now all parse calls are followed by
	if statements to check for error and return immediately.
	The code generator would have to generate code
	which can advance the stream's position by some byte(s)
	and try the parse call again as an attempt
	to resynchronize with a correct data stream
	after a bunch of failures.

	Note that we actually run the generated code in an embedded processor
	and call our own fread/frwrite functions
	which replace the stdio fread/fwrite functions
	since the C code runs bare metal without OS functions.
	We can implement fseek but we should have a good use case.

	=== Javadoc-like tool for C code

	We should consider adopting one of the javadoc-like tools for C code
	and structuring our comments that way.

	=== DSOM "fixed" getter

	Note: If we change runtime1 to validate "fixed" values
	like runtime2 does, then we can resolve
	https://issues.apache.org/jira/browse/DAFFODIL-117[DAFFODIL-117].

	=== Improve TDML Runner

	We want to improve the TDML Runner
	to make it easier to run TDML tests
	with both runtime1 and runtime2.
	We want to eliminate the need
	to configure a `daf:tdmlImplementation` tunable
	in the TDML test using 12 lines of code.

	I had an initial idea which was that
	the TDML Runner could run both runtime1 and runtime2
	automatically (in parallel or serially)
	if it sees a TDML root attribute
	saying `defaultImplementations="daffodil daffodil-runtime2"`
	or a parser/unparseTestCase attribute
	saying `implementations="daffodil daffodil-runtime2"`.
	To make running the same test on runtime1/runtime2 easier
	we also could add an implementation attribute
	to tdml:errors/warnings elements
	saying which implementation they are for
	and tell the TDML Runner to check errors/warnings
	for runtime2 as well as runtime1.

	Then I had another idea which might be easier to implement.
	If we could find a way to set Daffodil's tdmlImplementation tunable
	using a command line option or environment variable
	or some other way to change TDML Runner's behavior
	when running both "sbt test" and "daffodil test"
	then we could simply run "sbt test" or "daffodil test" twice
	(first using runtime1 and then using runtime2)
	in order to verify all the cross tests work on both.
	I think this way would be easier than making TDML Runner
	automatically run all the implementations it can find
	in parallel or serially when running cross tests.

	If the second idea works as I hope it does,
	then we can start the process of adding "daffodil-runtime2"
	to some of the cross tests we have for daffodil and ibm.
	We also chould change ibm's ProcessFactory class
	to have a different name than daffodil's ProcessFactory class
	and update TDML Runner's match expression to use the new class name.
	Then some developers could add the ibmDFDLCrossTester plugin
	to their daffodil checkout permanently
	instead of having to do & undo that change
	each time they want to run daffodil/ibm cross tests.

	=== C struct/field name collisions

	To avoid possible name collisions,
	we should prepend struct names and field names with namespace prefixes
	if their infoset elements have non-null namespace prefixes.
	Alternatively, we may need to use enclosing elements' names
	as prefixes to avoid name collisions without namespaces.

	=== Anonymous/multiple choice groups

	We already handle elements having xs:choice complex types.
	In addition, we should support anonymous/multiple choice groups.
	We may need to refine the choice runtime structure
	in order to allow multiple choice groups
	to be inlined into parent elements.
	Here is an example schema
	and corresponding C code to demonstrate:

	[source,xml]
	----
	<xs:complexType name="NestedUnionType">
	<xs:sequence>
	<xs:element name="first_tag" type="idl:int32"/>
	<xs:choice dfdl:choiceDispatchKey="{xs:string(./first_tag)}">
	<xs:element name="foo" type="idl:FooType" dfdl:choiceBranchKey="1 2"/>
	<xs:element name="bar" type="idl:BarType" dfdl:choiceBranchKey="3 4"/>
	</xs:choice>
	<xs:element name="second_tag" type="idl:int32"/>
	<xs:choice dfdl:choiceDispatchKey="{xs:string(./second_tag)}">
	<xs:element name="fie" type="idl:FieType" dfdl:choiceBranchKey="1"/>
	<xs:element name="fum" type="idl:FumType" dfdl:choiceBranchKey="2"/>
	</xs:choice>
	</xs:sequence>
	</xs:complexType>
	----

	[source,c]
	----
	typedef struct NestedUnion
	{
	InfosetBase _base;
	int32_t first_tag;
	size_t _choice_1; // choice of which union field to use
	union
	{
	foo foo;
	bar bar;
	};
	int32_t second_tag;
	size_t _choice_2; // choice of which union field to use
	union
	{
	fie fie;
	fum fum;
	};
	} NestedUnion;
	----

	=== Choice dispatch key expressions

	We currently support only a very restricted
	and simple subset of choice dispatch key expressions.
	We would like to refactor the DPath expression compiler
	and make it generate C code
	in order to support arbitrary choice dispatch key expressions.

	=== No match between choice dispatch key and choice branch keys

	Right now c-daffodil is more strict than scala-daffodil
	when unparsing infoset XML files with no matches (or mismatches)
	between choice dispatch keys and branch keys.
	Perhaps c-daffodil should load such an XML file
	without a no match processing error
	and unparse the infoset to a binary data file
	without a no match processing error.
	We would have to code and call a choice branch resolver in C
	which peeks at the next XML element,
	figures out which branch
	does that element indicate exists
	inside the choice group,
	and initializes the choice and element runtime data
	(_choice and childNode->erd member fields) accordingly.
	We probably would replace the initChoice() call in walkInfosetNode()
	with a call to that choice branch resolver
	and we might not need to call initChoice() in unparseSelf().
	When I called initChoice() in all these parse, walk, and unparse places,
	I was pondering removing the _choice member field
	and calling initChoice() as a function
	to tell us which element to visit next,
	but we probably should have a mutable choice runtime data structure
	that applications can override if they want to.

	=== Floating point numbers

	Right now runtime2 prints floating point numbers
	in XML infosets slightly differently than runtime1 does.
	This means we may need to use different XML infosets
	in TDML tests depending on the runtime implementation.
	In order to use the same XML infoset in TDML tests,
	we should make the TDML Runner
	compare floating point numbers numerically, not textually,
	as discussed in https://issues.apache.org/jira/browse/DAFFODIL-2402[DAFFODIL-2402].

	=== Arrays

	Instead of expanding arrays inline within childrenERDs,
	we may want to store a single entry
	for an array in childrenERDs
	giving the array's offset and size of all its elements.
	We would have to write code
	for special case treatment of array member fields
	versus scalar member fields
	but we could save space/memory in childrenERDs
	for use cases with very large arrays.
	An array element's ERD should have minOccurs and maxOccurs
	where minOccurs is unsigned
	and maxOccurs is signed with -1 meaning "unbounded".
	The actual number of children in an array instance
	would have to be stored with the array instance
	in the C struct or the ERD.
	An array node has to be a different kind of infoset node
	with a place for this number of actual children to be stored.
	Probably all ERDs should just get minOccurs and maxOccurs
	and a scalar is just one with 1, 1 as those values,
	an optional element is 0, 1,
	and an array is all other legal combinations
	like N, -1 and N, and M with N<=M.
	A restriction that minOccurs is 0, 1,
	or equal to maxOccurs (which is not -1)
	is acceptable.
	A restriction that maxOccurs is 1, -1,
	or equal to minOccurs
	is also fine
	(means variable-length arrays always have unbounded number of elements).

	=== Daffodil module/subdirectory names

	When Daffodil is ready to move from a 3.x to a 4.x release,
	rename the modules to have shorter and easier to understand names
	as discussed in https://issues.apache.org/jira/browse/DAFFODIL-2406[DAFFODIL-2406].