blob: 37ae11e575a9e36479c4371245ecfe1a1a755c06 [file] [log] [blame]
:page-layout: page
:keywords: dfdl-to-c backend code-generator runtime2
// ///////////////////////////////////////////////////////////////////////////
//
// This file is written in https://asciidoctor.org/docs/what-is-asciidoc/[AsciiDoc]
// with https://rhodesmill.org/brandon/2012/one-sentence-per-line/[semantic linefeeds].
//
// When editing, please start each sentence on a new line.
// This makes textual diffs of this file useful
// in a similar way to the way they work for code.
//
// //////////////////////////////////////////////////////////////////////////
== Runtime2 ToDos
=== Overview
We have built an initial DFDL-to-C backend
and code generator for Apache Daffodil.
Currently the C code generator can support
binary boolean, integer, and real numbers,
arrays of simple and complex elements,
choice groups using dispatch/branch keys,
validation of "fixed" attributes,
and padding of explicit length complex elements with fill bytes.
We plan to continue building out the C code generator
until it supports a minimal subset of the DFDL 1.0 specification
for embedded devices.
We are using this document
to keep track of some changes
requested by reviewers
so we don't forget to make these changes.
If someone wants to help
(which would be appreciated),
please let the mailto:dev@daffodil.apache.org[dev] list know
in order to avoid duplication.
=== Reporting errors using structs, not strings
We have replaced error message strings
with error structs everywhere now.
However, we may need to expand the error struct
to include a pointer (pstate/ustate for data position)
and another pointer (ERD or static context object
for schema filename/line number).
We also may want to implement error logging variants
that both do and don't humanize the errors,
e.g., a hardware/FPGA-type implementation might just output numbers
and an external tool might have to "humanize" these numbers
using knowledge of the schema and runtime data objects,
like an offline log processor does.
=== Recovering after errors
As we continue to build out runtime2,
we may need to distinguish more types of errors
and allow backtracking and retrying.
Right now we handle only parse/unparse and
validation errors in limited ways.
Parse/unparse errors abort the parsing/unparsing
and return to the caller immediately
without resetting the stream's position.
Validation errors are collected in an array
and printed after parsing or unparsing.
The only places where there are calls to stop the program
are in daffodil_main.c (top-level error handling)
and stack.c (empty, overflow, underflow errors which should never happen).
Most of the parse functions set pstate->error
only if they couldn't read data into their buffer
due to an I/O error or EOF,
which doesn't seem recoverable to me.
Likewise, the unparse functions set ustate->error
only if they couldn't write data from their buffer
due to an I/O error, which doesn't seem recoverable to me.
Only the parse_endian_bool functions set pstate->error
if they read an integer which doesn't match either true_rep or false_rep
when an exact match to either is required.
If we decide to implement backtracking and retrying,
they should call fseek to reset the stream's position
back to where they started reading the integer
before they return to their callers.
Right now all parse calls are followed by
if statements to check for error and return immediately.
The code generator would have to generate code
which can advance the stream's position by some byte(s)
and try the parse call again as an attempt
to resynchronize with a correct data stream
after a bunch of failures.
Note that we actually run the generated code in an embedded processor
and call our own fread/frwrite functions
which replace the stdio fread/fwrite functions
since the C code runs bare metal without OS functions.
We can implement fseek but we should have a good use case.
=== Javadoc-like tool for C code
We should consider adopting
one of the javadoc-like tools for C code
and structuring our comments that way.
=== DSOM "fixed" getter
Note: If we change runtime1 to validate "fixed" values
like runtime2 does, then we can resolve {% jira 117 %}.
=== Update to TDML Runner
We want to update the TDML Runner
to make it easier to run TDML tests
with both runtime1 and runtime2.
We want to eliminate the need
to configure a `daf:tdmlImplementation` tunable
in the TDML test using 12 lines of code.
The TDML Runner should configure itself
to run both/either runtime1 and/or runtime2
just from seeing a root attribute
saying `defaultImplementations="daffodil runtime2"`
or a parser/unparseTestCase attribute saying `implementations="runtime2"`.
Maybe we also want to add an implementation attribute
to tdml:errors/warnings elements
saying which implementation they are for too.
If we do that,
we should tell the TDML Runner
runtime2 tests are not cross tests
so it will check their errors/warnings.
=== C struct/field name collisions
To avoid possible name collisions,
we should prepend struct names and field names with namespace prefixes
if their infoset elements have non-null namespace prefixes.
Alternatively, we may need to use enclosing elements' names
as prefixes to avoid name collisions without namespaces.
=== Anonymous/multiple choice groups
We already handle elements having xs:choice complex types.
In addition, we should support anonymous/multiple choice groups.
We may need to refine the choice runtime structure
in order to allow multiple choice groups
to be inlined into parent elements.
Here is an example schema
and corresponding C code to demonstrate:
[source,xml]
----
<xs:complexType name="NestedUnionType">
<xs:sequence>
<xs:element name="first_tag" type="idl:int32"/>
<xs:choice dfdl:choiceDispatchKey="{xs:string(./first_tag)}">
<xs:element name="foo" type="idl:FooType" dfdl:choiceBranchKey="1 2"/>
<xs:element name="bar" type="idl:BarType" dfdl:choiceBranchKey="3 4"/>
</xs:choice>
<xs:element name="second_tag" type="idl:int32"/>
<xs:choice dfdl:choiceDispatchKey="{xs:string(./second_tag)}">
<xs:element name="fie" type="idl:FieType" dfdl:choiceBranchKey="1"/>
<xs:element name="fum" type="idl:FumType" dfdl:choiceBranchKey="2"/>
</xs:choice>
</xs:sequence>
</xs:complexType>
----
[source,c]
----
typedef struct NestedUnion
{
InfosetBase _base;
int32_t first_tag;
size_t _choice_1; // choice of which union field to use
union
{
foo foo;
bar bar;
};
int32_t second_tag;
size_t _choice_2; // choice of which union field to use
union
{
fie fie;
fum fum;
};
} NestedUnion;
----
=== Choice dispatch key expressions
We currently support only a very restricted
and simple subset of choice dispatch key expressions.
We would like to refactor the DPath expression compiler
and make it generate C code
in order to support arbitrary choice dispatch key expressions.
=== No match between choice dispatch key and choice branch keys
Right now c-daffodil is more strict than scala-daffodil
when unparsing infoset XML files with no matches (or mismatches)
between choice dispatch keys and branch keys.
Perhaps c-daffodil should load such an XML file
without a no match processing error
and unparse the infoset to a binary data file
without a no match processing error.
We would have to code and call a choice branch resolver in C
which peeks at the next XML element,
figures out which branch
does that element indicate exists
inside the choice group,
and initializes the choice and element runtime data
(_choice and childNode->erd member fields) accordingly.
We probably would replace the initChoice() call in walkInfosetNode()
with a call to that choice branch resolver
and we might not need to call initChoice() in unparseSelf().
When I called initChoice() in all these parse, walk, and unparse places,
I was pondering removing the _choice member field
and calling initChoice() as a function
to tell us which element to visit next,
but we probably should have a mutable choice runtime data structure
that applications can override if they want to.
=== Floating point numbers
Right now runtime2 prints floating point numbers
in XML infosets slightly differently than runtime1 does.
This means we may need to use different XML infosets
in TDML tests depending on the runtime implementation.
In order to use the same XML infoset in TDML tests,
we should make the TDML Runner
compare floating point numbers numerically, not textually,
as discussed in https://issues.apache.org/jira/browse/DAFFODIL-2402[DAFFODIL-2402].
=== Arrays
Instead of expanding arrays inline within childrenERDs,
we may want to store a single entry
for an array in childrenERDs
giving the array's offset and size of all its elements.
We would have to write code
for special case treatment of array member fields
versus scalar member fields
but we could save space/memory in childrenERDs
for use cases with very large arrays.
An array element's ERD should have minOccurs and maxOccurs
where minOccurs is unsigned
and maxOccurs is signed with -1 meaning "unbounded".
The actual number of children in an array instance
would have to be stored with the array instance
in the C struct or the ERD.
An array node has to be a different kind of infoset node
with a place for this number of actual children to be stored.
Probably all ERDs should just get minOccurs and maxOccurs
and a scalar is just one with 1, 1 as those values,
an optional element is 0, 1,
and an array is all other legal combinations
like N, -1 and N, and M with N<=M.
A restriction that minOccurs is 0, 1,
or equal to maxOccurs (which is not -1)
is acceptable.
A restriction that maxOccurs is 1, -1,
or equal to minOccurs
is also fine
(means variable-length arrays always have unbounded number of elements).
=== Daffodil module/subdirectory names
When Daffodil is ready to move from a 3.x to a 4.x release,
rename the modules to have shorter and easier to understand names
as discussed in https://issues.apache.org/jira/browse/DAFFODIL-2406[DAFFODIL-2406].