| :page-layout: page |
| :keywords: dfdl-to-c backend code-generator runtime2 |
| // /////////////////////////////////////////////////////////////////////////// |
| // |
| // This file is written in https://asciidoctor.org/docs/what-is-asciidoc/[AsciiDoc] |
| // with https://rhodesmill.org/brandon/2012/one-sentence-per-line/[semantic linefeeds]. |
| // |
| // When editing, please start each sentence on a new line. |
| // This makes textual diffs of this file useful |
| // in a similar way to the way they work for code. |
| // |
| // ////////////////////////////////////////////////////////////////////////// |
| |
| == Runtime2 ToDos |
| |
| === Overview |
| |
| We have built an initial DFDL-to-C backend |
| and code generator for Apache Daffodil. |
| Currently the C code generator can support |
| binary boolean, integer, and real numbers, |
| arrays of simple and complex elements, |
| choice groups using dispatch/branch keys, |
| validation of "fixed" attributes, |
| and padding of explicit length complex elements with fill bytes. |
| We plan to continue building out the C code generator |
| until it supports a minimal subset of the DFDL 1.0 specification |
| for embedded devices. |
| |
| We are using this document |
| to keep track of some changes |
| requested by reviewers |
| so we don't forget to make these changes. |
| If someone wants to help |
| (which would be appreciated), |
| please let the mailto:dev@daffodil.apache.org[dev] list know |
| in order to avoid duplication. |
| |
| === Remove dependence on argp and gcc |
| |
| Update BUILD.md and the github actions to use clang instead of gcc. |
| Will have to find the correct steps on Fedora / Linux / Windows |
| and check they work (e.g., install clang-10 package on Ubuntu |
| and set CC=clang-10 and AR=llvm-ar-10 environment variables). |
| |
| Likewise, replace argp calls with getopt calls |
| in daffodil_main.c and daffodil_argp.c |
| even though it means giving up long options |
| and other nice features that argp has. |
| |
| === Report hanging problem running sbt (really dev.dirs) from MSYS2 on Windows |
| |
| We need to open a issue with a reproducible test case |
| in the dev.dirs/directories-jvm project on GitHub. |
| Note that dev.dirs exhibits the problem |
| but they may or may not be responsible for it. |
| Their code which tries to run a Windows PowerShell script |
| using a Java subprocess call hangs |
| when run from MSYS2 on Windows |
| although it works fine when run from CMD on Windows. |
| Then we need to wait until |
| the hanging problem is fixed in the directories library, |
| coursier picks up the new directories version, |
| sbt picks up the new coursier version, |
| and daffodil picks up the new sbt version, |
| before we can remove the "echo >> $GITHUB_ENV" lines |
| from .github/workflows/main.yml. |
| |
| === Reporting errors using structs, not strings |
| |
| We have replaced error message strings |
| with error structs everywhere now. |
| However, we may need to expand the error struct |
| to include a pointer (pstate/ustate for data position) |
| and another pointer (ERD or static context object |
| for schema filename/line number). |
| |
| We also may want to implement error logging variants |
| that both do and don't humanize the errors, |
| e.g., a hardware/FPGA-type implementation might just output numbers |
| and an external tool might have to "humanize" these numbers |
| using knowledge of the schema and runtime data objects, |
| like an offline log processor does. |
| |
| === Recovering after errors |
| |
| As we continue to build out runtime2, |
| we may need to distinguish more types of errors |
| and allow backtracking and retrying. |
| Right now we handle only parse/unparse and |
| validation errors in limited ways. |
| Parse/unparse errors abort the parsing/unparsing |
| and return to the caller immediately |
| without resetting the stream's position. |
| Validation errors are collected in an array |
| and printed after parsing or unparsing. |
| The only places where there are calls to stop the program |
| are in daffodil_main.c (top-level error handling) |
| and stack.c (empty, overflow, underflow errors which should never happen). |
| |
| Most of the parse functions set pstate->error |
| only if they couldn't read data into their buffer |
| due to an I/O error or EOF, |
| which doesn't seem recoverable to me. |
| Likewise, the unparse functions set ustate->error |
| only if they couldn't write data from their buffer |
| due to an I/O error, which doesn't seem recoverable to me. |
| |
| Only the parse_endian_bool functions set pstate->error |
| if they read an integer which doesn't match either true_rep or false_rep |
| when an exact match to either is required. |
| If we decide to implement backtracking and retrying, |
| they should call fseek to reset the stream's position |
| back to where they started reading the integer |
| before they return to their callers. |
| Right now all parse calls are followed by |
| if statements to check for error and return immediately. |
| The code generator would have to generate code |
| which can advance the stream's position by some byte(s) |
| and try the parse call again as an attempt |
| to resynchronize with a correct data stream |
| after a bunch of failures. |
| |
| Note that we actually run the generated code in an embedded processor |
| and call our own fread/frwrite functions |
| which replace the stdio fread/fwrite functions |
| since the C code runs bare metal without OS functions. |
| We can implement fseek but we should have a good use case. |
| |
| === Javadoc-like tool for C code |
| |
| We should consider adopting one of the javadoc-like tools for C code |
| and structuring our comments that way. |
| |
| === DSOM "fixed" getter |
| |
| Note: If we change runtime1 to validate "fixed" values |
| like runtime2 does, then we can resolve |
| https://issues.apache.org/jira/browse/DAFFODIL-117[DAFFODIL-117]. |
| |
| === Improve TDML Runner |
| |
| We want to improve the TDML Runner |
| to make it easier to run TDML tests |
| with both runtime1 and runtime2. |
| We want to eliminate the need |
| to configure a `daf:tdmlImplementation` tunable |
| in the TDML test using 12 lines of code. |
| |
| I had an initial idea which was that |
| the TDML Runner could run both runtime1 and runtime2 |
| automatically (in parallel or serially) |
| if it sees a TDML root attribute |
| saying `defaultImplementations="daffodil daffodil-runtime2"` |
| or a parser/unparseTestCase attribute |
| saying `implementations="daffodil daffodil-runtime2"`. |
| To make running the same test on runtime1/runtime2 easier |
| we also could add an implementation attribute |
| to tdml:errors/warnings elements |
| saying which implementation they are for |
| and tell the TDML Runner to check errors/warnings |
| for runtime2 as well as runtime1. |
| |
| Then I had another idea which might be easier to implement. |
| If we could find a way to set Daffodil's tdmlImplementation tunable |
| using a command line option or environment variable |
| or some other way to change TDML Runner's behavior |
| when running both "sbt test" and "daffodil test" |
| then we could simply run "sbt test" or "daffodil test" twice |
| (first using runtime1 and then using runtime2) |
| in order to verify all the cross tests work on both. |
| I think this way would be easier than making TDML Runner |
| automatically run all the implementations it can find |
| in parallel or serially when running cross tests. |
| |
| If the second idea works as I hope it does, |
| then we can start the process of adding "daffodil-runtime2" |
| to some of the cross tests we have for daffodil and ibm. |
| We also chould change ibm's ProcessFactory class |
| to have a different name than daffodil's ProcessFactory class |
| and update TDML Runner's match expression to use the new class name. |
| Then some developers could add the ibmDFDLCrossTester plugin |
| to their daffodil checkout permanently |
| instead of having to do & undo that change |
| each time they want to run daffodil/ibm cross tests. |
| |
| === C struct/field name collisions |
| |
| To avoid possible name collisions, |
| we should prepend struct names and field names with namespace prefixes |
| if their infoset elements have non-null namespace prefixes. |
| Alternatively, we may need to use enclosing elements' names |
| as prefixes to avoid name collisions without namespaces. |
| |
| === Anonymous/multiple choice groups |
| |
| We already handle elements having xs:choice complex types. |
| In addition, we should support anonymous/multiple choice groups. |
| We may need to refine the choice runtime structure |
| in order to allow multiple choice groups |
| to be inlined into parent elements. |
| Here is an example schema |
| and corresponding C code to demonstrate: |
| |
| [source,xml] |
| ---- |
| <xs:complexType name="NestedUnionType"> |
| <xs:sequence> |
| <xs:element name="first_tag" type="idl:int32"/> |
| <xs:choice dfdl:choiceDispatchKey="{xs:string(./first_tag)}"> |
| <xs:element name="foo" type="idl:FooType" dfdl:choiceBranchKey="1 2"/> |
| <xs:element name="bar" type="idl:BarType" dfdl:choiceBranchKey="3 4"/> |
| </xs:choice> |
| <xs:element name="second_tag" type="idl:int32"/> |
| <xs:choice dfdl:choiceDispatchKey="{xs:string(./second_tag)}"> |
| <xs:element name="fie" type="idl:FieType" dfdl:choiceBranchKey="1"/> |
| <xs:element name="fum" type="idl:FumType" dfdl:choiceBranchKey="2"/> |
| </xs:choice> |
| </xs:sequence> |
| </xs:complexType> |
| ---- |
| |
| [source,c] |
| ---- |
| typedef struct NestedUnion |
| { |
| InfosetBase _base; |
| int32_t first_tag; |
| size_t _choice_1; // choice of which union field to use |
| union |
| { |
| foo foo; |
| bar bar; |
| }; |
| int32_t second_tag; |
| size_t _choice_2; // choice of which union field to use |
| union |
| { |
| fie fie; |
| fum fum; |
| }; |
| } NestedUnion; |
| ---- |
| |
| === Choice dispatch key expressions |
| |
| We currently support only a very restricted |
| and simple subset of choice dispatch key expressions. |
| We would like to refactor the DPath expression compiler |
| and make it generate C code |
| in order to support arbitrary choice dispatch key expressions. |
| |
| === No match between choice dispatch key and choice branch keys |
| |
| Right now c-daffodil is more strict than scala-daffodil |
| when unparsing infoset XML files with no matches (or mismatches) |
| between choice dispatch keys and branch keys. |
| Perhaps c-daffodil should load such an XML file |
| without a no match processing error |
| and unparse the infoset to a binary data file |
| without a no match processing error. |
| We would have to code and call a choice branch resolver in C |
| which peeks at the next XML element, |
| figures out which branch |
| does that element indicate exists |
| inside the choice group, |
| and initializes the choice and element runtime data |
| (_choice and childNode->erd member fields) accordingly. |
| We probably would replace the initChoice() call in walkInfosetNode() |
| with a call to that choice branch resolver |
| and we might not need to call initChoice() in unparseSelf(). |
| When I called initChoice() in all these parse, walk, and unparse places, |
| I was pondering removing the _choice member field |
| and calling initChoice() as a function |
| to tell us which element to visit next, |
| but we probably should have a mutable choice runtime data structure |
| that applications can override if they want to. |
| |
| === Floating point numbers |
| |
| Right now runtime2 prints floating point numbers |
| in XML infosets slightly differently than runtime1 does. |
| This means we may need to use different XML infosets |
| in TDML tests depending on the runtime implementation. |
| In order to use the same XML infoset in TDML tests, |
| we should make the TDML Runner |
| compare floating point numbers numerically, not textually, |
| as discussed in https://issues.apache.org/jira/browse/DAFFODIL-2402[DAFFODIL-2402]. |
| |
| === Arrays |
| |
| Instead of expanding arrays inline within childrenERDs, |
| we may want to store a single entry |
| for an array in childrenERDs |
| giving the array's offset and size of all its elements. |
| We would have to write code |
| for special case treatment of array member fields |
| versus scalar member fields |
| but we could save space/memory in childrenERDs |
| for use cases with very large arrays. |
| An array element's ERD should have minOccurs and maxOccurs |
| where minOccurs is unsigned |
| and maxOccurs is signed with -1 meaning "unbounded". |
| The actual number of children in an array instance |
| would have to be stored with the array instance |
| in the C struct or the ERD. |
| An array node has to be a different kind of infoset node |
| with a place for this number of actual children to be stored. |
| Probably all ERDs should just get minOccurs and maxOccurs |
| and a scalar is just one with 1, 1 as those values, |
| an optional element is 0, 1, |
| and an array is all other legal combinations |
| like N, -1 and N, and M with N<=M. |
| A restriction that minOccurs is 0, 1, |
| or equal to maxOccurs (which is not -1) |
| is acceptable. |
| A restriction that maxOccurs is 1, -1, |
| or equal to minOccurs |
| is also fine |
| (means variable-length arrays always have unbounded number of elements). |
| |
| === Daffodil module/subdirectory names |
| |
| When Daffodil is ready to move from a 3.x to a 4.x release, |
| rename the modules to have shorter and easier to understand names |
| as discussed in https://issues.apache.org/jira/browse/DAFFODIL-2406[DAFFODIL-2406]. |