| :page-layout: page |
| :keywords: dfdl-to-c backend code-generator runtime2 |
| // /////////////////////////////////////////////////////////////////////////// |
| // |
| // This file is written in https://asciidoctor.org/docs/what-is-asciidoc/[AsciiDoc] |
| // with https://rhodesmill.org/brandon/2012/one-sentence-per-line/[semantic linefeeds]. |
| // |
| // When editing, please start each sentence on a new line. |
| // This makes textual diffs of this file useful |
| // in a similar way to the way they work for code. |
| // |
| // ////////////////////////////////////////////////////////////////////////// |
| |
| == Runtime2 ToDos |
| |
| === Overview |
| |
| We have built an initial DFDL-to-C backend |
| and code generator for Apache Daffodil. |
| Currently the C code generator can support |
| binary boolean, integer, and real numbers, |
| arrays of simple and complex elements, |
| choice groups using dispatch/branch keys, |
| validation of "fixed" attributes, |
| and padding of explicit length complex elements with fill bytes. |
| We plan to continue building out the C code generator |
| until it supports a minimal subset of the DFDL 1.0 specification |
| for embedded devices. |
| |
| We are using this document |
| to keep track of some changes |
| requested by reviewers |
| so we don't forget to make these changes. |
| If someone wants to help |
| (which would be appreciated), |
| please let the mailto:dev@daffodil.apache.org[dev] list know |
| in order to avoid duplication. |
| |
| === Anonymous choice groups not allowed |
| |
| We handle elements having xs:choice complex types. |
| However, we don't support anonymous choice groups |
| (that is, an unnamed choice group in the middle, beginning, |
| or end of a sequence which may contain other elements). |
| A DFDL schema author may write a sequence like this: |
| |
| [source,xml] |
| ---- |
| <xs:complexType name="NestedUnionType"> |
| <xs:sequence> |
| <xs:element name="first_tag" type="idl:int32"/> |
| <xs:choice dfdl:choiceDispatchKey="{xs:string(./first_tag)}"> |
| <xs:element name="foo" type="idl:FooType" dfdl:choiceBranchKey="1 2"/> |
| <xs:element name="bar" type="idl:BarType" dfdl:choiceBranchKey="3 4"/> |
| </xs:choice> |
| <xs:element name="second_tag" type="idl:int32"/> |
| <xs:choice dfdl:choiceDispatchKey="{xs:string(./second_tag)}"> |
| <xs:element name="fie" type="idl:FieType" dfdl:choiceBranchKey="1"/> |
| <xs:element name="fum" type="idl:FumType" dfdl:choiceBranchKey="2"/> |
| </xs:choice> |
| </xs:sequence> |
| </xs:complexType> |
| ---- |
| |
| Daffodil will parse and unparse the above sequence fine, |
| but the C code generator will not generate correct code |
| (no _choice members or unions will be declared for the type). |
| It might be possible to generate C code that looks like this: |
| |
| [source,c] |
| ---- |
| typedef struct NestedUnion |
| { |
| InfosetBase _base; |
| int32_t first_tag; |
| size_t _choice_1; // choice of which union field to use |
| union |
| { |
| foo foo; |
| bar bar; |
| }; |
| int32_t second_tag; |
| size_t _choice_2; // choice of which union field to use |
| union |
| { |
| fie fie; |
| fum fum; |
| }; |
| } NestedUnion; |
| ---- |
| |
| However, the Daffodil devs have looked at DFDL integration |
| for other systems like Apache Drill, NiFi, Avro, etc., |
| and these systems generally do not allow anonymous choices. |
| Hence, any DFDL schema having anonymous choices |
| doesn't integrate well with any of these systems |
| unless we generate a child element with a generated name |
| (which makes paths awkward, etc.). |
| Hence, it seems better to say that |
| the runtime2 DFDL subset doesn't allow anonymous choices |
| and DFDL schema authors should write their schema like this: |
| |
| [source,xml] |
| ---- |
| <xs:complexType name="NestedUnionType"> |
| <xs:sequence> |
| <xs:element name="first_tag" type="idl:int32"/> |
| <xs:element name="first_choice"> |
| <xs:complexType> |
| <xs:choice dfdl:choiceDispatchKey="{xs:string(../first_tag)}"> |
| <xs:element name="foo" type="idl:FooType" dfdl:choiceBranchKey="1 2"/> |
| <xs:element name="bar" type="idl:BarType" dfdl:choiceBranchKey="3 4"/> |
| </xs:choice> |
| </xs:complexType> |
| </xs:element> |
| <xs:element name="second_tag" type="idl:int32"/> |
| <xs:element name="second_choice"> |
| <xs:complexType> |
| <xs:choice dfdl:choiceDispatchKey="{xs:string(../second_tag)}"> |
| <xs:element name="fie" type="idl:FieType" dfdl:choiceBranchKey="1"/> |
| <xs:element name="fum" type="idl:FumType" dfdl:choiceBranchKey="2"/> |
| </xs:choice> |
| </xs:complexType> |
| </xs:element> |
| </xs:sequence> |
| </xs:complexType> |
| ---- |
| |
| The C code generator will generate _choice members and unions |
| for the first_choice and second_choice elements, |
| and such a schema will integrate better with other systems too. |
| |
| === Replace size_t with choice_t |
| |
| It has been pointed out that it is actually not obvious |
| whether _choice should be a signed or unsigned type. |
| One thought had been that _choice should be unsigned |
| to avoid cutting the usable range in half |
| and it should be size_t because |
| size_t is the maximum allowable length of any type of C array. |
| However, there are equally compelling reasons why |
| indices should be signed instead of unsigned as well |
| (<https://www.quora.com/Why-is-size_t-sometimes-used-instead-of-int-for-declaring-an-array-index-in-C-Is-there-any-difference>). |
| There appears to be no One Right Answer |
| what type _choice should have, |
| so defining a choice_t type in only one place |
| will allow us to change our mind if we need to |
| although we still would need to re-evaluate |
| every use of _choice very carefully. |
| |
| === Arrays |
| |
| Currently we create an ERD for an array with the array's name |
| and the scalar type of its first element, |
| but the ERD has no numChildren and the rest of its fields are NULL. |
| Then in the parent element's ERD, we expand and inline the array |
| into the parent element's offsets and childrenERDs |
| with incrementing offsets for each array element |
| and the same pointer to the same array ERD for each array element. |
| We also expand and inline the array |
| into the parent element's parseSelf and unparseSelf functions |
| with as many parse and unparse calls as there are array elements. |
| |
| We need to change this approach to handle arrays |
| having undetermined lengths at compile time. |
| One possible approach might be to define an ERD for an array |
| like an ERD for a complex element with one child. |
| The typeCode might become ARRAY or remain COMPLEX, |
| the numChildren would be 1, |
| the offsets would be the offset of the first array element |
| (allowing room to skip over an actual number of elements |
| stored in the C struct to the offset of the actual array, |
| or to point to memory allocated from the heap), |
| the childrenERDs would be the ERD of the first array element, |
| the parseSelf would be a function to parse all array members, |
| and the unparseSelf would be a function to unparse all array members. |
| These functions would know how to find the number of elements |
| depending on dfdl:occursCountKind when parsing |
| (fixed, implicit, parsed, expression, or stopValue) |
| and depending on a count stored in the C struct when unparsing. |
| These functions also would know how to loop as many times |
| as needed to parse or unparse each array element using the |
| first array element's ERD in childrenERDs every time. |
| |
| Note that we don't have to store a count |
| of the actual number of array elements in the C struct |
| for a dfdl:occursCountKind of fixed, expression, or stopValue. |
| Fixed means the count is a known constant at compile time. |
| Expression means the count is already stored in |
| another C struct field which we just have to find |
| via the expression when parsing and unparsing. |
| StopValue means we only need to look inside the array |
| for a stopValue when parsing and unparsing. |
| However, we do need to store an actual count in the C struct |
| for a dfdl:occursCountKind of implicit or parsed |
| because we will have no other possible way |
| to find the actual count when unparsing. |
| Our C code also should allow the count to be zero |
| without the code blowing up. |
| |
| If we want the C code to validate the array's count |
| against the array's minOccurs and maxOccurs, |
| we can inline the array's minOccurs and maxOccurs |
| into the array's parseSelf and unparseSelf functions. |
| However, we should allow the normal case to be no validation, |
| since Daffodil must not enforce min/maxOccurs |
| if the user wants to parse and unparse well-formed but invalid data |
| for forensic analysis. |
| However, we still can let min/maxOccurs influence the generated C code. |
| If maxOccurs is unbounded or the largest possible array size |
| (maxOccurs - minOccurs) is larger than a heuristic or tunable, |
| we should allocate storage for the array from the heap |
| instead of declaring storage for the array inline in the C struct. |
| The normal case should be to inline the array into the C struct |
| with the array's maximum size since bare metal C and VHDL |
| will not be able to allocate memory from a heap dynamically. |
| |
| === Making infosets more efficient |
| |
| Right now all of our C structs (infoset nodes) store an ERD pointer |
| within their first field. |
| This makes it possible to take a pointer to any infoset node |
| and interpret the infoset node correctly in all the ways we need |
| (walk the infoset node, unparse the infoset node to XML, etc.) |
| because we can indirect over to the ERD to get all the static info. |
| |
| In most cases, the ERD needed for a child complex element |
| is static information of the enclosing parent's ERD, |
| so could be stored only in the parent's ERD. |
| Inductively, most infoset nodes should not need ERD pointers |
| since the ERD "nest" up to the root is all static information. |
| Logically, we should be able to remove ERD pointers |
| from the first field of most C structs (infoset nodes), |
| avoiding taking up the first field's space |
| multiplied by however many infoset nodes the data contains. |
| |
| We probably just need to find all the places in the code |
| where we pass a pointer to an infoset node and |
| make these places pass both a pointer to an infoset node |
| and a separate pointer to the infoset node's ERD at the same time. |
| Then we can remove the infoset node's pointer to the same ERD |
| since it would already be passed into all the places needed. |
| |
| === Javadoc-like tool for C code |
| |
| We may want to adopt one of the javadoc-like tools for C code |
| and restructure our comments to create some API documentation. |
| |
| === Choice dispatch key expressions |
| |
| We currently support only a very restricted |
| and simple subset of choice dispatch key expressions. |
| We would like to refactor the DPath expression compiler |
| and make it generate C code |
| in order to support arbitrary choice dispatch key expressions. |
| |
| === Daffodil module/subdirectory names |
| |
| When Daffodil is ready to move from a 3.x to a 4.x release, |
| rename the modules to have shorter and easier to understand names |
| as discussed in https://issues.apache.org/jira/browse/DAFFODIL-2406[DAFFODIL-2406]. |
| |
| === Remove workaround for problem running sbt (really dev.dirs) from MSYS2 on Windows |
| |
| We need to open a issue with a reproducible test case |
| in the dev.dirs/directories-jvm project on GitHub. |
| Note that dev.dirs exhibits the problem |
| but they may or may not be responsible for it. |
| Their code which tries to run a Windows PowerShell script |
| using a Java subprocess call hangs |
| when run from MSYS2 on Windows |
| although it works fine when run from CMD on Windows. |
| Then we need to wait until |
| the hanging problem is fixed in the directories library, |
| coursier picks up the new directories version, |
| sbt picks up the new coursier version, |
| and daffodil picks up the new sbt version, |
| before we can remove the "echo >> $GITHUB_ENV" lines |
| from .github/workflows/main.yml |
| which prevent the sbt hanging problem. |
| |
| === Reporting data/schema locations in errors |
| |
| We have replaced error message strings |
| with error structs everywhere now. |
| However, we may need to expand the error struct |
| to include a pointer (pstate/ustate for data position) |
| and another pointer (ERD or static context object |
| for schema filename/line number). |
| |
| We also may want to implement error logging variants |
| that both do and don't humanize the errors, |
| e.g., a hardware/FPGA-type implementation might just output numbers |
| and an external tool might have to "humanize" these numbers |
| using knowledge of the schema and runtime data objects, |
| like an offline log processor does. |
| |
| === Recovering after errors |
| |
| As we continue to build out runtime2, |
| we may need to distinguish more types of errors |
| and allow backtracking and retrying. |
| Right now we handle only parse/unparse and |
| validation errors in limited ways. |
| Parse/unparse errors abort the parsing/unparsing |
| and return to the caller immediately |
| without resetting the stream's position. |
| Validation errors are collected in an array |
| and printed after parsing or unparsing. |
| The only places where there are calls to stop the program |
| are in daffodil_main.c (top-level error handling) |
| and stack.c (empty, overflow, underflow errors which should never happen). |
| |
| Most of the parse functions set pstate->error |
| only if they couldn't read data into their buffer |
| due to an I/O error or EOF, |
| which doesn't seem recoverable to me. |
| Likewise, the unparse functions set ustate->error |
| only if they couldn't write data from their buffer |
| due to an I/O error, which doesn't seem recoverable to me. |
| |
| Only the parse_endian_bool functions set pstate->error |
| if they read an integer which doesn't match either true_rep or false_rep |
| when an exact match to either is required. |
| If we decide to implement backtracking and retrying, |
| they should call fseek to reset the stream's position |
| back to where they started reading the integer |
| before they return to their callers. |
| Right now all parse calls are followed by |
| if statements to check for error and return immediately. |
| The code generator would have to generate code |
| which can advance the stream's position by some byte(s) |
| and try the parse call again as an attempt |
| to resynchronize with a correct data stream |
| after a bunch of failures. |
| |
| Note that we sometimes run the generated code in an embedded processor |
| and call our own fread/frwrite functions |
| which replace the stdio fread/fwrite functions |
| since the C code runs bare metal without OS functions. |
| We can implement the fseek function on the embedded processor too |
| but we would need a good use case requiring recovering after errors. |
| |
| === Validate "fixed" values in runtime1 too |
| |
| If we change runtime1 to validate "fixed" values |
| like runtime2 does, then we can resolve |
| https://issues.apache.org/jira/browse/DAFFODIL-117[DAFFODIL-117]. |
| |
| === No match between choice dispatch key and choice branch keys |
| |
| Right now c/daffodil is more strict than daffodil |
| when unparsing infoset XML files with no matches (or mismatches) |
| between choice dispatch keys and branch keys. |
| Such a situation always makes c/daffodil exit with an error, |
| which is too strict. |
| We should make c/daffodil load such an XML file |
| without a no match processing error |
| and unparse the infoset to a binary data file |
| without a no match processing error, |
| even if the choiceDispatchKey is invalid. |
| The choiceDispatchKey should not be evaluated |
| at unparse time, only at parse time. |
| If the schema writer wants to enforce that |
| the choiceDispatchKey is the right one |
| matching the unparsed choice branch, |
| the writer must write an explicit dfdl:outputValueCalc |
| expression to replace the choiceDispatchKey |
| even though supporting dfdl:outputValueCalc |
| in runtime2 is likely a distant goal. |