Improve performance of padding removal when parsing

The current algorithm to remove right padding of left justified strings
first reverses the String, removes leading pad characters using
dropWhile, and then reverses the result. The two reverses are linear in
the length of the String, and requires allocating multiple String
instances and copying characters from one to the other. And this is done
regardless of how many, if any, pad chars exist in the String. This
logic is very clear, but is fairly inefficient, enough to show up while
profiling.

To improve performance, this rewrites the algorithm to scan through the
String in reverse to find the index of the last pad character and then
uses the substring() function to create a new String with those pad
characters removed. This is now linear in the number of pad characters
in a String instead of the full length of the string. Additionally, the
use of substring() avoids character copies, since it just allocates a
new String using the same underlying String value but with different
indices.

I have not looked into detail how scala implements dropWhile() for
Strings (skimming the code, it looks like it will allocate a new String
and copy characters), but for consistency and maximum performance, this
also updates the algorithm that removes left padding of right justified
strings to use similar logic as the new right padding algorithm. By
using substring() we should avoid possible copies.

In one test with lots of left justified strings, many of which are
padded, this saw about a 15% improvement in parse times (excluding
infoset creating using the null infoset outputter), and padding removal
no longer shows up while profiling.

DAFFODIL-2868
1 file changed
tree: 574bb0b936f75187d94fd910a6ec768fbbb11a36
  1. .github/
  2. containers/
  3. daffodil-cli/
  4. daffodil-codegen-c/
  5. daffodil-core/
  6. daffodil-io/
  7. daffodil-japi/
  8. daffodil-lib/
  9. daffodil-macro-lib/
  10. daffodil-propgen/
  11. daffodil-runtime1/
  12. daffodil-runtime1-layers/
  13. daffodil-runtime1-unparser/
  14. daffodil-sapi/
  15. daffodil-schematron/
  16. daffodil-slf4j-logger/
  17. daffodil-tdml-lib/
  18. daffodil-tdml-processor/
  19. daffodil-test/
  20. daffodil-test-ibm1/
  21. daffodil-test-integration/
  22. daffodil-udf/
  23. project/
  24. scripts/
  25. test-stdLayout/
  26. tutorials/
  27. .asf.yaml
  28. .codecov.yml
  29. .gitattributes
  30. .gitignore
  31. .sbtopts
  32. .scalafmt.conf
  33. .sonar-project.properties
  34. BUILD.md
  35. build.sbt
  36. DEVELOP.md
  37. KEYS
  38. LICENSE
  39. NOTICE
  40. README.md
README.md

Apache Daffodil is an open-source implementation of the DFDL specification that uses DFDL data descriptions to parse fixed format data into an infoset. This infoset is commonly converted into XML or JSON to enable the use of well-established XML or JSON technologies and libraries to consume, inspect, and manipulate fixed format data in existing solutions. Daffodil is also capable of serializing or “unparsing” data back to the original data format. The DFDL infoset can also be converted directly to/from the data structures carried by data processing frameworks so as to bypass any XML/JSON overheads.

For more information about Daffodil, see https://daffodil.apache.org/.

Build Requirements

  • Java 8 or higher
  • sbt 0.13.8 or higher
  • C compiler C99 or higher
  • Mini-XML Version 3.0 or higher

See BUILD.md for more details and DEVELOP.md for a developer guide.

Getting Started

sbt is the officially supported tool to build Daffodil. Below are some of the more commonly used commands for Daffodil development.

Compile

Compile source code:

sbt compile

Test

Check all unit tests pass:

sbt test

Check all integration tests pass:

sbt daffodil-test-integration/test

Format

Check format of source and sbt files:

sbt scalafmtCheckAll scalafmtSbtCheck

Reformat source and sbt files if necessary:

sbt scalafmtAll scalafmtSbt

Build

Build the Daffodil command line interface (Linux and Windows shell scripts in daffodil-cli/target/universal/stage/bin/; see the Command Line Interface documentation for details on their usage):

sbt daffodil-cli/stage

Publish the Daffodil jars to a Maven repository (for Java projects) or Ivy repository (for Scala or schema projects).

Maven (for Java or mvn):

sbt publishM2

Ivy (for Scala or sbt):

sbt publishLocal

Check Licenses

Run Apache RAT (license audit report in target/rat.txt and error if any unapproved licenses are found):

sbt ratCheck

Check Coverage

Run sbt-scoverage (report in target/scala-ver/scoverage-report/):

sbt clean coverage test daffodil-test-integration/test
sbt coverageAggregate

Getting Help

You can ask questions on the dev@daffodil.apache.org or users@daffodil.apache.org mailing lists. You can report bugs via the Daffodil JIRA.

License

Apache Daffodil is licensed under the Apache License, v2.0.