commit | e0617809b7deeb76ae0ca21c7faf7bd82e96c74d | [log] [tgz] |
---|---|---|
author | Antoine Pitrou <antoine@python.org> | Thu Nov 08 19:27:40 2018 -0500 |
committer | Wes McKinney <wesm+git@apache.org> | Thu Nov 08 19:27:40 2018 -0500 |
tree | 5b25613f4a51b159fb8a06f4c4426e3e53110c31 | |
parent | a76bab8303ebf575c62b69be3474d00039c021cb [diff] |
ARROW-3536: [C++] Add UTF8 validation functions The baseline UTF8 decoder is adapted from Bjoern Hoehrmann's DFA-based implementation. The common case of runs of ASCII chars benefit from a fast path handling 8 bytes at a time. Benchmark results (on a Ryzen 7 machine with gcc 7.3): ``` ----------------------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------------------- BM_ValidateTinyAscii/repeats:1 3 ns 3 ns 245245630 3.26202GB/s BM_ValidateTinyNonAscii/repeats:1 7 ns 7 ns 104679950 1.54295GB/s BM_ValidateSmallAscii/repeats:1 10 ns 10 ns 66365983 13.0928GB/s BM_ValidateSmallAlmostAscii/repeats:1 37 ns 37 ns 18755439 3.69415GB/s BM_ValidateSmallNonAscii/repeats:1 68 ns 68 ns 10267387 1.82934GB/s BM_ValidateLargeAscii/repeats:1 4140 ns 4140 ns 171331 22.5003GB/s BM_ValidateLargeAlmostAscii/repeats:1 24472 ns 24468 ns 28565 3.80816GB/s BM_ValidateLargeNonAscii/repeats:1 50420 ns 50411 ns 13830 1.84927GB/s ``` The case of tiny strings is probably the most important for the use case of CSV type inference. PS: benchmarks on the same machine with clang 6.0: ``` ----------------------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------------------- BM_ValidateTinyAscii/repeats:1 3 ns 3 ns 213945214 2.84658GB/s BM_ValidateTinyNonAscii/repeats:1 8 ns 8 ns 90916423 1.33072GB/s BM_ValidateSmallAscii/repeats:1 7 ns 7 ns 91498265 17.4425GB/s BM_ValidateSmallAlmostAscii/repeats:1 34 ns 34 ns 20750233 4.08138GB/s BM_ValidateSmallNonAscii/repeats:1 58 ns 58 ns 12063206 2.14002GB/s BM_ValidateLargeAscii/repeats:1 3999 ns 3999 ns 175099 23.2937GB/s BM_ValidateLargeAlmostAscii/repeats:1 21783 ns 21779 ns 31738 4.27822GB/s BM_ValidateLargeNonAscii/repeats:1 55162 ns 55153 ns 12526 1.69028GB/s ``` Author: Antoine Pitrou <antoine@python.org> Closes #2916 from pitrou/ARROW-3536-utf8-validation and squashes the following commits: 9c9713b78 <Antoine Pitrou> Improve benchmarks e6f23963a <Antoine Pitrou> Use a larger state table allowing for single lookups 29d6e347c <Antoine Pitrou> Help clang code gen e621b220f <Antoine Pitrou> Use memcpy for safe aligned reads, and improve speed of non-ASCII runs 89f6843d9 <Antoine Pitrou> ARROW-3536: Add UTF8 validation functions
Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast.
Major components of the project include:
Arrow is an Apache Software Foundation project. Learn more at arrow.apache.org.
The reference Arrow libraries contain a number of distinct software components:
Even if you do not plan to contribute to Apache Arrow itself or Arrow integrations in other projects, we'd be happy to have you involved:
We prefer to receive contributions in the form of GitHub pull requests. Please send pull requests against the github.com/apache/arrow repository.
If you are looking for some ideas on what to contribute, check out the JIRA issues for the Apache Arrow project. Comment on the issue and/or contact dev@arrow.apache.org with your questions and ideas.
If you’d like to report a bug but don’t have time to fix it, you can still post it on JIRA, or email the mailing list dev@arrow.apache.org
To contribute a patch:
Thank you in advance for your contributions!