ARROW-12146: [C++][Gandiva] Implement CONVERT_FROM(expression, replacement char) function

Implement CONVERT_FROM(expression, ‘UTF8’, replacement char)

Converts the byte data in expression to UTF-8. Expression can be a literal string or a field name. Will replace any invalid UTF-8 characters with the replacement character.

Obs.: Actually we will only support a single byte replacement char

Closes #9844 from jpedroantunes/feature/convert-replace-utf8 and squashes the following commits:

bef6eafda <João Pedro> Add optimization for returning original string if no invalid chars were found
e7c6a71db <João Pedro> Refactor memcpy unnecessary for single byte
7aac875e7 <João Pedro> Add handler for cases with 0 char len on replace char
6544583f0 <João Pedro> Apply proper identation on types.h and string_ops.cc in gandiva
c66efb8e4 <João Pedro> Apply corrections and optimization on convert replace function
d815f854c <João Pedro> Add validation for MSBs on convert replace utf8 Gandiva function
8e44d413d <João Pedro> Add validation for defined char length greater than 1 on convert replace
a2ea61bee <João Pedro> Adapt convert_from method to support single char on replacement (defined with dremio team)
7d4cec02c <João Pedro> Adapt convert_from method to support multiple char on replacement
1a1734b9a <João Pedro> Change string ops test for defining int variables instead of size_t
b96dfc750 <João Pedro> Fix lint problems on string ops and test files
8f9a4bde0 <João Pedro> Fix identation on string files on gandiva module
875a1dd87 <João Pedro> Add integration test for convert replace utf8 method
536fd3a63 <João Pedro> Add definition of convert replace str method to types.h
c950c8a45 <João Pedro> Add base tests for convert replace invalid chars
2a5fe944e <João Pedro> Add base logic for convert replace utf8 invalid chars

Authored-by: João Pedro <joaop@simbioseventures.com>
Signed-off-by: Praveen <praveen@dremio.com>
5 files changed
tree: bc43608a3099bc1b5fcfdd40c63a475844fb93d8
  1. .github/
  2. c_glib/
  3. ci/
  4. cpp/
  5. csharp/
  6. dev/
  7. docs/
  8. format/
  9. go/
  10. java/
  11. js/
  12. julia/
  13. matlab/
  14. python/
  15. r/
  16. ruby/
  17. rust/
  18. .asf.yaml
  19. .clang-format
  20. .clang-tidy
  21. .clang-tidy-ignore
  22. .dir-locals.el
  23. .dockerignore
  24. .env
  25. .gitattributes
  26. .gitignore
  27. .gitmodules
  28. .hadolint.yaml
  29. .pre-commit-config.yaml
  30. .readthedocs.yml
  31. .travis.yml
  32. appveyor.yml
  33. CHANGELOG.md
  34. cmake-format.py
  35. CODE_OF_CONDUCT.md
  36. CONTRIBUTING.md
  37. docker-compose.yml
  38. header
  39. LICENSE.txt
  40. NOTICE.txt
  41. README.md
  42. run-cmake-format.py
README.md

Apache Arrow

Build Status Coverage Status Fuzzing Status License Twitter Follow

Powering In-Memory Analytics

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast.

Major components of the project include:

Arrow is an Apache Software Foundation project. Learn more at arrow.apache.org.

What's in the Arrow libraries?

The reference Arrow libraries contain many distinct software components:

  • Columnar vector and table-like containers (similar to data frames) supporting flat or nested types
  • Fast, language agnostic metadata messaging layer (using Google's Flatbuffers library)
  • Reference-counted off-heap buffer memory management, for zero-copy memory sharing and handling memory-mapped files
  • IO interfaces to local and remote filesystems
  • Self-describing binary wire formats (streaming and batch/file-like) for remote procedure calls (RPC) and interprocess communication (IPC)
  • Integration tests for verifying binary compatibility between the implementations (e.g. sending data from Java to C++)
  • Conversions to and from other in-memory data structures
  • Readers and writers for various widely-used file formats (such as Parquet, CSV)

Implementation status

The official Arrow libraries in this repository are in different stages of implementing the Arrow format and related features. See our current feature matrix on git master.

How to Contribute

Please read our latest project contribution guide.

Getting involved

Even if you do not plan to contribute to Apache Arrow itself or Arrow integrations in other projects, we'd be happy to have you involved: