ARROW-12288: [C++] Create Scanner interface

To prepare for the AsyncScanner this PR creates a Scanner interface and, along the way, simplifies the current Scanner API so that the new scanner won't need to match.

## What is removed:

* `Scanner::GetFragments` was only used in `FileSystemDataset::Write`.  The correct source of truth for fragments is the `Dataset`.  Note: The python implementation exposed this method but it was not documented or used in any unit test.  I think it can be safely removed and we need not worry about deprecation.
* `Scanner::schema` is redundant and ambiguous.  There are two schemas at the scan level.  The dataset schema (the unified master schema that we expect all fragment schemas to be a subset of) and the projection schema (a combination of the dataset schema and the projection expression).  Both of these are available on the scan options object and there is an accessor for these options so the caller might as well get them from there.  This schema function was exposed via R and used internally there but I think any uses can be easily changed to using the options.
* `FileFormat::splittable` and `Fragment::splittable`.  These were intended to advertise that batch readahead was available on the given fragment/format.  However, there is no need to advertise this.  They are not used by the `SyncScanner` and the `AsyncScanner` will just assume that the format/fragment's will utilize readahead if they can (respecting the readahead options in `ScanOptions`)
* Direct instantiation of `Scanner`.  All `Scanner` creation should go through `ScannerBuilder` now.  This allows the `ScannerBuilder` to determine what implementation to use.  This was mostly the way things were implemented already.  Only a few tests instantiated a `Scanner` directly.

## What is deprecated

* `Scanner::Scan` is going to be deprecated (ARROW-11797).  It will not be implemented by `AsyncScanner`.  I do not actually deprecate it in this PR as I reserve that for ARROW-11797.  Unfortunately, this method was exposed via python & R and likely was used so deprecation is recommended over outright removal.

## What is new

* `Scanner::ScanBatches` and `Scanner::ScanBatchesUnordered` have been added.  These functions will be the new preferred "scan" method going forward.  This allows the parallelization (batch readahead, file readahead, etc.) to be handled by C++ and simplifies the user's life.
* `ScanOptions::batch_readahead` and `ScanOptions::fragment_readahead` options allow more fine grained control over how to perform readahead.  One technicality is that these options will not be respected well by the `SyncScanner` (although I think the current ARROW-11797 utilizes batch readahead) so they are more placeholders for when we implement `AsyncScanner`.
* `ScanOptions::cpu_executor` and `ScanOptions::io_context` are added and should be fairly self explanatory.
* `ScanOptions::use_async` will toggle which scanner to use.

Closes #9947 from westonpace/feature/arrow-12288

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
15 files changed
tree: e44ae344a8ad2385090025ebf49ac7e573f8ff54
  1. .github/
  2. c_glib/
  3. ci/
  4. cpp/
  5. csharp/
  6. dev/
  7. docs/
  8. format/
  9. go/
  10. java/
  11. js/
  12. julia/
  13. matlab/
  14. python/
  15. r/
  16. ruby/
  17. rust/
  18. .asf.yaml
  19. .clang-format
  20. .clang-tidy
  21. .clang-tidy-ignore
  22. .dir-locals.el
  23. .dockerignore
  24. .env
  25. .gitattributes
  26. .gitignore
  27. .gitmodules
  28. .hadolint.yaml
  29. .pre-commit-config.yaml
  30. .readthedocs.yml
  31. .travis.yml
  32. appveyor.yml
  33. CHANGELOG.md
  34. cmake-format.py
  35. CODE_OF_CONDUCT.md
  36. CONTRIBUTING.md
  37. docker-compose.yml
  38. header
  39. LICENSE.txt
  40. NOTICE.txt
  41. README.md
  42. run-cmake-format.py
README.md

Apache Arrow

Build Status Coverage Status Fuzzing Status License Twitter Follow

Powering In-Memory Analytics

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast.

Major components of the project include:

Arrow is an Apache Software Foundation project. Learn more at arrow.apache.org.

What's in the Arrow libraries?

The reference Arrow libraries contain many distinct software components:

  • Columnar vector and table-like containers (similar to data frames) supporting flat or nested types
  • Fast, language agnostic metadata messaging layer (using Google's Flatbuffers library)
  • Reference-counted off-heap buffer memory management, for zero-copy memory sharing and handling memory-mapped files
  • IO interfaces to local and remote filesystems
  • Self-describing binary wire formats (streaming and batch/file-like) for remote procedure calls (RPC) and interprocess communication (IPC)
  • Integration tests for verifying binary compatibility between the implementations (e.g. sending data from Java to C++)
  • Conversions to and from other in-memory data structures
  • Readers and writers for various widely-used file formats (such as Parquet, CSV)

Implementation status

The official Arrow libraries in this repository are in different stages of implementing the Arrow format and related features. See our current feature matrix on git master.

How to Contribute

Please read our latest project contribution guide.

Getting involved

Even if you do not plan to contribute to Apache Arrow itself or Arrow integrations in other projects, we'd be happy to have you involved: