commit	a102ba2f8b0054871eb441bbf6dc007a9b448ee7	[log] [tgz]
author	Weston Pace <weston.pace@gmail.com>	Tue Apr 13 08:51:51 2021 -0400
committer	David Li <li.davidm96@gmail.com>	Tue Apr 13 08:51:51 2021 -0400
tree	e44ae344a8ad2385090025ebf49ac7e573f8ff54
parent	1ed681912be7246695cdd938ea632e1751403f67 [diff]

ARROW-12288: [C++] Create Scanner interface

To prepare for the AsyncScanner this PR creates a Scanner interface and, along the way, simplifies the current Scanner API so that the new scanner won't need to match.

## What is removed:

* `Scanner::GetFragments` was only used in `FileSystemDataset::Write`. The correct source of truth for fragments is the `Dataset`. Note: The python implementation exposed this method but it was not documented or used in any unit test. I think it can be safely removed and we need not worry about deprecation.
* `Scanner::schema` is redundant and ambiguous. There are two schemas at the scan level. The dataset schema (the unified master schema that we expect all fragment schemas to be a subset of) and the projection schema (a combination of the dataset schema and the projection expression). Both of these are available on the scan options object and there is an accessor for these options so the caller might as well get them from there. This schema function was exposed via R and used internally there but I think any uses can be easily changed to using the options.
* `FileFormat::splittable` and `Fragment::splittable`. These were intended to advertise that batch readahead was available on the given fragment/format. However, there is no need to advertise this. They are not used by the `SyncScanner` and the `AsyncScanner` will just assume that the format/fragment's will utilize readahead if they can (respecting the readahead options in `ScanOptions`)
* Direct instantiation of `Scanner`. All `Scanner` creation should go through `ScannerBuilder` now. This allows the `ScannerBuilder` to determine what implementation to use. This was mostly the way things were implemented already. Only a few tests instantiated a `Scanner` directly.

## What is deprecated

* `Scanner::Scan` is going to be deprecated (ARROW-11797). It will not be implemented by `AsyncScanner`. I do not actually deprecate it in this PR as I reserve that for ARROW-11797. Unfortunately, this method was exposed via python & R and likely was used so deprecation is recommended over outright removal.

## What is new

* `Scanner::ScanBatches` and `Scanner::ScanBatchesUnordered` have been added. These functions will be the new preferred "scan" method going forward. This allows the parallelization (batch readahead, file readahead, etc.) to be handled by C++ and simplifies the user's life.
* `ScanOptions::batch_readahead` and `ScanOptions::fragment_readahead` options allow more fine grained control over how to perform readahead. One technicality is that these options will not be respected well by the `SyncScanner` (although I think the current ARROW-11797 utilizes batch readahead) so they are more placeholders for when we implement `AsyncScanner`.
* `ScanOptions::cpu_executor` and `ScanOptions::io_context` are added and should be fairly self explanatory.
* `ScanOptions::use_async` will toggle which scanner to use.

Closes #9947 from westonpace/feature/arrow-12288

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>

15 files changed

tree: e44ae344a8ad2385090025ebf49ac7e573f8ff54

README.md

Apache Arrow

Powering In-Memory Analytics

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast.

Major components of the project include:

The Arrow Columnar In-Memory Format: a standard and efficient in-memory representation of various datatypes, plain or nested
The Arrow IPC Format: an efficient serialization of the Arrow format and associated metadata, for communication between processes and heterogeneous environments
The Arrow Flight RPC protocol: based on the Arrow IPC format, a building block for remote services exchanging Arrow data with application-defined semantics (for example a storage server or a database)
C++ libraries
C bindings using GLib
C# .NET libraries
Gandiva: an LLVM-based Arrow expression compiler, part of the C++ codebase
Go libraries
Java libraries
JavaScript libraries
Plasma Object Store: a shared-memory blob store, part of the C++ codebase
Python libraries
R libraries
Ruby libraries
Rust libraries

Arrow is an Apache Software Foundation project. Learn more at arrow.apache.org.

What's in the Arrow libraries?

The reference Arrow libraries contain many distinct software components:

Columnar vector and table-like containers (similar to data frames) supporting flat or nested types
Fast, language agnostic metadata messaging layer (using Google's Flatbuffers library)
Reference-counted off-heap buffer memory management, for zero-copy memory sharing and handling memory-mapped files
IO interfaces to local and remote filesystems
Self-describing binary wire formats (streaming and batch/file-like) for remote procedure calls (RPC) and interprocess communication (IPC)
Integration tests for verifying binary compatibility between the implementations (e.g. sending data from Java to C++)
Conversions to and from other in-memory data structures
Readers and writers for various widely-used file formats (such as Parquet, CSV)

Implementation status

The official Arrow libraries in this repository are in different stages of implementing the Arrow format and related features. See our current feature matrix on git master.

How to Contribute

Please read our latest project contribution guide.

Getting involved

Even if you do not plan to contribute to Apache Arrow itself or Arrow integrations in other projects, we'd be happy to have you involved:

Join the mailing list: send an email to dev-subscribe@arrow.apache.org. Share your ideas and use cases for the project.
Follow our activity on JIRA
Learn the format
Contribute code to one of the reference implementations