commit | a102ba2f8b0054871eb441bbf6dc007a9b448ee7 | [log] [tgz] |
---|---|---|
author | Weston Pace <weston.pace@gmail.com> | Tue Apr 13 08:51:51 2021 -0400 |
committer | David Li <li.davidm96@gmail.com> | Tue Apr 13 08:51:51 2021 -0400 |
tree | e44ae344a8ad2385090025ebf49ac7e573f8ff54 | |
parent | 1ed681912be7246695cdd938ea632e1751403f67 [diff] |
ARROW-12288: [C++] Create Scanner interface To prepare for the AsyncScanner this PR creates a Scanner interface and, along the way, simplifies the current Scanner API so that the new scanner won't need to match. ## What is removed: * `Scanner::GetFragments` was only used in `FileSystemDataset::Write`. The correct source of truth for fragments is the `Dataset`. Note: The python implementation exposed this method but it was not documented or used in any unit test. I think it can be safely removed and we need not worry about deprecation. * `Scanner::schema` is redundant and ambiguous. There are two schemas at the scan level. The dataset schema (the unified master schema that we expect all fragment schemas to be a subset of) and the projection schema (a combination of the dataset schema and the projection expression). Both of these are available on the scan options object and there is an accessor for these options so the caller might as well get them from there. This schema function was exposed via R and used internally there but I think any uses can be easily changed to using the options. * `FileFormat::splittable` and `Fragment::splittable`. These were intended to advertise that batch readahead was available on the given fragment/format. However, there is no need to advertise this. They are not used by the `SyncScanner` and the `AsyncScanner` will just assume that the format/fragment's will utilize readahead if they can (respecting the readahead options in `ScanOptions`) * Direct instantiation of `Scanner`. All `Scanner` creation should go through `ScannerBuilder` now. This allows the `ScannerBuilder` to determine what implementation to use. This was mostly the way things were implemented already. Only a few tests instantiated a `Scanner` directly. ## What is deprecated * `Scanner::Scan` is going to be deprecated (ARROW-11797). It will not be implemented by `AsyncScanner`. I do not actually deprecate it in this PR as I reserve that for ARROW-11797. Unfortunately, this method was exposed via python & R and likely was used so deprecation is recommended over outright removal. ## What is new * `Scanner::ScanBatches` and `Scanner::ScanBatchesUnordered` have been added. These functions will be the new preferred "scan" method going forward. This allows the parallelization (batch readahead, file readahead, etc.) to be handled by C++ and simplifies the user's life. * `ScanOptions::batch_readahead` and `ScanOptions::fragment_readahead` options allow more fine grained control over how to perform readahead. One technicality is that these options will not be respected well by the `SyncScanner` (although I think the current ARROW-11797 utilizes batch readahead) so they are more placeholders for when we implement `AsyncScanner`. * `ScanOptions::cpu_executor` and `ScanOptions::io_context` are added and should be fairly self explanatory. * `ScanOptions::use_async` will toggle which scanner to use. Closes #9947 from westonpace/feature/arrow-12288 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast.
Major components of the project include:
Arrow is an Apache Software Foundation project. Learn more at arrow.apache.org.
The reference Arrow libraries contain many distinct software components:
The official Arrow libraries in this repository are in different stages of implementing the Arrow format and related features. See our current feature matrix on git master.
Please read our latest project contribution guide.
Even if you do not plan to contribute to Apache Arrow itself or Arrow integrations in other projects, we'd be happy to have you involved: