commit | 89af2cbe53fe4e53c948a4c389bdd14da3303687 | [log] [tgz] |
---|---|---|
author | Quanlong Huang <huangquanlong@gmail.com> | Tue Jan 25 08:20:01 2022 +0800 |
committer | GitHub <noreply@github.com> | Mon Jan 24 16:20:01 2022 -0800 |
tree | 96defb982a557e31239e4426fc13995386c737cb | |
parent | 7e12a8ccac81548fde6e5d0538da2e68c77df601 [diff] |
ORC-1098: [C++] Support specifying type ids or column names in cpp tools (#1020) ### What changes were proposed in this pull request? This is a follow-up task of #921. Currently we have options for the tools to work on specified top-level column fields. However, ACID ORC files usually have nested structure. We need the type ids to specify nested columns. As an extension, adding support for column names will also be helpful. So we don't need to manually convert column names to type ids. Also reports the valid values when an invalid column name is given. This PR extracts the option parsing codes into ToolsHelper. So similiar cpp tools can share the same option set. ### Why are the changes needed? It makes the tools more useful in practice. ### How was this patch tested? Added unit tests for the new options.
ORC is a self-describing type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. Storing data in a columnar format lets the reader read, decompress, and process only the values that are required for the current query. Because ORC files are type-aware, the writer chooses the most appropriate encoding for the type and builds an internal index as the file is written. Predicate pushdown uses those indexes to determine which stripes in a file need to be read for a particular query and the row indexes can narrow the search to a particular set of 10,000 rows. ORC supports the complete set of types in Hive, including the complex types: structs, lists, maps, and unions.
This project includes both a Java library and a C++ library for reading and writing the Optimized Row Columnar (ORC) file format. The C++ and Java libraries are completely independent of each other and will each read all versions of ORC files. But the C++ library only writes the original (Hive 0.11) version of ORC files, and will be extended in the future.
Releases:
The current build status:
Bug tracking: Apache Jira
The subdirectories are:
To build a release version with debug information:
% mkdir build % cd build % cmake .. % make package % make test-out
To build a debug version:
% mkdir build % cd build % cmake .. -DCMAKE_BUILD_TYPE=DEBUG % make package % make test-out
To build a release version without debug information:
% mkdir build % cd build % cmake .. -DCMAKE_BUILD_TYPE=RELEASE % make package % make test-out
To build only the Java library:
% cd java % ./mvnw package
To build only the C++ library:
% mkdir build % cd build % cmake .. -DBUILD_JAVA=OFF % make package % make test-out