ORC-1384: Fix `ArrayIndexOutOfBoundsException` when reading dictionary stream bigger then dictionary

### What changes were proposed in this pull request?
Avoid  ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary. Check the size of the dictionary and input and read only the min of those.

### Why are the changes needed?
In Hive when reading with LLAP data is read in 4kB blocks which leads to ArrayIndexOutOfBoundsException when the dictionary is smaller.

### How was this patch tested?
It is tested with HIVE's qtest, since here we do not have the necessary subclasses.

Closes #1431 from zratkai/ORC-1384.

Lead-authored-by: Zoltan Ratkai <zratkai@cloudera.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 8cf9057fc498f977125be3b721daf2170330b3f9)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
1 file changed
tree: 68372096c8402bdb3aac1410379292c7a85dfea9
  1. .github/
  2. c++/
  3. cmake_modules/
  4. dev/
  5. docker/
  6. examples/
  7. java/
  8. proto/
  9. site/
  10. tools/
  11. .asf.yaml
  12. .gitignore
  13. CMakeLists.txt
  14. LICENSE
  15. NOTICE
  16. README.md
README.md

Apache ORC

ORC is a self-describing type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. Storing data in a columnar format lets the reader read, decompress, and process only the values that are required for the current query. Because ORC files are type-aware, the writer chooses the most appropriate encoding for the type and builds an internal index as the file is written. Predicate pushdown uses those indexes to determine which stripes in a file need to be read for a particular query and the row indexes can narrow the search to a particular set of 10,000 rows. ORC supports the complete set of types in Hive, including the complex types: structs, lists, maps, and unions.

ORC File Library

This project includes both a Java library and a C++ library for reading and writing the Optimized Row Columnar (ORC) file format. The C++ and Java libraries are completely independent of each other and will each read all versions of ORC files.

Releases:

  • Latest: Apache ORC releases
  • Maven Central: Maven Central
  • Downloads: Apache ORC downloads
  • Release tags: Apache ORC release tags
  • Plan: Apache ORC future release plan

The current build status:

  • Main branch main build status

Bug tracking: Apache Jira

The subdirectories are:

  • c++ - the c++ reader and writer
  • cmake_modules - the cmake modules
  • docker - docker scripts to build and test on various linuxes
  • examples - various ORC example files that are used to test compatibility
  • java - the java reader and writer
  • proto - the protocol buffer definition for the ORC metadata
  • site - the website and documentation
  • tools - the c++ tools for reading and inspecting ORC files

Building

  • Install java 1.8 or higher
  • Install maven 3.8.6 or higher
  • Install cmake

To build a release version with debug information:

% mkdir build
% cd build
% cmake ..
% make package
% make test-out

To build a debug version:

% mkdir build
% cd build
% cmake .. -DCMAKE_BUILD_TYPE=DEBUG
% make package
% make test-out

To build a release version without debug information:

% mkdir build
% cd build
% cmake .. -DCMAKE_BUILD_TYPE=RELEASE
% make package
% make test-out

To build only the Java library:

% cd java
% ./mvnw package

To build only the C++ library:

% mkdir build
% cd build
% cmake .. -DBUILD_JAVA=OFF
% make package
% make test-out