PARQUET-834: Support I/O of arrow::ListArray

Author: Korn, Uwe <Uwe.Korn@blue-yonder.com>

Closes #229 from xhochy/PARQUET-834 and squashes the following commits:

ba68dec [Korn, Uwe] Remove signed/unsigned comparisons
0967992 [Korn, Uwe] Remove signed/unsigned comparisons
05979c3 [Korn, Uwe] Add missing RETURN_NOT_OK
6484e86 [Korn, Uwe] Remove unused member
e58a4e9 [Korn, Uwe] ListofLists finally work
e8267c7 [Korn, Uwe] Add test for 2 level List
f59da0c [Korn, Uwe] No need to distinguish anymore between different array types
1dc3bbe [Korn, Uwe] Determine values inputs
0ec90e9 [Korn, Uwe] Style fixes
ee609e5 [Korn, Uwe] Unify level generation
17cfe15 [Korn, Uwe] Write lists of any depth
75a4871 [Korn, Uwe] Directly use TypedWriteBatch
89b3e35 [Korn, Uwe] Remove unused import
ccdf25c [Korn, Uwe] Use TypedWriteBatch for all list cases
d7e09cf [Korn, Uwe] Reuse TypedWriteBatch for lists
d1b82d3 [Korn, Uwe] Activate fast path for timestamp type
0b98475 [Korn, Uwe] TypedWriteBatch should be applicable for all definition levels
34bea2f [Korn, Uwe] Push level generation one level up
89aaa8c [Korn, Uwe] Remove empty if section
0cda75b [Korn, Uwe] Refactor level generation into separate method
c50f9f7 [Korn, Uwe] Adjust WriteSpaced to behave as ReadSpaced
c76b7f3 [Korn, Uwe] Simplify list unittest
fbfe2a4 [Korn, Uwe] Review comments
bcef2b9 [Korn, Uwe] Make compatible schema detection more readable
be05282 [Korn, Uwe] Reuse repeated test code
856f75c [Korn, Uwe] Fix signed comparison
75c920d [Korn, Uwe] Correctly handle empty lists
93e92ab [Korn, Uwe] Fix benchmark compilation
0201578 [Korn, Uwe] Remove dead ASSERTs
e78cb13 [Korn, Uwe] Add support for lists with max_definition_level = 2
f43effc [Korn, Uwe] Remove 'Flat' from the reader API
e3b7f58 [Korn, Uwe] Update arrow hash
e44c1d8 [Korn, Uwe] Support boolean lists of lists
f46c056 [Korn, Uwe] Add UINT32 support
d20f2be [Korn, Uwe] Read string and binary listarrays
8d6b08b [Korn, Uwe] Remove 'Flat' from the writer API
f9ab91d [Korn, Uwe] PARQUET-834: Support I/O of arrow::ListArray
23 files changed
tree: fc98f0b3024d9bf4ba938e6c00e62f0a03b7aacd
  1. benchmarks/
  2. build-support/
  3. ci/
  4. cmake_modules/
  5. data/
  6. dev/
  7. examples/
  8. src/
  9. tools/
  10. .clang-format
  11. .clang-tidy
  12. .clang-tidy-ignore
  13. .gitignore
  14. .parquetcppversion
  15. .travis.yml
  16. CMakeLists.txt
  17. KEYS
  18. LICENSE.txt
  19. NOTICE.txt
  20. README.md
README.md

parquet-cpp: a C++ library to read and write the Apache Parquet columnar data format.

Third Party Dependencies

  • Apache Arrow (memory management, built-in IO, optional Array adapters)
  • snappy
  • zlib
  • Thrift 0.7+ install instructions
  • googletest 1.7.0 (cannot be installed with package managers)
  • Google Benchmark (only required if building benchmarks)

You can either install these dependencies separately, otherwise they will be built automatically as part of the build.

Note that thrift will not be build inside the project on macOS. Instead you should install it via homebrew:

brew install thrift

Build

  • cmake .

    • You can customize build dependency locations through various environment variables:
      • ARROW_HOME customizes the Apache Arrow installed location.
      • THRIFT_HOME customizes the Apache Thrift (C++ libraries and compiler installed location.
      • SNAPPY_HOME customizes the Snappy installed location.
      • ZLIB_HOME customizes the zlib installed location.
      • BROTLI_HOME customizes the Brotli installed location.
      • GTEST_HOME customizes the googletest installed location (if you are building the unit tests).
      • GBENCHMARK_HOME customizes the Google Benchmark installed location (if you are building the benchmarks).
  • make

The binaries will be built to ./debug which contains the libraries to link against as well as a few example executables.

To disable the testing (which requires googletest), pass -DPARQUET_BUILD_TESTS=Off to cmake.

For release-level builds (enable optimizations and disable debugging), pass -DCMAKE_BUILD_TYPE=Release to cmake.

Incremental builds can be done afterwords with just make.

Using with Apache Arrow

Arrow provides some of the memory management and IO interfaces that we use in parquet-cpp. By default, Parquet links to Arrow's shared libraries. If you wish to statically-link the Arrow symbols instead, pass -DPARQUET_ARROW_LINKAGE=static.

Testing

This library uses Google's googletest unit test framework. After building with make, you can run the test suite by running

make unittest

The test suite relies on an environment variable PARQUET_TEST_DATA pointing to the data directory in the source checkout, for example:

export PARQUET_TEST_DATA=`pwd`/data

See ctest --help for configuration details about ctest. On GNU/Linux systems, you can use valgrind with ctest to look for memory leaks:

valgrind --tool=memcheck --leak-check=yes ctest

Building/Running benchmarks

Follow the directions for simple build except run cmake with the --PARQUET_BUILD_BENCHMARKS parameter set correctly:

cmake -DPARQUET_BUILD_BENCHMARKS=ON ..

and instead of make unittest run either make; ctest to run both unit tests and benchmarks or make runbenchmark to run only the benchmark tests.

Benchmark logs will be placed in the build directory under build/benchmark-logs.

Out-of-source builds

parquet-cpp supports out of source builds. For example:

mkdir test-build
cd test-build
cmake ..
make
ctest -L unittest

By using out-of-source builds you can preserve your current build state in case you need to switch to another git branch.

Design

The library consists of 3 layers that map to the 3 units in the parquet format.

The first is the encodings which correspond to data pages. The APIs at this level return single values.

The second layer is the column reader which corresponds to column chunks. The APIs at this level return a triple: definition level, repetition level and value. It also handles reading pages, compression and managing encodings.

The 3rd layer would handle reading/writing records.

Developer Notes

The project adheres to the google coding convention: http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml with two notable exceptions. We do not encourage anonymous namespaces and the line length is 90 characters.

You can run cpplint through the build system with

make lint

The project prefers the use of C++ style memory management. new/delete should be used over malloc/free. new/delete should be avoided whenever possible by using stl/boost where possible. For example, scoped_ptr instead of explicit new/delete and using std::vector instead of allocated buffers. Currently, c++11 features are not used.

For error handling, this project uses exceptions.

In general, many of the APIs at the layers are interface based for extensibility. To minimize the cost of virtual calls, the APIs should be batch-centric. For example, encoding should operate on batches of values rather than a single value.

Using clang with a custom gcc toolchain

Suppose you are building libraries with a thirdparty gcc toolchain (not a built-in system one) on Linux. To use clang for development while linking to the proper toolchain, you can do (for out of source builds):

export CMAKE_CLANG_OPTIONS=--gcc-toolchain=$TOOLCHAIN/gcc-4.9.2

export CC=$TOOLCHAIN/llvm-3.7.0/bin/clang
export CXX=$TOOLCHAIN/llvm-3.7.0/bin/clang++

cmake -DCMAKE_CLANG_OPTIONS=$CMAKE_CLANG_OPTIONS \
	  -DCMAKE_CXX_FLAGS="-Werror" ..

Code Coverage

To build with gcov code coverage and upload results to http://coveralls.io or http://codecov.io, here are some instructions.

First, build the project with coverage and run the test suite

cd $PARQUET_HOME
mkdir coverage-build
cd coverage-build
cmake -DPARQUET_GENERATE_COVERAGE=1
make -j$PARALLEL
ctest -L unittest

The gcov artifacts are not located in a place that works well with either coveralls or codecov, so there is a helper script you need to run

mkdir coverage_artifacts
python ../build-support/collect_coverage.py CMakeFiles/parquet.dir/src/ coverage_artifacts

For codecov.io (using the provided project token -- be sure to keep this private):

cd coverage_artifacts
codecov --token $PARQUET_CPP_CODECOV_TOKEN --gcov-args '\-l' --root $PARQUET_ROOT

For coveralls, install cpp_coveralls:

pip install cpp_coveralls

And the coveralls upload script:

coveralls -t $PARQUET_CPP_COVERAGE_TOKEN --gcov-options '\-l' -r $PARQUET_ROOT --exclude $PARQUET_ROOT/thirdparty --exclude $PARQUET_ROOT/build --exclude $NATIVE_TOOLCHAIN --exclude $PARQUET_ROOT/src/parquet/thrift

Note that gcov throws off artifacts from the STL, so I excluded my toolchain root stored in $NATIVE_TOOLCHAIN to avoid a cluttered coverage report.