commit | 65e7db1965a1117852df45904fcd21eb40c6d6b5 | [log] [tgz] |
---|---|---|
author | Uwe L. Korn <uwelk@xhochy.com> | Tue Jan 17 19:10:45 2017 -0500 |
committer | Wes McKinney <wes.mckinney@twosigma.com> | Tue Jan 17 19:10:45 2017 -0500 |
tree | 7543947b681333acff3a3b7fa65ce195710ec51a | |
parent | 0804faf4fc8ecb448643d107f9cfe60021460546 [diff] |
PARQUET-820: Decoders should directly emit arrays with spacing for null entries Old: ``` In [3]: import pyarrow.io as paio ...: import pyarrow.parquet as pq ...: ...: with open('yellow_tripdata_2016-01.parquet', 'r') as f: ...: buf = f.read() ...: buf = paio.buffer_from_bytes(buf) ...: ...: def read_parquet(): ...: reader = paio.BufferReader(buf) ...: df = pq.read_table(reader) ...: ...: %timeit read_parquet() ...: 1 loop, best of 3: 1.21 s per loop ``` New: ``` In [1]: import pyarrow.io as paio ...: import pyarrow.parquet as pq ...: ...: with open('yellow_tripdata_2016-01.parquet', 'r') as f: ...: buf = f.read() ...: buf = paio.buffer_from_bytes(buf) ...: ...: def read_parquet(): ...: reader = paio.BufferReader(buf) ...: df = pq.read_table(reader) ...: ...: %timeit read_parquet() ...: 1 loop, best of 3: 906 ms per loop ``` Arrow->Pandas conversion for comparison: ``` In [5]: %timeit df.to_pandas() 1 loop, best of 3: 567 ms per loop ``` All benchmarks were done on a single core CPU I have to add a better test coverage before this can go in. There is still some room for future improvements that won't be done in this PR: * `DefinitionLevelsToBitmap` should be done in the DefinitionLevelsDecoder * `GetBatchWithDictSpaced` is something for a vectorization/bitmap ninja. Author: Uwe L. Korn <uwelk@xhochy.com> Author: Korn, Uwe <Uwe.Korn@blue-yonder.com> Closes #218 from xhochy/PARQUET-820 and squashes the following commits: e6db697 [Korn, Uwe] Add INIT_BITSET macro 8f17db9 [Korn, Uwe] Use arrow::TypeTraits 8dcab1b [Uwe L. Korn] Adjust documentation for ReadBatchSpaced 798bc83 [Uwe L. Korn] Test ReadSpaced 9dc6dc0 [Uwe L. Korn] Test DecodeSpaced ccb70dc [Uwe L. Korn] Add fast path for non-nullable-batches 6f99191 [Uwe L. Korn] Move bit reading into a macro 393d99a [Uwe L. Korn] Explicitly mark overrides 3424ae3 [Uwe L. Korn] Make more use of the bitmaps 685ad34 [Uwe L. Korn] Remove unused include 9b0f105 [Uwe L. Korn] Use bitset in the whole GetBatchWithDict loop 907c165 [Uwe L. Korn] Use bitset in literalbatch 0ec4b38 [Uwe L. Korn] Remove unused code f6c4b5e [Uwe L. Korn] ninja format cbf0176 [Uwe L. Korn] DecodeSpaced in dictionary encoder 3dfa43b [Uwe L. Korn] Directly read valid_bits 15aa324 [Uwe L. Korn] Only use ReadSpaced where needed 96dd347 [Korn, Uwe] PARQUET-820: Decoders should directly emit arrays with spacing for null entries
You can either install these dependencies separately, otherwise they will be built automatically as part of the build.
Note that thrift will not be build inside the project on macOS. Instead you should install it via homebrew:
brew install thrift
cmake .
make
The binaries will be built to ./debug which contains the libraries to link against as well as a few example executables.
To disable the testing (which requires googletest
), pass -DPARQUET_BUILD_TESTS=Off
to cmake
.
For release-level builds (enable optimizations and disable debugging), pass -DCMAKE_BUILD_TYPE=Release
to cmake
.
Incremental builds can be done afterwords with just make
.
Arrow provides some of the memory management and IO interfaces that we use in parquet-cpp. By default, Parquet links to Arrow's shared libraries. If you wish to statically-link the Arrow symbols instead, pass -DPARQUET_ARROW_LINKAGE=static
.
This library uses Google's googletest
unit test framework. After building with make
, you can run the test suite by running
make unittest
The test suite relies on an environment variable PARQUET_TEST_DATA
pointing to the data
directory in the source checkout, for example:
export PARQUET_TEST_DATA=`pwd`/data
See ctest --help
for configuration details about ctest. On GNU/Linux systems, you can use valgrind with ctest to look for memory leaks:
valgrind --tool=memcheck --leak-check=yes ctest
Follow the directions for simple build except run cmake with the --PARQUET_BUILD_BENCHMARKS
parameter set correctly:
cmake -DPARQUET_BUILD_BENCHMARKS=ON ..
and instead of make unittest run either make; ctest
to run both unit tests and benchmarks or make runbenchmark
to run only the benchmark tests.
Benchmark logs will be placed in the build directory under build/benchmark-logs
.
parquet-cpp supports out of source builds. For example:
mkdir test-build cd test-build cmake .. make ctest -L unittest
By using out-of-source builds you can preserve your current build state in case you need to switch to another git branch.
The library consists of 3 layers that map to the 3 units in the parquet format.
The first is the encodings which correspond to data pages. The APIs at this level return single values.
The second layer is the column reader which corresponds to column chunks. The APIs at this level return a triple: definition level, repetition level and value. It also handles reading pages, compression and managing encodings.
The 3rd layer would handle reading/writing records.
The project adheres to the google coding convention: http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml with two notable exceptions. We do not encourage anonymous namespaces and the line length is 90 characters.
You can run cpplint
through the build system with
make lint
The project prefers the use of C++ style memory management. new/delete should be used over malloc/free. new/delete should be avoided whenever possible by using stl/boost where possible. For example, scoped_ptr instead of explicit new/delete and using std::vector instead of allocated buffers. Currently, c++11 features are not used.
For error handling, this project uses exceptions.
In general, many of the APIs at the layers are interface based for extensibility. To minimize the cost of virtual calls, the APIs should be batch-centric. For example, encoding should operate on batches of values rather than a single value.
Suppose you are building libraries with a thirdparty gcc toolchain (not a built-in system one) on Linux. To use clang for development while linking to the proper toolchain, you can do (for out of source builds):
export CMAKE_CLANG_OPTIONS=--gcc-toolchain=$TOOLCHAIN/gcc-4.9.2 export CC=$TOOLCHAIN/llvm-3.7.0/bin/clang export CXX=$TOOLCHAIN/llvm-3.7.0/bin/clang++ cmake -DCMAKE_CLANG_OPTIONS=$CMAKE_CLANG_OPTIONS \ -DCMAKE_CXX_FLAGS="-Werror" ..
To build with gcov
code coverage and upload results to http://coveralls.io or http://codecov.io, here are some instructions.
First, build the project with coverage and run the test suite
cd $PARQUET_HOME mkdir coverage-build cd coverage-build cmake -DPARQUET_GENERATE_COVERAGE=1 make -j$PARALLEL ctest -L unittest
The gcov
artifacts are not located in a place that works well with either coveralls or codecov, so there is a helper script you need to run
mkdir coverage_artifacts python ../build-support/collect_coverage.py CMakeFiles/parquet.dir/src/ coverage_artifacts
For codecov.io (using the provided project token -- be sure to keep this private):
cd coverage_artifacts codecov --token $PARQUET_CPP_CODECOV_TOKEN --gcov-args '\-l' --root $PARQUET_ROOT
For coveralls, install cpp_coveralls
:
pip install cpp_coveralls
And the coveralls upload script:
coveralls -t $PARQUET_CPP_COVERAGE_TOKEN --gcov-options '\-l' -r $PARQUET_ROOT --exclude $PARQUET_ROOT/thirdparty --exclude $PARQUET_ROOT/build --exclude $NATIVE_TOOLCHAIN --exclude $PARQUET_ROOT/src/parquet/thrift
Note that gcov
throws off artifacts from the STL, so I excluded my toolchain root stored in $NATIVE_TOOLCHAIN
to avoid a cluttered coverage report.