|author||Phillip Cloud <firstname.lastname@example.org>||Sun Nov 19 23:19:03 2017 -0500|
|committer||Wes McKinney <email@example.com>||Sun Nov 19 23:19:03 2017 -0500|
PARQUET-1095: [C++] Read and write Arrow decimal values This depends on: - [x] [ARROW-1607](https://github.com/apache/arrow/pull/1128) - [x] [ARROW-1656](https://github.com/apache/arrow/pull/1184) - [x] [ARROW-1588](https://github.com/apache/arrow/pull/1211) - [x] Add tests for writing different sizes of values Author: Phillip Cloud <firstname.lastname@example.org> Author: Wes McKinney <email@example.com> Closes #403 from cpcloud/PARQUET-1095 and squashes the following commits: 8c3d222 [Phillip Cloud] Remove loop from BytesToInteger 63018bc [Wes McKinney] Suppress C4996 due to arrow/util/variant.h e4b02d3 [Phillip Cloud] Refactor types.h 83948ec [Phillip Cloud] Add last_value_ init 51965cd [Phillip Cloud] Min commit that contains the unique kernel in arrow e25c59b [Phillip Cloud] Fix reader writer test for unique kernel addition da0a7eb [Phillip Cloud] Update for ARROW-1811 16935de [Phillip Cloud] Reverse operand order and explicit cast 6036ca5 [Phillip Cloud] ARROW-1811 c5c4294 [Phillip Cloud] Fix issues 32a4abe [Phillip Cloud] Cleanup iteration a bit 920832a [Phillip Cloud] Update arrow version 9f97c1d [Phillip Cloud] Update for ARROW-1794: rename DecimalArray to Decimal128Array b2e0290 [Phillip Cloud] IWYU 64748a8 [Phillip Cloud] Copy from arrow for now 6c9e2a7 [Phillip Cloud] Reduce the number of decimal test cases 7ab2e5c [Phillip Cloud] Parameterize on precision 30655d6 [Phillip Cloud] Use arrow random_decimals 9ff7eb4 [Phillip Cloud] Remove specific template parameters 1eee6a9 [Phillip Cloud] Remove specific randint call 8808e4c [Phillip Cloud] Bump arrow version 659fbc1 [Phillip Cloud] Fix deprecated API call e162ca1 [Phillip Cloud] Allocate scratch space to hold the byteswapped values 5c9292b [Phillip Cloud] Proper dcheck call 1782da0 [Phillip Cloud] Use arrow 3d243d5 [Phillip Cloud] Checkpoint [ci skip] 028fb03 [Phillip Cloud] Remove garbage values 46dff15 [Phillip Cloud] Clean up uint32 test 613255e [Phillip Cloud] Do not use std::copy when reinterpret_cast will suffice 2917a62 [Phillip Cloud] PARQUET-1095: [C++] Read and write Arrow decimal values
We use the CMake build system and require a minimum version of 3.2. If you are using an older Linux distribution, you may need to use a PPA (for apt users) or build CMake from source.
parquet-cpp requires gcc 4.8 or higher on Linux.
To build parquet-cpp out of the box, you must install some build prerequisites for the thirdparty dependencies. On Debian/Ubuntu, these can be installed with:
sudo apt-get install libboost-dev libboost-filesystem-dev \ libboost-program-options-dev libboost-regex-dev \ libboost-system-dev libboost-test-dev \ libssl-dev libtool bison flex pkg-config
You must use XCode 6 or higher. We recommend using Homebrew to install Boost, which is required for Thrift:
brew install boost
Check Windows developer guide for instructions to build parquet-cpp on Windows.
You can either install these dependencies separately, otherwise they will be built automatically as part of the build.
Symbols from Thrift, Snappy, and ZLib are statically-linked into the
libparquet shared library, so these dependencies must be built with
-fPIC on Linux and OS X. Since Linux package managers do not consistently compile the static libraries for these components with
-fPIC, you may have issues with Linux packages such as
libsnappy-dev. It may be easier to depend on the thirdparty toolchain that parquet-cpp builds automatically.
The binaries will be built to ./debug which contains the libraries to link against as well as a few example executables.
To disable the testing (which requires
For release-level builds (enable optimizations and disable debugging), pass
To build only the library with minimal dependencies, pass
cmake. Note that the executables, tests, and benchmarks should be disabled as well.
Incremental builds can be done afterwords with just
Arrow provides some of the memory management and IO interfaces that we use in parquet-cpp. By default, Parquet links to Arrow's shared libraries. If you wish to statically-link the Arrow symbols instead, pass
This library uses Google's
googletest unit test framework. After building with
make, you can run the test suite by running
The test suite relies on an environment variable
PARQUET_TEST_DATA pointing to the
data directory in the source checkout, for example:
ctest --help for configuration details about ctest. On GNU/Linux systems, you can use valgrind with ctest to look for memory leaks:
valgrind --tool=memcheck --leak-check=yes ctest
Follow the directions for simple build except run cmake with the
--PARQUET_BUILD_BENCHMARKS parameter set correctly:
cmake -DPARQUET_BUILD_BENCHMARKS=ON ..
and instead of make unittest run either
make; ctest to run both unit tests and benchmarks or
make runbenchmark to run only the benchmark tests.
Benchmark logs will be placed in the build directory under
parquet-cpp supports out of source builds. For example:
mkdir test-build cd test-build cmake .. make ctest -L unittest
By using out-of-source builds you can preserve your current build state in case you need to switch to another git branch.
The library consists of 3 layers that map to the 3 units in the parquet format.
The first is the encodings which correspond to data pages. The APIs at this level return single values.
The second layer is the column reader which corresponds to column chunks. The APIs at this level return a triple: definition level, repetition level and value. It also handles reading pages, compression and managing encodings.
The 3rd layer would handle reading/writing records.
The project adheres to the google coding convention: http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml with two notable exceptions. We do not encourage anonymous namespaces and the line length is 90 characters.
You can run
cpplint through the build system with
The project prefers the use of C++ style memory management. new/delete should be used over malloc/free. new/delete should be avoided whenever possible by using stl/boost where possible. For example, scoped_ptr instead of explicit new/delete and using std::vector instead of allocated buffers. Currently, c++11 features are not used.
For error handling, this project uses exceptions.
In general, many of the APIs at the layers are interface based for extensibility. To minimize the cost of virtual calls, the APIs should be batch-centric. For example, encoding should operate on batches of values rather than a single value.
Suppose you are building libraries with a thirdparty gcc toolchain (not a built-in system one) on Linux. To use clang for development while linking to the proper toolchain, you can do (for out of source builds):
export CMAKE_CLANG_OPTIONS=--gcc-toolchain=$TOOLCHAIN/gcc-4.9.2 export CC=$TOOLCHAIN/llvm-3.7.0/bin/clang export CXX=$TOOLCHAIN/llvm-3.7.0/bin/clang++ cmake -DCMAKE_CLANG_OPTIONS=$CMAKE_CLANG_OPTIONS \ -DCMAKE_CXX_FLAGS="-Werror" ..
First, build the project with coverage and run the test suite
cd $PARQUET_HOME mkdir coverage-build cd coverage-build cmake -DPARQUET_GENERATE_COVERAGE=1 make -j$PARALLEL ctest -L unittest
gcov artifacts are not located in a place that works well with either coveralls or codecov, so there is a helper script you need to run
mkdir coverage_artifacts python ../build-support/collect_coverage.py CMakeFiles/parquet.dir/src/ coverage_artifacts
For codecov.io (using the provided project token -- be sure to keep this private):
cd coverage_artifacts codecov --token $PARQUET_CPP_CODECOV_TOKEN --gcov-args '\-l' --root $PARQUET_ROOT
For coveralls, install
pip install cpp_coveralls
And the coveralls upload script:
coveralls -t $PARQUET_CPP_COVERAGE_TOKEN --gcov-options '\-l' -r $PARQUET_ROOT --exclude $PARQUET_ROOT/thirdparty --exclude $PARQUET_ROOT/build --exclude $NATIVE_TOOLCHAIN --exclude $PARQUET_ROOT/src/parquet/thrift
gcov throws off artifacts from the STL, so I excluded my toolchain root stored in
$NATIVE_TOOLCHAIN to avoid a cluttered coverage report.