blob: 5407977c5f7e31c29a192e0c65b1d88cdadf8921 [file] [log] [blame] [view]
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
## parquet-cpp: a C++ library to read and write the Apache Parquet columnar data format.
<table>
<tr>
<td>Build Status</td>
<td>
<a href="https://travis-ci.org/apache/parquet-cpp">
<img src="https://travis-ci.org/apache/parquet-cpp.svg?branch=master" alt="travis build status" />
</a>
</td>
</tr>
<tr>
<td>Test coverage</td>
<td>
<a href='https://coveralls.io/github/apache/parquet-cpp?branch=master'><img src='https://coveralls.io/repos/github/apache/parquet-cpp/badge.svg?branch=master' alt='Coverage Status' /></a>
</td>
</tr>
</table>
## System Dependencies
We use the CMake build system and require a minimum version of 3.2. If you are
using an older Linux distribution, you may need to use a PPA (for apt users) or
build CMake from source.
### Linux
parquet-cpp requires gcc 4.8 or higher on Linux.
To build parquet-cpp out of the box, you must install some build prerequisites
for the thirdparty dependencies. On Debian/Ubuntu, these can be installed with:
```
sudo apt-get install libboost-dev libboost-filesystem-dev \
libboost-program-options-dev libboost-regex-dev \
libboost-system-dev libboost-test-dev \
libssl-dev libtool bison flex pkg-config
```
### OS X
You must use XCode 6 or higher. We recommend using Homebrew to install Boost,
which is required for Thrift:
```
brew install boost
```
### Windows
Check [Windows developer guide][1] for instructions to build parquet-cpp on Windows.
## Third Party Dependencies
- Apache Arrow >= 0.7.0 (memory management, compression, IO, optional columnar
data adapters)
- Thrift 0.7+ [install instructions](https://thrift.apache.org/docs/install/)
- googletest 1.7.0 (cannot be installed with package managers)
- Google Benchmark (only required if building benchmarks)
You can either install these dependencies separately, otherwise they will be
built automatically as part of the build.
Symbols from Thrift, Snappy, and ZLib are statically-linked into the
`libparquet` shared library, so these dependencies must be built with `-fPIC`
on Linux and OS X. Since Linux package managers do not consistently compile the
static libraries for these components with `-fPIC`, you may have issues with
Linux packages such as `libsnappy-dev`. It may be easier to depend on the
thirdparty toolchain that parquet-cpp builds automatically.
## Build
- `cmake .`
- You can customize build dependency locations through various environment variables:
- ARROW_HOME customizes the Apache Arrow installed location.
- THRIFT_HOME customizes the Apache Thrift (C++ libraries and compiler
installed location.
- GTEST_HOME customizes the googletest installed location (if you are
building the unit tests).
- GBENCHMARK_HOME customizes the Google Benchmark installed location (if
you are building the benchmarks).
- `make`
The binaries will be built to ./debug which contains the libraries to link against as
well as a few example executables.
To disable the testing (which requires `googletest`), pass
`-DPARQUET_BUILD_TESTS=Off` to `cmake`.
For release-level builds (enable optimizations and disable debugging), pass
`-DCMAKE_BUILD_TYPE=Release` to `cmake`.
To build only the library with minimal dependencies, pass
`-DPARQUET_MINIMAL_DEPENDENCY=ON` to `cmake`.
Note that the executables, tests, and benchmarks should be disabled as well.
Incremental builds can be done afterwords with just `make`.
## Using with Apache Arrow
Arrow provides some of the memory management and IO interfaces that we use in
parquet-cpp. By default, Parquet links to Arrow's shared libraries. If you wish
to statically-link the Arrow symbols instead, pass
`-DPARQUET_ARROW_LINKAGE=static`.
## Testing
This library uses Google's `googletest` unit test framework. After building
with `make`, you can run the test suite by running
```
make unittest
```
The test suite relies on an environment variable `PARQUET_TEST_DATA` pointing
to the `data` directory in the source checkout, for example:
```
export PARQUET_TEST_DATA=`pwd`/data
```
See `ctest --help` for configuration details about ctest. On GNU/Linux systems,
you can use valgrind with ctest to look for memory leaks:
```
valgrind --tool=memcheck --leak-check=yes ctest
```
## Building/Running benchmarks
Follow the directions for simple build except run cmake
with the `--PARQUET_BUILD_BENCHMARKS` parameter set correctly:
cmake -DPARQUET_BUILD_BENCHMARKS=ON ..
and instead of make unittest run either `make; ctest` to run both unit tests
and benchmarks or `make runbenchmark` to run only the benchmark tests.
Benchmark logs will be placed in the build directory under `build/benchmark-logs`.
## Out-of-source builds
parquet-cpp supports out of source builds. For example:
```
mkdir test-build
cd test-build
cmake ..
make
ctest -L unittest
```
By using out-of-source builds you can preserve your current build state in case
you need to switch to another git branch.
Design
========
The library consists of 3 layers that map to the 3 units in the parquet format.
The first is the encodings which correspond to data pages. The APIs at this level
return single values.
The second layer is the column reader which corresponds to column chunks. The APIs at
this level return a triple: definition level, repetition level and value. It also handles
reading pages, compression and managing encodings.
The 3rd layer would handle reading/writing records.
Developer Notes
========
The project adheres to the google coding convention:
http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml
with two notable exceptions. We do not encourage anonymous namespaces and the line
length is 90 characters.
You can run `cpplint` through the build system with
```
make lint
```
The project prefers the use of C++ style memory management. new/delete should be used
over malloc/free. new/delete should be avoided whenever possible by using stl/boost
where possible. For example, scoped_ptr instead of explicit new/delete and using
std::vector instead of allocated buffers. Currently, c++11 features are not used.
For error handling, this project uses exceptions.
In general, many of the APIs at the layers are interface based for extensibility. To
minimize the cost of virtual calls, the APIs should be batch-centric. For example,
encoding should operate on batches of values rather than a single value.
## Using clang with a custom gcc toolchain
Suppose you are building libraries with a thirdparty gcc toolchain (not a
built-in system one) on Linux. To use clang for development while linking to
the proper toolchain, you can do (for out of source builds):
```shell
export CMAKE_CLANG_OPTIONS=--gcc-toolchain=$TOOLCHAIN/gcc-4.9.2
export CC=$TOOLCHAIN/llvm-3.7.0/bin/clang
export CXX=$TOOLCHAIN/llvm-3.7.0/bin/clang++
cmake -DCMAKE_CLANG_OPTIONS=$CMAKE_CLANG_OPTIONS \
-DCMAKE_CXX_FLAGS="-Werror" ..
```
## Code Coverage
To build with `gcov` code coverage and upload results to http://coveralls.io or
http://codecov.io, here are some instructions.
First, build the project with coverage and run the test suite
```
cd $PARQUET_HOME
mkdir coverage-build
cd coverage-build
cmake -DPARQUET_GENERATE_COVERAGE=1
make -j$PARALLEL
ctest -L unittest
```
The `gcov` artifacts are not located in a place that works well with either
coveralls or codecov, so there is a helper script you need to run
```
mkdir coverage_artifacts
python ../build-support/collect_coverage.py CMakeFiles/parquet.dir/src/ coverage_artifacts
```
For codecov.io (using the provided project token -- be sure to keep this
private):
```
cd coverage_artifacts
codecov --token $PARQUET_CPP_CODECOV_TOKEN --gcov-args '\-l' --root $PARQUET_ROOT
```
For coveralls, install `cpp_coveralls`:
```
pip install cpp_coveralls
```
And the coveralls upload script:
```
coveralls -t $PARQUET_CPP_COVERAGE_TOKEN --gcov-options '\-l' -r $PARQUET_ROOT --exclude $PARQUET_ROOT/thirdparty --exclude $PARQUET_ROOT/build --exclude $NATIVE_TOOLCHAIN --exclude $PARQUET_ROOT/src/parquet/thrift
```
Note that `gcov` throws off artifacts from the STL, so I excluded my toolchain
root stored in `$NATIVE_TOOLCHAIN` to avoid a cluttered coverage report.
[1]: https://github.com/apache/parquet-cpp/blob/master/docs/Windows.md