|author||Gary Pennington <email@example.com>||Wed Jun 16 17:37:02 2021 +0100|
|committer||GitHub <firstname.lastname@example.org>||Wed Jun 16 12:37:02 2021 -0400|
parquet: improve BOOLEAN writing logic and report error on encoding fail (#443) * improve BOOLEAN writing logic and report error on encoding fail When writing BOOLEAN data, writing more than 2048 rows of data will overflow the hard-coded 256 buffer set for the bit-writer in the PlainEncoder. Once this occurs, further attempts to write to the encoder fail, becuase capacity is exceeded, but the errors are silently ignored. This fix improves the error detection and reporting at the point of encoding and modifies the logic for bit_writing (BOOLEANS). The bit_writer is initially allocated 256 bytes (as at present), then each time the capacity is exceeded the capacity is incremented by another 256 bytes. This certainly resolves the current problem, but it's not exactly a great fix because the capacity of the bit_writer could now grow substantially. Other data types seem to have a more sophisticated mechanism for writing data which doesn't involve growing or having a fixed size buffer. It would be desirable to make the BOOLEAN type use this same mechanism if possible, but that level of change is more intrusive and probably requires greater knowledge of the implementation than I possess. resolves: #349 * only manipulate the bit_writer for BOOLEAN data Tacky, but I can't think of better way to do this without specialization. * better isolation of changes Remove the byte tracking from the PlainEncoder and use the existing bytes_written() method in BitWriter. This is neater. * add test for boolean writer The test ensures that we can write > 2048 rows to a parquet file and that when we read the data back, it finishes without hanging (defined as taking < 5 seconds). If we don't want that extra complexity, we could remove the thread/channel stuff and just try to read the file and let the test runner terminate hanging tests. * fix capacity calculation error in bool encoding The values.len() reports the number of values to be encoded and so must be divided by 8 (bits in a bytes) to determine the effect on the byte capacity of the bit_writer.
Welcome to the implementation of Arrow, the popular in-memory columnar format, in Rust.
This part of the Arrow project is divided in 4 main components:
|Arrow||Core functionality (memory layout, arrays, low level computations)||(README)|
|Arrow-flight||Arrow data between processes||(README)|
|DataFusion||In-memory query engine with SQL support||(README)|
|Ballista||Distributed query execution||(README)|
Independently, they support a vast array of functionality for in-memory computations.
Together, they allow users to write an SQL query or a
DataFrame (using the
datafusion crate), run it against a parquet file (using the
parquet crate), evaluate it in-memory using Arrow's columnar format (using the
arrow crate), and send to another process (using the
Generally speaking, the
arrow crate offers functionality to develop code that uses Arrow arrays, and
datafusion offers most operations typically found in SQL, with the notable exceptions of:
There are too many features to enumerate here, but some notable mentions:
Arrowimplements all formats in the specification except certain dictionaries
Arrowsupports SIMD operations to some of its vertical operations
DataFusionsupports user-defined functions, aggregates, and whole execution nodes
You can find more details about each crate in their respective READMEs.
We use the official ASF Slack for informal discussions and coordination. This is a great place to meet other contributors and get guidance on where to contribute. Join us in the
We use ASF JIRA as the system of record for new features and bug fixes and this plays a critical role in the release process.
For design discussions we generally collaborate on Google documents and file a JIRA linking to the document.
There is also a bi-weekly Rust-specific sync call for the Arrow Rust community. This is hosted on Google Meet at https://meet.google.com/ctp-yujs-aee on alternate Wednesday's at 09:00 US/Pacific, 12:00 US/Eastern. During US daylight savings time this corresponds to 16:00 UTC and at other times this is 17:00 UTC.
This is a standard cargo project with workspaces. To build it, you need to have
cd /rust && cargo build
You can also use rust's official docker image:
docker run --rm -v $(pwd)/rust:/rust -it rust /bin/bash -c "cd /rust && cargo build"
The command above assumes that are in the root directory of the project, not in the same directory as this README.md.
You can also compile specific workspaces:
cd /rust/arrow && cargo build
Before running tests and examples, it is necessary to set up the local development environment.
The tests rely on test data that is contained in git submodules.
To pull down this data run the following:
git submodule update --init
This populates data in two git submodules:
../parquet_testing/data(sourced from https://github.com/apache/parquet-testing.git)
../testing(sourced from https://github.com/apache/arrow-testing)
cargo test will look for these directories at their standard location. The following environment variables can be used to override the location:
# Optionaly specify a different location for test data export PARQUET_TEST_DATA=$(cd ../parquet-testing/data; pwd) export ARROW_TEST_DATA=$(cd ../testing/data; pwd)
From here on, this is a pure Rust project and
cargo can be used to run tests, benchmarks, docs and examples as usual.
Run tests using the Rust standard
cargo test command:
# run all tests. cargo test # run only tests for the arrow crate cargo test -p arrow
Our CI uses
rustfmt to check code formatting. Before submitting a PR be sure to run the following and check for lint issues:
cargo +stable fmt --all -- --check
We recommend using
clippy for checking lints during development. While we do not yet enforce
clippy checks, we recommend not introducing new
clippy errors or warnings.
Run the following to check for clippy lints.
If you use Visual Studio Code with the
rust-analyzer plugin, you can enable
clippy to run each time you save a file. See https://users.rust-lang.org/t/how-to-use-clippy-in-vs-code-with-rust-analyzer/41881.
One of the concerns with
clippy is that it often produces a lot of false positives, or that some recommendations may hurt readability. We do not have a policy of which lints are ignored, but if you disagree with a
clippy lint, you may disable the lint and briefly justify it.
allow(clippy:: in the codebase to identify lints that are ignored/allowed. We currently prefer ignoring lints on the lowest unit possible.
We can use git pre-commit hook to automate various kinds of git pre-commit checking/formatting.
Suppose you are in the root directory of the project.
First check if the file already exists:
ls -l .git/hooks/pre-commit
If the file already exists, to avoid mistakenly overriding, you MAY have to check the link source or file content. Else if not exist, let's safely soft link pre-commit.sh as file
ln -s ../../rust/pre-commit.sh .git/hooks/pre-commit
If sometimes you want to commit without checking, just run
git commit with
git commit --no-verify -m "... commit message ..."