This crate contains a native Rust implementation of the Arrow columnar format.
Common information for all Rust libraries in this project, including testing, code formatting, and lints, can be found in the main Arrow Rust README.md.
Please refer to lib.rs for an introduction to this specific crate and its current functionality.
This crate heavily uses unsafe
due to how memory is allocated in cache lines. We have a small tool to verify that this crate does not leak memory (beyond what the compiler already does)
Run it with
cargo test --features memory-check --lib -- --test-threads 1
This runs all unit-tests on a single thread and counts all allocations and de-allocations.
The examples folder shows how to construct some different types of Arrow arrays, including dynamic arrays created at runtime.
Examples can be run using the cargo run --example
command. For example:
cargo run --example builders cargo run --example dynamic_types cargo run --example read_csv
The expected flatc version is 1.12.0+, built from flatbuffers master at fixed commit ID, by regen.sh.
The IPC flatbuffer code was generated by running this command from the root of the project:
./regen.sh
The above script will run the flatc
compiler and perform some adjustments to the source code:
type__
with type_
org::apache::arrow::flatbuffers
namespaceArrow uses the following features:
simd
- Arrow uses the packed_simd crate to optimize many of the implementations in the compute module using SIMD intrinsics. These optimizations are turned off by default. If the simd
feature is enabled, an unstable version of Rust is required (we test with nightly-2021-03-24
)flight
which contains useful functions to convert between the Flight wire format and Arrow dataprettyprint
which is a utility for printing record batchesOther than simd
all the other features are enabled by default. Disabling prettyprint
might be necessary in order to compile Arrow to the wasm32-unknown-unknown
WASM target.
unsafe
unsafe
has a high maintenance cost because debugging and testing it is difficult, time consuming, often requires external tools (e.g. valgrind
), and requires a higher-than-usual attention to details. Undefined behavior is particularly difficult to identify and test, and usage of unsafe
is the primary cause of undefined behavior in a program written in Rust. For two real world examples of where unsafe
has consumed time in the past in this project see #8545 and 8829 This crate only accepts the usage of unsafe
code upon careful consideration, and strives to avoid it to the largest possible extent.
unsafe
be used?Generally, unsafe
should only be used when a safe
counterpart is not available and there is no safe
way to achieve additional performance in that area. The following is a summary of the current components of the crate that require unsafe
:
The arrow format recommends storing buffers aligned with cache lines, and this crate adopts this behavior. However, Rust's global allocator does not allocate memory aligned with cache-lines. As such, many of the low-level operations related to memory management require unsafe
.
The arrow format is specified in bytes (u8
), which can be logically represented as certain types depending on the DataType
. For many operations, such as access, representation, numerical computation and string manipulation, it is often necessary to interpret bytes as other physical types (e.g. i32
).
Usage of unsafe
for the purpose of interpreting bytes in their corresponding type (according to the arrow specification) is allowed. Specifically, the pointer to the byte slice must be aligned to the type that it intends to represent and the length of the slice is a multiple of the size of the target type of the transmutation.
The arrow format declares an ABI for zero-copy from and to libraries that implement the specification (foreign interfaces). In Rust, receiving and sending pointers via FFI requires usage of unsafe
due to the impossibility of the compiler to derive the invariants (such as lifetime, null pointers, and pointer alignment) from the source code alone as they are part of the FFI contract.
The arrow format declares a IPC protocol, which this crate supports. IPC is equivalent to a FFI in that the rust compiler can‘t reason about the contract’s invariants.
The API provided by the packed_simd
library is currently unsafe
. However, SIMD offers a significant performance improvement over non-SIMD operations.
Some operations are significantly faster when unsafe
is used.
A common usage of unsafe
is to offer an API to access the i
th element of an array (e.g. UInt32Array
). This requires accessing the values buffer e.g. array.buffers()[0]
, picking the slice [i * size_of<i32>(), (i + 1) * size_of<i32>()]
, and then transmuting it to i32
. In safe Rust, this operation requires boundary checks that are detrimental to performance.
Usage of unsafe
for performance reasons is justified only when all other alternatives have been exhausted and the performance benefits are sufficiently large (e.g. >~10%).
unsafe
Usage of unsafe
in this crate must:
safe
when there are necessary invariants for that API to be defined behavior.safe
is not used / possibledebug_assert
s to relevant invariants (e.g. bound checks)Example of code documentation:
// JUSTIFICATION // Benefit // Describe the benefit of using unsafe. E.g. // "30% performance degradation if the safe counterpart is used, see bench X." // Soundness // Describe why the code remains sound (according to the definition of rust's unsafe code guidelines). E.g. // "We bounded check these values at initialization and the array is immutable." let ... = unsafe { ... };
When adding this documentation to existing code that is not sound and cannot trivially be fixed, we should file specific JIRA issues and reference them in these code comments. For example:
// Soundness // This is not sound because .... see https://issues.apache.org/jira/browse/ARROW-nnnnn
Please see the release for details on how to create arrow releases