This crate contains a native Rust implementation of the Arrow columnar format. It uses nightly Rust.
Here you can find general information about this crate's content and its organization.
Every array in Arrow has a data type, that specifies how the data should be layed in memory and casted, and an optional null bitmap, that specifies whether each value is null or not. Thus, a central enum of this crate is arrow::datatypes::DataType
, that contains the set of valid DataTypes in the specification. For example, arrow::datatypes::DataType::Utf8
.
Many (but not all) data types have an associated Rust native type. The trait that represents this relationship is arrow::datatypes::ArrowNativeType
, that most native types implement.
arrow::datatypes::Field
is a struct that contains an arrays' metadata (datatype and whether its values can be null), and a name. arrow::datatypes::Schema
is a vector of fields with optional metadata.
Finally, arrow::record_batch::RecordBatch
is a struct with a Schema
and a vector of Array
s all with the same len
. A record batch is the highest order struct that this crate currently offers.
The central trait of this package is arrow::array::Array
, a dynamically-typed trait that can be downcasted to specific implementations, such as arrow::array::UInt32Array
.
Array
has Array::len()
, Array::data_type()
, and nullability of each of its entries, that can be obtained via Array::is_null(index)
. To downcast an Array
to a specific implementation, you can use
let specific_array = array.as_any().downcast_ref<UInt32Array>().unwrap();
Once downcasted, it offers two calls to retrieve specific values (and nullability):
let is_null_0: bool = specifcic_array.is_null(0) let value_at_0: u32 = specifcic_array.value(0)
You can access the whole buffer of an Array
via Array::data()
, which returns an arrow::data::ArrayData
. This struct holds the array's DataType
, arrow::buffer::Buffer
s, and childs
(which are themselves ArrayData
).
The central structs that array implementations use to allocate and refer to memory aligned according to the specification are the arrow::buffer::Buffer
and arrow::buffer::MutableBuffer
. These are the lowest abstractions of this crate, and are used throughout the crate to efficiently allocate, write, read and deallocate memory.
This implementation uses a architecture-dependent alignment of sizes that are multiples of 64 bytes.
This crate offers many operations (called kernels) to operate on Array
s, that you can find at arrow::compute::kernels
.
The current status is:
The examples folder shows how to construct some different types of Arrow arrays, including dynamic arrays created at runtime.
Examples can be run using the cargo run --example
command. For example:
cargo run --example builders cargo run --example dynamic_types cargo run --example read_csv
The IPC flatbuffer code was generated by running this command from the root of the project, using flatc version 1.10.0:
./regen.sh
The above script will run the flatc
compiler and perform some adjustments to the source code:
type__
with type_
org::apache::arrow::flatbuffers
namespaceArrow uses the following features:
simd
- Arrow uses the packed_simd crate to optimize many of the implementations in the compute module using SIMD intrinsics. These optimizations are turned off by default.flight
which contains useful functions to convert between the Flight wire format and Arrow dataprettyprint
which is a utility for printing record batchesOther than simd
all the other features are enabled by default. Disabling prettyprint
might be necessary in order to compile Arrow to the wasm32-unknown-unknown
WASM target.
An Arrow committer can publish this crate after an official project release has been made to crates.io using the following instructions.
Follow these instructions to create an account and login to crates.io before asking to be added as an owner of the arrow crate.
Checkout the tag for the version to be released. For example:
git checkout apache-arrow-0.11.0
If the Cargo.toml in this tag already contains version = "0.11.0"
(as it should) then the crate can be published with the following command:
cargo publish
If the Cargo.toml does not have the correct version then it will be necessary to modify it manually. Since there is now a modified file locally that is not committed to GitHub it will be necessary to use the following command.
cargo publish --allow-dirty