arrow-avroTransfer data between the Apache Arrow memory format and Apache Avro.
This crate provides:
RecordBatches; andRecordBatches into Avro (OCF or SOE).The latest API docs for
main(unreleased) are published on the Arrow website: arrow_avro.
[dependencies] arrow-avro = "57.0.0"
Disable defaults and pick only what you need (see Feature Flags):
[dependencies] arrow-avro = { version = "58.0.0", default-features = false, features = ["deflate", "snappy"] }
use std::fs::File; use std::io::BufReader; use arrow_avro::reader::ReaderBuilder; use arrow_array::RecordBatch; fn main() -> anyhow::Result<()> { let file = BufReader::new(File::open("data/example.avro")?); let mut reader = ReaderBuilder::new().build(file)?; while let Some(batch) = reader.next() { let batch: RecordBatch = batch?; println!("rows: {}", batch.num_rows()); } Ok(()) }
use std::sync::Arc; use arrow_avro::writer::AvroWriter; use arrow_array::{ArrayRef, Int32Array, RecordBatch}; use arrow_schema::{DataType, Field, Schema}; fn main() -> anyhow::Result<()> { let schema = Schema::new(vec![Field::new("id", DataType::Int32, false)]); let batch = RecordBatch::try_new( Arc::new(schema.clone()), vec![Arc::new(Int32Array::from(vec![1, 2, 3])) as ArrayRef], )?; let sink: Vec<u8> = Vec::new(); let mut w = AvroWriter::new(sink, schema)?; w.write(&batch)?; w.finish()?; assert!(!w.into_inner().is_empty()); Ok(()) }
See the crate docs for runnable SOE and Confluent round‑trip examples.
object_store feature)use std::sync::Arc; use arrow_avro::reader::{AsyncAvroFileReader, AvroObjectReader}; use futures::TryStreamExt; use object_store::ObjectStore; use object_store::local::LocalFileSystem; use object_store::path::Path; #[tokio::main] async fn main() -> anyhow::Result<()> { let store: Arc<dyn ObjectStore> = Arc::new(LocalFileSystem::new()); let path = Path::from("data/example.avro"); let meta = store.head(&path).await?; let reader = AvroObjectReader::new(store, path); let stream = AsyncAvroFileReader::builder(reader, meta.size, 1024) .try_build() .await?; let batches: Vec<_> = stream.try_collect().await?; for batch in batches { println!("rows: {}", batch.num_rows()); } Ok(()) }
arrow-avro supports the Avro‑standard OCF codecs. The defaults include all five: deflate, snappy, zstd, bzip2, and xz.
| Feature | Default | What it enables | When to use |
|---|---|---|---|
deflate | ✅ | DEFLATE compression via flate2 (pure‑Rust backend) | Most compatible; widely supported; good compression, slower than Snappy. |
snappy | ✅ | Snappy block compression via snap with CRC‑32 as required by Avro | Fastest decode/encode; common in streaming/data‑lake pipelines. (Avro requires a 4‑byte big‑endian CRC of the uncompressed block.) |
zstd | ✅ | Zstandard block compression via zstd | Great compression/speed trade‑off on modern systems. May pull in a native library. |
bzip2 | ✅ | BZip2 block compression | For compatibility with older datasets that used BZip2. Slower; larger deps. |
xz | ✅ | XZ/LZMA block compression | Highest compression for archival data; slowest; larger deps. |
Avro defines these codecs for OCF:
null(no compression),deflate,snappy,bzip2,xz, andzstandard(recent spec versions).
Notes
compression module is specifically for OCF blocks.deflate uses flate2 with the rust_backend (no system zlib required).| Feature | Default | What it enables | When to use |
|---|---|---|---|
async | ⬜ | Async APIs for reading Avro via futures and tokio | Enable for non-blocking async Avro reading with AsyncAvroFileReader. |
object_store | ⬜ | Integration with object_store crate (implies async) | Enable for reading Avro from cloud storage (S3, GCS, Azure Blob, etc.). |
| Feature | Default | What it enables | When to use |
|---|---|---|---|
md5 | ⬜ | md5 dep for optional MD5 schema fingerprints | If you want to compute MD5 fingerprints of writer schemas (i.e. for custom prefixing/validation). |
sha256 | ⬜ | sha2 dep for optional SHA‑256 schema fingerprints | If you prefer longer fingerprints; affects max prefix length (i.e. when framing). |
small_decimals | ⬜ | Extra handling for small decimal logical types (Decimal32 and Decimal64) | If your Avro decimal values are small and you want more compact Arrow representations. |
avro_custom_types | ⬜ | Annotates Avro values using Arrow specific custom logical types | Enable when you need arrow-avro to reinterpret certain Avro fields as Arrow types that Avro doesn't natively model. |
canonical_extension_types | ⬜ | Re‑exports Arrow's canonical extension types support from arrow-schema | Enable if your workflow uses Arrow canonical extension types and you want arrow-avro to respect them. |
Lower‑level/internal toggles (rarely used directly)
flate2, snap, crc, zstd, bzip2, xz are optional dependencies wired to the user‑facing features above. You normally enable deflate/snappy/zstd/bzip2/xz, not these directly.Minimal, fast build (common pipelines):
arrow-avro = { version = "58", default-features = false, features = ["deflate", "snappy"] }
Include Zstandard too (modern data lakes):
arrow-avro = { version = "58", default-features = false, features = ["deflate", "snappy", "zstd"] }
Async reading from object stores (S3, GCS, etc.):
arrow-avro = { version = "58", features = ["object_store"] }
Fingerprint helpers:
arrow-avro = { version = "58", features = ["md5", "sha256"] }
0x00 + 4‑byte BE schema ID + Avro body; supports decode + encode helpers.0xC3 0x01 + 8‑byte LE CRC‑64‑AVRO fingerprint + Avro body; supports decode + encode helpers.There are additional examples under arrow-avro/examples/ in the repository.