tree: 6932cabbed0195afac65bf93ae25100bcb31680e [path history] [tgz]

arrow-avro/README.md

`arrow-avro`

Transfer data between the Apache Arrow memory format and Apache Avro.

This crate provides:

a reader that decodes Avro
- Object Container Files (OCF),
- Avro Single‑Object Encoding (SOE), and
- Confluent Schema Registry wire format
  into Arrow RecordBatches; and
a writer that encodes Arrow RecordBatches into Avro (OCF or SOE).

The latest API docs for main (unreleased) are published on the Arrow website: arrow_avro.

Install

[dependencies]
arrow-avro = "57.0.0"

Disable defaults and pick only what you need (see Feature Flags):

[dependencies]
arrow-avro = { version = "58.0.0", default-features = false, features = ["deflate", "snappy"] }

Quick start

Read an Avro OCF file into Arrow

use std::fs::File;
use std::io::BufReader;

use arrow_avro::reader::ReaderBuilder;
use arrow_array::RecordBatch;

fn main() -> anyhow::Result<()> {
    let file = BufReader::new(File::open("data/example.avro")?);
    let mut reader = ReaderBuilder::new().build(file)?;
    while let Some(batch) = reader.next() {
        let batch: RecordBatch = batch?;
        println!("rows: {}", batch.num_rows());
    }
    Ok(())
}

Write Arrow to Avro OCF (in‑memory)

use std::sync::Arc;

use arrow_avro::writer::AvroWriter;
use arrow_array::{ArrayRef, Int32Array, RecordBatch};
use arrow_schema::{DataType, Field, Schema};

fn main() -> anyhow::Result<()> {
    let schema = Schema::new(vec![Field::new("id", DataType::Int32, false)]);
    let batch = RecordBatch::try_new(
        Arc::new(schema.clone()),
        vec![Arc::new(Int32Array::from(vec![1, 2, 3])) as ArrayRef],
    )?;

    let sink: Vec<u8> = Vec::new();
    let mut w = AvroWriter::new(sink, schema)?;
    w.write(&batch)?;
    w.finish()?;
    assert!(!w.into_inner().is_empty());
    Ok(())
}

See the crate docs for runnable SOE and Confluent round‑trip examples.

Async reading from object stores (`object_store` feature)

use std::sync::Arc;
use arrow_avro::reader::{AsyncAvroFileReader, AvroObjectReader};
use futures::TryStreamExt;
use object_store::ObjectStore;
use object_store::local::LocalFileSystem;
use object_store::path::Path;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let store: Arc<dyn ObjectStore> = Arc::new(LocalFileSystem::new());
    let path = Path::from("data/example.avro");

    let meta = store.head(&path).await?;
    let reader = AvroObjectReader::new(store, path);

    let stream = AsyncAvroFileReader::builder(reader, meta.size, 1024)
        .try_build()
        .await?;

    let batches: Vec<_> = stream.try_collect().await?;
    for batch in batches {
        println!("rows: {}", batch.num_rows());
    }
    Ok(())
}

Feature Flags (what they do and when to use them)

Compression codecs (OCF block compression)

arrow-avro supports the Avro‑standard OCF codecs. The defaults include all five: deflate, snappy, zstd, bzip2, and xz.

Feature	Default	What it enables	When to use
`deflate`	✅	DEFLATE compression via `flate2` (pure‑Rust backend)	Most compatible; widely supported; good compression, slower than Snappy.
`snappy`	✅	Snappy block compression via `snap` with CRC‑32 as required by Avro	Fastest decode/encode; common in streaming/data‑lake pipelines. (Avro requires a 4‑byte big‑endian CRC of the uncompressed block.)
`zstd`	✅	Zstandard block compression via `zstd`	Great compression/speed trade‑off on modern systems. May pull in a native library.
`bzip2`	✅	BZip2 block compression	For compatibility with older datasets that used BZip2. Slower; larger deps.
`xz`	✅	XZ/LZMA block compression	Highest compression for archival data; slowest; larger deps.

Avro defines these codecs for OCF: null (no compression), deflate, snappy, bzip2, xz, and zstandard (recent spec versions).

Notes

Only OCF uses these codecs (they compress per‑block). They do not apply to raw Avro frames used by Confluent wire format or SOE. The crate’s compression module is specifically for OCF blocks.
deflate uses flate2 with the rust_backend (no system zlib required).

Async & Object Store

Feature	Default	What it enables	When to use
`async`	⬜	Async APIs for reading Avro via `futures` and `tokio`	Enable for non-blocking async Avro reading with `AsyncAvroFileReader`.
`object_store`	⬜	Integration with `object_store` crate (implies `async`)	Enable for reading Avro from cloud storage (S3, GCS, Azure Blob, etc.).

Schema fingerprints & custom logical type helpers

Feature	Default	What it enables	When to use
`md5`	⬜	`md5` dep for optional MD5 schema fingerprints	If you want to compute MD5 fingerprints of writer schemas (i.e. for custom prefixing/validation).
`sha256`	⬜	`sha2` dep for optional SHA‑256 schema fingerprints	If you prefer longer fingerprints; affects max prefix length (i.e. when framing).
`small_decimals`	⬜	Extra handling for small decimal logical types (`Decimal32` and `Decimal64`)	If your Avro `decimal` values are small and you want more compact Arrow representations.
`avro_custom_types`	⬜	Annotates Avro values using Arrow specific custom logical types	Enable when you need arrow-avro to reinterpret certain Avro fields as Arrow types that Avro doesn't natively model.
`canonical_extension_types`	⬜	Re‑exports Arrow's canonical extension types support from `arrow-schema`	Enable if your workflow uses Arrow canonical extension types and you want `arrow-avro` to respect them.

Lower‑level/internal toggles (rarely used directly)

flate2, snap, crc, zstd, bzip2, xz are optional dependencies wired to the user‑facing features above. You normally enable deflate/snappy/zstd/bzip2/xz, not these directly.

Feature snippets

Minimal, fast build (common pipelines):

arrow-avro = { version = "58", default-features = false, features = ["deflate", "snappy"] }

Include Zstandard too (modern data lakes):

arrow-avro = { version = "58", default-features = false, features = ["deflate", "snappy", "zstd"] }

Async reading from object stores (S3, GCS, etc.):

arrow-avro = { version = "58", features = ["object_store"] }

Fingerprint helpers:

arrow-avro = { version = "58", features = ["md5", "sha256"] }

What formats are supported?

OCF (Object Container Files): self‑describing Avro files with header, optional compression, sync markers; reader and writer supported.
Confluent Schema Registry wire format: 1‑byte magic 0x00 + 4‑byte BE schema ID + Avro body; supports decode + encode helpers.
Avro Single‑Object Encoding (SOE): 2‑byte magic 0xC3 0x01 + 8‑byte LE CRC‑64‑AVRO fingerprint + Avro body; supports decode + encode helpers.

Examples

Read/write OCF in memory and from files (see crate docs “OCF round‑trip”).
Confluent wire‑format and SOE quickstarts are provided as runnable snippets in docs.

There are additional examples under arrow-avro/examples/ in the repository.

arrow-avro