tree: 6932cabbed0195afac65bf93ae25100bcb31680e [path history] [tgz]
  1. benches/
  2. examples/
  3. src/
  4. test/
  5. Cargo.toml
  6. README.md
arrow-avro/README.md

arrow-avro

crates.io docs.rs

Transfer data between the Apache Arrow memory format and Apache Avro.

This crate provides:

  • a reader that decodes Avro
    • Object Container Files (OCF),
    • Avro Single‑Object Encoding (SOE), and
    • Confluent Schema Registry wire format
      into Arrow RecordBatches; and
  • a writer that encodes Arrow RecordBatches into Avro (OCF or SOE).

The latest API docs for main (unreleased) are published on the Arrow website: arrow_avro.


Install

[dependencies]
arrow-avro = "57.0.0"

Disable defaults and pick only what you need (see Feature Flags):

[dependencies]
arrow-avro = { version = "58.0.0", default-features = false, features = ["deflate", "snappy"] }

Quick start

Read an Avro OCF file into Arrow

use std::fs::File;
use std::io::BufReader;

use arrow_avro::reader::ReaderBuilder;
use arrow_array::RecordBatch;

fn main() -> anyhow::Result<()> {
    let file = BufReader::new(File::open("data/example.avro")?);
    let mut reader = ReaderBuilder::new().build(file)?;
    while let Some(batch) = reader.next() {
        let batch: RecordBatch = batch?;
        println!("rows: {}", batch.num_rows());
    }
    Ok(())
}

Write Arrow to Avro OCF (in‑memory)

use std::sync::Arc;

use arrow_avro::writer::AvroWriter;
use arrow_array::{ArrayRef, Int32Array, RecordBatch};
use arrow_schema::{DataType, Field, Schema};

fn main() -> anyhow::Result<()> {
    let schema = Schema::new(vec![Field::new("id", DataType::Int32, false)]);
    let batch = RecordBatch::try_new(
        Arc::new(schema.clone()),
        vec![Arc::new(Int32Array::from(vec![1, 2, 3])) as ArrayRef],
    )?;

    let sink: Vec<u8> = Vec::new();
    let mut w = AvroWriter::new(sink, schema)?;
    w.write(&batch)?;
    w.finish()?;
    assert!(!w.into_inner().is_empty());
    Ok(())
}

See the crate docs for runnable SOE and Confluent round‑trip examples.

Async reading from object stores (object_store feature)

use std::sync::Arc;
use arrow_avro::reader::{AsyncAvroFileReader, AvroObjectReader};
use futures::TryStreamExt;
use object_store::ObjectStore;
use object_store::local::LocalFileSystem;
use object_store::path::Path;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let store: Arc<dyn ObjectStore> = Arc::new(LocalFileSystem::new());
    let path = Path::from("data/example.avro");

    let meta = store.head(&path).await?;
    let reader = AvroObjectReader::new(store, path);

    let stream = AsyncAvroFileReader::builder(reader, meta.size, 1024)
        .try_build()
        .await?;

    let batches: Vec<_> = stream.try_collect().await?;
    for batch in batches {
        println!("rows: {}", batch.num_rows());
    }
    Ok(())
}

Feature Flags (what they do and when to use them)

Compression codecs (OCF block compression)

arrow-avro supports the Avro‑standard OCF codecs. The defaults include all five: deflate, snappy, zstd, bzip2, and xz.

FeatureDefaultWhat it enablesWhen to use
deflateDEFLATE compression via flate2 (pure‑Rust backend)Most compatible; widely supported; good compression, slower than Snappy.
snappySnappy block compression via snap with CRC‑32 as required by AvroFastest decode/encode; common in streaming/data‑lake pipelines. (Avro requires a 4‑byte big‑endian CRC of the uncompressed block.)
zstdZstandard block compression via zstdGreat compression/speed trade‑off on modern systems. May pull in a native library.
bzip2BZip2 block compressionFor compatibility with older datasets that used BZip2. Slower; larger deps.
xzXZ/LZMA block compressionHighest compression for archival data; slowest; larger deps.

Avro defines these codecs for OCF: null (no compression), deflate, snappy, bzip2, xz, and zstandard (recent spec versions).

Notes

  • Only OCF uses these codecs (they compress per‑block). They do not apply to raw Avro frames used by Confluent wire format or SOE. The crate’s compression module is specifically for OCF blocks.
  • deflate uses flate2 with the rust_backend (no system zlib required).

Async & Object Store

FeatureDefaultWhat it enablesWhen to use
asyncAsync APIs for reading Avro via futures and tokioEnable for non-blocking async Avro reading with AsyncAvroFileReader.
object_storeIntegration with object_store crate (implies async)Enable for reading Avro from cloud storage (S3, GCS, Azure Blob, etc.).

Schema fingerprints & custom logical type helpers

FeatureDefaultWhat it enablesWhen to use
md5md5 dep for optional MD5 schema fingerprintsIf you want to compute MD5 fingerprints of writer schemas (i.e. for custom prefixing/validation).
sha256sha2 dep for optional SHA‑256 schema fingerprintsIf you prefer longer fingerprints; affects max prefix length (i.e. when framing).
small_decimalsExtra handling for small decimal logical types (Decimal32 and Decimal64)If your Avro decimal values are small and you want more compact Arrow representations.
avro_custom_typesAnnotates Avro values using Arrow specific custom logical typesEnable when you need arrow-avro to reinterpret certain Avro fields as Arrow types that Avro doesn't natively model.
canonical_extension_typesRe‑exports Arrow's canonical extension types support from arrow-schemaEnable if your workflow uses Arrow canonical extension types and you want arrow-avro to respect them.

Lower‑level/internal toggles (rarely used directly)

  • flate2, snap, crc, zstd, bzip2, xz are optional dependencies wired to the user‑facing features above. You normally enable deflate/snappy/zstd/bzip2/xz, not these directly.

Feature snippets

  • Minimal, fast build (common pipelines):

    arrow-avro = { version = "58", default-features = false, features = ["deflate", "snappy"] }
    
  • Include Zstandard too (modern data lakes):

    arrow-avro = { version = "58", default-features = false, features = ["deflate", "snappy", "zstd"] }
    
  • Async reading from object stores (S3, GCS, etc.):

    arrow-avro = { version = "58", features = ["object_store"] }
    
  • Fingerprint helpers:

    arrow-avro = { version = "58", features = ["md5", "sha256"] }
    

What formats are supported?

  • OCF (Object Container Files): self‑describing Avro files with header, optional compression, sync markers; reader and writer supported.
  • Confluent Schema Registry wire format: 1‑byte magic 0x00 + 4‑byte BE schema ID + Avro body; supports decode + encode helpers.
  • Avro Single‑Object Encoding (SOE): 2‑byte magic 0xC3 0x01 + 8‑byte LE CRC‑64‑AVRO fingerprint + Avro body; supports decode + encode helpers.

Examples

  • Read/write OCF in memory and from files (see crate docs “OCF round‑trip”).
  • Confluent wire‑format and SOE quickstarts are provided as runnable snippets in docs.

There are additional examples under arrow-avro/examples/ in the repository.