tpchgen-cli — only needed to generate test data for the Parquet integration test (cargo install tpchgen-cli).Maven is bundled via the ./mvnw wrapper; no separate Maven install is required.
make test
This builds the native Rust crate and runs the JUnit tests. The steps can be run individually:
cargo build --workspace ./mvnw test
The native library must be built before running JVM tests.
Before pushing, run make format to apply the Java + Rust formatters in place. CI verifies formatting, clippy, and license headers on every PR.
The first build in a fresh checkout reaches out to raw.githubusercontent.com to fetch the DataFusion .proto files used to generate the datafusion-proto Java classes. Subsequent builds are offline; the download-maven-plugin cache under ~/.m2/repository/.cache/ satisfies them.
The Parquet integration test reads TPC-H SF1 data (~345 MB across 8 tables in Snappy-compressed Parquet). Generate it once with:
make tpch-data
Tests that need this data skip cleanly if it is missing. make clean does not remove tpch-data/ — delete it manually to reclaim the disk space.
The repository is a multi-module Maven build:
Cargo.toml — Rust workspace root declaring the crate members (native, native-common) and [workspace.dependencies] that pin shared versions in one place. Cargo writes artifacts to rust-target/ (overridden in .cargo/config.toml) so mvn clean at the repo root does not nuke the Rust build cache.pom.xml — parent POM declaring the core and examples modules and shared plugin/dependency versions.core/ — datafusion-java library module (Java sources, tests, and generated protobuf classes).examples/ — datafusion-java-examples module containing runnable examples that depend on the library; built alongside the library so they cannot fall out of sync with the API.native/ — datafusion-jni Rust crate (JNI + Arrow C Data Interface).native-common/ — datafusion-jni-common Rust crate: JNI plumbing shared across native crates (error→exception mapping, the per-cdylib Tokio runtime singleton, the async-stream→FFI_ArrowArrayStream bridge).proto/ — Protobuf definitions shared between Java and Rust.Makefile — top-level build orchestration (make test, make format, make tpch-data).mvnw, mvnw.cmd — bundled Maven wrapper.docs/ — Sphinx documentation source and build scripts.The examples module wires exec-maven-plugin with the right java.library.path and --add-opens flags. Install the library to the local Maven repository once, then run any example by main class:
./mvnw install -DskipTests ./mvnw -pl :datafusion-java-examples exec:exec \ -Dexec.mainClass=org.apache.datafusion.examples.SqlQueryExample
The bundled examples (under examples/src/main/java/org/apache/datafusion/examples/):
SqlQueryExample — register a CSV and run a SQL aggregation.DataFrameExample — read CSV → filter / select / rename / distinct → write Parquet → read back.ProtoPlanExample — build a DataFusion LogicalPlanNode directly via the generated protobuf classes and execute it through SessionContext.fromProto.Re-run mvnw install -DskipTests whenever you change the library.
When a JNI call needs to carry more than a handful of scalar arguments — for example, a struct of nullable knobs like CsvReadOptions or SessionOptions — encode the call's configuration as a protobuf message rather than expanding the JNI signature with one parameter per field.
Add a .proto file under proto/, declare package datafusion_java;, and follow the conventions already in use:
optional for fields whose unset-ness must survive the boundary (so the Rust side can leave a DataFusion default in place).FILE_COMPRESSION_TYPE_GZIP, not bare GZIP) because proto3 enum values are scoped at the package level, and the zero value must be a _UNSPECIFIED sentinel — the Rust side should reject UNSPECIFIED rather than silently default it.byte[] JNI arguments next to the options proto, not inside it.Proto if a sibling Java class would otherwise shadow it (e.g. CsvReadOptionsProto vs the public CsvReadOptions).The proto is compiled by both prost-build (Rust, via native/build.rs) and the Maven protobuf-maven-plugin (Java). The Java side builds the message, serializes to bytes, and passes the byte array through JNI; the Rust side decodes once and folds the fields into the corresponding DataFusion struct.
This pattern keeps JNI signatures short, makes nullable and enum fields explicit in a single typed schema, and lets new fields be added without touching the signature on either side.