DataFusion Java supports two query interfaces: SQL strings via SessionContext.sql(String), and a programmatic DataFrame API.
try (DataFrame df = ctx.sql("SELECT a, b FROM t WHERE a > 10")) { df.show(); }
sql(String) plans the query and returns a DataFrame. Execution does not start until you pull results.
The DataFrame API exposes select, filter, limit, distinct, dropColumns, and withColumnRenamed.
try (DataFrame df = ctx.readParquet("/path/to/orders.parquet")) { try (DataFrame filtered = df.filter("o_orderpriority = '1-URGENT'")) { filtered.show(); } }
Each transformation returns a new DataFrame that must be closed.
Three patterns are available:
Stream as Arrow batches. Use collect(allocator) to pull the result set as Arrow record batches via the Arrow C Data Interface:
try (DataFrame df = ctx.sql("SELECT ..."); ArrowReader reader = df.collect(allocator)) { while (reader.loadNextBatch()) { var batch = reader.getVectorSchemaRoot(); // process batch... } }
Count rows. df.count() returns the row count without materializing the rows in the JVM.
Print for inspection. df.show() and df.show(int n) print results to standard output. Useful for exploration; not appropriate for production code paths.
To get the schema of a registered table without running a query:
org.apache.arrow.vector.types.pojo.Schema schema = ctx.tableSchema("orders");
A DataFusion logical plan can be deserialized from datafusion-proto bytes via SessionContext.fromProto(byte[]). The datafusion-proto Java classes are generated by the Maven build. This is useful for accepting plans produced by other DataFusion-aware tooling.