docs/source/user-guide/table-provider.md - datafusion-java - Git at Google

 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->

 # Java table providers

 `SessionContext.registerTable(name, provider)` registers a Java-implemented
 table. SQL queries that reference `name` call back into your `TableProvider`
 to fetch batches. Data flows from Java to native code via the Arrow C Data
 Interface, so there are no extra copies in the hot path. This is the Java
 counterpart to DataFusion's Rust `SessionContext::register_table`.

 ## Implement

 ```java
 import org.apache.arrow.memory.BufferAllocator;
 import org.apache.arrow.vector.ipc.ArrowReader;
 import org.apache.arrow.vector.types.pojo.Schema;
 import org.apache.datafusion.TableProvider;

 public final class MyTable implements TableProvider {
     private final Schema schema;

     public MyTable(Schema schema) {
         this.schema = schema;
     }

     @Override public Schema schema() { return schema; }

     @Override
     public ArrowReader scan(BufferAllocator allocator) {
         // Return a fresh ArrowReader. The reader must allocate its buffers
         // from `allocator` (or a child of it) — the framework needs the
         // allocator hierarchy to share a root.
         return openMyReader(allocator);
     }
 }
 ```

 For the common case of "I have a schema and a function that returns an
 `ArrowReader`," `SimpleTableProvider` packages those two into a ready-made
 `TableProvider` without having to subclass:

 ```java
 TableProvider t = new SimpleTableProvider(mySchema(), allocator -> openMyReader(allocator));
 ctx.registerTable("t", t);
 ```

 ## Register and query

 ```java
 try (SessionContext ctx = new SessionContext();
      BufferAllocator allocator = new RootAllocator()) {
     ctx.registerTable("t", new MyTable(mySchema()));

     try (DataFrame df = ctx.sql("SELECT * FROM t WHERE x > 10");
          ArrowReader r = df.collect(allocator)) {
         while (r.loadNextBatch()) {
             // ...
         }
     }
 }
 ```

 ## Contract

 - `schema()` is called exactly once, on the caller's thread, at registration
   time. Throwing from it aborts registration with the original exception.
 - `scan(allocator)` is called once per SQL query that touches the table, on a
   worker thread. It must return a fresh, independent `ArrowReader` on every
   call — this is what makes self-joins and `UNION ALL` over the same table
   work.
 - The reader returned by `scan` must allocate its buffers from the supplied
   `allocator` (or a child of it). Arrow Java's `Data.exportArrayStream`
   requires the reader's allocator and the export allocator to share a root.
 - The returned reader's schema must equal the schema returned by `schema()`.
   A mismatch fails the query.
 - You do not need to close the returned reader yourself. The framework
   installs a release callback that closes it when the underlying FFI stream
   is dropped.

 ## Errors

 Exceptions thrown from `scan()` or from the returned reader surface in the
 `RuntimeException` raised by `collect()`. The error message includes the Java
 exception class and `getMessage()`, in the same format used for scalar UDF
 errors.

 ## Threading

 `SessionContext` is single-threaded, but `scan(allocator)` may be invoked from
 any DataFusion worker thread. If your implementation maintains mutable state
 across scans, synchronise it.

 ## Limitations (v1)

 - Single-partition scans only. DataFusion sees the table as one partition;
   multi-partition parallelism is a follow-up.
 - No projection or filter pushdown. DataFusion applies projection and
   filters on top of the batches you return; the Java side always sees the
   full schema. The interface is intentionally minimal so it can grow these
   capabilities (as default methods) without breaking existing implementations.
 - No `deregisterTable`. Tables live until the `SessionContext` is closed.
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->

	# Java table providers

	`SessionContext.registerTable(name, provider)` registers a Java-implemented
	table. SQL queries that reference `name` call back into your `TableProvider`
	to fetch batches. Data flows from Java to native code via the Arrow C Data
	Interface, so there are no extra copies in the hot path. This is the Java
	counterpart to DataFusion's Rust `SessionContext::register_table`.

	## Implement

	```java
	import org.apache.arrow.memory.BufferAllocator;
	import org.apache.arrow.vector.ipc.ArrowReader;
	import org.apache.arrow.vector.types.pojo.Schema;
	import org.apache.datafusion.TableProvider;

	public final class MyTable implements TableProvider {
	private final Schema schema;

	public MyTable(Schema schema) {
	this.schema = schema;
	}

	@Override public Schema schema() { return schema; }

	@Override
	public ArrowReader scan(BufferAllocator allocator) {
	// Return a fresh ArrowReader. The reader must allocate its buffers
	// from `allocator` (or a child of it) — the framework needs the
	// allocator hierarchy to share a root.
	return openMyReader(allocator);
	}
	}
	```

	For the common case of "I have a schema and a function that returns an
	`ArrowReader`," `SimpleTableProvider` packages those two into a ready-made
	`TableProvider` without having to subclass:

	```java
	TableProvider t = new SimpleTableProvider(mySchema(), allocator -> openMyReader(allocator));
	ctx.registerTable("t", t);
	```

	## Register and query

	```java
	try (SessionContext ctx = new SessionContext();
	BufferAllocator allocator = new RootAllocator()) {
	ctx.registerTable("t", new MyTable(mySchema()));

	try (DataFrame df = ctx.sql("SELECT * FROM t WHERE x > 10");
	ArrowReader r = df.collect(allocator)) {
	while (r.loadNextBatch()) {
	// ...
	}
	}
	}
	```

	## Contract

	- `schema()` is called exactly once, on the caller's thread, at registration
	time. Throwing from it aborts registration with the original exception.
	- `scan(allocator)` is called once per SQL query that touches the table, on a
	worker thread. It must return a fresh, independent `ArrowReader` on every
	call — this is what makes self-joins and `UNION ALL` over the same table
	work.
	- The reader returned by `scan` must allocate its buffers from the supplied
	`allocator` (or a child of it). Arrow Java's `Data.exportArrayStream`
	requires the reader's allocator and the export allocator to share a root.
	- The returned reader's schema must equal the schema returned by `schema()`.
	A mismatch fails the query.
	- You do not need to close the returned reader yourself. The framework
	installs a release callback that closes it when the underlying FFI stream
	is dropped.

	## Errors

	Exceptions thrown from `scan()` or from the returned reader surface in the
	`RuntimeException` raised by `collect()`. The error message includes the Java
	exception class and `getMessage()`, in the same format used for scalar UDF
	errors.

	## Threading

	`SessionContext` is single-threaded, but `scan(allocator)` may be invoked from
	any DataFusion worker thread. If your implementation maintains mutable state
	across scans, synchronise it.

	## Limitations (v1)

	- Single-partition scans only. DataFusion sees the table as one partition;
	multi-partition parallelism is a follow-up.
	- No projection or filter pushdown. DataFusion applies projection and
	filters on top of the batches you return; the Java side always sees the
	full schema. The interface is intentionally minimal so it can grow these
	capabilities (as default methods) without breaking existing implementations.
	- No `deregisterTable`. Tables live until the `SessionContext` is closed.