| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| |
| # DataFrame and SQL |
| |
| DataFusion Java supports two query interfaces: SQL strings via |
| `SessionContext.sql(String)`, and a programmatic DataFrame API. |
| |
| ## SQL |
| |
| ```java |
| try (DataFrame df = ctx.sql("SELECT a, b FROM t WHERE a > 10")) { |
| df.show(); |
| } |
| ``` |
| |
| `sql(String)` plans the query and returns a `DataFrame`. Execution does |
| not start until you pull results. |
| |
| ## DataFrame transformations |
| |
| The DataFrame API exposes `select`, `filter`, `limit`, `distinct`, |
| `dropColumns`, and `withColumnRenamed`. |
| |
| ```java |
| try (DataFrame df = ctx.readParquet("/path/to/orders.parquet")) { |
| try (DataFrame filtered = df.filter("o_orderpriority = '1-URGENT'")) { |
| filtered.show(); |
| } |
| } |
| ``` |
| |
| Each transformation returns a new `DataFrame` that must be closed. |
| |
| ## Pulling results |
| |
| Three patterns are available: |
| |
| **Stream as Arrow batches.** Use `collect(allocator)` to pull the result |
| set as Arrow record batches via the [Arrow C Data Interface]: |
| |
| ```java |
| try (DataFrame df = ctx.sql("SELECT ..."); |
| ArrowReader reader = df.collect(allocator)) { |
| while (reader.loadNextBatch()) { |
| var batch = reader.getVectorSchemaRoot(); |
| // process batch... |
| } |
| } |
| ``` |
| |
| [Arrow C Data Interface]: https://arrow.apache.org/docs/format/CDataInterface.html |
| |
| **Count rows.** `df.count()` returns the row count without materializing |
| the rows in the JVM. |
| |
| **Print for inspection.** `df.show()` and `df.show(int n)` print results |
| to standard output. Useful for exploration; not appropriate for |
| production code paths. |
| |
| ## Schema introspection |
| |
| To get the schema of a registered table without running a query: |
| |
| ```java |
| org.apache.arrow.vector.types.pojo.Schema schema = ctx.tableSchema("orders"); |
| ``` |
| |
| ## Plan input |
| |
| A DataFusion logical plan can be deserialized from `datafusion-proto` |
| bytes via `SessionContext.fromProto(byte[])`. The `datafusion-proto` Java |
| classes are generated by the Maven build. This is useful for accepting |
| plans produced by other DataFusion-aware tooling. |