DataFrame API

A DataFrame represents a logical set of rows with the same named columns, similar to a Pandas DataFrame or Spark DataFrame.

DataFrames are typically created by calling a method on SessionContext, such as read_csv, and can then be modified by calling the transformation methods, such as filter, select, aggregate, and limit to build up a query definition.

The query can be executed by calling the collect method.

The DataFrame struct is part of DataFusion's prelude and can be imported with the following statement.

use datafusion::prelude::*;

Here is a minimal example showing the execution of a query using the DataFrame API.

let ctx = SessionContext::new();
let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
let df = df.filter(col("a").lt_eq(col("b")))?
           .aggregate(vec![col("a")], vec![min(col("b"))])?
           .limit(0, Some(100))?;
// Print results
df.show();

The DataFrame API is well documented in the API reference on docs.rs.

Refer to the Expressions Reference for available functions for building logical expressions for use with the DataFrame API.

DataFrame Transformations

These methods create a new DataFrame after applying a transformation to the logical plan that the DataFrame represents.

DataFusion DataFrames use lazy evaluation, meaning that each transformation is just creating a new query plan and not actually performing any transformations. This approach allows for the overall plan to be optimized before execution. The plan is evaluated (executed) when an action method is invoked, such as collect.

Function	Notes
aggregate	Perform an aggregate query with optional grouping expressions.
distinct	Filter out duplicate rows.
except	Calculate the exception of two DataFrames. The two DataFrames must have exactly the same schema
filter	Filter a DataFrame to only include rows that match the specified filter expression.
intersect	Calculate the intersection of two DataFrames. The two DataFrames must have exactly the same schema
join	Join this DataFrame with another DataFrame using the specified columns as join keys.
join_on	Join this DataFrame with another DataFrame using arbitrary expressions.
limit	Limit the number of rows returned from this DataFrame.
repartition	Repartition a DataFrame based on a logical partitioning scheme.
sort	Sort the DataFrame by the specified sorting expressions. Any expression can be turned into a sort expression by calling its `sort` method.
select	Create a projection based on arbitrary expressions. Example: `df..select(vec![col("c1"), abs(col("c2"))])?`
select_columns	Create a projection based on column names. Example: `df.select_columns(&["id", "name"])?`.
union	Calculate the union of two DataFrames, preserving duplicate rows. The two DataFrames must have exactly the same schema.
union_distinct	Calculate the distinct union of two DataFrames. The two DataFrames must have exactly the same schema.
with_column	Add an additional column to the DataFrame.
with_column_renamed	Rename one column by applying a new projection.

DataFrame Actions

These methods execute the logical plan represented by the DataFrame and either collects the results into memory, prints them to stdout, or writes them to disk.

Function	Notes
collect	Executes this DataFrame and collects all results into a vector of RecordBatch.
collect_partitioned	Executes this DataFrame and collects all results into a vector of vector of RecordBatch maintaining the input partitioning.
count	Executes this DataFrame to get the total number of rows.
execute_stream	Executes this DataFrame and returns a stream over a single partition.
execute_stream_partitioned	Executes this DataFrame and returns one stream per partition.
show	Execute this DataFrame and print the results to stdout.
show_limit	Execute this DataFrame and print a subset of results to stdout.
write_csv	Execute this DataFrame and write the results to disk in CSV format.
write_json	Execute this DataFrame and write the results to disk in JSON format.
write_parquet	Execute this DataFrame and write the results to disk in Parquet format.

Other DataFrame Methods

Function	Notes
explain	Return a DataFrame with the explanation of its plan so far.
registry	Return a `FunctionRegistry` used to plan udf's calls.
schema	Returns the schema describing the output of this DataFrame in terms of columns returned, where each column has a name, data type, and nullability attribute.
to_logical_plan	Return the optimized logical plan represented by this DataFrame.
to_unoptimized_plan	Return the unoptimized logical plan represented by this DataFrame.