DataFrame API

A DataFrame represents a logical set of rows with the same named columns, similar to a Pandas DataFrame or Spark DataFrame.

DataFrames are typically created by calling a method on SessionContext, such as read_csv, and can then be modified by calling the transformation methods, such as filter, select, aggregate, and limit to build up a query definition.

The query can be executed by calling the collect method.

The DataFrame struct is part of DataFusion's prelude and can be imported with the following statement.

use datafusion::prelude::*;

Here is a minimal example showing the execution of a query using the DataFrame API.

let ctx = SessionContext::new();
let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
let df = df.filter(col("a").lt_eq(col("b")))?
           .aggregate(vec![col("a")], vec![min(col("b"))])?
           .limit(0, Some(100))?;
// Print results
df.show();

The DataFrame API is well documented in the API reference on docs.rs.

Refer to the Expressions Reference for available functions for building logical expressions for use with the DataFrame API.

DataFrame Transformations

These methods create a new DataFrame after applying a transformation to the logical plan that the DataFrame represents.

DataFusion DataFrames use lazy evaluation, meaning that each transformation is just creating a new query plan and not actually performing any transformations. This approach allows for the overall plan to be optimized before execution. The plan is evaluated (executed) when an action method is invoked, such as collect.

FunctionNotes
aggregatePerform an aggregate query with optional grouping expressions.
distinctFilter out duplicate rows.
exceptCalculate the exception of two DataFrames. The two DataFrames must have exactly the same schema
filterFilter a DataFrame to only include rows that match the specified filter expression.
intersectCalculate the intersection of two DataFrames. The two DataFrames must have exactly the same schema
joinJoin this DataFrame with another DataFrame using the specified columns as join keys.
join_onJoin this DataFrame with another DataFrame using arbitrary expressions.
limitLimit the number of rows returned from this DataFrame.
repartitionRepartition a DataFrame based on a logical partitioning scheme.
sortSort the DataFrame by the specified sorting expressions. Any expression can be turned into a sort expression by calling its sort method.
selectCreate a projection based on arbitrary expressions. Example: df..select(vec![col("c1"), abs(col("c2"))])?
select_columnsCreate a projection based on column names. Example: df.select_columns(&["id", "name"])?.
unionCalculate the union of two DataFrames, preserving duplicate rows. The two DataFrames must have exactly the same schema.
union_distinctCalculate the distinct union of two DataFrames. The two DataFrames must have exactly the same schema.
with_columnAdd an additional column to the DataFrame.
with_column_renamedRename one column by applying a new projection.

DataFrame Actions

These methods execute the logical plan represented by the DataFrame and either collects the results into memory, prints them to stdout, or writes them to disk.

FunctionNotes
collectExecutes this DataFrame and collects all results into a vector of RecordBatch.
collect_partitionedExecutes this DataFrame and collects all results into a vector of vector of RecordBatch maintaining the input partitioning.
countExecutes this DataFrame to get the total number of rows.
execute_streamExecutes this DataFrame and returns a stream over a single partition.
execute_stream_partitionedExecutes this DataFrame and returns one stream per partition.
showExecute this DataFrame and print the results to stdout.
show_limitExecute this DataFrame and print a subset of results to stdout.
write_csvExecute this DataFrame and write the results to disk in CSV format.
write_jsonExecute this DataFrame and write the results to disk in JSON format.
write_parquetExecute this DataFrame and write the results to disk in Parquet format.

Other DataFrame Methods

FunctionNotes
explainReturn a DataFrame with the explanation of its plan so far.
registryReturn a FunctionRegistry used to plan udf's calls.
schemaReturns the schema describing the output of this DataFrame in terms of columns returned, where each column has a name, data type, and nullability attribute.
to_logical_planReturn the optimized logical plan represented by this DataFrame.
to_unoptimized_planReturn the unoptimized logical plan represented by this DataFrame.