Expressions
DataFrame methods such as select and filter accept one or more logical expressions and there are many functions available for creating logical expressions. These are documented below.
Expressions can be chained together using a fluent-style API:
// create the expression `(a > 6) AND (b < 7)`
col("a").gt(lit(6)).and(col("b").lt(lit(7)))
Identifiers
| Function | Notes |
|---|
| col | Reference a column in a dataframe col("a") |
Literal Values
| Function | Notes |
|---|
| lit | Literal value such as lit(123) or lit("hello") |
Boolean Expressions
| Function | Notes |
|---|
| and | and(expr1, expr2) or expr1.and(expr2) |
| or | or(expr1, expr2) or expr1.or(expr2) |
| not | not(expr) or expr.not() |
Comparison Expressions
| Function | Notes |
|---|
| eq | expr1.eq(expr2) |
| gt | expr1.gt(expr2) |
| gt_eq | expr1.gt_eq(expr2) |
| lt | expr1.lt(expr2) |
| lt_eq | expr1.lt_eq(expr2) |
| not_eq | expr1.not_eq(expr2) |
Math Functions
In addition to the math functions listed here, some Rust operators are implemented for expressions, allowing expressions such as col("a") + col("b") to be used.
| Function | Notes |
|---|
| abs(x) | absolute value |
| acos(x) | inverse cosine |
| asin(x) | inverse sine |
| atan(x) | inverse tangent |
| atan2(y, x) | inverse tangent of y / x |
| ceil(x) | nearest integer greater than or equal to argument |
| cos(x) | cosine |
| exp(x) | exponential |
| floor(x) | nearest integer less than or equal to argument |
| ln(x) | natural logarithm |
| log(base, x) | logarithm of x for a particular base |
| log10(x) | base 10 logarithm |
| log2(x) | base 2 logarithm |
| power(base, exponent) | base raised to the power of exponent |
| round(x) | round to nearest integer |
| signum(x) | sign of the argument (-1, 0, +1) |
| sin(x) | sine |
| sqrt(x) | square root |
| tan(x) | tangent |
| trunc(x) | truncate toward zero |
Math functions usage notes:
Unlike to some databases the math functions in Datafusion works the same way as Rust math functions, avoiding failing on corner cases e.g
❯ select log(-1), log(0), sqrt(-1);
+----------------+---------------+-----------------+
| log(Int64(-1)) | log(Int64(0)) | sqrt(Int64(-1)) |
+----------------+---------------+-----------------+
| NaN | -inf | NaN |
+----------------+---------------+-----------------+
Bitwise Operators
| Operator | Notes |
|---|
| & | Bitwise AND => (expr1 & expr2) |
| | | Bitwise OR => (expr1 | expr2) |
| # | Bitwise XOR => (expr1 # expr2) |
| << | Bitwise left shift => (expr1 << expr2) |
| >> | Bitwise right shift => (expr1 << expr2) |
Conditional Expressions
| Function | Notes |
|---|
| coalesce | Returns the first of its arguments that is not null. Null is returned only if all arguments are null. It is often used to substitute a default value for null values when data is retrieved for display. |
| case | CASE expression. Example: case(expr).when(expr, expr).when(expr, expr).otherwise(expr).end(). |
| nullif | Returns a null value if value1 equals value2; otherwise it returns value1. This can be used to perform the inverse operation of the coalesce expression. |
String Expressions
| Function | Notes |
|---|
| ascii | |
| bit_length | |
| btrim | |
| char_length | |
| character_length | |
| concat | |
| concat_ws | |
| chr | |
| initcap | |
| left | |
| length | |
| lower | |
| lpad | |
| ltrim | |
| md5 | |
| octet_length | |
| repeat | |
| replace | |
| reverse | |
| right | |
| rpad | |
| rtrim | |
| digest | |
| split_part | |
| starts_with | |
| strpos | |
| substr | |
| translate | |
| trim | |
| upper | |
Regular Expressions
| Function | Notes |
|---|
| regexp_match | |
| regexp_replace | |
Temporal Expressions
| Function | Notes |
|---|
| date_part | |
| date_trunc | |
| from_unixtime | |
| to_timestamp | |
| to_timestamp_millis | |
| to_timestamp_micros | |
| to_timestamp_seconds | |
| now() | current time |
Other Expressions
| Function | Notes |
|---|
| array | |
| in_list | |
| random | |
| sha224 | |
| sha256 | |
| sha384 | |
| sha512 | |
| struct | |
| to_hex | |
Aggregate Functions
| Function | Notes |
|---|
| avg | |
| approx_distinct | |
| approx_median | |
| approx_percentile_cont | |
| approx_percentile_cont_with_weight | |
| count | |
| count_distinct | |
| cube | |
| grouping_set | |
| max | |
| median | |
| min | |
| rollup | |
| sum | |
Subquery Expressions
| Function | Notes |
|---|
| exists | |
| in_subquery | df1.filter(in_subquery(col("foo"), df2))? is the equivalent of the SQL WHERE foo IN <df2> |
| not_exists | |
| not_in_subquery | |
| scalar_subquery | |
User-Defined Function Expressions
| Function | Notes |
|---|
| create_udf | |
| create_udaf | |