blob: ccb47a4e758ca0d65542e9fa23ccbbc0d4d47317 [file] [log] [blame]
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements. See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership. The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied. See the License for the
.. specific language governing permissions and limitations
.. under the License.
Functions
=========
DataFusion provides a large number of built-in functions for performing complex queries without requiring user-defined functions.
In here we will cover some of the more popular use cases. If you want to view all the functions go to the :py:mod:`Functions <datafusion.functions>` API Reference.
We'll use the pokemon dataset in the following examples.
.. ipython:: python
from datafusion import SessionContext
ctx = SessionContext()
ctx.register_csv("pokemon", "pokemon.csv")
df = ctx.table("pokemon")
Mathematical
------------
DataFusion offers mathematical functions such as :py:func:`~datafusion.functions.pow` or :py:func:`~datafusion.functions.log`
.. ipython:: python
from datafusion import col, literal, string_literal, str_lit
from datafusion import functions as f
df.select(
f.pow(col('"Attack"'), literal(2)) - f.pow(col('"Defense"'), literal(2))
).limit(10)
Conditional
-----------
There 3 conditional functions in DataFusion :py:func:`~datafusion.functions.coalesce`, :py:func:`~datafusion.functions.nullif` and :py:func:`~datafusion.functions.case`.
.. ipython:: python
df.select(
f.coalesce(col('"Type 1"'), col('"Type 2"')).alias("dominant_type")
).limit(10)
Temporal
--------
For selecting the current time use :py:func:`~datafusion.functions.now`
.. ipython:: python
df.select(f.now())
Convert to timestamps using :py:func:`~datafusion.functions.to_timestamp`
.. ipython:: python
df.select(f.to_timestamp(col('"Total"')).alias("timestamp"))
Extracting parts of a date using :py:func:`~datafusion.functions.date_part` (alias :py:func:`~datafusion.functions.extract`)
.. ipython:: python
df.select(
f.date_part(literal("month"), f.to_timestamp(col('"Total"'))).alias("month"),
f.extract(literal("day"), f.to_timestamp(col('"Total"'))).alias("day")
)
String
------
In the field of data science, working with textual data is a common task. To make string manipulation easier,
DataFusion offers a range of helpful options.
.. ipython:: python
df.select(
f.char_length(col('"Name"')).alias("len"),
f.lower(col('"Name"')).alias("lower"),
f.left(col('"Name"'), literal(4)).alias("code")
)
This also includes the functions for regular expressions like :py:func:`~datafusion.functions.regexp_replace` and :py:func:`~datafusion.functions.regexp_match`
.. ipython:: python
df.select(
f.regexp_match(col('"Name"'), literal("Char")).alias("dragons"),
f.regexp_replace(col('"Name"'), literal("saur"), literal("fleur")).alias("flowers")
)
Casting
-------
Casting expressions to different data types using :py:func:`~datafusion.functions.arrow_cast`
.. ipython:: python
df.select(
f.arrow_cast(col('"Total"'), string_literal("Float64")).alias("total_as_float"),
f.arrow_cast(col('"Total"'), str_lit("Int32")).alias("total_as_int")
)
Other
-----
The function :py:func:`~datafusion.functions.in_list` allows to check a column for the presence of multiple values:
.. ipython:: python
types = [literal("Grass"), literal("Fire"), literal("Water")]
(
df.select(f.in_list(col('"Type 1"'), types, negated=False).alias("basic_types"))
.limit(20)
.to_pandas()
)
Handling Missing Values
=======================
DataFusion provides methods to handle missing values in DataFrames:
fill_null
---------
The ``fill_null()`` method replaces NULL values in specified columns with a provided value:
.. code-block:: python
# Fill all NULL values with 0 where possible
df = df.fill_null(0)
# Fill NULL values only in specific string columns
df = df.fill_null("missing", subset=["name", "category"])
The fill value will be cast to match each column's type. If casting fails for a column, that column remains unchanged.