blob: b4bfb9611d647dbce49232aec1ae5e8676446e0f [file] [log] [blame]
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements. See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership. The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied. See the License for the
.. specific language governing permissions and limitations
.. under the License.
SQL
===
DataFusion also offers a SQL API, read the full reference `here <https://arrow.apache.org/datafusion/user-guide/sql/index.html>`_
.. ipython:: python
import datafusion
from datafusion import DataFrame, SessionContext
# create a context
ctx = datafusion.SessionContext()
# register a CSV
ctx.register_csv("pokemon", "pokemon.csv")
# create a new statement via SQL
df = ctx.sql('SELECT "Attack"+"Defense", "Attack"-"Defense" FROM pokemon')
# collect and convert to pandas DataFrame
df.to_pandas()
Parameterized queries
---------------------
In DataFusion-Python 51.0.0 we introduced the ability to pass parameters
in a SQL query. These are similar in concept to
`prepared statements <https://datafusion.apache.org/user-guide/sql/prepared_statements.html>`_,
but allow passing named parameters into a SQL query. Consider this simple
example.
.. ipython:: python
def show_attacks(ctx: SessionContext, threshold: int) -> None:
ctx.sql(
'SELECT "Name", "Attack" FROM pokemon WHERE "Attack" > $val', val=threshold
).show(num=5)
show_attacks(ctx, 75)
When passing parameters like the example above we convert the Python objects
into their string representation. We also have special case handling
for :py:class:`~datafusion.dataframe.DataFrame` objects, since they cannot simply
be turned into string representations for an SQL query. In these cases we
will register a temporary view in the :py:class:`~datafusion.context.SessionContext`
using a generated table name.
The formatting for passing string replacement objects is to precede the
variable name with a single ``$``. This works for all dialects in
the SQL parser except ``hive`` and ``mysql``. Since these dialects do not
support named placeholders, we are unable to do this type of replacement.
We recommend either switching to another dialect or using Python
f-string style replacement.
.. warning::
To support DataFrame parameterized queries, your session must support
registration of temporary views. The default
:py:class:`~datafusion.catalog.CatalogProvider` and
:py:class:`~datafusion.catalog.SchemaProvider` do have this capability.
If you have implemented custom providers, it is important that temporary
views do not persist across :py:class:`~datafusion.context.SessionContext`
or you may get unintended consequences.
The following example shows passing in both a :py:class:`~datafusion.dataframe.DataFrame`
object as well as a Python object to be used in parameterized replacement.
.. ipython:: python
def show_column(
ctx: SessionContext, column: str, df: DataFrame, threshold: int
) -> None:
ctx.sql(
'SELECT "Name", $col FROM $df WHERE $col > $val',
col=column,
df=df,
val=threshold,
).show(num=5)
df = ctx.table("pokemon")
show_column(ctx, '"Defense"', df, 75)
The approach implemented for conversion of variables into a SQL query
relies on string conversion. This has the potential for data loss,
specifically for cases like floating point numbers. If you need to pass
variables into a parameterized query and it is important to maintain the
original value without conversion to a string, then you can use the
optional parameter ``param_values`` to specify these. This parameter
expects a dictionary mapping from the parameter name to a Python
object. Those objects will be cast into a
`PyArrow Scalar Value <https://arrow.apache.org/docs/python/generated/pyarrow.Scalar.html>`_.
Using ``param_values`` will rely on the SQL dialect you have configured
for your session. This can be set using the :ref:`configuration options <configuration>`
of your :py:class:`~datafusion.context.SessionContext`. Similar to how
`prepared statements <https://datafusion.apache.org/user-guide/sql/prepared_statements.html>`_
work, these parameters are limited to places where you would pass in a
scalar value, such as a comparison.
.. ipython:: python
def param_attacks(ctx: SessionContext, threshold: int) -> None:
ctx.sql(
'SELECT "Name", "Attack" FROM pokemon WHERE "Attack" > $val',
param_values={"val": threshold},
).show(num=5)
param_attacks(ctx, 75)