docs/source/user-guide/data-sources.rst - datafusion-python - Git at Google

 .. Licensed to the Apache Software Foundation (ASF) under one
 .. or more contributor license agreements.  See the NOTICE file
 .. distributed with this work for additional information
 .. regarding copyright ownership.  The ASF licenses this file
 .. to you under the Apache License, Version 2.0 (the
 .. "License"); you may not use this file except in compliance
 .. with the License.  You may obtain a copy of the License at

 ..   http://www.apache.org/licenses/LICENSE-2.0

 .. Unless required by applicable law or agreed to in writing,
 .. software distributed under the License is distributed on an
 .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 .. KIND, either express or implied.  See the License for the
 .. specific language governing permissions and limitations
 .. under the License.

 .. _user_guide_data_sources:

 Data Sources
 ============

 DataFusion provides a wide variety of ways to get data into a DataFrame to perform operations.

 Local file
 ----------

 DataFusion has the ability to read from a variety of popular file formats, such as :ref:`Parquet <io_parquet>`,
 :ref:`CSV <io_csv>`, :ref:`JSON <io_json>`, and :ref:`AVRO <io_avro>`.

 .. ipython:: python

     from datafusion import SessionContext
     ctx = SessionContext()
     df = ctx.read_csv("pokemon.csv")
     df.show()

 Create in-memory
 ----------------

 Sometimes it can be convenient to create a small DataFrame from a Python list or dictionary object.
 To do this in DataFusion, you can use one of the three functions
 :py:func:`~datafusion.context.SessionContext.from_pydict`,
 :py:func:`~datafusion.context.SessionContext.from_pylist`, or
 :py:func:`~datafusion.context.SessionContext.create_dataframe`.

 As their names suggest, ``from_pydict`` and ``from_pylist`` will create DataFrames from Python
 dictionary and list objects, respectively. ``create_dataframe`` assumes you will pass in a list
 of list of `PyArrow Record Batches <https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html>`_.

 The following three examples all will create identical DataFrames:

 .. ipython:: python

     import pyarrow as pa

     ctx.from_pylist([
         { "a": 1, "b": 10.0, "c": "alpha" },
         { "a": 2, "b": 20.0, "c": "beta" },
         { "a": 3, "b": 30.0, "c": "gamma" },
     ]).show()

     ctx.from_pydict({
         "a": [1, 2, 3],
         "b": [10.0, 20.0, 30.0],
         "c": ["alpha", "beta", "gamma"],
     }).show()

     batch = pa.RecordBatch.from_arrays(
         [
             pa.array([1, 2, 3]),
             pa.array([10.0, 20.0, 30.0]),
             pa.array(["alpha", "beta", "gamma"]),
         ],
         names=["a", "b", "c"],
     )

     ctx.create_dataframe([[batch]]).show()


 Object Store
 ------------

 DataFusion has support for multiple storage options in addition to local files.
 The example below requires an appropriate S3 account with access credentials.

 Supported Object Stores are

 - :py:class:`~datafusion.object_store.AmazonS3`
 - :py:class:`~datafusion.object_store.GoogleCloud`
 - :py:class:`~datafusion.object_store.Http`
 - :py:class:`~datafusion.object_store.LocalFileSystem`
 - :py:class:`~datafusion.object_store.MicrosoftAzure`

 .. code-block:: python

     from datafusion.object_store import AmazonS3

     region = "us-east-1"
     bucket_name = "yellow-trips"

     s3 = AmazonS3(
         bucket_name=bucket_name,
         region=region,
         access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
         secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
     )

     path = f"s3://{bucket_name}/"
     ctx.register_object_store("s3://", s3, None)

     ctx.register_parquet("trips", path)

     ctx.table("trips").show()

 Other DataFrame Libraries
 -------------------------

 DataFusion can import DataFrames directly from other libraries, such as
 `Polars <https://pola.rs/>`_ and `Pandas <https://pandas.pydata.org/>`_.
 Since DataFusion version 42.0.0, any DataFrame library that supports the Arrow FFI PyCapsule
 interface can be imported to DataFusion using the
 :py:func:`~datafusion.context.SessionContext.from_arrow` function. Older versions of Polars may
 not support the arrow interface. In those cases, you can still import via the
 :py:func:`~datafusion.context.SessionContext.from_polars` function.

 .. code-block:: python

     import pandas as pd

     data = { "a": [1, 2, 3], "b": [10.0, 20.0, 30.0], "c": ["alpha", "beta", "gamma"] }
     pandas_df = pd.DataFrame(data)

     datafusion_df = ctx.from_arrow(pandas_df)
     datafusion_df.show()

 .. code-block:: python

     import polars as pl
     polars_df = pl.DataFrame(data)

     datafusion_df = ctx.from_arrow(polars_df)
     datafusion_df.show()

 Delta Lake
 ----------

 DataFusion 43.0.0 and later support the ability to register table providers from sources such
 as Delta Lake. This will require a recent version of
 `deltalake <https://delta-io.github.io/delta-rs/>`_ to provide the required interfaces.

 .. code-block:: python

     from deltalake import DeltaTable

     delta_table = DeltaTable("path_to_table")
     ctx.register_table("my_delta_table", delta_table)
     df = ctx.table("my_delta_table")
     df.show()

 On older versions of ``deltalake`` (prior to 0.22) you can use the
 `Arrow DataSet <https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html>`_
 interface to import to DataFusion, but this does not support features such as filter push down
 which can lead to a significant performance difference.

 .. code-block:: python

     from deltalake import DeltaTable

     delta_table = DeltaTable("path_to_table")
     ctx.register_dataset("my_delta_table", delta_table.to_pyarrow_dataset())
     df = ctx.table("my_delta_table")
     df.show()

 Apache Iceberg
 --------------

 DataFusion 45.0.0 and later support the ability to register Apache Iceberg tables as table providers through the Custom Table Provider interface.

 This requires either the `pyiceberg <https://pypi.org/project/pyiceberg/>`__ library (>=0.10.0) or the `pyiceberg-core <https://pypi.org/project/pyiceberg-core/>`__ library (>=0.5.0).

 * The ``pyiceberg-core`` library exposes Iceberg Rust's implementation of the Custom Table Provider interface as python bindings.
 * The ``pyiceberg`` library utilizes the ``pyiceberg-core`` python bindings under the hood and provides a native way for Python users to interact with the DataFusion.

 .. code-block:: python

     from datafusion import SessionContext
     from pyiceberg.catalog import load_catalog
     import pyarrow as pa

     # Load catalog and create/load a table
     catalog = load_catalog("catalog", type="in-memory")
     catalog.create_namespace_if_not_exists("default")

     # Create some sample data
     data = pa.table({"x": [1, 2, 3], "y": [4, 5, 6]})
     iceberg_table = catalog.create_table("default.test", schema=data.schema)
     iceberg_table.append(data)

     # Register the table with DataFusion
     ctx = SessionContext()
     ctx.register_table_provider("test", iceberg_table)

     # Query the table using DataFusion
     ctx.table("test").show()


 Note that the Datafusion integration rely on features from the `Iceberg Rust <https://github.com/apache/iceberg-rust/>`_ implementation instead of the `PyIceberg <https://github.com/apache/iceberg-python/>`_ implementation.
 Features that are available in PyIceberg but not yet in Iceberg Rust will not be available when using DataFusion.

 Custom Table Provider
 ---------------------

 You can implement a custom Data Provider in Rust and expose it to DataFusion through the
 the interface as describe in the :ref:`Custom Table Provider <io_custom_table_provider>`
 section. This is an advanced topic, but a
 `user example <https://github.com/apache/datafusion-python/tree/main/examples/datafusion-ffi-example>`_
 is provided in the DataFusion repository.

 Catalog
 =======

 A common technique for organizing tables is using a three level hierarchical approach. DataFusion
 supports this form of organizing using the :py:class:`~datafusion.catalog.Catalog`,
 :py:class:`~datafusion.catalog.Schema`, and :py:class:`~datafusion.catalog.Table`. By default,
 a :py:class:`~datafusion.context.SessionContext` comes with a single Catalog and a single Schema
 with the names ``datafusion`` and ``default``, respectively.

 The default implementation uses an in-memory approach to the catalog and schema. We have support
 for adding additional in-memory catalogs and schemas. This can be done like in the following
 example:

 .. code-block:: python

     from datafusion.catalog import Catalog, Schema

     my_catalog = Catalog.memory_catalog()
     my_schema = Schema.memory_schema()

     my_catalog.register_schema("my_schema_name", my_schema)

     ctx.register_catalog("my_catalog_name", my_catalog)

 You could then register tables in ``my_schema`` and access them either through the DataFrame
 API or via sql commands such as ``"SELECT * from my_catalog_name.my_schema_name.my_table"``.

 User Defined Catalog and Schema
 -------------------------------

 If the in-memory catalogs are insufficient for your uses, there are two approaches you can take
 to implementing a custom catalog and/or schema. In the below discussion, we describe how to
 implement these for a Catalog, but the approach to implementing for a Schema is nearly
 identical.

 DataFusion supports Catalogs written in either Rust or Python. If you write a Catalog in Rust,
 you will need to export it as a Python library via PyO3. There is a complete example of a
 catalog implemented this way in the
 `examples folder <https://github.com/apache/datafusion-python/tree/main/examples/>`_
 of our repository. Writing catalog providers in Rust provides typically can lead to significant
 performance improvements over the Python based approach.

 To implement a Catalog in Python, you will need to inherit from the abstract base class
 :py:class:`~datafusion.catalog.CatalogProvider`. There are examples in the
 `unit tests <https://github.com/apache/datafusion-python/tree/main/python/tests>`_ of
 implementing a basic Catalog in Python where we simply keep a dictionary of the
 registered Schemas.

 One important note for developers is that when we have a Catalog defined in Python, we have
 two different ways of accessing this Catalog. First, we register the catalog with a Rust
 wrapper. This allows for any rust based code to call the Python functions as necessary.
 Second, if the user access the Catalog via the Python API, we identify this and return back
 the original Python object that implements the Catalog. This is an important distinction
 for developers because we do *not* return a Python wrapper around the Rust wrapper of the
 original Python object.
	.. Licensed to the Apache Software Foundation (ASF) under one
	.. or more contributor license agreements. See the NOTICE file
	.. distributed with this work for additional information
	.. regarding copyright ownership. The ASF licenses this file
	.. to you under the Apache License, Version 2.0 (the
	.. "License"); you may not use this file except in compliance
	.. with the License. You may obtain a copy of the License at

	.. http://www.apache.org/licenses/LICENSE-2.0

	.. Unless required by applicable law or agreed to in writing,
	.. software distributed under the License is distributed on an
	.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	.. KIND, either express or implied. See the License for the
	.. specific language governing permissions and limitations
	.. under the License.

	.. _user_guide_data_sources:

	Data Sources
	============

	DataFusion provides a wide variety of ways to get data into a DataFrame to perform operations.

	Local file
	----------

	DataFusion has the ability to read from a variety of popular file formats, such as :ref:`Parquet <io_parquet>`,
	:ref:`CSV <io_csv>`, :ref:`JSON <io_json>`, and :ref:`AVRO <io_avro>`.

	.. ipython:: python

	from datafusion import SessionContext
	ctx = SessionContext()
	df = ctx.read_csv("pokemon.csv")
	df.show()

	Create in-memory
	----------------

	Sometimes it can be convenient to create a small DataFrame from a Python list or dictionary object.
	To do this in DataFusion, you can use one of the three functions
	:py:func:`~datafusion.context.SessionContext.from_pydict`,
	:py:func:`~datafusion.context.SessionContext.from_pylist`, or
	:py:func:`~datafusion.context.SessionContext.create_dataframe`.

	As their names suggest, ``from_pydict`` and ``from_pylist`` will create DataFrames from Python
	dictionary and list objects, respectively. ``create_dataframe`` assumes you will pass in a list
	of list of `PyArrow Record Batches <https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html>`_.

	The following three examples all will create identical DataFrames:

	.. ipython:: python

	import pyarrow as pa

	ctx.from_pylist([
	{ "a": 1, "b": 10.0, "c": "alpha" },
	{ "a": 2, "b": 20.0, "c": "beta" },
	{ "a": 3, "b": 30.0, "c": "gamma" },
	]).show()

	ctx.from_pydict({
	"a": [1, 2, 3],
	"b": [10.0, 20.0, 30.0],
	"c": ["alpha", "beta", "gamma"],
	}).show()

	batch = pa.RecordBatch.from_arrays(
	[
	pa.array([1, 2, 3]),
	pa.array([10.0, 20.0, 30.0]),
	pa.array(["alpha", "beta", "gamma"]),
	],
	names=["a", "b", "c"],
	)

	ctx.create_dataframe([[batch]]).show()


	Object Store
	------------

	DataFusion has support for multiple storage options in addition to local files.
	The example below requires an appropriate S3 account with access credentials.

	Supported Object Stores are

	- :py:class:`~datafusion.object_store.AmazonS3`
	- :py:class:`~datafusion.object_store.GoogleCloud`
	- :py:class:`~datafusion.object_store.Http`
	- :py:class:`~datafusion.object_store.LocalFileSystem`
	- :py:class:`~datafusion.object_store.MicrosoftAzure`

	.. code-block:: python

	from datafusion.object_store import AmazonS3

	region = "us-east-1"
	bucket_name = "yellow-trips"

	s3 = AmazonS3(
	bucket_name=bucket_name,
	region=region,
	access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
	secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
	)

	path = f"s3://{bucket_name}/"
	ctx.register_object_store("s3://", s3, None)

	ctx.register_parquet("trips", path)

	ctx.table("trips").show()

	Other DataFrame Libraries
	-------------------------

	DataFusion can import DataFrames directly from other libraries, such as
	`Polars <https://pola.rs/>`_ and `Pandas <https://pandas.pydata.org/>`_.
	Since DataFusion version 42.0.0, any DataFrame library that supports the Arrow FFI PyCapsule
	interface can be imported to DataFusion using the
	:py:func:`~datafusion.context.SessionContext.from_arrow` function. Older versions of Polars may
	not support the arrow interface. In those cases, you can still import via the
	:py:func:`~datafusion.context.SessionContext.from_polars` function.

	.. code-block:: python

	import pandas as pd

	data = { "a": [1, 2, 3], "b": [10.0, 20.0, 30.0], "c": ["alpha", "beta", "gamma"] }
	pandas_df = pd.DataFrame(data)

	datafusion_df = ctx.from_arrow(pandas_df)
	datafusion_df.show()

	.. code-block:: python

	import polars as pl
	polars_df = pl.DataFrame(data)

	datafusion_df = ctx.from_arrow(polars_df)
	datafusion_df.show()

	Delta Lake
	----------

	DataFusion 43.0.0 and later support the ability to register table providers from sources such
	as Delta Lake. This will require a recent version of
	`deltalake <https://delta-io.github.io/delta-rs/>`_ to provide the required interfaces.

	.. code-block:: python

	from deltalake import DeltaTable

	delta_table = DeltaTable("path_to_table")
	ctx.register_table("my_delta_table", delta_table)
	df = ctx.table("my_delta_table")
	df.show()

	On older versions of ``deltalake`` (prior to 0.22) you can use the
	`Arrow DataSet <https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html>`_
	interface to import to DataFusion, but this does not support features such as filter push down
	which can lead to a significant performance difference.

	.. code-block:: python

	from deltalake import DeltaTable

	delta_table = DeltaTable("path_to_table")
	ctx.register_dataset("my_delta_table", delta_table.to_pyarrow_dataset())
	df = ctx.table("my_delta_table")
	df.show()

	Apache Iceberg
	--------------

	DataFusion 45.0.0 and later support the ability to register Apache Iceberg tables as table providers through the Custom Table Provider interface.

	This requires either the `pyiceberg <https://pypi.org/project/pyiceberg/>`__ library (>=0.10.0) or the `pyiceberg-core <https://pypi.org/project/pyiceberg-core/>`__ library (>=0.5.0).

	* The ``pyiceberg-core`` library exposes Iceberg Rust's implementation of the Custom Table Provider interface as python bindings.
	* The ``pyiceberg`` library utilizes the ``pyiceberg-core`` python bindings under the hood and provides a native way for Python users to interact with the DataFusion.

	.. code-block:: python

	from datafusion import SessionContext
	from pyiceberg.catalog import load_catalog
	import pyarrow as pa

	# Load catalog and create/load a table
	catalog = load_catalog("catalog", type="in-memory")
	catalog.create_namespace_if_not_exists("default")

	# Create some sample data
	data = pa.table({"x": [1, 2, 3], "y": [4, 5, 6]})
	iceberg_table = catalog.create_table("default.test", schema=data.schema)
	iceberg_table.append(data)

	# Register the table with DataFusion
	ctx = SessionContext()
	ctx.register_table_provider("test", iceberg_table)

	# Query the table using DataFusion
	ctx.table("test").show()


	Note that the Datafusion integration rely on features from the `Iceberg Rust <https://github.com/apache/iceberg-rust/>`_ implementation instead of the `PyIceberg <https://github.com/apache/iceberg-python/>`_ implementation.
	Features that are available in PyIceberg but not yet in Iceberg Rust will not be available when using DataFusion.

	Custom Table Provider
	---------------------

	You can implement a custom Data Provider in Rust and expose it to DataFusion through the
	the interface as describe in the :ref:`Custom Table Provider <io_custom_table_provider>`
	section. This is an advanced topic, but a
	`user example <https://github.com/apache/datafusion-python/tree/main/examples/datafusion-ffi-example>`_
	is provided in the DataFusion repository.

	Catalog
	=======

	A common technique for organizing tables is using a three level hierarchical approach. DataFusion
	supports this form of organizing using the :py:class:`~datafusion.catalog.Catalog`,
	:py:class:`~datafusion.catalog.Schema`, and :py:class:`~datafusion.catalog.Table`. By default,
	a :py:class:`~datafusion.context.SessionContext` comes with a single Catalog and a single Schema
	with the names ``datafusion`` and ``default``, respectively.

	The default implementation uses an in-memory approach to the catalog and schema. We have support
	for adding additional in-memory catalogs and schemas. This can be done like in the following
	example:

	.. code-block:: python

	from datafusion.catalog import Catalog, Schema

	my_catalog = Catalog.memory_catalog()
	my_schema = Schema.memory_schema()

	my_catalog.register_schema("my_schema_name", my_schema)

	ctx.register_catalog("my_catalog_name", my_catalog)

	You could then register tables in ``my_schema`` and access them either through the DataFrame
	API or via sql commands such as ``"SELECT * from my_catalog_name.my_schema_name.my_table"``.

	User Defined Catalog and Schema
	-------------------------------

	If the in-memory catalogs are insufficient for your uses, there are two approaches you can take
	to implementing a custom catalog and/or schema. In the below discussion, we describe how to
	implement these for a Catalog, but the approach to implementing for a Schema is nearly
	identical.

	DataFusion supports Catalogs written in either Rust or Python. If you write a Catalog in Rust,
	you will need to export it as a Python library via PyO3. There is a complete example of a
	catalog implemented this way in the
	`examples folder <https://github.com/apache/datafusion-python/tree/main/examples/>`_
	of our repository. Writing catalog providers in Rust provides typically can lead to significant
	performance improvements over the Python based approach.

	To implement a Catalog in Python, you will need to inherit from the abstract base class
	:py:class:`~datafusion.catalog.CatalogProvider`. There are examples in the
	`unit tests <https://github.com/apache/datafusion-python/tree/main/python/tests>`_ of
	implementing a basic Catalog in Python where we simply keep a dictionary of the
	registered Schemas.

	One important note for developers is that when we have a Catalog defined in Python, we have
	two different ways of accessing this Catalog. First, we register the catalog with a Rust
	wrapper. This allows for any rust based code to call the Python functions as necessary.
	Second, if the user access the Catalog via the Python API, we identify this and return back
	the original Python object that implements the Catalog. This is an important distinction
	for developers because we do not return a Python wrapper around the Rust wrapper of the
	original Python object.