| datafusion.context |
| ================== |
| |
| .. py:module:: datafusion.context |
| |
| .. autoapi-nested-parse:: |
| |
| Session Context and it's associated configuration. |
| |
| |
| |
| Classes |
| ------- |
| |
| .. autoapisummary:: |
| |
| datafusion.context.ArrowArrayExportable |
| datafusion.context.ArrowStreamExportable |
| datafusion.context.CatalogProviderExportable |
| datafusion.context.RuntimeConfig |
| datafusion.context.RuntimeEnvBuilder |
| datafusion.context.SQLOptions |
| datafusion.context.SessionConfig |
| datafusion.context.SessionContext |
| datafusion.context.TableProviderExportable |
| |
| |
| Module Contents |
| --------------- |
| |
| .. py:class:: ArrowArrayExportable |
| |
| Bases: :py:obj:`Protocol` |
| |
| |
| Type hint for object exporting Arrow C Array via Arrow PyCapsule Interface. |
| |
| https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html |
| |
| |
| .. py:method:: __arrow_c_array__(requested_schema: object | None = None) -> tuple[object, object] |
| |
| |
| .. py:class:: ArrowStreamExportable |
| |
| Bases: :py:obj:`Protocol` |
| |
| |
| Type hint for object exporting Arrow C Stream via Arrow PyCapsule Interface. |
| |
| https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html |
| |
| |
| .. py:method:: __arrow_c_stream__(requested_schema: object | None = None) -> object |
| |
| |
| .. py:class:: CatalogProviderExportable |
| |
| Bases: :py:obj:`Protocol` |
| |
| |
| Type hint for object that has __datafusion_catalog_provider__ PyCapsule. |
| |
| https://docs.rs/datafusion/latest/datafusion/catalog/trait.CatalogProvider.html |
| |
| |
| .. py:method:: __datafusion_catalog_provider__() -> object |
| |
| |
| .. py:class:: RuntimeConfig |
| |
| Bases: :py:obj:`RuntimeEnvBuilder` |
| |
| |
| See `RuntimeEnvBuilder`. |
| |
| Create a new :py:class:`RuntimeEnvBuilder` with default values. |
| |
| |
| .. py:class:: RuntimeEnvBuilder |
| |
| Runtime configuration options. |
| |
| Create a new :py:class:`RuntimeEnvBuilder` with default values. |
| |
| |
| .. py:method:: with_disk_manager_disabled() -> RuntimeEnvBuilder |
| |
| Disable the disk manager, attempts to create temporary files will error. |
| |
| :returns: A new :py:class:`RuntimeEnvBuilder` object with the updated setting. |
| |
| |
| |
| .. py:method:: with_disk_manager_os() -> RuntimeEnvBuilder |
| |
| Use the operating system's temporary directory for disk manager. |
| |
| :returns: A new :py:class:`RuntimeEnvBuilder` object with the updated setting. |
| |
| |
| |
| .. py:method:: with_disk_manager_specified(*paths: str | pathlib.Path) -> RuntimeEnvBuilder |
| |
| Use the specified paths for the disk manager's temporary files. |
| |
| :param paths: Paths to use for the disk manager's temporary files. |
| |
| :returns: A new :py:class:`RuntimeEnvBuilder` object with the updated setting. |
| |
| |
| |
| .. py:method:: with_fair_spill_pool(size: int) -> RuntimeEnvBuilder |
| |
| Use a fair spill pool with the specified size. |
| |
| This pool works best when you know beforehand the query has multiple spillable |
| operators that will likely all need to spill. Sometimes it will cause spills |
| even when there was sufficient memory (reserved for other operators) to avoid |
| doing so:: |
| |
| ┌───────────────────────z──────────────────────z───────────────┐ |
| │ z z │ |
| │ z z │ |
| │ Spillable z Unspillable z Free │ |
| │ Memory z Memory z Memory │ |
| │ z z │ |
| │ z z │ |
| └───────────────────────z──────────────────────z───────────────┘ |
| |
| :param size: Size of the memory pool in bytes. |
| |
| :returns: A new :py:class:`RuntimeEnvBuilder` object with the updated setting. |
| |
| Examples usage:: |
| |
| config = RuntimeEnvBuilder().with_fair_spill_pool(1024) |
| |
| |
| |
| .. py:method:: with_greedy_memory_pool(size: int) -> RuntimeEnvBuilder |
| |
| Use a greedy memory pool with the specified size. |
| |
| This pool works well for queries that do not need to spill or have a single |
| spillable operator. See :py:func:`with_fair_spill_pool` if there are |
| multiple spillable operators that all will spill. |
| |
| :param size: Size of the memory pool in bytes. |
| |
| :returns: A new :py:class:`RuntimeEnvBuilder` object with the updated setting. |
| |
| Example usage:: |
| |
| config = RuntimeEnvBuilder().with_greedy_memory_pool(1024) |
| |
| |
| |
| .. py:method:: with_temp_file_path(path: str | pathlib.Path) -> RuntimeEnvBuilder |
| |
| Use the specified path to create any needed temporary files. |
| |
| :param path: Path to use for temporary files. |
| |
| :returns: A new :py:class:`RuntimeEnvBuilder` object with the updated setting. |
| |
| Example usage:: |
| |
| config = RuntimeEnvBuilder().with_temp_file_path("/tmp") |
| |
| |
| |
| .. py:method:: with_unbounded_memory_pool() -> RuntimeEnvBuilder |
| |
| Use an unbounded memory pool. |
| |
| :returns: A new :py:class:`RuntimeEnvBuilder` object with the updated setting. |
| |
| |
| |
| .. py:attribute:: config_internal |
| |
| |
| .. py:class:: SQLOptions |
| |
| Options to be used when performing SQL queries. |
| |
| Create a new :py:class:`SQLOptions` with default values. |
| |
| The default values are: |
| - DDL commands are allowed |
| - DML commands are allowed |
| - Statements are allowed |
| |
| |
| .. py:method:: with_allow_ddl(allow: bool = True) -> SQLOptions |
| |
| Should DDL (Data Definition Language) commands be run? |
| |
| Examples of DDL commands include ``CREATE TABLE`` and ``DROP TABLE``. |
| |
| :param allow: Allow DDL commands to be run. |
| |
| :returns: A new :py:class:`SQLOptions` object with the updated setting. |
| |
| Example usage:: |
| |
| options = SQLOptions().with_allow_ddl(True) |
| |
| |
| |
| .. py:method:: with_allow_dml(allow: bool = True) -> SQLOptions |
| |
| Should DML (Data Manipulation Language) commands be run? |
| |
| Examples of DML commands include ``INSERT INTO`` and ``DELETE``. |
| |
| :param allow: Allow DML commands to be run. |
| |
| :returns: A new :py:class:`SQLOptions` object with the updated setting. |
| |
| Example usage:: |
| |
| options = SQLOptions().with_allow_dml(True) |
| |
| |
| |
| .. py:method:: with_allow_statements(allow: bool = True) -> SQLOptions |
| |
| Should statements such as ``SET VARIABLE`` and ``BEGIN TRANSACTION`` be run? |
| |
| :param allow: Allow statements to be run. |
| |
| :returns: py:class:SQLOptions` object with the updated setting. |
| :rtype: A new |
| |
| Example usage:: |
| |
| options = SQLOptions().with_allow_statements(True) |
| |
| |
| |
| .. py:attribute:: options_internal |
| |
| |
| .. py:class:: SessionConfig(config_options: dict[str, str] | None = None) |
| |
| Session configuration options. |
| |
| Create a new :py:class:`SessionConfig` with the given configuration options. |
| |
| :param config_options: Configuration options. |
| |
| |
| .. py:method:: set(key: str, value: str) -> SessionConfig |
| |
| Set a configuration option. |
| |
| Args: |
| key: Option key. |
| value: Option value. |
| |
| :returns: A new :py:class:`SessionConfig` object with the updated setting. |
| |
| |
| |
| .. py:method:: with_batch_size(batch_size: int) -> SessionConfig |
| |
| Customize batch size. |
| |
| :param batch_size: Batch size. |
| |
| :returns: A new :py:class:`SessionConfig` object with the updated setting. |
| |
| |
| |
| .. py:method:: with_create_default_catalog_and_schema(enabled: bool = True) -> SessionConfig |
| |
| Control if the default catalog and schema will be automatically created. |
| |
| :param enabled: Whether the default catalog and schema will be |
| automatically created. |
| |
| :returns: A new :py:class:`SessionConfig` object with the updated setting. |
| |
| |
| |
| .. py:method:: with_default_catalog_and_schema(catalog: str, schema: str) -> SessionConfig |
| |
| Select a name for the default catalog and schema. |
| |
| :param catalog: Catalog name. |
| :param schema: Schema name. |
| |
| :returns: A new :py:class:`SessionConfig` object with the updated setting. |
| |
| |
| |
| .. py:method:: with_information_schema(enabled: bool = True) -> SessionConfig |
| |
| Enable or disable the inclusion of ``information_schema`` virtual tables. |
| |
| :param enabled: Whether to include ``information_schema`` virtual tables. |
| |
| :returns: A new :py:class:`SessionConfig` object with the updated setting. |
| |
| |
| |
| .. py:method:: with_parquet_pruning(enabled: bool = True) -> SessionConfig |
| |
| Enable or disable the use of pruning predicate for parquet readers. |
| |
| Pruning predicates will enable the reader to skip row groups. |
| |
| :param enabled: Whether to use pruning predicate for parquet readers. |
| |
| :returns: A new :py:class:`SessionConfig` object with the updated setting. |
| |
| |
| |
| .. py:method:: with_repartition_aggregations(enabled: bool = True) -> SessionConfig |
| |
| Enable or disable the use of repartitioning for aggregations. |
| |
| Enabling this improves parallelism. |
| |
| :param enabled: Whether to use repartitioning for aggregations. |
| |
| :returns: A new :py:class:`SessionConfig` object with the updated setting. |
| |
| |
| |
| .. py:method:: with_repartition_file_min_size(size: int) -> SessionConfig |
| |
| Set minimum file range size for repartitioning scans. |
| |
| :param size: Minimum file range size. |
| |
| :returns: A new :py:class:`SessionConfig` object with the updated setting. |
| |
| |
| |
| .. py:method:: with_repartition_file_scans(enabled: bool = True) -> SessionConfig |
| |
| Enable or disable the use of repartitioning for file scans. |
| |
| :param enabled: Whether to use repartitioning for file scans. |
| |
| :returns: A new :py:class:`SessionConfig` object with the updated setting. |
| |
| |
| |
| .. py:method:: with_repartition_joins(enabled: bool = True) -> SessionConfig |
| |
| Enable or disable the use of repartitioning for joins to improve parallelism. |
| |
| :param enabled: Whether to use repartitioning for joins. |
| |
| :returns: A new :py:class:`SessionConfig` object with the updated setting. |
| |
| |
| |
| .. py:method:: with_repartition_sorts(enabled: bool = True) -> SessionConfig |
| |
| Enable or disable the use of repartitioning for window functions. |
| |
| This may improve parallelism. |
| |
| :param enabled: Whether to use repartitioning for window functions. |
| |
| :returns: A new :py:class:`SessionConfig` object with the updated setting. |
| |
| |
| |
| .. py:method:: with_repartition_windows(enabled: bool = True) -> SessionConfig |
| |
| Enable or disable the use of repartitioning for window functions. |
| |
| This may improve parallelism. |
| |
| :param enabled: Whether to use repartitioning for window functions. |
| |
| :returns: A new :py:class:`SessionConfig` object with the updated setting. |
| |
| |
| |
| .. py:method:: with_target_partitions(target_partitions: int) -> SessionConfig |
| |
| Customize the number of target partitions for query execution. |
| |
| Increasing partitions can increase concurrency. |
| |
| :param target_partitions: Number of target partitions. |
| |
| :returns: A new :py:class:`SessionConfig` object with the updated setting. |
| |
| |
| |
| .. py:attribute:: config_internal |
| |
| |
| .. py:class:: SessionContext(config: SessionConfig | None = None, runtime: RuntimeEnvBuilder | None = None) |
| |
| This is the main interface for executing queries and creating DataFrames. |
| |
| See :ref:`user_guide_concepts` in the online documentation for more information. |
| |
| Main interface for executing queries with DataFusion. |
| |
| Maintains the state of the connection between a user and an instance |
| of the connection between a user and an instance of the DataFusion |
| engine. |
| |
| :param config: Session configuration options. |
| :param runtime: Runtime configuration options. |
| |
| Example usage: |
| |
| The following example demonstrates how to use the context to execute |
| a query against a CSV data source using the :py:class:`DataFrame` API:: |
| |
| from datafusion import SessionContext |
| |
| ctx = SessionContext() |
| df = ctx.read_csv("data.csv") |
| |
| |
| .. py:method:: __repr__() -> str |
| |
| Print a string representation of the Session Context. |
| |
| |
| |
| .. py:method:: _convert_file_sort_order(file_sort_order: collections.abc.Sequence[collections.abc.Sequence[datafusion.expr.SortKey]] | None) -> list[list[datafusion._internal.expr.SortExpr]] | None |
| :staticmethod: |
| |
| |
| Convert nested ``SortKey`` sequences into raw sort expressions. |
| |
| Each ``SortKey`` can be a column name string, an ``Expr``, or a |
| ``SortExpr`` and will be converted using |
| :func:`datafusion.expr.sort_list_to_raw_sort_list`. |
| |
| |
| |
| .. py:method:: _convert_table_partition_cols(table_partition_cols: list[tuple[str, str | pyarrow.DataType]]) -> list[tuple[str, pyarrow.DataType]] |
| :staticmethod: |
| |
| |
| |
| .. py:method:: catalog(name: str = 'datafusion') -> datafusion.catalog.Catalog |
| |
| Retrieve a catalog by name. |
| |
| |
| |
| .. py:method:: catalog_names() -> set[str] |
| |
| Returns the list of catalogs in this context. |
| |
| |
| |
| .. py:method:: create_dataframe(partitions: list[list[pyarrow.RecordBatch]], name: str | None = None, schema: pyarrow.Schema | None = None) -> datafusion.dataframe.DataFrame |
| |
| Create and return a dataframe using the provided partitions. |
| |
| :param partitions: :py:class:`pa.RecordBatch` partitions to register. |
| :param name: Resultant dataframe name. |
| :param schema: Schema for the partitions. |
| |
| :returns: DataFrame representation of the SQL query. |
| |
| |
| |
| .. py:method:: create_dataframe_from_logical_plan(plan: datafusion.plan.LogicalPlan) -> datafusion.dataframe.DataFrame |
| |
| Create a :py:class:`~datafusion.dataframe.DataFrame` from an existing plan. |
| |
| :param plan: Logical plan. |
| |
| :returns: DataFrame representation of the logical plan. |
| |
| |
| |
| .. py:method:: deregister_table(name: str) -> None |
| |
| Remove a table from the session. |
| |
| |
| |
| .. py:method:: empty_table() -> datafusion.dataframe.DataFrame |
| |
| Create an empty :py:class:`~datafusion.dataframe.DataFrame`. |
| |
| |
| |
| .. py:method:: enable_url_table() -> SessionContext |
| |
| Control if local files can be queried as tables. |
| |
| :returns: A new :py:class:`SessionContext` object with url table enabled. |
| |
| |
| |
| .. py:method:: execute(plan: datafusion.plan.ExecutionPlan, partitions: int) -> datafusion.record_batch.RecordBatchStream |
| |
| Execute the ``plan`` and return the results. |
| |
| |
| |
| .. py:method:: from_arrow(data: ArrowStreamExportable | ArrowArrayExportable, name: str | None = None) -> datafusion.dataframe.DataFrame |
| |
| Create a :py:class:`~datafusion.dataframe.DataFrame` from an Arrow source. |
| |
| The Arrow data source can be any object that implements either |
| ``__arrow_c_stream__`` or ``__arrow_c_array__``. For the latter, it must return |
| a struct array. |
| |
| Arrow data can be Polars, Pandas, Pyarrow etc. |
| |
| :param data: Arrow data source. |
| :param name: Name of the DataFrame. |
| |
| :returns: DataFrame representation of the Arrow table. |
| |
| |
| |
| .. py:method:: from_arrow_table(data: pyarrow.Table, name: str | None = None) -> datafusion.dataframe.DataFrame |
| |
| Create a :py:class:`~datafusion.dataframe.DataFrame` from an Arrow table. |
| |
| This is an alias for :py:func:`from_arrow`. |
| |
| |
| |
| .. py:method:: from_pandas(data: pandas.DataFrame, name: str | None = None) -> datafusion.dataframe.DataFrame |
| |
| Create a :py:class:`~datafusion.dataframe.DataFrame` from a Pandas DataFrame. |
| |
| :param data: Pandas DataFrame. |
| :param name: Name of the DataFrame. |
| |
| :returns: DataFrame representation of the Pandas DataFrame. |
| |
| |
| |
| .. py:method:: from_polars(data: polars.DataFrame, name: str | None = None) -> datafusion.dataframe.DataFrame |
| |
| Create a :py:class:`~datafusion.dataframe.DataFrame` from a Polars DataFrame. |
| |
| :param data: Polars DataFrame. |
| :param name: Name of the DataFrame. |
| |
| :returns: DataFrame representation of the Polars DataFrame. |
| |
| |
| |
| .. py:method:: from_pydict(data: dict[str, list[Any]], name: str | None = None) -> datafusion.dataframe.DataFrame |
| |
| Create a :py:class:`~datafusion.dataframe.DataFrame` from a dictionary. |
| |
| :param data: Dictionary of lists. |
| :param name: Name of the DataFrame. |
| |
| :returns: DataFrame representation of the dictionary of lists. |
| |
| |
| |
| .. py:method:: from_pylist(data: list[dict[str, Any]], name: str | None = None) -> datafusion.dataframe.DataFrame |
| |
| Create a :py:class:`~datafusion.dataframe.DataFrame` from a list. |
| |
| :param data: List of dictionaries. |
| :param name: Name of the DataFrame. |
| |
| :returns: DataFrame representation of the list of dictionaries. |
| |
| |
| |
| .. py:method:: global_ctx() -> SessionContext |
| :classmethod: |
| |
| |
| Retrieve the global context as a `SessionContext` wrapper. |
| |
| :returns: A `SessionContext` object that wraps the global `SessionContextInternal`. |
| |
| |
| |
| .. py:method:: read_avro(path: str | pathlib.Path, schema: pyarrow.Schema | None = None, file_partition_cols: list[tuple[str, str | pyarrow.DataType]] | None = None, file_extension: str = '.avro') -> datafusion.dataframe.DataFrame |
| |
| Create a :py:class:`DataFrame` for reading Avro data source. |
| |
| :param path: Path to the Avro file. |
| :param schema: The data source schema. |
| :param file_partition_cols: Partition columns. |
| :param file_extension: File extension to select. |
| |
| :returns: DataFrame representation of the read Avro file |
| |
| |
| |
| .. py:method:: read_csv(path: str | pathlib.Path | list[str] | list[pathlib.Path], schema: pyarrow.Schema | None = None, has_header: bool = True, delimiter: str = ',', schema_infer_max_records: int = 1000, file_extension: str = '.csv', table_partition_cols: list[tuple[str, str | pyarrow.DataType]] | None = None, file_compression_type: str | None = None) -> datafusion.dataframe.DataFrame |
| |
| Read a CSV data source. |
| |
| :param path: Path to the CSV file |
| :param schema: An optional schema representing the CSV files. If None, the |
| CSV reader will try to infer it based on data in file. |
| :param has_header: Whether the CSV file have a header. If schema inference |
| is run on a file with no headers, default column names are |
| created. |
| :param delimiter: An optional column delimiter. |
| :param schema_infer_max_records: Maximum number of rows to read from CSV |
| files for schema inference if needed. |
| :param file_extension: File extension; only files with this extension are |
| selected for data input. |
| :param table_partition_cols: Partition columns. |
| :param file_compression_type: File compression type. |
| |
| :returns: DataFrame representation of the read CSV files |
| |
| |
| |
| .. py:method:: read_json(path: str | pathlib.Path, schema: pyarrow.Schema | None = None, schema_infer_max_records: int = 1000, file_extension: str = '.json', table_partition_cols: list[tuple[str, str | pyarrow.DataType]] | None = None, file_compression_type: str | None = None) -> datafusion.dataframe.DataFrame |
| |
| Read a line-delimited JSON data source. |
| |
| :param path: Path to the JSON file. |
| :param schema: The data source schema. |
| :param schema_infer_max_records: Maximum number of rows to read from JSON |
| files for schema inference if needed. |
| :param file_extension: File extension; only files with this extension are |
| selected for data input. |
| :param table_partition_cols: Partition columns. |
| :param file_compression_type: File compression type. |
| |
| :returns: DataFrame representation of the read JSON files. |
| |
| |
| |
| .. py:method:: read_parquet(path: str | pathlib.Path, table_partition_cols: list[tuple[str, str | pyarrow.DataType]] | None = None, parquet_pruning: bool = True, file_extension: str = '.parquet', skip_metadata: bool = True, schema: pyarrow.Schema | None = None, file_sort_order: collections.abc.Sequence[collections.abc.Sequence[datafusion.expr.SortKey]] | None = None) -> datafusion.dataframe.DataFrame |
| |
| Read a Parquet source into a :py:class:`~datafusion.dataframe.Dataframe`. |
| |
| :param path: Path to the Parquet file. |
| :param table_partition_cols: Partition columns. |
| :param parquet_pruning: Whether the parquet reader should use the predicate |
| to prune row groups. |
| :param file_extension: File extension; only files with this extension are |
| selected for data input. |
| :param skip_metadata: Whether the parquet reader should skip any metadata |
| that may be in the file schema. This can help avoid schema |
| conflicts due to metadata. |
| :param schema: An optional schema representing the parquet files. If None, |
| the parquet reader will try to infer it based on data in the |
| file. |
| :param file_sort_order: Sort order for the file. Each sort key can be |
| specified as a column name (``str``), an expression |
| (``Expr``), or a ``SortExpr``. |
| |
| :returns: DataFrame representation of the read Parquet files |
| |
| |
| |
| .. py:method:: read_table(table: datafusion.catalog.Table | TableProviderExportable | datafusion.dataframe.DataFrame | pyarrow.dataset.Dataset) -> datafusion.dataframe.DataFrame |
| |
| Creates a :py:class:`~datafusion.dataframe.DataFrame` from a table. |
| |
| |
| |
| .. py:method:: register_avro(name: str, path: str | pathlib.Path, schema: pyarrow.Schema | None = None, file_extension: str = '.avro', table_partition_cols: list[tuple[str, str | pyarrow.DataType]] | None = None) -> None |
| |
| Register an Avro file as a table. |
| |
| The registered table can be referenced from SQL statement executed against |
| this context. |
| |
| :param name: Name of the table to register. |
| :param path: Path to the Avro file. |
| :param schema: The data source schema. |
| :param file_extension: File extension to select. |
| :param table_partition_cols: Partition columns. |
| |
| |
| |
| .. py:method:: register_catalog_provider(name: str, provider: CatalogProviderExportable | datafusion.catalog.CatalogProvider | datafusion.catalog.Catalog) -> None |
| |
| Register a catalog provider. |
| |
| |
| |
| .. py:method:: register_csv(name: str, path: str | pathlib.Path | list[str | pathlib.Path], schema: pyarrow.Schema | None = None, has_header: bool = True, delimiter: str = ',', schema_infer_max_records: int = 1000, file_extension: str = '.csv', file_compression_type: str | None = None) -> None |
| |
| Register a CSV file as a table. |
| |
| The registered table can be referenced from SQL statement executed against. |
| |
| :param name: Name of the table to register. |
| :param path: Path to the CSV file. It also accepts a list of Paths. |
| :param schema: An optional schema representing the CSV file. If None, the |
| CSV reader will try to infer it based on data in file. |
| :param has_header: Whether the CSV file have a header. If schema inference |
| is run on a file with no headers, default column names are |
| created. |
| :param delimiter: An optional column delimiter. |
| :param schema_infer_max_records: Maximum number of rows to read from CSV |
| files for schema inference if needed. |
| :param file_extension: File extension; only files with this extension are |
| selected for data input. |
| :param file_compression_type: File compression type. |
| |
| |
| |
| .. py:method:: register_dataset(name: str, dataset: pyarrow.dataset.Dataset) -> None |
| |
| Register a :py:class:`pa.dataset.Dataset` as a table. |
| |
| :param name: Name of the table to register. |
| :param dataset: PyArrow dataset. |
| |
| |
| |
| .. py:method:: register_json(name: str, path: str | pathlib.Path, schema: pyarrow.Schema | None = None, schema_infer_max_records: int = 1000, file_extension: str = '.json', table_partition_cols: list[tuple[str, str | pyarrow.DataType]] | None = None, file_compression_type: str | None = None) -> None |
| |
| Register a JSON file as a table. |
| |
| The registered table can be referenced from SQL statement executed |
| against this context. |
| |
| :param name: Name of the table to register. |
| :param path: Path to the JSON file. |
| :param schema: The data source schema. |
| :param schema_infer_max_records: Maximum number of rows to read from JSON |
| files for schema inference if needed. |
| :param file_extension: File extension; only files with this extension are |
| selected for data input. |
| :param table_partition_cols: Partition columns. |
| :param file_compression_type: File compression type. |
| |
| |
| |
| .. py:method:: register_listing_table(name: str, path: str | pathlib.Path, table_partition_cols: list[tuple[str, str | pyarrow.DataType]] | None = None, file_extension: str = '.parquet', schema: pyarrow.Schema | None = None, file_sort_order: collections.abc.Sequence[collections.abc.Sequence[datafusion.expr.SortKey]] | None = None) -> None |
| |
| Register multiple files as a single table. |
| |
| Registers a :py:class:`~datafusion.catalog.Table` that can assemble multiple |
| files from locations in an :py:class:`~datafusion.object_store.ObjectStore` |
| instance. |
| |
| :param name: Name of the resultant table. |
| :param path: Path to the file to register. |
| :param table_partition_cols: Partition columns. |
| :param file_extension: File extension of the provided table. |
| :param schema: The data source schema. |
| :param file_sort_order: Sort order for the file. Each sort key can be |
| specified as a column name (``str``), an expression |
| (``Expr``), or a ``SortExpr``. |
| |
| |
| |
| .. py:method:: register_object_store(schema: str, store: Any, host: str | None = None) -> None |
| |
| Add a new object store into the session. |
| |
| :param schema: The data source schema. |
| :param store: The :py:class:`~datafusion.object_store.ObjectStore` to register. |
| :param host: URL for the host. |
| |
| |
| |
| .. py:method:: register_parquet(name: str, path: str | pathlib.Path, table_partition_cols: list[tuple[str, str | pyarrow.DataType]] | None = None, parquet_pruning: bool = True, file_extension: str = '.parquet', skip_metadata: bool = True, schema: pyarrow.Schema | None = None, file_sort_order: collections.abc.Sequence[collections.abc.Sequence[datafusion.expr.SortKey]] | None = None) -> None |
| |
| Register a Parquet file as a table. |
| |
| The registered table can be referenced from SQL statement executed |
| against this context. |
| |
| :param name: Name of the table to register. |
| :param path: Path to the Parquet file. |
| :param table_partition_cols: Partition columns. |
| :param parquet_pruning: Whether the parquet reader should use the |
| predicate to prune row groups. |
| :param file_extension: File extension; only files with this extension are |
| selected for data input. |
| :param skip_metadata: Whether the parquet reader should skip any metadata |
| that may be in the file schema. This can help avoid schema |
| conflicts due to metadata. |
| :param schema: The data source schema. |
| :param file_sort_order: Sort order for the file. Each sort key can be |
| specified as a column name (``str``), an expression |
| (``Expr``), or a ``SortExpr``. |
| |
| |
| |
| .. py:method:: register_record_batches(name: str, partitions: list[list[pyarrow.RecordBatch]]) -> None |
| |
| Register record batches as a table. |
| |
| This function will convert the provided partitions into a table and |
| register it into the session using the given name. |
| |
| :param name: Name of the resultant table. |
| :param partitions: Record batches to register as a table. |
| |
| |
| |
| .. py:method:: register_table(name: str, table: datafusion.catalog.Table | TableProviderExportable | datafusion.dataframe.DataFrame | pyarrow.dataset.Dataset) -> None |
| |
| Register a :py:class:`~datafusion.Table` with this context. |
| |
| The registered table can be referenced from SQL statements executed against |
| this context. |
| |
| :param name: Name of the resultant table. |
| :param table: Any object that can be converted into a :class:`Table`. |
| |
| |
| |
| .. py:method:: register_table_provider(name: str, provider: datafusion.catalog.Table | TableProviderExportable | datafusion.dataframe.DataFrame | pyarrow.dataset.Dataset) -> None |
| |
| Register a table provider. |
| |
| Deprecated: use :meth:`register_table` instead. |
| |
| |
| |
| .. py:method:: register_udaf(udaf: datafusion.user_defined.AggregateUDF) -> None |
| |
| Register a user-defined aggregation function (UDAF) with the context. |
| |
| |
| |
| .. py:method:: register_udf(udf: datafusion.user_defined.ScalarUDF) -> None |
| |
| Register a user-defined function (UDF) with the context. |
| |
| |
| |
| .. py:method:: register_udtf(func: datafusion.user_defined.TableFunction) -> None |
| |
| Register a user defined table function. |
| |
| |
| |
| .. py:method:: register_udwf(udwf: datafusion.user_defined.WindowUDF) -> None |
| |
| Register a user-defined window function (UDWF) with the context. |
| |
| |
| |
| .. py:method:: register_view(name: str, df: datafusion.dataframe.DataFrame) -> None |
| |
| Register a :py:class:`~datafusion.dataframe.DataFrame` as a view. |
| |
| :param name: The name to register the view under. |
| :type name: str |
| :param df: The DataFrame to be converted into a view and registered. |
| :type df: DataFrame |
| |
| |
| |
| .. py:method:: session_id() -> str |
| |
| Return an id that uniquely identifies this :py:class:`SessionContext`. |
| |
| |
| |
| .. py:method:: sql(query: str, options: SQLOptions | None = None) -> datafusion.dataframe.DataFrame |
| |
| Create a :py:class:`~datafusion.DataFrame` from SQL query text. |
| |
| Note: This API implements DDL statements such as ``CREATE TABLE`` and |
| ``CREATE VIEW`` and DML statements such as ``INSERT INTO`` with in-memory |
| default implementation.See |
| :py:func:`~datafusion.context.SessionContext.sql_with_options`. |
| |
| :param query: SQL query text. |
| :param options: If provided, the query will be validated against these options. |
| |
| :returns: DataFrame representation of the SQL query. |
| |
| |
| |
| .. py:method:: sql_with_options(query: str, options: SQLOptions) -> datafusion.dataframe.DataFrame |
| |
| Create a :py:class:`~datafusion.dataframe.DataFrame` from SQL query text. |
| |
| This function will first validate that the query is allowed by the |
| provided options. |
| |
| :param query: SQL query text. |
| :param options: SQL options. |
| |
| :returns: DataFrame representation of the SQL query. |
| |
| |
| |
| .. py:method:: table(name: str) -> datafusion.dataframe.DataFrame |
| |
| Retrieve a previously registered table by name. |
| |
| |
| |
| .. py:method:: table_exist(name: str) -> bool |
| |
| Return whether a table with the given name exists. |
| |
| |
| |
| .. py:attribute:: ctx |
| |
| |
| .. py:class:: TableProviderExportable |
| |
| Bases: :py:obj:`Protocol` |
| |
| |
| Type hint for object that has __datafusion_table_provider__ PyCapsule. |
| |
| https://datafusion.apache.org/python/user-guide/io/table_provider.html |
| |
| |
| .. py:method:: __datafusion_table_provider__() -> object |
| |
| |