| datafusion.dataframe |
| ==================== |
| |
| .. py:module:: datafusion.dataframe |
| |
| .. autoapi-nested-parse:: |
| |
| :py:class:`DataFrame` is one of the core concepts in DataFusion. |
| |
| See :ref:`user_guide_concepts` in the online documentation for more information. |
| |
| |
| |
| Classes |
| ------- |
| |
| .. autoapisummary:: |
| |
| datafusion.dataframe.Compression |
| datafusion.dataframe.DataFrame |
| datafusion.dataframe.DataFrameWriteOptions |
| datafusion.dataframe.InsertOp |
| datafusion.dataframe.ParquetColumnOptions |
| datafusion.dataframe.ParquetWriterOptions |
| |
| |
| Module Contents |
| --------------- |
| |
| .. py:class:: Compression(*args, **kwds) |
| |
| Bases: :py:obj:`enum.Enum` |
| |
| |
| Enum representing the available compression types for Parquet files. |
| |
| |
| .. py:method:: from_str(value: str) -> Compression |
| :classmethod: |
| |
| |
| Convert a string to a Compression enum value. |
| |
| :param value: The string representation of the compression type. |
| |
| :returns: The Compression enum lowercase value. |
| |
| :raises ValueError: If the string does not match any Compression enum value. |
| |
| |
| |
| .. py:method:: get_default_level() -> Optional[int] |
| |
| Get the default compression level for the compression type. |
| |
| :returns: The default compression level for the compression type. |
| |
| |
| |
| .. py:attribute:: BROTLI |
| :value: 'brotli' |
| |
| |
| |
| .. py:attribute:: GZIP |
| :value: 'gzip' |
| |
| |
| |
| .. py:attribute:: LZ4 |
| :value: 'lz4' |
| |
| |
| |
| .. py:attribute:: LZ4_RAW |
| :value: 'lz4_raw' |
| |
| |
| |
| .. py:attribute:: SNAPPY |
| :value: 'snappy' |
| |
| |
| |
| .. py:attribute:: UNCOMPRESSED |
| :value: 'uncompressed' |
| |
| |
| |
| .. py:attribute:: ZSTD |
| :value: 'zstd' |
| |
| |
| |
| .. py:class:: DataFrame(df: datafusion._internal.DataFrame) |
| |
| Two dimensional table representation of data. |
| |
| See :ref:`user_guide_concepts` in the online documentation for more information. |
| |
| This constructor is not to be used by the end user. |
| |
| See :py:class:`~datafusion.context.SessionContext` for methods to |
| create a :py:class:`DataFrame`. |
| |
| |
| .. py:method:: __arrow_c_stream__(requested_schema: object | None = None) -> object |
| |
| Export an Arrow PyCapsule Stream. |
| |
| This will execute and collect the DataFrame. We will attempt to respect the |
| requested schema, but only trivial transformations will be applied such as only |
| returning the fields listed in the requested schema if their data types match |
| those in the DataFrame. |
| |
| :param requested_schema: Attempt to provide the DataFrame using this schema. |
| |
| :returns: Arrow PyCapsule object. |
| |
| |
| |
| .. py:method:: __getitem__(key: str | list[str]) -> DataFrame |
| |
| Return a new :py:class`DataFrame` with the specified column or columns. |
| |
| :param key: Column name or list of column names to select. |
| |
| :returns: DataFrame with the specified column or columns. |
| |
| |
| |
| .. py:method:: __repr__() -> str |
| |
| Return a string representation of the DataFrame. |
| |
| :returns: String representation of the DataFrame. |
| |
| |
| |
| .. py:method:: _repr_html_() -> str |
| |
| |
| .. py:method:: aggregate(group_by: collections.abc.Sequence[datafusion.expr.Expr | str] | datafusion.expr.Expr | str, aggs: collections.abc.Sequence[datafusion.expr.Expr] | datafusion.expr.Expr) -> DataFrame |
| |
| Aggregates the rows of the current DataFrame. |
| |
| :param group_by: Sequence of expressions or column names to group by. |
| :param aggs: Sequence of expressions to aggregate. |
| |
| :returns: DataFrame after aggregation. |
| |
| |
| |
| .. py:method:: cache() -> DataFrame |
| |
| Cache the DataFrame as a memory table. |
| |
| :returns: Cached DataFrame. |
| |
| |
| |
| .. py:method:: cast(mapping: dict[str, pyarrow.DataType[Any]]) -> DataFrame |
| |
| Cast one or more columns to a different data type. |
| |
| :param mapping: Mapped with column as key and column dtype as value. |
| |
| :returns: DataFrame after casting columns |
| |
| |
| |
| .. py:method:: collect() -> list[pyarrow.RecordBatch] |
| |
| Execute this :py:class:`DataFrame` and collect results into memory. |
| |
| Prior to calling ``collect``, modifying a DataFrame simply updates a plan |
| (no actual computation is performed). Calling ``collect`` triggers the |
| computation. |
| |
| :returns: List of :py:class:`pyarrow.RecordBatch` collected from the DataFrame. |
| |
| |
| |
| .. py:method:: collect_partitioned() -> list[list[pyarrow.RecordBatch]] |
| |
| Execute this DataFrame and collect all partitioned results. |
| |
| This operation returns :py:class:`pyarrow.RecordBatch` maintaining the input |
| partitioning. |
| |
| :returns: |
| |
| List of list of :py:class:`RecordBatch` collected from the |
| DataFrame. |
| |
| |
| |
| .. py:method:: count() -> int |
| |
| Return the total number of rows in this :py:class:`DataFrame`. |
| |
| Note that this method will actually run a plan to calculate the |
| count, which may be slow for large or complicated DataFrames. |
| |
| :returns: Number of rows in the DataFrame. |
| |
| |
| |
| .. py:method:: default_str_repr(batches: list[pyarrow.RecordBatch], schema: pyarrow.Schema, has_more: bool, table_uuid: str | None = None) -> str |
| :staticmethod: |
| |
| |
| Return the default string representation of a DataFrame. |
| |
| This method is used by the default formatter and implemented in Rust for |
| performance reasons. |
| |
| |
| |
| .. py:method:: describe() -> DataFrame |
| |
| Return the statistics for this DataFrame. |
| |
| Only summarized numeric datatypes at the moments and returns nulls |
| for non-numeric datatypes. |
| |
| The output format is modeled after pandas. |
| |
| :returns: A summary DataFrame containing statistics. |
| |
| |
| |
| .. py:method:: distinct() -> DataFrame |
| |
| Return a new :py:class:`DataFrame` with all duplicated rows removed. |
| |
| :returns: DataFrame after removing duplicates. |
| |
| |
| |
| .. py:method:: drop(*columns: str) -> DataFrame |
| |
| Drop arbitrary amount of columns. |
| |
| Column names are case-sensitive and do not require double quotes like |
| other operations such as `select`. Leading and trailing double quotes |
| are allowed and will be automatically stripped if present. |
| |
| :param columns: Column names to drop from the dataframe. Both ``column_name`` |
| and ``"column_name"`` are accepted. |
| |
| :returns: DataFrame with those columns removed in the projection. |
| |
| Example Usage:: |
| |
| df.drop('ID_For_Students') # Works |
| df.drop('"ID_For_Students"') # Also works (quotes stripped) |
| |
| |
| |
| .. py:method:: except_all(other: DataFrame) -> DataFrame |
| |
| Calculate the exception of two :py:class:`DataFrame`. |
| |
| The two :py:class:`DataFrame` must have exactly the same schema. |
| |
| :param other: DataFrame to calculate exception with. |
| |
| :returns: DataFrame after exception. |
| |
| |
| |
| .. py:method:: execute_stream() -> datafusion.record_batch.RecordBatchStream |
| |
| Executes this DataFrame and returns a stream over a single partition. |
| |
| :returns: Record Batch Stream over a single partition. |
| |
| |
| |
| .. py:method:: execute_stream_partitioned() -> list[datafusion.record_batch.RecordBatchStream] |
| |
| Executes this DataFrame and returns a stream for each partition. |
| |
| :returns: One record batch stream per partition. |
| |
| |
| |
| .. py:method:: execution_plan() -> datafusion.plan.ExecutionPlan |
| |
| Return the execution/physical plan. |
| |
| :returns: Execution plan. |
| |
| |
| |
| .. py:method:: explain(verbose: bool = False, analyze: bool = False) -> None |
| |
| Print an explanation of the DataFrame's plan so far. |
| |
| If ``analyze`` is specified, runs the plan and reports metrics. |
| |
| :param verbose: If ``True``, more details will be included. |
| :param analyze: If ``True``, the plan will run and metrics reported. |
| |
| |
| |
| .. py:method:: fill_null(value: Any, subset: list[str] | None = None) -> DataFrame |
| |
| Fill null values in specified columns with a value. |
| |
| :param value: Value to replace nulls with. Will be cast to match column type. |
| :param subset: Optional list of column names to fill. If None, fills all columns. |
| |
| :returns: DataFrame with null values replaced where type casting is possible |
| |
| .. rubric:: Examples |
| |
| >>> df = df.fill_null(0) # Fill all nulls with 0 where possible |
| >>> # Fill nulls in specific string columns |
| >>> df = df.fill_null("missing", subset=["name", "category"]) |
| |
| .. rubric:: Notes |
| |
| - Only fills nulls in columns where the value can be cast to the column type |
| - For columns where casting fails, the original column is kept unchanged |
| - For columns not in subset, the original column is kept unchanged |
| |
| |
| |
| .. py:method:: filter(*predicates: datafusion.expr.Expr) -> DataFrame |
| |
| Return a DataFrame for which ``predicate`` evaluates to ``True``. |
| |
| Rows for which ``predicate`` evaluates to ``False`` or ``None`` are filtered |
| out. If more than one predicate is provided, these predicates will be |
| combined as a logical AND. Each ``predicate`` must be an |
| :class:`~datafusion.expr.Expr` created using helper functions such as |
| :func:`datafusion.col` or :func:`datafusion.lit`. |
| If more complex logic is required, see the logical operations in |
| :py:mod:`~datafusion.functions`. |
| |
| Example:: |
| |
| from datafusion import col, lit |
| df.filter(col("a") > lit(1)) |
| |
| :param predicates: Predicate expression(s) to filter the DataFrame. |
| |
| :returns: DataFrame after filtering. |
| |
| |
| |
| .. py:method:: head(n: int = 5) -> DataFrame |
| |
| Return a new :py:class:`DataFrame` with a limited number of rows. |
| |
| :param n: Number of rows to take from the head of the DataFrame. |
| |
| :returns: DataFrame after limiting. |
| |
| |
| |
| .. py:method:: intersect(other: DataFrame) -> DataFrame |
| |
| Calculate the intersection of two :py:class:`DataFrame`. |
| |
| The two :py:class:`DataFrame` must have exactly the same schema. |
| |
| :param other: DataFrame to intersect with. |
| |
| :returns: DataFrame after intersection. |
| |
| |
| |
| .. py:method:: into_view() -> datafusion.catalog.Table |
| |
| Convert ``DataFrame`` into a :class:`~datafusion.Table`. |
| |
| .. rubric:: Examples |
| |
| >>> from datafusion import SessionContext |
| >>> ctx = SessionContext() |
| >>> df = ctx.sql("SELECT 1 AS value") |
| >>> view = df.into_view() |
| >>> ctx.register_table("values_view", view) |
| >>> df.collect() # The DataFrame is still usable |
| >>> ctx.sql("SELECT value FROM values_view").collect() |
| |
| |
| |
| .. py:method:: join(right: DataFrame, on: str | collections.abc.Sequence[str], how: Literal['inner', 'left', 'right', 'full', 'semi', 'anti'] = 'inner', *, left_on: None = None, right_on: None = None, join_keys: None = None) -> DataFrame |
| join(right: DataFrame, on: None = None, how: Literal['inner', 'left', 'right', 'full', 'semi', 'anti'] = 'inner', *, left_on: str | collections.abc.Sequence[str], right_on: str | collections.abc.Sequence[str], join_keys: tuple[list[str], list[str]] | None = None) -> DataFrame |
| join(right: DataFrame, on: None = None, how: Literal['inner', 'left', 'right', 'full', 'semi', 'anti'] = 'inner', *, join_keys: tuple[list[str], list[str]], left_on: None = None, right_on: None = None) -> DataFrame |
| |
| Join this :py:class:`DataFrame` with another :py:class:`DataFrame`. |
| |
| `on` has to be provided or both `left_on` and `right_on` in conjunction. |
| |
| :param right: Other DataFrame to join with. |
| :param on: Column names to join on in both dataframes. |
| :param how: Type of join to perform. Supported types are "inner", "left", |
| "right", "full", "semi", "anti". |
| :param left_on: Join column of the left dataframe. |
| :param right_on: Join column of the right dataframe. |
| :param join_keys: Tuple of two lists of column names to join on. [Deprecated] |
| |
| :returns: DataFrame after join. |
| |
| |
| |
| .. py:method:: join_on(right: DataFrame, *on_exprs: datafusion.expr.Expr, how: Literal['inner', 'left', 'right', 'full', 'semi', 'anti'] = 'inner') -> DataFrame |
| |
| Join two :py:class:`DataFrame` using the specified expressions. |
| |
| Join predicates must be :class:`~datafusion.expr.Expr` objects, typically |
| built with :func:`datafusion.col`. On expressions are used to support |
| in-equality predicates. Equality predicates are correctly optimized. |
| |
| Example:: |
| |
| from datafusion import col |
| df.join_on(other_df, col("id") == col("other_id")) |
| |
| :param right: Other DataFrame to join with. |
| :param on_exprs: single or multiple (in)-equality predicates. |
| :param how: Type of join to perform. Supported types are "inner", "left", |
| "right", "full", "semi", "anti". |
| |
| :returns: DataFrame after join. |
| |
| |
| |
| .. py:method:: limit(count: int, offset: int = 0) -> DataFrame |
| |
| Return a new :py:class:`DataFrame` with a limited number of rows. |
| |
| :param count: Number of rows to limit the DataFrame to. |
| :param offset: Number of rows to skip. |
| |
| :returns: DataFrame after limiting. |
| |
| |
| |
| .. py:method:: logical_plan() -> datafusion.plan.LogicalPlan |
| |
| Return the unoptimized ``LogicalPlan``. |
| |
| :returns: Unoptimized logical plan. |
| |
| |
| |
| .. py:method:: optimized_logical_plan() -> datafusion.plan.LogicalPlan |
| |
| Return the optimized ``LogicalPlan``. |
| |
| :returns: Optimized logical plan. |
| |
| |
| |
| .. py:method:: parse_sql_expr(expr: str) -> datafusion.expr.Expr |
| |
| Creates logical expression from a SQL query text. |
| |
| The expression is created and processed against the current schema. |
| |
| Example:: |
| |
| from datafusion import col, lit |
| df.parse_sql_expr("a > 1") |
| |
| should produce: |
| |
| col("a") > lit(1) |
| |
| :param expr: Expression string to be converted to datafusion expression |
| |
| :returns: Logical expression . |
| |
| |
| |
| .. py:method:: repartition(num: int) -> DataFrame |
| |
| Repartition a DataFrame into ``num`` partitions. |
| |
| The batches allocation uses a round-robin algorithm. |
| |
| :param num: Number of partitions to repartition the DataFrame into. |
| |
| :returns: Repartitioned DataFrame. |
| |
| |
| |
| .. py:method:: repartition_by_hash(*exprs: datafusion.expr.Expr, num: int) -> DataFrame |
| |
| Repartition a DataFrame using a hash partitioning scheme. |
| |
| :param exprs: Expressions to evaluate and perform hashing on. |
| :param num: Number of partitions to repartition the DataFrame into. |
| |
| :returns: Repartitioned DataFrame. |
| |
| |
| |
| .. py:method:: schema() -> pyarrow.Schema |
| |
| Return the :py:class:`pyarrow.Schema` of this DataFrame. |
| |
| The output schema contains information on the name, data type, and |
| nullability for each column. |
| |
| :returns: Describing schema of the DataFrame |
| |
| |
| |
| .. py:method:: select(*exprs: datafusion.expr.Expr | str) -> DataFrame |
| |
| Project arbitrary expressions into a new :py:class:`DataFrame`. |
| |
| :param exprs: Either column names or :py:class:`~datafusion.expr.Expr` to select. |
| |
| :returns: DataFrame after projection. It has one column for each expression. |
| |
| Example usage: |
| |
| The following example will return 3 columns from the original dataframe. |
| The first two columns will be the original column ``a`` and ``b`` since the |
| string "a" is assumed to refer to column selection. Also a duplicate of |
| column ``a`` will be returned with the column name ``alternate_a``:: |
| |
| df = df.select("a", col("b"), col("a").alias("alternate_a")) |
| |
| |
| |
| |
| .. py:method:: select_columns(*args: str) -> DataFrame |
| |
| Filter the DataFrame by columns. |
| |
| :returns: DataFrame only containing the specified columns. |
| |
| |
| |
| .. py:method:: show(num: int = 20) -> None |
| |
| Execute the DataFrame and print the result to the console. |
| |
| :param num: Number of lines to show. |
| |
| |
| |
| .. py:method:: sort(*exprs: datafusion.expr.SortKey) -> DataFrame |
| |
| Sort the DataFrame by the specified sorting expressions or column names. |
| |
| Note that any expression can be turned into a sort expression by |
| calling its ``sort`` method. |
| |
| :param exprs: Sort expressions or column names, applied in order. |
| |
| :returns: DataFrame after sorting. |
| |
| |
| |
| .. py:method:: tail(n: int = 5) -> DataFrame |
| |
| Return a new :py:class:`DataFrame` with a limited number of rows. |
| |
| Be aware this could be potentially expensive since the row size needs to be |
| determined of the dataframe. This is done by collecting it. |
| |
| :param n: Number of rows to take from the tail of the DataFrame. |
| |
| :returns: DataFrame after limiting. |
| |
| |
| |
| .. py:method:: to_arrow_table() -> pyarrow.Table |
| |
| Execute the :py:class:`DataFrame` and convert it into an Arrow Table. |
| |
| :returns: Arrow Table. |
| |
| |
| |
| .. py:method:: to_pandas() -> pandas.DataFrame |
| |
| Execute the :py:class:`DataFrame` and convert it into a Pandas DataFrame. |
| |
| :returns: Pandas DataFrame. |
| |
| |
| |
| .. py:method:: to_polars() -> polars.DataFrame |
| |
| Execute the :py:class:`DataFrame` and convert it into a Polars DataFrame. |
| |
| :returns: Polars DataFrame. |
| |
| |
| |
| .. py:method:: to_pydict() -> dict[str, list[Any]] |
| |
| Execute the :py:class:`DataFrame` and convert it into a dictionary of lists. |
| |
| :returns: Dictionary of lists. |
| |
| |
| |
| .. py:method:: to_pylist() -> list[dict[str, Any]] |
| |
| Execute the :py:class:`DataFrame` and convert it into a list of dictionaries. |
| |
| :returns: List of dictionaries. |
| |
| |
| |
| .. py:method:: transform(func: Callable[Ellipsis, DataFrame], *args: Any) -> DataFrame |
| |
| Apply a function to the current DataFrame which returns another DataFrame. |
| |
| This is useful for chaining together multiple functions. For example:: |
| |
| def add_3(df: DataFrame) -> DataFrame: |
| return df.with_column("modified", lit(3)) |
| |
| def within_limit(df: DataFrame, limit: int) -> DataFrame: |
| return df.filter(col("a") < lit(limit)).distinct() |
| |
| df = df.transform(modify_df).transform(within_limit, 4) |
| |
| :param func: A callable function that takes a DataFrame as it's first argument |
| :param args: Zero or more arguments to pass to `func` |
| |
| :returns: After applying func to the original dataframe. |
| :rtype: DataFrame |
| |
| |
| |
| .. py:method:: union(other: DataFrame, distinct: bool = False) -> DataFrame |
| |
| Calculate the union of two :py:class:`DataFrame`. |
| |
| The two :py:class:`DataFrame` must have exactly the same schema. |
| |
| :param other: DataFrame to union with. |
| :param distinct: If ``True``, duplicate rows will be removed. |
| |
| :returns: DataFrame after union. |
| |
| |
| |
| .. py:method:: union_distinct(other: DataFrame) -> DataFrame |
| |
| Calculate the distinct union of two :py:class:`DataFrame`. |
| |
| The two :py:class:`DataFrame` must have exactly the same schema. |
| Any duplicate rows are discarded. |
| |
| :param other: DataFrame to union with. |
| |
| :returns: DataFrame after union. |
| |
| |
| |
| .. py:method:: unnest_columns(*columns: str, preserve_nulls: bool = True) -> DataFrame |
| |
| Expand columns of arrays into a single row per array element. |
| |
| :param columns: Column names to perform unnest operation on. |
| :param preserve_nulls: If False, rows with null entries will not be |
| returned. |
| |
| :returns: A DataFrame with the columns expanded. |
| |
| |
| |
| .. py:method:: with_column(name: str, expr: datafusion.expr.Expr) -> DataFrame |
| |
| Add an additional column to the DataFrame. |
| |
| The ``expr`` must be an :class:`~datafusion.expr.Expr` constructed with |
| :func:`datafusion.col` or :func:`datafusion.lit`. |
| |
| Example:: |
| |
| from datafusion import col, lit |
| df.with_column("b", col("a") + lit(1)) |
| |
| :param name: Name of the column to add. |
| :param expr: Expression to compute the column. |
| |
| :returns: DataFrame with the new column. |
| |
| |
| |
| .. py:method:: with_column_renamed(old_name: str, new_name: str) -> DataFrame |
| |
| Rename one column by applying a new projection. |
| |
| This is a no-op if the column to be renamed does not exist. |
| |
| The method supports case sensitive rename with wrapping column name |
| into one the following symbols (" or ' or \`). |
| |
| :param old_name: Old column name. |
| :param new_name: New column name. |
| |
| :returns: DataFrame with the column renamed. |
| |
| |
| |
| .. py:method:: with_columns(*exprs: datafusion.expr.Expr | Iterable[datafusion.expr.Expr], **named_exprs: datafusion.expr.Expr) -> DataFrame |
| |
| Add columns to the DataFrame. |
| |
| By passing expressions, iterables of expressions, or named expressions. |
| All expressions must be :class:`~datafusion.expr.Expr` objects created via |
| :func:`datafusion.col` or :func:`datafusion.lit`. |
| To pass named expressions use the form ``name=Expr``. |
| |
| Example usage: The following will add 4 columns labeled ``a``, ``b``, ``c``, |
| and ``d``:: |
| |
| from datafusion import col, lit |
| df = df.with_columns( |
| col("x").alias("a"), |
| [lit(1).alias("b"), col("y").alias("c")], |
| d=lit(3) |
| ) |
| |
| :param exprs: Either a single expression or an iterable of expressions to add. |
| :param named_exprs: Named expressions in the form of ``name=expr`` |
| |
| :returns: DataFrame with the new columns added. |
| |
| |
| |
| .. py:method:: write_csv(path: str | pathlib.Path, with_header: bool = False, write_options: DataFrameWriteOptions | None = None) -> None |
| |
| Execute the :py:class:`DataFrame` and write the results to a CSV file. |
| |
| :param path: Path of the CSV file to write. |
| :param with_header: If true, output the CSV header row. |
| :param write_options: Options that impact how the DataFrame is written. |
| |
| |
| |
| .. py:method:: write_json(path: str | pathlib.Path, write_options: DataFrameWriteOptions | None = None) -> None |
| |
| Execute the :py:class:`DataFrame` and write the results to a JSON file. |
| |
| :param path: Path of the JSON file to write. |
| :param write_options: Options that impact how the DataFrame is written. |
| |
| |
| |
| .. py:method:: write_parquet(path: str | pathlib.Path, compression: str, compression_level: int | None = None, write_options: DataFrameWriteOptions | None = None) -> None |
| write_parquet(path: str | pathlib.Path, compression: Compression = Compression.ZSTD, compression_level: int | None = None, write_options: DataFrameWriteOptions | None = None) -> None |
| write_parquet(path: str | pathlib.Path, compression: ParquetWriterOptions, compression_level: None = None, write_options: DataFrameWriteOptions | None = None) -> None |
| |
| Execute the :py:class:`DataFrame` and write the results to a Parquet file. |
| |
| Available compression types are: |
| |
| - "uncompressed": No compression. |
| - "snappy": Snappy compression. |
| - "gzip": Gzip compression. |
| - "brotli": Brotli compression. |
| - "lz4": LZ4 compression. |
| - "lz4_raw": LZ4_RAW compression. |
| - "zstd": Zstandard compression. |
| |
| LZO compression is not yet implemented in arrow-rs and is therefore |
| excluded. |
| |
| :param path: Path of the Parquet file to write. |
| :param compression: Compression type to use. Default is "ZSTD". |
| :param compression_level: Compression level to use. For ZSTD, the |
| recommended range is 1 to 22, with the default being 4. Higher levels |
| provide better compression but slower speed. |
| :param write_options: Options that impact how the DataFrame is written. |
| |
| |
| |
| .. py:method:: write_parquet_with_options(path: str | pathlib.Path, options: ParquetWriterOptions, write_options: DataFrameWriteOptions | None = None) -> None |
| |
| Execute the :py:class:`DataFrame` and write the results to a Parquet file. |
| |
| Allows advanced writer options to be set with `ParquetWriterOptions`. |
| |
| :param path: Path of the Parquet file to write. |
| :param options: Sets the writer parquet options (see `ParquetWriterOptions`). |
| :param write_options: Options that impact how the DataFrame is written. |
| |
| |
| |
| .. py:method:: write_table(table_name: str, write_options: DataFrameWriteOptions | None = None) -> None |
| |
| Execute the :py:class:`DataFrame` and write the results to a table. |
| |
| The table must be registered with the session to perform this operation. |
| Not all table providers support writing operations. See the individual |
| implementations for details. |
| |
| |
| |
| .. py:attribute:: df |
| |
| |
| .. py:class:: DataFrameWriteOptions(insert_operation: InsertOp | None = None, single_file_output: bool = False, partition_by: str | collections.abc.Sequence[str] | None = None, sort_by: datafusion.expr.Expr | datafusion.expr.SortExpr | collections.abc.Sequence[datafusion.expr.Expr] | collections.abc.Sequence[datafusion.expr.SortExpr] | None = None) |
| |
| Writer options for DataFrame. |
| |
| There is no guarantee the table provider supports all writer options. |
| See the individual implementation and documentation for details. |
| |
| Instantiate writer options for DataFrame. |
| |
| |
| .. py:attribute:: _raw_write_options |
| |
| |
| .. py:class:: InsertOp(*args, **kwds) |
| |
| Bases: :py:obj:`enum.Enum` |
| |
| |
| Insert operation mode. |
| |
| These modes are used by the table writing feature to define how record |
| batches should be written to a table. |
| |
| |
| .. py:attribute:: APPEND |
| |
| Appends new rows to the existing table without modifying any existing rows. |
| |
| |
| .. py:attribute:: OVERWRITE |
| |
| Overwrites all existing rows in the table with the new rows. |
| |
| |
| .. py:attribute:: REPLACE |
| |
| Replace existing rows that collide with the inserted rows. |
| |
| Replacement is typically based on a unique key or primary key. |
| |
| |
| .. py:class:: ParquetColumnOptions(encoding: Optional[str] = None, dictionary_enabled: Optional[bool] = None, compression: Optional[str] = None, statistics_enabled: Optional[str] = None, bloom_filter_enabled: Optional[bool] = None, bloom_filter_fpp: Optional[float] = None, bloom_filter_ndv: Optional[int] = None) |
| |
| Parquet options for individual columns. |
| |
| Contains the available options that can be applied for an individual Parquet column, |
| replacing the global options in ``ParquetWriterOptions``. |
| |
| Initialize the ParquetColumnOptions. |
| |
| :param encoding: Sets encoding for the column path. Valid values are: ``plain``, |
| ``plain_dictionary``, ``rle``, ``bit_packed``, ``delta_binary_packed``, |
| ``delta_length_byte_array``, ``delta_byte_array``, ``rle_dictionary``, |
| and ``byte_stream_split``. These values are not case-sensitive. If |
| ``None``, uses the default parquet options |
| :param dictionary_enabled: Sets if dictionary encoding is enabled for the column |
| path. If `None`, uses the default parquet options |
| :param compression: Sets default parquet compression codec for the column path. |
| Valid values are ``uncompressed``, ``snappy``, ``gzip(level)``, ``lzo``, |
| ``brotli(level)``, ``lz4``, ``zstd(level)``, and ``lz4_raw``. These |
| values are not case-sensitive. If ``None``, uses the default parquet |
| options. |
| :param statistics_enabled: Sets if statistics are enabled for the column Valid |
| values are: ``none``, ``chunk``, and ``page`` These values are not case |
| sensitive. If ``None``, uses the default parquet options. |
| :param bloom_filter_enabled: Sets if bloom filter is enabled for the column path. |
| If ``None``, uses the default parquet options. |
| :param bloom_filter_fpp: Sets bloom filter false positive probability for the |
| column path. If ``None``, uses the default parquet options. |
| :param bloom_filter_ndv: Sets bloom filter number of distinct values. If ``None``, |
| uses the default parquet options. |
| |
| |
| .. py:attribute:: bloom_filter_enabled |
| :value: None |
| |
| |
| |
| .. py:attribute:: bloom_filter_fpp |
| :value: None |
| |
| |
| |
| .. py:attribute:: bloom_filter_ndv |
| :value: None |
| |
| |
| |
| .. py:attribute:: compression |
| :value: None |
| |
| |
| |
| .. py:attribute:: dictionary_enabled |
| :value: None |
| |
| |
| |
| .. py:attribute:: encoding |
| :value: None |
| |
| |
| |
| .. py:attribute:: statistics_enabled |
| :value: None |
| |
| |
| |
| .. py:class:: ParquetWriterOptions(data_pagesize_limit: int = 1024 * 1024, write_batch_size: int = 1024, writer_version: str = '1.0', skip_arrow_metadata: bool = False, compression: Optional[str] = 'zstd(3)', compression_level: Optional[int] = None, dictionary_enabled: Optional[bool] = True, dictionary_page_size_limit: int = 1024 * 1024, statistics_enabled: Optional[str] = 'page', max_row_group_size: int = 1024 * 1024, created_by: str = 'datafusion-python', column_index_truncate_length: Optional[int] = 64, statistics_truncate_length: Optional[int] = None, data_page_row_count_limit: int = 20000, encoding: Optional[str] = None, bloom_filter_on_write: bool = False, bloom_filter_fpp: Optional[float] = None, bloom_filter_ndv: Optional[int] = None, allow_single_file_parallelism: bool = True, maximum_parallel_row_group_writers: int = 1, maximum_buffered_record_batches_per_stream: int = 2, column_specific_options: Optional[dict[str, ParquetColumnOptions]] = None) |
| |
| Advanced parquet writer options. |
| |
| Allows settings the writer options that apply to the entire file. Some options can |
| also be set on a column by column basis, with the field ``column_specific_options`` |
| (see ``ParquetColumnOptions``). |
| |
| Initialize the ParquetWriterOptions. |
| |
| :param data_pagesize_limit: Sets best effort maximum size of data page in bytes. |
| :param write_batch_size: Sets write_batch_size in bytes. |
| :param writer_version: Sets parquet writer version. Valid values are ``1.0`` and |
| ``2.0``. |
| :param skip_arrow_metadata: Skip encoding the embedded arrow metadata in the |
| KV_meta. |
| :param compression: Compression type to use. Default is ``zstd(3)``. |
| Available compression types are |
| |
| - ``uncompressed``: No compression. |
| - ``snappy``: Snappy compression. |
| - ``gzip(n)``: Gzip compression with level n. |
| - ``brotli(n)``: Brotli compression with level n. |
| - ``lz4``: LZ4 compression. |
| - ``lz4_raw``: LZ4_RAW compression. |
| - ``zstd(n)``: Zstandard compression with level n. |
| :param compression_level: Compression level to set. |
| :param dictionary_enabled: Sets if dictionary encoding is enabled. If ``None``, |
| uses the default parquet writer setting. |
| :param dictionary_page_size_limit: Sets best effort maximum dictionary page size, |
| in bytes. |
| :param statistics_enabled: Sets if statistics are enabled for any column Valid |
| values are ``none``, ``chunk``, and ``page``. If ``None``, uses the |
| default parquet writer setting. |
| :param max_row_group_size: Target maximum number of rows in each row group |
| (defaults to 1M rows). Writing larger row groups requires more memory |
| to write, but can get better compression and be faster to read. |
| :param created_by: Sets "created by" property. |
| :param column_index_truncate_length: Sets column index truncate length. |
| :param statistics_truncate_length: Sets statistics truncate length. If ``None``, |
| uses the default parquet writer setting. |
| :param data_page_row_count_limit: Sets best effort maximum number of rows in a data |
| page. |
| :param encoding: Sets default encoding for any column. Valid values are ``plain``, |
| ``plain_dictionary``, ``rle``, ``bit_packed``, ``delta_binary_packed``, |
| ``delta_length_byte_array``, ``delta_byte_array``, ``rle_dictionary``, |
| and ``byte_stream_split``. If ``None``, uses the default parquet writer |
| setting. |
| :param bloom_filter_on_write: Write bloom filters for all columns when creating |
| parquet files. |
| :param bloom_filter_fpp: Sets bloom filter false positive probability. If ``None``, |
| uses the default parquet writer setting |
| :param bloom_filter_ndv: Sets bloom filter number of distinct values. If ``None``, |
| uses the default parquet writer setting. |
| :param allow_single_file_parallelism: Controls whether DataFusion will attempt to |
| speed up writing parquet files by serializing them in parallel. Each |
| column in each row group in each output file are serialized in parallel |
| leveraging a maximum possible core count of |
| ``n_files * n_row_groups * n_columns``. |
| :param maximum_parallel_row_group_writers: By default parallel parquet writer is |
| tuned for minimum memory usage in a streaming execution plan. You may |
| see a performance benefit when writing large parquet files by increasing |
| ``maximum_parallel_row_group_writers`` and |
| ``maximum_buffered_record_batches_per_stream`` if your system has idle |
| cores and can tolerate additional memory usage. Boosting these values is |
| likely worthwhile when writing out already in-memory data, such as from |
| a cached data frame. |
| :param maximum_buffered_record_batches_per_stream: See |
| ``maximum_parallel_row_group_writers``. |
| :param column_specific_options: Overrides options for specific columns. If a column |
| is not a part of this dictionary, it will use the parameters provided |
| here. |
| |
| |
| .. py:attribute:: allow_single_file_parallelism |
| :value: True |
| |
| |
| |
| .. py:attribute:: bloom_filter_fpp |
| :value: None |
| |
| |
| |
| .. py:attribute:: bloom_filter_ndv |
| :value: None |
| |
| |
| |
| .. py:attribute:: bloom_filter_on_write |
| :value: False |
| |
| |
| |
| .. py:attribute:: column_index_truncate_length |
| :value: 64 |
| |
| |
| |
| .. py:attribute:: column_specific_options |
| :value: None |
| |
| |
| |
| .. py:attribute:: created_by |
| :value: 'datafusion-python' |
| |
| |
| |
| .. py:attribute:: data_page_row_count_limit |
| :value: 20000 |
| |
| |
| |
| .. py:attribute:: data_pagesize_limit |
| :value: 1048576 |
| |
| |
| |
| .. py:attribute:: dictionary_enabled |
| :value: True |
| |
| |
| |
| .. py:attribute:: dictionary_page_size_limit |
| :value: 1048576 |
| |
| |
| |
| .. py:attribute:: encoding |
| :value: None |
| |
| |
| |
| .. py:attribute:: max_row_group_size |
| :value: 1048576 |
| |
| |
| |
| .. py:attribute:: maximum_buffered_record_batches_per_stream |
| :value: 2 |
| |
| |
| |
| .. py:attribute:: maximum_parallel_row_group_writers |
| :value: 1 |
| |
| |
| |
| .. py:attribute:: skip_arrow_metadata |
| :value: False |
| |
| |
| |
| .. py:attribute:: statistics_enabled |
| :value: 'page' |
| |
| |
| |
| .. py:attribute:: statistics_truncate_length |
| :value: None |
| |
| |
| |
| .. py:attribute:: write_batch_size |
| :value: 1024 |
| |
| |
| |
| .. py:attribute:: writer_version |
| :value: '1.0' |
| |
| |
| |