| <!--- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| |
| # Format Options |
| |
| DataFusion supports customizing how data is read from or written to disk as a result of a `COPY`, `INSERT INTO`, or `CREATE EXTERNAL TABLE` statements. There are a few special options, file format (e.g., CSV or Parquet) specific options, and Parquet column-specific options. In some cases, Options can be specified in multiple ways with a set order of precedence. |
| |
| ## Specifying Options and Order of Precedence |
| |
| Format-related options can be specified in three ways, in decreasing order of precedence: |
| |
| - `CREATE EXTERNAL TABLE` syntax |
| - `COPY` option tuples |
| - Session-level config defaults |
| |
| For a list of supported session-level config defaults, see [Configuration Settings](../configs). These defaults apply to all operations but have the lowest level of precedence. |
| |
| If creating an external table, table-specific format options can be specified when the table is created using the `OPTIONS` clause: |
| |
| ```sql |
| CREATE EXTERNAL TABLE |
| my_table(a bigint, b bigint) |
| STORED AS csv |
| LOCATION '/tmp/my_csv_table/' |
| OPTIONS( |
| NULL_VALUE 'NAN', |
| 'has_header' 'true', |
| 'format.delimiter' ';' |
| ); |
| ``` |
| |
| When running `INSERT INTO my_table ...`, the options from the `CREATE TABLE` will be respected (e.g., gzip compression, special delimiter, and header row included). Note that compression, header, and delimiter settings can also be specified within the `OPTIONS` tuple list. Dedicated syntax within the SQL statement always takes precedence over arbitrary option tuples, so if both are specified, the `OPTIONS` setting will be ignored. |
| |
| For example, with the table defined above, running the following command: |
| |
| ```sql |
| INSERT INTO my_table VALUES(1,2); |
| ``` |
| |
| Results in a new CSV file with the specified options: |
| |
| ```shell |
| $ cat /tmp/my_csv_table/bmC8zWFvLMtWX68R_0.csv |
| a;b |
| 1;2 |
| ``` |
| |
| Finally, options can be passed when running a `COPY` command. |
| |
| ```sql |
| COPY source_table |
| TO 'test/table_with_options' |
| PARTITIONED BY (column3, column4) |
| OPTIONS ( |
| format parquet, |
| compression snappy, |
| 'compression::column1' 'zstd(5)', |
| ) |
| ``` |
| |
| In this example, we write the entire `source_table` out to a folder of Parquet files. One Parquet file will be written in parallel to the folder for each partition in the query. The next option `compression` set to `snappy` indicates that unless otherwise specified, all columns should use the snappy compression codec. The option `compression::col1` sets an override, so that the column `col1` in the Parquet file will use the ZSTD compression codec with compression level `5`. In general, Parquet options that support column-specific settings can be specified with the syntax `OPTION::COLUMN.NESTED.PATH`. |
| |
| # Available Options |
| |
| ## JSON Format Options |
| |
| The following options are available when reading or writing JSON files. Note: If any unsupported option is specified, an error will be raised and the query will fail. |
| |
| | Option | Description | Default Value | |
| | ----------- | ---------------------------------------------------------------------------------------------------------------------------------- | ------------- | |
| | COMPRESSION | Sets the compression that should be applied to the entire JSON file. Supported values are GZIP, BZIP2, XZ, ZSTD, and UNCOMPRESSED. | UNCOMPRESSED | |
| |
| **Example:** |
| |
| ```sql |
| CREATE EXTERNAL TABLE t(a int) |
| STORED AS JSON |
| LOCATION '/tmp/foo/' |
| OPTIONS('COMPRESSION' 'gzip'); |
| ``` |
| |
| ## CSV Format Options |
| |
| The following options are available when reading or writing CSV files. Note: If any unsupported option is specified, an error will be raised and the query will fail. |
| |
| | Option | Description | Default Value | |
| | -------------------- | --------------------------------------------------------------------------------------------------------------------------------- | ------------------ | |
| | COMPRESSION | Sets the compression that should be applied to the entire CSV file. Supported values are GZIP, BZIP2, XZ, ZSTD, and UNCOMPRESSED. | UNCOMPRESSED | |
| | HAS_HEADER | Sets if the CSV file should include column headers. If not set, uses session or system default. | None | |
| | DELIMITER | Sets the character which should be used as the column delimiter within the CSV file. | `,` (comma) | |
| | QUOTE | Sets the character which should be used for quoting values within the CSV file. | `"` (double quote) | |
| | TERMINATOR | Sets the character which should be used as the line terminator within the CSV file. | None | |
| | ESCAPE | Sets the character which should be used for escaping special characters within the CSV file. | None | |
| | DOUBLE_QUOTE | Sets if quotes within quoted fields should be escaped by doubling them (e.g., `"aaa""bbb"`). | None | |
| | NEWLINES_IN_VALUES | Sets if newlines in quoted values are supported. If not set, uses session or system default. | None | |
| | DATE_FORMAT | Sets the format that dates should be encoded in within the CSV file. | None | |
| | DATETIME_FORMAT | Sets the format that datetimes should be encoded in within the CSV file. | None | |
| | TIMESTAMP_FORMAT | Sets the format that timestamps should be encoded in within the CSV file. | None | |
| | TIMESTAMP_TZ_FORMAT | Sets the format that timestamps with timezone should be encoded in within the CSV file. | None | |
| | TIME_FORMAT | Sets the format that times should be encoded in within the CSV file. | None | |
| | NULL_VALUE | Sets the string which should be used to indicate null values within the CSV file. | None | |
| | NULL_REGEX | Sets the regex pattern to match null values when loading CSVs. | None | |
| | SCHEMA_INFER_MAX_REC | Sets the maximum number of records to scan to infer the schema. | None | |
| | COMMENT | Sets the character which should be used to indicate comment lines in the CSV file. | None | |
| |
| **Example:** |
| |
| ```sql |
| CREATE EXTERNAL TABLE t (col1 varchar, col2 int, col3 boolean) |
| STORED AS CSV |
| LOCATION '/tmp/foo/' |
| OPTIONS('DELIMITER' '|', 'HAS_HEADER' 'true', 'NEWLINES_IN_VALUES' 'true'); |
| ``` |
| |
| ## Parquet Format Options |
| |
| The following options are available when reading or writing Parquet files. If any unsupported option is specified, an error will be raised and the query will fail. If a column-specific option is specified for a column that does not exist, the option will be ignored without error. |
| |
| | Option | Can be Column Specific? | Description | OPTIONS Key | Default Value | |
| | ------------------------------------------ | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------- | ------------------------ | |
| | COMPRESSION | Yes | Sets the internal Parquet **compression codec** for data pages, optionally including the compression level. Applies globally if set without `::col`, or specifically to a column if set using `'compression::column_name'`. Valid values: `uncompressed`, `snappy`, `gzip(level)`, `lzo`, `brotli(level)`, `lz4`, `zstd(level)`, `lz4_raw`. | `'compression'` or `'compression::col'` | zstd(3) | |
| | ENCODING | Yes | Sets the **encoding** scheme for data pages. Valid values: `plain`, `plain_dictionary`, `rle`, `bit_packed`, `delta_binary_packed`, `delta_length_byte_array`, `delta_byte_array`, `rle_dictionary`, `byte_stream_split`. Use key `'encoding'` or `'encoding::col'` in OPTIONS. | `'encoding'` or `'encoding::col'` | None | |
| | DICTIONARY_ENABLED | Yes | Sets whether dictionary encoding should be enabled globally or for a specific column. | `'dictionary_enabled'` or `'dictionary_enabled::col'` | true | |
| | STATISTICS_ENABLED | Yes | Sets the level of statistics to write (`none`, `chunk`, `page`). | `'statistics_enabled'` or `'statistics_enabled::col'` | page | |
| | BLOOM_FILTER_ENABLED | Yes | Sets whether a bloom filter should be written for a specific column. | `'bloom_filter_enabled::column_name'` | None | |
| | BLOOM_FILTER_FPP | Yes | Sets bloom filter false positive probability (global or per column). | `'bloom_filter_fpp'` or `'bloom_filter_fpp::col'` | None | |
| | BLOOM_FILTER_NDV | Yes | Sets bloom filter number of distinct values (global or per column). | `'bloom_filter_ndv'` or `'bloom_filter_ndv::col'` | None | |
| | MAX_ROW_GROUP_SIZE | No | Sets the maximum number of rows per row group. Larger groups require more memory but can improve compression and scan efficiency. | `'max_row_group_size'` | 1048576 | |
| | ENABLE_PAGE_INDEX | No | If true, reads the Parquet data page level metadata (the Page Index), if present, to reduce I/O and decoding. | `'enable_page_index'` | true | |
| | PRUNING | No | If true, enables row group pruning based on min/max statistics. | `'pruning'` | true | |
| | SKIP_METADATA | No | If true, skips optional embedded metadata in the file schema. | `'skip_metadata'` | true | |
| | METADATA_SIZE_HINT | No | Sets the size hint (in bytes) for fetching Parquet file metadata. | `'metadata_size_hint'` | None | |
| | PUSHDOWN_FILTERS | No | If true, enables filter pushdown during Parquet decoding. | `'pushdown_filters'` | false | |
| | REORDER_FILTERS | No | If true, enables heuristic reordering of filters during Parquet decoding. | `'reorder_filters'` | false | |
| | SCHEMA_FORCE_VIEW_TYPES | No | If true, reads Utf8/Binary columns as view types. | `'schema_force_view_types'` | true | |
| | BINARY_AS_STRING | No | If true, reads Binary columns as strings. | `'binary_as_string'` | false | |
| | DATA_PAGESIZE_LIMIT | No | Sets best effort maximum size of data page in bytes. | `'data_pagesize_limit'` | 1048576 | |
| | DATA_PAGE_ROW_COUNT_LIMIT | No | Sets best effort maximum number of rows in data page. | `'data_page_row_count_limit'` | 20000 | |
| | DICTIONARY_PAGE_SIZE_LIMIT | No | Sets best effort maximum dictionary page size, in bytes. | `'dictionary_page_size_limit'` | 1048576 | |
| | WRITE_BATCH_SIZE | No | Sets write_batch_size in bytes. | `'write_batch_size'` | 1024 | |
| | WRITER_VERSION | No | Sets the Parquet writer version (`1.0` or `2.0`). | `'writer_version'` | 1.0 | |
| | SKIP_ARROW_METADATA | No | If true, skips writing Arrow schema information into the Parquet file metadata. | `'skip_arrow_metadata'` | false | |
| | CREATED_BY | No | Sets the "created by" string in the Parquet file metadata. | `'created_by'` | datafusion version X.Y.Z | |
| | COLUMN_INDEX_TRUNCATE_LENGTH | No | Sets the length (in bytes) to truncate min/max values in column indexes. | `'column_index_truncate_length'` | 64 | |
| | STATISTICS_TRUNCATE_LENGTH | No | Sets statistics truncate length. | `'statistics_truncate_length'` | None | |
| | BLOOM_FILTER_ON_WRITE | No | Sets whether bloom filters should be written for all columns by default (can be overridden per column). | `'bloom_filter_on_write'` | false | |
| | ALLOW_SINGLE_FILE_PARALLELISM | No | Enables parallel serialization of columns in a single file. | `'allow_single_file_parallelism'` | true | |
| | MAXIMUM_PARALLEL_ROW_GROUP_WRITERS | No | Maximum number of parallel row group writers. | `'maximum_parallel_row_group_writers'` | 1 | |
| | MAXIMUM_BUFFERED_RECORD_BATCHES_PER_STREAM | No | Maximum number of buffered record batches per stream. | `'maximum_buffered_record_batches_per_stream'` | 2 | |
| | KEY_VALUE_METADATA | No (Key is specific) | Adds custom key-value pairs to the file metadata. Use the format `'metadata::your_key_name' 'your_value'`. Multiple entries allowed. | `'metadata::key_name'` | None | |
| |
| **Example:** |
| |
| ```sql |
| CREATE EXTERNAL TABLE t (id bigint, value double, category varchar) |
| STORED AS PARQUET |
| LOCATION '/tmp/parquet_data/' |
| OPTIONS( |
| 'COMPRESSION::user_id' 'snappy', |
| 'ENCODING::col_a' 'delta_binary_packed', |
| 'MAX_ROW_GROUP_SIZE' '1000000', |
| 'BLOOM_FILTER_ENABLED::id' 'true' |
| ); |
| ``` |