docs/source/user-guide/sql/write_options.md - datafusion - Git at Google

 <!---
   Licensed to the Apache Software Foundation (ASF) under one
   or more contributor license agreements.  See the NOTICE file
   distributed with this work for additional information
   regarding copyright ownership.  The ASF licenses this file
   to you under the Apache License, Version 2.0 (the
   "License"); you may not use this file except in compliance
   with the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing,
   software distributed under the License is distributed on an
   "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
   KIND, either express or implied.  See the License for the
   specific language governing permissions and limitations
   under the License.
 -->

 # Write Options

 DataFusion supports customizing how data is written out to disk as a result of a `COPY` or `INSERT INTO` query. There are a few special options, file format (e.g. CSV or parquet) specific options, and parquet column specific options. Options can also in some cases be specified in multiple ways with a set order of precedence.

 ## Specifying Options and Order of Precedence

 Write related options can be specified in the following ways:

 - Session level config defaults
 - `CREATE EXTERNAL TABLE` options
 - `COPY` option tuples

 For a list of supported session level config defaults see [Configuration Settings](../configs). These defaults apply to all write operations but have the lowest level of precedence.

 If inserting to an external table, table specific write options can be specified when the table is created using the `OPTIONS` clause:

 ```sql
 CREATE EXTERNAL TABLE
   my_table(a bigint, b bigint)
   STORED AS csv
   COMPRESSION TYPE gzip
   DELIMITER ';'
   LOCATION '/test/location/my_csv_table/'
   OPTIONS(
     NULL_VALUE 'NAN',
     'has_header' 'true'
   )
 ```

 When running `INSERT INTO my_table ...`, the options from the `CREATE TABLE` will be respected (gzip compression, special delimiter, and header row included). There will be a single output file if the output path doesn't have folder format, i.e. ending with a `\`. Note that compression, header, and delimiter settings can also be specified within the `OPTIONS` tuple list. Dedicated syntax within the SQL statement always takes precedence over arbitrary option tuples, so if both are specified the `OPTIONS` setting will be ignored. NULL_VALUE is a CSV format specific option that determines how null values should be encoded within the CSV file.

 Finally, options can be passed when running a `COPY` command.

 <!--
  Test the following example with:
  CREATE TABLE source_table AS VALUES ('1','2','3','4');
 -->

 ```sql
 COPY source_table
   TO 'test/table_with_options'
   PARTITIONED BY (column3, column4)
   OPTIONS (
     format parquet,
     compression snappy,
     'compression::column1' 'zstd(5)',
   )
 ```

 In this example, we write the entirety of `source_table` out to a folder of parquet files. One parquet file will be written in parallel to the folder for each partition in the query. The next option `compression` set to `snappy` indicates that unless otherwise specified all columns should use the snappy compression codec. The option `compression::col1` sets an override, so that the column `col1` in the parquet file will use `ZSTD` compression codec with compression level `5`. In general, parquet options which support column specific settings can be specified with the syntax `OPTION::COLUMN.NESTED.PATH`.

 ## Available Options

 ### Execution Specific Options

 The following options are available when executing a `COPY` query.

 | Option                              | Description                                                                        | Default Value |
 | ----------------------------------- | ---------------------------------------------------------------------------------- | ------------- |
 | execution.keep_partition_by_columns | Flag to retain the columns in the output data when using `PARTITIONED BY` queries. | false         |

 Note: `execution.keep_partition_by_columns` flag can also be enabled through `ExecutionOptions` within `SessionConfig`.

 ### JSON Format Specific Options

 The following options are available when writing JSON files. Note: If any unsupported option is specified, an error will be raised and the query will fail.

 | Option      | Description                                                                                                                        | Default Value |
 | ----------- | ---------------------------------------------------------------------------------------------------------------------------------- | ------------- |
 | COMPRESSION | Sets the compression that should be applied to the entire JSON file. Supported values are GZIP, BZIP2, XZ, ZSTD, and UNCOMPRESSED. | UNCOMPRESSED  |

 ### CSV Format Specific Options

 The following options are available when writing CSV files. Note: if any unsupported options is specified an error will be raised and the query will fail.

 | Option          | Description                                                                                                                       | Default Value    |
 | --------------- | --------------------------------------------------------------------------------------------------------------------------------- | ---------------- |
 | COMPRESSION     | Sets the compression that should be applied to the entire CSV file. Supported values are GZIP, BZIP2, XZ, ZSTD, and UNCOMPRESSED. | UNCOMPRESSED     |
 | HEADER          | Sets if the CSV file should include column headers                                                                                | false            |
 | DATE_FORMAT     | Sets the format that dates should be encoded in within the CSV file                                                               | arrow-rs default |
 | DATETIME_FORMAT | Sets the format that datetimes should be encoded in within the CSV file                                                           | arrow-rs default |
 | TIME_FORMAT     | Sets the format that times should be encoded in within the CSV file                                                               | arrow-rs default |
 | RFC3339         | If true, uses RFC339 format for date and time encodings                                                                           | arrow-rs default |
 | NULL_VALUE      | Sets the string which should be used to indicate null values within the CSV file.                                                 | arrow-rs default |
 | DELIMITER       | Sets the character which should be used as the column delimiter within the CSV file.                                              | arrow-rs default |

 ### Parquet Format Specific Options

 The following options are available when writing parquet files. If any unsupported option is specified an error will be raised and the query will fail. If a column specific option is specified for a column which does not exist, the option will be ignored without error. For default values, see: [Configuration Settings](https://datafusion.apache.org/user-guide/configs.html).

 | Option                       | Can be Column Specific? | Description                                                                                                                         |
 | ---------------------------- | ----------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
 | COMPRESSION                  | Yes                     | Sets the compression codec and if applicable compression level to use                                                               |
 | MAX_ROW_GROUP_SIZE           | No                      | Sets the maximum number of rows that can be encoded in a single row group. Larger row groups require more memory to write and read. |
 | DATA_PAGESIZE_LIMIT          | No                      | Sets the best effort maximum page size in bytes                                                                                     |
 | WRITE_BATCH_SIZE             | No                      | Maximum number of rows written for each column in a single batch                                                                    |
 | WRITER_VERSION               | No                      | Parquet writer version (1.0 or 2.0)                                                                                                 |
 | DICTIONARY_PAGE_SIZE_LIMIT   | No                      | Sets best effort maximum dictionary page size in bytes                                                                              |
 | CREATED_BY                   | No                      | Sets the "created by" property in the parquet file                                                                                  |
 | COLUMN_INDEX_TRUNCATE_LENGTH | No                      | Sets the max length of min/max value fields in the column index.                                                                    |
 | DATA_PAGE_ROW_COUNT_LIMIT    | No                      | Sets best effort maximum number of rows in a data page.                                                                             |
 | BLOOM_FILTER_ENABLED         | Yes                     | Sets whether a bloom filter should be written into the file.                                                                        |
 | ENCODING                     | Yes                     | Sets the encoding that should be used (e.g. PLAIN or RLE)                                                                           |
 | DICTIONARY_ENABLED           | Yes                     | Sets if dictionary encoding is enabled. Use this instead of ENCODING to set dictionary encoding.                                    |
 | STATISTICS_ENABLED           | Yes                     | Sets if statistics are enabled at PAGE or ROW_GROUP level.                                                                          |
 | MAX_STATISTICS_SIZE          | Yes                     | Sets the maximum size in bytes that statistics can take up.                                                                         |
 | BLOOM_FILTER_FPP             | Yes                     | Sets the false positive probability (fpp) for the bloom filter. Implicitly sets BLOOM_FILTER_ENABLED to true.                       |
 | BLOOM_FILTER_NDV             | Yes                     | Sets the number of distinct values (ndv) for the bloom filter. Implicitly sets bloom_filter_enabled to true.                        |
	<!---
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->

	# Write Options

	DataFusion supports customizing how data is written out to disk as a result of a `COPY` or `INSERT INTO` query. There are a few special options, file format (e.g. CSV or parquet) specific options, and parquet column specific options. Options can also in some cases be specified in multiple ways with a set order of precedence.

	## Specifying Options and Order of Precedence

	Write related options can be specified in the following ways:

	- Session level config defaults
	- `CREATE EXTERNAL TABLE` options
	- `COPY` option tuples

	For a list of supported session level config defaults see [Configuration Settings](../configs). These defaults apply to all write operations but have the lowest level of precedence.

	If inserting to an external table, table specific write options can be specified when the table is created using the `OPTIONS` clause:

	```sql
	CREATE EXTERNAL TABLE
	my_table(a bigint, b bigint)
	STORED AS csv
	COMPRESSION TYPE gzip
	DELIMITER ';'
	LOCATION '/test/location/my_csv_table/'
	OPTIONS(
	NULL_VALUE 'NAN',
	'has_header' 'true'
	)
	```

	When running `INSERT INTO my_table ...`, the options from the `CREATE TABLE` will be respected (gzip compression, special delimiter, and header row included). There will be a single output file if the output path doesn't have folder format, i.e. ending with a `\`. Note that compression, header, and delimiter settings can also be specified within the `OPTIONS` tuple list. Dedicated syntax within the SQL statement always takes precedence over arbitrary option tuples, so if both are specified the `OPTIONS` setting will be ignored. NULL_VALUE is a CSV format specific option that determines how null values should be encoded within the CSV file.

	Finally, options can be passed when running a `COPY` command.

	<!--
	Test the following example with:
	CREATE TABLE source_table AS VALUES ('1','2','3','4');
	-->

	```sql
	COPY source_table
	TO 'test/table_with_options'
	PARTITIONED BY (column3, column4)
	OPTIONS (
	format parquet,
	compression snappy,
	'compression::column1' 'zstd(5)',
	)
	```

	In this example, we write the entirety of `source_table` out to a folder of parquet files. One parquet file will be written in parallel to the folder for each partition in the query. The next option `compression` set to `snappy` indicates that unless otherwise specified all columns should use the snappy compression codec. The option `compression::col1` sets an override, so that the column `col1` in the parquet file will use `ZSTD` compression codec with compression level `5`. In general, parquet options which support column specific settings can be specified with the syntax `OPTION::COLUMN.NESTED.PATH`.

	## Available Options

	### Execution Specific Options

	The following options are available when executing a `COPY` query.

	\| Option \| Description \| Default Value \|
	\| ----------------------------------- \| ---------------------------------------------------------------------------------- \| ------------- \|
	\| execution.keep_partition_by_columns \| Flag to retain the columns in the output data when using `PARTITIONED BY` queries. \| false \|

	Note: `execution.keep_partition_by_columns` flag can also be enabled through `ExecutionOptions` within `SessionConfig`.

	### JSON Format Specific Options

	The following options are available when writing JSON files. Note: If any unsupported option is specified, an error will be raised and the query will fail.

	\| Option \| Description \| Default Value \|
	\| ----------- \| ---------------------------------------------------------------------------------------------------------------------------------- \| ------------- \|
	\| COMPRESSION \| Sets the compression that should be applied to the entire JSON file. Supported values are GZIP, BZIP2, XZ, ZSTD, and UNCOMPRESSED. \| UNCOMPRESSED \|

	### CSV Format Specific Options

	The following options are available when writing CSV files. Note: if any unsupported options is specified an error will be raised and the query will fail.

	\| Option \| Description \| Default Value \|
	\| --------------- \| --------------------------------------------------------------------------------------------------------------------------------- \| ---------------- \|
	\| COMPRESSION \| Sets the compression that should be applied to the entire CSV file. Supported values are GZIP, BZIP2, XZ, ZSTD, and UNCOMPRESSED. \| UNCOMPRESSED \|
	\| HEADER \| Sets if the CSV file should include column headers \| false \|
	\| DATE_FORMAT \| Sets the format that dates should be encoded in within the CSV file \| arrow-rs default \|
	\| DATETIME_FORMAT \| Sets the format that datetimes should be encoded in within the CSV file \| arrow-rs default \|
	\| TIME_FORMAT \| Sets the format that times should be encoded in within the CSV file \| arrow-rs default \|
	\| RFC3339 \| If true, uses RFC339 format for date and time encodings \| arrow-rs default \|
	\| NULL_VALUE \| Sets the string which should be used to indicate null values within the CSV file. \| arrow-rs default \|
	\| DELIMITER \| Sets the character which should be used as the column delimiter within the CSV file. \| arrow-rs default \|

	### Parquet Format Specific Options

	The following options are available when writing parquet files. If any unsupported option is specified an error will be raised and the query will fail. If a column specific option is specified for a column which does not exist, the option will be ignored without error. For default values, see: [Configuration Settings](https://datafusion.apache.org/user-guide/configs.html).

	\| Option \| Can be Column Specific? \| Description \|
	\| ---------------------------- \| ----------------------- \| ----------------------------------------------------------------------------------------------------------------------------------- \|
	\| COMPRESSION \| Yes \| Sets the compression codec and if applicable compression level to use \|
	\| MAX_ROW_GROUP_SIZE \| No \| Sets the maximum number of rows that can be encoded in a single row group. Larger row groups require more memory to write and read. \|
	\| DATA_PAGESIZE_LIMIT \| No \| Sets the best effort maximum page size in bytes \|
	\| WRITE_BATCH_SIZE \| No \| Maximum number of rows written for each column in a single batch \|
	\| WRITER_VERSION \| No \| Parquet writer version (1.0 or 2.0) \|
	\| DICTIONARY_PAGE_SIZE_LIMIT \| No \| Sets best effort maximum dictionary page size in bytes \|
	\| CREATED_BY \| No \| Sets the "created by" property in the parquet file \|
	\| COLUMN_INDEX_TRUNCATE_LENGTH \| No \| Sets the max length of min/max value fields in the column index. \|
	\| DATA_PAGE_ROW_COUNT_LIMIT \| No \| Sets best effort maximum number of rows in a data page. \|
	\| BLOOM_FILTER_ENABLED \| Yes \| Sets whether a bloom filter should be written into the file. \|
	\| ENCODING \| Yes \| Sets the encoding that should be used (e.g. PLAIN or RLE) \|
	\| DICTIONARY_ENABLED \| Yes \| Sets if dictionary encoding is enabled. Use this instead of ENCODING to set dictionary encoding. \|
	\| STATISTICS_ENABLED \| Yes \| Sets if statistics are enabled at PAGE or ROW_GROUP level. \|
	\| MAX_STATISTICS_SIZE \| Yes \| Sets the maximum size in bytes that statistics can take up. \|
	\| BLOOM_FILTER_FPP \| Yes \| Sets the false positive probability (fpp) for the bloom filter. Implicitly sets BLOOM_FILTER_ENABLED to true. \|
	\| BLOOM_FILTER_NDV \| Yes \| Sets the number of distinct values (ndv) for the bloom filter. Implicitly sets bloom_filter_enabled to true. \|