Format Options

DataFusion supports customizing how data is read from or written to disk as a result of a COPY, INSERT INTO, or CREATE EXTERNAL TABLE statements. There are a few special options, file format (e.g., CSV or Parquet) specific options, and Parquet column-specific options. In some cases, Options can be specified in multiple ways with a set order of precedence.

Specifying Options and Order of Precedence

Format-related options can be specified in three ways, in decreasing order of precedence:

  • CREATE EXTERNAL TABLE syntax
  • COPY option tuples
  • Session-level config defaults

For a list of supported session-level config defaults, see Configuration Settings. These defaults apply to all operations but have the lowest level of precedence.

If creating an external table, table-specific format options can be specified when the table is created using the OPTIONS clause:

CREATE EXTERNAL TABLE
  my_table(a bigint, b bigint)
  STORED AS csv
  LOCATION '/tmp/my_csv_table/'
  OPTIONS(
    NULL_VALUE 'NAN',
    'has_header' 'true',
    'format.delimiter' ';'
  );

When running INSERT INTO my_table ..., the options from the CREATE TABLE will be respected (e.g., gzip compression, special delimiter, and header row included). Note that compression, header, and delimiter settings can also be specified within the OPTIONS tuple list. Dedicated syntax within the SQL statement always takes precedence over arbitrary option tuples, so if both are specified, the OPTIONS setting will be ignored.

For example, with the table defined above, running the following command:

INSERT INTO my_table VALUES(1,2);

Results in a new CSV file with the specified options:

$ cat /tmp/my_csv_table/bmC8zWFvLMtWX68R_0.csv
a;b
1;2

Finally, options can be passed when running a COPY command.

COPY source_table
  TO 'test/table_with_options'
  PARTITIONED BY (column3, column4)
  OPTIONS (
    format parquet,
    compression snappy,
    'compression::column1' 'zstd(5)',
  )

In this example, we write the entire source_table out to a folder of Parquet files. One Parquet file will be written in parallel to the folder for each partition in the query. The next option compression set to snappy indicates that unless otherwise specified, all columns should use the snappy compression codec. The option compression::col1 sets an override, so that the column col1 in the Parquet file will use the ZSTD compression codec with compression level 5. In general, Parquet options that support column-specific settings can be specified with the syntax OPTION::COLUMN.NESTED.PATH.

Available Options

JSON Format Options

The following options are available when reading or writing JSON files. Note: If any unsupported option is specified, an error will be raised and the query will fail.

OptionDescriptionDefault Value
COMPRESSIONSets the compression that should be applied to the entire JSON file. Supported values are GZIP, BZIP2, XZ, ZSTD, and UNCOMPRESSED.UNCOMPRESSED

Example:

CREATE EXTERNAL TABLE t(a int)
STORED AS JSON
LOCATION '/tmp/foo/'
OPTIONS('COMPRESSION' 'gzip');

CSV Format Options

The following options are available when reading or writing CSV files. Note: If any unsupported option is specified, an error will be raised and the query will fail.

OptionDescriptionDefault Value
COMPRESSIONSets the compression that should be applied to the entire CSV file. Supported values are GZIP, BZIP2, XZ, ZSTD, and UNCOMPRESSED.UNCOMPRESSED
HAS_HEADERSets if the CSV file should include column headers. If not set, uses session or system default.None
DELIMITERSets the character which should be used as the column delimiter within the CSV file., (comma)
QUOTESets the character which should be used for quoting values within the CSV file." (double quote)
TERMINATORSets the character which should be used as the line terminator within the CSV file.None
ESCAPESets the character which should be used for escaping special characters within the CSV file.None
DOUBLE_QUOTESets if quotes within quoted fields should be escaped by doubling them (e.g., "aaa""bbb").None
NEWLINES_IN_VALUESSets if newlines in quoted values are supported. If not set, uses session or system default.None
DATE_FORMATSets the format that dates should be encoded in within the CSV file.None
DATETIME_FORMATSets the format that datetimes should be encoded in within the CSV file.None
TIMESTAMP_FORMATSets the format that timestamps should be encoded in within the CSV file.None
TIMESTAMP_TZ_FORMATSets the format that timestamps with timezone should be encoded in within the CSV file.None
TIME_FORMATSets the format that times should be encoded in within the CSV file.None
NULL_VALUESets the string which should be used to indicate null values within the CSV file.None
NULL_REGEXSets the regex pattern to match null values when loading CSVs.None
SCHEMA_INFER_MAX_RECSets the maximum number of records to scan to infer the schema.None
COMMENTSets the character which should be used to indicate comment lines in the CSV file.None

Example:

CREATE EXTERNAL TABLE t (col1 varchar, col2 int, col3 boolean)
STORED AS CSV
LOCATION '/tmp/foo/'
OPTIONS('DELIMITER' '|', 'HAS_HEADER' 'true', 'NEWLINES_IN_VALUES' 'true');

Parquet Format Options

The following options are available when reading or writing Parquet files. If any unsupported option is specified, an error will be raised and the query will fail. If a column-specific option is specified for a column that does not exist, the option will be ignored without error.

OptionCan be Column Specific?DescriptionOPTIONS KeyDefault Value
COMPRESSIONYesSets the internal Parquet compression codec for data pages, optionally including the compression level. Applies globally if set without ::col, or specifically to a column if set using 'compression::column_name'. Valid values: uncompressed, snappy, gzip(level), lzo, brotli(level), lz4, zstd(level), lz4_raw.'compression' or 'compression::col'zstd(3)
ENCODINGYesSets the encoding scheme for data pages. Valid values: plain, plain_dictionary, rle, bit_packed, delta_binary_packed, delta_length_byte_array, delta_byte_array, rle_dictionary, byte_stream_split. Use key 'encoding' or 'encoding::col' in OPTIONS.'encoding' or 'encoding::col'None
DICTIONARY_ENABLEDYesSets whether dictionary encoding should be enabled globally or for a specific column.'dictionary_enabled' or 'dictionary_enabled::col'true
STATISTICS_ENABLEDYesSets the level of statistics to write (none, chunk, page).'statistics_enabled' or 'statistics_enabled::col'page
BLOOM_FILTER_ENABLEDYesSets whether a bloom filter should be written for a specific column.'bloom_filter_enabled::column_name'None
BLOOM_FILTER_FPPYesSets bloom filter false positive probability (global or per column).'bloom_filter_fpp' or 'bloom_filter_fpp::col'None
BLOOM_FILTER_NDVYesSets bloom filter number of distinct values (global or per column).'bloom_filter_ndv' or 'bloom_filter_ndv::col'None
MAX_ROW_GROUP_SIZENoSets the maximum number of rows per row group. Larger groups require more memory but can improve compression and scan efficiency.'max_row_group_size'1048576
ENABLE_PAGE_INDEXNoIf true, reads the Parquet data page level metadata (the Page Index), if present, to reduce I/O and decoding.'enable_page_index'true
PRUNINGNoIf true, enables row group pruning based on min/max statistics.'pruning'true
SKIP_METADATANoIf true, skips optional embedded metadata in the file schema.'skip_metadata'true
METADATA_SIZE_HINTNoSets the size hint (in bytes) for fetching Parquet file metadata.'metadata_size_hint'None
PUSHDOWN_FILTERSNoIf true, enables filter pushdown during Parquet decoding.'pushdown_filters'false
REORDER_FILTERSNoIf true, enables heuristic reordering of filters during Parquet decoding.'reorder_filters'false
SCHEMA_FORCE_VIEW_TYPESNoIf true, reads Utf8/Binary columns as view types.'schema_force_view_types'true
BINARY_AS_STRINGNoIf true, reads Binary columns as strings.'binary_as_string'false
DATA_PAGESIZE_LIMITNoSets best effort maximum size of data page in bytes.'data_pagesize_limit'1048576
DATA_PAGE_ROW_COUNT_LIMITNoSets best effort maximum number of rows in data page.'data_page_row_count_limit'20000
DICTIONARY_PAGE_SIZE_LIMITNoSets best effort maximum dictionary page size, in bytes.'dictionary_page_size_limit'1048576
WRITE_BATCH_SIZENoSets write_batch_size in bytes.'write_batch_size'1024
WRITER_VERSIONNoSets the Parquet writer version (1.0 or 2.0).'writer_version'1.0
SKIP_ARROW_METADATANoIf true, skips writing Arrow schema information into the Parquet file metadata.'skip_arrow_metadata'false
CREATED_BYNoSets the “created by” string in the Parquet file metadata.'created_by'datafusion version X.Y.Z
COLUMN_INDEX_TRUNCATE_LENGTHNoSets the length (in bytes) to truncate min/max values in column indexes.'column_index_truncate_length'64
STATISTICS_TRUNCATE_LENGTHNoSets statistics truncate length.'statistics_truncate_length'None
BLOOM_FILTER_ON_WRITENoSets whether bloom filters should be written for all columns by default (can be overridden per column).'bloom_filter_on_write'false
ALLOW_SINGLE_FILE_PARALLELISMNoEnables parallel serialization of columns in a single file.'allow_single_file_parallelism'true
MAXIMUM_PARALLEL_ROW_GROUP_WRITERSNoMaximum number of parallel row group writers.'maximum_parallel_row_group_writers'1
MAXIMUM_BUFFERED_RECORD_BATCHES_PER_STREAMNoMaximum number of buffered record batches per stream.'maximum_buffered_record_batches_per_stream'2
KEY_VALUE_METADATANo (Key is specific)Adds custom key-value pairs to the file metadata. Use the format 'metadata::your_key_name' 'your_value'. Multiple entries allowed.'metadata::key_name'None

Example:

CREATE EXTERNAL TABLE t (id bigint, value double, category varchar)
STORED AS PARQUET
LOCATION '/tmp/parquet_data/'
OPTIONS(
  'COMPRESSION::user_id' 'snappy',
  'ENCODING::col_a' 'delta_binary_packed',
  'MAX_ROW_GROUP_SIZE' '1000000',
  'BLOOM_FILTER_ENABLED::id' 'true'
);