blob: 05e6c366e742da82f952dc2c229ba28862827db1 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="parquet_block_size" id="parquet_file_size">
<title>PARQUET_FILE_SIZE Query Option</title>
<titlealts audience="PDF"><navtitle>PARQUET_FILE_SIZE</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Parquet"/>
<data name="Category" value="ETL"/>
<data name="Category" value="File Formats"/>
<data name="Category" value="Impala Query Options"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="hidden">PARQUET_FILE_SIZE query option</indexterm>
Specifies the maximum size of each Parquet data file produced by Impala <codeph>INSERT</codeph> statements.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
<p>
Specify the size in bytes, or with a trailing <codeph>m</codeph> or <codeph>g</codeph> character to indicate
megabytes or gigabytes. For example:
</p>
<codeblock>-- 128 megabytes.
set PARQUET_FILE_SIZE=134217728
INSERT OVERWRITE parquet_table SELECT * FROM text_table;
-- 512 megabytes.
set PARQUET_FILE_SIZE=512m;
INSERT OVERWRITE parquet_table SELECT * FROM text_table;
-- 1 gigabyte.
set PARQUET_FILE_SIZE=1g;
INSERT OVERWRITE parquet_table SELECT * FROM text_table;
</codeblock>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
With tables that are small or finely partitioned, the default Parquet block size (formerly 1 GB, now 256 MB
in Impala 2.0 and later) could be much larger than needed for each data file. For <codeph>INSERT</codeph>
operations into such tables, you can increase parallelism by specifying a smaller
<codeph>PARQUET_FILE_SIZE</codeph> value, resulting in more HDFS blocks that can be processed by different
nodes.
<!-- Reducing the file size also reduces the memory required to buffer each block before writing it to disk. -->
</p>
<p>
<b>Type:</b> numeric, with optional unit specifier
</p>
<note type="important">
<p>
Currently, the maximum value for this setting is 1 gigabyte (<codeph>1g</codeph>).
Setting a value higher than 1 gigabyte could result in errors during
an <codeph>INSERT</codeph> operation.
</p>
</note>
<p>
<b>Default:</b> 0 (produces files with a target size of 256 MB; files might be larger for very wide tables)
</p>
<p conref="../shared/impala_common.xml#common/adls_block_splitting"/>
<p conref="../shared/impala_common.xml#common/isilon_blurb"/>
<p conref="../shared/impala_common.xml#common/isilon_block_size_caveat"/>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
For information about the Parquet file format, and how the number and size of data files affects query
performance, see <xref href="impala_parquet.xml#parquet"/>.
</p>
<!-- Examples actually folded into Syntax earlier. <p conref="../shared/impala_common.xml#common/example_blurb"/> -->
</conbody>
</concept>