blob: 01cdb26c939aa410a7ffe44910e0037b39d23c74 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="parquet">
<title>Using the Parquet File Format with Impala Tables</title>
<titlealts audience="PDF">
<navtitle>Parquet Data Files</navtitle>
<data name="Category" value="Impala"/>
<data name="Category" value="File Formats"/>
<data name="Category" value="Parquet"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Tables"/>
<data name="Category" value="Schemas"/>
Impala allows you to create, manage, and query Parquet tables. Parquet is a
column-oriented binary file format intended to be highly efficient for the types of
large-scale queries that Impala is best at. Parquet is especially good for queries
scanning particular columns within a table, for example, to query <q>wide</q> tables with
many columns, or to perform aggregation operations such as <codeph>SUM()</codeph> and
<codeph>AVG()</codeph> that need to process most or all of the values from a column. Each
Parquet data file written by Impala contains the values for a set of rows (referred to as
the <q>row group</q>). Within a data file, the values from each column are organized so
that they are all adjacent, enabling good compression for the values from that column.
Queries against a Parquet table can retrieve and analyze these values from any column
quickly and with minimal I/O.
See <xref href="impala_file_formats.xml#file_formats"/> for the summary of Parquet format
<p outputclass="toc inpage"/>
<concept id="parquet_ddl">
<title>Creating Parquet Tables in Impala</title>
To create a table named <codeph>PARQUET_TABLE</codeph> that uses the Parquet format, you
would use a command like the following, substituting your own table name, column names,
and data types:
<codeblock>[impala-host:21000] &gt; create table <varname>parquet_table_name</varname> (x INT, y STRING) STORED AS PARQUET;</codeblock>
Or, to clone the column names and data types of an existing table:
<codeblock>[impala-host:21000] &gt; create table <varname>parquet_table_name</varname> LIKE <varname>other_table_name</varname> STORED AS PARQUET;</codeblock>
<p rev="1.4.0">
In Impala 1.4.0 and higher, you can derive column definitions from a raw Parquet data
file, even without an existing Impala table. For example, you can create an external
table pointing to an HDFS directory, and base the column definitions on one of the files
in that directory:
<codeblock rev="1.4.0">CREATE EXTERNAL TABLE ingest_existing_files LIKE PARQUET '/user/etl/destination/datafile1.dat'
LOCATION '/user/etl/destination';
Or, you can refer to an existing data file and create a new empty table with suitable
column definitions. Then you can use <codeph>INSERT</codeph> to create new data files or
<codeph>LOAD DATA</codeph> to transfer existing data files into the new table.
<codeblock rev="1.4.0">CREATE TABLE columns_from_data_file LIKE PARQUET '/user/etl/destination/datafile1.dat'
The default properties of the newly created table are the same as for any other
<codeph>CREATE TABLE</codeph> statement. For example, the default file format is text;
if you want the new table to use the Parquet file format, include the <codeph>STORED AS
PARQUET</codeph> file also.
In this example, the new table is partitioned by year, month, and day. These partition
key columns are not part of the data file, so you specify them in the <codeph>CREATE
TABLE</codeph> statement:
<codeblock rev="1.4.0">CREATE TABLE columns_from_data_file LIKE PARQUET '/user/etl/destination/datafile1.dat'
<p rev="1.4.0">
See <xref href="impala_create_table.xml#create_table"/> for more details about the
<codeph>CREATE TABLE LIKE PARQUET</codeph> syntax.
Once you have created a table, to insert data into that table, use a command similar to
the following, again with your own table names:
<codeblock>[impala-host:21000] &gt; insert overwrite table <varname>parquet_table_name</varname> select * from <varname>other_table_name</varname>;</codeblock>
If the Parquet table has a different number of columns or different column names than
the other table, specify the names of columns from the other table rather than
<codeph>*</codeph> in the <codeph>SELECT</codeph> statement.
<concept id="parquet_etl">
<title>Loading Data into Parquet Tables</title>
<data name="Category" value="ETL"/>
Choose from the following techniques for loading data into Parquet tables, depending on
whether the original data is already in an Impala table, or exists as raw data files
outside Impala.
If you already have data in an Impala or Hive table, perhaps in a different file format
or partitioning scheme, you can transfer the data to a Parquet table using the Impala
<codeph>INSERT...SELECT</codeph> syntax. You can convert, filter, repartition, and do
other things to the data as part of this same <codeph>INSERT</codeph> statement. See
href="#parquet_compression"/> for some examples showing how to insert
data into Parquet tables.
When inserting into partitioned tables, especially using the Parquet file format, you
can include a hint in the <codeph>INSERT</codeph> statement to fine-tune the overall
performance of the operation and its resource usage. See <keyword keyref="hints"/> for
using hints in the <codeph>INSERT</codeph> statements.
<p conref="../shared/impala_common.xml#common/insert_parquet_blocksize"/>
Avoid the <codeph>INSERT...VALUES</codeph> syntax for Parquet tables, because
<codeph>INSERT...VALUES</codeph> produces a separate tiny data file for each
<codeph>INSERT...VALUES</codeph> statement, and the strength of Parquet is in its
handling of data (compressing, parallelizing, and so on) in
<ph rev="parquet_block_size">large</ph> chunks.
If you have one or more Parquet data files produced outside of Impala, you can quickly
make the data queryable through Impala by one of the following methods:
The <codeph>LOAD DATA</codeph> statement moves a single data file or a directory full
of data files into the data directory for an Impala table. It does no validation or
conversion of the data. The original data files must be somewhere in HDFS, not the
local filesystem.
The <codeph>CREATE TABLE</codeph> statement with the <codeph>LOCATION</codeph> clause
creates a table where the data continues to reside outside the Impala data directory.
The original data files must be somewhere in HDFS, not the local filesystem. For extra
safety, if the data is intended to be long-lived and reused by other applications, you
can use the <codeph>CREATE EXTERNAL TABLE</codeph> syntax so that the data files are
not deleted by an Impala <codeph>DROP TABLE</codeph> statement.
If the Parquet table already exists, you can copy Parquet data files directly into it,
then use the <codeph>REFRESH</codeph> statement to make Impala recognize the newly
added data. Remember to preserve the block size of the Parquet data files by using the
<codeph>hadoop distcp -pb</codeph> command rather than a <codeph>-put</codeph> or
<codeph>-cp</codeph> operation on the Parquet files. See
<xref href="#parquet_compression_multiple"/> for an example of this kind of operation.
Recent versions of Sqoop can produce Parquet output files using the
<codeph>--as-parquetfile</codeph> option.
<p conref="../shared/impala_common.xml#common/sqoop_timestamp_caveat"
If the data exists outside Impala and is in some other format, combine both of the
preceding techniques. First, use a <codeph>LOAD DATA</codeph> or <codeph>CREATE EXTERNAL
TABLE ... LOCATION</codeph> statement to bring the data into an Impala table that uses
the appropriate file format. Then, use an <codeph>INSERT...SELECT</codeph> statement to
copy the data to the Parquet table, converting to Parquet format as part of the process.
Loading data into Parquet tables is a memory-intensive operation, because the incoming
data is buffered until it reaches <ph
rev="parquet_block_size">one data
block</ph> in size, then that chunk of data is organized and compressed in memory before
being written out. The memory consumption can be larger when inserting data into
partitioned Parquet tables, because a separate data file is written for each combination
of partition key column values, potentially requiring several
<ph rev="parquet_block_size">large</ph> chunks to be manipulated in memory at once.
When inserting into a partitioned Parquet table, Impala redistributes the data among the
nodes to reduce memory consumption. You might still need to temporarily increase the
memory dedicated to Impala during the insert operation, or break up the load operation
into several <codeph>INSERT</codeph> statements, or both.
All the preceding techniques assume that the data you are loading matches the structure
of the destination table, including column order, column names, and partition layout. To
transform or reorganize the data, start by loading the data into a Parquet table that
matches the underlying structure of the data, then use one of the table-copying
techniques such as <codeph>CREATE TABLE AS SELECT</codeph> or <codeph>INSERT ...
SELECT</codeph> to reorder or rename columns, divide the data among multiple partitions,
and so on. For example to take a single comprehensive Parquet data file and load it into
a partitioned table, you would use an <codeph>INSERT ... SELECT</codeph> statement with
dynamic partitioning to let Impala create separate data files with the appropriate
partition values; for an example, see <xref
<concept id="parquet_performance">
<title>Query Performance for Impala Parquet Tables</title>
<data name="Category" value="Performance"/>
Query performance for Parquet tables depends on the number of columns needed to process
the <codeph>SELECT</codeph> list and <codeph>WHERE</codeph> clauses of the query, the
way data is divided into <ph rev="parquet_block_size">large data files with block size
equal to file size</ph>, the reduction in I/O by reading the data for each column in
compressed format, which data files can be skipped (for partitioned tables), and the CPU
overhead of decompressing the data for each column.
For example, the following is an efficient query for a Parquet table:
<codeblock>select avg(income) from census_data where state = 'CA';</codeblock>
The query processes only 2 columns out of a large number of total columns. If the table
is partitioned by the <codeph>STATE</codeph> column, it is even more efficient because
the query only has to read and decode 1 column from each data file, and it can read only
the data files in the partition directory for the state <codeph>'CA'</codeph>, skipping
the data files for all the other states, which will be physically located in other
The following is a relatively inefficient query for a Parquet table:
<codeblock>select * from census_data;</codeblock>
Impala would have to read the entire contents of each
<ph rev="parquet_block_size">large</ph> data file, and decompress the contents of each
column for each row group, negating the I/O optimizations of the column-oriented format.
This query might still be faster for a Parquet table than a table with some other file
format, but it does not take advantage of the unique strengths of Parquet data files.
Impala can optimize queries on Parquet tables, especially join queries, better when
statistics are available for all the tables. Issue the <codeph>COMPUTE STATS</codeph>
statement for each table after substantial amounts of data are loaded into or appended
to it. See <xref href="impala_compute_stats.xml#compute_stats"/> for details.
<p rev="2.5.0">
The runtime filtering feature, available in <keyword keyref="impala25_full"/> and
higher, works best with Parquet tables. The per-row filtering aspect only applies to
Parquet tables. See <xref href="impala_runtime_filtering.xml#runtime_filtering"/> for
<p conref="../shared/impala_common.xml#common/s3_block_splitting"/>
<p>Starting in Impala 3.4.0, use the query option
<codeph>PARQUET_OBJECT_STORE_SPLIT_SIZE</codeph> to control the
Parquet split size for non-block stores (e.g. S3, ADLS, etc.). The
default value is 256 MB.</p>
<p rev="IMPALA-3909">
In <keyword keyref="impala29"/> and higher, Parquet files written by Impala include
embedded metadata specifying the minimum and maximum values for each column, within each
row group and each data page within the row group. Impala-written Parquet files
typically contain a single row group; a row group can contain many data pages. Impala
uses this information (currently, only the metadata for each row group) when reading
each Parquet data file during a query, to quickly determine whether each row group
within the file potentially includes any rows that match the conditions in the
<codeph>WHERE</codeph> clause. For example, if the column <codeph>X</codeph> within a
particular Parquet file has a minimum value of 1 and a maximum value of 100, then a
query including the clause <codeph>WHERE x &gt; 200</codeph> can quickly determine that
it is safe to skip that particular file, instead of scanning all the associated column
values. This optimization technique is especially effective for tables that use the
<codeph>SORT BY</codeph> clause for the columns most frequently checked in
<codeph>WHERE</codeph> clauses, because any <codeph>INSERT</codeph> operation on such
tables produces Parquet data files with relatively narrow ranges of column values within
each file.
<p>To disable Impala from writing the Parquet page index when creating
Parquet files, set the <codeph>PARQUET_WRITE_PAGE_INDEX</codeph> query
option to <codeph>FALSE</codeph>.</p>
<concept id="parquet_partitioning">
<title>Partitioning for Parquet Tables</title>
As explained in <xref href="impala_partitioning.xml#partitioning"/>, partitioning is
an important performance technique for Impala generally. This section explains some of
the performance considerations for partitioned Parquet tables.
The Parquet file format is ideal for tables containing many columns, where most
queries only refer to a small subset of the columns. As explained in
<xref href="#parquet_data_files"/>, the physical layout of Parquet data files lets
Impala read only a small fraction of the data for many queries. The performance
benefits of this approach are amplified when you use Parquet tables in combination
with partitioning. Impala can skip the data files for certain partitions entirely,
based on the comparisons in the <codeph>WHERE</codeph> clause that refer to the
partition key columns. For example, queries on partitioned tables often analyze data
for time intervals based on columns such as <codeph>YEAR</codeph>,
<codeph>MONTH</codeph>, and/or <codeph>DAY</codeph>, or for geographic regions.
Remember that Parquet data files use a <ph rev="parquet_block_size">large</ph> block
size, so when deciding how finely to partition the data, try to find a granularity
where each partition contains <ph rev="parquet_block_size">256 MB</ph> or more of
data, rather than creating a large number of smaller files split among many
Inserting into a partitioned Parquet table can be a resource-intensive operation,
because each Impala node could potentially be writing a separate data file to HDFS for
each combination of different values for the partition key columns. The large number
of simultaneous open files could exceed the HDFS <q>transceivers</q> limit. To avoid
exceeding this limit, consider the following techniques:
Load different subsets of data using separate <codeph>INSERT</codeph> statements
with specific values for the <codeph>PARTITION</codeph> clause, such as
<codeph>PARTITION (year=2010)</codeph>.
Increase the <q>transceivers</q> value for HDFS, sometimes spelled <q>xcievers</q>
(sic). The property value in the <filepath>hdfs-site.xml</filepath> configuration
file is <codeph>dfs.datanode.max.transfer.threads</codeph>. For example, if you were
loading 12 years of data partitioned by year, month, and day, even a value of 4096
might not be high enough. This
keyref="hbase-hadoop-xceivers">blog post</xref> explores the
considerations for setting this value higher or lower, using HBase examples for
Use the <codeph>COMPUTE STATS</codeph> statement to collect
<xref href="impala_perf_stats.xml#perf_column_stats">column statistics</xref> on the
source table from which data is being copied, so that the Impala query can estimate
the number of different values in the partition key columns and distribute the work
<concept id="parquet_compression">
<title>Compressions for Parquet Data Files</title>
<data name="Category" value="Snappy"/>
<data name="Category" value="Gzip"/>
<data name="Category" value="Compression"/>
When Impala writes Parquet data files using the <codeph>INSERT</codeph> statement, the
underlying compression is controlled by the <codeph>COMPRESSION_CODEC</codeph> query
option. (Prior to Impala 2.0, the query option name was
<codeph>PARQUET_COMPRESSION_CODEC</codeph>.) The allowed values for this query option
are <codeph>snappy</codeph> (the default), <codeph>gzip</codeph>, <codeph>zstd</codeph>,
<codeph>lz4</codeph>, and <codeph>none</codeph>. The option value is not case-sensitive.
If the option is set to an unrecognized value, all kinds of queries will fail due to
the invalid option setting, not just queries involving Parquet tables.
<concept id="parquet_snappy">
<title>Example of Parquet Table with Snappy Compression</title>
By default, the underlying data files for a Parquet table are compressed with Snappy.
The combination of fast compression and decompression makes it a good choice for many
data sets. To ensure Snappy compression is used, for example after experimenting with
other compression codecs, set the <codeph>COMPRESSION_CODEC</codeph> query option to
<codeph>snappy</codeph> before inserting the data:
<codeblock>[localhost:21000] &gt; create database parquet_compression;
[localhost:21000] &gt; use parquet_compression;
[localhost:21000] &gt; create table parquet_snappy like raw_text_data;
[localhost:21000] &gt; set COMPRESSION_CODEC=snappy;
[localhost:21000] &gt; insert into parquet_snappy select * from raw_text_data;
Inserted 1000000000 rows in 181.98s
<concept id="parquet_gzip">
<title>Example of Parquet Table with GZip Compression</title>
If you need more intensive compression (at the expense of more CPU cycles for
uncompressing during queries), set the <codeph>COMPRESSION_CODEC</codeph> query option
to <codeph>gzip</codeph> before inserting the data:
<codeblock>[localhost:21000] &gt; create table parquet_gzip like raw_text_data;
[localhost:21000] &gt; set COMPRESSION_CODEC=gzip;
[localhost:21000] &gt; insert into parquet_gzip select * from raw_text_data;
Inserted 1000000000 rows in 1418.24s
<concept id="parquet_none">
<title>Example of Uncompressed Parquet Table</title>
If your data compresses very poorly, or you want to avoid the CPU overhead of
compression and decompression entirely, set the <codeph>COMPRESSION_CODEC</codeph>
query option to <codeph>none</codeph> before inserting the data:
<codeblock>[localhost:21000] &gt; create table parquet_none like raw_text_data;
[localhost:21000] &gt; set COMPRESSION_CODEC=none;
[localhost:21000] &gt; insert into parquet_none select * from raw_text_data;
Inserted 1000000000 rows in 146.90s
<concept id="parquet_compression_examples">
<title>Examples of Sizes and Speeds for Compressed Parquet Tables</title>
Here are some examples showing differences in data sizes and query speeds for 1
billion rows of synthetic data, compressed with each kind of codec. As always, run
similar tests with realistic data sets of your own. The actual compression ratios, and
relative insert and query speeds, will vary depending on the characteristics of the
actual data.
In this case, switching from Snappy to GZip compression shrinks the data by an
additional 40% or so, while switching from Snappy compression to no compression
expands the data also by about 40%:
<codeblock>$ hdfs dfs -du -h /user/hive/warehouse/parquet_compression.db
23.1 G /user/hive/warehouse/parquet_compression.db/parquet_snappy
13.5 G /user/hive/warehouse/parquet_compression.db/parquet_gzip
32.8 G /user/hive/warehouse/parquet_compression.db/parquet_none
Because Parquet data files are typically <ph rev="parquet_block_size">large</ph>, each
directory will have a different number of data files and the row groups will be
arranged differently.
At the same time, the less agressive the compression, the faster the data can be
decompressed. In this case using a table with a billion rows, a query that evaluates
all the values for a particular column runs faster with no compression than with
Snappy compression, and faster with Snappy compression than with Gzip compression.
Query performance depends on several other factors, so as always, run your own
benchmarks with your own data to determine the ideal tradeoff between data size, CPU
efficiency, and speed of insert and query operations.
<codeblock>[localhost:21000] &gt; desc parquet_snappy;
Query finished, fetching results ...
| name | type | comment |
| id | int | |
| val | int | |
| zfill | string | |
| name | string | |
| assertion | boolean | |
Returned 5 row(s) in 0.14s
[localhost:21000] &gt; select avg(val) from parquet_snappy;
Query finished, fetching results ...
| _c0 |
| 250000.93577915 |
Returned 1 row(s) in 4.29s
[localhost:21000] &gt; select avg(val) from parquet_gzip;
Query finished, fetching results ...
| _c0 |
| 250000.93577915 |
Returned 1 row(s) in 6.97s
[localhost:21000] &gt; select avg(val) from parquet_none;
Query finished, fetching results ...
| _c0 |
| 250000.93577915 |
Returned 1 row(s) in 3.67s
<concept id="parquet_compression_multiple">
<title>Example of Copying Parquet Data Files</title>
Here is a final example, to illustrate how the data files using the various
compression codecs are all compatible with each other for read operations. The
metadata about the compression format is written into each data file, and can be
decoded during queries regardless of the <codeph>COMPRESSION_CODEC</codeph> setting in
effect at the time. In this example, we copy data files from the
<codeph>PARQUET_SNAPPY</codeph>, <codeph>PARQUET_GZIP</codeph>, and
<codeph>PARQUET_NONE</codeph> tables used in the previous examples, each containing 1
billion rows, all to the data directory of a new table
<codeph>PARQUET_EVERYTHING</codeph>. A couple of sample queries demonstrate that the
new table now contains 3 billion rows featuring a variety of compression codecs for
the data files.
First, we create the table in Impala so that there is a destination directory in HDFS
to put the data files:
<codeblock>[localhost:21000] &gt; create table parquet_everything like parquet_snappy;
Query: create table parquet_everything like parquet_snappy
Then in the shell, we copy the relevant data files into the data directory for this
new table. Rather than using <codeph>hdfs dfs -cp</codeph> as with typical files, we
use <codeph>hadoop distcp -pb</codeph> to ensure that the special
<ph rev="parquet_block_size"> block size</ph> of the Parquet data files is preserved.
<codeblock>$ hadoop distcp -pb /user/hive/warehouse/parquet_compression.db/parquet_snappy \
...<varname>MapReduce output</varname>...
$ hadoop distcp -pb /user/hive/warehouse/parquet_compression.db/parquet_gzip \
...<varname>MapReduce output</varname>...
$ hadoop distcp -pb /user/hive/warehouse/parquet_compression.db/parquet_none \
...<varname>MapReduce output</varname>...
Back in the <cmdname>impala-shell</cmdname> interpreter, we use the
<codeph>REFRESH</codeph> statement to alert the Impala server to the new data files
for this table, then we can run queries demonstrating that the data files represent 3
billion rows, and the values for one of the numeric columns match what was in the
original smaller tables:
<codeblock>[localhost:21000] &gt; refresh parquet_everything;
Query finished, fetching results ...
Returned 0 row(s) in 0.32s
[localhost:21000] &gt; select count(*) from parquet_everything;
Query finished, fetching results ...
| _c0 |
| 3000000000 |
Returned 1 row(s) in 8.18s
[localhost:21000] &gt; select avg(val) from parquet_everything;
Query finished, fetching results ...
| _c0 |
| 250000.93577915 |
Returned 1 row(s) in 13.35s
<concept rev="2.3.0" id="parquet_complex_types">
<title>Parquet Tables for Impala Complex Types</title>
<p conref="../shared/impala_common.xml#common/complex_types_short_intro"/>
<concept id="parquet_interop">
<title>Exchanging Parquet Data Files with Other Hadoop Components</title>
<data name="Category" value="Hadoop"/>
You can read and write Parquet data files from other Hadoop components. See
<xref keyref="cdh_ig_parquet"/> for details.
<!-- These couple of paragraphs reused in the release notes 'incompatible changes' section. -->
<!-- But conbodydiv tag too restrictive, can't have just paragraphs and codeblocks inside. -->
<!-- So I will physically copy the info for the time being. -->
<!-- <conbodydiv id="upgrade_parquet_metadata"> -->
Previously, it was not possible to create Parquet data through Impala and reuse that
table within Hive. Now that Parquet support is available for Hive, reusing existing
Impala Parquet data files in Hive requires updating the table metadata. Use the
following command if you are already running Impala 1.1.1 or higher:
<codeblock>ALTER TABLE <varname>table_name</varname> SET FILEFORMAT PARQUET;
If you are running a level of Impala that is older than 1.1.1, do the metadata update
through Hive:
<codeblock>ALTER TABLE <varname>table_name</varname> SET SERDE 'parquet.hive.serde.ParquetHiveSerDe';
ALTER TABLE <varname>table_name</varname> SET FILEFORMAT
INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat";
Impala 1.1.1 and higher can reuse Parquet data files created by Hive, without any action
<!-- </conbodydiv> -->
<p rev="2.2.0">
Impala supports the scalar data types that you can encode in a Parquet data file, but
not composite or nested types such as maps or arrays. In
<keyword keyref="impala22_full"/> and higher, Impala can query Parquet data files that
include composite or nested types, as long as the query only refers to columns with
scalar types.
<!-- TK: could include an example here, but would require setup in Hive or Pig or something. -->
If you copy Parquet data files between nodes, or even between different directories on
the same node, make sure to preserve the block size by using the command <codeph>hadoop
distcp -pb</codeph>. To verify that the block size was preserved, issue the command
<codeph>hdfs fsck -blocks <varname>HDFS_path_of_impala_table_dir</varname></codeph> and
check that the average block size is at or near <ph rev="parquet_block_size">256 MB (or
whatever other size is defined by the <codeph>PARQUET_FILE_SIZE</codeph> query
option).</ph>. (The <codeph>hadoop distcp</codeph> operation typically leaves some
directories behind, with names matching <filepath>_distcp_logs_*</filepath>, that you
can delete from the destination directory afterward.)
<!-- The Apache wiki page keeps disappearing, even though Google still points to it as of Nov. 11/2014. -->
<!-- Now there is a 'distcp2' guide: but I haven't tried that so let's play it safe for now and hide the link. -->
<!-- See the <xref href="" scope="external" format="html">Hadoop DistCP Guide</xref> for details. -->
Issue the command <cmdname>hadoop distcp</cmdname> for details about
<cmdname>distcp</cmdname> command syntax.
<!-- Sample commands/output for when the 'distcp' business is expanded into a tutorial later.
<codeblock>$ hdfs fsck -blocks /user/impala/warehouse/parquet_compression.db/parquet_everything
Connecting to namenode via
FSCK started by jrussell (auth:SIMPLE) from / for path /user/impala/warehouse/parquet_compression.db/parquet_everything at Fri Aug 23 11:35:37 PDT 2013
............................................................................Status: HEALTHY
Total size: 74504481213 B
Total dirs: 1
Total files: 76
Total blocks (validated): 76 (avg. block size 980322121 B)
Minimally replicated blocks: 76 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 4
Number of racks: 1
FSCK ended at Fri Aug 23 11:35:37 PDT 2013 in 8 milliseconds
The filesystem under path '/user/impala/warehouse/parquet_compression.db/parquet_everything' is HEALTHY
<p conref="../shared/impala_common.xml#common/impala_parquet_encodings_caveat"/>
<p conref="../shared/impala_common.xml#common/parquet_tools_blurb"/>
<concept id="parquet_data_files">
<title>How Parquet Data Files Are Organized</title>
<data name="Category" value="Concepts"/>
Although Parquet is a column-oriented file format, do not expect to find one data file
for each column. Parquet keeps all the data for a row within the same data file, to
ensure that the columns for a row are always available on the same node for processing.
What Parquet does is to set a large HDFS block size and a matching maximum data file
size, to ensure that I/O and network transfer requests apply to large batches of data.
Within that data file, the data for a set of rows is rearranged so that all the values
from the first column are organized in one contiguous block, then all the values from
the second column, and so on. Putting the values from the same column next to each other
lets Impala use effective compression techniques on the values in that column.
Impala <codeph>INSERT</codeph> statements write Parquet data files using an HDFS block
size <ph rev="parquet_block_size">that matches the data file size</ph>, to ensure that
each data file is represented by a single HDFS block, and the entire file can be
processed on a single node without requiring any remote reads.
If you create Parquet data files outside of Impala, such as through a MapReduce or Pig
job, ensure that the HDFS block size is greater than or equal to the file size, so
that the <q>one file per block</q> relationship is maintained. Set the
<codeph>dfs.block.size</codeph> or the <codeph>dfs.blocksize</codeph> property large
enough that each file fits within a single HDFS block, even if that size is larger
than the normal HDFS block size.
If the block size is reset to a lower value during a file copy, you will see lower
performance for queries involving those files, and the <codeph>PROFILE</codeph>
statement will reveal that some I/O is being done suboptimally, through remote reads.
See <xref href="impala_parquet.xml#parquet_compression_multiple"/> for an example
showing how to preserve the block size when copying Parquet data files.
When Impala retrieves or tests the data for a particular column, it opens all the data
files, but only reads the portion of each file containing the values for that column.
The column values are stored consecutively, minimizing the I/O required to process the
values within a single column. If other columns are named in the <codeph>SELECT</codeph>
list or <codeph>WHERE</codeph> clauses, the data for all columns in the same row is
available within that same data file.
If an <codeph>INSERT</codeph> statement brings in less than
<ph rev="parquet_block_size">one Parquet block's worth</ph> of data, the resulting data
file is smaller than ideal. Thus, if you do split up an ETL job to use multiple
<codeph>INSERT</codeph> statements, try to keep the volume of data for each
<codeph>INSERT</codeph> statement to approximately <ph rev="parquet_block_size">256 MB,
or a multiple of 256 MB</ph>.
<concept id="parquet_encoding">
<title>RLE and Dictionary Encoding for Parquet Data Files</title>
Parquet uses some automatic compression techniques, such as run-length encoding (RLE)
and dictionary encoding, based on analysis of the actual data values. Once the data
values are encoded in a compact form, the encoded data can optionally be further
compressed using a compression algorithm. Parquet data files created by Impala can use
Snappy, GZip, or no compression; the Parquet spec also allows LZO compression, but
currently Impala does not support LZO-compressed Parquet files.
RLE and dictionary encoding are compression techniques that Impala applies
automatically to groups of Parquet data values, in addition to any Snappy or GZip
compression applied to the entire data files. These automatic optimizations can save
you time and planning that are normally needed for a traditional data warehouse. For
example, dictionary encoding reduces the need to create numeric IDs as abbreviations
for longer string values.
Run-length encoding condenses sequences of repeated data values. For example, if many
consecutive rows all contain the same value for a country code, those repeating values
can be represented by the value followed by a count of how many times it appears
Dictionary encoding takes the different values present in a column, and represents
each one in compact 2-byte form rather than the original value, which could be several
bytes. (Additional compression is applied to the compacted values, for extra space
savings.) This type of encoding applies when the number of different values for a
column is less than 2**16 (16,384). It does not apply to columns of data type
<codeph>BOOLEAN</codeph>, which are already very short. <codeph>TIMESTAMP</codeph>
columns sometimes have a unique value for each row, in which case they can quickly
exceed the 2**16 limit on distinct values. The 2**16 limit on different values within
a column is reset for each data file, so if several different data files each
contained 10,000 different city names, the city name column in each data file could
still be condensed using dictionary encoding.
<concept rev="1.4.0" id="parquet_compacting">
<title>Compacting Data Files for Parquet Tables</title>
If you reuse existing table structures or ETL processes for Parquet tables, you might
encounter a <q>many small files</q> situation, which is suboptimal for query efficiency.
For example, statements like these might produce inefficiently organized data files:
<codeblock>-- In an N-node cluster, each node produces a data file
-- for the INSERT operation. If you have less than
-- N GB of data to copy, some files are likely to be
-- much smaller than the <ph rev="parquet_block_size">default Parquet</ph> block size.
insert into parquet_table select * from text_table;
-- Even if this operation involves an overall large amount of data,
-- when split up by year/month/day, each partition might only
-- receive a small amount of data. Then the data files for
-- the partition might be divided between the N nodes in the cluster.
-- A multi-gigabyte copy operation might produce files of only
-- a few MB each.
insert into partitioned_parquet_table partition (year, month, day)
select year, month, day, url, referer, user_agent, http_code, response_time
from web_stats;
Here are techniques to help you produce large data files in Parquet
<codeph>INSERT</codeph> operations, and to compact existing too-small data files:
When inserting into a partitioned Parquet table, use statically partitioned
<codeph>INSERT</codeph> statements where the partition key values are specified as
constant values. Ideally, use a separate <codeph>INSERT</codeph> statement for each
<p conref="../shared/impala_common.xml#common/num_nodes_tip"/>
Be prepared to reduce the number of partition key columns from what you are used to
with traditional analytic database systems.
Do not expect Impala-written Parquet files to fill up the entire Parquet block size.
Impala estimates on the conservative side when figuring out how much data to write
to each Parquet file. Typically, the of uncompressed data in memory is substantially
reduced on disk by the compression and encoding techniques in the Parquet file
Impala reserves <ph rev="parquet_block_size">1 GB</ph> of memory to buffer the data before writing,
but the actual data file might be smaller, in the hundreds of megabytes.
The final data file size varies depending on the compressibility of the data.
Therefore, it is not an indication of a problem if <ph rev="parquet_block_size">256
MB</ph> of text data is turned into 2 Parquet data files, each less than
<ph rev="parquet_block_size">256 MB</ph>.
If you accidentally end up with a table with many small data files, consider using
one or more of the preceding techniques and copying all the data into a new Parquet
table, either through <codeph>CREATE TABLE AS SELECT</codeph> or <codeph>INSERT ...
SELECT</codeph> statements.
To avoid rewriting queries to change table names, you can adopt a convention of
always running important queries against a view. Changing the view definition
immediately switches any subsequent queries to use the new underlying tables:
<codeblock>create view production_table as select * from table_with_many_small_files;
-- CTAS or INSERT...SELECT all the data into a more efficient layout...
alter view production_table as select * from table_with_few_big_files;
select * from production_table where c1 = 100 and c2 &lt; 50 and ...;
<concept rev="1.4.0" id="parquet_schema_evolution">
<title>Schema Evolution for Parquet Tables</title>
Schema evolution refers to using the statement <codeph>ALTER TABLE ... REPLACE
COLUMNS</codeph> to change the names, data type, or number of columns in a table. You
can perform schema evolution for Parquet tables as follows:
The Impala <codeph>ALTER TABLE</codeph> statement never changes any data files in
the tables. From the Impala side, schema evolution involves interpreting the same
data files in terms of a new table definition. Some types of schema changes make
sense and are represented correctly. Other types of changes cannot be represented in
a sensible way, and produce special result values or conversion errors during
The <codeph>INSERT</codeph> statement always creates data using the latest table
definition. You might end up with data files with different numbers of columns or
internal data representations if you do a sequence of <codeph>INSERT</codeph> and
<codeph>ALTER TABLE ... REPLACE COLUMNS</codeph> statements.
If you use <codeph>ALTER TABLE ... REPLACE COLUMNS</codeph> to define additional
columns at the end, when the original data files are used in a query, these final
columns are considered to be all <codeph>NULL</codeph> values.
If you use <codeph>ALTER TABLE ... REPLACE COLUMNS</codeph> to define fewer columns
than before, when the original data files are used in a query, the unused columns
still present in the data file are ignored.
Parquet represents the <codeph>TINYINT</codeph>, <codeph>SMALLINT</codeph>, and
<codeph>INT</codeph> types the same internally, all stored in 32-bit integers.
That means it is easy to promote a <codeph>TINYINT</codeph> column to
<codeph>SMALLINT</codeph> or <codeph>INT</codeph>, or a <codeph>SMALLINT</codeph>
column to <codeph>INT</codeph>. The numbers are represented exactly the same in
the data file, and the columns being promoted would not contain any out-of-range
If you change any of these column types to a smaller type, any values that are
out-of-range for the new type are returned incorrectly, typically as negative
You cannot change a <codeph>TINYINT</codeph>, <codeph>SMALLINT</codeph>, or
<codeph>INT</codeph> column to <codeph>BIGINT</codeph>, or the other way around.
Although the <codeph>ALTER TABLE</codeph> succeeds, any attempt to query those
columns results in conversion errors.
Any other type conversion for columns produces a conversion error during
queries. For example, <codeph>INT</codeph> to <codeph>STRING</codeph>,
<codeph>FLOAT</codeph> to <codeph>DOUBLE</codeph>, <codeph>TIMESTAMP</codeph> to
<codeph>STRING</codeph>, <codeph>DECIMAL(9,0)</codeph> to
<codeph>DECIMAL(5,2)</codeph>, and so on.
<p rev="2.6.0 IMPALA-2835">
You might find that you have Parquet files where the columns do not line up in the same
order as in your Impala table. For example, you might have a Parquet file that was part
of a table with columns <codeph>C1,C2,C3,C4</codeph>, and now you want to reuse the same
Parquet file in a table with columns <codeph>C4,C2</codeph>. By default, Impala expects
the columns in the data file to appear in the same order as the columns defined for the
table, making it impractical to do some kinds of file reuse or schema evolution. In
<keyword keyref="impala26_full"/> and higher, the query option
<codeph>PARQUET_FALLBACK_SCHEMA_RESOLUTION=name</codeph> lets Impala resolve columns by
name, and therefore handle out-of-order or extra columns in the data file. For example:
<codeblock conref="../shared/impala_common.xml#common/parquet_fallback_schema_resolution_example"/>
<xref href="impala_parquet_fallback_schema_resolution.xml#parquet_fallback_schema_resolution"/>
for more details.
<concept id="parquet_data_types">
<title>Data Type Considerations for Parquet Tables</title>
The Parquet format defines a set of data types whose names differ from the names of the
corresponding Impala data types. If you are preparing Parquet files using other Hadoop
components such as Pig or MapReduce, you might need to work with the type names defined
by Parquet. The following tables list the Parquet-defined types and the equivalent types
in Impala.
<b>Primitive types</b>
<simpletable frame="all" id="simpletable_am3_rxn_wgb">
<stentry>Parquet type</stentry>
<stentry>Impala type</stentry>
<b>Logical types</b>
Parquet uses type annotations to extend the types that it can store, by specifying how
the primitive types should be interpreted.
<simpletable frame="all" id="simpletable_az3_byn_wgb">
<stentry>Parquet primitive type and annotation</stentry>
<stentry>Impala type</stentry>
<stentry>BINARY annotated with the UTF8 OriginalType</stentry>
<stentry>BINARY annotated with the STRING LogicalType</stentry>
<stentry>BINARY annotated with the ENUM OriginalType</stentry>
<stentry>BINARY annotated with the DECIMAL OriginalType</stentry>
<stentry>INT64 annotated with the TIMESTAMP_MILLIS
<stentry>TIMESTAMP (in <keyword keyref="impala32"/> or
</p>BIGINT (for backward compatibility)</stentry>
<stentry>INT64 annotated with the TIMESTAMP_MICROS
<stentry>TIMESTAMP (in <keyword keyref="impala32"/> or
</p>BIGINT (for backward compatibility)</stentry>
<stentry>INT64 annotated with the TIMESTAMP LogicalType</stentry>
<stentry>TIMESTAMP (in <keyword keyref="impala32"/> or
</p>BIGINT (for backward compatibility)</stentry>
<p rev="2.3.0">
<b>Complex types:</b>
<p rev="2.3.0">
For the complex types (<codeph>ARRAY</codeph>, <codeph>MAP</codeph>, and
<codeph>STRUCT</codeph>) available in <keyword keyref="impala23_full"/> and higher,
Impala only supports queries against those types in Parquet tables.