blob: af6ab738c8f6bbf2184e944e0e66f1c7e040d35e [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="parquet">
<title>Using the Parquet File Format with Impala Tables</title>
<titlealts audience="PDF"><navtitle>Parquet Data Files</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="File Formats"/>
<data name="Category" value="Parquet"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Tables"/>
<data name="Category" value="Schemas"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="hidden">Parquet support in Impala</indexterm>
Impala helps you to create, manage, and query Parquet tables. Parquet is a column-oriented binary file format
intended to be highly efficient for the types of large-scale queries that Impala is best at. Parquet is
especially good for queries scanning particular columns within a table, for example to query <q>wide</q>
tables with many columns, or to perform aggregation operations such as <codeph>SUM()</codeph> and
<codeph>AVG()</codeph> that need to process most or all of the values from a column. Each data file contains
the values for a set of rows (the <q>row group</q>). Within a data file, the values from each column are
organized so that they are all adjacent, enabling good compression for the values from that column. Queries
against a Parquet table can retrieve and analyze these values from any column quickly and with minimal I/O.
</p>
<table>
<title>Parquet Format Support in Impala</title>
<tgroup cols="5">
<colspec colname="1" colwidth="10*"/>
<colspec colname="2" colwidth="10*"/>
<colspec colname="3" colwidth="20*"/>
<colspec colname="4" colwidth="30*"/>
<colspec colname="5" colwidth="30*"/>
<thead>
<row>
<entry>
File Type
</entry>
<entry>
Format
</entry>
<entry>
Compression Codecs
</entry>
<entry>
Impala Can CREATE?
</entry>
<entry>
Impala Can INSERT?
</entry>
</row>
</thead>
<tbody>
<row conref="impala_file_formats.xml#file_formats/parquet_support">
<entry/>
</row>
</tbody>
</tgroup>
</table>
<p outputclass="toc inpage"/>
</conbody>
<concept id="parquet_ddl">
<title>Creating Parquet Tables in Impala</title>
<conbody>
<p>
To create a table named <codeph>PARQUET_TABLE</codeph> that uses the Parquet format, you would use a
command like the following, substituting your own table name, column names, and data types:
</p>
<codeblock>[impala-host:21000] &gt; create table <varname>parquet_table_name</varname> (x INT, y STRING) STORED AS PARQUET;</codeblock>
<!--
<note>
Formerly, the <codeph>STORED AS</codeph> clause required the keyword <codeph>PARQUETFILE</codeph>.
In Impala 1.2.2 and higher, you can use <codeph>STORED AS PARQUET</codeph>.
This <codeph>PARQUET</codeph> keyword is recommended for new code.
</note>
-->
<p>
Or, to clone the column names and data types of an existing table:
</p>
<codeblock>[impala-host:21000] &gt; create table <varname>parquet_table_name</varname> LIKE <varname>other_table_name</varname> STORED AS PARQUET;</codeblock>
<p rev="1.4.0">
In Impala 1.4.0 and higher, you can derive column definitions from a raw Parquet data file, even without an
existing Impala table. For example, you can create an external table pointing to an HDFS directory, and
base the column definitions on one of the files in that directory:
</p>
<codeblock rev="1.4.0">CREATE EXTERNAL TABLE ingest_existing_files LIKE PARQUET '/user/etl/destination/datafile1.dat'
STORED AS PARQUET
LOCATION '/user/etl/destination';
</codeblock>
<p>
Or, you can refer to an existing data file and create a new empty table with suitable column definitions.
Then you can use <codeph>INSERT</codeph> to create new data files or <codeph>LOAD DATA</codeph> to transfer
existing data files into the new table.
</p>
<codeblock rev="1.4.0">CREATE TABLE columns_from_data_file LIKE PARQUET '/user/etl/destination/datafile1.dat'
STORED AS PARQUET;
</codeblock>
<p>
The default properties of the newly created table are the same as for any other <codeph>CREATE
TABLE</codeph> statement. For example, the default file format is text; if you want the new table to use
the Parquet file format, include the <codeph>STORED AS PARQUET</codeph> file also.
</p>
<p>
In this example, the new table is partitioned by year, month, and day. These partition key columns are not
part of the data file, so you specify them in the <codeph>CREATE TABLE</codeph> statement:
</p>
<codeblock rev="1.4.0">CREATE TABLE columns_from_data_file LIKE PARQUET '/user/etl/destination/datafile1.dat'
PARTITION (year INT, month TINYINT, day TINYINT)
STORED AS PARQUET;
</codeblock>
<p rev="1.4.0">
See <xref href="impala_create_table.xml#create_table"/> for more details about the <codeph>CREATE TABLE
LIKE PARQUET</codeph> syntax.
</p>
<p>
Once you have created a table, to insert data into that table, use a command similar to the following,
again with your own table names:
</p>
<!-- To do:
Opportunity for another example showing CTAS technique.
-->
<codeblock>[impala-host:21000] &gt; insert overwrite table <varname>parquet_table_name</varname> select * from <varname>other_table_name</varname>;</codeblock>
<p>
If the Parquet table has a different number of columns or different column names than the other table,
specify the names of columns from the other table rather than <codeph>*</codeph> in the
<codeph>SELECT</codeph> statement.
</p>
</conbody>
</concept>
<concept id="parquet_etl">
<title>Loading Data into Parquet Tables</title>
<prolog>
<metadata>
<data name="Category" value="ETL"/>
</metadata>
</prolog>
<conbody>
<p>
Choose from the following techniques for loading data into Parquet tables, depending on whether the
original data is already in an Impala table, or exists as raw data files outside Impala.
</p>
<p>
If you already have data in an Impala or Hive table, perhaps in a different file format or partitioning
scheme, you can transfer the data to a Parquet table using the Impala <codeph>INSERT...SELECT</codeph>
syntax. You can convert, filter, repartition, and do other things to the data as part of this same
<codeph>INSERT</codeph> statement. See <xref href="#parquet_compression"/> for some examples showing how to
insert data into Parquet tables.
</p>
<p conref="../shared/impala_common.xml#common/insert_hints"/>
<p conref="../shared/impala_common.xml#common/insert_parquet_blocksize"/>
<draft-comment translate="no">
Add an example here.
</draft-comment>
<p>
Avoid the <codeph>INSERT...VALUES</codeph> syntax for Parquet tables, because
<codeph>INSERT...VALUES</codeph> produces a separate tiny data file for each
<codeph>INSERT...VALUES</codeph> statement, and the strength of Parquet is in its handling of data
(compressing, parallelizing, and so on) in <ph rev="parquet_block_size">large</ph> chunks.
</p>
<p>
If you have one or more Parquet data files produced outside of Impala, you can quickly make the data
queryable through Impala by one of the following methods:
</p>
<ul>
<li>
The <codeph>LOAD DATA</codeph> statement moves a single data file or a directory full of data files into
the data directory for an Impala table. It does no validation or conversion of the data. The original
data files must be somewhere in HDFS, not the local filesystem.
<draft-comment translate="no">
Add an example here.
</draft-comment>
</li>
<li>
The <codeph>CREATE TABLE</codeph> statement with the <codeph>LOCATION</codeph> clause creates a table
where the data continues to reside outside the Impala data directory. The original data files must be
somewhere in HDFS, not the local filesystem. For extra safety, if the data is intended to be long-lived
and reused by other applications, you can use the <codeph>CREATE EXTERNAL TABLE</codeph> syntax so that
the data files are not deleted by an Impala <codeph>DROP TABLE</codeph> statement.
<draft-comment translate="no">
Add an example here.
</draft-comment>
</li>
<li>
If the Parquet table already exists, you can copy Parquet data files directly into it, then use the
<codeph>REFRESH</codeph> statement to make Impala recognize the newly added data. Remember to preserve
the block size of the Parquet data files by using the <codeph>hadoop distcp -pb</codeph> command rather
than a <codeph>-put</codeph> or <codeph>-cp</codeph> operation on the Parquet files. See
<xref href="#parquet_compression_multiple"/> for an example of this kind of operation.
</li>
</ul>
<note conref="../shared/impala_common.xml#common/restrictions_nonimpala_parquet"/>
<p>
Recent versions of Sqoop can produce Parquet output files using the <codeph>--as-parquetfile</codeph>
option.
</p>
<p conref="../shared/impala_common.xml#common/sqoop_timestamp_caveat"/>
<p>
If the data exists outside Impala and is in some other format, combine both of the preceding techniques.
First, use a <codeph>LOAD DATA</codeph> or <codeph>CREATE EXTERNAL TABLE ... LOCATION</codeph> statement to
bring the data into an Impala table that uses the appropriate file format. Then, use an
<codeph>INSERT...SELECT</codeph> statement to copy the data to the Parquet table, converting to Parquet
format as part of the process.
</p>
<draft-comment translate="no">
Add an example here.
</draft-comment>
<p>
Loading data into Parquet tables is a memory-intensive operation, because the incoming data is buffered
until it reaches <ph rev="parquet_block_size">one data block</ph> in size, then that chunk of data is
organized and compressed in memory before being written out. The memory consumption can be larger when
inserting data into partitioned Parquet tables, because a separate data file is written for each
combination of partition key column values, potentially requiring several
<ph rev="parquet_block_size">large</ph> chunks to be manipulated in memory at once.
</p>
<p>
When inserting into a partitioned Parquet table, Impala redistributes the data among the nodes to reduce
memory consumption. You might still need to temporarily increase the memory dedicated to Impala during the
insert operation, or break up the load operation into several <codeph>INSERT</codeph> statements, or both.
</p>
<note>
All the preceding techniques assume that the data you are loading matches the structure of the destination
table, including column order, column names, and partition layout. To transform or reorganize the data,
start by loading the data into a Parquet table that matches the underlying structure of the data, then use
one of the table-copying techniques such as <codeph>CREATE TABLE AS SELECT</codeph> or <codeph>INSERT ...
SELECT</codeph> to reorder or rename columns, divide the data among multiple partitions, and so on. For
example to take a single comprehensive Parquet data file and load it into a partitioned table, you would
use an <codeph>INSERT ... SELECT</codeph> statement with dynamic partitioning to let Impala create separate
data files with the appropriate partition values; for an example, see
<xref href="impala_insert.xml#insert"/>.
</note>
</conbody>
</concept>
<concept id="parquet_performance">
<title>Query Performance for Impala Parquet Tables</title>
<prolog>
<metadata>
<data name="Category" value="Performance"/>
</metadata>
</prolog>
<conbody>
<p>
Query performance for Parquet tables depends on the number of columns needed to process the
<codeph>SELECT</codeph> list and <codeph>WHERE</codeph> clauses of the query, the way data is divided into
<ph rev="parquet_block_size">large data files with block size equal to file size</ph>, the reduction in I/O
by reading the data for each column in compressed format, which data files can be skipped (for partitioned
tables), and the CPU overhead of decompressing the data for each column.
</p>
<p>
For example, the following is an efficient query for a Parquet table:
<codeblock>select avg(income) from census_data where state = 'CA';</codeblock>
The query processes only 2 columns out of a large number of total columns. If the table is partitioned by
the <codeph>STATE</codeph> column, it is even more efficient because the query only has to read and decode
1 column from each data file, and it can read only the data files in the partition directory for the state
<codeph>'CA'</codeph>, skipping the data files for all the other states, which will be physically located
in other directories.
</p>
<p>
The following is a relatively inefficient query for a Parquet table:
<codeblock>select * from census_data;</codeblock>
Impala would have to read the entire contents of each <ph rev="parquet_block_size">large</ph> data file,
and decompress the contents of each column for each row group, negating the I/O optimizations of the
column-oriented format. This query might still be faster for a Parquet table than a table with some other
file format, but it does not take advantage of the unique strengths of Parquet data files.
</p>
<p>
Impala can optimize queries on Parquet tables, especially join queries, better when statistics are
available for all the tables. Issue the <codeph>COMPUTE STATS</codeph> statement for each table after
substantial amounts of data are loaded into or appended to it. See
<xref href="impala_compute_stats.xml#compute_stats"/> for details.
</p>
<p rev="2.5.0">
The runtime filtering feature, available in <keyword keyref="impala25_full"/> and higher, works best with Parquet tables.
The per-row filtering aspect only applies to Parquet tables.
See <xref href="impala_runtime_filtering.xml#runtime_filtering"/> for details.
</p>
<p conref="../shared/impala_common.xml#common/s3_block_splitting"/>
<p rev="IMPALA-3909">
In <keyword keyref="impala29"/> and higher, Parquet files written by Impala include
embedded metadata specifying the minimum and maximum values for each column, within
each row group and each data page within the row group. Impala-written Parquet files
typically contain a single row group; a row group can contain many data pages.
Impala uses this information (currently, only the metadata for each row group)
when reading each Parquet data file during a query, to quickly determine whether each
row group within the file potentially includes any rows that match the conditions in the
<codeph>WHERE</codeph> clause. For example, if the column <codeph>X</codeph> within
a particular Parquet file has a minimum value of 1 and a maximum value of 100, then
a query including the clause <codeph>WHERE x &gt; 200</codeph> can quickly determine
that it is safe to skip that particular file, instead of scanning all the associated
column values. This optimization technique is especially effective for tables that
use the <codeph>SORT BY</codeph> clause for the columns most frequently checked in
<codeph>WHERE</codeph> clauses, because any <codeph>INSERT</codeph> operation on
such tables produces Parquet data files with relatively narrow ranges of column values
within each file.
</p>
</conbody>
<concept id="parquet_partitioning">
<title>Partitioning for Parquet Tables</title>
<conbody>
<p>
As explained in <xref href="impala_partitioning.xml#partitioning"/>, partitioning is an important
performance technique for Impala generally. This section explains some of the performance considerations
for partitioned Parquet tables.
</p>
<p>
The Parquet file format is ideal for tables containing many columns, where most queries only refer to a
small subset of the columns. As explained in <xref href="#parquet_data_files"/>, the physical layout of
Parquet data files lets Impala read only a small fraction of the data for many queries. The performance
benefits of this approach are amplified when you use Parquet tables in combination with partitioning.
Impala can skip the data files for certain partitions entirely, based on the comparisons in the
<codeph>WHERE</codeph> clause that refer to the partition key columns. For example, queries on
partitioned tables often analyze data for time intervals based on columns such as <codeph>YEAR</codeph>,
<codeph>MONTH</codeph>, and/or <codeph>DAY</codeph>, or for geographic regions. Remember that Parquet
data files use a <ph rev="parquet_block_size">large</ph> block size, so when deciding how finely to
partition the data, try to find a granularity where each partition contains
<ph rev="parquet_block_size">256 MB</ph> or more of data, rather than creating a large number of smaller
files split among many partitions.
</p>
<p>
Inserting into a partitioned Parquet table can be a resource-intensive operation, because each Impala
node could potentially be writing a separate data file to HDFS for each combination of different values
for the partition key columns. The large number of simultaneous open files could exceed the HDFS
<q>transceivers</q> limit. To avoid exceeding this limit, consider the following techniques:
</p>
<ul>
<li>
Load different subsets of data using separate <codeph>INSERT</codeph> statements with specific values
for the <codeph>PARTITION</codeph> clause, such as <codeph>PARTITION (year=2010)</codeph>.
</li>
<li>
Increase the <q>transceivers</q> value for HDFS, sometimes spelled <q>xcievers</q> (sic). The property
value in the <filepath>hdfs-site.xml</filepath> configuration file is
<!-- Old name, now deprecated: <codeph>dfs.datanode.max.xcievers</codeph>. -->
<codeph>dfs.datanode.max.transfer.threads</codeph>. For example, if you were loading 12 years of data
partitioned by year, month, and day, even a value of 4096 might not be high enough. This
<xref keyref="hbase-hadoop-xceivers">blog post</xref> explores the considerations for setting this value
higher or lower, using HBase examples for illustration.
</li>
<li>
Use the <codeph>COMPUTE STATS</codeph> statement to collect
<xref href="impala_perf_stats.xml#perf_column_stats">column statistics</xref> on the source table from
which data is being copied, so that the Impala query can estimate the number of different values in the
partition key columns and distribute the work accordingly.
</li>
</ul>
</conbody>
</concept>
</concept>
<concept id="parquet_compression">
<title>Snappy and GZip Compression for Parquet Data Files</title>
<prolog>
<metadata>
<data name="Category" value="Snappy"/>
<data name="Category" value="Gzip"/>
<data name="Category" value="Compression"/>
</metadata>
</prolog>
<conbody>
<p>
<indexterm audience="hidden">COMPRESSION_CODEC query option</indexterm>
When Impala writes Parquet data files using the <codeph>INSERT</codeph> statement, the underlying
compression is controlled by the <codeph>COMPRESSION_CODEC</codeph> query option. (Prior to Impala 2.0, the
query option name was <codeph>PARQUET_COMPRESSION_CODEC</codeph>.) The allowed values for this query option
are <codeph>snappy</codeph> (the default), <codeph>gzip</codeph>, and <codeph>none</codeph>. The option
value is not case-sensitive. If the option is set to an unrecognized value, all kinds of queries will fail
due to the invalid option setting, not just queries involving Parquet tables.
</p>
</conbody>
<concept id="parquet_snappy">
<title>Example of Parquet Table with Snappy Compression</title>
<conbody>
<p>
<indexterm audience="hidden">compression</indexterm>
By default, the underlying data files for a Parquet table are compressed with Snappy. The combination of
fast compression and decompression makes it a good choice for many data sets. To ensure Snappy
compression is used, for example after experimenting with other compression codecs, set the
<codeph>COMPRESSION_CODEC</codeph> query option to <codeph>snappy</codeph> before inserting the data:
</p>
<codeblock>[localhost:21000] &gt; create database parquet_compression;
[localhost:21000] &gt; use parquet_compression;
[localhost:21000] &gt; create table parquet_snappy like raw_text_data;
[localhost:21000] &gt; set COMPRESSION_CODEC=snappy;
[localhost:21000] &gt; insert into parquet_snappy select * from raw_text_data;
Inserted 1000000000 rows in 181.98s
</codeblock>
</conbody>
</concept>
<concept id="parquet_gzip">
<title>Example of Parquet Table with GZip Compression</title>
<conbody>
<p>
If you need more intensive compression (at the expense of more CPU cycles for uncompressing during
queries), set the <codeph>COMPRESSION_CODEC</codeph> query option to <codeph>gzip</codeph> before
inserting the data:
</p>
<codeblock>[localhost:21000] &gt; create table parquet_gzip like raw_text_data;
[localhost:21000] &gt; set COMPRESSION_CODEC=gzip;
[localhost:21000] &gt; insert into parquet_gzip select * from raw_text_data;
Inserted 1000000000 rows in 1418.24s
</codeblock>
</conbody>
</concept>
<concept id="parquet_none">
<title>Example of Uncompressed Parquet Table</title>
<conbody>
<p>
If your data compresses very poorly, or you want to avoid the CPU overhead of compression and
decompression entirely, set the <codeph>COMPRESSION_CODEC</codeph> query option to <codeph>none</codeph>
before inserting the data:
</p>
<codeblock>[localhost:21000] &gt; create table parquet_none like raw_text_data;
[localhost:21000] &gt; set COMPRESSION_CODEC=none;
[localhost:21000] &gt; insert into parquet_none select * from raw_text_data;
Inserted 1000000000 rows in 146.90s
</codeblock>
</conbody>
</concept>
<concept id="parquet_compression_examples">
<title>Examples of Sizes and Speeds for Compressed Parquet Tables</title>
<conbody>
<p>
Here are some examples showing differences in data sizes and query speeds for 1 billion rows of synthetic
data, compressed with each kind of codec. As always, run similar tests with realistic data sets of your
own. The actual compression ratios, and relative insert and query speeds, will vary depending on the
characteristics of the actual data.
</p>
<p>
In this case, switching from Snappy to GZip compression shrinks the data by an additional 40% or so,
while switching from Snappy compression to no compression expands the data also by about 40%:
</p>
<codeblock>$ hdfs dfs -du -h /user/hive/warehouse/parquet_compression.db
23.1 G /user/hive/warehouse/parquet_compression.db/parquet_snappy
13.5 G /user/hive/warehouse/parquet_compression.db/parquet_gzip
32.8 G /user/hive/warehouse/parquet_compression.db/parquet_none
</codeblock>
<p>
Because Parquet data files are typically <ph rev="parquet_block_size">large</ph>, each directory will
have a different number of data files and the row groups will be arranged differently.
</p>
<p>
At the same time, the less agressive the compression, the faster the data can be decompressed. In this
case using a table with a billion rows, a query that evaluates all the values for a particular column
runs faster with no compression than with Snappy compression, and faster with Snappy compression than
with Gzip compression. Query performance depends on several other factors, so as always, run your own
benchmarks with your own data to determine the ideal tradeoff between data size, CPU efficiency, and
speed of insert and query operations.
</p>
<codeblock>[localhost:21000] &gt; desc parquet_snappy;
Query finished, fetching results ...
+-----------+---------+---------+
| name | type | comment |
+-----------+---------+---------+
| id | int | |
| val | int | |
| zfill | string | |
| name | string | |
| assertion | boolean | |
+-----------+---------+---------+
Returned 5 row(s) in 0.14s
[localhost:21000] &gt; select avg(val) from parquet_snappy;
Query finished, fetching results ...
+-----------------+
| _c0 |
+-----------------+
| 250000.93577915 |
+-----------------+
Returned 1 row(s) in 4.29s
[localhost:21000] &gt; select avg(val) from parquet_gzip;
Query finished, fetching results ...
+-----------------+
| _c0 |
+-----------------+
| 250000.93577915 |
+-----------------+
Returned 1 row(s) in 6.97s
[localhost:21000] &gt; select avg(val) from parquet_none;
Query finished, fetching results ...
+-----------------+
| _c0 |
+-----------------+
| 250000.93577915 |
+-----------------+
Returned 1 row(s) in 3.67s
</codeblock>
</conbody>
</concept>
<concept id="parquet_compression_multiple">
<title>Example of Copying Parquet Data Files</title>
<conbody>
<p>
Here is a final example, to illustrate how the data files using the various compression codecs are all
compatible with each other for read operations. The metadata about the compression format is written into
each data file, and can be decoded during queries regardless of the <codeph>COMPRESSION_CODEC</codeph>
setting in effect at the time. In this example, we copy data files from the
<codeph>PARQUET_SNAPPY</codeph>, <codeph>PARQUET_GZIP</codeph>, and <codeph>PARQUET_NONE</codeph> tables
used in the previous examples, each containing 1 billion rows, all to the data directory of a new table
<codeph>PARQUET_EVERYTHING</codeph>. A couple of sample queries demonstrate that the new table now
contains 3 billion rows featuring a variety of compression codecs for the data files.
</p>
<p>
First, we create the table in Impala so that there is a destination directory in HDFS to put the data
files:
</p>
<codeblock>[localhost:21000] &gt; create table parquet_everything like parquet_snappy;
Query: create table parquet_everything like parquet_snappy
</codeblock>
<p>
Then in the shell, we copy the relevant data files into the data directory for this new table. Rather
than using <codeph>hdfs dfs -cp</codeph> as with typical files, we use <codeph>hadoop distcp -pb</codeph>
to ensure that the special <ph rev="parquet_block_size"> block size</ph> of the Parquet data files is
preserved.
</p>
<codeblock>$ hadoop distcp -pb /user/hive/warehouse/parquet_compression.db/parquet_snappy \
/user/hive/warehouse/parquet_compression.db/parquet_everything
...<varname>MapReduce output</varname>...
$ hadoop distcp -pb /user/hive/warehouse/parquet_compression.db/parquet_gzip \
/user/hive/warehouse/parquet_compression.db/parquet_everything
...<varname>MapReduce output</varname>...
$ hadoop distcp -pb /user/hive/warehouse/parquet_compression.db/parquet_none \
/user/hive/warehouse/parquet_compression.db/parquet_everything
...<varname>MapReduce output</varname>...
</codeblock>
<p>
Back in the <cmdname>impala-shell</cmdname> interpreter, we use the <codeph>REFRESH</codeph> statement to
alert the Impala server to the new data files for this table, then we can run queries demonstrating that
the data files represent 3 billion rows, and the values for one of the numeric columns match what was in
the original smaller tables:
</p>
<codeblock>[localhost:21000] &gt; refresh parquet_everything;
Query finished, fetching results ...
Returned 0 row(s) in 0.32s
[localhost:21000] &gt; select count(*) from parquet_everything;
Query finished, fetching results ...
+------------+
| _c0 |
+------------+
| 3000000000 |
+------------+
Returned 1 row(s) in 8.18s
[localhost:21000] &gt; select avg(val) from parquet_everything;
Query finished, fetching results ...
+-----------------+
| _c0 |
+-----------------+
| 250000.93577915 |
+-----------------+
Returned 1 row(s) in 13.35s
</codeblock>
</conbody>
</concept>
</concept>
<concept rev="2.3.0" id="parquet_complex_types">
<title>Parquet Tables for Impala Complex Types</title>
<conbody>
<p>
In <keyword keyref="impala23_full"/> and higher, Impala supports the complex types
<codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph>
See <xref href="impala_complex_types.xml#complex_types"/> for details.
Because these data types are currently supported only for the Parquet file format,
if you plan to use them, become familiar with the performance and storage aspects
of Parquet first.
</p>
</conbody>
</concept>
<concept id="parquet_interop">
<title>Exchanging Parquet Data Files with Other Hadoop Components</title>
<prolog>
<metadata>
<data name="Category" value="Hadoop"/>
</metadata>
</prolog>
<conbody>
<p>
You can read and write Parquet data files from other <keyword keyref="distro"/> components.
See <xref keyref="cdh_ig_parquet"/> for details.
</p>
<!-- These couple of paragraphs reused in the release notes 'incompatible changes' section. -->
<!-- But conbodydiv tag too restrictive, can't have just paragraphs and codeblocks inside. -->
<!-- So I will physically copy the info for the time being. -->
<!-- <conbodydiv id="upgrade_parquet_metadata"> -->
<p>
Previously, it was not possible to create Parquet data through Impala and reuse that table within Hive. Now
that Parquet support is available for Hive, reusing existing Impala Parquet data files in Hive
requires updating the table metadata. Use the following command if you are already running Impala 1.1.1 or
higher:
</p>
<codeblock>ALTER TABLE <varname>table_name</varname> SET FILEFORMAT PARQUET;
</codeblock>
<p>
If you are running a level of Impala that is older than 1.1.1, do the metadata update through Hive:
</p>
<codeblock>ALTER TABLE <varname>table_name</varname> SET SERDE 'parquet.hive.serde.ParquetHiveSerDe';
ALTER TABLE <varname>table_name</varname> SET FILEFORMAT
INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat";
</codeblock>
<p>
Impala 1.1.1 and higher can reuse Parquet data files created by Hive, without any action required.
</p>
<!-- </conbodydiv> -->
<p rev="2.2.0">
Impala supports the scalar data types that you can encode in a Parquet data file, but not composite or
nested types such as maps or arrays. In <keyword keyref="impala22_full"/> and higher, Impala can query Parquet data
files that include composite or nested types, as long as the query only refers to columns with scalar
types.
<!-- TK: could include an example here, but would require setup in Hive or Pig or something. -->
</p>
<p>
If you copy Parquet data files between nodes, or even between different directories on the same node, make
sure to preserve the block size by using the command <codeph>hadoop distcp -pb</codeph>. To verify that the
block size was preserved, issue the command <codeph>hdfs fsck -blocks
<varname>HDFS_path_of_impala_table_dir</varname></codeph> and check that the average block size is at or
near <ph rev="parquet_block_size">256 MB (or whatever other size is defined by the
<codeph>PARQUET_FILE_SIZE</codeph> query option).</ph>. (The <codeph>hadoop distcp</codeph> operation
typically leaves some directories behind, with names matching <filepath>_distcp_logs_*</filepath>, that you
can delete from the destination directory afterward.)
<!-- The Apache wiki page keeps disappearing, even though Google still points to it as of Nov. 11/2014. -->
<!-- Now there is a 'distcp2' guide: http://hadoop.apache.org/docs/r1.2.1/distcp2.html but I haven't tried that so let's play it safe for now and hide the link. -->
<!-- See the <xref href="http://hadoop.apache.org/docs/r0.19.0/distcp.html" scope="external" format="html">Hadoop DistCP Guide</xref> for details. -->
Issue the command <cmdname>hadoop distcp</cmdname> for details about <cmdname>distcp</cmdname> command
syntax.
</p>
<!-- Sample commands/output for when the 'distcp' business is expanded into a tutorial later.
<codeblock>$ hdfs fsck -blocks /user/impala/warehouse/parquet_compression.db/parquet_everything
Connecting to namenode via http://a1730.example.com:50070
FSCK started by jrussell (auth:SIMPLE) from /10.20.198.130 for path /user/impala/warehouse/parquet_compression.db/parquet_everything at Fri Aug 23 11:35:37 PDT 2013
............................................................................Status: HEALTHY
Total size: 74504481213 B
Total dirs: 1
Total files: 76
Total blocks (validated): 76 (avg. block size 980322121 B)
Minimally replicated blocks: 76 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 4
Number of racks: 1
FSCK ended at Fri Aug 23 11:35:37 PDT 2013 in 8 milliseconds
The filesystem under path '/user/impala/warehouse/parquet_compression.db/parquet_everything' is HEALTHY
</codeblock>
-->
<p conref="../shared/impala_common.xml#common/impala_parquet_encodings_caveat"/>
<p conref="../shared/impala_common.xml#common/parquet_tools_blurb"/>
</conbody>
</concept>
<concept id="parquet_data_files">
<title>How Parquet Data Files Are Organized</title>
<prolog>
<metadata>
<data name="Category" value="Concepts"/>
</metadata>
</prolog>
<conbody>
<p>
Although Parquet is a column-oriented file format, do not expect to find one data file for each column.
Parquet keeps all the data for a row within the same data file, to ensure that the columns for a row are
always available on the same node for processing. What Parquet does is to set a large HDFS block size and a
matching maximum data file size, to ensure that I/O and network transfer requests apply to large batches of
data.
</p>
<p>
Within that data file, the data for a set of rows is rearranged so that all the values from the first
column are organized in one contiguous block, then all the values from the second column, and so on.
Putting the values from the same column next to each other lets Impala use effective compression techniques
on the values in that column.
</p>
<note>
<p>
Impala <codeph>INSERT</codeph> statements write Parquet data files using an HDFS block size
<ph rev="parquet_block_size">that matches the data file size</ph>, to ensure that each data file is
represented by a single HDFS block, and the entire file can be processed on a single node without
requiring any remote reads.
</p>
<p>
If you create Parquet data files outside of Impala, such as through a MapReduce or Pig job, ensure that
the HDFS block size is greater than or equal to the file size, so that the <q>one file per block</q>
relationship is maintained. Set the <codeph>dfs.block.size</codeph> or the <codeph>dfs.blocksize</codeph>
property large enough that each file fits within a single HDFS block, even if that size is larger than
the normal HDFS block size.
</p>
<p>
If the block size is reset to a lower value during a file copy, you will see lower performance for
queries involving those files, and the <codeph>PROFILE</codeph> statement will reveal that some I/O is
being done suboptimally, through remote reads. See
<xref href="impala_parquet.xml#parquet_compression_multiple"/> for an example showing how to preserve the
block size when copying Parquet data files.
</p>
</note>
<p>
When Impala retrieves or tests the data for a particular column, it opens all the data files, but only
reads the portion of each file containing the values for that column. The column values are stored
consecutively, minimizing the I/O required to process the values within a single column. If other columns
are named in the <codeph>SELECT</codeph> list or <codeph>WHERE</codeph> clauses, the data for all columns
in the same row is available within that same data file.
</p>
<p>
If an <codeph>INSERT</codeph> statement brings in less than <ph rev="parquet_block_size">one Parquet
block's worth</ph> of data, the resulting data file is smaller than ideal. Thus, if you do split up an ETL
job to use multiple <codeph>INSERT</codeph> statements, try to keep the volume of data for each
<codeph>INSERT</codeph> statement to approximately <ph rev="parquet_block_size">256 MB, or a multiple of
256 MB</ph>.
</p>
</conbody>
<concept id="parquet_encoding">
<title>RLE and Dictionary Encoding for Parquet Data Files</title>
<conbody>
<p>
Parquet uses some automatic compression techniques, such as run-length encoding (RLE) and dictionary
encoding, based on analysis of the actual data values. Once the data values are encoded in a compact
form, the encoded data can optionally be further compressed using a compression algorithm. Parquet data
files created by Impala can use Snappy, GZip, or no compression; the Parquet spec also allows LZO
compression, but currently Impala does not support LZO-compressed Parquet files.
</p>
<p>
RLE and dictionary encoding are compression techniques that Impala applies automatically to groups of
Parquet data values, in addition to any Snappy or GZip compression applied to the entire data files.
These automatic optimizations can save you time and planning that are normally needed for a traditional
data warehouse. For example, dictionary encoding reduces the need to create numeric IDs as abbreviations
for longer string values.
</p>
<p>
Run-length encoding condenses sequences of repeated data values. For example, if many consecutive rows
all contain the same value for a country code, those repeating values can be represented by the value
followed by a count of how many times it appears consecutively.
</p>
<p>
Dictionary encoding takes the different values present in a column, and represents each one in compact
2-byte form rather than the original value, which could be several bytes. (Additional compression is
applied to the compacted values, for extra space savings.) This type of encoding applies when the number
of different values for a column is less than 2**16 (16,384). It does not apply to columns of data type
<codeph>BOOLEAN</codeph>, which are already very short. <codeph>TIMESTAMP</codeph> columns sometimes have
a unique value for each row, in which case they can quickly exceed the 2**16 limit on distinct values.
The 2**16 limit on different values within a column is reset for each data file, so if several different
data files each contained 10,000 different city names, the city name column in each data file could still
be condensed using dictionary encoding.
</p>
</conbody>
</concept>
</concept>
<concept rev="1.4.0" id="parquet_compacting">
<title>Compacting Data Files for Parquet Tables</title>
<conbody>
<p>
If you reuse existing table structures or ETL processes for Parquet tables, you might encounter a <q>many
small files</q> situation, which is suboptimal for query efficiency. For example, statements like these
might produce inefficiently organized data files:
</p>
<codeblock>-- In an N-node cluster, each node produces a data file
-- for the INSERT operation. If you have less than
-- N GB of data to copy, some files are likely to be
-- much smaller than the <ph rev="parquet_block_size">default Parquet</ph> block size.
insert into parquet_table select * from text_table;
-- Even if this operation involves an overall large amount of data,
-- when split up by year/month/day, each partition might only
-- receive a small amount of data. Then the data files for
-- the partition might be divided between the N nodes in the cluster.
-- A multi-gigabyte copy operation might produce files of only
-- a few MB each.
insert into partitioned_parquet_table partition (year, month, day)
select year, month, day, url, referer, user_agent, http_code, response_time
from web_stats;
</codeblock>
<p>
Here are techniques to help you produce large data files in Parquet <codeph>INSERT</codeph> operations, and
to compact existing too-small data files:
</p>
<ul>
<li>
<p>
When inserting into a partitioned Parquet table, use statically partitioned <codeph>INSERT</codeph>
statements where the partition key values are specified as constant values. Ideally, use a separate
<codeph>INSERT</codeph> statement for each partition.
</p>
</li>
<li>
<p conref="../shared/impala_common.xml#common/num_nodes_tip"/>
</li>
<li>
<p>
Be prepared to reduce the number of partition key columns from what you are used to with traditional
analytic database systems.
</p>
</li>
<li>
<p>
Do not expect Impala-written Parquet files to fill up the entire Parquet block size. Impala estimates
on the conservative side when figuring out how much data to write to each Parquet file. Typically, the
of uncompressed data in memory is substantially reduced on disk by the compression and encoding
techniques in the Parquet file format.
<!--
Impala reserves <ph rev="parquet_block_size">1 GB</ph> of memory to buffer the data before writing,
but the actual data file might be smaller, in the hundreds of megabytes.
-->
The final data file size varies depending on the compressibility of the data. Therefore, it is not an
indication of a problem if <ph rev="parquet_block_size">256 MB</ph> of text data is turned into 2
Parquet data files, each less than <ph rev="parquet_block_size">256 MB</ph>.
</p>
</li>
<li>
<p>
If you accidentally end up with a table with many small data files, consider using one or more of the
preceding techniques and copying all the data into a new Parquet table, either through <codeph>CREATE
TABLE AS SELECT</codeph> or <codeph>INSERT ... SELECT</codeph> statements.
</p>
<p>
To avoid rewriting queries to change table names, you can adopt a convention of always running
important queries against a view. Changing the view definition immediately switches any subsequent
queries to use the new underlying tables:
</p>
<codeblock>create view production_table as select * from table_with_many_small_files;
-- CTAS or INSERT...SELECT all the data into a more efficient layout...
alter view production_table as select * from table_with_few_big_files;
select * from production_table where c1 = 100 and c2 &lt; 50 and ...;
</codeblock>
</li>
</ul>
</conbody>
</concept>
<concept rev="1.4.0" id="parquet_schema_evolution">
<title>Schema Evolution for Parquet Tables</title>
<conbody>
<p>
Schema evolution refers to using the statement <codeph>ALTER TABLE ... REPLACE COLUMNS</codeph> to change
the names, data type, or number of columns in a table. You can perform schema evolution for Parquet tables
as follows:
</p>
<ul>
<li>
<p>
The Impala <codeph>ALTER TABLE</codeph> statement never changes any data files in the tables. From the
Impala side, schema evolution involves interpreting the same data files in terms of a new table
definition. Some types of schema changes make sense and are represented correctly. Other types of
changes cannot be represented in a sensible way, and produce special result values or conversion errors
during queries.
</p>
</li>
<li>
<p>
The <codeph>INSERT</codeph> statement always creates data using the latest table definition. You might
end up with data files with different numbers of columns or internal data representations if you do a
sequence of <codeph>INSERT</codeph> and <codeph>ALTER TABLE ... REPLACE COLUMNS</codeph> statements.
</p>
</li>
<li>
<p>
If you use <codeph>ALTER TABLE ... REPLACE COLUMNS</codeph> to define additional columns at the end,
when the original data files are used in a query, these final columns are considered to be all
<codeph>NULL</codeph> values.
</p>
</li>
<li>
<p>
If you use <codeph>ALTER TABLE ... REPLACE COLUMNS</codeph> to define fewer columns than before, when
the original data files are used in a query, the unused columns still present in the data file are
ignored.
</p>
</li>
<li>
<p>
Parquet represents the <codeph>TINYINT</codeph>, <codeph>SMALLINT</codeph>, and <codeph>INT</codeph>
types the same internally, all stored in 32-bit integers.
</p>
<ul>
<li>
That means it is easy to promote a <codeph>TINYINT</codeph> column to <codeph>SMALLINT</codeph> or
<codeph>INT</codeph>, or a <codeph>SMALLINT</codeph> column to <codeph>INT</codeph>. The numbers are
represented exactly the same in the data file, and the columns being promoted would not contain any
out-of-range values.
</li>
<li>
<p>
If you change any of these column types to a smaller type, any values that are out-of-range for the
new type are returned incorrectly, typically as negative numbers.
</p>
</li>
<li>
<p>
You cannot change a <codeph>TINYINT</codeph>, <codeph>SMALLINT</codeph>, or <codeph>INT</codeph>
column to <codeph>BIGINT</codeph>, or the other way around. Although the <codeph>ALTER
TABLE</codeph> succeeds, any attempt to query those columns results in conversion errors.
</p>
</li>
<li>
<p>
Any other type conversion for columns produces a conversion error during queries. For example,
<codeph>INT</codeph> to <codeph>STRING</codeph>, <codeph>FLOAT</codeph> to <codeph>DOUBLE</codeph>,
<codeph>TIMESTAMP</codeph> to <codeph>STRING</codeph>, <codeph>DECIMAL(9,0)</codeph> to
<codeph>DECIMAL(5,2)</codeph>, and so on.
</p>
</li>
</ul>
</li>
</ul>
<p rev="2.6.0 IMPALA-2835">
You might find that you have Parquet files where the columns do not line up in the same
order as in your Impala table. For example, you might have a Parquet file that was part of
a table with columns <codeph>C1,C2,C3,C4</codeph>, and now you want to reuse the same
Parquet file in a table with columns <codeph>C4,C2</codeph>. By default, Impala expects the
columns in the data file to appear in the same order as the columns defined for the table,
making it impractical to do some kinds of file reuse or schema evolution. In <keyword keyref="impala26_full"/>
and higher, the query option <codeph>PARQUET_FALLBACK_SCHEMA_RESOLUTION=name</codeph> lets Impala
resolve columns by name, and therefore handle out-of-order or extra columns in the data file.
For example:
<codeblock conref="../shared/impala_common.xml#common/parquet_fallback_schema_resolution_example"/>
See <xref href="impala_parquet_fallback_schema_resolution.xml#parquet_fallback_schema_resolution"/>
for more details.
</p>
</conbody>
</concept>
<concept id="parquet_data_types">
<title>Data Type Considerations for Parquet Tables</title>
<conbody>
<p>
The Parquet format defines a set of data types whose names differ from the names of the corresponding
Impala data types. If you are preparing Parquet files using other Hadoop components such as Pig or
MapReduce, you might need to work with the type names defined by Parquet. The following figure lists the
Parquet-defined types and the equivalent types in Impala.
</p>
<p>
<b>Primitive types:</b>
</p>
<codeblock>BINARY -&gt; STRING
BOOLEAN -&gt; BOOLEAN
DOUBLE -&gt; DOUBLE
FLOAT -&gt; FLOAT
INT32 -&gt; INT
INT64 -&gt; BIGINT
INT96 -&gt; TIMESTAMP
</codeblock>
<p>
<b>Logical types:</b>
</p>
<codeblock>BINARY + OriginalType UTF8 -&gt; STRING
BINARY + OriginalType ENUM -&gt; STRING
BINARY + OriginalType DECIMAL -&gt; DECIMAL
</codeblock>
<p rev="2.3.0">
<b>Complex types:</b>
</p>
<p rev="2.3.0">
For the complex types (<codeph>ARRAY</codeph>, <codeph>MAP</codeph>, and <codeph>STRUCT</codeph>)
available in <keyword keyref="impala23_full"/> and higher, Impala only supports queries
against those types in Parquet tables.
</p>
</conbody>
</concept>
</concept>