| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="file_formats"> |
| |
| <title>How Impala Works with Hadoop File Formats</title> |
| |
| <titlealts audience="PDF"> |
| |
| <navtitle>File Formats</navtitle> |
| |
| </titlealts> |
| |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="Concepts"/> |
| <data name="Category" value="Hadoop"/> |
| <data name="Category" value="File Formats"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Data Analysts"/> |
| <data name="Category" value="Stub Pages"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| Impala supports several familiar file formats used in Apache Hadoop. Impala can load and |
| query data files produced by other Hadoop components such as Spark, and data files |
| produced by Impala can be used by other components also. The following sections discuss |
| the procedures, limitations, and performance considerations for using each file format |
| with Impala. |
| </p> |
| |
| <p> |
| The file format used for an Impala table has significant performance consequences. Some |
| file formats include compression support that affects the size of data on the disk and, |
| consequently, the amount of I/O and CPU resources required to deserialize data. The |
| amounts of I/O and CPU resources required can be a limiting factor in query performance |
| since querying often begins with moving and decompressing data. To reduce the potential |
| impact of this part of the process, data is often compressed. By compressing data, a |
| smaller total number of bytes are transferred from disk to memory. This reduces the amount |
| of time taken to transfer the data, but a tradeoff occurs when the CPU decompresses the |
| content. |
| </p> |
| |
| <p> |
| For the file formats that Impala cannot write to, create the table from within Impala |
| whenever possible and insert data using another component such as Hive or Spark. See the |
| table below for specific file formats. |
| </p> |
| |
| <p> |
| The following table lists the file formats that Impala supports. |
| </p> |
| |
| <table> |
| <tgroup cols="5"> |
| <colspec colname="1" colwidth="10*"/> |
| <colspec colname="2" colwidth="10*"/> |
| <colspec colname="3" colwidth="20*"/> |
| <colspec colname="4" colwidth="30*"/> |
| <colspec colname="5" colwidth="30*"/> |
| <thead> |
| <row> |
| <entry> |
| File Type |
| </entry> |
| <entry> |
| Format |
| </entry> |
| <entry> |
| Compression Codecs |
| </entry> |
| <entry> |
| Impala Can CREATE? |
| </entry> |
| <entry> |
| Impala Can INSERT? |
| </entry> |
| </row> |
| </thead> |
| <tbody> |
| <row id="parquet_support"> |
| <entry> |
| <xref href="impala_parquet.xml#parquet">Parquet</xref> |
| </entry> |
| <entry> |
| Structured |
| </entry> |
| <entry> Snappy, gzip, zstd, lz4; currently Snappy by default </entry> |
| <entry> |
| Yes. |
| </entry> |
| <entry> |
| Yes: <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, <codeph>LOAD |
| DATA</codeph>, and query. |
| </entry> |
| </row> |
| <row id="orc_support"> |
| <entry> |
| <xref href="impala_orc.xml#orc">ORC</xref> |
| </entry> |
| <entry> |
| Structured |
| </entry> |
| <entry> |
| gzip, Snappy, LZO, LZ4; currently gzip by default |
| </entry> |
| <entry> |
| Yes, in Impala 2.12.0 and higher. |
| |
| <p> |
| The ORC support is an experimental feature since Impala-2.12. To disable it, set |
| <codeph>--enable_orc_scanner</codeph> to <codeph>false</codeph> when starting |
| the cluster. |
| </p> |
| </entry> |
| <entry> |
| No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the |
| right format, or use <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH |
| <varname>table_name</varname></codeph> in Impala. |
| </entry> |
| </row> |
| <row id="txtfile_support"> |
| <entry> |
| <xref href="impala_txtfile.xml#txtfile">Text</xref> |
| </entry> |
| <entry> |
| Unstructured |
| </entry> |
| <entry rev="2.0.0"> |
| LZO, gzip, bzip2, Snappy |
| </entry> |
| <entry> |
| Yes. For <codeph>CREATE TABLE</codeph> with no <codeph>STORED AS</codeph> clause, |
| the default file format is uncompressed text, with values separated by ASCII |
| <codeph>0x01</codeph> characters (typically represented as Ctrl-A). |
| </entry> |
| <entry> Yes if uncompressed.<p>No if compressed.</p><p>If LZO |
| compression is used, you must create the table and load data in |
| Hive.</p><p>If other kinds of compression are used, you must |
| load data through <codeph>LOAD DATA</codeph>, Hive, or manually |
| in HDFS. </p></entry> |
| </row> |
| <row id="avro_support"> |
| <entry> |
| <xref href="impala_avro.xml#avro">Avro</xref> |
| </entry> |
| <entry> |
| Structured |
| </entry> |
| <entry> |
| Snappy, gzip, deflate |
| </entry> |
| <entry rev="1.4.0"> |
| Yes, in Impala 1.4.0 and higher. In lower versions, create the table using Hive. |
| </entry> |
| <entry> |
| No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the |
| right format, or use <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH |
| <varname>table_name</varname></codeph> in Impala. |
| </entry> |
| <!-- <entry rev="2.0.0">Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.</entry> --> |
| </row> |
| <row id="rcfile_support"> |
| <entry> |
| <xref href="impala_rcfile.xml#rcfile">RCFile</xref> |
| </entry> |
| <entry> |
| Structured |
| </entry> |
| <entry> |
| Snappy, gzip, deflate, bzip2 |
| </entry> |
| <entry> |
| Yes. |
| </entry> |
| <entry> |
| No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the |
| right format, or use <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH |
| <varname>table_name</varname></codeph> in Impala. |
| </entry> |
| <!-- |
| <entry rev="2.0.0">Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.</entry> |
| --> |
| </row> |
| <row id="sequencefile_support"> |
| <entry> |
| <xref href="impala_seqfile.xml#seqfile">SequenceFile</xref> |
| </entry> |
| <entry> |
| Structured |
| </entry> |
| <entry> |
| Snappy, gzip, deflate, bzip2 |
| </entry> |
| <entry> |
| Yes. |
| </entry> |
| <entry> |
| No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the |
| right format, or use <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH |
| <varname>table_name</varname></codeph> in Impala. |
| </entry> |
| <!-- |
| <entry rev="2.0.0"> |
| Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD |
| DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive. |
| </entry> |
| --> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| |
| <p> |
| Impala supports the following compression codecs: |
| </p> |
| |
| <dl> |
| <dlentry rev="2.0.0"> |
| |
| <dt> |
| Snappy |
| </dt> |
| |
| <dd> |
| <p> |
| Recommended for its effective balance between compression ratio and decompression |
| speed. Snappy compression is very fast, but gzip provides greater space savings. |
| Supported for text, RC, Sequence, and Avro files in Impala 2.0 and higher. |
| </p> |
| </dd> |
| |
| </dlentry> |
| |
| <dlentry rev="2.0.0"> |
| |
| <dt> |
| Gzip |
| </dt> |
| |
| <dd> |
| <p> |
| Recommended when achieving the highest level of compression (and therefore greatest |
| disk-space savings) is desired. Supported for text, RC, Sequence and Avro files in |
| Impala 2.0 and higher. |
| </p> |
| </dd> |
| |
| </dlentry> |
| |
| <dlentry rev="2.0.0"> |
| |
| <dt> |
| Deflate |
| </dt> |
| |
| <dd> |
| <p> |
| Not supported for text files. |
| </p> |
| </dd> |
| |
| </dlentry> |
| |
| <dlentry rev="2.0.0"> |
| |
| <dt> |
| Bzip2 |
| </dt> |
| |
| <dd> |
| <p> |
| Supported for text, RC, and Sequence files in Impala 2.0 and higher. |
| </p> |
| </dd> |
| |
| </dlentry> |
| |
| <dlentry rev="2.0.0"> |
| |
| <dt> |
| LZO |
| </dt> |
| |
| <dd> |
| <p> |
| For text files only. Impala can query LZO-compressed text tables, but currently |
| cannot create them or insert data into them. You need to perform these operations in |
| Hive. |
| </p> |
| </dd> |
| |
| </dlentry> |
| <dlentry> |
| <dt>Zstd</dt> |
| <dd>For Parquet files only.</dd> |
| </dlentry> |
| |
| <dlentry> |
| <dt>Lz4</dt> |
| <dd>For Parquet files only.</dd> |
| </dlentry> |
| </dl> |
| |
| </conbody> |
| |
| <concept id="file_format_choosing"> |
| |
| <title>Choosing the File Format for a Table</title> |
| |
| <prolog> |
| <metadata> |
| <data name="Category" value="Planning"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| Different file formats and compression codecs work better for different data sets. |
| Choosing the proper format for your data can yield performance improvements. Use the |
| following considerations to decide which combination of file format and compression to |
| use for a particular table: |
| </p> |
| |
| <ul> |
| <li> |
| If you are working with existing files that are already in a supported file format, |
| use the same format for the Impala table if performance is acceptable. If the original |
| format does not yield acceptable query performance or resource usage, consider |
| creating a new Impala table with different file format or compression characteristics, |
| and doing a one-time conversion by rewriting the data to the new table. |
| </li> |
| |
| <li> |
| Text files are convenient to produce through many different tools, and are |
| human-readable for ease of verification and debugging. Those characteristics are why |
| text is the default format for an Impala <codeph>CREATE TABLE</codeph> statement. |
| However, when performance and resource usage are the primary considerations, use one |
| of the structured file formats that include metadata and built-in compression. |
| <p> |
| A typical workflow might involve bringing data into an Impala table by copying CSV |
| or TSV files into the appropriate data directory, and then using the <codeph>INSERT |
| ... SELECT</codeph> syntax to rewrite the data into a table using a different, more |
| compact file format. |
| </p> |
| </li> |
| </ul> |
| |
| </conbody> |
| |
| </concept> |
| |
| </concept> |