docs/topics/impala_file_formats.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept id="file_formats">

   <title>How Impala Works with Hadoop File Formats</title>

   <titlealts audience="PDF">

     <navtitle>File Formats</navtitle>

   </titlealts>

   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="Concepts"/>
       <data name="Category" value="Hadoop"/>
       <data name="Category" value="File Formats"/>
       <data name="Category" value="Developers"/>
       <data name="Category" value="Data Analysts"/>
       <data name="Category" value="Stub Pages"/>
     </metadata>
   </prolog>

   <conbody>

     <p>
       Impala supports several familiar file formats used in Apache Hadoop. Impala can load and
       query data files produced by other Hadoop components such as Spark, and data files
       produced by Impala can be used by other components also. The following sections discuss
       the procedures, limitations, and performance considerations for using each file format
       with Impala.
     </p>

     <p>
       The file format used for an Impala table has significant performance consequences. Some
       file formats include compression support that affects the size of data on the disk and,
       consequently, the amount of I/O and CPU resources required to deserialize data. The
       amounts of I/O and CPU resources required can be a limiting factor in query performance
       since querying often begins with moving and decompressing data. To reduce the potential
       impact of this part of the process, data is often compressed. By compressing data, a
       smaller total number of bytes are transferred from disk to memory. This reduces the amount
       of time taken to transfer the data, but a tradeoff occurs when the CPU decompresses the
       content.
     </p>

     <p>
       For the file formats that Impala cannot write to, create the table from within Impala
       whenever possible and insert data using another component such as Hive or Spark. See the
       table below for specific file formats.
     </p>

     <p>
       The following table lists the file formats that Impala supports.
     </p>

     <table>
       <tgroup cols="5">
         <colspec colname="1" colwidth="10*"/>
         <colspec colname="2" colwidth="10*"/>
         <colspec colname="3" colwidth="20*"/>
         <colspec colname="4" colwidth="30*"/>
         <colspec colname="5" colwidth="30*"/>
         <thead>
           <row>
             <entry>
               File Type
             </entry>
             <entry>
               Format
             </entry>
             <entry>
               Compression Codecs
             </entry>
             <entry>
               Impala Can CREATE?
             </entry>
             <entry>
               Impala Can INSERT?
             </entry>
           </row>
         </thead>
         <tbody>
           <row id="parquet_support">
             <entry>
               <xref href="impala_parquet.xml#parquet">Parquet</xref>
             </entry>
             <entry>
               Structured
             </entry>
             <entry> Snappy, gzip, zstd, lz4; currently Snappy by default </entry>
             <entry>
               Yes.
             </entry>
             <entry>
               Yes: <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, <codeph>LOAD
               DATA</codeph>, and query.
             </entry>
           </row>
           <row id="orc_support">
             <entry>
               <xref href="impala_orc.xml#orc">ORC</xref>
             </entry>
             <entry>
               Structured
             </entry>
             <entry>
               gzip, Snappy, LZO, LZ4; currently gzip by default
             </entry>
             <entry>
               Yes, in Impala 2.12.0 and higher.

               <p>
                 The ORC support is an experimental feature since Impala-2.12. To disable it, set
                 <codeph>--enable_orc_scanner</codeph> to <codeph>false</codeph> when starting
                 the cluster.
               </p>
             </entry>
             <entry>
               No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the
               right format, or use <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH
               <varname>table_name</varname></codeph> in Impala.
             </entry>
           </row>
           <row id="txtfile_support">
             <entry>
               <xref href="impala_txtfile.xml#txtfile">Text</xref>
             </entry>
             <entry>
               Unstructured
             </entry>
             <entry rev="2.0.0">
               LZO, gzip, bzip2, Snappy
             </entry>
             <entry>
               Yes. For <codeph>CREATE TABLE</codeph> with no <codeph>STORED AS</codeph> clause,
               the default file format is uncompressed text, with values separated by ASCII
               <codeph>0x01</codeph> characters (typically represented as Ctrl-A).
             </entry>
             <entry> Yes if uncompressed.<p>No if compressed.</p><p>If LZO
                 compression is used, you must create the table and load data in
                 Hive.</p><p>If other kinds of compression are used, you must
                 load data through <codeph>LOAD DATA</codeph>, Hive, or manually
                 in HDFS. </p></entry>
           </row>
           <row id="avro_support">
             <entry>
               <xref href="impala_avro.xml#avro">Avro</xref>
             </entry>
             <entry>
               Structured
             </entry>
             <entry>
               Snappy, gzip, deflate
             </entry>
             <entry rev="1.4.0">
               Yes, in Impala 1.4.0 and higher. In lower versions, create the table using Hive.
             </entry>
             <entry>
               No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the
               right format, or use <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH
               <varname>table_name</varname></codeph> in Impala.
             </entry>
 <!-- <entry rev="2.0.0">Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.</entry> -->
           </row>
           <row id="rcfile_support">
             <entry>
               <xref href="impala_rcfile.xml#rcfile">RCFile</xref>
             </entry>
             <entry>
               Structured
             </entry>
             <entry>
               Snappy, gzip, deflate, bzip2
             </entry>
             <entry>
               Yes.
             </entry>
             <entry>
               No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the
               right format, or use <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH
               <varname>table_name</varname></codeph> in Impala.
             </entry>
 <!--
             <entry rev="2.0.0">Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.</entry>
             -->
           </row>
           <row id="sequencefile_support">
             <entry>
               <xref href="impala_seqfile.xml#seqfile">SequenceFile</xref>
             </entry>
             <entry>
               Structured
             </entry>
             <entry>
               Snappy, gzip, deflate, bzip2
             </entry>
             <entry>
               Yes.
             </entry>
             <entry>
               No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the
               right format, or use <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH
               <varname>table_name</varname></codeph> in Impala.
             </entry>
 <!--
             <entry rev="2.0.0">
               Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD
               DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.
             </entry>
 -->
           </row>
         </tbody>
       </tgroup>
     </table>

     <p>
       Impala supports the following compression codecs:
     </p>

     <dl>
       <dlentry rev="2.0.0">

         <dt>
           Snappy
         </dt>

         <dd>
           <p>
             Recommended for its effective balance between compression ratio and decompression
             speed. Snappy compression is very fast, but gzip provides greater space savings.
             Supported for text, RC, Sequence, and Avro files in Impala 2.0 and higher.
           </p>
         </dd>

       </dlentry>

       <dlentry rev="2.0.0">

         <dt>
           Gzip
         </dt>

         <dd>
           <p>
             Recommended when achieving the highest level of compression (and therefore greatest
             disk-space savings) is desired. Supported for text, RC, Sequence and Avro files in
             Impala 2.0 and higher.
           </p>
         </dd>

       </dlentry>

       <dlentry rev="2.0.0">

         <dt>
           Deflate
         </dt>

         <dd>
           <p>
             Not supported for text files.
           </p>
         </dd>

       </dlentry>

       <dlentry rev="2.0.0">

         <dt>
           Bzip2
         </dt>

         <dd>
           <p>
             Supported for text, RC, and Sequence files in Impala 2.0 and higher.
           </p>
         </dd>

       </dlentry>

       <dlentry rev="2.0.0">

         <dt>
           LZO
         </dt>

         <dd>
           <p>
             For text files only. Impala can query LZO-compressed text tables, but currently
             cannot create them or insert data into them. You need to perform these operations in
             Hive.
           </p>
         </dd>

       </dlentry>
       <dlentry>
         <dt>Zstd</dt>
         <dd>For Parquet files only.</dd>
       </dlentry>

       <dlentry>
         <dt>Lz4</dt>
         <dd>For Parquet files only.</dd>
       </dlentry>
     </dl>

   </conbody>

   <concept id="file_format_choosing">

     <title>Choosing the File Format for a Table</title>

     <prolog>
       <metadata>
         <data name="Category" value="Planning"/>
       </metadata>
     </prolog>

     <conbody>

       <p>
         Different file formats and compression codecs work better for different data sets.
         Choosing the proper format for your data can yield performance improvements. Use the
         following considerations to decide which combination of file format and compression to
         use for a particular table:
       </p>

       <ul>
         <li>
           If you are working with existing files that are already in a supported file format,
           use the same format for the Impala table if performance is acceptable. If the original
           format does not yield acceptable query performance or resource usage, consider
           creating a new Impala table with different file format or compression characteristics,
           and doing a one-time conversion by rewriting the data to the new table.
         </li>

         <li>
           Text files are convenient to produce through many different tools, and are
           human-readable for ease of verification and debugging. Those characteristics are why
           text is the default format for an Impala <codeph>CREATE TABLE</codeph> statement.
           However, when performance and resource usage are the primary considerations, use one
           of the structured file formats that include metadata and built-in compression.
           <p>
             A typical workflow might involve bringing data into an Impala table by copying CSV
             or TSV files into the appropriate data directory, and then using the <codeph>INSERT
             ... SELECT</codeph> syntax to rewrite the data into a table using a different, more
             compact file format.
           </p>
         </li>
       </ul>

     </conbody>

   </concept>

 </concept>
	<?xml version="1.0" encoding="UTF-8"?>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->
	<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
	<concept id="file_formats">

	<title>How Impala Works with Hadoop File Formats</title>

	<titlealts audience="PDF">

	<navtitle>File Formats</navtitle>

	</titlealts>

	<prolog>
	<metadata>
	<data name="Category" value="Impala"/>
	<data name="Category" value="Concepts"/>
	<data name="Category" value="Hadoop"/>
	<data name="Category" value="File Formats"/>
	<data name="Category" value="Developers"/>
	<data name="Category" value="Data Analysts"/>
	<data name="Category" value="Stub Pages"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	Impala supports several familiar file formats used in Apache Hadoop. Impala can load and
	query data files produced by other Hadoop components such as Spark, and data files
	produced by Impala can be used by other components also. The following sections discuss
	the procedures, limitations, and performance considerations for using each file format
	with Impala.
	</p>

	<p>
	The file format used for an Impala table has significant performance consequences. Some
	file formats include compression support that affects the size of data on the disk and,
	consequently, the amount of I/O and CPU resources required to deserialize data. The
	amounts of I/O and CPU resources required can be a limiting factor in query performance
	since querying often begins with moving and decompressing data. To reduce the potential
	impact of this part of the process, data is often compressed. By compressing data, a
	smaller total number of bytes are transferred from disk to memory. This reduces the amount
	of time taken to transfer the data, but a tradeoff occurs when the CPU decompresses the
	content.
	</p>

	<p>
	For the file formats that Impala cannot write to, create the table from within Impala
	whenever possible and insert data using another component such as Hive or Spark. See the
	table below for specific file formats.
	</p>

	<p>
	The following table lists the file formats that Impala supports.
	</p>

	<table>
	<tgroup cols="5">
	<colspec colname="1" colwidth="10*"/>
	<colspec colname="2" colwidth="10*"/>
	<colspec colname="3" colwidth="20*"/>
	<colspec colname="4" colwidth="30*"/>
	<colspec colname="5" colwidth="30*"/>
	<thead>
	<row>
	<entry>
	File Type
	</entry>
	<entry>
	Format
	</entry>
	<entry>
	Compression Codecs
	</entry>
	<entry>
	Impala Can CREATE?
	</entry>
	<entry>
	Impala Can INSERT?
	</entry>
	</row>
	</thead>
	<tbody>
	<row id="parquet_support">
	<entry>
	<xref href="impala_parquet.xml#parquet">Parquet</xref>
	</entry>
	<entry>
	Structured
	</entry>
	<entry> Snappy, gzip, zstd, lz4; currently Snappy by default </entry>
	<entry>
	Yes.
	</entry>
	<entry>
	Yes: <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, <codeph>LOAD
	DATA</codeph>, and query.
	</entry>
	</row>
	<row id="orc_support">
	<entry>
	<xref href="impala_orc.xml#orc">ORC</xref>
	</entry>
	<entry>
	Structured
	</entry>
	<entry>
	gzip, Snappy, LZO, LZ4; currently gzip by default
	</entry>
	<entry>
	Yes, in Impala 2.12.0 and higher.

	<p>
	The ORC support is an experimental feature since Impala-2.12. To disable it, set
	<codeph>--enable_orc_scanner</codeph> to <codeph>false</codeph> when starting
	the cluster.
	</p>
	</entry>
	<entry>
	No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the
	right format, or use <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH
	<varname>table_name</varname></codeph> in Impala.
	</entry>
	</row>
	<row id="txtfile_support">
	<entry>
	<xref href="impala_txtfile.xml#txtfile">Text</xref>
	</entry>
	<entry>
	Unstructured
	</entry>
	<entry rev="2.0.0">
	LZO, gzip, bzip2, Snappy
	</entry>
	<entry>
	Yes. For <codeph>CREATE TABLE</codeph> with no <codeph>STORED AS</codeph> clause,
	the default file format is uncompressed text, with values separated by ASCII
	<codeph>0x01</codeph> characters (typically represented as Ctrl-A).
	</entry>
	<entry> Yes if uncompressed.<p>No if compressed.</p><p>If LZO
	compression is used, you must create the table and load data in
	Hive.</p><p>If other kinds of compression are used, you must
	load data through <codeph>LOAD DATA</codeph>, Hive, or manually
	in HDFS. </p></entry>
	</row>
	<row id="avro_support">
	<entry>
	<xref href="impala_avro.xml#avro">Avro</xref>
	</entry>
	<entry>
	Structured
	</entry>
	<entry>
	Snappy, gzip, deflate
	</entry>
	<entry rev="1.4.0">
	Yes, in Impala 1.4.0 and higher. In lower versions, create the table using Hive.
	</entry>
	<entry>
	No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the
	right format, or use <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH
	<varname>table_name</varname></codeph> in Impala.
	</entry>
	<!-- <entry rev="2.0.0">Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.</entry> -->
	</row>
	<row id="rcfile_support">
	<entry>
	<xref href="impala_rcfile.xml#rcfile">RCFile</xref>
	</entry>
	<entry>
	Structured
	</entry>
	<entry>
	Snappy, gzip, deflate, bzip2
	</entry>
	<entry>
	Yes.
	</entry>
	<entry>
	No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the
	right format, or use <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH
	<varname>table_name</varname></codeph> in Impala.
	</entry>
	<!--
	<entry rev="2.0.0">Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.</entry>
	-->
	</row>
	<row id="sequencefile_support">
	<entry>
	<xref href="impala_seqfile.xml#seqfile">SequenceFile</xref>
	</entry>
	<entry>
	Structured
	</entry>
	<entry>
	Snappy, gzip, deflate, bzip2
	</entry>
	<entry>
	Yes.
	</entry>
	<entry>
	No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the
	right format, or use <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH
	<varname>table_name</varname></codeph> in Impala.
	</entry>
	<!--
	<entry rev="2.0.0">
	Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD
	DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.
	</entry>
	-->
	</row>
	</tbody>
	</tgroup>
	</table>

	<p>
	Impala supports the following compression codecs:
	</p>

	<dl>
	<dlentry rev="2.0.0">

	<dt>
	Snappy
	</dt>

	<dd>
	<p>
	Recommended for its effective balance between compression ratio and decompression
	speed. Snappy compression is very fast, but gzip provides greater space savings.
	Supported for text, RC, Sequence, and Avro files in Impala 2.0 and higher.
	</p>
	</dd>

	</dlentry>

	<dlentry rev="2.0.0">

	<dt>
	Gzip
	</dt>

	<dd>
	<p>
	Recommended when achieving the highest level of compression (and therefore greatest
	disk-space savings) is desired. Supported for text, RC, Sequence and Avro files in
	Impala 2.0 and higher.
	</p>
	</dd>

	</dlentry>

	<dlentry rev="2.0.0">

	<dt>
	Deflate
	</dt>

	<dd>
	<p>
	Not supported for text files.
	</p>
	</dd>

	</dlentry>

	<dlentry rev="2.0.0">

	<dt>
	Bzip2
	</dt>

	<dd>
	<p>
	Supported for text, RC, and Sequence files in Impala 2.0 and higher.
	</p>
	</dd>

	</dlentry>

	<dlentry rev="2.0.0">

	<dt>
	LZO
	</dt>

	<dd>
	<p>
	For text files only. Impala can query LZO-compressed text tables, but currently
	cannot create them or insert data into them. You need to perform these operations in
	Hive.
	</p>
	</dd>

	</dlentry>
	<dlentry>
	<dt>Zstd</dt>
	<dd>For Parquet files only.</dd>
	</dlentry>

	<dlentry>
	<dt>Lz4</dt>
	<dd>For Parquet files only.</dd>
	</dlentry>
	</dl>

	</conbody>

	<concept id="file_format_choosing">

	<title>Choosing the File Format for a Table</title>

	<prolog>
	<metadata>
	<data name="Category" value="Planning"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	Different file formats and compression codecs work better for different data sets.
	Choosing the proper format for your data can yield performance improvements. Use the
	following considerations to decide which combination of file format and compression to
	use for a particular table:
	</p>

	<ul>
	<li>
	If you are working with existing files that are already in a supported file format,
	use the same format for the Impala table if performance is acceptable. If the original
	format does not yield acceptable query performance or resource usage, consider
	creating a new Impala table with different file format or compression characteristics,
	and doing a one-time conversion by rewriting the data to the new table.
	</li>

	<li>
	Text files are convenient to produce through many different tools, and are
	human-readable for ease of verification and debugging. Those characteristics are why
	text is the default format for an Impala <codeph>CREATE TABLE</codeph> statement.
	However, when performance and resource usage are the primary considerations, use one
	of the structured file formats that include metadata and built-in compression.
	<p>
	A typical workflow might involve bringing data into an Impala table by copying CSV
	or TSV files into the appropriate data directory, and then using the <codeph>INSERT
	... SELECT</codeph> syntax to rewrite the data into a table using a different, more
	compact file format.
	</p>
	</li>
	</ul>

	</conbody>

	</concept>

	</concept>