docs/topics/impala_file_formats.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept id="file_formats">

   <title>How Impala Works with Hadoop File Formats</title>
   <titlealts audience="PDF"><navtitle>File Formats</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="Concepts"/>
       <data name="Category" value="Hadoop"/>
       <data name="Category" value="File Formats"/>
       <data name="Category" value="Developers"/>
       <data name="Category" value="Data Analysts"/>
       <!-- Like Impala Administration, this page has a fair bit of info already, but it could benefit from wiki-style embedded of intro text from those other pages. -->
       <!-- In this case, that would also enable a good in-page TOC since there is already one lonely subtopic on this same page. -->
       <data name="Category" value="Stub Pages"/>
     </metadata>
   </prolog>

   <conbody>

     <p>
       <indexterm audience="hidden">file formats</indexterm>
       <indexterm audience="hidden">compression</indexterm>
       Impala supports several familiar file formats used in Apache Hadoop. Impala can load and query data files
       produced by other Hadoop components such as Pig or MapReduce, and data files produced by Impala can be used
       by other components also. The following sections discuss the procedures, limitations, and performance
       considerations for using each file format with Impala.
     </p>

     <p>
       The file format used for an Impala table has significant performance consequences. Some file formats include
       compression support that affects the size of data on the disk and, consequently, the amount of I/O and CPU
       resources required to deserialize data. The amounts of I/O and CPU resources required can be a limiting
       factor in query performance since querying often begins with moving and decompressing data. To reduce the
       potential impact of this part of the process, data is often compressed. By compressing data, a smaller total
       number of bytes are transferred from disk to memory. This reduces the amount of time taken to transfer the
       data, but a tradeoff occurs when the CPU decompresses the content.
     </p>

     <p>
       Impala can query files encoded with most of the popular file formats and compression codecs used in Hadoop.
       Impala can create and insert data into tables that use some file formats but not others; for file formats
       that Impala cannot write to, create the table in Hive, issue the <codeph>INVALIDATE METADATA <varname>table_name</varname></codeph>
       statement in <codeph>impala-shell</codeph>, and query the table through Impala. File formats can be
       structured, in which case they may include metadata and built-in compression. Supported formats include:
     </p>

     <table>
       <title>File Format Support in Impala</title>
       <tgroup cols="5">
         <colspec colname="1" colwidth="10*"/>
         <colspec colname="2" colwidth="10*"/>
         <colspec colname="3" colwidth="20*"/>
         <colspec colname="4" colwidth="30*"/>
         <colspec colname="5" colwidth="30*"/>
         <thead>
           <row>
             <entry>
               File Type
             </entry>
             <entry>
               Format
             </entry>
             <entry>
               Compression Codecs
             </entry>
             <entry>
               Impala Can CREATE?
             </entry>
             <entry>
               Impala Can INSERT?
             </entry>
           </row>
         </thead>
         <tbody>
           <row id="parquet_support">
             <entry>
               <xref href="impala_parquet.xml#parquet">Parquet</xref>
             </entry>
             <entry>
               Structured
             </entry>
             <entry>
               Snappy, gzip; currently Snappy by default
             </entry>
             <entry>
               Yes.
             </entry>
             <entry>
               Yes: <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>, and query.
             </entry>
           </row>
           <row id="txtfile_support">
             <entry>
               <xref href="impala_txtfile.xml#txtfile">Text</xref>
             </entry>
             <entry>
               Unstructured
             </entry>
             <entry rev="2.0.0">
               LZO, gzip, bzip2, Snappy
             </entry>
             <entry>
               Yes. For <codeph>CREATE TABLE</codeph> with no <codeph>STORED AS</codeph> clause, the default file
               format is uncompressed text, with values separated by ASCII <codeph>0x01</codeph> characters
               (typically represented as Ctrl-A).
             </entry>
             <entry>
               Yes: <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>, and query.
               If LZO compression is used, you must create the table and load data in Hive. If other kinds of
               compression are used, you must load data through <codeph>LOAD DATA</codeph>, Hive, or manually in
               HDFS.

 <!--            <ph rev="2.0.0">Impala 2.0 and higher can write LZO-compressed text data; for earlier Impala releases,  you must create the table and load data in Hive.</ph> -->
             </entry>
           </row>
           <row id="avro_support">
             <entry>
               <xref href="impala_avro.xml#avro">Avro</xref>
             </entry>
             <entry>
               Structured
             </entry>
             <entry>
               Snappy, gzip, deflate, bzip2
             </entry>
             <entry rev="1.4.0">
               Yes, in Impala 1.4.0 and higher. Before that, create the table using Hive.
             </entry>
             <entry>
               No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the right format, or use
               <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH <varname>table_name</varname></codeph> in Impala.
             </entry>
 <!-- <entry rev="2.0.0">Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.</entry> -->
           </row>
           <row id="rcfile_support">
             <entry>
               <xref href="impala_rcfile.xml#rcfile">RCFile</xref>
             </entry>
             <entry>
               Structured
             </entry>
             <entry>
               Snappy, gzip, deflate, bzip2
             </entry>
             <entry>
               Yes.
             </entry>
             <entry>
               No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the right format, or use
               <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH <varname>table_name</varname></codeph> in Impala.
             </entry>
 <!--
             <entry rev="2.0.0">Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.</entry>
             -->
           </row>
           <row id="sequencefile_support">
             <entry>
               <xref href="impala_seqfile.xml#seqfile">SequenceFile</xref>
             </entry>
             <entry>
               Structured
             </entry>
             <entry>
               Snappy, gzip, deflate, bzip2
             </entry>
             <entry>Yes.</entry>
             <entry>
               No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the right format, or use
               <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH <varname>table_name</varname></codeph> in Impala.
             </entry>
 <!--
             <entry rev="2.0.0">
               Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD
               DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.
             </entry>
 -->
           </row>
         </tbody>
       </tgroup>
     </table>

     <p rev="DOCS-1370">
       Impala can only query the file formats listed in the preceding table.
       In particular, Impala does not support the ORC file format.
     </p>

     <p>
       Impala supports the following compression codecs:
     </p>

     <ul>
       <li rev="2.0.0">
         Snappy. Recommended for its effective balance between compression ratio and decompression speed. Snappy
         compression is very fast, but gzip provides greater space savings. Supported for text files in Impala 2.0
         and higher.
 <!-- Not supported for text files. -->
       </li>

       <li rev="2.0.0">
         Gzip. Recommended when achieving the highest level of compression (and therefore greatest disk-space
         savings) is desired. Supported for text files in Impala 2.0 and higher.
       </li>

       <li>
         Deflate. Not supported for text files.
       </li>

       <li rev="2.0.0">
         Bzip2. Supported for text files in Impala 2.0 and higher.
 <!-- Not supported for text files. -->
       </li>

       <li>
         <p rev="2.0.0"> LZO, for text files only. Impala can query
           LZO-compressed text tables, but currently cannot create them or insert
           data into them; perform these operations in Hive. </p>
       </li>
     </ul>
   </conbody>

   <concept id="file_format_choosing">

     <title>Choosing the File Format for a Table</title>
   <prolog>
     <metadata>
       <data name="Category" value="Planning"/>
     </metadata>
   </prolog>

     <conbody>

       <p>
         Different file formats and compression codecs work better for different data sets. While Impala typically
         provides performance gains regardless of file format, choosing the proper format for your data can yield
         further performance improvements. Use the following considerations to decide which combination of file
         format and compression to use for a particular table:
       </p>

       <ul>
         <li>
           If you are working with existing files that are already in a supported file format, use the same format
           for the Impala table where practical. If the original format does not yield acceptable query performance
           or resource usage, consider creating a new Impala table with different file format or compression
           characteristics, and doing a one-time conversion by copying the data to the new table using the
           <codeph>INSERT</codeph> statement. Depending on the file format, you might run the
           <codeph>INSERT</codeph> statement in <codeph>impala-shell</codeph> or in Hive.
         </li>

         <li>
           Text files are convenient to produce through many different tools, and are human-readable for ease of
           verification and debugging. Those characteristics are why text is the default format for an Impala
           <codeph>CREATE TABLE</codeph> statement. When performance and resource usage are the primary
           considerations, use one of the other file formats and consider using compression. A typical workflow
           might involve bringing data into an Impala table by copying CSV or TSV files into the appropriate data
           directory, and then using the <codeph>INSERT ... SELECT</codeph> syntax to copy the data into a table
           using a different, more compact file format.
         </li>

         <li>
           If your architecture involves storing data to be queried in memory, do not compress the data. There is no
           I/O savings since the data does not need to be moved from disk, but there is a CPU cost to decompress the
           data.
         </li>
       </ul>
     </conbody>
   </concept>
 </concept>
	<?xml version="1.0" encoding="UTF-8"?>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->
	<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
	<concept id="file_formats">

	<title>How Impala Works with Hadoop File Formats</title>
	<titlealts audience="PDF"><navtitle>File Formats</navtitle></titlealts>
	<prolog>
	<metadata>
	<data name="Category" value="Impala"/>
	<data name="Category" value="Concepts"/>
	<data name="Category" value="Hadoop"/>
	<data name="Category" value="File Formats"/>
	<data name="Category" value="Developers"/>
	<data name="Category" value="Data Analysts"/>
	<!-- Like Impala Administration, this page has a fair bit of info already, but it could benefit from wiki-style embedded of intro text from those other pages. -->
	<!-- In this case, that would also enable a good in-page TOC since there is already one lonely subtopic on this same page. -->
	<data name="Category" value="Stub Pages"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	<indexterm audience="hidden">file formats</indexterm>
	<indexterm audience="hidden">compression</indexterm>
	Impala supports several familiar file formats used in Apache Hadoop. Impala can load and query data files
	produced by other Hadoop components such as Pig or MapReduce, and data files produced by Impala can be used
	by other components also. The following sections discuss the procedures, limitations, and performance
	considerations for using each file format with Impala.
	</p>

	<p>
	The file format used for an Impala table has significant performance consequences. Some file formats include
	compression support that affects the size of data on the disk and, consequently, the amount of I/O and CPU
	resources required to deserialize data. The amounts of I/O and CPU resources required can be a limiting
	factor in query performance since querying often begins with moving and decompressing data. To reduce the
	potential impact of this part of the process, data is often compressed. By compressing data, a smaller total
	number of bytes are transferred from disk to memory. This reduces the amount of time taken to transfer the
	data, but a tradeoff occurs when the CPU decompresses the content.
	</p>

	<p>
	Impala can query files encoded with most of the popular file formats and compression codecs used in Hadoop.
	Impala can create and insert data into tables that use some file formats but not others; for file formats
	that Impala cannot write to, create the table in Hive, issue the <codeph>INVALIDATE METADATA <varname>table_name</varname></codeph>
	statement in <codeph>impala-shell</codeph>, and query the table through Impala. File formats can be
	structured, in which case they may include metadata and built-in compression. Supported formats include:
	</p>

	<table>
	<title>File Format Support in Impala</title>
	<tgroup cols="5">
	<colspec colname="1" colwidth="10*"/>
	<colspec colname="2" colwidth="10*"/>
	<colspec colname="3" colwidth="20*"/>
	<colspec colname="4" colwidth="30*"/>
	<colspec colname="5" colwidth="30*"/>
	<thead>
	<row>
	<entry>
	File Type
	</entry>
	<entry>
	Format
	</entry>
	<entry>
	Compression Codecs
	</entry>
	<entry>
	Impala Can CREATE?
	</entry>
	<entry>
	Impala Can INSERT?
	</entry>
	</row>
	</thead>
	<tbody>
	<row id="parquet_support">
	<entry>
	<xref href="impala_parquet.xml#parquet">Parquet</xref>
	</entry>
	<entry>
	Structured
	</entry>
	<entry>
	Snappy, gzip; currently Snappy by default
	</entry>
	<entry>
	Yes.
	</entry>
	<entry>
	Yes: <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>, and query.
	</entry>
	</row>
	<row id="txtfile_support">
	<entry>
	<xref href="impala_txtfile.xml#txtfile">Text</xref>
	</entry>
	<entry>
	Unstructured
	</entry>
	<entry rev="2.0.0">
	LZO, gzip, bzip2, Snappy
	</entry>
	<entry>
	Yes. For <codeph>CREATE TABLE</codeph> with no <codeph>STORED AS</codeph> clause, the default file
	format is uncompressed text, with values separated by ASCII <codeph>0x01</codeph> characters
	(typically represented as Ctrl-A).
	</entry>
	<entry>
	Yes: <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>, and query.
	If LZO compression is used, you must create the table and load data in Hive. If other kinds of
	compression are used, you must load data through <codeph>LOAD DATA</codeph>, Hive, or manually in
	HDFS.

	<!-- <ph rev="2.0.0">Impala 2.0 and higher can write LZO-compressed text data; for earlier Impala releases, you must create the table and load data in Hive.</ph> -->
	</entry>
	</row>
	<row id="avro_support">
	<entry>
	<xref href="impala_avro.xml#avro">Avro</xref>
	</entry>
	<entry>
	Structured
	</entry>
	<entry>
	Snappy, gzip, deflate, bzip2
	</entry>
	<entry rev="1.4.0">
	Yes, in Impala 1.4.0 and higher. Before that, create the table using Hive.
	</entry>
	<entry>
	No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the right format, or use
	<codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH <varname>table_name</varname></codeph> in Impala.
	</entry>
	<!-- <entry rev="2.0.0">Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.</entry> -->
	</row>
	<row id="rcfile_support">
	<entry>
	<xref href="impala_rcfile.xml#rcfile">RCFile</xref>
	</entry>
	<entry>
	Structured
	</entry>
	<entry>
	Snappy, gzip, deflate, bzip2
	</entry>
	<entry>
	Yes.
	</entry>
	<entry>
	No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the right format, or use
	<codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH <varname>table_name</varname></codeph> in Impala.
	</entry>
	<!--
	<entry rev="2.0.0">Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.</entry>
	-->
	</row>
	<row id="sequencefile_support">
	<entry>
	<xref href="impala_seqfile.xml#seqfile">SequenceFile</xref>
	</entry>
	<entry>
	Structured
	</entry>
	<entry>
	Snappy, gzip, deflate, bzip2
	</entry>
	<entry>Yes.</entry>
	<entry>
	No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the right format, or use
	<codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH <varname>table_name</varname></codeph> in Impala.
	</entry>
	<!--
	<entry rev="2.0.0">
	Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD
	DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.
	</entry>
	-->
	</row>
	</tbody>
	</tgroup>
	</table>

	<p rev="DOCS-1370">
	Impala can only query the file formats listed in the preceding table.
	In particular, Impala does not support the ORC file format.
	</p>

	<p>
	Impala supports the following compression codecs:
	</p>

	<ul>
	<li rev="2.0.0">
	Snappy. Recommended for its effective balance between compression ratio and decompression speed. Snappy
	compression is very fast, but gzip provides greater space savings. Supported for text files in Impala 2.0
	and higher.
	<!-- Not supported for text files. -->
	</li>

	<li rev="2.0.0">
	Gzip. Recommended when achieving the highest level of compression (and therefore greatest disk-space
	savings) is desired. Supported for text files in Impala 2.0 and higher.
	</li>

	<li>
	Deflate. Not supported for text files.
	</li>

	<li rev="2.0.0">
	Bzip2. Supported for text files in Impala 2.0 and higher.
	<!-- Not supported for text files. -->
	</li>

	<li>
	<p rev="2.0.0"> LZO, for text files only. Impala can query
	LZO-compressed text tables, but currently cannot create them or insert
	data into them; perform these operations in Hive. </p>
	</li>
	</ul>
	</conbody>

	<concept id="file_format_choosing">

	<title>Choosing the File Format for a Table</title>
	<prolog>
	<metadata>
	<data name="Category" value="Planning"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	Different file formats and compression codecs work better for different data sets. While Impala typically
	provides performance gains regardless of file format, choosing the proper format for your data can yield
	further performance improvements. Use the following considerations to decide which combination of file
	format and compression to use for a particular table:
	</p>

	<ul>
	<li>
	If you are working with existing files that are already in a supported file format, use the same format
	for the Impala table where practical. If the original format does not yield acceptable query performance
	or resource usage, consider creating a new Impala table with different file format or compression
	characteristics, and doing a one-time conversion by copying the data to the new table using the
	<codeph>INSERT</codeph> statement. Depending on the file format, you might run the
	<codeph>INSERT</codeph> statement in <codeph>impala-shell</codeph> or in Hive.
	</li>

	<li>
	Text files are convenient to produce through many different tools, and are human-readable for ease of
	verification and debugging. Those characteristics are why text is the default format for an Impala
	<codeph>CREATE TABLE</codeph> statement. When performance and resource usage are the primary
	considerations, use one of the other file formats and consider using compression. A typical workflow
	might involve bringing data into an Impala table by copying CSV or TSV files into the appropriate data
	directory, and then using the <codeph>INSERT ... SELECT</codeph> syntax to copy the data into a table
	using a different, more compact file format.
	</li>

	<li>
	If your architecture involves storing data to be queried in memory, do not compress the data. There is no
	I/O savings since the data does not need to be moved from disk, but there is a CPU cost to decompress the
	data.
	</li>
	</ul>
	</conbody>
	</concept>
	</concept>