| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="txtfile"> |
| |
| <title>Using Text Data Files with Impala Tables</title> |
| <titlealts audience="PDF"><navtitle>Text Data Files</navtitle></titlealts> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="File Formats"/> |
| <data name="Category" value="Tables"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Data Analysts"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| <indexterm audience="Cloudera">Text support in Impala</indexterm> |
| Impala supports using text files as the storage format for input and output. Text files are a |
| convenient format to use for interchange with other applications or scripts that produce or read delimited |
| text files, such as CSV or TSV with commas or tabs for delimiters. |
| </p> |
| |
| <p> |
| Text files are also very flexible in their column definitions. For example, a text file could have more |
| fields than the Impala table, and those extra fields are ignored during queries; or it could have fewer |
| fields than the Impala table, and those missing fields are treated as <codeph>NULL</codeph> values in |
| queries. You could have fields that were treated as numbers or timestamps in a table, then use <codeph>ALTER |
| TABLE ... REPLACE COLUMNS</codeph> to switch them to strings, or the reverse. |
| </p> |
| |
| <table> |
| <title>Text Format Support in Impala</title> |
| <tgroup cols="5"> |
| <colspec colname="1" colwidth="10*"/> |
| <colspec colname="2" colwidth="10*"/> |
| <colspec colname="3" colwidth="20*"/> |
| <colspec colname="4" colwidth="30*"/> |
| <colspec colname="5" colwidth="30*"/> |
| <thead> |
| <row> |
| <entry> |
| File Type |
| </entry> |
| <entry> |
| Format |
| </entry> |
| <entry> |
| Compression Codecs |
| </entry> |
| <entry> |
| Impala Can CREATE? |
| </entry> |
| <entry> |
| Impala Can INSERT? |
| </entry> |
| </row> |
| </thead> |
| <tbody> |
| <row conref="impala_file_formats.xml#file_formats/txtfile_support"> |
| <entry/> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| |
| <p outputclass="toc inpage"/> |
| |
| </conbody> |
| |
| <concept id="text_performance"> |
| |
| <title>Query Performance for Impala Text Tables</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Performance"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| Data stored in text format is relatively bulky, and not as efficient to query as binary formats such as |
| Parquet. You typically use text tables with Impala if that is the format you receive the data and you do |
| not have control over that process, or if you are a relatively new Hadoop user and not familiar with |
| techniques to generate files in other formats. (Because the default format for <codeph>CREATE |
| TABLE</codeph> is text, you might create your first Impala tables as text without giving performance much |
| thought.) Either way, look for opportunities to use more efficient file formats for the tables used in your |
| most performance-critical queries. |
| </p> |
| |
| <p> |
| For frequently queried data, you might load the original text data files into one Impala table, then use an |
| <codeph>INSERT</codeph> statement to transfer the data to another table that uses the Parquet file format; |
| the data is converted automatically as it is stored in the destination table. |
| </p> |
| |
| <p> |
| For more compact data, consider using LZO compression for the text files. LZO is the only compression codec |
| that Impala supports for text data, because the <q>splittable</q> nature of LZO data files lets different |
| nodes work on different parts of the same file in parallel. See <xref href="impala_txtfile.xml#lzo"/> for |
| details. |
| </p> |
| |
| <p rev="2.0.0"> |
| In Impala 2.0 and later, you can also use text data compressed in the gzip, bzip2, or Snappy formats. |
| Because these compressed formats are not <q>splittable</q> in the way that LZO is, there is less |
| opportunity for Impala to parallelize queries on them. Therefore, use these types of compressed data only |
| for convenience if that is the format in which you receive the data. Prefer to use LZO compression for text |
| data if you have the choice, or convert the data to Parquet using an <codeph>INSERT ... SELECT</codeph> |
| statement to copy the original data into a Parquet table. |
| </p> |
| |
| <note rev="2.2.0"> |
| <p> |
| Impala supports bzip files created by the <codeph>bzip2</codeph> command, but not bzip files with |
| multiple streams created by the <codeph>pbzip2</codeph> command. Impala decodes only the data from the |
| first part of such files, leading to incomplete results. |
| </p> |
| |
| <p> |
| The maximum size that Impala can accommodate for an individual bzip file is 1 GB (after uncompression). |
| </p> |
| </note> |
| |
| <p conref="../shared/impala_common.xml#common/s3_block_splitting"/> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="text_ddl"> |
| |
| <title>Creating Text Tables</title> |
| |
| <conbody> |
| |
| <p> |
| <b>To create a table using text data files:</b> |
| </p> |
| |
| <p> |
| If the exact format of the text data files (such as the delimiter character) is not significant, use the |
| <codeph>CREATE TABLE</codeph> statement with no extra clauses at the end to create a text-format table. For |
| example: |
| </p> |
| |
| <codeblock>create table my_table(id int, s string, n int, t timestamp, b boolean); |
| </codeblock> |
| |
| <p> |
| The data files created by any <codeph>INSERT</codeph> statements will use the Ctrl-A character (hex 01) as |
| a separator between each column value. |
| </p> |
| |
| <p> |
| A common use case is to import existing text files into an Impala table. The syntax is more verbose; the |
| significant part is the <codeph>FIELDS TERMINATED BY</codeph> clause, which must be preceded by the |
| <codeph>ROW FORMAT DELIMITED</codeph> clause. The statement can end with a <codeph>STORED AS |
| TEXTFILE</codeph> clause, but that clause is optional because text format tables are the default. For |
| example: |
| </p> |
| |
| <codeblock>create table csv(id int, s string, n int, t timestamp, b boolean) |
| row format delimited |
| <ph id="csv">fields terminated by ',';</ph> |
| |
| create table tsv(id int, s string, n int, t timestamp, b boolean) |
| row format delimited |
| <ph id="tsv">fields terminated by '\t';</ph> |
| |
| create table pipe_separated(id int, s string, n int, t timestamp, b boolean) |
| row format delimited |
| <ph id="psv">fields terminated by '|'</ph> |
| stored as textfile; |
| </codeblock> |
| |
| <p> |
| You can create tables with specific separator characters to import text files in familiar formats such as |
| CSV, TSV, or pipe-separated. You can also use these tables to produce output data files, by copying data |
| into them through the <codeph>INSERT ... SELECT</codeph> syntax and then extracting the data files from the |
| Impala data directory. |
| </p> |
| |
| <p rev="1.3.1"> |
| In Impala 1.3.1 and higher, you can specify a delimiter character <codeph>'\</codeph><codeph>0'</codeph> to |
| use the ASCII 0 (<codeph>nul</codeph>) character for text tables: |
| </p> |
| |
| <codeblock rev="1.3.1">create table nul_separated(id int, s string, n int, t timestamp, b boolean) |
| row format delimited |
| fields terminated by '\0' |
| stored as textfile; |
| </codeblock> |
| |
| <note> |
| <p> |
| Do not surround string values with quotation marks in text data files that you construct. If you need to |
| include the separator character inside a field value, for example to put a string value with a comma |
| inside a CSV-format data file, specify an escape character on the <codeph>CREATE TABLE</codeph> statement |
| with the <codeph>ESCAPED BY</codeph> clause, and insert that character immediately before any separator |
| characters that need escaping. |
| </p> |
| </note> |
| |
| <!-- |
| <p> |
| In the <cmdname>impala-shell</cmdname> interpreter, issue a command similar to: |
| </p> |
| |
| <codeblock>create table textfile_table (<varname>column_specs</varname>) stored as textfile; |
| /* If the STORED AS clause is omitted, the default is a TEXTFILE with hex 01 characters as the delimiter. */ |
| create table default_table (<varname>column_specs</varname>); |
| /* Some optional clauses in the CREATE TABLE statement apply only to Text tables. */ |
| create table csv_table (<varname>column_specs</varname>) row format delimited fields terminated by ','; |
| create table tsv_table (<varname>column_specs</varname>) row format delimited fields terminated by '\t'; |
| create table dos_table (<varname>column_specs</varname>) lines terminated by '\r';</codeblock> |
| --> |
| |
| <p> |
| Issue a <codeph>DESCRIBE FORMATTED <varname>table_name</varname></codeph> statement to see the details of |
| how each table is represented internally in Impala. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/complex_types_unsupported_filetype"/> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="text_data_files"> |
| |
| <title>Data Files for Text Tables</title> |
| |
| <conbody> |
| |
| <p> |
| When Impala queries a table with data in text format, it consults all the data files in the data directory |
| for that table, with some exceptions: |
| </p> |
| |
| <ul rev="2.2.0"> |
| <li> |
| <p> |
| Impala ignores any hidden files, that is, files whose names start with a dot or an underscore. |
| </p> |
| </li> |
| |
| <li> |
| <p conref="../shared/impala_common.xml#common/ignore_file_extensions"/> |
| </li> |
| |
| <li> |
| <!-- Copied and slightly adapted text from later on in this same file. Turn into a conref. --> |
| <p> |
| Impala uses suffixes to recognize when text data files are compressed text. For Impala to recognize the |
| compressed text files, they must have the appropriate file extension corresponding to the compression |
| codec, either <codeph>.gz</codeph>, <codeph>.bz2</codeph>, or <codeph>.snappy</codeph>. The extensions |
| can be in uppercase or lowercase. |
| </p> |
| </li> |
| |
| <li> |
| Otherwise, the file names are not significant. When you put files into an HDFS directory through ETL |
| jobs, or point Impala to an existing HDFS directory with the <codeph>CREATE EXTERNAL TABLE</codeph> |
| statement, or move data files under external control with the <codeph>LOAD DATA</codeph> statement, |
| Impala preserves the original filenames. |
| </li> |
| </ul> |
| |
| <p> |
| Filenames for data produced through Impala <codeph>INSERT</codeph> statements are given unique names to |
| avoid filename conflicts. |
| </p> |
| |
| <p> |
| An <codeph>INSERT ... SELECT</codeph> statement produces one data file from each node that processes the |
| <codeph>SELECT</codeph> part of the statement. An <codeph>INSERT ... VALUES</codeph> statement produces a |
| separate data file for each statement; because Impala is more efficient querying a small number of huge |
| files than a large number of tiny files, the <codeph>INSERT ... VALUES</codeph> syntax is not recommended |
| for loading a substantial volume of data. If you find yourself with a table that is inefficient due to too |
| many small data files, reorganize the data into a few large files by doing <codeph>INSERT ... |
| SELECT</codeph> to transfer the data to a new table. |
| </p> |
| |
| <p> |
| <b>Special values within text data files:</b> |
| </p> |
| |
| <ul> |
| <li rev="1.4.0"> |
| <p> |
| Impala recognizes the literal strings <codeph>inf</codeph> for infinity and <codeph>nan</codeph> for |
| <q>Not a Number</q>, for <codeph>FLOAT</codeph> and <codeph>DOUBLE</codeph> columns. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Impala recognizes the literal string <codeph>\N</codeph> to represent <codeph>NULL</codeph>. When using |
| Sqoop, specify the options <codeph>--null-non-string</codeph> and <codeph>--null-string</codeph> to |
| ensure all <codeph>NULL</codeph> values are represented correctly in the Sqoop output files. By default, |
| Sqoop writes <codeph>NULL</codeph> values using the string <codeph>null</codeph>, which causes a |
| conversion error when such rows are evaluated by Impala. (A workaround for existing tables and data files |
| is to change the table properties through <codeph>ALTER TABLE <varname>name</varname> SET |
| TBLPROPERTIES("serialization.null.format"="null")</codeph>.) |
| </p> |
| </li> |
| |
| <li> |
| <p conref="../shared/impala_common.xml#common/skip_header_lines"/> |
| </li> |
| </ul> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="text_etl"> |
| |
| <title>Loading Data into Impala Text Tables</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="ETL"/> |
| <data name="Category" value="Ingest"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| To load an existing text file into an Impala text table, use the <codeph>LOAD DATA</codeph> statement and |
| specify the path of the file in HDFS. That file is moved into the appropriate Impala data directory. |
| </p> |
| |
| <p> |
| To load multiple existing text files into an Impala text table, use the <codeph>LOAD DATA</codeph> |
| statement and specify the HDFS path of the directory containing the files. All non-hidden files are moved |
| into the appropriate Impala data directory. |
| </p> |
| |
| <p> |
| To convert data to text from any other file format supported by Impala, use a SQL statement such as: |
| </p> |
| |
| <codeblock>-- Text table with default delimiter, the hex 01 character. |
| CREATE TABLE text_table AS SELECT * FROM other_file_format_table; |
| |
| -- Text table with user-specified delimiter. Currently, you cannot specify |
| -- the delimiter as part of CREATE TABLE LIKE or CREATE TABLE AS SELECT. |
| -- But you can change an existing text table to have a different delimiter. |
| CREATE TABLE csv LIKE other_file_format_table; |
| ALTER TABLE csv SET SERDEPROPERTIES ('serialization.format'=',', 'field.delim'=','); |
| INSERT INTO csv SELECT * FROM other_file_format_table;</codeblock> |
| |
| <p> |
| This can be a useful technique to see how Impala represents special values within a text-format data file. |
| Use the <codeph>DESCRIBE FORMATTED</codeph> statement to see the HDFS directory where the data files are |
| stored, then use Linux commands such as <codeph>hdfs dfs -ls <varname>hdfs_directory</varname></codeph> and |
| <codeph>hdfs dfs -cat <varname>hdfs_file</varname></codeph> to display the contents of an Impala-created |
| text file. |
| </p> |
| |
| <p> |
| To create a few rows in a text table for test purposes, you can use the <codeph>INSERT ... VALUES</codeph> |
| syntax: |
| </p> |
| |
| <codeblock>INSERT INTO <varname>text_table</varname> VALUES ('string_literal',100,hex('hello world'));</codeblock> |
| |
| <note> |
| Because Impala and the HDFS infrastructure are optimized for multi-megabyte files, avoid the <codeph>INSERT |
| ... VALUES</codeph> notation when you are inserting many rows. Each <codeph>INSERT ... VALUES</codeph> |
| statement produces a new tiny file, leading to fragmentation and reduced performance. When creating any |
| substantial volume of new data, use one of the bulk loading techniques such as <codeph>LOAD DATA</codeph> |
| or <codeph>INSERT ... SELECT</codeph>. Or, <xref href="impala_hbase.xml#impala_hbase">use an HBase |
| table</xref> for single-row <codeph>INSERT</codeph> operations, because HBase tables are not subject to the |
| same fragmentation issues as tables stored on HDFS. |
| </note> |
| |
| <p> |
| When you create a text file for use with an Impala text table, specify <codeph>\N</codeph> to represent a |
| <codeph>NULL</codeph> value. For the differences between <codeph>NULL</codeph> and empty strings, see |
| <xref href="impala_literals.xml#null"/>. |
| </p> |
| |
| <p> |
| If a text file has fewer fields than the columns in the corresponding Impala table, all the corresponding |
| columns are set to <codeph>NULL</codeph> when the data in that file is read by an Impala query. |
| </p> |
| |
| <p> |
| If a text file has more fields than the columns in the corresponding Impala table, the extra fields are |
| ignored when the data in that file is read by an Impala query. |
| </p> |
| |
| <p> |
| You can also use manual HDFS operations such as <codeph>hdfs dfs -put</codeph> or <codeph>hdfs dfs |
| -cp</codeph> to put data files in the data directory for an Impala table. When you copy or move new data |
| files into the HDFS directory for the Impala table, issue a <codeph>REFRESH |
| <varname>table_name</varname></codeph> statement in <cmdname>impala-shell</cmdname> before issuing the next |
| query against that table, to make Impala recognize the newly added files. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="lzo"> |
| |
| <title>Using LZO-Compressed Text Files</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="LZO"/> |
| <data name="Category" value="Compression"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| <indexterm audience="Cloudera">LZO support in Impala</indexterm> |
| |
| <indexterm audience="Cloudera">compression</indexterm> |
| Impala supports using text data files that employ LZO compression. <ph rev="upstream">Cloudera</ph> recommends compressing |
| text data files when practical. Impala queries are usually I/O-bound; reducing the amount of data read from |
| disk typically speeds up a query, despite the extra CPU work to uncompress the data in memory. |
| </p> |
| |
| <p> |
| Impala can work with LZO-compressed text files are preferable to files compressed by other codecs, because |
| LZO-compressed files are <q>splittable</q>, meaning that different portions of a file can be uncompressed |
| and processed independently by different nodes. |
| </p> |
| |
| <p> |
| Impala does not currently support writing LZO-compressed text files. |
| </p> |
| |
| <p> |
| Because Impala can query LZO-compressed files but currently cannot write them, you use Hive to do the |
| initial <codeph>CREATE TABLE</codeph> and load the data, then switch back to Impala to run queries. For |
| instructions on setting up LZO compression for Hive <codeph>CREATE TABLE</codeph> and |
| <codeph>INSERT</codeph> statements, see |
| <xref href="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO" scope="external" format="html">the |
| LZO page on the Hive wiki</xref>. Once you have created an LZO text table, you can also manually add |
| LZO-compressed text files to it, produced by the |
| <xref href="http://www.lzop.org/" scope="external" format="html"> <cmdname>lzop</cmdname></xref> command |
| or similar method. |
| </p> |
| |
| <section id="lzo_setup"> |
| |
| <title>Preparing to Use LZO-Compressed Text Files</title> |
| |
| <p> |
| Before using LZO-compressed tables in Impala, do the following one-time setup for each machine in the |
| cluster. Install the necessary packages using either the Cloudera public repository, a private repository |
| you establish, or by using packages. You must do these steps manually, whether or not the cluster is |
| managed by the Cloudera Manager product. |
| </p> |
| |
| <ol> |
| <li> |
| <b>Prepare your systems to work with LZO by downloading and installing the appropriate libraries:</b> |
| <p> |
| <b>On systems managed by Cloudera Manager using parcels:</b> |
| </p> |
| |
| <p> |
| See the setup instructions for the LZO parcel in the Cloudera Manager documentation for |
| <xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_install_gpl_extras.html" scope="external" format="html">Cloudera |
| Manager 5</xref>. |
| </p> |
| |
| <p> |
| <b>On systems managed by Cloudera Manager using packages, or not managed by Cloudera Manager:</b> |
| </p> |
| |
| <p> |
| Download and install the appropriate file to each machine on which you intend to use LZO with Impala. |
| These files all come from the Cloudera |
| <xref href="https://archive.cloudera.com/gplextras/redhat/5/x86_64/gplextras/" scope="external" format="html">GPL |
| extras</xref> download site. Install the: |
| </p> |
| <ul> |
| <li> |
| <xref href="https://archive.cloudera.com/gplextras/redhat/5/x86_64/gplextras/cloudera-gplextras4.repo" scope="external" format="repo">Red |
| Hat 5 repo file</xref> to <filepath>/etc/yum.repos.d/</filepath>. |
| </li> |
| |
| <li> |
| <xref href="https://archive.cloudera.com/gplextras/redhat/6/x86_64/gplextras/cloudera-gplextras4.repo" scope="external" format="repo">Red |
| Hat 6 repo file</xref> to <filepath>/etc/yum.repos.d/</filepath>. |
| </li> |
| |
| <li> |
| <xref href="https://archive.cloudera.com/gplextras/sles/11/x86_64/gplextras/cloudera-gplextras4.repo" scope="external" format="repo">SUSE |
| repo file</xref> to <filepath>/etc/zypp/repos.d/</filepath>. |
| </li> |
| |
| <li> |
| <xref href="https://archive.cloudera.com/gplextras/ubuntu/lucid/amd64/gplextras/cloudera.list" scope="external" format="list">Ubuntu |
| 10.04 list file</xref> to <filepath>/etc/apt/sources.list.d/</filepath>. |
| </li> |
| |
| <li> |
| <xref href="https://archive.cloudera.com/gplextras/ubuntu/precise/amd64/gplextras/cloudera.list" scope="external" format="list">Ubuntu |
| 12.04 list file</xref> to <filepath>/etc/apt/sources.list.d/</filepath>. |
| </li> |
| |
| <li> |
| <xref href="https://archive.cloudera.com/gplextras/debian/squeeze/amd64/gplextras/cloudera.list" scope="external" format="list">Debian |
| list file</xref> to <filepath>/etc/apt/sources.list.d/</filepath>. |
| </li> |
| </ul> |
| </li> |
| |
| <li> |
| <b>Configure Impala to use LZO:</b> |
| <p> |
| Use <b>one</b> of the following sets of commands to refresh your package management system's |
| repository information, install the base LZO support for Hadoop, and install the LZO support for |
| Impala. |
| </p> |
| |
| <note rev="1.2.0"> |
| <p rev="1.2.0"> |
| <ph rev="upstream">The name of the Hadoop LZO package changed between CDH 4 and CDH 5. In CDH 4, the package name was |
| <codeph>hadoop-lzo-cdh4</codeph>. In CDH 5 and higher, the package name is <codeph>hadoop-lzo</codeph>.</ph> |
| </p> |
| </note> |
| |
| <p> |
| <b>For RHEL/CentOS systems:</b> |
| </p> |
| <codeblock>$ sudo yum update |
| $ sudo yum install hadoop-lzo |
| $ sudo yum install impala-lzo</codeblock> |
| <p> |
| <b>For SUSE systems:</b> |
| </p> |
| <codeblock rev="1.2">$ sudo apt-get update |
| $ sudo zypper install hadoop-lzo |
| $ sudo zypper install impala-lzo</codeblock> |
| <p> |
| <b>For Debian/Ubuntu systems:</b> |
| </p> |
| <codeblock>$ sudo zypper update |
| $ sudo apt-get install hadoop-lzo |
| $ sudo apt-get install impala-lzo</codeblock> |
| <note> |
| <p> |
| The level of the <codeph>impala-lzo-cdh4</codeph> package is closely tied to the version of Impala |
| you use. Any time you upgrade Impala, re-do the installation command for |
| <codeph>impala-lzo</codeph> on each applicable machine to make sure you have the appropriate |
| version of that package. |
| </p> |
| </note> |
| </li> |
| |
| <li> |
| For <codeph>core-site.xml</codeph> on the client <b>and</b> server (that is, in the configuration |
| directories for both Impala and Hadoop), append <codeph>com.hadoop.compression.lzo.LzopCodec</codeph> |
| to the comma-separated list of codecs. For example: |
| <codeblock><property> |
| <name>io.compression.codecs</name> |
| <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec, |
| org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec, |
| org.apache.hadoop.io.compress.SnappyCodec,com.hadoop.compression.lzo.LzopCodec</value> |
| </property></codeblock> |
| <note> |
| <p> |
| If this is the first time you have edited the Hadoop <filepath>core-site.xml</filepath> file, note |
| that the <filepath>/etc/hadoop/conf</filepath> directory is typically a symbolic link, so the |
| canonical <filepath>core-site.xml</filepath> might reside in a different directory: |
| </p> |
| <codeblock>$ ls -l /etc/hadoop |
| total 8 |
| lrwxrwxrwx. 1 root root 29 Feb 26 2013 conf -> /etc/alternatives/hadoop-conf |
| lrwxrwxrwx. 1 root root 10 Feb 26 2013 conf.dist -> conf.empty |
| drwxr-xr-x. 2 root root 4096 Feb 26 2013 conf.empty |
| drwxr-xr-x. 2 root root 4096 Oct 28 15:46 conf.pseudo</codeblock> |
| <p> |
| If the <codeph>io.compression.codecs</codeph> property is missing from |
| <filepath>core-site.xml</filepath>, only add <codeph>com.hadoop.compression.lzo.LzopCodec</codeph> |
| to the new property value, not all the names from the preceding example. |
| </p> |
| </note> |
| </li> |
| |
| <li> |
| <!-- To do: |
| Link to CM or other doc where that procedure is explained. |
| Run through the procedure in CM and cite the relevant safety valves to put the XML into. |
| --> |
| Restart the MapReduce and Impala services. |
| </li> |
| </ol> |
| |
| </section> |
| |
| <section id="lzo_create_table"> |
| |
| <title>Creating LZO Compressed Text Tables</title> |
| |
| <p> |
| A table containing LZO-compressed text files must be created in Hive with the following storage clause: |
| </p> |
| |
| <codeblock>STORED AS |
| INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat' |
| OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'</codeblock> |
| |
| <!-- |
| <p> |
| In Hive, when writing LZO compressed text tables, you must include the following specification: |
| </p> |
| |
| <codeblock>hive> SET hive.exec.compress.output=true; |
| hive> SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;</codeblock> |
| --> |
| |
| <p> |
| Also, certain Hive settings need to be in effect. For example: |
| </p> |
| |
| <codeblock>hive> SET mapreduce.output.fileoutputformat.compress=true; |
| hive> SET hive.exec.compress.output=true; |
| hive> SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec; |
| hive> CREATE TABLE lzo_t (s string) STORED AS |
| > INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat' |
| > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'; |
| hive> INSERT INTO TABLE lzo_t SELECT col1, col2 FROM uncompressed_text_table;</codeblock> |
| |
| <p> |
| Once you have created LZO-compressed text tables, you can convert data stored in other tables (regardless |
| of file format) by using the <codeph>INSERT ... SELECT</codeph> statement in Hive. |
| </p> |
| |
| <p> |
| Files in an LZO-compressed table must use the <codeph>.lzo</codeph> extension. Examine the files in the |
| HDFS data directory after doing the <codeph>INSERT</codeph> in Hive, to make sure the files have the |
| right extension. If the required settings are not in place, you end up with regular uncompressed files, |
| and Impala cannot access the table because it finds data files with the wrong (uncompressed) format. |
| </p> |
| |
| <p> |
| After loading data into an LZO-compressed text table, index the files so that they can be split. You |
| index the files by running a Java class, |
| <codeph>com.hadoop.compression.lzo.DistributedLzoIndexer</codeph>, through the Linux command line. This |
| Java class is included in the <codeph>hadoop-lzo</codeph> package. |
| </p> |
| |
| <p> |
| Run the indexer using a command like the following: |
| </p> |
| |
| <codeblock>$ hadoop jar /usr/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar |
| com.hadoop.compression.lzo.DistributedLzoIndexer /hdfs_location_of_table/</codeblock> |
| |
| <note> |
| If the path of the JAR file in the preceding example is not recognized, do a <cmdname>find</cmdname> |
| command to locate <filepath>hadoop-lzo-*-gplextras.jar</filepath> and use that path. |
| </note> |
| |
| <p> |
| Indexed files have the same name as the file they index, with the <codeph>.index</codeph> extension. If |
| the data files are not indexed, Impala queries still work, but the queries read the data from remote |
| DataNodes, which is very inefficient. |
| </p> |
| |
| <!-- To do: |
| Here is the place to put some end-to-end examples once I have it |
| all working. Or at least the final step with Impala queries. |
| Have never actually gotten this part working yet due to mismatches |
| between the levels of Impala and LZO packages. |
| --> |
| |
| <p> |
| Once the LZO-compressed tables are created, and data is loaded and indexed, you can query them through |
| Impala. As always, the first time you start <cmdname>impala-shell</cmdname> after creating a table in |
| Hive, issue an <codeph>INVALIDATE METADATA</codeph> statement so that Impala recognizes the new table. |
| (In Impala 1.2 and higher, you only have to run <codeph>INVALIDATE METADATA</codeph> on one node, rather |
| than on all the Impala nodes.) |
| </p> |
| |
| </section> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept rev="2.0.0" id="gzip"> |
| |
| <title>Using gzip, bzip2, or Snappy-Compressed Text Files</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Snappy"/> |
| <data name="Category" value="Gzip"/> |
| <data name="Category" value="Compression"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| <indexterm audience="Cloudera">gzip support in Impala</indexterm> |
| |
| <indexterm audience="Cloudera">bzip2 support in Impala</indexterm> |
| |
| <indexterm audience="Cloudera">Snappy support in Impala</indexterm> |
| |
| <indexterm audience="Cloudera">compression</indexterm> |
| In Impala 2.0 and later, Impala supports using text data files that employ gzip, bzip2, or Snappy |
| compression. These compression types are primarily for convenience within an existing ETL pipeline rather |
| than maximum performance. Although it requires less I/O to read compressed text than the equivalent |
| uncompressed text, files compressed by these codecs are not <q>splittable</q> and therefore cannot take |
| full advantage of the Impala parallel query capability. |
| </p> |
| |
| <p> |
| As each bzip2- or Snappy-compressed text file is processed, the node doing the work reads the entire file |
| into memory and then decompresses it. Therefore, the node must have enough memory to hold both the |
| compressed and uncompressed data from the text file. The memory required to hold the uncompressed data is |
| difficult to estimate in advance, potentially causing problems on systems with low memory limits or with |
| resource management enabled. <ph rev="2.1.0">In Impala 2.1 and higher, this memory overhead is reduced for |
| gzip-compressed text files. The gzipped data is decompressed as it is read, rather than all at once.</ph> |
| </p> |
| |
| <!-- |
| <p> |
| Impala can work with LZO-compressed text files but not GZip-compressed text. |
| LZO-compressed files are <q>splittable</q>, meaning that different portions of a file |
| can be uncompressed and processed independently by different nodes. GZip-compressed |
| files are not splittable, making them unsuitable for Impala-style distributed queries. |
| </p> |
| --> |
| |
| <p> |
| To create a table to hold gzip, bzip2, or Snappy-compressed text, create a text table with no special |
| compression options. Specify the delimiter and escape character if required, using the <codeph>ROW |
| FORMAT</codeph> clause. |
| </p> |
| |
| <p> |
| Because Impala can query compressed text files but currently cannot write them, produce the compressed text |
| files outside Impala and use the <codeph>LOAD DATA</codeph> statement, manual HDFS commands to move them to |
| the appropriate Impala data directory. (Or, you can use <codeph>CREATE EXTERNAL TABLE</codeph> and point |
| the <codeph>LOCATION</codeph> attribute at a directory containing existing compressed text files.) |
| </p> |
| |
| <p> |
| For Impala to recognize the compressed text files, they must have the appropriate file extension |
| corresponding to the compression codec, either <codeph>.gz</codeph>, <codeph>.bz2</codeph>, or |
| <codeph>.snappy</codeph>. The extensions can be in uppercase or lowercase. |
| </p> |
| |
| <p> |
| The following example shows how you can create a regular text table, put different kinds of compressed and |
| uncompressed files into it, and Impala automatically recognizes and decompresses each one based on their |
| file extensions: |
| </p> |
| |
| <codeblock>create table csv_compressed (a string, b string, c string) |
| row format delimited fields terminated by ","; |
| |
| insert into csv_compressed values |
| ('one - uncompressed', 'two - uncompressed', 'three - uncompressed'), |
| ('abc - uncompressed', 'xyz - uncompressed', '123 - uncompressed'); |
| ...make equivalent .gz, .bz2, and .snappy files and load them into same table directory... |
| |
| select * from csv_compressed; |
| +--------------------+--------------------+----------------------+ |
| | a | b | c | |
| +--------------------+--------------------+----------------------+ |
| | one - snappy | two - snappy | three - snappy | |
| | one - uncompressed | two - uncompressed | three - uncompressed | |
| | abc - uncompressed | xyz - uncompressed | 123 - uncompressed | |
| | one - bz2 | two - bz2 | three - bz2 | |
| | abc - bz2 | xyz - bz2 | 123 - bz2 | |
| | one - gzip | two - gzip | three - gzip | |
| | abc - gzip | xyz - gzip | 123 - gzip | |
| +--------------------+--------------------+----------------------+ |
| |
| $ hdfs dfs -ls 'hdfs://127.0.0.1:8020/user/hive/warehouse/file_formats.db/csv_compressed/'; |
| ...truncated for readability... |
| 75 hdfs://127.0.0.1:8020/user/hive/warehouse/file_formats.db/csv_compressed/csv_compressed.snappy |
| 79 hdfs://127.0.0.1:8020/user/hive/warehouse/file_formats.db/csv_compressed/csv_compressed_bz2.csv.bz2 |
| 80 hdfs://127.0.0.1:8020/user/hive/warehouse/file_formats.db/csv_compressed/csv_compressed_gzip.csv.gz |
| 116 hdfs://127.0.0.1:8020/user/hive/warehouse/file_formats.db/csv_compressed/dd414df64d67d49b_data.0. |
| </codeblock> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept audience="Cloudera" id="txtfile_data_types"> |
| |
| <title>Data Type Considerations for Text Tables</title> |
| |
| <conbody> |
| |
| <p></p> |
| |
| </conbody> |
| |
| </concept> |
| |
| </concept> |