| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="orc"> |
| |
| <title>Using the ORC File Format with Impala Tables</title> |
| <titlealts audience="PDF"><navtitle>ORC Data Files</navtitle></titlealts> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <!-- <data name="Category" value="ORC"/> --> |
| <data name="Category" value="File Formats"/> |
| <data name="Category" value="Tables"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Data Analysts"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| <indexterm audience="hidden">ORC support in Impala</indexterm> Impala supports using ORC data |
| files. By default, ORC reads are enabled in Impala 3.4.0 and higher.</p> |
| |
| <table> |
| <title>ORC Format Support in Impala</title> |
| <tgroup cols="5"> |
| <colspec colname="1" colwidth="10*"/> |
| <colspec colname="2" colwidth="10*"/> |
| <colspec colname="3" colwidth="20*"/> |
| <colspec colname="4" colwidth="30*"/> |
| <colspec colname="5" colwidth="30*"/> |
| <thead> |
| <row> |
| <entry> |
| File Type |
| </entry> |
| <entry> |
| Format |
| </entry> |
| <entry> |
| Compression Codecs |
| </entry> |
| <entry> |
| Impala Can CREATE? |
| </entry> |
| <entry> |
| Impala Can INSERT? |
| </entry> |
| </row> |
| </thead> |
| <tbody> |
| <row conref="impala_file_formats.xml#file_formats/orc_support"> |
| <entry/> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| |
| <p outputclass="toc inpage"/> |
| </conbody> |
| |
| <concept id="orc_create"> |
| |
| <title>Creating ORC Tables and Loading Data</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="ETL"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| If you do not have an existing data file to use, begin by creating one in the appropriate format. |
| </p> |
| |
| <p> |
| <b>To create an ORC table:</b> |
| </p> |
| |
| <p> |
| In the <codeph>impala-shell</codeph> interpreter, issue a command similar to: |
| </p> |
| |
| <codeblock>CREATE TABLE orc_table (<varname>column_specs</varname>) STORED AS ORC;</codeblock> |
| |
| <p> |
| Because Impala can query some kinds of tables that it cannot currently write to, after creating tables of |
| certain file formats, you might use the Hive shell to load the data. See |
| <xref href="impala_file_formats.xml#file_formats"/> for details. After loading data into a table through |
| Hive or other mechanism outside of Impala, issue a <codeph>REFRESH <varname>table_name</varname></codeph> |
| statement the next time you connect to the Impala node, before querying the table, to make Impala recognize |
| the new data. |
| </p> |
| |
| <p> |
| For example, here is how you might create some ORC tables in Impala (by specifying the columns |
| explicitly, or cloning the structure of another table), load data through Hive, and query them through |
| Impala: |
| </p> |
| |
| <codeblock>$ impala-shell -i localhost |
| [localhost:21000] default> CREATE TABLE orc_table (x INT) STORED AS ORC; |
| [localhost:21000] default> CREATE TABLE orc_clone LIKE some_other_table STORED AS ORC; |
| [localhost:21000] default> quit; |
| |
| $ hive |
| hive> INSERT INTO TABLE orc_table SELECT x FROM some_other_table; |
| 3 Rows loaded to orc_table |
| Time taken: 4.169 seconds |
| hive> quit; |
| |
| $ impala-shell -i localhost |
| [localhost:21000] default> SELECT * FROM orc_table; |
| Fetched 0 row(s) in 0.11s |
| [localhost:21000] default> -- Make Impala recognize the data loaded through Hive; |
| [localhost:21000] default> REFRESH orc_table; |
| [localhost:21000] default> SELECT * FROM orc_table; |
| +---+ |
| | x | |
| +---+ |
| | 1 | |
| | 2 | |
| | 3 | |
| +---+ |
| Fetched 3 row(s) in 0.11s</codeblock> |
| |
| </conbody> |
| </concept> |
| |
| <concept id="orc_compression"> |
| |
| <title>Enabling Compression for ORC Tables</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Snappy"/> |
| <data name="Category" value="Compression"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| <indexterm audience="hidden">compression</indexterm> |
| ORC tables are in zlib (Deflate in Impala) compression in default. You may want |
| to use Snappy or LZO compression on existing tables for different balance between |
| compression ratio and decompression speed. In Hive-1.1.0, the supported |
| compressions for ORC tables are NONE, ZLIB, SNAPPY and LZO. |
| For example, to enable Snappy compression, you would specify |
| the following additional settings when loading data through the Hive shell: |
| </p> |
| |
| <codeblock>hive> SET hive.exec.compress.output=true; |
| hive> SET orc.compress=SNAPPY; |
| hive> INSERT OVERWRITE TABLE <varname>new_table</varname> SELECT * FROM <varname>old_table</varname>;</codeblock> |
| |
| <p> |
| If you are converting partitioned tables, you must complete additional steps. In such a case, specify |
| additional settings similar to the following: |
| </p> |
| |
| <codeblock>hive> CREATE TABLE <varname>new_table</varname> (<varname>your_cols</varname>) PARTITIONED BY (<varname>partition_cols</varname>) STORED AS <varname>new_format</varname>; |
| hive> SET hive.exec.dynamic.partition.mode=nonstrict; |
| hive> SET hive.exec.dynamic.partition=true; |
| hive> INSERT OVERWRITE TABLE <varname>new_table</varname> PARTITION(<varname>comma_separated_partition_cols</varname>) SELECT * FROM <varname>old_table</varname>;</codeblock> |
| |
| <p> |
| Remember that Hive does not require that you specify a source format for it. Consider the case of |
| converting a table with two partition columns called <codeph>year</codeph> and <codeph>month</codeph> to a |
| Snappy compressed ORC table. Combining the components outlined previously to complete this table conversion, |
| you would specify settings similar to the following: |
| </p> |
| |
| <codeblock>hive> CREATE TABLE tbl_orc (int_col INT, string_col STRING) STORED AS ORC; |
| hive> SET hive.exec.compress.output=true; |
| hive> SET orc.compress=SNAPPY; |
| hive> SET hive.exec.dynamic.partition.mode=nonstrict; |
| hive> SET hive.exec.dynamic.partition=true; |
| hive> INSERT OVERWRITE TABLE tbl_orc SELECT * FROM tbl;</codeblock> |
| |
| <p> |
| To complete a similar process for a table that includes partitions, you would specify settings similar to |
| the following: |
| </p> |
| |
| <codeblock>hive> CREATE TABLE tbl_orc (int_col INT, string_col STRING) PARTITIONED BY (year INT) STORED AS ORC; |
| hive> SET hive.exec.compress.output=true; |
| hive> SET orc.compress=SNAPPY; |
| hive> SET hive.exec.dynamic.partition.mode=nonstrict; |
| hive> SET hive.exec.dynamic.partition=true; |
| hive> INSERT OVERWRITE TABLE tbl_orc PARTITION(year) SELECT * FROM tbl;</codeblock> |
| |
| <note> |
| <p> |
| The compression type is specified in the following command: |
| </p> |
| <codeblock>SET orc.compress=SNAPPY;</codeblock> |
| <p> |
| You could elect to specify alternative codecs such as <codeph>NONE, GZIP, LZO</codeph> here. |
| </p> |
| </note> |
| </conbody> |
| </concept> |
| |
| <concept id="rcfile_performance"> |
| |
| <title>Query Performance for Impala ORC Tables</title> |
| |
| <conbody> |
| |
| <p> |
| In general, expect query performance with ORC tables to be |
| faster than with tables using text data, but slower than with |
| Parquet tables since there're bunch of optimizations for Parquet. |
| See <xref href="impala_parquet.xml#parquet"/> |
| for information about using the Parquet file format for |
| high-performance analytic queries. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/s3_block_splitting"/> |
| |
| </conbody> |
| </concept> |
| |
| <concept id="orc_data_types"> |
| |
| <title>Data Type Considerations for ORC Tables</title> |
| |
| <conbody> |
| |
| <p> |
| The ORC format defines a set of data types whose names differ from the names of the corresponding |
| Impala data types. If you are preparing ORC files using other Hadoop components such as Pig or |
| MapReduce, you might need to work with the type names defined by ORC. The following figure lists the |
| ORC-defined types and the equivalent types in Impala. |
| </p> |
| |
| <p> |
| <b>Primitive types:</b> |
| </p> |
| |
| <codeblock>BINARY -> STRING |
| BOOLEAN -> BOOLEAN |
| DOUBLE -> DOUBLE |
| FLOAT -> FLOAT |
| TINYINT -> TINYINT |
| SMALLINT -> SMALLINT |
| INT -> INT |
| BIGINT -> BIGINT |
| TIMESTAMP -> TIMESTAMP |
| DATE (not supported) |
| </codeblock> |
| |
| <p> |
| <b>Complex types:</b> |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/complex_types_short_intro"/> |
| |
| </conbody> |
| </concept> |
| </concept> |