| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="impala_hbase"> |
| |
| <title id="hbase">Using Impala to Query HBase Tables</title> |
| <titlealts audience="PDF"><navtitle>HBase Tables</navtitle></titlealts> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="HBase"/> |
| <data name="Category" value="Querying"/> |
| <data name="Category" value="Data Analysts"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Tables"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| <indexterm audience="hidden">HBase</indexterm> |
| You can use Impala to query HBase tables. This is useful for accessing any of |
| your existing HBase tables via SQL and performing analytics over them. HDFS |
| and Kudu tables are preferred over HBase for analytic workloads and offer |
| superior performance. Kudu supports efficient inserts, updates and deletes |
| of small numbers of rows and can replace HBase for most analytics-oriented use |
| cases. See <xref href="impala_kudu.xml#impala_kudu"/> for information on using |
| Impala with Kudu. |
| </p> |
| |
| <p> |
| From the perspective of an Impala user, coming from an RDBMS background, HBase is a kind of key-value store |
| where the value consists of multiple fields. The key is mapped to one column in the Impala table, and the |
| various fields of the value are mapped to the other columns in the Impala table. |
| </p> |
| |
| <p> |
| For background information on HBase, see <xref keyref="upstream_hbase_docs"/>. |
| </p> |
| |
| <p outputclass="toc inpage"/> |
| </conbody> |
| |
| <concept id="hbase_using"> |
| |
| <title>Overview of Using HBase with Impala</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Concepts"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| When you use Impala with HBase: |
| </p> |
| |
| <ul> |
| <li> |
| You create the tables on the Impala side using the Hive shell, because the Impala <codeph>CREATE |
| TABLE</codeph> statement currently does not support custom SerDes and some other syntax needed for these |
| tables: |
| <ul> |
| <li> |
| You designate it as an HBase table using the <codeph>STORED BY |
| 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'</codeph> clause on the Hive <codeph>CREATE |
| TABLE</codeph> statement. |
| </li> |
| |
| <li> |
| You map these specially created tables to corresponding tables that exist in HBase, with the clause |
| <codeph>TBLPROPERTIES("hbase.table.name" = "<varname>table_name_in_hbase</varname>")</codeph> on the |
| Hive <codeph>CREATE TABLE</codeph> statement. |
| </li> |
| |
| <li> |
| See <xref href="#hbase_queries"/> for a full example. |
| </li> |
| </ul> |
| </li> |
| |
| <li> |
| You define the column corresponding to the HBase row key as a string with the <codeph>#string</codeph> |
| keyword, or map it to a <codeph>STRING</codeph> column. |
| </li> |
| |
| <li> |
| Because Impala and Hive share the same metastore database, once you create the table in Hive, you can |
| query or insert into it through Impala. (After creating a new table through Hive, issue the |
| <codeph>INVALIDATE METADATA</codeph> statement in <cmdname>impala-shell</cmdname> to make Impala aware of |
| the new table.) |
| </li> |
| |
| <li> You issue queries against the Impala tables. For efficient queries, |
| use the <codeph>WHERE</codeph> clause to find a single key value or a |
| range of key values wherever practical, by testing the Impala column |
| corresponding to the HBase row key. Avoid queries that do full-table |
| scans, which are efficient for regular Impala tables but inefficient |
| in HBase. </li> |
| </ul> |
| |
| <p> |
| To work with an HBase table from Impala, ensure that the <codeph>impala</codeph> user has read/write |
| privileges for the HBase table, using the <codeph>GRANT</codeph> command in the HBase shell. For details |
| about HBase security, see <xref keyref="upstream_hbase_security_docs"/>. |
| </p> |
| </conbody> |
| </concept> |
| |
| <concept id="hbase_config"> |
| |
| <title>Configuring HBase for Use with Impala</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Configuring"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| HBase works out of the box with Impala. There is no mandatory configuration needed to use these two |
| components together. |
| </p> |
| |
| <p> |
| To avoid delays if HBase is unavailable during Impala startup or after an <codeph>INVALIDATE |
| METADATA</codeph> statement, set timeout values similar to the following in |
| <filepath>/etc/impala/conf/hbase-site.xml</filepath>: |
| </p> |
| |
| <codeblock><property> |
| <name>hbase.client.retries.number</name> |
| <value>3</value> |
| </property> |
| <property> |
| <name>hbase.rpc.timeout</name> |
| <value>3000</value> |
| </property> |
| </codeblock> |
| |
| </conbody> |
| </concept> |
| |
| <concept id="hbase_types"> |
| |
| <title>Supported Data Types for HBase Columns</title> |
| |
| <conbody> |
| |
| <p> |
| To understand how Impala column data types are mapped to fields in HBase, you should have some background |
| knowledge about HBase first. You set up the mapping by running the <codeph>CREATE TABLE</codeph> statement |
| in the Hive shell. See |
| <xref href="https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration" scope="external" format="html">the |
| Hive wiki</xref> for a starting point, and <xref href="#hbase_queries"/> for examples. |
| </p> |
| |
| <p> |
| HBase works as a kind of <q>bit bucket</q>, in the sense that HBase does not enforce any typing for the |
| key or value fields. All the type enforcement is done on the Impala side. |
| </p> |
| |
| <p> For best performance of Impala queries against HBase tables, most |
| queries will perform comparisons in the <codeph>WHERE</codeph> clause |
| against the column that corresponds to the HBase row key. When creating |
| the table through the Hive shell, use the <codeph>STRING</codeph> data |
| type for the column that corresponds to the HBase row key. Impala can |
| translate predicates (through operators such as <codeph>=</codeph>, |
| <codeph><</codeph>, and <codeph>BETWEEN</codeph>) against this |
| column into fast lookups in HBase, but this optimization (<q>predicate |
| pushdown</q>) only works when that column is defined as |
| <codeph>STRING</codeph>. </p> |
| |
| <p> |
| Starting in Impala 1.1, Impala also supports reading and writing to columns that are defined in the Hive |
| <codeph>CREATE TABLE</codeph> statement using binary data types, represented in the Hive table definition |
| using the <codeph>#binary</codeph> keyword, often abbreviated as <codeph>#b</codeph>. Defining numeric |
| columns as binary can reduce the overall data volume in the HBase tables. You should still define the |
| column that corresponds to the HBase row key as a <codeph>STRING</codeph>, to allow fast lookups using |
| those columns. |
| </p> |
| </conbody> |
| </concept> |
| |
| <concept id="hbase_performance"> |
| |
| <title>Performance Considerations for the Impala-HBase Integration</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Performance"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| To understand the performance characteristics of SQL queries against data stored in HBase, you should have |
| some background knowledge about how HBase interacts with SQL-oriented systems first. See |
| <xref href="https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration" scope="external" format="html">the |
| Hive wiki</xref> for a starting point; because Impala shares the same metastore database as Hive, the |
| information about mapping columns from Hive tables to HBase tables is generally applicable to Impala too. |
| </p> |
| |
| <p> |
| Impala uses the HBase client API via Java Native Interface (JNI) to query data stored in HBase. This |
| querying does not read HFiles directly. The extra communication overhead makes it important to choose what |
| data to store in HBase or in HDFS, and construct efficient queries that can retrieve the HBase data |
| efficiently: |
| </p> |
| |
| <ul> |
| <li> |
| Use HBase table for queries that return a single row or a small range of rows, |
| not queries that perform a full table scan of an entire table. (If a query has |
| a HBase table and no <codeph>WHERE</codeph> clause referencing that table, |
| that is a strong indicator that it is an inefficient query for an HBase table.) |
| </li> |
| |
| <li> |
| HBase may offer acceptable performance for storing small dimension tables where |
| the table is small enough that executing a full table scan for every query is |
| efficient enough. However, Kudu is almost always a superior alternative for |
| storing dimension tables. HDFS tables are also appropriate for dimension |
| tables that do not need to support update queries, delete queries or insert |
| queries with small numbers of rows. |
| </li> |
| </ul> |
| |
| <p> |
| Query predicates are applied to row keys as start and stop keys, thereby limiting the scope of a particular |
| lookup. If row keys are not mapped to string columns, then ordering is typically incorrect and comparison |
| operations do not work. For example, if row keys are not mapped to string columns, evaluating for greater |
| than (>) or less than (<) cannot be completed. |
| </p> |
| |
| <p> |
| Predicates on non-key columns can be sent to HBase to scan as <codeph>SingleColumnValueFilters</codeph>, |
| providing some performance gains. In such a case, HBase returns fewer rows than if those same predicates |
| were applied using Impala. While there is some improvement, it is not as great when start and stop rows are |
| used. This is because the number of rows that HBase must examine is not limited as it is when start and |
| stop rows are used. As long as the row key predicate only applies to a single row, HBase will locate and |
| return that row. Conversely, if a non-key predicate is used, even if it only applies to a single row, HBase |
| must still scan the entire table to find the correct result. |
| </p> |
| |
| <example> |
| |
| <title>Interpreting EXPLAIN Output for HBase Queries</title> |
| |
| <p> |
| For example, here are some queries against the following Impala table, which is mapped to an HBase table. |
| The examples show excerpts from the output of the <codeph>EXPLAIN</codeph> statement, demonstrating what |
| things to look for to indicate an efficient or inefficient query against an HBase table. |
| </p> |
| |
| <p> |
| The first column (<codeph>cust_id</codeph>) was specified as the key column in the <codeph>CREATE |
| EXTERNAL TABLE</codeph> statement; for performance, it is important to declare this column as |
| <codeph>STRING</codeph>. Other columns, such as <codeph>BIRTH_YEAR</codeph> and |
| <codeph>NEVER_LOGGED_ON</codeph>, are also declared as <codeph>STRING</codeph>, rather than their |
| <q>natural</q> types of <codeph>INT</codeph> or <codeph>BOOLEAN</codeph>, because Impala can optimize |
| those types more effectively in HBase tables. For comparison, we leave one column, |
| <codeph>YEAR_REGISTERED</codeph>, as <codeph>INT</codeph> to show that filtering on this column is |
| inefficient. |
| </p> |
| |
| <codeblock>describe hbase_table; |
| Query: describe hbase_table |
| +-----------------------+--------+---------+ |
| | name | type | comment | |
| +-----------------------+--------+---------+ |
| | cust_id | <b>string</b> | | |
| | birth_year | <b>string</b> | | |
| | never_logged_on | <b>string</b> | | |
| | private_email_address | string | | |
| | year_registered | <b>int</b> | | |
| +-----------------------+--------+---------+ |
| </codeblock> |
| |
| <p> |
| The best case for performance involves a single row lookup using an equality comparison on the column |
| defined as the row key: |
| </p> |
| |
| <codeblock>explain select count(*) from hbase_table where cust_id = 'some_user@example.com'; |
| +------------------------------------------------------------------------------------+ |
| | Explain String | |
| +------------------------------------------------------------------------------------+ |
| | Estimated Per-Host Requirements: Memory=1.01GB VCores=1 | |
| | WARNING: The following tables are missing relevant table and/or column statistics. | |
| | hbase.hbase_table | |
| | | |
| | 03:AGGREGATE [MERGE FINALIZE] | |
| | | output: sum(count(*)) | |
| | | | |
| | 02:EXCHANGE [PARTITION=UNPARTITIONED] | |
| | | | |
| | 01:AGGREGATE | |
| | | output: count(*) | |
| | | | |
| <b>| 00:SCAN HBASE [hbase.hbase_table] |</b> |
| <b>| start key: some_user@example.com |</b> |
| <b>| stop key: some_user@example.com\0 |</b> |
| +------------------------------------------------------------------------------------+ |
| </codeblock> |
| |
| <p> |
| Another type of efficient query involves a range lookup on the row key column, using SQL operators such |
| as greater than (or equal), less than (or equal), or <codeph>BETWEEN</codeph>. This example also includes |
| an equality test on a non-key column; because that column is a <codeph>STRING</codeph>, Impala can let |
| HBase perform that test, indicated by the <codeph>hbase filters:</codeph> line in the |
| <codeph>EXPLAIN</codeph> output. Doing the filtering within HBase is more efficient than transmitting all |
| the data to Impala and doing the filtering on the Impala side. |
| </p> |
| |
| <codeblock>explain select count(*) from hbase_table where cust_id between 'a' and 'b' |
| and never_logged_on = 'true'; |
| +------------------------------------------------------------------------------------+ |
| | Explain String | |
| +------------------------------------------------------------------------------------+ |
| ... |
| <!--| Estimated Per-Host Requirements: Memory=1.01GB VCores=1 | |
| | WARNING: The following tables are missing relevant table and/or column statistics. | |
| | hbase.hbase_table | |
| | | |
| | 03:AGGREGATE [MERGE FINALIZE] | |
| | | output: sum(count(*)) | |
| | | | |
| | 02:EXCHANGE [PARTITION=UNPARTITIONED] | |
| | | |--> |
| | 01:AGGREGATE | |
| | | output: count(*) | |
| | | | |
| <b>| 00:SCAN HBASE [hbase.hbase_table] |</b> |
| <b>| start key: a |</b> |
| <b>| stop key: b\0 |</b> |
| <b>| hbase filters: cols:never_logged_on EQUAL 'true' |</b> |
| +------------------------------------------------------------------------------------+ |
| </codeblock> |
| |
| <p> |
| The query is less efficient if Impala has to evaluate any of the predicates, because Impala must scan the |
| entire HBase table. Impala can only push down predicates to HBase for columns declared as |
| <codeph>STRING</codeph>. This example tests a column declared as <codeph>INT</codeph>, and the |
| <codeph>predicates:</codeph> line in the <codeph>EXPLAIN</codeph> output indicates that the test is |
| performed after the data is transmitted to Impala. |
| </p> |
| |
| <codeblock>explain select count(*) from hbase_table where year_registered = 2010; |
| +------------------------------------------------------------------------------------+ |
| | Explain String | |
| +------------------------------------------------------------------------------------+ |
| ... |
| <!--| Estimated Per-Host Requirements: Memory=1.01GB VCores=1 | |
| | WARNING: The following tables are missing relevant table and/or column statistics. | |
| | hbase.hbase_table | |
| | | |
| | 03:AGGREGATE [MERGE FINALIZE] | |
| | | output: sum(count(*)) | |
| | | | |
| | 02:EXCHANGE [PARTITION=UNPARTITIONED] | |
| | | |--> |
| | 01:AGGREGATE | |
| | | output: count(*) | |
| | | | |
| <b>| 00:SCAN HBASE [hbase.hbase_table] |</b> |
| <b>| predicates: year_registered = 2010 |</b> |
| +------------------------------------------------------------------------------------+ |
| </codeblock> |
| |
| <p> |
| The same inefficiency applies if the key column is compared to any non-constant value. Here, even though |
| the key column is a <codeph>STRING</codeph>, and is tested using an equality operator, Impala must scan |
| the entire HBase table because the key column is compared to another column value rather than a constant. |
| </p> |
| |
| <codeblock>explain select count(*) from hbase_table where cust_id = private_email_address; |
| +------------------------------------------------------------------------------------+ |
| | Explain String | |
| +------------------------------------------------------------------------------------+ |
| ... |
| <!--| Estimated Per-Host Requirements: Memory=1.01GB VCores=1 | |
| | WARNING: The following tables are missing relevant table and/or column statistics. | |
| | hbase.hbase_table | |
| | | |
| | 03:AGGREGATE [MERGE FINALIZE] | |
| | | output: sum(count(*)) | |
| | | | |
| | 02:EXCHANGE [PARTITION=UNPARTITIONED] | |
| | | |--> |
| | 01:AGGREGATE | |
| | | output: count(*) | |
| | | | |
| <b>| 00:SCAN HBASE [hbase.hbase_table] |</b> |
| <b>| predicates: cust_id = private_email_address |</b> |
| +------------------------------------------------------------------------------------+ |
| </codeblock> |
| |
| <p> |
| Currently, tests on the row key using <codeph>OR</codeph> or <codeph>IN</codeph> clauses are not |
| optimized into direct lookups either. Such limitations might be lifted in the future, so always check the |
| <codeph>EXPLAIN</codeph> output to be sure whether a particular SQL construct results in an efficient |
| query or not for HBase tables. |
| </p> |
| |
| <codeblock>explain select count(*) from hbase_table where |
| cust_id = 'some_user@example.com' or cust_id = 'other_user@example.com'; |
| +----------------------------------------------------------------------------------------+ |
| | Explain String | |
| +----------------------------------------------------------------------------------------+ |
| ... |
| <!--| Estimated Per-Host Requirements: Memory=1.01GB VCores=1 | |
| | WARNING: The following tables are missing relevant table and/or column statistics. | |
| | hbase.hbase_table | |
| | | |
| | 03:AGGREGATE [MERGE FINALIZE] | |
| | | output: sum(count(*)) | |
| | | | |
| | 02:EXCHANGE [PARTITION=UNPARTITIONED] | |
| | | |--> |
| | 01:AGGREGATE | |
| | | output: count(*) | |
| | | | |
| <b>| 00:SCAN HBASE [hbase.hbase_table] |</b> |
| <b>| predicates: cust_id = 'some_user@example.com' OR cust_id = 'other_user@example.com' |</b> |
| +----------------------------------------------------------------------------------------+ |
| |
| explain select count(*) from hbase_table where |
| cust_id in ('some_user@example.com', 'other_user@example.com'); |
| +------------------------------------------------------------------------------------+ |
| | Explain String | |
| +------------------------------------------------------------------------------------+ |
| ... |
| <!--| Estimated Per-Host Requirements: Memory=1.01GB VCores=1 | |
| | WARNING: The following tables are missing relevant table and/or column statistics. | |
| | hbase.hbase_table | |
| | | |
| | 03:AGGREGATE [MERGE FINALIZE] | |
| | | output: sum(count(*)) | |
| | | | |
| | 02:EXCHANGE [PARTITION=UNPARTITIONED] | |
| | | |--> |
| | 01:AGGREGATE | |
| | | output: count(*) | |
| | | | |
| <b>| 00:SCAN HBASE [hbase.hbase_table] |</b> |
| <b>| predicates: cust_id IN ('some_user@example.com', 'other_user@example.com') |</b> |
| +------------------------------------------------------------------------------------+ |
| </codeblock> |
| |
| <p> |
| Either rewrite into separate queries for each value and combine the results in the application, or |
| combine the single-row queries using UNION ALL: |
| </p> |
| |
| <codeblock>select count(*) from hbase_table where cust_id = 'some_user@example.com'; |
| select count(*) from hbase_table where cust_id = 'other_user@example.com'; |
| |
| explain |
| select count(*) from hbase_table where cust_id = 'some_user@example.com' |
| union all |
| select count(*) from hbase_table where cust_id = 'other_user@example.com'; |
| +------------------------------------------------------------------------------------+ |
| | Explain String | |
| +------------------------------------------------------------------------------------+ |
| ... |
| <!--| Estimated Per-Host Requirements: Memory=1.01GB VCores=1 | |
| | WARNING: The following tables are missing relevant table and/or column statistics. | |
| | hbase.hbase_table | |
| | | |
| | 09:EXCHANGE [PARTITION=UNPARTITIONED] | |
| | | | |
| | |−−11:MERGE | |
| | | | | |
| | | 08:AGGREGATE [MERGE FINALIZE] | |
| | | | output: sum(count(*)) | |
| | | | | |
| | | 07:EXCHANGE [PARTITION=UNPARTITIONED] | |
| | | | |--> |
| | | 04:AGGREGATE | |
| | | | output: count(*) | |
| | | | | |
| <b>| | 03:SCAN HBASE [hbase.hbase_table] |</b> |
| <b>| | start key: other_user@example.com |</b> |
| <b>| | stop key: other_user@example.com\0 |</b> |
| | | | |
| | 10:MERGE | |
| ... |
| <!--| | | |
| | 06:AGGREGATE [MERGE FINALIZE] | |
| | | output: sum(count(*)) | |
| | | | |
| | 05:EXCHANGE [PARTITION=UNPARTITIONED] | |
| | | |--> |
| | 02:AGGREGATE | |
| | | output: count(*) | |
| | | | |
| <b>| 01:SCAN HBASE [hbase.hbase_table] |</b> |
| <b>| start key: some_user@example.com |</b> |
| <b>| stop key: some_user@example.com\0 |</b> |
| +------------------------------------------------------------------------------------+ |
| </codeblock> |
| |
| </example> |
| |
| <example> |
| |
| <title>Configuration Options for Java HBase Applications</title> |
| |
| <p> If you have an HBase Java application that calls the |
| <codeph>setCacheBlocks</codeph> or <codeph>setCaching</codeph> |
| methods of the class <xref |
| href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html" |
| scope="external" format="html" |
| >org.apache.hadoop.hbase.client.Scan</xref>, you can set these same |
| caching behaviors through Impala query options, to control the memory |
| pressure on the HBase RegionServer. For example, when doing queries in |
| HBase that result in full-table scans (which by default are |
| inefficient for HBase), you can reduce memory usage and speed up the |
| queries by turning off the <codeph>HBASE_CACHE_BLOCKS</codeph> setting |
| and specifying a large number for the <codeph>HBASE_CACHING</codeph> |
| setting. |
| </p> |
| |
| <p> |
| To set these options, issue commands like the following in <cmdname>impala-shell</cmdname>: |
| </p> |
| |
| <codeblock>-- Same as calling setCacheBlocks(true) or setCacheBlocks(false). |
| set hbase_cache_blocks=true; |
| set hbase_cache_blocks=false; |
| |
| -- Same as calling setCaching(rows). |
| set hbase_caching=1000; |
| </codeblock> |
| |
| <p> |
| Or update the <cmdname>impalad</cmdname> defaults file <filepath>/etc/default/impala</filepath> and |
| include settings for <codeph>HBASE_CACHE_BLOCKS</codeph> and/or <codeph>HBASE_CACHING</codeph> in the |
| <codeph>-default_query_options</codeph> setting for <codeph>IMPALA_SERVER_ARGS</codeph>. See |
| <xref href="impala_config_options.xml#config_options"/> for details. |
| </p> |
| |
| <note> |
| In Impala 2.0 and later, these options are settable through the JDBC or ODBC interfaces using the |
| <codeph>SET</codeph> statement. |
| </note> |
| |
| </example> |
| </conbody> |
| </concept> |
| |
| <concept id="hbase_scenarios"> |
| |
| <title>Use Cases for Querying HBase through Impala</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Use Cases"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| The following are representative use cases for using Impala to query HBase tables: |
| </p> |
| |
| <ul> |
| <li> |
| Using HBase to store rapidly incrementing counters, such as how many times a web page has been viewed, or |
| on a social network, how many connections a user has or how many votes a post received. HBase is |
| efficient for capturing such changeable data: the append-only storage mechanism is efficient for writing |
| each change to disk, and a query always returns the latest value. An application could query specific |
| totals like these from HBase, and combine the results with a broader set of data queried from Impala. |
| </li> |
| |
| <li> |
| <p> |
| Storing very wide tables in HBase. Wide tables have many columns, possibly thousands, typically |
| recording many attributes for an important subject such as a user of an online service. These tables |
| are also often sparse, that is, most of the columns values are <codeph>NULL</codeph>, 0, |
| <codeph>false</codeph>, empty string, or other blank or placeholder value. (For example, any particular |
| web site user might have never used some site feature, filled in a certain field in their profile, |
| visited a particular part of the site, and so on.) A typical query against this kind of table is to |
| look up a single row to retrieve all the information about a specific subject, rather than summing, |
| averaging, or filtering millions of rows as in typical Impala-managed tables. |
| </p> |
| </li> |
| </ul> |
| </conbody> |
| </concept> |
| |
| <concept audience="hidden" id="hbase_create_new"> |
| |
| <title>Creating a New HBase Table for Impala to Use</title> |
| |
| <conbody> |
| |
| <p> |
| You can create an HBase-backed table through a <codeph>CREATE TABLE</codeph> statement in the Hive shell, |
| without going into the HBase shell at all: |
| </p> |
| |
| <!-- To do: |
| Add example. (Not critical because this subtopic is currently hidden.) |
| --> |
| </conbody> |
| </concept> |
| |
| <concept audience="hidden" id="hbase_reuse_existing"> |
| |
| <title>Associate Impala with an Existing HBase Table</title> |
| |
| <conbody> |
| |
| <p> |
| If you already have some HBase tables created through the HBase shell, you can make them accessible to |
| Impala through a <codeph>CREATE TABLE</codeph> statement in the Hive shell: |
| </p> |
| |
| <!-- To do: |
| Add example. (Not critical because this subtopic is currently hidden.) |
| --> |
| </conbody> |
| </concept> |
| |
| <concept audience="hidden" id="hbase_column_families"> |
| |
| <title>Map HBase Columns and Column Families to Impala Columns</title> |
| |
| <conbody> |
| |
| <p/> |
| </conbody> |
| </concept> |
| |
| <concept id="hbase_loading"> |
| |
| <title>Loading Data into an HBase Table</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="ETL"/> |
| <data name="Category" value="Ingest"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| The Impala <codeph>INSERT</codeph> statement works for HBase tables. The <codeph>INSERT ... VALUES</codeph> |
| syntax is ideally suited to HBase tables, because inserting a single row is an efficient operation for an |
| HBase table. (For regular Impala tables, with data files in HDFS, the tiny data files produced by |
| <codeph>INSERT ... VALUES</codeph> are extremely inefficient, so you would not use that technique with |
| tables containing any significant data volume.) |
| </p> |
| |
| <!-- To do: |
| Add examples throughout this section. |
| --> |
| |
| <p> |
| When you use the <codeph>INSERT ... SELECT</codeph> syntax, the result in the HBase table could be fewer |
| rows than you expect. HBase only stores the most recent version of each unique row key, so if an |
| <codeph>INSERT ... SELECT</codeph> statement copies over multiple rows containing the same value for the |
| key column, subsequent queries will only return one row with each key column value: |
| </p> |
| |
| <p> |
| Although Impala does not have an <codeph>UPDATE</codeph> statement, you can achieve the same effect by |
| doing successive <codeph>INSERT</codeph> statements using the same value for the key column each time: |
| </p> |
| |
| </conbody> |
| </concept> |
| |
| <concept id="hbase_limitations"> |
| |
| <title>Limitations and Restrictions of the Impala and HBase Integration</title> |
| |
| <conbody> |
| |
| <p> |
| The Impala integration with HBase has the following limitations and restrictions, some inherited from the |
| integration between HBase and Hive, and some unique to Impala: |
| </p> |
| |
| <ul> |
| <li> |
| <p> |
| If you issue a <codeph>DROP TABLE</codeph> for an internal (Impala-managed) table that is mapped to an |
| HBase table, the underlying table is not removed in HBase. The Hive <codeph>DROP TABLE</codeph> |
| statement also removes the HBase table in this case. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| The <codeph>INSERT OVERWRITE</codeph> statement is not available for HBase tables. You can insert new |
| data, or modify an existing row by inserting a new row with the same key value, but not replace the |
| entire contents of the table. You can do an <codeph>INSERT OVERWRITE</codeph> in Hive if you need this |
| capability. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| If you issue a <codeph>CREATE TABLE LIKE</codeph> statement for a table mapped to an HBase table, the |
| new table is also an HBase table, but inherits the same underlying HBase table name as the original. |
| The new table is effectively an alias for the old one, not a new table with identical column structure. |
| Avoid using <codeph>CREATE TABLE LIKE</codeph> for HBase tables, to avoid any confusion. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Copying data into an HBase table using the Impala <codeph>INSERT ... SELECT</codeph> syntax might |
| produce fewer new rows than are in the query result set. If the result set contains multiple rows with |
| the same value for the key column, each row supercedes any previous rows with the same key value. |
| Because the order of the inserted rows is unpredictable, you cannot rely on this technique to preserve |
| the <q>latest</q> version of a particular key value. |
| </p> |
| </li> |
| <li rev="2.3.0"> |
| <p> |
| Because the complex data types (<codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph>) |
| available in <keyword keyref="impala23_full"/> and higher are currently only supported in Parquet tables, you cannot |
| use these types in HBase tables that are queried through Impala. |
| </p> |
| </li> |
| <li> |
| <p conref="../shared/impala_common.xml#common/hbase_no_load_data"/> |
| </li> |
| <li> |
| <p conref="../shared/impala_common.xml#common/tablesample_caveat"/> |
| </li> |
| </ul> |
| </conbody> |
| </concept> |
| |
| <concept id="hbase_queries"> |
| |
| <title>Examples of Querying HBase Tables from Impala</title> |
| |
| <conbody> |
| |
| <p> |
| The following examples create an HBase table with four column families, |
| create a corresponding table through Hive, |
| then insert and query the table through Impala. |
| </p> |
| <p> |
| In HBase shell, the table |
| name is quoted in <codeph>CREATE</codeph> and <codeph>DROP</codeph> statements. Tables created in HBase |
| begin in <q>enabled</q> state; before dropping them through the HBase shell, you must issue a |
| <codeph>disable '<varname>table_name</varname>'</codeph> statement. |
| </p> |
| |
| <codeblock>$ hbase shell |
| 15/02/10 16:07:45 |
| HBase Shell; enter 'help<RETURN>' for list of supported commands. |
| Type "exit<RETURN>" to leave the HBase Shell |
| ... |
| |
| hbase(main):001:0> create 'hbasealltypessmall', 'boolsCF', 'intsCF', 'floatsCF', 'stringsCF' |
| 0 row(s) in 4.6520 seconds |
| |
| => Hbase::Table - hbasealltypessmall |
| hbase(main):006:0> quit |
| </codeblock> |
| |
| <p> |
| Issue the following <codeph>CREATE TABLE</codeph> statement in the Hive shell. (The Impala <codeph>CREATE |
| TABLE</codeph> statement currently does not support the <codeph>STORED BY</codeph> clause, so you switch into Hive to |
| create the table, then back to Impala and the <cmdname>impala-shell</cmdname> interpreter to issue the |
| queries.) |
| </p> |
| |
| <p> |
| This example creates an external table mapped to the HBase table, usable by both Impala and Hive. It is |
| defined as an external table so that when dropped by Impala or Hive, the original HBase table is not touched at all. |
| </p> |
| |
| <p> |
| The <codeph>WITH SERDEPROPERTIES</codeph> clause |
| specifies that the first column (<codeph>ID</codeph>) represents the row key, and maps the remaining |
| columns of the SQL table to HBase column families. The mapping relies on the ordinal order of the |
| columns in the table, not the column names in the <codeph>CREATE TABLE</codeph> statement. |
| The first column is defined to be the lookup key; the |
| <codeph>STRING</codeph> data type produces the fastest key-based lookups for HBase tables. |
| </p> |
| |
| <note> |
| For Impala with HBase tables, the most important aspect to ensure good performance is to use a |
| <codeph>STRING</codeph> column as the row key, as shown in this example. |
| </note> |
| |
| <codeblock>$ hive |
| ... |
| hive> use hbase; |
| OK |
| Time taken: 4.095 seconds |
| hive> CREATE EXTERNAL TABLE hbasestringids ( |
| > id string, |
| > bool_col boolean, |
| > tinyint_col tinyint, |
| > smallint_col smallint, |
| > int_col int, |
| > bigint_col bigint, |
| > float_col float, |
| > double_col double, |
| > date_string_col string, |
| > string_col string, |
| > timestamp_col timestamp) |
| > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' |
| > WITH SERDEPROPERTIES ( |
| > "hbase.columns.mapping" = |
| > ":key,boolsCF:bool_col,intsCF:tinyint_col,intsCF:smallint_col,intsCF:int_col,intsCF:\ |
| > bigint_col,floatsCF:float_col,floatsCF:double_col,stringsCF:date_string_col,\ |
| > stringsCF:string_col,stringsCF:timestamp_col" |
| > ) |
| > TBLPROPERTIES("hbase.table.name" = "hbasealltypessmall"); |
| OK |
| Time taken: 2.879 seconds |
| hive> quit; |
| </codeblock> |
| |
| <p> |
| Once you have established the mapping to an HBase table, you can issue DML statements and queries |
| from Impala. The following example shows a series of <codeph>INSERT</codeph> |
| statements followed by a query. |
| The ideal kind of query from a performance standpoint |
| retrieves a row from the table based on a row key |
| mapped to a string column. |
| An initial <codeph>INVALIDATE METADATA <varname>table_name</varname></codeph> |
| statement makes the table created through Hive visible to Impala. |
| </p> |
| |
| <codeblock>$ impala-shell -i localhost -d hbase |
| Starting Impala Shell without Kerberos authentication |
| Connected to localhost:21000 |
| ... |
| Query: use `hbase` |
| [localhost:21000] > invalidate metadata hbasestringids; |
| Fetched 0 row(s) in 0.09s |
| [localhost:21000] > desc hbasestringids; |
| +-----------------+-----------+---------+ |
| | name | type | comment | |
| +-----------------+-----------+---------+ |
| | id | string | | |
| | bool_col | boolean | | |
| | double_col | double | | |
| | float_col | float | | |
| | bigint_col | bigint | | |
| | int_col | int | | |
| | smallint_col | smallint | | |
| | tinyint_col | tinyint | | |
| | date_string_col | string | | |
| | string_col | string | | |
| | timestamp_col | timestamp | | |
| +-----------------+-----------+---------+ |
| Fetched 11 row(s) in 0.02s |
| [localhost:21000] > insert into hbasestringids values ('0001',true,3.141,9.94,1234567,32768,4000,76,'2014-12-31','Hello world',now()); |
| Inserted 1 row(s) in 0.26s |
| [localhost:21000] > insert into hbasestringids values ('0002',false,2.004,6.196,1500,8000,129,127,'2014-01-01','Foo bar',now()); |
| Inserted 1 row(s) in 0.12s |
| [localhost:21000] > select * from hbasestringids where id = '0001'; |
| +------+----------+------------+-------------------+------------+---------+--------------+-------------+-----------------+-------------+-------------------------------+ |
| | id | bool_col | double_col | float_col | bigint_col | int_col | smallint_col | tinyint_col | date_string_col | string_col | timestamp_col | |
| +------+----------+------------+-------------------+------------+---------+--------------+-------------+-----------------+-------------+-------------------------------+ |
| | 0001 | true | 3.141 | 9.939999580383301 | 1234567 | 32768 | 4000 | 76 | 2014-12-31 | Hello world | 2015-02-10 16:36:59.764838000 | |
| +------+----------+------------+-------------------+------------+---------+--------------+-------------+-----------------+-------------+-------------------------------+ |
| Fetched 1 row(s) in 0.54s |
| </codeblock> |
| |
| <note conref="../shared/impala_common.xml#common/invalidate_metadata_hbase"/> |
| </conbody> |
| </concept> |
| </concept> |