blob: c5696c49ad394eccfe992a2a7ce0591482c9dcb1 [file] [log] [blame]
---
title: Accessing Hive Data
---
This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog.
- **[Prerequisites](../pxf/HivePXF.html#installingthepxfhiveplugin)**
- **[Hive Command Line](../pxf/HivePXF.html#hivecommandline)**
- **[Using PXF Tables to Query Hive](../pxf/HivePXF.html#topic_p2s_lvl_25)**
- **[Using PXF and HCatalog to Query Hive](../pxf/HivePXF.html#hcatalog)**
- **[Partition Filtering](../pxf/HivePXF.html#partitionfiltering)**
- **[Using PXF with Hive Default Partitions](../pxf/HivePXF.html#topic_fdm_zrh_1s)**
- **[Accessing Hive Tables in Parquet Format](../pxf/HivePXF.html#topic_dbb_nz3_ts)**
## Prerequisites<a id="installingthepxfhiveplugin"></a>
Check the following before using PXF to access Hive:
- The PXF HDFS plugin is installed on all cluster nodes.
- The PXF Hive plugin is installed on all cluster nodes.
- The Hive JAR files and conf directory are installed on all cluster nodes.
- Test PXF on HDFS before connecting to Hive or HBase.
- You are running the Hive Metastore service on a machine in your cluster. 
- You have set the `hive.metastore.uris` property in the `hive-site.xml` on the NameNode.
## Hive Command Line<a id="hivecommandline"></a>
To start the Hive command line and work directly on a Hive table:
``` pre
/>${HIVE_HOME}/bin/hive
```
Here is an example of how to create and load data into a sample Hive table from an existing file.
``` pre
Hive> CREATE TABLE test (name string, type string, supplier_key int, full_price double) row format delimited fields terminated by ',';
Hive> LOAD DATA local inpath '/local/path/data.txt' into table test;
```
## Using PXF Tables to Query Hive<a id="topic_p2s_lvl_25"></a>
Hive tables are defined in a specific way in PXF, regardless of the underlying file storage format. The PXF Hive plugins automatically detect source tables in the following formats:
- Text based
- SequenceFile
- RCFile
- ORCFile
- Parquet
- Avro
The source table can also be a combination of these types. The PXF Hive plugin uses this information to query the data in runtime.
- **[Syntax Example](../pxf/HivePXF.html#syntax2)**
- **[Hive Complex Types](../pxf/HivePXF.html#topic_b4v_g3n_25)**
### Syntax Example<a id="syntax2"></a>
The following PXF table definition is valid for any Hive file storage type.
``` pre
CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name     
( column_name data_type [, ...] | LIKE other_table )
LOCATION ('pxf://namenode[:port]/hive-table-name?<pxf parameters>[&custom-option=value...]')FORMAT 'CUSTOM' (formatter='pxfwritable_import')
```
where `<pxf parameters>` is:
``` pre
FRAGMENTER=fragmenter_class&ACCESSOR=accessor_class&RESOLVER=resolver_class]
| PROFILE=profile-name
```
**Note:** The port is the connection port for the PXF service. If the port is omitted, PXF assumes that High Availability (HA) is enabled and connects to the HA name service port, 51200 by default. The HA name service port can be changed by setting thepxf\_service\_port configuration parameter.
PXF has three built-in profiles for Hive tables:
- Hive
- HiveRC
- HiveText
The Hive profile works with any Hive storage type. Use HiveRC and HiveText to query RC and Text formats respectively. The HiveRC and HiveText profiles are faster than the generic Hive profile. When using the HiveRC and HiveText profiles, you must specify a DELIMITER option in the LOCATION clause. See [Using Profiles to Read and Write Data](ReadWritePXF.html#readingandwritingdatawithpxf) for more information on profiles.
The following example creates a readable external table for a Hive table named `/user/eddie/test` using the PXF Hive profile:
``` pre
CREATE EXTERNAL TABLE hivetest(id int, newid int) 
LOCATION ('pxf://namenode:51200/user/eddie/test?PROFILE=Hive')
FORMAT 'custom' (formatter='pxfwritable_import');
```
### Hive Complex Types<a id="topic_b4v_g3n_25"></a>
PXF tables support Hive data types that are not primitive types. The supported Hive complex data types are array, struct, map, and union. This Hive `CREATE TABLE` statement, for example, creates a table with each of these complex data types:
``` pre
CREATE TABLE sales_collections (
item STRING
price FLOAT
properties ARRAY<STRING>, 
hash MAP<STRING,FLOAT>, 
delivery_address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>,
others UNIONTYPE<FLOAT, BOOLEAN, STRING>
)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\001' COLLECTION ITEMS TERMINATED BY '\002' MAP KEYS TERMINATED BY '\003' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH '/local/path/<some data file>' INTO TABLE sales_collection;
```
You can use HAWQ functions or application code to extract the components of the complex data columns as needed.
## Using PXF and HCatalog to Query Hive<a id="hcatalog"></a>
Hive tables can be queried directly through HCatalog integration with HAWQ and PXF, regardless of the underlying file storage format.
Previously, in order to query Hive tables using HAWQ and PXF, you needed to create an external table in PXF that described the target table's Hive metadata. Since HAWQ is now integrated with HCatalog, HAWQ can use metadata stored in HCatalog instead of external tables created for PXF. HCatalog is built on top of the Hive metastore and incorporates Hive's DDL. This provides several advantages:
- You do not need to know the table schema of your Hive tables
- You do not need to manually enter information about Hive table location or format
- If Hive table metadata changes, HCatalog provides updated metadata. This is in contrast to the use of static external PXF tables to define Hive table metadata for HAWQ.
The following diagram depicts how HAWQ integrates with HCatalog to query Hive tables:
<img src="../images/hawq_hcatalog.png" id="hcatalog__image_ukw_h2v_c5" class="image" width="672" />
1. HAWQ retrieves table metadata from HCatalog using PXF.
2. HAWQ creates in-memory catalog tables from the retrieved metadata. If a table is referenced multiple times in a transaction, HAWQ uses its in-memory metadata to reduce external calls to HCatalog.
3. PXF queries Hive using table metadata that is stored in the HAWQ in-memory catalog tables. Table metadata is dropped at the end of the transaction.
- **[Usage](../pxf/HivePXF.html#topic_j1l_y55_c5)**
- **[Limitations](../pxf/HivePXF.html#topic_r5k_pst_25)**
### Usage<a id="topic_j1l_y55_c5"></a>
To enable HCatalog query integration in HAWQ, perform the following steps:
1. Make sure your deployment meets the requirements listed in [Prerequisites](#installingthepxfhiveplugin).
2. If necessary, set the `pxf_service_address` global configuration property to a hostname or IP address and port where you have installed the PXF Hive plugin. By default, the value is set to `localhost:51200`.
``` pre
SET pxf_service_address TO "hivenode:51200"
```
3. To query a Hive table with HCatalog integration, simply query HCatalog directly. The query syntax is:
``` pre
SELECT * FROM hcatalog.hive-db-name.hive-table-name;
```
For example:
``` pre
SELECT * FROM hcatalog.default.sales;
```
4. To obtain a description of a Hive table with HCatalog integration, you can use the `psql` client interface.
- Within HAWQ, use either the `\d hcatalog.hive-db-name.hive-table-name` or `\d+ hcatalog.hive-db-name.hive-table-name` commands to describe a single table. For example, from the `psql` client interface:
``` pre
$ psql -d postgres
postgres=# \d hcatalog.default.test
PXF Hive Table "default.test"
Column | Type
--------------+--------
name | text
type | text
supplier_key | int4
full_price | float8
```
- Use `\d hcatalog.hive-db-name.*` to describe the whole database schema. For example:
``` pre
postgres=# \d hcatalog.default.*
PXF Hive Table "default.test"
Column | Type
--------------+--------
type | text
name | text
supplier_key | int4
full_price | float8
PXF Hive Table "default.testabc"
Column | Type
--------+------
type | text
name | text
```
- Use `\d hcatalog.*.*` to describe the whole schema:
``` pre
postgres=# \d hcatalog.*.*
PXF Hive Table "default.test"
Column | Type
--------------+--------
type | text
name | text
supplier_key | int4
full_price | float8
PXF Hive Table "default.testabc"
Column | Type
--------+------
type | text
name | text
PXF Hive Table "userdb.test"
Column | Type
----------+------
address | text
username | text
```
**Note:** When using `\d` or `\d+` commands in the `psql` HAWQ client, `hcatalog` will not be listed as a database. If you use other `psql` compatible clients, `hcatalog` will be listed as a database with a size value of `-1` since `hcatalog` is not a real database in HAWQ.
5. Alternatively, you can use the **pxf\_get\_item\_fields** user-defined function (UDF) to obtain Hive table descriptions from other client interfaces or third-party applications. The UDF takes a PXF profile and a table pattern string as its input parameters.
**Note:** Currently the only supported input profile is `'Hive'`.
For example, the following statement returns a description of a specific table. The description includes path, itemname (table), fieldname, and fieldtype.
``` pre
# select * from pxf_get_item_fields('Hive','default.test');
path | itemname | fieldname | fieldtype
---------+----------+--------------+-----------
default | test | name | text
default | test | type | text
default | test | supplier_key | int4
default | test | full_price | float8
(4 rows)
```
The following statement returns table descriptions from the default database.
``` pre
postgres=# select * from pxf_get_item_fields('Hive','default.*');
path | itemname | fieldname | fieldtype
---------+----------+--------------+-----------
default | test | name | text
default | test | type | text
default | test | supplier_key | int4
default | test | full_price | float8
default | testabc | name | text
default | testabc | type | text
(6 rows)
```
The following statement returns a description of the entire schema.
``` pre
# select * from pxf_get_item_fields('Hive', '*.*');
path | itemname | fieldname | fieldtype
---------+----------+--------------+-----------
default | test | name | text
default | test | type | text
default | test | supplier_key | int4
default | test | full_price | float8
default | testabc | name | text
default | testabc | type | text
userdb | test | username | text
userdb | test | address | text
(8 rows)
```
### Limitations<a id="topic_r5k_pst_25"></a>
HCatalog integration has the following limitations:
- HCatalog integration queries and describe commands do not support complex types; only primitive types are supported. Use PXF external tables to query complex types in Hive instead. (See [Hive Complex Types](#topic_b4v_g3n_25) for example.)
- Even for primitive types, HCatalog metadata descriptions produced by `\d` and` \d+` are converted to HAWQ types. For example, the Hive type `tinyint` is converted to HAWQ type `int2`. (See [Data Types](../reference/HAWQDataTypes.html) for a list of data types in HAWQ.)
- HAWQ reserves the database name "hcatalog" for system use. You cannot connect to or alter the system "hcatalog" database.
## Partition Filtering<a id="partitionfiltering"></a>
The PXF Hive plugin uses the Hive partitioning feature and directory structure. This enables partition exclusion on HDFS files that contain the Hive table. To use the partition filtering feature to reduce network traffic and I/O, run a PXF query using a WHERE clause that refers to a specific partition in the partitioned Hive table.
To take advantage of PXF partition filtering push-down, name the partition fields in the external table. These names must be the same as the names stored in the Hive table. Otherwise, PXF ignores Partition filtering and the filtering is performed on the HAWQ side, impacting performance.
**Note:** The Hive plugin only filters on partition columns, not on other table attributes.
### Example<a id="example2"></a>
Create a Hive table `sales_part` with two partition columns, `delivery_state` and `delivery_city:`
``` pre
CREATE TABLE sales_part (name STRING, type STRING, supplier_key INT, price DOUBLE)
PARTITIONED BY (delivery_state STRING, delivery_city STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
```
Load data into this Hive table and add some partitions:
``` pre
LOAD DATA LOCAL INPATH '/local/path/data1.txt' INTO TABLE sales_part PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'San Francisco');
LOAD DATA LOCAL INPATH '/local/path/data2.txt' INTO TABLE sales_part PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'Sacramento');
LOAD DATA LOCAL INPATH '/local/path/data3.txt' INTO TABLE sales_part PARTITION(delivery_state = 'NEVADA' , delivery_city = 'Reno');
LOAD DATA LOCAL INPATH '/local/path/data4.txt' INTO TABLE sales_part PARTITION(delivery_state = 'NEVADA' , delivery_city = 'Las Vegas');
```
The Hive storage directory should appears as follows:
``` pre
/hive/warehouse/sales_part/delivery_state=CALIFORNIA/delivery_city=’San Francisco’/data1.txt
/hive/warehouse/sales_part/delivery_state=CALIFORNIA/delivery_city=Sacramento/data2.txt
/hive/warehouse/sales_part/delivery_state=NEVADA/delivery_city=Reno/data3.txt
/hive/warehouse/sales_part/delivery_state=NEVADA/delivery_city=’Las Vegas’/data4.txt
```
To define a PXF table that will read this Hive table and take advantage of partition filter push-down, define the fields corresponding to the Hive partition fields at the end of the attribute list. In HiveQL, a `select *` statement on a partitioned table shows the partition fields at the end of the record.
``` pre
CREATE EXTERNAL TABLE pxf_sales_part(
item_name TEXT,
item_type TEXT,
supplier_key INTEGER,
item_price DOUBLE PRECISION,
delivery_state TEXT,
delivery_city TEXT
)
LOCATION ('pxf://namenode_host:51200/sales_part?Profile=Hive')
FORMAT 'custom' (FORMATTER='pxfwritable_import');
SELECT * FROM pxf_sales_part;
```
### Example<a id="example3"></a>
In the following example, the HAWQ query filters the `delivery_city` partition `Sacramento`. The filter on  `item_name` is not pushed down, since it is not a partition column. It is performed on the HAWQ side after all the data on `Sacramento` is transferred for processing.
``` pre
SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' AND item_name = 'shirt';
```
### Example<a id="example4"></a>
The following HAWQ query reads all the data under `delivery_state` partition `CALIFORNIA`, regardless of the city.
``` pre
SELECT * FROM pxf_sales_part WHERE delivery_state = 'CALIFORNIA';
```
## Using PXF with Hive Default Partitions<a id="topic_fdm_zrh_1s"></a>
This topic describes a difference in query results between Hive and PXF queries when Hive tables use a default partition. When dynamic partitioning is enabled in Hive, a partitioned table may store data in a default partition. Hive creates a default partition when the value of a partitioning column does not match the defined type of the column (for example, when a NULL value is used for any partitioning column). In Hive, any query that includes a filter on a partition column *excludes* any data that is stored in the table's default partition.
Similar to Hive, PXF represents a table's partitioning columns as columns that are appended to the end of the table. However, PXF translates any column value in a default partition to a NULL value. This means that a HAWQ query that includes an IS NULL filter on a partitioning column can return different results than the same Hive query.
### Example<a id="topic_g4r_4wh_1s"></a>
Consider a Hive partitioned table that is created with the statement:
``` pre
CREATE TABLE sales (order_id bigint, order_amount float) PARTITIONED BY (date date);
```
The table is loaded with five rows that contain the following data:
``` pre
1 1.0 1900-01-01
2 2.2 1994-04-14
3 3.3 2011-03-31
4 4.5 NULL
5 5.0 2013-12-06
```
In this case, the insertion of row 4 creates a Hive default partition, because the partition column "date" contains a null value.
In Hive, any query that filters on the partition column omits data in the default partition. For example the following query returns no rows:
``` pre
hive> select * from sales where date is null;
```
However, if you map this table as a PXF external table in HAWQ, all default partition values are translated into actual NULL values. In HAWQ, executing the same query against the PXF table returns row 4 as the result, because the filter matches the NULL value.
Keep this behavior in mind when executing IS NULL queries on Hive partitioned tables.
## Accessing Hive Tables in Parquet Format<a id="topic_dbb_nz3_ts"></a>
The PXF Hive profile supports both non-partitioned and partitioned Hive tables that use the Parquet storage format in HDFS. Simply map the table columns using equivalent HAWQ data types. For example, in Hive if a table is created using:
``` pre
hive> create table hive_parquet_table (fname string, lname string, custid int, acctbalance double) stored as
parquet;
```
Then you would define the HAWQ external table using:
``` pre
create external table pxf_parquet_table (fname text, lname text, custid int, acctbalance double precision)
location ('pxf://localhost:51200/hive_parquet_table?profile=Hive')
format 'custom' (formatter='pxfwritable_import');
```