pxf/HivePXF.html.md.erb - hawq-docs - Git at Google

 ---
 title: Accessing Hive Data
 ---

 This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog.

 -   **[Prerequisites](../pxf/HivePXF.html#installingthepxfhiveplugin)**

 -   **[Hive Command Line](../pxf/HivePXF.html#hivecommandline)**

 -   **[Using PXF Tables to Query Hive](../pxf/HivePXF.html#topic_p2s_lvl_25)**

 -   **[Using PXF and HCatalog to Query Hive](../pxf/HivePXF.html#hcatalog)**

 -   **[Partition Filtering](../pxf/HivePXF.html#partitionfiltering)**

 -   **[Using PXF with Hive Default Partitions](../pxf/HivePXF.html#topic_fdm_zrh_1s)**

 -   **[Accessing Hive Tables in Parquet Format](../pxf/HivePXF.html#topic_dbb_nz3_ts)**

 ## Prerequisites<a id="installingthepxfhiveplugin"></a>

 Check the following before using PXF to access Hive:

 -   The PXF HDFS plugin is installed on all cluster nodes.
 -   The PXF Hive plugin is installed on all cluster nodes.
 -   The Hive JAR files and conf directory are installed on all cluster nodes.
 -   Test PXF on HDFS before connecting to Hive or HBase.
 -   You are running the Hive Metastore service on a machine in your cluster.
 -   You have set the `hive.metastore.uris` property in the `hive-site.xml` on the NameNode.

 ## Hive Command Line<a id="hivecommandline"></a>

 To start the Hive command line and work directly on a Hive table:

 ``` pre
 />${HIVE_HOME}/bin/hive
 ```

 Here is an example of how to create and load data into a sample Hive table from an existing file.

 ``` pre
 Hive> CREATE TABLE test (name string, type string, supplier_key int, full_price double) row format delimited fields terminated by ',';
 Hive> LOAD DATA local inpath '/local/path/data.txt' into table test;
 ```

 ## Using PXF Tables to Query Hive<a id="topic_p2s_lvl_25"></a>

 Hive tables are defined in a specific way in PXF, regardless of the underlying file storage format. The PXF Hive plugins automatically detect source tables in the following formats:

 -   Text based
 -   SequenceFile
 -   RCFile
 -   ORCFile
 -   Parquet
 -   Avro

 The source table can also be a combination of these types. The PXF Hive plugin uses this information to query the data in runtime.

 -   **[Syntax Example](../pxf/HivePXF.html#syntax2)**

 -   **[Hive Complex Types](../pxf/HivePXF.html#topic_b4v_g3n_25)**

 ### Syntax Example<a id="syntax2"></a>

 The following PXF table definition is valid for any Hive file storage type.

 ``` pre
 CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name
     ( column_name data_type [, ...] | LIKE other_table )
 LOCATION ('pxf://namenode[:port]/hive-table-name?<pxf parameters>[&custom-option=value...]')FORMAT 'CUSTOM' (formatter='pxfwritable_import')
 ```

 where `<pxf parameters>` is:

 ``` pre
    FRAGMENTER=fragmenter_class&ACCESSOR=accessor_class&RESOLVER=resolver_class]
  | PROFILE=profile-name
 ```

 **Note:** The port is the connection port for the PXF service. If the port is omitted, PXF assumes that High Availability (HA) is enabled and connects to the HA name service port, 51200 by default. The HA name service port can be changed by setting thepxf\_service\_port configuration parameter.

 PXF has three built-in profiles for Hive tables:

 -   Hive
 -   HiveRC
 -   HiveText

 The Hive profile works with any Hive storage type. Use HiveRC and HiveText to query RC and Text formats respectively. The HiveRC and HiveText profiles are faster than the generic Hive profile. When using the HiveRC and HiveText profiles, you must specify a DELIMITER option in the LOCATION clause. See [Using Profiles to Read and Write Data](ReadWritePXF.html#readingandwritingdatawithpxf) for more information on profiles.

 The following example creates a readable external table for a Hive table named `/user/eddie/test` using the PXF Hive profile:

 ``` pre
 CREATE EXTERNAL TABLE hivetest(id int, newid int)
 LOCATION ('pxf://namenode:51200/user/eddie/test?PROFILE=Hive')
 FORMAT 'custom' (formatter='pxfwritable_import');
 ```

 ### Hive Complex Types<a id="topic_b4v_g3n_25"></a>

 PXF tables support Hive data types that are not primitive types. The supported Hive complex data types are array, struct, map, and union. This Hive `CREATE TABLE` statement, for example, creates a table with each of these complex data types:

 ``` pre
 CREATE TABLE sales_collections (
   item STRING,
   price FLOAT,
   properties ARRAY<STRING>,
   hash MAP<STRING,FLOAT>,
   delivery_address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>,
   others UNIONTYPE<FLOAT, BOOLEAN, STRING>
 )
 ROW FORMAT DELIMITED FIELDS
 TERMINATED BY '\001' COLLECTION ITEMS TERMINATED BY '\002' MAP KEYS TERMINATED BY '\003' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
 LOAD DATA LOCAL INPATH '/local/path/<some data file>' INTO TABLE sales_collection;
 ```

 You can use HAWQ functions or application code to extract the components of the complex data columns as needed.

 ## Using PXF and HCatalog to Query Hive<a id="hcatalog"></a>

 Hive tables can be queried directly through HCatalog integration with HAWQ and PXF, regardless of the underlying file storage format.

 Previously, in order to query Hive tables using HAWQ and PXF, you needed to create an external table in PXF that described the target table's Hive metadata. Since HAWQ is now integrated with HCatalog, HAWQ can use metadata stored in HCatalog instead of external tables created for PXF. HCatalog is built on top of the Hive metastore and incorporates Hive's DDL. This provides several advantages:

 -   You do not need to know the table schema of your Hive tables
 -   You do not need to manually enter information about Hive table location or format
 -   If Hive table metadata changes, HCatalog provides updated metadata. This is in contrast to the use of static external PXF tables to define Hive table metadata for HAWQ.

 The following diagram depicts how HAWQ integrates with HCatalog to query Hive tables:

 <img src="../images/hawq_hcatalog.png" id="hcatalog__image_ukw_h2v_c5" class="image" width="672" />

 1.  HAWQ retrieves table metadata from HCatalog using PXF.
 2.  HAWQ creates in-memory catalog tables from the retrieved metadata. If a table is referenced multiple times in a transaction, HAWQ uses its in-memory metadata to reduce external calls to HCatalog.
 3.  PXF queries Hive using table metadata that is stored in the HAWQ in-memory catalog tables. Table metadata is dropped at the end of the transaction.

 -   **[Usage](../pxf/HivePXF.html#topic_j1l_y55_c5)**

 -   **[Limitations](../pxf/HivePXF.html#topic_r5k_pst_25)**

 ### Usage<a id="topic_j1l_y55_c5"></a>

 To enable HCatalog query integration in HAWQ, perform the following steps:

 1.  Make sure your deployment meets the requirements listed in [Prerequisites](#installingthepxfhiveplugin).
 2.  If necessary, set the `pxf_service_address` global configuration property to a hostname or IP address and port where you have installed the PXF Hive plugin. By default, the value is set to `localhost:51200`.

     ``` pre
     SET pxf_service_address TO "hivenode:51200"
     ```

 3.  To query a Hive table with HCatalog integration, simply query HCatalog directly. The query syntax is:

     ``` pre
     SELECT * FROM hcatalog.hive-db-name.hive-table-name;
     ```

     For example:

     ``` pre
     SELECT * FROM hcatalog.default.sales;
     ```

 4.  To obtain a description of a Hive table with HCatalog integration, you can use the `psql` client interface.
     -   Within HAWQ, use either the `\d                                         hcatalog.hive-db-name.hive-table-name` or `\d+                                         hcatalog.hive-db-name.hive-table-name` commands to describe a single table. For example, from the `psql` client interface:

         ``` pre
         $ psql -d postgres

         postgres=# \d hcatalog.default.test
         PXF Hive Table "default.test"
             Column    |  Type
         --------------+--------
          name         | text
          type         | text
          supplier_key | int4
          full_price   | float8
         ```

     -   Use `\d hcatalog.hive-db-name.*` to describe the whole database schema. For example:

         ``` pre
         postgres=# \d hcatalog.default.*

         PXF Hive Table "default.test"
             Column    |  Type
         --------------+--------
          type         | text
          name         | text
          supplier_key | int4
          full_price   | float8

         PXF Hive Table "default.testabc"
          Column | Type
         --------+------
          type   | text
          name   | text
         ```

     -   Use `\d hcatalog.*.*` to describe the whole schema:

         ``` pre
         postgres=# \d hcatalog.*.*
         PXF Hive Table "default.test"
             Column    |  Type
         --------------+--------
          type         | text
          name         | text
          supplier_key | int4
          full_price   | float8

         PXF Hive Table "default.testabc"
          Column | Type
         --------+------
          type   | text
          name   | text

         PXF Hive Table "userdb.test"
           Column  | Type
         ----------+------
          address  | text
          username | text

         ```

     **Note:** When using `\d` or `\d+` commands in the `psql` HAWQ client, `hcatalog` will not be listed as a database. If you use other `psql` compatible clients, `hcatalog` will be listed as a database with a size value of `-1` since `hcatalog` is not a real database in HAWQ.

 5.  Alternatively, you can use the **pxf\_get\_item\_fields** user-defined function (UDF) to obtain Hive table descriptions from other client interfaces or third-party applications. The UDF takes a PXF profile and a table pattern string as its input parameters.

     **Note:** Currently the only supported input profile is `'Hive'`.

     For example, the following statement returns a description of a specific table. The description includes path, itemname (table), fieldname, and fieldtype.

     ``` pre
     # select * from pxf_get_item_fields('Hive','default.test');
       path   | itemname |  fieldname   | fieldtype
     ---------+----------+--------------+-----------
      default | test     | name         | text
      default | test     | type         | text
      default | test     | supplier_key | int4
      default | test     | full_price   | float8
     (4 rows)
     ```

     The following statement returns table descriptions from the default database.

     ``` pre
     postgres=# select * from pxf_get_item_fields('Hive','default.*');
       path   | itemname |  fieldname   | fieldtype
     ---------+----------+--------------+-----------
      default | test     | name         | text
      default | test     | type         | text
      default | test     | supplier_key | int4
      default | test     | full_price   | float8
      default | testabc  | name         | text
      default | testabc  | type         | text
     (6 rows)
     ```

     The following statement returns a description of the entire schema.

     ``` pre
     # select * from pxf_get_item_fields('Hive', '*.*');
       path   | itemname |  fieldname   | fieldtype
     ---------+----------+--------------+-----------
      default | test     | name         | text
      default | test     | type         | text
      default | test     | supplier_key | int4
      default | test     | full_price   | float8
      default | testabc  | name         | text
      default | testabc  | type         | text
      userdb  | test     | username     | text
      userdb  | test     | address      | text
     (8 rows)
     ```

 ### Limitations<a id="topic_r5k_pst_25"></a>

 HCatalog integration has the following limitations:

 -   HCatalog integration queries and describe commands do not support complex types; only primitive types are supported. Use PXF external tables to query complex types in Hive instead. (See [Hive Complex Types](#topic_b4v_g3n_25) for example.)
 -   Even for primitive types, HCatalog metadata descriptions produced by `\d` and` \d+` are converted to HAWQ types. For example, the Hive type `tinyint` is converted to HAWQ type `int2`. (See [Data Types](../reference/HAWQDataTypes.html) for a list of data types in HAWQ.)
 -   HAWQ reserves the database name "hcatalog" for system use. You cannot connect to or alter the system "hcatalog" database.

 ## Partition Filtering<a id="partitionfiltering"></a>

 The PXF Hive plugin uses the Hive partitioning feature and directory structure. This enables partition exclusion on HDFS files that contain the Hive table. To use the partition filtering feature to reduce network traffic and I/O, run a PXF query using a WHERE clause that refers to a specific partition in the partitioned Hive table.

 To take advantage of PXF partition filtering push-down, name the partition fields in the external table. These names must be the same as the names stored in the Hive table. Otherwise, PXF ignores Partition filtering and the filtering is performed on the HAWQ side, impacting performance.

 **Note:** The Hive plugin only filters on partition columns, not on other table attributes.

 ### Example<a id="example2"></a>

 Create a Hive table `sales_part` with two partition columns, `delivery_state` and `delivery_city:`

 ``` pre
 CREATE TABLE sales_part (name STRING, type STRING, supplier_key INT, price DOUBLE)
 PARTITIONED BY (delivery_state STRING, delivery_city STRING)
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
 ```

 Load data into this Hive table and add some partitions:

 ``` pre
 LOAD DATA LOCAL INPATH '/local/path/data1.txt' INTO TABLE sales_part PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'San Francisco');
 LOAD DATA LOCAL INPATH '/local/path/data2.txt' INTO TABLE sales_part PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'Sacramento');
 LOAD DATA LOCAL INPATH '/local/path/data3.txt' INTO TABLE sales_part PARTITION(delivery_state = 'NEVADA'    , delivery_city = 'Reno');
 LOAD DATA LOCAL INPATH '/local/path/data4.txt' INTO TABLE sales_part PARTITION(delivery_state = 'NEVADA'    , delivery_city = 'Las Vegas');
 ```

 The Hive storage directory should appears as follows:

 ``` pre
 /hive/warehouse/sales_part/delivery_state=CALIFORNIA/delivery_city=’San Francisco’/data1.txt
 /hive/warehouse/sales_part/delivery_state=CALIFORNIA/delivery_city=Sacramento/data2.txt
 /hive/warehouse/sales_part/delivery_state=NEVADA/delivery_city=Reno/data3.txt
 /hive/warehouse/sales_part/delivery_state=NEVADA/delivery_city=’Las Vegas’/data4.txt
 ```

 To define a PXF table that will read this Hive table and take advantage of partition filter push-down, define the fields corresponding to the Hive partition fields at the end of the attribute list. In HiveQL, a `select                         *` statement on a partitioned table shows the partition fields at the end of the record.

 ``` pre
 CREATE EXTERNAL TABLE pxf_sales_part(
   item_name TEXT,
   item_type TEXT,
   supplier_key INTEGER,
   item_price DOUBLE PRECISION,
   delivery_state TEXT,
   delivery_city TEXT
 )
 LOCATION ('pxf://namenode_host:51200/sales_part?Profile=Hive')
 FORMAT 'custom' (FORMATTER='pxfwritable_import');

 SELECT * FROM pxf_sales_part;
 ```

 ### Example<a id="example3"></a>

 In the following example, the HAWQ query filters the `delivery_city` partition `Sacramento`. The filter on  `item_name` is not pushed down, since it is not a partition column. It is performed on the HAWQ side after all the data on `Sacramento` is transferred for processing.

 ``` pre
 SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' AND item_name = 'shirt';
 ```

 ### Example<a id="example4"></a>

 The following HAWQ query reads all the data under `delivery_state` partition `CALIFORNIA`, regardless of the city.

 ``` pre
 SELECT * FROM pxf_sales_part WHERE delivery_state = 'CALIFORNIA';
 ```

 ## Using PXF with Hive Default Partitions<a id="topic_fdm_zrh_1s"></a>

 This topic describes a difference in query results between Hive and PXF queries when Hive tables use a default partition. When dynamic partitioning is enabled in Hive, a partitioned table may store data in a default partition. Hive creates a default partition when the value of a partitioning column does not match the defined type of the column (for example, when a NULL value is used for any partitioning column). In Hive, any query that includes a filter on a partition column *excludes* any data that is stored in the table's default partition.

 Similar to Hive, PXF represents a table's partitioning columns as columns that are appended to the end of the table. However, PXF translates any column value in a default partition to a NULL value. This means that a HAWQ query that includes an IS NULL filter on a partitioning column can return different results than the same Hive query.

 ### Example<a id="topic_g4r_4wh_1s"></a>

 Consider a Hive partitioned table that is created with the statement:

 ``` pre
 CREATE TABLE sales (order_id bigint, order_amount float) PARTITIONED BY (date date);
 ```

 The table is loaded with five rows that contain the following data:

 ``` pre
 1    1.0    1900-01-01
 2    2.2    1994-04-14
 3    3.3    2011-03-31
 4    4.5    NULL
 5    5.0    2013-12-06
 ```

 In this case, the insertion of row 4 creates a Hive default partition, because the partition column "date" contains a null value.

 In Hive, any query that filters on the partition column omits data in the default partition. For example the following query returns no rows:

 ``` pre
 hive> select * from sales where date is null;
 ```

 However, if you map this table as a PXF external table in HAWQ, all default partition values are translated into actual NULL values. In HAWQ, executing the same query against the PXF table returns row 4 as the result, because the filter matches the NULL value.

 Keep this behavior in mind when executing IS NULL queries on Hive partitioned tables.

 ## Accessing Hive Tables in Parquet Format<a id="topic_dbb_nz3_ts"></a>

 The PXF Hive profile supports both non-partitioned and partitioned Hive tables that use the Parquet storage format in HDFS. Simply map the table columns using equivalent HAWQ data types. For example, in Hive if a table is created using:

 ``` pre
 hive> create table hive_parquet_table (fname string, lname string, custid int, acctbalance double) stored as
       parquet;
 ```

 Then you would define the HAWQ external table using:

 ``` pre
 create external table pxf_parquet_table (fname text, lname text, custid int, acctbalance double precision)
     location ('pxf://localhost:51200/hive_parquet_table?profile=Hive')
     format 'custom' (formatter='pxfwritable_import');
 ```
	---
	title: Accessing Hive Data
	---

	This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog.

	- [Prerequisites](../pxf/HivePXF.html#installingthepxfhiveplugin)

	- [Hive Command Line](../pxf/HivePXF.html#hivecommandline)

	- [Using PXF Tables to Query Hive](../pxf/HivePXF.html#topic_p2s_lvl_25)

	- [Using PXF and HCatalog to Query Hive](../pxf/HivePXF.html#hcatalog)

	- [Partition Filtering](../pxf/HivePXF.html#partitionfiltering)

	- [Using PXF with Hive Default Partitions](../pxf/HivePXF.html#topic_fdm_zrh_1s)

	- [Accessing Hive Tables in Parquet Format](../pxf/HivePXF.html#topic_dbb_nz3_ts)

	## Prerequisites<a id="installingthepxfhiveplugin"></a>

	Check the following before using PXF to access Hive:

	- The PXF HDFS plugin is installed on all cluster nodes.
	- The PXF Hive plugin is installed on all cluster nodes.
	- The Hive JAR files and conf directory are installed on all cluster nodes.
	- Test PXF on HDFS before connecting to Hive or HBase.
	- You are running the Hive Metastore service on a machine in your cluster.
	- You have set the `hive.metastore.uris` property in the `hive-site.xml` on the NameNode.

	## Hive Command Line<a id="hivecommandline"></a>

	To start the Hive command line and work directly on a Hive table:

	``` pre
	/>${HIVE_HOME}/bin/hive
	```

	Here is an example of how to create and load data into a sample Hive table from an existing file.

	``` pre
	Hive> CREATE TABLE test (name string, type string, supplier_key int, full_price double) row format delimited fields terminated by ',';
	Hive> LOAD DATA local inpath '/local/path/data.txt' into table test;
	```

	## Using PXF Tables to Query Hive<a id="topic_p2s_lvl_25"></a>

	Hive tables are defined in a specific way in PXF, regardless of the underlying file storage format. The PXF Hive plugins automatically detect source tables in the following formats:

	- Text based
	- SequenceFile
	- RCFile
	- ORCFile
	- Parquet
	- Avro

	The source table can also be a combination of these types. The PXF Hive plugin uses this information to query the data in runtime.

	- [Syntax Example](../pxf/HivePXF.html#syntax2)

	- [Hive Complex Types](../pxf/HivePXF.html#topic_b4v_g3n_25)

	### Syntax Example<a id="syntax2"></a>

	The following PXF table definition is valid for any Hive file storage type.

	``` pre
	CREATE [READABLE\|WRITABLE] EXTERNAL TABLE table_name
	( column_name data_type [, ...] \| LIKE other_table )
	LOCATION ('pxf://namenode[:port]/hive-table-name?<pxf parameters>[&custom-option=value...]')FORMAT 'CUSTOM' (formatter='pxfwritable_import')
	```

	where `<pxf parameters>` is:

	``` pre
	FRAGMENTER=fragmenter_class&ACCESSOR=accessor_class&RESOLVER=resolver_class]
	\| PROFILE=profile-name
	```

	Note: The port is the connection port for the PXF service. If the port is omitted, PXF assumes that High Availability (HA) is enabled and connects to the HA name service port, 51200 by default. The HA name service port can be changed by setting thepxf\_service\_port configuration parameter.

	PXF has three built-in profiles for Hive tables:

	- Hive
	- HiveRC
	- HiveText

	The Hive profile works with any Hive storage type. Use HiveRC and HiveText to query RC and Text formats respectively. The HiveRC and HiveText profiles are faster than the generic Hive profile. When using the HiveRC and HiveText profiles, you must specify a DELIMITER option in the LOCATION clause. See [Using Profiles to Read and Write Data](ReadWritePXF.html#readingandwritingdatawithpxf) for more information on profiles.

	The following example creates a readable external table for a Hive table named `/user/eddie/test` using the PXF Hive profile:

	``` pre
	CREATE EXTERNAL TABLE hivetest(id int, newid int)
	LOCATION ('pxf://namenode:51200/user/eddie/test?PROFILE=Hive')
	FORMAT 'custom' (formatter='pxfwritable_import');
	```

	### Hive Complex Types<a id="topic_b4v_g3n_25"></a>

	PXF tables support Hive data types that are not primitive types. The supported Hive complex data types are array, struct, map, and union. This Hive `CREATE TABLE` statement, for example, creates a table with each of these complex data types:

	``` pre
	CREATE TABLE sales_collections (
	item STRING,
	price FLOAT,
	properties ARRAY<STRING>,
	hash MAP<STRING,FLOAT>,
	delivery_address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>,
	others UNIONTYPE<FLOAT, BOOLEAN, STRING>
	)
	ROW FORMAT DELIMITED FIELDS
	TERMINATED BY '\001' COLLECTION ITEMS TERMINATED BY '\002' MAP KEYS TERMINATED BY '\003' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
	LOAD DATA LOCAL INPATH '/local/path/<some data file>' INTO TABLE sales_collection;
	```

	You can use HAWQ functions or application code to extract the components of the complex data columns as needed.

	## Using PXF and HCatalog to Query Hive<a id="hcatalog"></a>

	Hive tables can be queried directly through HCatalog integration with HAWQ and PXF, regardless of the underlying file storage format.

	Previously, in order to query Hive tables using HAWQ and PXF, you needed to create an external table in PXF that described the target table's Hive metadata. Since HAWQ is now integrated with HCatalog, HAWQ can use metadata stored in HCatalog instead of external tables created for PXF. HCatalog is built on top of the Hive metastore and incorporates Hive's DDL. This provides several advantages:

	- You do not need to know the table schema of your Hive tables
	- You do not need to manually enter information about Hive table location or format
	- If Hive table metadata changes, HCatalog provides updated metadata. This is in contrast to the use of static external PXF tables to define Hive table metadata for HAWQ.

	The following diagram depicts how HAWQ integrates with HCatalog to query Hive tables:

	<img src="../images/hawq_hcatalog.png" id="hcatalog__image_ukw_h2v_c5" class="image" width="672" />

	1. HAWQ retrieves table metadata from HCatalog using PXF.
	2. HAWQ creates in-memory catalog tables from the retrieved metadata. If a table is referenced multiple times in a transaction, HAWQ uses its in-memory metadata to reduce external calls to HCatalog.
	3. PXF queries Hive using table metadata that is stored in the HAWQ in-memory catalog tables. Table metadata is dropped at the end of the transaction.

	- [Usage](../pxf/HivePXF.html#topic_j1l_y55_c5)

	- [Limitations](../pxf/HivePXF.html#topic_r5k_pst_25)

	### Usage<a id="topic_j1l_y55_c5"></a>

	To enable HCatalog query integration in HAWQ, perform the following steps:

	1. Make sure your deployment meets the requirements listed in [Prerequisites](#installingthepxfhiveplugin).
	2. If necessary, set the `pxf_service_address` global configuration property to a hostname or IP address and port where you have installed the PXF Hive plugin. By default, the value is set to `localhost:51200`.

	``` pre
	SET pxf_service_address TO "hivenode:51200"
	```

	3. To query a Hive table with HCatalog integration, simply query HCatalog directly. The query syntax is:

	``` pre
	SELECT * FROM hcatalog.hive-db-name.hive-table-name;
	```

	For example:

	``` pre
	SELECT * FROM hcatalog.default.sales;
	```

	4. To obtain a description of a Hive table with HCatalog integration, you can use the `psql` client interface.
	- Within HAWQ, use either the `\d hcatalog.hive-db-name.hive-table-name` or `\d+ hcatalog.hive-db-name.hive-table-name` commands to describe a single table. For example, from the `psql` client interface:

	``` pre
	$ psql -d postgres

	postgres=# \d hcatalog.default.test
	PXF Hive Table "default.test"
	Column \| Type
	--------------+--------
	name \| text
	type \| text
	supplier_key \| int4
	full_price \| float8
	```

	- Use `\d hcatalog.hive-db-name.*` to describe the whole database schema. For example:

	``` pre
	postgres=# \d hcatalog.default.*

	PXF Hive Table "default.test"
	Column \| Type
	--------------+--------
	type \| text
	name \| text
	supplier_key \| int4
	full_price \| float8

	PXF Hive Table "default.testabc"
	Column \| Type
	--------+------
	type \| text
	name \| text
	```

	- Use `\d hcatalog..` to describe the whole schema:

	``` pre
	postgres=# \d hcatalog..
	PXF Hive Table "default.test"
	Column \| Type
	--------------+--------
	type \| text
	name \| text
	supplier_key \| int4
	full_price \| float8

	PXF Hive Table "default.testabc"
	Column \| Type
	--------+------
	type \| text
	name \| text

	PXF Hive Table "userdb.test"
	Column \| Type
	----------+------
	address \| text
	username \| text

	```

	Note: When using `\d` or `\d+` commands in the `psql` HAWQ client, `hcatalog` will not be listed as a database. If you use other `psql` compatible clients, `hcatalog` will be listed as a database with a size value of `-1` since `hcatalog` is not a real database in HAWQ.

	5. Alternatively, you can use the pxf\_get\_item\_fields user-defined function (UDF) to obtain Hive table descriptions from other client interfaces or third-party applications. The UDF takes a PXF profile and a table pattern string as its input parameters.

	Note: Currently the only supported input profile is `'Hive'`.

	For example, the following statement returns a description of a specific table. The description includes path, itemname (table), fieldname, and fieldtype.

	``` pre
	# select * from pxf_get_item_fields('Hive','default.test');
	path \| itemname \| fieldname \| fieldtype
	---------+----------+--------------+-----------
	default \| test \| name \| text
	default \| test \| type \| text
	default \| test \| supplier_key \| int4
	default \| test \| full_price \| float8
	(4 rows)
	```

	The following statement returns table descriptions from the default database.

	``` pre
	postgres=# select * from pxf_get_item_fields('Hive','default.*');
	path \| itemname \| fieldname \| fieldtype
	---------+----------+--------------+-----------
	default \| test \| name \| text
	default \| test \| type \| text
	default \| test \| supplier_key \| int4
	default \| test \| full_price \| float8
	default \| testabc \| name \| text
	default \| testabc \| type \| text
	(6 rows)
	```

	The following statement returns a description of the entire schema.

	``` pre
	# select * from pxf_get_item_fields('Hive', '.');
	path \| itemname \| fieldname \| fieldtype
	---------+----------+--------------+-----------
	default \| test \| name \| text
	default \| test \| type \| text
	default \| test \| supplier_key \| int4
	default \| test \| full_price \| float8
	default \| testabc \| name \| text
	default \| testabc \| type \| text
	userdb \| test \| username \| text
	userdb \| test \| address \| text
	(8 rows)
	```

	### Limitations<a id="topic_r5k_pst_25"></a>

	HCatalog integration has the following limitations:

	- HCatalog integration queries and describe commands do not support complex types; only primitive types are supported. Use PXF external tables to query complex types in Hive instead. (See [Hive Complex Types](#topic_b4v_g3n_25) for example.)
	- Even for primitive types, HCatalog metadata descriptions produced by `\d` and` \d+` are converted to HAWQ types. For example, the Hive type `tinyint` is converted to HAWQ type `int2`. (See [Data Types](../reference/HAWQDataTypes.html) for a list of data types in HAWQ.)
	- HAWQ reserves the database name "hcatalog" for system use. You cannot connect to or alter the system "hcatalog" database.

	## Partition Filtering<a id="partitionfiltering"></a>

	The PXF Hive plugin uses the Hive partitioning feature and directory structure. This enables partition exclusion on HDFS files that contain the Hive table. To use the partition filtering feature to reduce network traffic and I/O, run a PXF query using a WHERE clause that refers to a specific partition in the partitioned Hive table.

	To take advantage of PXF partition filtering push-down, name the partition fields in the external table. These names must be the same as the names stored in the Hive table. Otherwise, PXF ignores Partition filtering and the filtering is performed on the HAWQ side, impacting performance.

	Note: The Hive plugin only filters on partition columns, not on other table attributes.

	### Example<a id="example2"></a>

	Create a Hive table `sales_part` with two partition columns, `delivery_state` and `delivery_city:`

	``` pre
	CREATE TABLE sales_part (name STRING, type STRING, supplier_key INT, price DOUBLE)
	PARTITIONED BY (delivery_state STRING, delivery_city STRING)
	ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
	```

	Load data into this Hive table and add some partitions:

	``` pre
	LOAD DATA LOCAL INPATH '/local/path/data1.txt' INTO TABLE sales_part PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'San Francisco');
	LOAD DATA LOCAL INPATH '/local/path/data2.txt' INTO TABLE sales_part PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'Sacramento');
	LOAD DATA LOCAL INPATH '/local/path/data3.txt' INTO TABLE sales_part PARTITION(delivery_state = 'NEVADA' , delivery_city = 'Reno');
	LOAD DATA LOCAL INPATH '/local/path/data4.txt' INTO TABLE sales_part PARTITION(delivery_state = 'NEVADA' , delivery_city = 'Las Vegas');
	```

	The Hive storage directory should appears as follows:

	``` pre
	/hive/warehouse/sales_part/delivery_state=CALIFORNIA/delivery_city=’San Francisco’/data1.txt
	/hive/warehouse/sales_part/delivery_state=CALIFORNIA/delivery_city=Sacramento/data2.txt
	/hive/warehouse/sales_part/delivery_state=NEVADA/delivery_city=Reno/data3.txt
	/hive/warehouse/sales_part/delivery_state=NEVADA/delivery_city=’Las Vegas’/data4.txt
	```

	To define a PXF table that will read this Hive table and take advantage of partition filter push-down, define the fields corresponding to the Hive partition fields at the end of the attribute list. In HiveQL, a `select *` statement on a partitioned table shows the partition fields at the end of the record.

	``` pre
	CREATE EXTERNAL TABLE pxf_sales_part(
	item_name TEXT,
	item_type TEXT,
	supplier_key INTEGER,
	item_price DOUBLE PRECISION,
	delivery_state TEXT,
	delivery_city TEXT
	)
	LOCATION ('pxf://namenode_host:51200/sales_part?Profile=Hive')
	FORMAT 'custom' (FORMATTER='pxfwritable_import');

	SELECT * FROM pxf_sales_part;
	```

	### Example<a id="example3"></a>

	In the following example, the HAWQ query filters the `delivery_city` partition `Sacramento`. The filter on `item_name` is not pushed down, since it is not a partition column. It is performed on the HAWQ side after all the data on `Sacramento` is transferred for processing.

	``` pre
	SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' AND item_name = 'shirt';
	```

	### Example<a id="example4"></a>

	The following HAWQ query reads all the data under `delivery_state` partition `CALIFORNIA`, regardless of the city.

	``` pre
	SELECT * FROM pxf_sales_part WHERE delivery_state = 'CALIFORNIA';
	```

	## Using PXF with Hive Default Partitions<a id="topic_fdm_zrh_1s"></a>

	This topic describes a difference in query results between Hive and PXF queries when Hive tables use a default partition. When dynamic partitioning is enabled in Hive, a partitioned table may store data in a default partition. Hive creates a default partition when the value of a partitioning column does not match the defined type of the column (for example, when a NULL value is used for any partitioning column). In Hive, any query that includes a filter on a partition column excludes any data that is stored in the table's default partition.

	Similar to Hive, PXF represents a table's partitioning columns as columns that are appended to the end of the table. However, PXF translates any column value in a default partition to a NULL value. This means that a HAWQ query that includes an IS NULL filter on a partitioning column can return different results than the same Hive query.

	### Example<a id="topic_g4r_4wh_1s"></a>

	Consider a Hive partitioned table that is created with the statement:

	``` pre
	CREATE TABLE sales (order_id bigint, order_amount float) PARTITIONED BY (date date);
	```

	The table is loaded with five rows that contain the following data:

	``` pre
	1 1.0 1900-01-01
	2 2.2 1994-04-14
	3 3.3 2011-03-31
	4 4.5 NULL
	5 5.0 2013-12-06
	```

	In this case, the insertion of row 4 creates a Hive default partition, because the partition column "date" contains a null value.

	In Hive, any query that filters on the partition column omits data in the default partition. For example the following query returns no rows:

	``` pre
	hive> select * from sales where date is null;
	```

	However, if you map this table as a PXF external table in HAWQ, all default partition values are translated into actual NULL values. In HAWQ, executing the same query against the PXF table returns row 4 as the result, because the filter matches the NULL value.

	Keep this behavior in mind when executing IS NULL queries on Hive partitioned tables.

	## Accessing Hive Tables in Parquet Format<a id="topic_dbb_nz3_ts"></a>

	The PXF Hive profile supports both non-partitioned and partitioned Hive tables that use the Parquet storage format in HDFS. Simply map the table columns using equivalent HAWQ data types. For example, in Hive if a table is created using:

	``` pre
	hive> create table hive_parquet_table (fname string, lname string, custid int, acctbalance double) stored as
	parquet;
	```

	Then you would define the HAWQ external table using:

	``` pre
	create external table pxf_parquet_table (fname text, lname text, custid int, acctbalance double precision)
	location ('pxf://localhost:51200/hive_parquet_table?profile=Hive')
	format 'custom' (formatter='pxfwritable_import');
	```