blob: f5237c678cdd9cb0b4d12e7a2ce2c211a33ed683 [file]
---
title: Introduction to PXF
---
## <a id="tut_intropxfdisc"></a>Discussion
Many deployments do not store data directly in HAWQ internal tables, but instead reference files from external data sources. PXF provides access to this external data via built-in connectors called plug-ins. These plug-ins facilitate mapping the data source to a HAWQ external table definition. PXF includes HDFS, Hive, HBase, and JSON plug-ins. For additional information about PXF, refer to [Working with PXF and External Data](../../pxf/HawqExtensionFrameworkPXF.html).
In this exercise, you will load the sample Retail data set into HDFS. You will create HAWQ external table definitions using the PXF HDFS plug-in. After the external tables are created, you will perform queries on the HDFS data via HAWQ.
## <a id="tut_intropxfprereq"></a>Prerequisites
Ensure that you have a running HAWQ installation and have downloaded the tutorial work files, including the Retail example data set. Also make sure you have set up your HAWQ environment by logging in as the `gpadmin` user and source'ing `/usr/local/hawq/greenplum_path.sh`.
## <a id="tut_excreatepxftblsteps"></a>Steps
Perform the following steps to load the Retail data set into HDFS and create HAWQ external table definitions to access the HDFS data.
1. In a terminal window, navigate to the tutorial work files sample data directory.
``` shell
$ cd $HAWQTUTBASE/sample_data
```
2. Run the `load_data_to_HDFS.sh` script to load the sample data set files into HDFS. The data will be loaded to a `/retail_demo` directory in HDFS:
``` shell
$ ./load_data_to_HDFS.sh
```
`load_to_HDFS.sh` loads the sample data `.tsv.gz` files directly into HDFS. Each file is loaded to it's respective `/retail_demo/tablename/tablename.tsv.gz` path.
3. Run the PostgreSQL interpreter `psql`:
``` shell
$ psql
```
You will enter the `psql` interactive :
``` sql
gpadmin=#
```
4. Create the `retail_demo` schema if it has not already been created:
``` sql
gpadmin=# \dn
gpadmin=# create schema retail_demo;
```
5. Create a HAWQ external table definition to represent the Retail customer data and then exit `psql`:
``` sql
gpadmin=# CREATE EXTERNAL TABLE retail_demo.customers_dim_pxf
(
customer_id TEXT,
first_name TEXT,
last_name TEXT,
gender TEXT
)
LOCATION ('pxf://hostname:51200/retail_demo/customers_dim/customers_dim.tsv.gz?profile=HdfsTextSimple')
FORMAT 'TEXT' (DELIMITER = E'\t');
gpadmin=# \q
```
Substitute your sandbox VM IP address for the `hostname` in the `LOCATION` clause.
?? DESCRIBE THE MANY COMPONENTS OF THIS COMMAND AND WHAT IT DOES HERE ??
6. Create HAWQ external table definitions for the remainder of the Retail data set using the provided sql scripts:
``` shell
$ cd ../hawq/pxf_tables
$ psql
```
``` sql
gpadmin=# \i drop_pxf_tables.sql
gpadmin=# \i create_pxf_tables.sql
gpadmin=# \q
```
We drop the tables before creating. If this is your first time performing this exercise, ignore the `psql` "table does not exist, skipping" messages.
6. Verify the external table definitions were created successfully:
``` shell
$ ./verify_load_pxf_tables.sh
```
The output of the script should match the following:
```
Table Name | Count
-------------------------------+------------------------
customers\_dim\_pxf | 401430
categories\_dim\_pxf | 56
customer\_addresses\_dim\_pxf | 1130639
email\_addresses\_dim\_pxf | 401430
order\_lineitems\_pxf | 1024158
orders\_pxf | 512071
payment\_methods\_pxf | 5
products\_dim\_pxf | 698911
-------------------------------+------------------------
```
8. Determine the top ten postal codes by order revenue by running the following query on the `retail_demo.orders_hawq_pxf` table:
``` shell
$ psql
```
``` sql
gpadmin=# select billing_address_postal_code,
sum(total_paid_amount::float8) as total,
sum(total_tax_amount::float8) as tax
from retail_demo.orders_pxf
group by billing_address_postal_code
order by total desc limit 10;
```
The output should match the following:
``` shell
billing_address_postal_code | total | tax
-----------------------------+-----------+-----------
48001 | 111868.32 | 6712.0992
15329 | 107958.24 | 6477.4944
42714 | 103244.58 | 6194.6748
41030 | 101365.5 | 6081.93
50223 | 100511.64 | 6030.6984
03106 | 83566.41 | 0
57104 | 77383.63 | 3095.3452
23002 | 73673.66 | 3683.683
25703 | 68282.12 | 4096.9272
26178 | 66836.4 | 4010.184
(10 rows)
```
?? ADD SOME SUMMARY/WRAPUP INFO HERE ??