title: Load Data Using PXF

Load Data Using the Platform Extension Framework (PXF)

The Apache Cloudberry Platform Extension Framework (PXF) is an extensible framework that allows Apache Cloudberry to query external data files whose metadata is not managed by the database. You can use its pxf:// protocol to access data residing in object store systems (Azure, Google Cloud Storage, Minio, S3), external Hadoop systems (HDFS, Hive, HBase), and SQL databases.

PXF includes built-in connectors for accessing data that exists inside HDFS files, Hive tables, HBase tables, and JDBC-accessible databases. Users can also create their own connectors to other data storage or processing engines.

The version of PXF used in Apache Cloudberry is forked from the Greenplum PXF project and has been specifically adapted for it. The source code for this adapted version is hosted by the Apache Software Foundation at apache/cloudberry-pxf.

The PXF pxf protocol is packaged as an Apache Cloudberry extension that supports both reading from and writing to external data stores. The framework includes a C-language extension and a Java service. After you configure and initialize PXF, you start a single PXF JVM process on each Apache Cloudberry segment host. This long-running process concurrently serves multiple query requests.

Before using the pxf protocol, you must explicitly initialize and start the PXF service. You must also enable PXF in each database where it will be used and grant permissions on the pxf protocol to the relevant Apache Cloudberry users.

Install and Build PXF

PXF is a component that must be built and installed separately. The source code and detailed instructions for building, installing, and developing PXF are available in the official apache/cloudberry-pxf repository. Before proceeding with the configuration, follow the development guide in the repository to set up PXF.

Build and Installation Guide: apache/cloudberry-pxf README

Configure PXF

Before you can use PXF to access external data, you must configure and initialize the service.

Initialize and Start PXF

To use PXF, you must first initialize it. This process creates the necessary configuration directory structure and files on each Apache Cloudberry segment host. After initialization, you need to start the PXF service.

Add the PXF binary directory to your PATH. This step ensures that you can run pxf commands from any location.
```
export PATH=/usr/local/pxf/bin:$PATH
```
You can add this line to your .bashrc or .zshrc file to make it permanent.
Initialize PXF. The prepare command sets up the PXF configuration. This command only needs to be run once.
```
pxf prepare
```
Start the PXF service. This command starts a Java process on each segment host that acts as the external data coordinator.
```
pxf start
```

Enable PXF in a database

You must enable PXF in each database in which you want to use the pxf protocol to access external data.

To enable PXF in a database, you must create the PXF extension in the database:

CREATE EXTENSION pxf;

The pxf protocol is packaged as an extension with Apache Cloudberry, and you must explicitly enable it in each database in which you plan to use it.

Grant access to PXF

To allow non-superuser Apache Cloudberry roles to create external tables using the pxf protocol, you must grant SELECT privileges on the pxf protocol to each role:

GRANT SELECT ON PROTOCOL pxf TO <role_name>;

To allow non-superuser Apache Cloudberry roles to create writable external tables using the pxf protocol, you must grant INSERT privileges on the pxf protocol to each role:

GRANT INSERT ON PROTOCOL pxf TO <role_name>;

PXF configuration overview

PXF configuration includes server configuration and connector configuration. A PXF server is a named configuration that provides access credentials and other information required to access an external data source. A PXF connector is the interface between PXF and the external data source.

Example: configure S3 connector

Create a PXF configuration for accessing S3 data:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <property>
        <name>fs.s3a.access.key</name>
        <value>YOUR_ACCESS_KEY</value>
    </property>
    <property>
        <name>fs.s3a.secret.key</name>
        <value>YOUR_SECRET_KEY</value>
    </property>
    <property>
        <name>fs.s3a.endpoint</name>
        <value>s3.amazonaws.com</value>
    </property>
</configuration>

Example: configure JDBC connector

For accessing SQL databases:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <property>
        <name>jdbc.driver</name>
        <value>org.postgresql.Driver</value>
    </property>
    <property>
        <name>jdbc.url</name>
        <value>jdbc:postgresql://hostname:5432/database</value>
    </property>
    <property>
        <name>jdbc.user</name>
        <value>username</value>
    </property>
    <property>
        <name>jdbc.password</name>
        <value>password</value>
    </property>
</configuration>

Step 3. Create PXF external tables

PXF external tables use the pxf:// protocol in the LOCATION clause. The URL format varies depending on the data source and connector.

Read from HDFS

Access text files stored in HDFS:

CREATE EXTERNAL TABLE sales_hdfs (
    transaction_id int,
    product_name text,
    sale_date date,
    amount decimal(10,2)
)
LOCATION ('pxf://hdfs-namenode:8020/data/sales/sales.txt?PROFILE=hdfs:text')
FORMAT 'TEXT' (DELIMITER '|');

Read from Hive tables

Access Hive tables directly:

CREATE EXTERNAL TABLE hive_sales (
    transaction_id int,
    product_name text,
    sale_date date,
    amount decimal(10,2)
)
LOCATION ('pxf://hive-metastore:9083/sales_db.sales_table?PROFILE=hive')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

Read from Parquet files

Access Parquet files in object storage:

CREATE EXTERNAL TABLE parquet_data (
    id bigint,
    name text,
    created_date date
)
LOCATION ('pxf://s3a://my-bucket/data/events.parquet?PROFILE=s3:parquet&SERVER=s3-server')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

Read from SQL databases

Query external SQL databases:

CREATE EXTERNAL TABLE external_customers (
    customer_id int,
    customer_name text,
    email text,
    registration_date date
)
LOCATION ('pxf://postgresql-server/customers?PROFILE=jdbc&SERVER=postgres-server')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

Write data with PXF

Create writable external tables to export data:

CREATE WRITABLE EXTERNAL TABLE export_to_hdfs (
    transaction_id int,
    product_name text,
    sale_date date,
    amount decimal(10,2)
)
LOCATION ('pxf://hdfs-namenode:8020/exports/sales_export?PROFILE=hdfs:text')
FORMAT 'TEXT' (DELIMITER '|')
DISTRIBUTED BY (transaction_id);

PXF URL format

The PXF protocol URL follows this syntax:

pxf://<host>[:<port>]/<path-to-data>?PROFILE=<profile_name>[&<custom-option>=<value>][&SERVER=<server_name>]

Where:

host:port: Location of the external data source
path-to-data: Path to the specific data (file, directory, table, etc.)
PROFILE: PXF connector profile (e.g., hdfs:text, hive, s3:parquet, jdbc)
SERVER: Named server configuration (optional)
custom-option: Additional connector-specific options

Common PXF profiles

Profile	Data Source	Format	Use Case
`hdfs:text`	HDFS	Text files	Delimited text data
`hdfs:avro`	HDFS	Avro files	Schema evolution support
`hdfs:parquet`	HDFS	Parquet files	Columnar analytics
`hdfs:orc`	HDFS	ORC files	Optimized row columnar
`hive`	Hive	Various	Hive table access
`hbase`	HBase	HBase	NoSQL data access
`s3:text`	S3	Text files	Cloud object storage
`s3:parquet`	S3	Parquet files	Cloud analytics
`jdbc`	SQL Database	Various	External database access

Performance optimization

Partition pruning

PXF supports partition pruning for Hive tables:

SELECT * FROM hive_sales 
WHERE sale_date >= '2024-01-01' AND sale_date < '2024-02-01';

Predicate pushdown

Enable predicate pushdown for better performance:

-- This filter can be pushed down to the external source
SELECT * FROM external_customers 
WHERE registration_date > '2024-01-01';

Parallel processing

Leverage multiple files for better parallelism:

CREATE EXTERNAL TABLE multi_file_data (
    id bigint,
    data text
)
LOCATION ('pxf://hdfs-namenode:8020/data/partitioned/*?PROFILE=hdfs:text')
FORMAT 'TEXT' (DELIMITER ',');

Error handling

Enable error logging for data quality monitoring:

CREATE EXTERNAL TABLE sales_with_errors (
    transaction_id int,
    product_name text,
    sale_date date,
    amount decimal(10,2)
)
LOCATION ('pxf://hdfs-namenode:8020/data/sales/?PROFILE=hdfs:text')
FORMAT 'TEXT' (DELIMITER '|')
LOG ERRORS SEGMENT REJECT LIMIT 50;

Best practices

Select the most specific PXF profile that matches your data format and storage system to ensure optimal performance and feature support.
Create reusable named server configurations to centralize connection parameters and credentials, making it easier to manage multiple external data sources.
Structure your external data using file sizes between 100MB and 1GB to achieve optimal parallel processing across Apache Cloudberry segments.
Leverage compressed storage formats such as Parquet or ORC to reduce I/O overhead and improve query performance.
Implement comprehensive error logging and monitoring in production environments to quickly identify and resolve data access issues.
Establish proper authentication mechanisms and enable encryption for all connections to sensitive external data sources.

Limitations

PXF requires the external data source to be accessible from all segment hosts.
Some data sources may have specific version compatibility requirements.
Write operations depend on the connector's capabilities.
Performance can be affected by network latency to external sources.

Troubleshooting

Check PXF service status

pxf cluster status

View PXF logs

pxf cluster logs

Test connectivity

SELECT * FROM pxf_external_table LIMIT 5;

Learn More

For more details about the Apache Cloudberry Platform Extension Framework (PXF), please refer to the official PXF project repository, which contains the source code, documentation, and contribution guidelines.

Apache Cloudberry PXF Project