versioned_docs/version-2.x/data-loading/load-data-using-pxf.md - cloudberry-site - Git at Google

 ---
 title: Load Data Using PXF
 ---

 # Load Data Using the Platform Extension Framework (PXF)

 The Apache Cloudberry Platform Extension Framework (PXF) is an extensible framework that allows Apache Cloudberry to query external data files whose metadata is not managed by the database. You can use its `pxf://` protocol to access data residing in object store systems (Azure, Google Cloud Storage, Minio, S3), external Hadoop systems (HDFS, Hive, HBase), and SQL databases.

 PXF includes built-in connectors for accessing data that exists inside HDFS files, Hive tables, HBase tables, and JDBC-accessible databases. Users can also create their own connectors to other data storage or processing engines.

 The version of PXF used in Apache Cloudberry is forked from the Greenplum PXF project and has been specifically adapted for it. The source code for this adapted version is hosted by the Apache Software Foundation at [apache/cloudberry-pxf](https://github.com/apache/cloudberry-pxf).

 The PXF `pxf` protocol is packaged as an Apache Cloudberry extension that supports both reading from and writing to external data stores. The framework includes a C-language extension and a Java service. After you configure and initialize PXF, you start a single PXF JVM process on each Apache Cloudberry segment host. This long-running process concurrently serves multiple query requests.

 Before using the `pxf` protocol, you must explicitly initialize and start the PXF service. You must also enable PXF in each database where it will be used and grant permissions on the `pxf` protocol to the relevant Apache Cloudberry users.

 ## Install and Build PXF

 PXF is a component that must be built and installed separately. The source code and detailed instructions for building, installing, and developing PXF are available in the official `apache/cloudberry-pxf` repository. Before proceeding with the configuration, follow the development guide in the repository to set up PXF.

 **Build and Installation Guide**: [apache/cloudberry-pxf README](https://github.com/apache/cloudberry-pxf/blob/main/README.md)

 ## Configure PXF

 Before you can use PXF to access external data, you must configure and initialize the service.

 ### Initialize and Start PXF

 To use PXF, you must first initialize it. This process creates the necessary configuration directory structure and files on each Apache Cloudberry segment host. After initialization, you need to start the PXF service.

 1.  Add the PXF binary directory to your `PATH`. This step ensures that you can run `pxf` commands from any location.

     ```shell
     export PATH=/usr/local/pxf/bin:$PATH
     ```

     You can add this line to your `.bashrc` or `.zshrc` file to make it permanent.

 2.  Initialize PXF. The `prepare` command sets up the PXF configuration. This command only needs to be run once.

     ```shell
     pxf prepare
     ```

 3.  Start the PXF service. This command starts a Java process on each segment host that acts as the external data coordinator.

     ```shell
     pxf start
     ```

 ### Enable PXF in a database

 You must enable PXF in each database in which you want to use the `pxf` protocol to access external data.

 To enable PXF in a database, you must create the PXF extension in the database:

 ```sql
 CREATE EXTENSION pxf;
 ```

 The `pxf` protocol is packaged as an extension with Apache Cloudberry, and you must explicitly enable it in each database in which you plan to use it.

 ### Grant access to PXF

 To allow non-superuser Apache Cloudberry roles to create external tables using the `pxf` protocol, you must grant `SELECT` privileges on the `pxf` protocol to each role:

 ```sql
 GRANT SELECT ON PROTOCOL pxf TO <role_name>;
 ```

 To allow non-superuser Apache Cloudberry roles to create writable external tables using the `pxf` protocol, you must grant `INSERT` privileges on the `pxf` protocol to each role:

 ```sql
 GRANT INSERT ON PROTOCOL pxf TO <role_name>;
 ```

 ## PXF configuration overview

 PXF configuration includes server configuration and connector configuration. A PXF *server* is a named configuration that provides access credentials and other information required to access an external data source. A PXF *connector* is the interface between PXF and the external data source.

 ### Example: configure S3 connector

 Create a PXF configuration for accessing S3 data:

 ```xml
 <?xml version="1.0" encoding="UTF-8"?>
 <configuration>
     <property>
         <name>fs.s3a.access.key</name>
         <value>YOUR_ACCESS_KEY</value>
     </property>
     <property>
         <name>fs.s3a.secret.key</name>
         <value>YOUR_SECRET_KEY</value>
     </property>
     <property>
         <name>fs.s3a.endpoint</name>
         <value>s3.amazonaws.com</value>
     </property>
 </configuration>
 ```

 ### Example: configure JDBC connector

 For accessing SQL databases:

 ```xml
 <?xml version="1.0" encoding="UTF-8"?>
 <configuration>
     <property>
         <name>jdbc.driver</name>
         <value>org.postgresql.Driver</value>
     </property>
     <property>
         <name>jdbc.url</name>
         <value>jdbc:postgresql://hostname:5432/database</value>
     </property>
     <property>
         <name>jdbc.user</name>
         <value>username</value>
     </property>
     <property>
         <name>jdbc.password</name>
         <value>password</value>
     </property>
 </configuration>
 ```

 ## Step 3. Create PXF external tables

 PXF external tables use the `pxf://` protocol in the `LOCATION` clause. The URL format varies depending on the data source and connector.

 ### Read from HDFS

 Access text files stored in HDFS:

 ```sql
 CREATE EXTERNAL TABLE sales_hdfs (
     transaction_id int,
     product_name text,
     sale_date date,
     amount decimal(10,2)
 )
 LOCATION ('pxf://hdfs-namenode:8020/data/sales/sales.txt?PROFILE=hdfs:text')
 FORMAT 'TEXT' (DELIMITER '|');
 ```

 ### Read from Hive tables

 Access Hive tables directly:

 ```sql
 CREATE EXTERNAL TABLE hive_sales (
     transaction_id int,
     product_name text,
     sale_date date,
     amount decimal(10,2)
 )
 LOCATION ('pxf://hive-metastore:9083/sales_db.sales_table?PROFILE=hive')
 FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
 ```

 ### Read from Parquet files

 Access Parquet files in object storage:

 ```sql
 CREATE EXTERNAL TABLE parquet_data (
     id bigint,
     name text,
     created_date date
 )
 LOCATION ('pxf://s3a://my-bucket/data/events.parquet?PROFILE=s3:parquet&SERVER=s3-server')
 FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
 ```

 ### Read from SQL databases

 Query external SQL databases:

 ```sql
 CREATE EXTERNAL TABLE external_customers (
     customer_id int,
     customer_name text,
     email text,
     registration_date date
 )
 LOCATION ('pxf://postgresql-server/customers?PROFILE=jdbc&SERVER=postgres-server')
 FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
 ```

 ### Write data with PXF

 Create writable external tables to export data:

 ```sql
 CREATE WRITABLE EXTERNAL TABLE export_to_hdfs (
     transaction_id int,
     product_name text,
     sale_date date,
     amount decimal(10,2)
 )
 LOCATION ('pxf://hdfs-namenode:8020/exports/sales_export?PROFILE=hdfs:text')
 FORMAT 'TEXT' (DELIMITER '|')
 DISTRIBUTED BY (transaction_id);
 ```

 ## PXF URL format

 The PXF protocol URL follows this syntax:

 ```
 pxf://<host>[:<port>]/<path-to-data>?PROFILE=<profile_name>[&<custom-option>=<value>][&SERVER=<server_name>]
 ```

 Where:
 - `host:port`: Location of the external data source
 - `path-to-data`: Path to the specific data (file, directory, table, etc.)
 - `PROFILE`: PXF connector profile (e.g., `hdfs:text`, `hive`, `s3:parquet`, `jdbc`)
 - `SERVER`: Named server configuration (optional)
 - `custom-option`: Additional connector-specific options

 ## Common PXF profiles

 | Profile | Data Source | Format | Use Case |
 |---------|-------------|--------|----------|
 | `hdfs:text` | HDFS | Text files | Delimited text data |
 | `hdfs:avro` | HDFS | Avro files | Schema evolution support |
 | `hdfs:parquet` | HDFS | Parquet files | Columnar analytics |
 | `hdfs:orc` | HDFS | ORC files | Optimized row columnar |
 | `hive` | Hive | Various | Hive table access |
 | `hbase` | HBase | HBase | NoSQL data access |
 | `s3:text` | S3 | Text files | Cloud object storage |
 | `s3:parquet` | S3 | Parquet files | Cloud analytics |
 | `jdbc` | SQL Database | Various | External database access |

 ## Performance optimization

 ### Partition pruning

 PXF supports partition pruning for Hive tables:

 ```sql
 SELECT * FROM hive_sales
 WHERE sale_date >= '2024-01-01' AND sale_date < '2024-02-01';
 ```

 ### Predicate pushdown

 Enable predicate pushdown for better performance:

 ```sql
 -- This filter can be pushed down to the external source
 SELECT * FROM external_customers
 WHERE registration_date > '2024-01-01';
 ```

 ### Parallel processing

 Leverage multiple files for better parallelism:

 ```sql
 CREATE EXTERNAL TABLE multi_file_data (
     id bigint,
     data text
 )
 LOCATION ('pxf://hdfs-namenode:8020/data/partitioned/*?PROFILE=hdfs:text')
 FORMAT 'TEXT' (DELIMITER ',');
 ```

 ## Error handling

 Enable error logging for data quality monitoring:

 ```sql
 CREATE EXTERNAL TABLE sales_with_errors (
     transaction_id int,
     product_name text,
     sale_date date,
     amount decimal(10,2)
 )
 LOCATION ('pxf://hdfs-namenode:8020/data/sales/?PROFILE=hdfs:text')
 FORMAT 'TEXT' (DELIMITER '|')
 LOG ERRORS SEGMENT REJECT LIMIT 50;
 ```

 ## Best practices

 - Select the most specific PXF profile that matches your data format and storage system to ensure optimal performance and feature support.
 - Create reusable named server configurations to centralize connection parameters and credentials, making it easier to manage multiple external data sources.
 - Structure your external data using file sizes between 100MB and 1GB to achieve optimal parallel processing across Apache Cloudberry segments.
 - Leverage compressed storage formats such as Parquet or ORC to reduce I/O overhead and improve query performance.
 - Implement comprehensive error logging and monitoring in production environments to quickly identify and resolve data access issues.
 - Establish proper authentication mechanisms and enable encryption for all connections to sensitive external data sources.

 ## Limitations

 - PXF requires the external data source to be accessible from all segment hosts.
 - Some data sources may have specific version compatibility requirements.
 - Write operations depend on the connector's capabilities.
 - Performance can be affected by network latency to external sources.

 ## Troubleshooting

 ### Check PXF service status

 ```shell
 pxf cluster status
 ```

 ### View PXF logs

 ```shell
 pxf cluster logs
 ```

 ### Test connectivity

 ```sql
 SELECT * FROM pxf_external_table LIMIT 5;
 ```

 ## Learn More

 For more details about the Apache Cloudberry Platform Extension Framework (PXF), please refer to the official PXF project repository, which contains the source code, documentation, and contribution guidelines.

 - [Apache Cloudberry PXF Project](https://github.com/apache/cloudberry-pxf)
	---
	title: Load Data Using PXF
	---

	# Load Data Using the Platform Extension Framework (PXF)

	The Apache Cloudberry Platform Extension Framework (PXF) is an extensible framework that allows Apache Cloudberry to query external data files whose metadata is not managed by the database. You can use its `pxf://` protocol to access data residing in object store systems (Azure, Google Cloud Storage, Minio, S3), external Hadoop systems (HDFS, Hive, HBase), and SQL databases.

	PXF includes built-in connectors for accessing data that exists inside HDFS files, Hive tables, HBase tables, and JDBC-accessible databases. Users can also create their own connectors to other data storage or processing engines.

	The version of PXF used in Apache Cloudberry is forked from the Greenplum PXF project and has been specifically adapted for it. The source code for this adapted version is hosted by the Apache Software Foundation at [apache/cloudberry-pxf](https://github.com/apache/cloudberry-pxf).

	The PXF `pxf` protocol is packaged as an Apache Cloudberry extension that supports both reading from and writing to external data stores. The framework includes a C-language extension and a Java service. After you configure and initialize PXF, you start a single PXF JVM process on each Apache Cloudberry segment host. This long-running process concurrently serves multiple query requests.

	Before using the `pxf` protocol, you must explicitly initialize and start the PXF service. You must also enable PXF in each database where it will be used and grant permissions on the `pxf` protocol to the relevant Apache Cloudberry users.

	## Install and Build PXF

	PXF is a component that must be built and installed separately. The source code and detailed instructions for building, installing, and developing PXF are available in the official `apache/cloudberry-pxf` repository. Before proceeding with the configuration, follow the development guide in the repository to set up PXF.

	Build and Installation Guide: [apache/cloudberry-pxf README](https://github.com/apache/cloudberry-pxf/blob/main/README.md)

	## Configure PXF

	Before you can use PXF to access external data, you must configure and initialize the service.

	### Initialize and Start PXF

	To use PXF, you must first initialize it. This process creates the necessary configuration directory structure and files on each Apache Cloudberry segment host. After initialization, you need to start the PXF service.

	1. Add the PXF binary directory to your `PATH`. This step ensures that you can run `pxf` commands from any location.

	```shell
	export PATH=/usr/local/pxf/bin:$PATH
	```

	You can add this line to your `.bashrc` or `.zshrc` file to make it permanent.

	2. Initialize PXF. The `prepare` command sets up the PXF configuration. This command only needs to be run once.

	```shell
	pxf prepare
	```

	3. Start the PXF service. This command starts a Java process on each segment host that acts as the external data coordinator.

	```shell
	pxf start
	```

	### Enable PXF in a database

	You must enable PXF in each database in which you want to use the `pxf` protocol to access external data.

	To enable PXF in a database, you must create the PXF extension in the database:

	```sql
	CREATE EXTENSION pxf;
	```

	The `pxf` protocol is packaged as an extension with Apache Cloudberry, and you must explicitly enable it in each database in which you plan to use it.

	### Grant access to PXF

	To allow non-superuser Apache Cloudberry roles to create external tables using the `pxf` protocol, you must grant `SELECT` privileges on the `pxf` protocol to each role:

	```sql
	GRANT SELECT ON PROTOCOL pxf TO <role_name>;
	```

	To allow non-superuser Apache Cloudberry roles to create writable external tables using the `pxf` protocol, you must grant `INSERT` privileges on the `pxf` protocol to each role:

	```sql
	GRANT INSERT ON PROTOCOL pxf TO <role_name>;
	```

	## PXF configuration overview

	PXF configuration includes server configuration and connector configuration. A PXF server is a named configuration that provides access credentials and other information required to access an external data source. A PXF connector is the interface between PXF and the external data source.

	### Example: configure S3 connector

	Create a PXF configuration for accessing S3 data:

	```xml
	<?xml version="1.0" encoding="UTF-8"?>
	<configuration>
	<property>
	<name>fs.s3a.access.key</name>
	<value>YOUR_ACCESS_KEY</value>
	</property>
	<property>
	<name>fs.s3a.secret.key</name>
	<value>YOUR_SECRET_KEY</value>
	</property>
	<property>
	<name>fs.s3a.endpoint</name>
	<value>s3.amazonaws.com</value>
	</property>
	</configuration>
	```

	### Example: configure JDBC connector

	For accessing SQL databases:

	```xml
	<?xml version="1.0" encoding="UTF-8"?>
	<configuration>
	<property>
	<name>jdbc.driver</name>
	<value>org.postgresql.Driver</value>
	</property>
	<property>
	<name>jdbc.url</name>
	<value>jdbc:postgresql://hostname:5432/database</value>
	</property>
	<property>
	<name>jdbc.user</name>
	<value>username</value>
	</property>
	<property>
	<name>jdbc.password</name>
	<value>password</value>
	</property>
	</configuration>
	```

	## Step 3. Create PXF external tables

	PXF external tables use the `pxf://` protocol in the `LOCATION` clause. The URL format varies depending on the data source and connector.

	### Read from HDFS

	Access text files stored in HDFS:

	```sql
	CREATE EXTERNAL TABLE sales_hdfs (
	transaction_id int,
	product_name text,
	sale_date date,
	amount decimal(10,2)
	)
	LOCATION ('pxf://hdfs-namenode:8020/data/sales/sales.txt?PROFILE=hdfs:text')
	FORMAT 'TEXT' (DELIMITER '\|');
	```

	### Read from Hive tables

	Access Hive tables directly:

	```sql
	CREATE EXTERNAL TABLE hive_sales (
	transaction_id int,
	product_name text,
	sale_date date,
	amount decimal(10,2)
	)
	LOCATION ('pxf://hive-metastore:9083/sales_db.sales_table?PROFILE=hive')
	FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
	```

	### Read from Parquet files

	Access Parquet files in object storage:

	```sql
	CREATE EXTERNAL TABLE parquet_data (
	id bigint,
	name text,
	created_date date
	)
	LOCATION ('pxf://s3a://my-bucket/data/events.parquet?PROFILE=s3:parquet&SERVER=s3-server')
	FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
	```

	### Read from SQL databases

	Query external SQL databases:

	```sql
	CREATE EXTERNAL TABLE external_customers (
	customer_id int,
	customer_name text,
	email text,
	registration_date date
	)
	LOCATION ('pxf://postgresql-server/customers?PROFILE=jdbc&SERVER=postgres-server')
	FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
	```

	### Write data with PXF

	Create writable external tables to export data:

	```sql
	CREATE WRITABLE EXTERNAL TABLE export_to_hdfs (
	transaction_id int,
	product_name text,
	sale_date date,
	amount decimal(10,2)
	)
	LOCATION ('pxf://hdfs-namenode:8020/exports/sales_export?PROFILE=hdfs:text')
	FORMAT 'TEXT' (DELIMITER '\|')
	DISTRIBUTED BY (transaction_id);
	```

	## PXF URL format

	The PXF protocol URL follows this syntax:

	```
	pxf://<host>[:<port>]/<path-to-data>?PROFILE=<profile_name>[&<custom-option>=<value>][&SERVER=<server_name>]
	```

	Where:
	- `host:port`: Location of the external data source
	- `path-to-data`: Path to the specific data (file, directory, table, etc.)
	- `PROFILE`: PXF connector profile (e.g., `hdfs:text`, `hive`, `s3:parquet`, `jdbc`)
	- `SERVER`: Named server configuration (optional)
	- `custom-option`: Additional connector-specific options

	## Common PXF profiles

	\| Profile \| Data Source \| Format \| Use Case \|
	\|---------\|-------------\|--------\|----------\|
	\| `hdfs:text` \| HDFS \| Text files \| Delimited text data \|
	\| `hdfs:avro` \| HDFS \| Avro files \| Schema evolution support \|
	\| `hdfs:parquet` \| HDFS \| Parquet files \| Columnar analytics \|
	\| `hdfs:orc` \| HDFS \| ORC files \| Optimized row columnar \|
	\| `hive` \| Hive \| Various \| Hive table access \|
	\| `hbase` \| HBase \| HBase \| NoSQL data access \|
	\| `s3:text` \| S3 \| Text files \| Cloud object storage \|
	\| `s3:parquet` \| S3 \| Parquet files \| Cloud analytics \|
	\| `jdbc` \| SQL Database \| Various \| External database access \|

	## Performance optimization

	### Partition pruning

	PXF supports partition pruning for Hive tables:

	```sql
	SELECT * FROM hive_sales
	WHERE sale_date >= '2024-01-01' AND sale_date < '2024-02-01';
	```

	### Predicate pushdown

	Enable predicate pushdown for better performance:

	```sql
	-- This filter can be pushed down to the external source
	SELECT * FROM external_customers
	WHERE registration_date > '2024-01-01';
	```

	### Parallel processing

	Leverage multiple files for better parallelism:

	```sql
	CREATE EXTERNAL TABLE multi_file_data (
	id bigint,
	data text
	)
	LOCATION ('pxf://hdfs-namenode:8020/data/partitioned/*?PROFILE=hdfs:text')
	FORMAT 'TEXT' (DELIMITER ',');
	```

	## Error handling

	Enable error logging for data quality monitoring:

	```sql
	CREATE EXTERNAL TABLE sales_with_errors (
	transaction_id int,
	product_name text,
	sale_date date,
	amount decimal(10,2)
	)
	LOCATION ('pxf://hdfs-namenode:8020/data/sales/?PROFILE=hdfs:text')
	FORMAT 'TEXT' (DELIMITER '\|')
	LOG ERRORS SEGMENT REJECT LIMIT 50;
	```

	## Best practices

	- Select the most specific PXF profile that matches your data format and storage system to ensure optimal performance and feature support.
	- Create reusable named server configurations to centralize connection parameters and credentials, making it easier to manage multiple external data sources.
	- Structure your external data using file sizes between 100MB and 1GB to achieve optimal parallel processing across Apache Cloudberry segments.
	- Leverage compressed storage formats such as Parquet or ORC to reduce I/O overhead and improve query performance.
	- Implement comprehensive error logging and monitoring in production environments to quickly identify and resolve data access issues.
	- Establish proper authentication mechanisms and enable encryption for all connections to sensitive external data sources.

	## Limitations

	- PXF requires the external data source to be accessible from all segment hosts.
	- Some data sources may have specific version compatibility requirements.
	- Write operations depend on the connector's capabilities.
	- Performance can be affected by network latency to external sources.

	## Troubleshooting

	### Check PXF service status

	```shell
	pxf cluster status
	```

	### View PXF logs

	```shell
	pxf cluster logs
	```

	### Test connectivity

	```sql
	SELECT * FROM pxf_external_table LIMIT 5;
	```

	## Learn More

	For more details about the Apache Cloudberry Platform Extension Framework (PXF), please refer to the official PXF project repository, which contains the source code, documentation, and contribution guidelines.

	- [Apache Cloudberry PXF Project](https://github.com/apache/cloudberry-pxf)