blob: 9ec7168e1bd8268c5d2d624d1223dc9b3b3a5b28 [file] [view]
---
title: Load Data Using gpload
---
# Load Data into Apache Cloudberry Using `gpload`
The `gpload` utility of Apache Cloudberry loads data using readable external tables and the Apache Cloudberry parallel file server (gpfdist). It handles parallel file-based external table setup and allows users to configure their data format, external table definition, and gpfdist setup in a single configuration file.
:::tip
In `gpload`, `MERGE` and `UPDATE` operations are not supported if the target table column name is a reserved keyword, has capital letters, or includes any character that requires quotes `" "` to identify the column.
:::
## To use gpload
1. Ensure that your environment is set up to run `gpload`. Some dependent files from your Apache Cloudberry installation are required, such as gpfdist and Python 3, as well as network access to the Apache Cloudberry segment hosts. `gpload` also requires that you install the following packages:
```shell
pip install psycopg2 pyyaml
```
2. Create your load control file. This is a YAML-formatted file that specifies the Apache Cloudberry connection information, gpfdist configuration information, external table options, and data format.
For example:
```yaml
---
VERSION: 1.0.0.1
DATABASE: ops
USER: gpadmin
HOST: cdw-1
PORT: 5432
GPLOAD:
INPUT:
- SOURCE:
LOCAL_HOSTNAME:
- etl1-1
- etl1-2
- etl1-3
- etl1-4
PORT: 8081
FILE:
- /var/load/data/*
- COLUMNS:
- name: text
- amount: float4
- category: text
- descr: text
- date: date
- FORMAT: text
- DELIMITER: '|'
- ERROR_LIMIT: 25
- LOG_ERRORS: true
OUTPUT:
- TABLE: payables.expenses
- MODE: INSERT
PRELOAD:
- REUSE_TABLES: true
# SQL:
# - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)"
# - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"
```
3. Run `gpload`, passing in the load control file. For example:
```sql
gpload -f my_load.yml
```