versioned_docs/version-2.x/data-loading/load-data-using-gpload.md - cloudberry-site - Git at Google

 ---
 title: Load Data Using gpload
 ---

 # Load Data into Apache Cloudberry Using `gpload`

 The `gpload` utility of Apache Cloudberry loads data using readable external tables and the Apache Cloudberry parallel file server (gpfdist). It handles parallel file-based external table setup and allows users to configure their data format, external table definition, and gpfdist setup in a single configuration file.

 :::tip
 In `gpload`, `MERGE` and `UPDATE` operations are not supported if the target table column name is a reserved keyword, has capital letters, or includes any character that requires quotes `" "` to identify the column.
 :::

 ## To use gpload

 1. Ensure that your environment is set up to run `gpload`. Some dependent files from your Apache Cloudberry installation are required, such as gpfdist and Python 3, as well as network access to the Apache Cloudberry segment hosts. `gpload` also requires that you install the following packages:

     ```shell
     pip install psycopg2 pyyaml
     ```

 2. Create your load control file. This is a YAML-formatted file that specifies the Apache Cloudberry connection information, gpfdist configuration information, external table options, and data format.

     For example:

     ```yaml
     ---
     VERSION: 1.0.0.1
     DATABASE: ops
     USER: gpadmin
     HOST: cdw-1
     PORT: 5432
     GPLOAD:
        INPUT:
         - SOURCE:
              LOCAL_HOSTNAME:
                - etl1-1
                - etl1-2
                - etl1-3
                - etl1-4
              PORT: 8081
              FILE:
                - /var/load/data/*
         - COLUMNS:
                - name: text
                - amount: float4
                - category: text
                - descr: text
                - date: date
         - FORMAT: text
         - DELIMITER: '|'
         - ERROR_LIMIT: 25
         - LOG_ERRORS: true
        OUTPUT:
         - TABLE: payables.expenses
         - MODE: INSERT
        PRELOAD:
         - REUSE_TABLES: true
     # SQL:
     #   - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)"
     #   - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"
     ```

 3.  Run `gpload`, passing in the load control file. For example:

     ```sql
     gpload -f my_load.yml
     ```
	---
	title: Load Data Using gpload
	---

	# Load Data into Apache Cloudberry Using `gpload`

	The `gpload` utility of Apache Cloudberry loads data using readable external tables and the Apache Cloudberry parallel file server (gpfdist). It handles parallel file-based external table setup and allows users to configure their data format, external table definition, and gpfdist setup in a single configuration file.

	:::tip
	In `gpload`, `MERGE` and `UPDATE` operations are not supported if the target table column name is a reserved keyword, has capital letters, or includes any character that requires quotes `" "` to identify the column.
	:::

	## To use gpload

	1. Ensure that your environment is set up to run `gpload`. Some dependent files from your Apache Cloudberry installation are required, such as gpfdist and Python 3, as well as network access to the Apache Cloudberry segment hosts. `gpload` also requires that you install the following packages:

	```shell
	pip install psycopg2 pyyaml
	```

	2. Create your load control file. This is a YAML-formatted file that specifies the Apache Cloudberry connection information, gpfdist configuration information, external table options, and data format.

	For example:

	```yaml
	---
	VERSION: 1.0.0.1
	DATABASE: ops
	USER: gpadmin
	HOST: cdw-1
	PORT: 5432
	GPLOAD:
	INPUT:
	- SOURCE:
	LOCAL_HOSTNAME:
	- etl1-1
	- etl1-2
	- etl1-3
	- etl1-4
	PORT: 8081
	FILE:
	- /var/load/data/*
	- COLUMNS:
	- name: text
	- amount: float4
	- category: text
	- descr: text
	- date: date
	- FORMAT: text
	- DELIMITER: '\|'
	- ERROR_LIMIT: 25
	- LOG_ERRORS: true
	OUTPUT:
	- TABLE: payables.expenses
	- MODE: INSERT
	PRELOAD:
	- REUSE_TABLES: true
	# SQL:
	# - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)"
	# - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"
	```

	3. Run `gpload`, passing in the load control file. For example:

	```sql
	gpload -f my_load.yml
	```