blob: 4df834f47447a8cc00d1448e1cb50dcd699676ad [file] [view]
---
title: Load Data Using gpfdists (Secure)
---
# Load Data Securely Using `gpfdists`
The `gpfdists` protocol is a secure version of the `gpfdist` protocol that enables encrypted communication between Apache Cloudberry and the gpfdist file server. When you use `gpfdists`, all data transfer is encrypted using SSL, protecting against eavesdropping and man-in-the-middle attacks.
`gpfdists` provides the same high-performance parallel data loading capabilities as `gpfdist`, but with additional security features essential for production environments handling sensitive data.
## Security features
- All data transmitted between Apache Cloudberry segments and gpfdist servers is encrypted using SSL/TLS protocols, protecting against eavesdropping and data interception.
- Mutual authentication is enforced through client certificates, ensuring that both Apache Cloudberry and gpfdist servers verify each other's identities before establishing connections.
- The implementation uses TLSv1 protocol with AES_128_CBC_SHA encryption algorithm to provide strong cryptographic protection for data in transit.
- Secure server identification mechanisms prevent unauthorized systems from masquerading as legitimate gpfdist servers, protecting against man-in-the-middle attacks.
## Before you begin
To use `gpfdists`, make sure:
- SSL certificates configured on all segment hosts.
- gpfdist utility available on the file server host.
- Network connectivity between segment hosts and the gpfdist server.
- Appropriate SSL certificate files in the correct locations.
## Step 1. Set up SSL certificates
### Required certificate files
The following certificate files must be present in the `$PGDATA/gpfdists` directory on each Apache Cloudberry segment host:
#### For full SSL authentication (recommended):
- `client.key` - Client private key file
- `client.crt` - Client certificate file
- `root.crt` - Trusted certificate authorities file
#### Certificate requirements by configuration:
| verify_gpfdists_cert | --ssl_verify_peer | Required Certificate Files |
|---------------------|-------------------|---------------------------|
| on (default) | on (default) | `client.key`, `client.crt`, `root.crt` |
| on | off | `root.crt` |
| off | on | `client.key`, `client.crt` |
| off | off | None |
### Install certificates
1. Create the gpfdists directory on each segment host:
```shell
mkdir -p $PGDATA/gpfdists
```
2. Copy the certificate files to each segment host:
```shell
# Copy to all segment hosts
scp client.key client.crt root.crt gpadmin@segment-host:$PGDATA/gpfdists/
```
3. Set appropriate permissions:
```shell
chmod 600 $PGDATA/gpfdists/client.key
chmod 644 $PGDATA/gpfdists/client.crt
chmod 644 $PGDATA/gpfdists/root.crt
```
## Step 2. Start gpfdist with SSL
Start the gpfdist utility with the `--ssl` option to enable secure connections:
```shell
gpfdist -p 8081 -d /data/load_files --ssl /path/to/certificates &
```
### SSL options for gpfdist
- `--ssl <certificates_path>`: Enables SSL and specify certificate directory
- `--ssl_verify_peer on|off`: Controls peer verification (default: on)
### Example: Start multiple secure gpfdist instances
```shell
# Starts the first secure gpfdist instance.
gpfdist -d /var/load_files1 -p 8081 --ssl /home/gpadmin/certs \
--ssl_verify_peer on -l /home/gpadmin/log1 &
# Starts the second secure gpfdist instance.
gpfdist -d /var/load_files2 -p 8082 --ssl /home/gpadmin/certs \
--ssl_verify_peer on -l /home/gpadmin/log2 &
```
## Step 3. Create external tables with gpfdists
Use the `gpfdists://` protocol in the `LOCATION` clause to create secure external tables:
### Readable external table
```sql
CREATE EXTERNAL TABLE secure_sales_data (
transaction_id int,
product_name text,
sale_date date,
amount decimal(10,2)
)
LOCATION ('gpfdists://etl-server1:8081/sales/*.txt',
'gpfdists://etl-server2:8082/sales/*.txt')
FORMAT 'TEXT' (DELIMITER '|' NULL ' ');
```
### Writable external table
```sql
CREATE WRITABLE EXTERNAL TABLE secure_export (
transaction_id int,
product_name text,
sale_date date,
amount decimal(10,2)
)
LOCATION ('gpfdists://etl-server1:8081/exports/sales_data.txt')
FORMAT 'TEXT' (DELIMITER '|')
DISTRIBUTED BY (transaction_id);
```
### With error handling
```sql
CREATE EXTERNAL TABLE secure_data_with_errors (
id int,
name text,
value decimal(10,2)
)
LOCATION ('gpfdists://etl-server:8081/data/*.csv')
FORMAT 'CSV' (HEADER)
LOG ERRORS SEGMENT REJECT LIMIT 100;
```
## Configuration parameters
### Apache Cloudberry parameters
Configure these parameters in `postgresql.conf`:
```ini
# Enables/disables SSL certificate verification (default: on)
verify_gpfdists_cert = on
# Control segment parallelism
gp_external_max_segs = 64
```
### gpfdist SSL parameters
The gpfdist utility supports these SSL-related options:
| Parameter | Description | Default |
|-----------|-------------|---------|
| `--ssl` | Enable SSL and specify certificate path | Disabled |
| `--ssl_verify_peer` | Verify client certificates | on |
## Security best practices
### Certificate management
- Generate all certificates from a trusted certificate authority to ensure proper validation and trust chain establishment.
- Implement a regular certificate rotation schedule to enhance security and prevent issues from certificate expiration.
- Store private keys in secure locations with restricted access permissions, ensuring only authorized personnel can access them.
- Maintain secure backup copies of all certificate files to enable quick recovery in case of system failures or corruption.
### Network security
- Configure firewall rules to restrict access to gpfdists ports, allowing only authorized Apache Cloudberry segment hosts to connect.
- Use secure network connections such as VPNs or private networks to prevent unauthorized access to data transmission channels.
- Implement continuous monitoring of SSL connections and certificate expiration dates to proactively address security issues.
### Access control
- Apply the principle of least privilege by granting only the minimum permissions necessary for users and applications to perform their required functions.
- Implement robust authentication mechanisms including multi-factor authentication where appropriate to verify user identities.
- Enable comprehensive audit logging to track all access attempts, successful connections, and security-related events for compliance and security monitoring.
## Troubleshooting
### SSL connection errors
Check certificate configuration:
```shell
# Verifies certificate files exist.
ls -la $PGDATA/gpfdists/
# Checks certificate validity.
openssl x509 -in $PGDATA/gpfdists/client.crt -text -noout
```
### Troubleshoot certificate verification issues
1. Verify that the `root.crt` file contains the correct certificate authority chain and that all intermediate certificates are properly included for validation.
2. Check certificate expiration dates using tools like `openssl x509 -dates` to ensure that certificates have not expired and plan for renewal well in advance.
3. Validate that the private key file `client.key` corresponds exactly to the public certificate in `client.crt` using certificate validation tools.
### Common error messages
| Error | Cause | Solution |
|-------|-------|----------|
| "SSL certificate verify failed" | Invalid or expired certificate | Check certificate validity and CA |
| "SSL handshake failed" | SSL configuration mismatch | Verify SSL settings on both sides |
| "Permission denied" | Incorrect file permissions | Set proper permissions on certificate files |
### Debug SSL connections
Enable verbose logging:
```shell
gpfdist -d /data -p 8081 --ssl /certs -V --ssl_verify_peer on
```
## Performance considerations
- SSL encryption introduces computational overhead during data transmission, which may reduce overall throughput compared to unencrypted connections, especially for large data transfers.
- Certificate caching mechanisms help reduce the performance impact of SSL handshakes by reusing established secure connections across multiple data transfer operations.
- Deploy multiple gpfdists instances across different hosts or network interfaces to distribute the load and achieve better aggregate throughput for concurrent data operations.
- Ensure that your network infrastructure provides adequate bandwidth and low latency between Apache Cloudberry segments and gpfdist servers to minimize performance bottlenecks.
## Migration from gpfdist to gpfdists
To migrate existing gpfdist usage to gpfdists:
- Install and configure the required SSL certificates on all Apache Cloudberry segment hosts, ensuring proper permissions and certificate chain validation.
- Update all external table definitions to change protocol specifications from `gpfdist://` to `gpfdists://` in their LOCATION clauses.
- Restart your gpfdist file servers with the `--ssl` option enabled, specifying the appropriate certificate directories and SSL verification settings.
- Thoroughly test secure connections from all Apache Cloudberry segments to verify that data loading operations work correctly with the new encrypted protocol.
- Update any automation scripts, monitoring tools, and operational procedures to account for the new secure protocol requirements and certificate management tasks.
### Example migration
Before (insecure):
```sql
LOCATION ('gpfdist://etl-server:8081/data.txt')
```
After (secure):
```sql
LOCATION ('gpfdists://etl-server:8081/data.txt')
```
## Limitations
- Cannot mix `gpfdist://` and `gpfdists://` protocols in the same external table definition.
- All Apache Cloudberry segment hosts must have properly configured SSL certificates and trust relationships.
- SSL encryption introduces computational overhead that may reduce data transfer throughput compared to unencrypted gpfdist.
- Ongoing certificate lifecycle management is required, including renewal, rotation, and revocation processes.
## Learn more
- [Load Data Using gpfdist](/docs/data-loading/load-data-using-gpfdist.md)
- [SSL Certificate Management](https://www.sslshopper.com/ssl-converter.html)