| COMMAND NAME: gpfdist |
| |
| Serves data files to or writes data files out from Apache Cloudberry |
| segments. |
| |
| ***************************************************** |
| SYNOPSIS |
| ***************************************************** |
| |
| gpfdist [-d <directory>] [-p <http_port>] [-l <log_file>] [-t <timeout>] |
| [-S] [-w <time>] [-v | -V] [-m <max_length>] [--ssl <certificate_path>] |
| [--ssl_verify_peer <on/off>] |
| [--compress] [--compress-level] |
| |
| gpfdist [-? | --help] | --version |
| |
| |
| ***************************************************** |
| DESCRIPTION |
| ***************************************************** |
| |
| gpfdist is Cloudberry's parallel file distribution program. It is used |
| by readable external tables and gpload to serve external table files to |
| all Apache Cloudberry segments in parallel. It is used by writable |
| external tables to accept output streams from Apache Cloudberry |
| segments in parallel and write them out to a file. |
| |
| In order for gpfdist to be used by an external table, the LOCATION |
| clause of the external table definition must specify the external table |
| data using the gpfdist:// protocol (see the Apache Cloudberry command |
| CREATE EXTERNAL TABLE). |
| |
| NOTE: If the --ssl option is specified to enable SSL security, create |
| the external table with the gpfdists:// protocol. |
| |
| The benefit of using gpfdist is that you are guaranteed maximum |
| parallelism while reading from or writing to external tables, thereby |
| offering the best performance as well as easier administration of |
| external tables. |
| |
| For readable external tables, gpfdist parses and serves data files |
| evenly to all the segment instances in the Apache Cloudberry system |
| when users SELECT from the external table. For writable external tables, |
| gpfdist accepts parallel output streams from the segments when users |
| INSERT into the external table, and writes to an output file. |
| |
| For readable external tables, if load files are compressed using gzip or |
| bzip2 (have a .gz or .bz2 file extension), gpfdist uncompresses the |
| files automatically before loading provided that gunzip or bunzip2 is in |
| your path. |
| |
| NOTE: Currently, readable external tables do not support compression on |
| Windows platforms, and writable external tables do not support |
| compression on any platforms. |
| |
| Most likely, you will want to run gpfdist on your ETL machines rather |
| than the hosts where Apache Cloudberry is installed. To install gpfdist |
| on another host, simply copy the utility over to that host and add |
| gpfdist to your $PATH. |
| |
| NOTE: When using IPv6, always enclose the numeric IP address in |
| brackets. |
| |
| You can also run gpfdist as a Windows Service. See the section "Running |
| gpfdist as a Windows Service" for more details. |
| |
| |
| ***************************************************** |
| OPTIONS |
| ***************************************************** |
| |
| -d <directory> |
| |
| The directory from which gpfdist will serve files for readable external |
| tables or create output files for writable external tables. If not |
| specified, defaults to the current directory. |
| |
| |
| -l <log_file> |
| |
| The fully qualified path and log file name where standard output |
| messages are to be logged. |
| |
| |
| -p <http_port> |
| |
| The HTTP port on which gpfdist will serve files. Defaults to 8080. |
| |
| |
| -t <timeout> |
| |
| Sets the time allowed for Apache Cloudberry to establish a connection |
| to a gpfdist process. Default is 5 seconds. Allowed values are 2 to 600 |
| seconds. May need to be increased on systems with a lot of network |
| traffic. |
| |
| |
| -m <max_length> |
| |
| Sets the maximum allowed data row length in bytes. Default is 32768. |
| Should be used when user data includes very wide rows (or when a "line too |
| long" error message occurs). Should not be used otherwise as it increases |
| resource allocation. Valid range is 32K to 256MB. (The upper limit is |
| 1MB on Windows systems.) |
| |
| |
| -S (use O_SYNC) |
| |
| Opens the file for synchronous I/O with the O_SYNC flag. Any writes to |
| the resulting file descriptor block gpfdist until the data is physically |
| written to the underlying hardware. |
| |
| |
| -w <time> |
| |
| Sets the number of seconds that Apache Cloudberry delays before closing |
| a target file such as a named pipe. The default value is 0, no delay. |
| The maximum value is 600 seconds, 10 minutes. |
| |
| For a Apache Cloudberry with multiple segments, there might be a delay |
| between segments when writing data from different segments to the file. |
| You can specify a time to wait before Apache Cloudberry closes the file |
| to ensure all the data is written to the file. |
| |
| |
| --ssl <certificate_path> |
| |
| Adds SSL encryption to data transferred with gpfdist. After executing |
| gpfdist with the --ssl <certificate_path> option, the only way to load |
| data from this file server is with the gpfdists protocol. For |
| information on the gpfdists protocol, see the "Apache Cloudberry |
| Administrator Guide". |
| |
| The location specified in certificate_path must contain the following |
| files: |
| |
| * The server certificate file, server.crt |
| * The server private key file, server.key |
| * The trusted certificate authorities, root.crt |
| |
| The root directory (/) cannot be specified as certificate_path. |
| |
| |
| --ssl_verify_peer <on/off> |
| |
| This option is used to configure the SSL authentication files required in |
| the gpfdists protocol, as well as to control the behavior of gpfdist regarding |
| SSL identity authentication. The default value is 'on'. |
| |
| If the flag is 'on', gpfdist will be forced to check the identity of clients, |
| and CA certification file (root.crt) is necessary to be laid in the SSL |
| certification directory. If the flag is 'off', gpfdist will not check the |
| identity of clients, and it doesn't matter whether the CA certification file |
| exists on the gpfdist side. |
| |
| --compress |
| |
| Utilizes the Zstandard (zstd) compression method for data transfer via |
| gpfdist. Zstd is an effective compression algorithm designed to enable |
| gpfdist to transmit larger amounts of data while maintaining low network |
| usage. However, it is important to note that compression can be a |
| time-intensive process, which may potentially result in reduced |
| transmission speeds. |
| |
| --compress-level <level> |
| |
| Specifies the compression level to be used with the Zstandard (zstd) compression |
| when the --compress option is enabled. The <level> parameter is an integer value |
| that controls the trade-off between compression ratio and speed. Higher values |
| typically result in better compression ratios but may require more CPU time. |
| The valid range is 1 (fastest, least compression) to 9 (slowest, best compression). |
| The default compression level is 1 if not specified. |
| |
| |
| -v (verbose) |
| |
| Verbose mode shows progress and status messages. |
| |
| |
| -V (very verbose) |
| |
| Verbose mode shows all output messages generated by this utility. |
| |
| |
| -? (help) |
| |
| Displays the online help. |
| |
| |
| --version |
| |
| Displays the version of this utility. |
| |
| |
| ***************************************************** |
| RUNNING GPFDIST AS A WINDOWS SERVICE |
| ***************************************************** |
| |
| Cloudberry Loaders allow gpfdist to run as a Windows Service. |
| |
| Follow the instructions below to download, register and activate gpfdist |
| as a service: |
| |
| 1. Update your Cloudberry Loader package to the latest version. This |
| package is available from the EMC Download Center. |
| (https://emc.subscribenet.com) |
| |
| 2. Register gpfdist as a Windows service: |
| |
| a. Open a Windows command window |
| |
| b. Run the following command: |
| |
| sc create gpfdist binpath= "path_to_gpfdist.exe -p 8081 |
| -d External\load\files\path -l Log\file\path" |
| |
| You can create multiple instances of gpfdist by running the same command |
| again, with a unique name and port number for each instance, for |
| example: |
| |
| sc create gpfdistN binpath= "path_to_gpfdist.exe -p 8082 |
| -d External\load\files\path -l Log\file\path" |
| |
| 3. Activate the gpfdist service: |
| |
| a. Open the Windows Control Panel and select |
| Administrative Tools>Services. |
| |
| b. Highlight then right-click on the gpfdist service in the list of |
| services. |
| |
| c. Select Properties from the right-click menu, the Service Properties |
| window opens. |
| |
| Note that you can also stop this service from the Service Properties |
| window. |
| |
| d. Optional: Change the Startup Type to Automatic (after a system |
| restart, this service will be running), then under Service status, |
| click Start. |
| |
| e. Click OK. |
| |
| Repeat the above steps for each instance of gpfdist that you created. |
| |
| |
| ***************************************************** |
| EXAMPLES |
| ***************************************************** |
| |
| Serve files from a specified directory using port 8081 (and start |
| gpfdist in the background): |
| |
| gpfdist -d /var/load_files -p 8081 & |
| |
| |
| Start gpfdist in the background and redirect output and errors to a log |
| file: |
| |
| gpfdist -d /var/load_files -p 8081 -l /home/gpadmin/log & |
| |
| |
| To stop gpfdist when it is running in the background: |
| |
| --First find its process id: |
| |
| ps ax | grep gpfdist |
| |
| --Then kill the process, for example: |
| |
| kill 3456 |
| |
| |
| ***************************************************** |
| SEE ALSO |
| ***************************************************** |
| |
| CREATE EXTERNAL TABLE, gpload |
| |
| See the "Apache Cloudberry Reference Guide" for information |
| about CREATE EXTERNAL TABLE. |
| |
| |