blob: 9d8f7f3f34c6c56523d7c19d89de5ce590552cdd [file] [log] [blame]
COMMAND NAME: gpfdist
Serves data files to or writes data files out from HAWQ segments.
*****************************************************
SYNOPSIS
*****************************************************
gpfdist [-d <directory>] [-p <http_port>] [-l <log_file>] [-t <timeout>] [-c <config_file>]
[-S] [-v | -V] [-m <maxlen>] [--ssl certificate_path]
gpfdist [-? | --help] | --version
*****************************************************
DESCRIPTION
*****************************************************
gpfdist is HAWQ’s parallel file distribution program.
It is used by readable external tables and gpload to serve
external table files to all HAWQ segments in parallel.
It is used by writable external tables to accept output
streams from HAWQ segments in parallel and write them out to a file.
In order for gpfdist to be used by an external table, the
LOCATION clause of the external table definition must specify
the correct file location using the gpfdist:// protocol
(see CREATE EXTERNAL TABLE).
NOTE: If the --ssl option is specified to enable SSL security,
create the external table with the gpfdists:// protocol.
The benefit of using gpfdist is that you are guaranteed maximum
parallelism while reading from or writing to external tables,
thereby offering the best performance as well as easier
administration of external tables.
For readable external tables, gpfdist parses and serves data
files evenly to all the segment instances in the HAWQ
system when users SELECT from the external table. For writable
external tables, gpfdist accepts parallel output streams from
the segments when users INSERT into the external table, and
writes to an output file.
For readable external tables, if load files are compressed using
gzip or bzip2 (have a .gz or .bz2 file extension), gpfdist
uncompresses the files automatically before loading provided
that gunzip or bunzip2 is in your path.
NOTE: Currently, readable external tables do not support
compression on Windows platforms, and writable external
tables do not support compression on any platforms.
Most likely, you will want to run gpfdist on your ETL machines
rather than the hosts where HAWQ is installed.
To install gpfdist on another host, simply copy the utility
over to that host and add gpfdist to your $PATH.
NOTE: When using IPv6, always enclose the numeric IP address
in brackets.
You can also run gpfdist as a Windows Service. See below for
details.
*****************************************************
OPTIONS
*****************************************************
-d <directory>
The directory from which gpfdist will serve files for
readable external tables or create output files for writable
external tables. If not specified, defaults to the current directory.
-l <log_file>
The fully qualified path and log file name where standard output
messages are to be logged.
-p <http_port>
The HTTP port on which gpfdist will serve files. Defaults to 8080.
-t <timeout>
Sets the time (in seconds) allowed for HAWQ to
establish a connection to a gpfdist process. Default is 5 seconds.
Valid values are 2 to 30 seconds. May need to be increased on
systems with a lot of network traffic.
-m <max_length>
Sets the maximum allowed data row length in bytes. Default is 32768.
Should be used when user data includes very wide rows, i.e when
"line too long" error message is receieved. Should not be used otherwise
as it increases resource allocation.
Valid range is 32K to 256MB. (The upper limit is 1MB on Windows systems.)
-S (use O_SYNC)
Opens the file for synchronous I/O with the O_SYNC flag. Any writes to
the resulting file descriptor block gpfdist until the data is
physically written to the underlying hardware.
--ssl certificate_path
Adds SSL encryption to data transferred with gpfdist. After executing
gpfdist with the --ssl certificate_path option, the only way
to load data from this file server is with the gpfdists protocol.
The location specified in certificate_path must
contain the following files:
- The server certificate file, server.crt
- The server private key file, server.key
- The trusted certificate authorities, root.crt
The root directory (/) cannot be specified as certificate_path.
-c <config_file>
Configuration file for transformations.The option config_file specifies
the location of the transformation configuration file, passed to gpload via -c.
The gpfdist configuration is expected to be a YAML file with the following format:
---
VERSION: 1.0.0.1
TRANSFORMATIONS:
transformname1:
TYPE: input | output
COMMAND: command1
CONTENT: data | paths
SAFE: posix-regex
transformname2:
TYPE: input | output
COMMAND: command2
...
-v (verbose)
Verbose mode shows progress and status messages.
-V (very verbose)
Verbose mode shows all output messages generated by this utility.
--version
Prints out the version of this utility.
-?
--help
Displays online help.
*****************************************************
RUNNING GPFDIST AS A WINDOWS SERVICE
*****************************************************
HAWQ Loaders allow gpfdist to run as a Windows Service.
Follow the instructions below to download, register and
activate gpfdist as a service:
1. Update your HAWQ Loader package to the latest
version. This package is available from the
EMC Download Center (https://emc.subscribenet.com)
2. Register gpfdist as a Windows service:
* Open a Windows command window
* Run the following command:
sc create gpfdist binpath= "path_to_gpfdist.exe -p 8081
-d External\load\files\path -l Log\file\path"
You can create multiple instances of gpfdist by
running the same command again, with a unique
name and port number for each instance, for example:
sc create gpfdistN binpath= "path_to_gpfdist.exe
-p 8082 -d External\load\files\path -l Log\file\path"
3. Activate the gpfdist service:
* Open the Windows Control Panel and select
Administrative Tools>Services.
* Highlight then right-click on the gpfdist
service in the list of services.
* Select Properties from the right-click menu,
the Service Properties window opens.
Note that you can also stop this service
from the Service Properties window.
* Optional: Change the Startup Type to
Automatic (after a system restart, this
service will be running), then under Service
status, click Start.
* Click OK.
Repeat the above steps for each instance of
gpfdist that you created.
*****************************************************
EXAMPLES
*****************************************************
Serve files from a specified directory using port 8081
(and start gpfdist in the background):
gpfdist -d /var/load_files -p 8081 &
Start gpfdist in the background and redirect output and
errors to a log file:
gpfdist -d /var/load_files -p 8081 -l /home/gpadmin/log &
To stop gpfdist when it is running in the background:
--First find its process id:
ps ax | grep gpfdist
OR on Solaris
ps -ef | grep gpfdist
--Then kill the process, for example:
kill 3456
*****************************************************
SEE ALSO
*****************************************************
CREATE EXTERNAL TABLE
gpload