tools/doc/gpload_help - hawq - Git at Google

 COMMAND NAME: hawq load

 Runs a load job as defined in a YAML formatted control file.


 *****************************************************
 SYNOPSIS
 *****************************************************

 hawq load -f <control_file> [-l <log_file>]
 [-h <hostname>] [-p <port>] [-U <username>] [-d <database>] [-W]
 [--gpfdist_timeout <seconds>] [[-v | -V] | [-q]] [-D]

 hawq load -?

 hawq load --version


 *****************************************************
 PREREQUISITES
 *****************************************************

 The client machine where hawq load is executed must have the following:

 * Python 2.6.2 or later, pygresql (the Python interface to PostgreSQL),
   and pyyaml. Note that Python and the required Python libraries are
   included with the HAWQ server installation, so if you have HAWQ
   installed on the machine where hawq load is running, you do not need
   a separate Python installation.
   Note: HAWQ Loaders for Windows supports only Python 2.5
   (available from www.python.org).

 * The gpfdist parallel file distribution program installed and
   in your $PATH. This program is located in $GPHOME/bin of your
   HAWQ server installation.

 * Network access to and from all hosts in your HAWQ array
   (master and segments).

 * Network access to and from the hosts where the data to be loaded
   resides (ETL servers).

 *****************************************************
 DESCRIPTION
 *****************************************************

 hawq load is a data loading utility that acts as an interface
 to HAWQs external table parallel loading feature. Using a load
 specification defined in a YAML formatted control file, hawq load
 executes a load by invoking the HAWQ parallel file server (gpfdist),
 creating an external table definition based on the source data
 defined, and executing an INSERT, UPDATE or MERGE operation to
 load the source data into the target table in the database.

 *****************************************************
 OPTIONS
 *****************************************************

 -f control_file

  Required. A YAML file that contains the load specification details.
  See Control File Format below.

 --gpfdist_timeout seconds

   Sets the timeout for the gpfdist parallel file distribution program
   to send a response. Enter a value from 0 to 30 seconds (entering 0 to disables timeouts). Note that you might need to increase this value when operating on high-traffic networks.

 -l log_file

   Specifies where to write the log file. Defaults to
   ~/hawqAdminLogs/hawq_load_YYYYMMDD.

 -v (verbose mode)

   Show verbose output of the load steps as they are executed.

 -V (very verbose mode)

   Shows very verbose output.

 -q (no screen output)

   Run in quiet mode. Command output is not displayed on the screen,
   but is still written to the log file.

 -D (debug mode)

   Check for error conditions, but do not execute the load.

 -? (show help)
   Show help, then exit.

 --version

   Show the version of this utility, then exit.


 *********************
 CONNECTION OPTIONS
 *********************

 -d database

   The database to load into. If not specified, reads from the
   load control file, the environment variable $PGDATABASE or
   defaults to the current system user name.

 -h hostname

   Specifies the host name of the machine on which the HAWQ master
   database server is running. If not specified, reads from the load
   control file, the environment variable $PGHOST or defaults to localhost.

 -p port

   Specifies the TCP port on which the HAWQ master database server
   is listening for connections. If not specified, reads from the load
   control file, the environment variable $PGPORT or defaults to 5432.

 -U username

   The database role name to connect as. If not specified, reads
   from the load control file, the environment variable $PGUSER or
   defaults to the current system user name.

 -W (force password prompt)

   Force a password prompt. If not specified, reads the password
   from the environment variable $PGPASSWORD or from a password file
   specified by $PGPASSFILE or in ~/.pgpass. If these are not set, then
   hawq load will prompt for a password even if -W is not supplied.

 *****************************************************
 CONTROL FILE FORMAT
 *****************************************************

 The hawq load control file uses the YAML 1.1 document format and
 then implements its own schema for defining the various steps
 of a HAWQ load operation. The control file must be a valid YAML document.
 The hawq load program processes the control file document in order
 and uses indentation (spaces) to determine the document hierarchy
 and the relationships of the sections to one another. The use of white
 space is significant. White space should not be used simply for formatting
 purposes, and tabs should not be used at all.

 The basic structure of a load control file is:
 ---

 VERSION: 1.0.0.1
 DATABASE: <db_name>
 USER: <db_username>
 HOST: <master_hostname>
 PORT: <master_port>
 GPLOAD:
    INPUT:
     - SOURCE:
          LOCAL_HOSTNAME:
            - <hostname_or_ip>
          PORT: <http_port>
        | PORT_RANGE: [<start_port_range>, <end_port_range>]
          FILE:
            - </path/to/input_file>
          SSL: true | false
          CERTIFICATES_PATH: /path/to/certificates
     - COLUMNS:
            - <field_name>: <data_type>
     - TRANSFORM: 'transformation'
     - TRANSFORM_CONFIG: 'configuration_file_path' | 'OFF'
     - MAX_LINE_LENGTH: integer
     - FORMAT: text | csv
     - DELIMITER: '<delimiter_character>'
     - ESCAPE: '<escape_character>' | 'OFF'
     - NULL_AS: '<null_string>'
     - FORCE_NOT_NULL: true | false
     - QUOTE: '<csv_quote_character>'
     - HEADER: true | false
     - ENCODING: <database_encoding>
     - ERROR_LIMIT: <integer>
     - ERROR_TABLE: <schema>.<table_name>
    OUTPUT:
     - TABLE: <schema>.<table_name>
     - MODE: insert | update | merge
     - MATCH_COLUMNS:
            - <target_column_name>
     - UPDATE_COLUMNS:
            - <target_column_name>
     - UPDATE_CONDITION: '<boolean_condition>'
     - MAPPING:
              <target_column_name>: <source_column_name> | '<expression>'
    PRELOAD:
     - TRUNCATE: true | false
     - REUSE_TABLES: true | false
    SQL:
     - BEFORE: "<sql_command>"
     - AFTER: "<sql_command>"


 *****************************************************
 CONTROL FILE SCHEMA ELEMENT DESCRIPTIONS
 *****************************************************

 VERSION - Optional. The version of the hawq load control file schema.
           The current version is 1.0.0.1.

 DATABASE - Optional. Specifies which database in HAWQ to connect to.
            If not specified, defaults to $PGDATABASE if set or the current
            system user name. You can also specify the database on the command
            line using the -d option.

 USER - Optional. Specifies which database role to use to connect.
        If not specified, defaults to the current user or $PGUSER
        if set. You can also specify the database role on the command
        line using the -U option. If the user running hawq load is not a
        HAWQ superuser, then the server configuration parameter
        'gp_external_grant_privileges' must be set to on in order
        for the load to be processed.

 HOST - Optional. Specifies HAWQ master host name. If not
        specified, defaults to localhost or $PGHOST if set. You can
        also specify the master host name on the command line using
        the -h option.

 PORT - Optional. Specifies HAWQ master port. If not specified,
        defaults to 5432 or $PGPORT if set. You can also specify the
        master port on the command line using the -p option.

 GPLOAD - Required. Begins the load specification section. A GPLOAD
          specification must have an INPUT and an OUTPUT section
          defined.

 INPUT - Required. Defines the location and the format of the input
         data to be loaded. hawq load will start one or more instances of
         the gpfdist file distribution program on the current host and
         create the required external table definition(s) in HAWQ
         that point to the source data. Note that the host
         from which you run hawq load must be accessible over the network
         by all HAWQ hosts (master and segments).

 SOURCE - Required. The SOURCE block of an INPUT specification defines
          the location of a source file. An INPUT section can have more
          than one SOURCE block defined. Each SOURCE block defined
          corresponds to one instance of the gpfdist file distribution
          program that will be started on the local machine. Each
          SOURCE block defined must have a FILE specification.

 LOCAL_HOSTNAME - Optional. Specifies the host name or IP address of
                  the local machine on which hawq load is running. If this
                  machine is configured with multiple network interface
                  cards(s), you can specify the host name or IP of each
                  individual NIC to allow network traffic to use all
                  NICs simultaneously. The default is to use the local
                  machine's primary host name or IP only.

 PORT - Optional. Specifies the specific port number that the gpfdist
        file distribution program should use. You can also supply a
        PORT_RANGE to select an available port from the specified range.
        If both PORT and PORT_RANGE are defined, then PORT takes
        precedence. If neither PORT or PORT_RANGE are defined, the
        default is to select an available port between 8000 and 9000.
        If multiple host names are declared in LOCAL_HOSTNAME, this port
        number is used for all hosts. This configuration is desired if
        you want to use all NICs to load the same file or set of files
        in a given directory location.

 PORT_RANGE - Optional. Can be used instead of PORT to supply a range
              of port numbers from which hawq load can choose an available
              port for this instance of the gpfdist file distribution
              program.

 FILE - Required. Specifies the location of a file, named pipe, or
        directory location on the local file system that contains data
        to be loaded. You can declare more than one file so long as the
        data is of the same format in all files specified. If the files
        are compressed using gzip or bzip2 (have a .gz or .bz2 file
        extension), the files will be uncompressed automatically
        (provided that gunzip or bunzip2 is in your path). When
        specifying which source files to load, you can use the wildcard
        character (*) or other C-style pattern matching to denote
        multiple files. The files specified are assumed to be relative
        to the current directory from which hawq load is executed (or you
        can declare an absolute path).

 SSL - Optional. Specifies usage of SSL encryption.

 CERTIFICATES_PATH - Required when SSL is true; cannot be specified when
                     SSL is false or unspecified. The location specified
                     in CERTIFICATES_PATH must contain the following
                     files:

                     - The server certificate file, server.crt
                     - The server private key file, server.key
                     - The trusted certificate authorities, root.crt

                     The root directory (/) cannot be specified
                     as CERTIFICATES_PATH.

 COLUMNS - Optional. Specifies the schema of the source data file(s)
           in the format of <field_name>: <data_type>. The DELIMITER
           character in the source file is what separates two data value
           fields (columns). A row is determined by a line feed character
           (0x0a). If the input COLUMNS are not specified, then the
           schema of the output TABLE is implied, meaning that the source
           data must have the same column order, number of columns, and
           data format as the target table. The default source-to-target
           mapping is based on a match of column names as defined in this
           section and the column names in the target TABLE. This default
           mapping can be overridden using the MAPPING section.

 TRANSFORM - Optional. Specifies the name of the input XML transformation
             passed to hawq load.

 TRANSFORM_CONFIG - Optional. Specifies the location of the XML transformation
                    configuration file that is specified in the TRANSFORM
                    parameter, above.

 MAX_LINE_LENGTH - Optional. An integer that specifies the maximum length
                   of a line in the XML transformation data passed to hawq load.

 FORMAT - Optional. Specifies the format of the source data file(s);
          either plain text (TEXT) or comma separated values (CSV) format.
          Defaults to TEXT if not specified.

 DELIMITER - Optional. Specifies a single ASCII character that separates
             columns within each row (line) of data. The default is a tab
             character in TEXT mode, a comma in CSV mode. You can also
             specify a non-printable ASCII character via an escape sequence\
             using the decimal representation of the ASCII character.
             For example, \014 represents the shift out character..

 ESCAPE - Optional. Specifies the single character that is used for C
          escape sequences (such as \n,\t,\100, and so on) and for
          escaping data characters that might otherwise be taken as row
          or column delimiters. Make sure to choose an escape character
          that is not used anywhere in your actual column data. The
          default escape character is a \ (backslash) for text-formatted
          files are a "(double quote for csv-formatted files), however it
          is possible to use another character to represent an escape. It
          is also possible to disable escaping by specifying the value
          'OFF' as the escape value. This is very useful for data such as
          text-formatted web log data that has many embedded backslashes
          that are not intended to be escapes.

 NULL_AS - Optional. Specifies the string that represents a null value.
           The default is \N (backslash-N) in TEXT mode, and an empty value
           with no quotations in CSV mode. You might prefer an empty string
           even in TEXT mode for cases where you do not want to distinguish
           nulls from empty strings. Any source data item that matches this
           string will be considered a null value.

 FORCE_NOT_NULL - Optional. In CSV mode, processes each specified column as
                  though it were quoted and hence not a NULL value. For the
                  default null string in CSV mode (nothing between two
                  delimiters), this causes missing values to be evaluated
                  as zero-length strings.

 QUOTE - Required when FORMAT is CSV. Specifies the quotation character
         for CSV mode. The default is double-quote (").

 HEADER - Optional. Specifies that the first line in the data file(s) is
          a header row (contains the names of the columns) and should not
          be included as data to be loaded. If using multiple data source
          files, all files must have a header row. The default is to assume
          that the input files do not have a header row.

 ENCODING - Optional. Character set encoding of the source data.
            Specify a string constant (such as 'SQL_ASCII'), an
            integer encoding number, or 'DEFAULT' to use the default
            client encoding.

 ERROR_LIMIT - Optional. Enables single row error isolation mode for
               this load operation. When enabled, input rows that have
               format errors will be discarded provided that the error
               limit count is not reached on any HAWQ segment instance
               during input processing. If the error limit is not reached,
               all good rows will be loaded and any error rows will either
               be discarded or logged to the table specified in ERROR_TABLE.
               The default is to abort the load operation on the first error
               encountered. Note that single row error isolation only
               applies to data rows with format errors; for example, extra
               or missing attributes, attributes of a wrong data type, or
               invalid client encoding sequences. Constraint errors, such
               as primary key violations, will still cause the load
               operation to abort if encountered.

 ERROR_TABLE - Optional when ERROR_LIMIT is declared. Specifies an error
               table where rows with formatting errors will be logged when
               running in single row error isolation mode. You can then
               examine this error table to see error rows that were not
               loaded (if any). If the error_table specified already exists,
               it will be used. If it does not exist, it will be
               automatically generated.

 OUTPUT - Required. Defines the target table and final data column values
          that are to be loaded into the database.

 TABLE - Required. The name of the target table to load into.

 MODE - Optional. Defaults to INSERT if not specified. There are three
        available load modes:

        * INSERT - Loads data into the target table using the following
             method: INSERT INTO <target_table> SELECT * FROM <input_data>;

        * UPDATE - Updates the UPDATE_COLUMNS of the target table where
            the rows have MATCH_COLUMNS attribute values equal to those
            of the input data, and the optional UPDATE_CONDITION is true.

        * MERGE - Inserts new rows and updates the UPDATE_COLUMNS of
            existing rows where MATCH_COLUMNS attribute values are equal
            to those of the input data, and the optional UPDATE_CONDITION
            is true. New rows are identified when the MATCH_COLUMNS value
            in the source data does not have a corresponding value in the
            existing data of the target table. In those cases, the entire
            row from the source file is inserted, not only the MATCH and
            UPDATE columns. If there are multiple new MATCH_COLUMNS values
            that are the same, only one new row for that value will be
            inserted (use UPDATE_CONDITION to filter out the rows you want
            to discard).

 MATCH_COLUMNS - Required if MODE is UPDATE or MERGE. Specifies the
                 column(s) to use as the join condition for the update.
                 The attribute value in the specified target column(s)
                 must be equal to that of the corresponding source data
                 column(s) in order for the row to be updated in the
                 target table.

 UPDATE_COLUMNS - Required if MODE is UPDATE or MERGE. Specifies
                  the column(s) to update for the rows that meet the
                  MATCH_COLUMNS criteria and the optional UPDATE_CONDITION.

 UPDATE_CONDITION - Optional. Specifies a Boolean condition (similar to
                    what you would declare in a WHERE clause) that must
                    be met in order for a row in the target table to be
                    updated (or inserted in the case of a MERGE).

 MAPPING - Optional. If a mapping is specified, it overrides the default
           source-to-target column mapping. The default source-to-target
           mapping is based on a match of column names as defined in the
           source COLUMNS section and the column names of the target
           TABLE.
           A mapping is specified as either:
             target_column_name: source_column_name
            or
             target_column_name: 'expression'
           Where expression is any expression that you would specify in
           the SELECT list of a query, such as a constant value, a column
           reference, an operator invocation, a function call, and so on.

 PRELOAD - Optional. Specifies operations to run prior to the load operation.
           Right now the only preload operation is TRUNCATE.

 TRUNCATE - Optional. If set to true, hawq load will remove all rows in the
            target table prior to loading it.

 REUSE_TABLES - Optional. If set to true, hawq load will not drop the external
                table objects and staging table objects it creates. These
                objects will be reused for future load operations that use
                the same load specifications. This improves performance of
                trickle loads (ongoing small loads to the same target table).

 SQL - Optional. Defines SQL commands to run before and/or after the load
       operation. Commands that contain spaces or special characters must
       be enclosed in quotes. You can specify multiple BEFORE and/or
       AFTER commands. List commands in the order of desired execution.

 BEFORE - Optional. An SQL command to run before the load operation starts.
          Enclose commands in quotes.

 AFTER - Optional. An SQL command to run after the load operation completes.
         Enclose commands in quotes.


 *****************************************************
 NOTES
 *****************************************************

 If your database object names were created using a double-quoted
 identifier (delimited identifier), you must specify the delimited
 name within single quotes in the hawq load control file. For example,
 if you create a table as follows:

      CREATE TABLE "MyTable" ("MyColumn" text);

 Your YAML-formatted hawq load control file would refer to the above
 table and column names as follows:

         - COLUMNS:
           - '"MyColumn"': text

     OUTPUT:
         - TABLE: public.'"MyTable"'

 *****************************************************
 LOG FILE FORMAT
 *****************************************************

 Log files output by hawq load have the following format:

 <timestamp>|<level>|<message>

 Where <timestamp> takes the form: YYYY-MM-DD HH:MM:SS,
 <level> is one of DEBUG, LOG, INFO, ERROR, and <message> is
 a normal text message. Some INFO messages that may be of
 interest in the log files are (where # corresponds to the
 actual number of seconds, units of data, or failed rows):

 INFO|running time: #.## seconds
 INFO|transferred #.# kB of #.# kB.
 INFO|hawq load succeeded
 INFO|hawq load succeeded with warnings
 INFO|hawq load failed
 INFO|1 bad row
 INFO|# bad rows

 *****************************************************
 EXAMPLES
 *****************************************************

 Run a load job as defined in my_load.yml:

 hawq load -f my_load.yml


 Example load control file:

 ---
 VERSION: 1.0.0.1
 DATABASE: ops
 USER: gpadmin
 HOST: mdw-1
 PORT: 5432
 GPLOAD:
    INPUT:
     - SOURCE:
          LOCAL_HOSTNAME:
            - etl1-1
            - etl1-2
            - etl1-3
            - etl1-4
          PORT: 8081
          FILE:
            - /var/load/data/*
     - COLUMNS:
            - name: text
            - amount: float4
            - category: text
            - desc: text
            - date: date
     - FORMAT: text
     - DELIMITER: '|'
     - ERROR_LIMIT: 25
     - ERROR_TABLE: payables.err_expenses
    OUTPUT:
     - TABLE: payables.expenses
     - MODE: INSERT
    SQL:
     - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)"
     - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"


 *****************************************************
 SEE ALSO
 *****************************************************

 gpfdist, CREATE EXTERNAL TABLE