blob: bf877c44edd99af057b1ad6c96f71c4ffc09c603 [file]
COMMAND NAME: gpmapreduce
Runs Cloudberry MapReduce jobs as defined in a YAML specification document.
*****************************************************
SYNOPSIS
*****************************************************
gpmapreduce -f <yaml_file> [<dbname> [<username>]]
[-k <name>=<value> | --key <name>=<value>]
[-h <hostname> | --host <hostname>]
[-p <port>| --port <port>]
[-U <username> | --username <username>] [-W] [-v]
gpmapreduce -V | --version
gpmapreduce -h | --help
gpmapreduce -x | --explain
gpmapreduce -X | --explain-analyze
*****************************************************
PREREQUISITES
*****************************************************
The following are required prior to running this program:
* You must have your MapReduce job defined in a YAML file.
* You must be a Apache Cloudberry superuser to run MapReduce jobs
written in untrusted Perl or Python.
* You must be a Apache Cloudberry superuser to run MapReduce jobs
with EXEC and FILE inputs.
* Non-superuser roles must be granted external table permissions
using CREATE/ALTER ROLE in order to run MapReduce jobs.
*****************************************************
DESCRIPTION
*****************************************************
MapReduce is a programming model developed by Google for
processing and generating large data sets on an array of commodity
servers. Cloudberry MapReduce allows programmers who are familiar
with the MapReduce paradigm to write map and reduce functions and
submit them to the Apache Cloudberry parallel engine for processing.
In order for Cloudberry to be able to process MapReduce functions,
the functions need to be defined in a YAML document, which is then
passed to the Cloudberry MapReduce program, gpmapreduce, for execution
by the Apache Cloudberry parallel engine. The Cloudberry system takes
care of the details of distributing the input data, executing the
program across a set of machines, handling machine failures,
and managing the required inter-machine communication.
*****************************************************
OPTIONS
*****************************************************
-f <yaml_file>
Required. The YAML file that contains the Cloudberry MapReduce
job definitions. See the Apache Cloudberry Administrator Guide
for more information about creating YAML documents.
-? | --help
Show help, then exit.
-V | --version
Show version information, then exit.
-v | --verbose
Show verbose output.
-x | --explain
Do not run MapReduce jobs, but produce explain plans.
-X | --explain-analyze
Run MapReduce jobs and produce explain-analyze plans.
-k | --key <name>=<value>
Sets a YAML variable. A value is required. Defaults to "key"
if no variable name is specified.
-h <host> | --host <host>
Specifies the host name of the machine on which the Cloudberry
coordinator database server is running. If not specified, reads
from the environment variable PGHOST or defaults to localhost.
-p <port> | --port <port>
Specifies the TCP port on which the Cloudberry coordinator database
server is listening for connections. If not specified, reads
from the environment variable PGPORT or defaults to 5432.
-U <username> | --username <username>
The database role name to connect as. If not specified, reads
from the environment variable PGUSER or defaults to the
current system user name.
-W | --password
Force a password prompt.
*****************************************************
EXAMPLES
*****************************************************
Run a MapReduce job as defined in my_yaml.txt:
gpmapreduce -f my_yaml.txt
*****************************************************
SEE ALSO
*****************************************************
"Cloudberry MapReduce YAML Specification" in the
Apache Cloudberry Administrator Guide