gpMgmt/doc/gpmapreduce_help - cloudberry - Git at Google

 COMMAND NAME: gpmapreduce

 Runs Cloudberry MapReduce jobs as defined in a YAML specification document.


 *****************************************************
 SYNOPSIS
 *****************************************************

 gpmapreduce -f <yaml_file> [<dbname> [<username>]]
             [-k <name>=<value> | --key <name>=<value>]
             [-h <hostname> | --host <hostname>]
             [-p <port>| --port <port>]
             [-U <username> | --username <username>] [-W] [-v]

 gpmapreduce -V | --version

 gpmapreduce -h | --help

 gpmapreduce -x | --explain

 gpmapreduce -X | --explain-analyze


 *****************************************************
 PREREQUISITES
 *****************************************************

 The following are required prior to running this program:

 * You must have your MapReduce job defined in a YAML file.

 * You must be a Apache Cloudberry superuser to run MapReduce jobs
   written in untrusted Perl or Python.

 * You must be a Apache Cloudberry superuser to run MapReduce jobs
   with EXEC and FILE inputs.

 * Non-superuser roles must be granted external table permissions
   using CREATE/ALTER ROLE in order to run MapReduce jobs.

 *****************************************************
 DESCRIPTION
 *****************************************************

 MapReduce is a programming model developed by Google for
 processing and generating large data sets on an array of commodity
 servers. Cloudberry MapReduce allows programmers who are familiar
 with the MapReduce paradigm to write map and reduce functions and
 submit them to the Apache Cloudberry parallel engine for processing.

 In order for Cloudberry to be able to process MapReduce functions,
 the functions need to be defined in a YAML document, which is then
 passed to the Cloudberry MapReduce program, gpmapreduce, for execution
 by the Apache Cloudberry parallel engine. The Cloudberry system takes
 care of the details of distributing the input data, executing the
 program across a set of machines, handling machine failures,
 and managing the required inter-machine communication.


 *****************************************************
 OPTIONS
 *****************************************************


 -f <yaml_file>

 Required. The YAML file that contains the Cloudberry MapReduce
 job definitions. See the Apache Cloudberry Administrator Guide
 for more information about creating YAML documents.


 -? | --help

 Show help, then exit.


 -V | --version

 Show version information, then exit.


 -v | --verbose

 Show verbose output.


 -x | --explain

 Do not run MapReduce jobs, but produce explain plans.


 -X | --explain-analyze

 Run MapReduce jobs and produce explain-analyze plans.


 -k | --key <name>=<value>

 Sets a YAML variable.  A value is required. Defaults to "key"
 if no variable name is specified.


 -h <host> | --host <host>

 Specifies the host name of the machine on which the Cloudberry
 coordinator database server is running. If not specified, reads
 from the environment variable PGHOST or defaults to localhost.


 -p <port> | --port <port>

 Specifies the TCP port on which the Cloudberry coordinator database
 server is listening for connections. If not specified, reads
 from the environment variable PGPORT or defaults to 5432.


 -U <username> | --username <username>

 The database role name to connect as. If not specified, reads
 from the environment variable PGUSER or defaults to the
 current system user name.


 -W | --password

 Force a password prompt.


 *****************************************************
 EXAMPLES
 *****************************************************

 Run a MapReduce job as defined in my_yaml.txt:

 gpmapreduce -f my_yaml.txt


 *****************************************************
 SEE ALSO
 *****************************************************

 "Cloudberry MapReduce YAML Specification" in the
 Apache Cloudberry Administrator Guide
	COMMAND NAME: gpmapreduce

	Runs Cloudberry MapReduce jobs as defined in a YAML specification document.


	*****************************************************
	SYNOPSIS
	*****************************************************

	gpmapreduce -f <yaml_file> [<dbname> [<username>]]
	[-k <name>=<value> \| --key <name>=<value>]
	[-h <hostname> \| --host <hostname>]
	[-p <port>\| --port <port>]
	[-U <username> \| --username <username>] [-W] [-v]

	gpmapreduce -V \| --version

	gpmapreduce -h \| --help

	gpmapreduce -x \| --explain

	gpmapreduce -X \| --explain-analyze


	*****************************************************
	PREREQUISITES
	*****************************************************

	The following are required prior to running this program:

	* You must have your MapReduce job defined in a YAML file.

	* You must be a Apache Cloudberry superuser to run MapReduce jobs
	written in untrusted Perl or Python.

	* You must be a Apache Cloudberry superuser to run MapReduce jobs
	with EXEC and FILE inputs.

	* Non-superuser roles must be granted external table permissions
	using CREATE/ALTER ROLE in order to run MapReduce jobs.

	*****************************************************
	DESCRIPTION
	*****************************************************

	MapReduce is a programming model developed by Google for
	processing and generating large data sets on an array of commodity
	servers. Cloudberry MapReduce allows programmers who are familiar
	with the MapReduce paradigm to write map and reduce functions and
	submit them to the Apache Cloudberry parallel engine for processing.

	In order for Cloudberry to be able to process MapReduce functions,
	the functions need to be defined in a YAML document, which is then
	passed to the Cloudberry MapReduce program, gpmapreduce, for execution
	by the Apache Cloudberry parallel engine. The Cloudberry system takes
	care of the details of distributing the input data, executing the
	program across a set of machines, handling machine failures,
	and managing the required inter-machine communication.


	*****************************************************
	OPTIONS
	*****************************************************


	-f <yaml_file>

	Required. The YAML file that contains the Cloudberry MapReduce
	job definitions. See the Apache Cloudberry Administrator Guide
	for more information about creating YAML documents.


	-? \| --help

	Show help, then exit.


	-V \| --version

	Show version information, then exit.


	-v \| --verbose

	Show verbose output.


	-x \| --explain

	Do not run MapReduce jobs, but produce explain plans.


	-X \| --explain-analyze

	Run MapReduce jobs and produce explain-analyze plans.


	-k \| --key <name>=<value>

	Sets a YAML variable. A value is required. Defaults to "key"
	if no variable name is specified.


	-h <host> \| --host <host>

	Specifies the host name of the machine on which the Cloudberry
	coordinator database server is running. If not specified, reads
	from the environment variable PGHOST or defaults to localhost.


	-p <port> \| --port <port>

	Specifies the TCP port on which the Cloudberry coordinator database
	server is listening for connections. If not specified, reads
	from the environment variable PGPORT or defaults to 5432.


	-U <username> \| --username <username>

	The database role name to connect as. If not specified, reads
	from the environment variable PGUSER or defaults to the
	current system user name.


	-W \| --password

	Force a password prompt.


	*****************************************************
	EXAMPLES
	*****************************************************

	Run a MapReduce job as defined in my_yaml.txt:

	gpmapreduce -f my_yaml.txt


	*****************************************************
	SEE ALSO
	*****************************************************

	"Cloudberry MapReduce YAML Specification" in the
	Apache Cloudberry Administrator Guide