INSTALL.txt - tez - Git at Google

 How to use TEZ
 =======================

 Tez provides an ApplicationMaster that can run any arbritary DAG of tasks. It also
 provides a translation layer to run MR jobs using the MR APIs. This translation
 layer is not fully feature compatible so if you do see any issues with running your
 existing MR jobs on TEZ, please file jiras.

 Install/Deploy Instructions
 ===========================

 1) Deploy Apache Hadoop either using the 2.2.0 release or a compatible 2.x version.
 2) Build tez using "mvn clean package -DskipTests=true -Dmaven.javadoc.skip=true"
    - If you prefer to run the unit tests, remove skipTests from the command above.
    - A tarball containing the libraries required to run tez will be created at tez-dist/target/tez-0.6.0-SNAPSHOT.tar.gz
 3) Copy the relevant tez tarball into HDFS, and configure tez-site.xml
    - A tez tarball containing tez and hadoop libraries will be found at tez-dist/target/tez-0.6.0-SNAPSHOT.tar.gz
    - Assuming that the tez jars are put in /apps/ on HDFS, the command would be
      "hadoop fs -mkdir /apps/tez-0.6.0-SNAPSHOT"
      "hadoop fs -copyFromLocal tez-dist/target/tez-0.6.0-SNAPSHOT.tar.gz /apps/tez-0.6.0-SNAPSHOT/"
    - tez-site.xml configuration
      - Set tez.lib.uris to point to the tar.gz uploaded to HDFS. Assuming the steps mentioned so far were followed,
        set tez.lib.uris to "${fs.defaultFS}/apps/tez-0.6.0-SNAPSHOT/tez-0.6.0-SNAPSHOT.tar.gz"
      - Ensure tez.use.cluster.hadoop-libs is not set in tez-site.xml, or if it is set, the value should be false
 4) Optional: If running existing MapReduce jobs on Tez. Modify mapred-site.xml to change
 "mapreduce.framework.name" property from its default value of "yarn" to "yarn-tez"
 5) Configure the client node to include the tez-libraries in the hadoop classpath
    - Extract the tez tarball created in step 2 to a local directory - (assuming TEZ_JARS is where the files will be decompressed for the next steps)
      "tar -xvzf tez-dist/target/tez-0.6.0-SNAPSHOT.tar.gz -C $TEZ_JARS"
    - set HADOOP_CLASSPATH to include the tez-libraries
      - set TEZ_CONF_DIR to the location of tez-site.xml
      - The command to set up the classpath should be something like:
        "export HADOOP_CLASSPATH=${TEZ_CONF_DIR}:${TEZ_JARS}/*:${TEZ_JARS}/lib/*"
      - Please note the "*" which is an important requirement when setting up classpaths for directories containing jar files.

 6) There is a basic example of a Tez job in the tez-examples.jar. Refer to OrderedWordCount.java
 in the source code. To run this example:

 $HADOOP_PREFIX/bin/hadoop jar tez-examples.jar orderedwordcount <input> <output>

 This will use the TEZ DAG ApplicationMaster to run the ordered word count job. This job is similar
 to the word count example except that it also orders all words based on the frequency of
 occurrence.

 There are multiple variations of orderedwordcount. You can take a look at TestOrderedWordCount.java
 in tez-tests for these variations. You can use it to run multiple
 DAGs serially on different inputs/outputs. These DAGs could be run separately as
 different applications or serially within a single TEZ session.

 $HADOOP_PREFIX/bin/hadoop jar tez-tests.jar testorderedwordcount <input1> <output1> <input2> <output2> <input3> <output3> ...

 The above will run multiple DAGs for each input-output pair. To use TEZ sessions,
 set -DUSE_TEZ_SESSION=true

 $HADOOP_PREFIX/bin/hadoop jar tez-tests.jar testorderedwordcount -DUSE_TEZ_SESSION=true <input1> <output1> <input2> <output2>

 7) To test MR jobs you can submit an MR job as you normally would using something like:

 $HADOOP_PREFIX/bin/hadoop jar hadoop-mapreduce-client-jobclient-2.2.0-tests.jar sleep -mt 1 -rt 1 -m 1 -r 1

 This will use the TEZ DAG ApplicationMaster to run the MR job. This can be verified by looking at
 the AM's logs from the YARN ResourceManager UI. This needs mapred-site.xml to have "mapreduce.framework.name"
 set to "yarn-tez"


 Hadoop Installation dependent Install/Deploy Instructions
 =========================================================
 The above install instructions use Tez with pre-packaged Hadoop libraries included in the package and is the
 recommended method for installation. If its needed to make Tez use the existing cluster Hadoop libraries then
 follow this alternate machanism to setup Tez to use Hadoop libraries from the cluster.
 Step 3 above changes as follows. Also subsequent steps would use tez-dist/target/tez-0.6.0-SNAPSHOT-minimal.tar.gz instead of tez-dist/target/tez-0.6.0-SNAPSHOT.tar.gz
    - A tez build without Hadoop dependencies will be available at tez-dist/target/tez-0.6.0-SNAPSHOT-minimal.tar.gz
    - Assuming that the tez jars are put in /apps/ on HDFS, the command would be
      "hadoop fs -mkdir /apps/tez-0.6.0-SNAPSHOT"
      "hadoop fs -copyFromLocal tez-dist/target/tez-0.6.0-SNAPSHOT-minimal.tar.gz /apps/tez-0.6.0-SNAPSHOT"
    - tez-site.xml configuration
      - Set tez.lib.uris to point to the paths in HDFS containing the tez jars. Assuming the steps mentioned so far were followed,
      set tez.lib.uris to "${fs.defaultFS}/apps/tez-0.6.0-SNAPSHOT/tez-0.6.0-SNAPSHOT-minimal.tar.gz
      - set tez.use.cluster.hadoop-libs to true
	How to use TEZ
	=======================

	Tez provides an ApplicationMaster that can run any arbritary DAG of tasks. It also
	provides a translation layer to run MR jobs using the MR APIs. This translation
	layer is not fully feature compatible so if you do see any issues with running your
	existing MR jobs on TEZ, please file jiras.

	Install/Deploy Instructions
	===========================

	1) Deploy Apache Hadoop either using the 2.2.0 release or a compatible 2.x version.
	2) Build tez using "mvn clean package -DskipTests=true -Dmaven.javadoc.skip=true"
	- If you prefer to run the unit tests, remove skipTests from the command above.
	- A tarball containing the libraries required to run tez will be created at tez-dist/target/tez-0.6.0-SNAPSHOT.tar.gz
	3) Copy the relevant tez tarball into HDFS, and configure tez-site.xml
	- A tez tarball containing tez and hadoop libraries will be found at tez-dist/target/tez-0.6.0-SNAPSHOT.tar.gz
	- Assuming that the tez jars are put in /apps/ on HDFS, the command would be
	"hadoop fs -mkdir /apps/tez-0.6.0-SNAPSHOT"
	"hadoop fs -copyFromLocal tez-dist/target/tez-0.6.0-SNAPSHOT.tar.gz /apps/tez-0.6.0-SNAPSHOT/"
	- tez-site.xml configuration
	- Set tez.lib.uris to point to the tar.gz uploaded to HDFS. Assuming the steps mentioned so far were followed,
	set tez.lib.uris to "${fs.defaultFS}/apps/tez-0.6.0-SNAPSHOT/tez-0.6.0-SNAPSHOT.tar.gz"
	- Ensure tez.use.cluster.hadoop-libs is not set in tez-site.xml, or if it is set, the value should be false
	4) Optional: If running existing MapReduce jobs on Tez. Modify mapred-site.xml to change
	"mapreduce.framework.name" property from its default value of "yarn" to "yarn-tez"
	5) Configure the client node to include the tez-libraries in the hadoop classpath
	- Extract the tez tarball created in step 2 to a local directory - (assuming TEZ_JARS is where the files will be decompressed for the next steps)
	"tar -xvzf tez-dist/target/tez-0.6.0-SNAPSHOT.tar.gz -C $TEZ_JARS"
	- set HADOOP_CLASSPATH to include the tez-libraries
	- set TEZ_CONF_DIR to the location of tez-site.xml
	- The command to set up the classpath should be something like:
	"export HADOOP_CLASSPATH=${TEZ_CONF_DIR}:${TEZ_JARS}/:${TEZ_JARS}/lib/"
	- Please note the "*" which is an important requirement when setting up classpaths for directories containing jar files.

	6) There is a basic example of a Tez job in the tez-examples.jar. Refer to OrderedWordCount.java
	in the source code. To run this example:

	$HADOOP_PREFIX/bin/hadoop jar tez-examples.jar orderedwordcount <input> <output>

	This will use the TEZ DAG ApplicationMaster to run the ordered word count job. This job is similar
	to the word count example except that it also orders all words based on the frequency of
	occurrence.

	There are multiple variations of orderedwordcount. You can take a look at TestOrderedWordCount.java
	in tez-tests for these variations. You can use it to run multiple
	DAGs serially on different inputs/outputs. These DAGs could be run separately as
	different applications or serially within a single TEZ session.

	$HADOOP_PREFIX/bin/hadoop jar tez-tests.jar testorderedwordcount <input1> <output1> <input2> <output2> <input3> <output3> ...

	The above will run multiple DAGs for each input-output pair. To use TEZ sessions,
	set -DUSE_TEZ_SESSION=true

	$HADOOP_PREFIX/bin/hadoop jar tez-tests.jar testorderedwordcount -DUSE_TEZ_SESSION=true <input1> <output1> <input2> <output2>

	7) To test MR jobs you can submit an MR job as you normally would using something like:

	$HADOOP_PREFIX/bin/hadoop jar hadoop-mapreduce-client-jobclient-2.2.0-tests.jar sleep -mt 1 -rt 1 -m 1 -r 1

	This will use the TEZ DAG ApplicationMaster to run the MR job. This can be verified by looking at
	the AM's logs from the YARN ResourceManager UI. This needs mapred-site.xml to have "mapreduce.framework.name"
	set to "yarn-tez"


	Hadoop Installation dependent Install/Deploy Instructions
	=========================================================
	The above install instructions use Tez with pre-packaged Hadoop libraries included in the package and is the
	recommended method for installation. If its needed to make Tez use the existing cluster Hadoop libraries then
	follow this alternate machanism to setup Tez to use Hadoop libraries from the cluster.
	Step 3 above changes as follows. Also subsequent steps would use tez-dist/target/tez-0.6.0-SNAPSHOT-minimal.tar.gz instead of tez-dist/target/tez-0.6.0-SNAPSHOT.tar.gz
	- A tez build without Hadoop dependencies will be available at tez-dist/target/tez-0.6.0-SNAPSHOT-minimal.tar.gz
	- Assuming that the tez jars are put in /apps/ on HDFS, the command would be
	"hadoop fs -mkdir /apps/tez-0.6.0-SNAPSHOT"
	"hadoop fs -copyFromLocal tez-dist/target/tez-0.6.0-SNAPSHOT-minimal.tar.gz /apps/tez-0.6.0-SNAPSHOT"
	- tez-site.xml configuration
	- Set tez.lib.uris to point to the paths in HDFS containing the tez jars. Assuming the steps mentioned so far were followed,
	set tez.lib.uris to "${fs.defaultFS}/apps/tez-0.6.0-SNAPSHOT/tez-0.6.0-SNAPSHOT-minimal.tar.gz
	- set tez.use.cluster.hadoop-libs to true