| How to use TEZ |
| ======================= |
| |
| Tez provides an ApplicationMaster that can run any arbritary DAG of tasks. It also |
| provides a translation layer to run MR jobs using the MR APIs. This translation |
| layer is not fully feature compatible so if you do see any issues with running your |
| existing MR jobs on TEZ, please file jiras. |
| |
| Install/Deploy Instructions |
| =========================== |
| |
| 1) Deploy Apache Hadoop either using the 2.2.0 release or a compatible 2.x version. |
| 2) Build tez using "mvn clean package -DskipTests=true -Dmaven.javadoc.skip=true" |
| - If you prefer to run the unit tests, remove skipTests from the command above. |
| - A tarball containing the libraries required to run tez will be created at tez-dist/target/tez-0.6.0-SNAPSHOT.tar.gz |
| 3) Copy the relevant tez tarball into HDFS, and configure tez-site.xml |
| - A tez tarball containing tez and hadoop libraries will be found at tez-dist/target/tez-0.6.0-SNAPSHOT.tar.gz |
| - Assuming that the tez jars are put in /apps/ on HDFS, the command would be |
| "hadoop fs -mkdir /apps/tez-0.6.0-SNAPSHOT" |
| "hadoop fs -copyFromLocal tez-dist/target/tez-0.6.0-SNAPSHOT.tar.gz /apps/tez-0.6.0-SNAPSHOT/" |
| - tez-site.xml configuration |
| - Set tez.lib.uris to point to the tar.gz uploaded to HDFS. Assuming the steps mentioned so far were followed, |
| set tez.lib.uris to "${fs.defaultFS}/apps/tez-0.6.0-SNAPSHOT/tez-0.6.0-SNAPSHOT.tar.gz" |
| - Ensure tez.use.cluster.hadoop-libs is not set in tez-site.xml, or if it is set, the value should be false |
| 4) Optional: If running existing MapReduce jobs on Tez. Modify mapred-site.xml to change |
| "mapreduce.framework.name" property from its default value of "yarn" to "yarn-tez" |
| 5) Configure the client node to include the tez-libraries in the hadoop classpath |
| - Extract the tez tarball created in step 2 to a local directory - (assuming TEZ_JARS is where the files will be decompressed for the next steps) |
| "tar -xvzf tez-dist/target/tez-0.6.0-SNAPSHOT.tar.gz -C $TEZ_JARS" |
| - set HADOOP_CLASSPATH to include the tez-libraries |
| - set TEZ_CONF_DIR to the location of tez-site.xml |
| - The command to set up the classpath should be something like: |
| "export HADOOP_CLASSPATH=${TEZ_CONF_DIR}:${TEZ_JARS}/*:${TEZ_JARS}/lib/*" |
| - Please note the "*" which is an important requirement when setting up classpaths for directories containing jar files. |
| |
| 6) There is a basic example of a Tez job in the tez-examples.jar. Refer to OrderedWordCount.java |
| in the source code. To run this example: |
| |
| $HADOOP_PREFIX/bin/hadoop jar tez-examples.jar orderedwordcount <input> <output> |
| |
| This will use the TEZ DAG ApplicationMaster to run the ordered word count job. This job is similar |
| to the word count example except that it also orders all words based on the frequency of |
| occurrence. |
| |
| There are multiple variations of orderedwordcount. You can take a look at TestOrderedWordCount.java |
| in tez-tests for these variations. You can use it to run multiple |
| DAGs serially on different inputs/outputs. These DAGs could be run separately as |
| different applications or serially within a single TEZ session. |
| |
| $HADOOP_PREFIX/bin/hadoop jar tez-tests.jar testorderedwordcount <input1> <output1> <input2> <output2> <input3> <output3> ... |
| |
| The above will run multiple DAGs for each input-output pair. To use TEZ sessions, |
| set -DUSE_TEZ_SESSION=true |
| |
| $HADOOP_PREFIX/bin/hadoop jar tez-tests.jar testorderedwordcount -DUSE_TEZ_SESSION=true <input1> <output1> <input2> <output2> |
| |
| 7) To test MR jobs you can submit an MR job as you normally would using something like: |
| |
| $HADOOP_PREFIX/bin/hadoop jar hadoop-mapreduce-client-jobclient-2.2.0-tests.jar sleep -mt 1 -rt 1 -m 1 -r 1 |
| |
| This will use the TEZ DAG ApplicationMaster to run the MR job. This can be verified by looking at |
| the AM's logs from the YARN ResourceManager UI. This needs mapred-site.xml to have "mapreduce.framework.name" |
| set to "yarn-tez" |
| |
| |
| Hadoop Installation dependent Install/Deploy Instructions |
| ========================================================= |
| The above install instructions use Tez with pre-packaged Hadoop libraries included in the package and is the |
| recommended method for installation. If its needed to make Tez use the existing cluster Hadoop libraries then |
| follow this alternate machanism to setup Tez to use Hadoop libraries from the cluster. |
| Step 3 above changes as follows. Also subsequent steps would use tez-dist/target/tez-0.6.0-SNAPSHOT-minimal.tar.gz instead of tez-dist/target/tez-0.6.0-SNAPSHOT.tar.gz |
| - A tez build without Hadoop dependencies will be available at tez-dist/target/tez-0.6.0-SNAPSHOT-minimal.tar.gz |
| - Assuming that the tez jars are put in /apps/ on HDFS, the command would be |
| "hadoop fs -mkdir /apps/tez-0.6.0-SNAPSHOT" |
| "hadoop fs -copyFromLocal tez-dist/target/tez-0.6.0-SNAPSHOT-minimal.tar.gz /apps/tez-0.6.0-SNAPSHOT" |
| - tez-site.xml configuration |
| - Set tez.lib.uris to point to the paths in HDFS containing the tez jars. Assuming the steps mentioned so far were followed, |
| set tez.lib.uris to "${fs.defaultFS}/apps/tez-0.6.0-SNAPSHOT/tez-0.6.0-SNAPSHOT-minimal.tar.gz |
| - set tez.use.cluster.hadoop-libs to true |
| |