blob: 672e53831aa66c806f0218ac5ac9c40bd119eebc [file] [log] [blame]
<!---
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
------------------------------------------------------------------------------
Apache Knox Gateway - Usage Examples
------------------------------------------------------------------------------
This guide provides detailed examples for how to do some basic interactions
with Hadoop via the Apache Knox Gateway.
The first two examples submit a Java MapReduce job and workflow using the
KnoxShell DSL
* Example #1: WebHDFS & Templeton/WebHCat via KnoxShell DSL
* Example #2: WebHDFS & Oozie via KnoxShell DSL
The second two examples submit the same job and workflow but do so using only
the [cURL](http://curl.haxx.se/) command line HTTP client.
* Example #1: WebHDFS & Templeton/WebHCat via cURL
* Example #2: WebHDFS & Oozie via KnoxShell cURL
------------------------------------------------------------------------------
Assumptions
------------------------------------------------------------------------------
This document assumes a few things about your environment in order to
simplify the examples.
1. The JVM is executable as simply java.
2. The Apache Knox Gateway is installed and functional.
3. The example commands are executed within the context of the GATEWAY_HOME
current directory. The GATEWAY_HOME directory is the directory within the
Apache Knox Gateway installation that contains the README file and the bin,
conf and deployments directories.
4. A few examples optionally require the use of commands from a standard
Groovy installation. These examples are optional but to try them you will
need Groovy [installed][gii].
[gii]: http://groovy.codehaus.org/Installing+Groovy
------------------------------------------------------------------------------
Customization
------------------------------------------------------------------------------
These examples may need to be tailored to the execution environment. In
particular hostnames and ports may need to be changes to match your
environment. In particular there are two example files in the distribution
that may need to be customized. Take a moment to review these files.
All of the values that may need to be customized can be found together at the
top of each file.
* samples/ExampleSubmitJob.groovy
* samples/ExampleSubmitWorkflow.groovy
If you are using the Sandbox VM for your Hadoop cluster you may want to
review [these configuration tips][sb].
[sb]: sandbox.html
------------------------------------------------------------------------------
Example #1: WebHDFS & Templeton/WebHCat via KnoxShell DSL
------------------------------------------------------------------------------
This example will submit the familiar WordCount Java MapReduce job to the
Hadoop cluster via the gateway using the KnoxShell DSL. There are several
ways to do this depending upon your preference.
You can use the "embedded" Groovy interpreter provided with the distribution.
java -jar bin/shell-${gateway-version}.jar samples/ExampleSubmitJob.groovy
You can load the KnoxShell DSL script into the standard Groovy Console.
groovyConsole -cp bin/shell-${gateway-version}.jar samples/ExampleSubmitJob.groovy
You can manually type in the KnoxShell DSL script into the "embedded" Groovy
interpreter provided with the distribution.
java -jar bin/shell-${gateway-version}.jar
Each line from the file below will need to be typed or copied into the
interactive shell.
***samples/ExampleSubmitJob***
import com.jayway.jsonpath.JsonPath
import org.apache.hadoop.gateway.shell.Hadoop
import org.apache.hadoop.gateway.shell.hdfs.Hdfs
import org.apache.hadoop.gateway.shell.job.Job
import static java.util.concurrent.TimeUnit.SECONDS
gateway = "https://localhost:8443/gateway/sample"
username = "mapred"
password = "mapred-password"
dataFile = "LICENSE"
jarFile = "samples/hadoop-examples.jar"
hadoop = Hadoop.login( gateway, username, password )
println "Delete /tmp/test " + Hdfs.rm(hadoop).file( "/tmp/test" ).recursive().now().statusCode
println "Create /tmp/test " + Hdfs.mkdir(hadoop).dir( "/tmp/test").now().statusCode
putData = Hdfs.put(hadoop).file( dataFile ).to( "/tmp/test/input/FILE" ).later() {
println "Put /tmp/test/input/FILE " + it.statusCode }
putJar = Hdfs.put(hadoop).file( jarFile ).to( "/tmp/test/hadoop-examples.jar" ).later() {
println "Put /tmp/test/hadoop-examples.jar " + it.statusCode }
hadoop.waitFor( putData, putJar )
jobId = Job.submitJava(hadoop) \
.jar( "/tmp/test/hadoop-examples.jar" ) \
.app( "wordcount" ) \
.input( "/tmp/test/input" ) \
.output( "/tmp/test/output" ) \
.now().jobId
println "Submitted job " + jobId
done = false
count = 0
while( !done && count++ < 60 ) {
sleep( 1000 )
json = Job.queryStatus(hadoop).jobId(jobId).now().string
done = JsonPath.read( json, "\$.status.jobComplete" )
}
println "Done " + done
println "Shutdown " + hadoop.shutdown( 10, SECONDS )
------------------------------------------------------------------------------
Example #2: WebHDFS & Oozie via KnoxShell DSL
------------------------------------------------------------------------------
This example will also submit the familiar WordCount Java MapReduce job to the
Hadoop cluster via the gateway using the KnoxShell DSL. However in this case
the job will be submitted via a Oozie workflow. There are several ways to do
this depending upon your preference.
You can use the "embedded" Groovy interpreter provided with the distribution.
java -jar bin/shell-${gateway-version}.jar samples/ExampleSubmitWorkflow.groovy
You can load the KnoxShell DSL script into the standard Groovy Console.
groovyConsole -cp bin/shell-${gateway-version}.jar samples/ExampleSubmitWorkflow.groovy
You can manually type in the KnoxShell DSL script into the "embedded" Groovy
interpreter provided with the distribution.
java -jar bin/shell-${gateway-version}.jar
Each line from the file below will need to be typed or copied into the
interactive shell.
***samples/ExampleSubmitWorkflow.groovy***
import com.jayway.jsonpath.JsonPath
import org.apache.hadoop.gateway.shell.Hadoop
import org.apache.hadoop.gateway.shell.hdfs.Hdfs
import org.apache.hadoop.gateway.shell.workflow.Workflow
import static java.util.concurrent.TimeUnit.SECONDS
gateway = "https://localhost:8443/gateway/sample"
jobTracker = "sandbox:50300";
nameNode = "sandbox:8020";
username = "mapred"
password = "mapred-password"
inputFile = "LICENSE"
jarFile = "samples/hadoop-examples.jar"
definition = """\
<workflow-app xmlns="uri:oozie:workflow:0.2" name="wordcount-workflow">
<start to="root-node"/>
<action name="root-node">
<java>
<job-tracker>$jobTracker</job-tracker>
<name-node>hdfs://$nameNode</name-node>
<main-class>org.apache.hadoop.examples.WordCount</main-class>
<arg>/tmp/test/input</arg>
<arg>/tmp/test/output</arg>
</java>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Java failed</message>
</kill>
<end name="end"/>
</workflow-app>
"""
configuration = """\
<configuration>
<property>
<name>user.name</name>
<value>$username</value>
</property>
<property>
<name>oozie.wf.application.path</name>
<value>hdfs://$nameNode/tmp/test</value>
</property>
</configuration>
"""
hadoop = Hadoop.login( gateway, username, password )
println "Delete /tmp/test " + Hdfs.rm(hadoop).file( "/tmp/test" ).recursive().now().statusCode
println "Mkdir /tmp/test " + Hdfs.mkdir(hadoop).dir( "/tmp/test").now().statusCode
putWorkflow = Hdfs.put(hadoop).text( definition ).to( "/tmp/test/workflow.xml" ).later() {
println "Put /tmp/test/workflow.xml " + it.statusCode }
putData = Hdfs.put(hadoop).file( inputFile ).to( "/tmp/test/input/FILE" ).later() {
println "Put /tmp/test/input/FILE " + it.statusCode }
putJar = Hdfs.put(hadoop).file( jarFile ).to( "/tmp/test/lib/hadoop-examples.jar" ).later() {
println "Put /tmp/test/lib/hadoop-examples.jar " + it.statusCode }
hadoop.waitFor( putWorkflow, putData, putJar )
jobId = Workflow.submit(hadoop).text( configuration ).now().jobId
println "Submitted job " + jobId
status = "UNKNOWN";
count = 0;
while( status != "SUCCEEDED" && count++ < 60 ) {
sleep( 1000 )
json = Workflow.status(hadoop).jobId( jobId ).now().string
status = JsonPath.read( json, "\$.status" )
}
println "Job status " + status;
println "Shutdown " + hadoop.shutdown( 10, SECONDS )
------------------------------------------------------------------------------
Example #3: WebHDFS & Templeton/WebHCat via cURL
------------------------------------------------------------------------------
The example below illustrates the sequence of curl commands that could be used
to run a "word count" map reduce job. It utilizes the hadoop-examples.jar
from a Hadoop install for running a simple word count job. Take care to
follow the instructions below for steps 4/5 and 6/7 where the Location header
returned by the call to the NameNode is copied for use with the call to the
DataNode that follows it.
# 0. Optionally cleanup the test directory in case a previous example was run without cleaning up.
curl -i -k -u mapred:mapred-password -X DELETE \
'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test?op=DELETE&recursive=true'
# 1. Create a test input directory /tmp/test/input
curl -i -k -u mapred:mapred-password -X PUT \
'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test/input?op=MKDIRS'
# 2. Create a test output directory /tmp/test/input
curl -i -k -u mapred:mapred-password -X PUT \
'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test/output?op=MKDIRS'
# 3. Create the inode for hadoop-examples.jar in /tmp/test
curl -i -k -u mapred:mapred-password -X PUT \
'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test/hadoop-examples.jar?op=CREATE'
# 4. Upload hadoop-examples.jar to /tmp/test. Use a hadoop-examples.jar from a Hadoop install.
curl -i -k -u mapred:mapred-password -T hadoop-examples.jar -X PUT '{Value Location header from command above}'
# 5. Create the inode for a sample file README in /tmp/test/input
curl -i -k -u mapred:mapred-password -X PUT \
'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test/input/README?op=CREATE'
# 6. Upload readme.txt to /tmp/test/input. Use the readme.txt in {GATEWAY_HOME}.
curl -i -k -u mapred:mapred-password -T README -X PUT '{Value of Location header from command above}'
# 7. Submit the word count job via WebHCat/Templeton.
# Take note of the Job ID in the JSON response as this will be used in the next step.
curl -v -i -k -u mapred:mapred-password -X POST \
-d jar=/tmp/test/hadoop-examples.jar -d class=wordcount \
-d arg=/tmp/test/input -d arg=/tmp/test/output \
'https://localhost:8443/gateway/sample/templeton/api/v1/mapreduce/jar'
# 8. Look at the status of the job
curl -i -k -u mapred:mapred-password -X GET \
'https://localhost:8443/gateway/sample/templeton/api/v1/queue/{Job ID returned in JSON body from previous step}'
# 9. Look at the status of the job queue
curl -i -k -u mapred:mapred-password -X GET \
'https://localhost:8443/gateway/sample/templeton/api/v1/queue'
# 10. List the contents of the output directory /tmp/test/output
curl -i -k -u mapred:mapred-password -X GET \
'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test/output?op=LISTSTATUS'
# 11. Optionally cleanup the test directory
curl -i -k -u mapred:mapred-password -X DELETE \
'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test?op=DELETE&recursive=true'
------------------------------------------------------------------------------
Example #4: WebHDFS & Oozie via cURL
------------------------------------------------------------------------------
The example below illustrates the sequence of curl commands that could be used
to run a "word count" map reduce job via an Oozie workflow. It utilizes the
hadoop-examples.jar from a Hadoop install for running a simple word count job.
Take care to follow the instructions below where replacement values are
required. These replacement values are identivied with { } markup.
# 0. Optionally cleanup the test directory in case a previous example was run without cleaning up.
curl -i -k -u mapred:mapred-password -X DELETE \
'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test?op=DELETE&recursive=true'
# 1. Create the inode for workflow definition file in /tmp/test
curl -i -k -u mapred:mapred-password -X PUT \
'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test/workflow.xml?op=CREATE'
# 2. Upload the workflow definition file. This file can be found in {GATEWAY_HOME}/templates
curl -i -k -u mapred:mapred-password -T templates/workflow-definition.xml -X PUT \
'{Value Location header from command above}'
# 3. Create the inode for hadoop-examples.jar in /tmp/test/lib
curl -i -k -u mapred:mapred-password -X PUT \
'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test/lib/hadoop-examples.jar?op=CREATE'
# 4. Upload hadoop-examples.jar to /tmp/test/lib. Use a hadoop-examples.jar from a Hadoop install.
curl -i -k -u mapred:mapred-password -T hadoop-examples.jar -X PUT \
'{Value Location header from command above}'
# 5. Create the inode for a sample input file readme.txt in /tmp/test/input.
curl -i -k -u mapred:mapred-password -X PUT \
'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test/input/README?op=CREATE'
# 6. Upload readme.txt to /tmp/test/input. Use the readme.txt in {GATEWAY_HOME}.
# The sample below uses this README file found in {GATEWAY_HOME}.
curl -i -k -u mapred:mapred-password -T README -X PUT \
'{Value of Location header from command above}'
# 7. Create the job configuration file by replacing the {NameNode host:port} and {JobTracker host:port}
# in the command below to values that match your Hadoop configuration.
# NOTE: The hostnames must be resolvable by the Oozie daemon. The ports are the RPC ports not the HTTP ports.
# For example {NameNode host:port} might be sandbox:8020 and {JobTracker host:port} sandbox:50300
# The source workflow-configuration.xml file can be found in {GATEWAY_HOME}/templates
# Alternatively, this file can copied and edited manually for environments without the sed utility.
sed -e s/REPLACE.NAMENODE.RPCHOSTPORT/{NameNode host:port}/ \
-e s/REPLACE.JOBTRACKER.RPCHOSTPORT/{JobTracker host:port}/ \
<templates/workflow-configuration.xml >workflow-configuration.xml
# 8. Submit the job via Oozie
# Take note of the Job ID in the JSON response as this will be used in the next step.
curl -i -k -u mapred:mapred-password -T workflow-configuration.xml -H Content-Type:application/xml -X POST \
'https://localhost:8443/gateway/oozie/sample/api/v1/jobs?action=start'
# 9. Query the job status via Oozie.
curl -i -k -u mapred:mapred-password -X GET \
'https://localhost:8443/gateway/sample/oozie/api/v1/job/{Job ID returned in JSON body from previous step}'
# 10. List the contents of the output directory /tmp/test/output
curl -i -k -u mapred:mapred-password -X GET \
'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test/output?op=LISTSTATUS'
# 11. Optionally cleanup the test directory
curl -i -k -u mapred:mapred-password -X DELETE \
'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test?op=DELETE&recursive=true'
------------------------------------------------------------------------------
Disclaimer
------------------------------------------------------------------------------
The Apache Knox Gateway is an effort undergoing incubation at the
Apache Software Foundation (ASF), sponsored by the Apache Incubator PMC.
Incubation is required of all newly accepted projects until a further review
indicates that the infrastructure, communications, and decision making process
have stabilized in a manner consistent with other successful ASF projects.
While incubation status is not necessarily a reflection of the completeness
or stability of the code, it does indicate that the project has yet to be
fully endorsed by the ASF.