WebHCat is a related but separate service from Hive. As such it is installed and configured independently. The WebHCat wiki pages describe this processes. In sandbox this configuration file for WebHCat is located at /etc/hadoop/hcatalog/webhcat-site.xml. Note the properties shown below as they are related to configuration required by the gateway.
<property> <name>templeton.port</name> <value>50111</value> </property>
Also important is the configuration of the JOBTRACKER RPC endpoint. For Hadoop 2 this can be found in the yarn-site.xml file. In Sandbox this file can be found at /etc/hadoop/conf/yarn-site.xml. The property yarn.resourcemanager.address within that file is relevant for the gateway's configuration.
<property> <name>yarn.resourcemanager.address</name> <value>sandbox.hortonworks.com:8050</value> </property>
See #[WebHDFS] for details about locating the Haddop configuration for the NAMENODE endpoint.
The gateway by default includes a sample topology descriptor file {GATEWAY_HOME}/deployments/sandbox.xml
. The values in this sample are configured to work with an installed Sandbox VM.
<service> <role>NAMENODE</role> <url>hdfs://localhost:8020</url> </service> <service> <role>JOBTRACKER</role> <url>rpc://localhost:8050</url> </service> <service> <role>WEBHCAT</role> <url>http://localhost:50111/templeton</url> </service>
The URLs provided for the role NAMENODE and JOBTRACKER do not result in an endpoint being exposed by the gateway. This information is only required so that other URLs can be rewritten that reference the appropriate RPC address for Hadoop services. This prevents clients from needed to be aware of the internal cluster details. Note that for Hadoop 2 the JOBTRACKER RPC endpoint is provided by the Resource Manager component.
By default the gateway is configured to use the HTTP endpoint for WebHCat in the Sandbox. This could alternatively be configured to use the HTTPS endpoint by provided the correct address.
For WebHCat URLs, the mapping of Knox Gateway accessible URLs to direct WebHCat URLs is simple.
| ------- | ------------------------------------------------------------------------------- | | Gateway | https://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/templeton
| | Cluster | http://{webhcat-host}:{webhcat-port}/templeton}
|
This example will submit the familiar WordCount Java MapReduce job to the Hadoop cluster via the gateway using the KnoxShell DSL. There are several ways to do this depending upon your preference.
You can use the “embedded” Groovy interpreter provided with the distribution.
java -jar bin/shell.jar samples/ExampleWebHCatJob.groovy
You can manually type in the KnoxShell DSL script into the “embedded” Groovy interpreter provided with the distribution.
java -jar bin/shell.jar
Each line from the file samples/ExampleWebHCatJob.groovy
would then need to be typed or copied into the interactive shell.
Request
Response
Example
Job.submitJava(session) .jar(remoteJarName) .app(appName) .input(remoteInputDir) .output(remoteOutputDir) .now() .jobId
Job.submitPig(session).file(remotePigFileName).arg("-v").statusDir(remoteStatusDir).now()
Job.submitHive(session).file(remoteHiveFileName).arg("-v").statusDir(remoteStatusDir).now()
Job.queryQueue(session).now().string
Job.queryStatus(session).jobId(jobId).now().string