WebHCat (also called Templeton) is a related but separate service from HiveServer2. As such it is installed and configured independently. The WebHCat wiki pages describe this processes. In sandbox this configuration file for WebHCat is located at /etc/hadoop/hcatalog/webhcat-site.xml
. Note the properties shown below as they are related to configuration required by the gateway.
<property> <name>templeton.port</name> <value>50111</value> </property>
Also important is the configuration of the JOBTRACKER RPC endpoint. For Hadoop 2 this can be found in the yarn-site.xml
file. In Sandbox this file can be found at /etc/hadoop/conf/yarn-site.xml
. The property yarn.resourcemanager.address
within that file is relevant for the gateway's configuration.
<property> <name>yarn.resourcemanager.address</name> <value>sandbox.hortonworks.com:8050</value> </property>
See #[WebHDFS] for details about locating the Hadoop configuration for the NAMENODE endpoint.
The gateway by default includes a sample topology descriptor file {GATEWAY_HOME}/deployments/sandbox.xml
. The values in this sample are configured to work with an installed Sandbox VM.
<service> <role>NAMENODE</role> <url>hdfs://localhost:8020</url> </service> <service> <role>JOBTRACKER</role> <url>rpc://localhost:8050</url> </service> <service> <role>WEBHCAT</role> <url>http://localhost:50111/templeton</url> </service>
The URLs provided for the role NAMENODE and JOBTRACKER do not result in an endpoint being exposed by the gateway. This information is only required so that other URLs can be rewritten that reference the appropriate RPC address for Hadoop services. This prevents clients from needing to be aware of the internal cluster details. Note that for Hadoop 2 the JOBTRACKER RPC endpoint is provided by the Resource Manager component.
By default the gateway is configured to use the HTTP endpoint for WebHCat in the Sandbox. This could alternatively be configured to use the HTTPS endpoint by providing the correct address.
For WebHCat URLs, the mapping of Knox Gateway accessible URLs to direct WebHCat URLs is simple.
| ------- | ------------------------------------------------------------------------------- | | Gateway | https://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/templeton
| | Cluster | http://{webhcat-host}:{webhcat-port}/templeton}
|
Users can use cURL to directly invoke the REST APIs via the gateway. For the full list of available REST calls look at the WebHCat documentation. This is a simple curl command to test the connection:
curl -i -k -u guest:guest-password 'https://localhost:8443/gateway/sandbox/templeton/v1/status'
This example will submit the familiar WordCount Java MapReduce job to the Hadoop cluster via the gateway using the KnoxShell DSL. There are several ways to do this depending upon your preference.
You can use the “embedded” Groovy interpreter provided with the distribution.
java -jar bin/shell.jar samples/ExampleWebHCatJob.groovy
You can manually type in the KnoxShell DSL script into the “embedded” Groovy interpreter provided with the distribution.
java -jar bin/shell.jar
Each line from the file samples/ExampleWebHCatJob.groovy
would then need to be typed or copied into the interactive shell.
Request
Response
Example
Job.submitJava(session) .jar(remoteJarName) .app(appName) .input(remoteInputDir) .output(remoteOutputDir) .now() .jobId
Job.submitPig(session).file(remotePigFileName).arg("-v").statusDir(remoteStatusDir).now()
Job.submitHive(session).file(remoteHiveFileName).arg("-v").statusDir(remoteStatusDir).now()
Using the Knox DSL, you can now easily submit and monitor Apache Sqoop jobs. The WebHCat Job class now supports the submitSqoop
command.
Job.submitSqoop(session) .command("import --connect jdbc:mysql://hostname:3306/dbname ... ") .statusDir(remoteStatusDir) .now().jobId
The submitSqoop
command supports the following arguments:
A complete example is available here: https://cwiki.apache.org/confluence/display/KNOX/2016/11/08/Running+SQOOP+job+via+KNOX+Shell+DSL
Job.queryQueue(session).now().string
Job.queryStatus(session).jobId(jobId).now().string
Please look at #[Default Service HA support]