| <!--- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| ---> |
| |
| ### WebHCat ### |
| |
| WebHCat is a related but separate service from Hive. |
| As such it is installed and configured independently. |
| The [WebHCat wiki pages](https://cwiki.apache.org/confluence/display/Hive/WebHCat) describe this processes. |
| In sandbox this configuration file for WebHCat is located at /etc/hadoop/hcatalog/webhcat-site.xml. |
| Note the properties shown below as they are related to configuration required by the gateway. |
| |
| <property> |
| <name>templeton.port</name> |
| <value>50111</value> |
| </property> |
| |
| Also important is the configuration of the JOBTRACKER RPC endpoint. |
| For Hadoop 2 this can be found in the yarn-site.xml file. |
| In Sandbox this file can be found at /etc/hadoop/conf/yarn-site.xml. |
| The property yarn.resourcemanager.address within that file is relevant for the gateway's configuration. |
| |
| <property> |
| <name>yarn.resourcemanager.address</name> |
| <value>sandbox.hortonworks.com:8050</value> |
| </property> |
| |
| See #[WebHDFS] for details about locating the Haddop configuration for the NAMENODE endpoint. |
| |
| The gateway by default includes a sample topology descriptor file `{GATEWAY_HOME}/deployments/sandbox.xml`. |
| The values in this sample are configured to work with an installed Sandbox VM. |
| |
| <service> |
| <role>NAMENODE</role> |
| <url>hdfs://localhost:8020</url> |
| </service> |
| <service> |
| <role>JOBTRACKER</role> |
| <url>rpc://localhost:8050</url> |
| </service> |
| <service> |
| <role>WEBHCAT</role> |
| <url>http://localhost:50111/templeton</url> |
| </service> |
| |
| The URLs provided for the role NAMENODE and JOBTRACKER do not result in an endpoint being exposed by the gateway. |
| This information is only required so that other URLs can be rewritten that reference the appropriate RPC address for Hadoop services. |
| This prevents clients from needed to be aware of the internal cluster details. |
| Note that for Hadoop 2 the JOBTRACKER RPC endpoint is provided by the Resource Manager component. |
| |
| By default the gateway is configured to use the HTTP endpoint for WebHCat in the Sandbox. |
| This could alternatively be configured to use the HTTPS endpoint by provided the correct address. |
| |
| #### WebHCat URL Mapping #### |
| |
| For WebHCat URLs, the mapping of Knox Gateway accessible URLs to direct WebHCat URLs is simple. |
| |
| | ------- | ------------------------------------------------------------------------------- | |
| | Gateway | `https://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/templeton` | |
| | Cluster | `http://{webhcat-host}:{webhcat-port}/templeton}` | |
| |
| |
| #### WebHCat Example #### |
| |
| This example will submit the familiar WordCount Java MapReduce job to the Hadoop cluster via the gateway using the KnoxShell DSL. |
| There are several ways to do this depending upon your preference. |
| |
| You can use the "embedded" Groovy interpreter provided with the distribution. |
| |
| java -jar bin/shell.jar samples/ExampleWebHCatJob.groovy |
| |
| You can manually type in the KnoxShell DSL script into the "embedded" Groovy interpreter provided with the distribution. |
| |
| java -jar bin/shell.jar |
| |
| Each line from the file `samples/ExampleWebHCatJob.groovy` would then need to be typed or copied into the interactive shell. |
| |
| |
| #### WebHCat Client DSL #### |
| |
| ##### submitJava() - Submit a Java MapReduce job. |
| |
| * Request |
| * jar (String) - The remote file name of the JAR containing the app to execute. |
| * app (String) - The app name to execute. This is wordcount for example not the class name. |
| * input (String) - The remote directory name to use as input for the job. |
| * output (String) - The remote directory name to store output from the job. |
| * Response |
| * jobId : String - The job ID of the submitted job. Consumes body. |
| * Example |
| |
| |
| Job.submitJava(session) |
| .jar(remoteJarName) |
| .app(appName) |
| .input(remoteInputDir) |
| .output(remoteOutputDir) |
| .now() |
| .jobId |
| |
| ##### submitPig() - Submit a Pig job. |
| |
| * Request |
| * file (String) - The remote file name of the pig script. |
| * arg (String) - An argument to pass to the script. |
| * statusDir (String) - The remote directory to store status output. |
| * Response |
| * jobId : String - The job ID of the submitted job. Consumes body. |
| * Example |
| * `Job.submitPig(session).file(remotePigFileName).arg("-v").statusDir(remoteStatusDir).now()` |
| |
| ##### submitHive() - Submit a Hive job. |
| |
| * Request |
| * file (String) - The remote file name of the hive script. |
| * arg (String) - An argument to pass to the script. |
| * statusDir (String) - The remote directory to store status output. |
| * Response |
| * jobId : String - The job ID of the submitted job. Consumes body. |
| * Example |
| * `Job.submitHive(session).file(remoteHiveFileName).arg("-v").statusDir(remoteStatusDir).now()` |
| |
| ##### queryQueue() - Return a list of all job IDs registered to the user. |
| |
| * Request |
| * No request parameters. |
| * Response |
| * BasicResponse |
| * Example |
| * `Job.queryQueue(session).now().string` |
| |
| ##### queryStatus() - Check the status of a job and get related job information given its job ID. |
| |
| * Request |
| * jobId (String) - The job ID to check. This is the ID received when the job was created. |
| * Response |
| * BasicResponse |
| * Example |
| * `Job.queryStatus(session).jobId(jobId).now().string` |