knox-site/docs/service_webhcat.md - knox - Git at Google

 <!--
    Licensed to the Apache Software Foundation (ASF) under one or more
    contributor license agreements.  See the NOTICE file distributed with
    this work for additional information regarding copyright ownership.
    The ASF licenses this file to You under the Apache License, Version 2.0
    (the "License"); you may not use this file except in compliance with
    the License.  You may obtain a copy of the License at

        https://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.
 -->
 <!---
    Licensed to the Apache Software Foundation (ASF) under one or more
    contributor license agreements.  See the NOTICE file distributed with
    this work for additional information regarding copyright ownership.
    The ASF licenses this file to You under the Apache License, Version 2.0
    (the "License"); you may not use this file except in compliance with
    the License.  You may obtain a copy of the License at

        https://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.
 --->

 ### WebHCat ###

 WebHCat (also called _Templeton_) is a related but separate service from HiveServer2.
 As such it is installed and configured independently.
 The [WebHCat wiki pages](https://cwiki.apache.org/confluence/display/Hive/WebHCat) describe this processes.
 In sandbox this configuration file for WebHCat is located at `/etc/hadoop/hcatalog/webhcat-site.xml`.
 Note the properties shown below as they are related to configuration required by the gateway.

     <property>
         <name>templeton.port</name>
         <value>50111</value>
     </property>

 Also important is the configuration of the JOBTRACKER RPC endpoint.
 For Hadoop 2 this can be found in the `yarn-site.xml` file.
 In Sandbox this file can be found at `/etc/hadoop/conf/yarn-site.xml`.
 The property `yarn.resourcemanager.address` within that file is relevant for the gateway's configuration.

     <property>
         <name>yarn.resourcemanager.address</name>
         <value>sandbox.hortonworks.com:8050</value>
     </property>

 See #[WebHDFS] for details about locating the Hadoop configuration for the NAMENODE endpoint.

 The gateway by default includes a sample topology descriptor file `{GATEWAY_HOME}/deployments/sandbox.xml`.
 The values in this sample are configured to work with an installed Sandbox VM.

     <service>
         <role>NAMENODE</role>
         <url>hdfs://localhost:8020</url>
     </service>
     <service>
         <role>JOBTRACKER</role>
         <url>rpc://localhost:8050</url>
     </service>
     <service>
         <role>WEBHCAT</role>
         <url>http://localhost:50111/templeton</url>
     </service>

 The URLs provided for the role NAMENODE and JOBTRACKER do not result in an endpoint being exposed by the gateway.
 This information is only required so that other URLs can be rewritten that reference the appropriate RPC address for Hadoop services.
 This prevents clients from needing to be aware of the internal cluster details.
 Note that for Hadoop 2 the JOBTRACKER RPC endpoint is provided by the Resource Manager component.

 By default the gateway is configured to use the HTTP endpoint for WebHCat in the Sandbox.
 This could alternatively be configured to use the HTTPS endpoint by providing the correct address.

 #### WebHCat URL Mapping ####

 For WebHCat URLs, the mapping of Knox Gateway accessible URLs to direct WebHCat URLs is simple.

 | ------- | ------------------------------------------------------------------------------- |
 | Gateway | `https://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/templeton` |
 | Cluster | `http://{webhcat-host}:{webhcat-port}/templeton}`                               |


 #### WebHCat via cURL

 Users can use cURL to directly invoke the REST APIs via the gateway. For the full list of available REST calls look at the WebHCat documentation. This is a simple curl command to test the connection:

     curl -i -k -u guest:guest-password 'https://localhost:8443/gateway/sandbox/templeton/v1/status'


 #### WebHCat Example ####

 This example will submit the familiar WordCount Java MapReduce job to the Hadoop cluster via the gateway using the KnoxShell DSL.
 There are several ways to do this depending upon your preference.

 You can use the "embedded" Groovy interpreter provided with the distribution.

     java -jar bin/shell.jar samples/ExampleWebHCatJob.groovy

 You can manually type in the KnoxShell DSL script into the "embedded" Groovy interpreter provided with the distribution.

     java -jar bin/shell.jar

 Each line from the file `samples/ExampleWebHCatJob.groovy` would then need to be typed or copied into the interactive shell.

 #### WebHCat Client DSL ####

 ##### submitJava() - Submit a Java MapReduce job.

 * Request
     * jar (String) - The remote file name of the JAR containing the app to execute.
     * app (String) - The app name to execute. This is _wordcount_ for example not the class name.
     * input (String) - The remote directory name to use as input for the job.
     * output (String) - The remote directory name to store output from the job.
 * Response
     * jobId : String - The job ID of the submitted job.  Consumes body.
 * Example


     Job.submitJava(session)
         .jar(remoteJarName)
         .app(appName)
         .input(remoteInputDir)
         .output(remoteOutputDir)
         .now()
         .jobId

 ##### submitPig() - Submit a Pig job.

 * Request
     * file (String) - The remote file name of the pig script.
     * arg (String) - An argument to pass to the script.
     * statusDir (String) - The remote directory to store status output.
 * Response
     * jobId : String - The job ID of the submitted job.  Consumes body.
 * Example
     * `Job.submitPig(session).file(remotePigFileName).arg("-v").statusDir(remoteStatusDir).now()`

 ##### submitHive() - Submit a Hive job.

 * Request
     * file (String) - The remote file name of the hive script.
     * arg (String) - An argument to pass to the script.
     * statusDir (String) - The remote directory to store status output.
 * Response
     * jobId : String - The job ID of the submitted job.  Consumes body.
 * Example
     * `Job.submitHive(session).file(remoteHiveFileName).arg("-v").statusDir(remoteStatusDir).now()`

 #### submitSqoop Job API ####
 Using the Knox DSL, you can now easily submit and monitor [Apache Sqoop](https://sqoop.apache.org) jobs. The WebHCat Job class now supports the `submitSqoop` command.

     Job.submitSqoop(session)
         .command("import --connect jdbc:mysql://hostname:3306/dbname ... ")
         .statusDir(remoteStatusDir)
         .now().jobId

 The `submitSqoop` command supports the following arguments:

 * command (String) - The sqoop command string to execute.
 * files (String) - Comma separated files to be copied to the templeton controller job.
 * optionsfile (String) - The remote file which contain Sqoop command need to run.
 * libdir (String) - The remote directory containing jdbc jar to include with sqoop lib
 * statusDir (String) - The remote directory to store status output.

 A complete example is available here: https://cwiki.apache.org/confluence/display/KNOX/2016/11/08/Running+SQOOP+job+via+KNOX+Shell+DSL


 ##### queryQueue() - Return a list of all job IDs registered to the user.

 * Request
     * No request parameters.
 * Response
     * BasicResponse
 * Example
     * `Job.queryQueue(session).now().string`

 ##### queryStatus() - Check the status of a job and get related job information given its job ID.

 * Request
     * jobId (String) - The job ID to check. This is the ID received when the job was created.
 * Response
     * BasicResponse
 * Example
     * `Job.queryStatus(session).jobId(jobId).now().string`

 ### WebHCat HA ###

 Please look at #[Default Service HA support]
	<!--
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	https://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->
	<!---
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	https://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	--->

	### WebHCat ###

	WebHCat (also called _Templeton_) is a related but separate service from HiveServer2.
	As such it is installed and configured independently.
	The [WebHCat wiki pages](https://cwiki.apache.org/confluence/display/Hive/WebHCat) describe this processes.
	In sandbox this configuration file for WebHCat is located at `/etc/hadoop/hcatalog/webhcat-site.xml`.
	Note the properties shown below as they are related to configuration required by the gateway.

	<property>
	<name>templeton.port</name>
	<value>50111</value>
	</property>

	Also important is the configuration of the JOBTRACKER RPC endpoint.
	For Hadoop 2 this can be found in the `yarn-site.xml` file.
	In Sandbox this file can be found at `/etc/hadoop/conf/yarn-site.xml`.
	The property `yarn.resourcemanager.address` within that file is relevant for the gateway's configuration.

	<property>
	<name>yarn.resourcemanager.address</name>
	<value>sandbox.hortonworks.com:8050</value>
	</property>

	See #[WebHDFS] for details about locating the Hadoop configuration for the NAMENODE endpoint.

	The gateway by default includes a sample topology descriptor file `{GATEWAY_HOME}/deployments/sandbox.xml`.
	The values in this sample are configured to work with an installed Sandbox VM.

	<service>
	<role>NAMENODE</role>
	<url>hdfs://localhost:8020</url>
	</service>
	<service>
	<role>JOBTRACKER</role>
	<url>rpc://localhost:8050</url>
	</service>
	<service>
	<role>WEBHCAT</role>
	<url>http://localhost:50111/templeton</url>
	</service>

	The URLs provided for the role NAMENODE and JOBTRACKER do not result in an endpoint being exposed by the gateway.
	This information is only required so that other URLs can be rewritten that reference the appropriate RPC address for Hadoop services.
	This prevents clients from needing to be aware of the internal cluster details.
	Note that for Hadoop 2 the JOBTRACKER RPC endpoint is provided by the Resource Manager component.

	By default the gateway is configured to use the HTTP endpoint for WebHCat in the Sandbox.
	This could alternatively be configured to use the HTTPS endpoint by providing the correct address.

	#### WebHCat URL Mapping ####

	For WebHCat URLs, the mapping of Knox Gateway accessible URLs to direct WebHCat URLs is simple.

	\| ------- \| ------------------------------------------------------------------------------- \|
	\| Gateway \| `https://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/templeton` \|
	\| Cluster \| `http://{webhcat-host}:{webhcat-port}/templeton}` \|


	#### WebHCat via cURL

	Users can use cURL to directly invoke the REST APIs via the gateway. For the full list of available REST calls look at the WebHCat documentation. This is a simple curl command to test the connection:

	curl -i -k -u guest:guest-password 'https://localhost:8443/gateway/sandbox/templeton/v1/status'


	#### WebHCat Example ####

	This example will submit the familiar WordCount Java MapReduce job to the Hadoop cluster via the gateway using the KnoxShell DSL.
	There are several ways to do this depending upon your preference.

	You can use the "embedded" Groovy interpreter provided with the distribution.

	java -jar bin/shell.jar samples/ExampleWebHCatJob.groovy

	You can manually type in the KnoxShell DSL script into the "embedded" Groovy interpreter provided with the distribution.

	java -jar bin/shell.jar

	Each line from the file `samples/ExampleWebHCatJob.groovy` would then need to be typed or copied into the interactive shell.

	#### WebHCat Client DSL ####

	##### submitJava() - Submit a Java MapReduce job.

	* Request
	* jar (String) - The remote file name of the JAR containing the app to execute.
	* app (String) - The app name to execute. This is _wordcount_ for example not the class name.
	* input (String) - The remote directory name to use as input for the job.
	* output (String) - The remote directory name to store output from the job.
	* Response
	* jobId : String - The job ID of the submitted job. Consumes body.
	* Example


	Job.submitJava(session)
	.jar(remoteJarName)
	.app(appName)
	.input(remoteInputDir)
	.output(remoteOutputDir)
	.now()
	.jobId

	##### submitPig() - Submit a Pig job.

	* Request
	* file (String) - The remote file name of the pig script.
	* arg (String) - An argument to pass to the script.
	* statusDir (String) - The remote directory to store status output.
	* Response
	* jobId : String - The job ID of the submitted job. Consumes body.
	* Example
	* `Job.submitPig(session).file(remotePigFileName).arg("-v").statusDir(remoteStatusDir).now()`

	##### submitHive() - Submit a Hive job.

	* Request
	* file (String) - The remote file name of the hive script.
	* arg (String) - An argument to pass to the script.
	* statusDir (String) - The remote directory to store status output.
	* Response
	* jobId : String - The job ID of the submitted job. Consumes body.
	* Example
	* `Job.submitHive(session).file(remoteHiveFileName).arg("-v").statusDir(remoteStatusDir).now()`

	#### submitSqoop Job API ####
	Using the Knox DSL, you can now easily submit and monitor [Apache Sqoop](https://sqoop.apache.org) jobs. The WebHCat Job class now supports the `submitSqoop` command.

	Job.submitSqoop(session)
	.command("import --connect jdbc:mysql://hostname:3306/dbname ... ")
	.statusDir(remoteStatusDir)
	.now().jobId

	The `submitSqoop` command supports the following arguments:

	* command (String) - The sqoop command string to execute.
	* files (String) - Comma separated files to be copied to the templeton controller job.
	* optionsfile (String) - The remote file which contain Sqoop command need to run.
	* libdir (String) - The remote directory containing jdbc jar to include with sqoop lib
	* statusDir (String) - The remote directory to store status output.

	A complete example is available here: https://cwiki.apache.org/confluence/display/KNOX/2016/11/08/Running+SQOOP+job+via+KNOX+Shell+DSL


	##### queryQueue() - Return a list of all job IDs registered to the user.

	* Request
	* No request parameters.
	* Response
	* BasicResponse
	* Example
	* `Job.queryQueue(session).now().string`

	##### queryStatus() - Check the status of a job and get related job information given its job ID.

	* Request
	* jobId (String) - The job ID to check. This is the ID received when the job was created.
	* Response
	* BasicResponse
	* Example
	* `Job.queryStatus(session).jobId(jobId).now().string`

	### WebHCat HA ###

	Please look at #[Default Service HA support]