WebHDFS

REST API access to HDFS in a Hadoop cluster is provided by WebHDFS. The WebHDFS REST API documentation is available online. WebHDFS must be enabled in the hdfs-site.xml configuration file. In sandbox this configuration file is located at /etc/hadoop/conf/hdfs-site.xml. Note the properties shown below as they are related to configuration required by the gateway. Some of these represent the default values and may not actually be present in hdfs-site.xml.

<property>
    <name>dfs.webhdfs.enabled</name>
    <value>true</value>
</property>
<property>
    <name>dfs.namenode.rpc-address</name>
    <value>sandbox.hortonworks.com:8020</value>
</property>
<property>
    <name>dfs.namenode.http-address</name>
    <value>sandbox.hortonworks.com:50070</value>
</property>
<property>
    <name>dfs.https.namenode.https-address</name>
    <value>sandbox.hortonworks.com:50470</value>
</property>

The values above need to be reflected in each topology descriptor file deployed to the gateway. The gateway by default includes a sample topology descriptor file {GATEWAY_HOME}/deployments/sandbox.xml. The values in this sample are configured to work with an installed Sandbox VM.

<service>
    <role>NAMENODE</role>
    <url>hdfs://localhost:8020</url>
</service>
<service>
    <role>WEBHDFS</role>
    <url>http://localhost:50070/webhdfs</url>
</service>

The URL provided for the role NAMENODE does not result in an endpoint being exposed by the gateway. This information is only required so that other URLs can be rewritten that reference the Name Node's RPC address. This prevents clients from needed to be aware of the internal cluster details.

By default the gateway is configured to use the HTTP endpoint for WebHDFS in the Sandbox. This could alternatively be configured to use the HTTPS endpoint by provided the correct address.

WebHDFS URL Mapping

For Name Node URLs, the mapping of Knox Gateway accessible WebHDFS URLs to direct WebHDFS URLs is simple.

| ------- | ----------------------------------------------------------------------------- | | Gateway | https://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/webhdfs | | Cluster | http://{webhdfs-host}:50070/webhdfs |

However, there is a subtle difference to URLs that are returned by WebHDFS in the Location header of many requests. Direct WebHDFS requests may return Location headers that contain the address of a particular Data Node. The gateway will rewrite these URLs to ensure subsequent requests come back through the gateway and internal cluster details are protected.

A WebHDFS request to the Node Node to retrieve a file will return a URL of the form below in the Location header.

http://{datanode-host}:{data-node-port}/webhdfs/v1/{path}?...

Note that this URL contains the newtwork location of a Data Node. The gateway will rewrite this URL to look like the URL below.

https://{gateway-host}:{gateway-port}/{gateway-path}/{custer-name}/webhdfs/data/v1/{path}?_={encrypted-query-parameters}

The {encrypted-query-parameters} will contain the {datanode-host} and {datanode-port} information. This information along with the original query parameters are encrypted so that the internal Hadoop details are protected.

WebHDFS Examples

The examples below upload a file, download the file and list the contents of the directory.

WebHDFS via client DSL

You can use the Groovy example scripts and interpreter provided with the distribution.

java -jar bin/shell.jar samples/ExampleWebHdfsPutGet.groovy
java -jar bin/shell.jar samples/ExampleWebHdfsLs.groovy

You can manually type the client DSL script into the KnoxShell interactive Groovy interpreter provided with the distribution. The command below starts the KnoxShell in interactive mode.

java -jar bin/shell.jar

Each line below could be typed or copied into the interactive shell and executed. This is provided as an example to illustrate the use of the client DSL.

// Import the client DSL and a useful utilities for working with JSON.
import org.apache.hadoop.gateway.shell.Hadoop
import org.apache.hadoop.gateway.shell.hdfs.Hdfs
import groovy.json.JsonSlurper

// Setup some basic config.
gateway = "https://localhost:8443/gateway/sandbox"
username = "guest"
password = "guest-password"

// Start the session.
session = Hadoop.login( gateway, username, password )

// Cleanup anything leftover from a previous run.
Hdfs.rm( session ).file( "/user/guest/example" ).recursive().now()

// Upload the README to HDFS.
Hdfs.put( session ).file( "README" ).to( "/user/guest/example/README" ).now()

// Download the README from HDFS.
text = Hdfs.get( session ).from( "/user/guest/example/README" ).now().string
println text

// List the contents of the directory.
text = Hdfs.ls( session ).dir( "/user/guest/example" ).now().string
json = (new JsonSlurper()).parseText( text )
println json.FileStatuses.FileStatus.pathSuffix

// Cleanup the directory.
Hdfs.rm( session ).file( "/user/guest/example" ).recursive().now()

// Clean the session.
session.shutdown()
WebHDFS via cURL

Use can use cURL to directly invoke the REST APIs via the gateway.

Optionally cleanup the sample directory in case a previous example was run without cleaning up.
curl -i -k -u guest:guest-password -X DELETE \
    'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/example?op=DELETE&recursive=true'
Register the name for a sample file README in /user/guest/example.
curl -i -k -u guest:guest-password -X PUT \
    'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/example/README?op=CREATE'
Upload README to /user/guest/example. Use the README in {GATEWAY_HOME}.
curl -i -k -u guest:guest-password -T README -X PUT \
    '{Value of Location header from command above}'
List the contents of the directory /user/guest/example.
curl -i -k -u guest:guest-password -X GET \
    'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/example?op=LISTSTATUS'
Request the content of the README file in /user/guest/example.
curl -i -k -u guest:guest-password -X GET \
    'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/example/README?op=OPEN'
Read the content of the file.
curl -i -k -u guest:guest-password -X GET \
    '{Value of Location header from command above}'
Optionally cleanup the example directory.
curl -i -k -u guest:guest-password -X DELETE \
    'https://localhost:8443/gateway/sandbox/webhdfs/v1/user/guest/example?op=DELETE&recursive=true'
WebHDFS client DSL
get() - Get a file from HDFS (OPEN).
  • Request
    • from( String name ) - The full name of the file in HDFS.
    • file( String name ) - The name name of a local file to create with the content. If this isn't specified the file content must be read from the response.
  • Response
    • BasicResponse
    • If file parameter specified content will be streamed to file.
  • Example
    • Hdfs.get( session ).from( "/user/guest/example/README" ).now().string
ls() - Query the contents of a directory (LISTSTATUS)
  • Request
    • dir( String name ) - The full name of the directory in HDFS.
  • Response
    • BasicResponse
  • Example
    • Hdfs.ls( session ).dir( "/user/guest/example" ).now().string
mkdir() - Create a directory in HDFS (MKDIRS)
  • Request
    • dir( String name ) - The full name of the directory to create in HDFS.
    • perm( String perm ) - The permissions for the directory (e.g. 644). Optional: default=“777”
  • Response
    • EmptyResponse - Implicit close().
  • Example
    • Hdfs.mkdir( session ).dir( "/user/guest/example" ).now()
put() - Write a file into HDFS (CREATE)
  • Request
    • text( String text ) - Text to upload to HDFS. Takes precidence over file if both present.
    • file( String name ) - The name of a local file to upload to HDFS.
    • to( String name ) - The fully qualified name to create in HDFS.
  • Response
    • EmptyResponse - Implicit close().
  • Example
    • Hdfs.put( session ).file( README ).to( "/user/guest/example/README" ).now()
rm() - Delete a file or directory (DELETE)
  • Request
    • file( String name ) - The fully qualified file or directory name in HDFS.
    • recursive( Boolean recursive ) - Delete directory and all of its contents if True. Optional: default=False
  • Response
    • BasicResponse - Implicit close().
  • Example
    • Hdfs.rm( session ).file( "/user/guest/example" ).recursive().now()