| <!--- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| ------------------------------------------------------------------------------ |
| Apache Knox Gateway - Getting Started |
| ------------------------------------------------------------------------------ |
| This guide describes the steps required to install, deploy and validate the |
| Apache Knox Gateway. |
| |
| ------------------------------------------------------------------------------ |
| Requirements |
| ------------------------------------------------------------------------------ |
| The following prerequisites must be installed to successfully complete the |
| steps described in this guide. |
| |
| ${HHH} Java |
| Java 1.6 or later |
| |
| ${HHH} Hadoop |
| A local installation of a Hadoop Cluster is required at this time. |
| Hadoop EC2 cluster and/or Sandbox installations are currently difficult |
| to access remotely via the Gateway. The EC2 and Sandbox limitation is |
| caused by Hadoop services running with internal IP addresses. For the |
| Gateway to work in these cases it will need to be deployed on the EC2 |
| cluster or Sandbox, at this time. |
| |
| The instructions that follow assume that the Gateway is *not* collocated |
| with the Hadoop clusters themselves and (most importantly) that the |
| hostnames and IP addresses of the cluster services are accessible by the |
| gateway where ever it happens to be running. |
| |
| The Hadoop cluster should be ensured to have WebHDFS, WebHCat |
| (i.e. Templeton) and Oozie configured, deployed and running. |
| |
| This release of the Apache Knox Gateway has been tested against the |
| [Hortonworks Sandbox 1.2][hsb] with [these changes][sb]. |
| |
| [hsb]: http://hortonworks.com/products/hortonworks-sandbox/ |
| [sb]: sandbox.html |
| |
| ------------------------------------------------------------------------------ |
| Installation |
| ------------------------------------------------------------------------------ |
| ${HHH} 1. Extract the distribution ZIP |
| |
| Download and extract the gateway-${gateway-version}.zip file into the |
| installation directory that will contain your `{GATEWAY_HOME}` |
| |
| jar xf gateway-${gateway-version}.zip |
| |
| This will create a directory `gateway-${gateway-version}` in your current |
| directory. |
| |
| ${HHH} 2. Enter the `{GATEWAY_HOME}` directory |
| |
| cd gateway-${gateway-version} |
| |
| The fully qualified name of this directory will be referenced as |
| `{GATEWAY_HOME}` throughout the remainder of this document. |
| |
| ${HHH} 3. Start the demo LDAP server (ApacheDS) |
| |
| First, understand that the LDAP server provided here is for demonstration |
| purposes. You may configure the LDAP specifics within the topology |
| descriptor for the cluster as described in step 5 below, in order to |
| customize what LDAP instance to use. The assumption is that most users |
| will leverage the demo LDAP server while evaluating this release and should |
| therefore continue with the instructions here in step 3. |
| |
| Edit `{GATEWAY_HOME}/conf/users.ldif` if required and add your users and |
| groups to the file. A number of normal Hadoop users |
| (e.g. hdfs, mapred, hcat, hive) have already been included. Note that |
| the passwords in this file are "fictitious" and have nothing to do with |
| the actual accounts on the Hadoop cluster you are using. There is also |
| a copy of this file in the templates directory that you can use to start |
| over if necessary. |
| |
| Start the LDAP server - pointing it to the config dir where it will find |
| the users.ldif file in the conf directory. |
| |
| java -jar bin/ldap-${gateway-version}.jar conf & |
| |
| There are a number of log messages of the form `Created null.` that can |
| safely be ignored. Take note of the port on which it was started as this |
| needs to match later configuration. This will create a directory named |
| 'org.apache.hadoop.gateway.security.EmbeddedApacheDirectoryServer' that |
| can safely be ignored. |
| |
| ${HHH} 4. Start the Gateway server |
| |
| java -jar bin/gateway-server-${gateway-version}.jar |
| |
| Take note of the port identified in the logging output as you will need this |
| for accessing the gateway. |
| |
| The server will prompt you for the master secret (password). This secret is |
| used to secure artifacts used to secure artifacts used by the gateway server |
| for things like SSL, credential/password aliasing. This secret will have to |
| be entered at startup unless you choose to persist it. Remember this secret |
| and keep it safe. It represents the keys to the kingdom. See the Persisting |
| the Master section for more information. |
| |
| ${HHH} 5. Configure the Gateway with the topology of your Hadoop cluster |
| Edit the file `{GATEWAY_HOME}/deployments/sample.xml` |
| |
| Change the host and port in the urls of the `<service>` elements for |
| NAMENODE, TEMPLETON and OOZIE services to match your Hadoop cluster |
| deployment. |
| |
| The default configuration contains the LDAP URL for a LDAP server. By |
| default that file is configured to access the demo ApacheDS based LDAP |
| server and its default configuration. By default, this server listens on |
| port 33389. Optionally, you can change the LDAP URL for the LDAP server |
| to be used for authentication. This is set via the |
| main.ldapRealm.contextFactory.url property in the |
| `<gateway><provider><authentication>` section. |
| |
| Save the file. The directory {GATEWAY_HOME}/deployments is monitored |
| by the Gateway server and reacts to the discovery of a new or changed |
| cluster topology descriptor by provisioning the endpoints and required |
| filter chains to serve the needs of each cluster as described by the |
| topology file. Note that the name of the file excluding the extension |
| is also used as the path for that cluster in the URL. So for example |
| the sample.xml file will result in Gateway URLs of the form |
| `http://{gateway-host}:{gateway-port}/gateway/sample/namenode/api/v1` |
| |
| ${HHH} 6. Test the installation and configuration of your Gateway |
| Invoke the LISTSATUS operation on HDFS represented by your configured |
| NAMENODE by using your web browser or curl: |
| |
| curl -i -k -u hdfs:hdfs-password -X GET \ |
| 'https://localhost:8443/gateway/sample/namenode/api/v1/?op=LISTSTATUS' |
| |
| The results of the above command should result in something to along the |
| lines of the output below. The exact information returned is subject to |
| the content within HDFS in your Hadoop cluster. |
| |
| HTTP/1.1 200 OK |
| Content-Type: application/json |
| Content-Length: 760 |
| Server: Jetty(6.1.26) |
| |
| {"FileStatuses":{"FileStatus":[ |
| {"accessTime":0,"blockSize":0,"group":"hdfs","length":0,"modificationTime":1350595859762,"owner":"hdfs","pathSuffix":"apps","permission":"755","replication":0,"type":"DIRECTORY"}, |
| {"accessTime":0,"blockSize":0,"group":"mapred","length":0,"modificationTime":1350595874024,"owner":"mapred","pathSuffix":"mapred","permission":"755","replication":0,"type":"DIRECTORY"}, |
| {"accessTime":0,"blockSize":0,"group":"hdfs","length":0,"modificationTime":1350596040075,"owner":"hdfs","pathSuffix":"tmp","permission":"777","replication":0,"type":"DIRECTORY"}, |
| {"accessTime":0,"blockSize":0,"group":"hdfs","length":0,"modificationTime":1350595857178,"owner":"hdfs","pathSuffix":"user","permission":"755","replication":0,"type":"DIRECTORY"} |
| ]}} |
| |
| For additional information on WebHDFS, Templeton/WebHCat and Oozie |
| REST APIs, see the following URLs respectively: |
| |
| * WebHDFS - http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html |
| * Templeton/WebHCat - http://people.apache.org/~thejas/templeton_doc_v1/ |
| * Oozie - http://oozie.apache.org/docs/3.3.1/WebServicesAPI.html |
| |
| ------------------------------------------------------------------------------ |
| Examples |
| ------------------------------------------------------------------------------ |
| More examples can be found [here](/examples.html). |
| |
| ------------------------------------------------------------------------------ |
| Persisting the Master Secret |
| ------------------------------------------------------------------------------ |
| The master secret is required to start the server. This secret is used to |
| access secured artifacts by the gateway instance. Keystore, trust stores and |
| credential stores are all protected with the master secret. |
| |
| You may persist the master secret by supplying the *-persist-master* switch at |
| startup. This will result in a warning indicating that persisting the secret |
| is less secure than providing it at startup. We do make some provisions in |
| order to protect the persisted password. |
| |
| It is encrypted with AES 128 bit encryption and where possible the file |
| permissions are set to only be accessable by the user that the gateway is |
| running as. |
| |
| After persisting the secret, ensure that the file at config/security/master |
| has the appropriate permissions set for your environment. This is probably |
| the most important layer of defense for master secret. Do not assume that |
| the encryption if sufficient protection. |
| |
| A specific user should be created to run the gateway this will protect a |
| persisted master file. |
| |
| ------------------------------------------------------------------------------ |
| Management of Security Artifacts |
| ------------------------------------------------------------------------------ |
| There are a number of artifacts that are used by the gateway in ensuring the |
| security of wire level communications, access to protected resources and the |
| encryption of sensitive data. These artifacts can be managed from outside of |
| the gateway instances or generated and populated by the gateway instance |
| itself. |
| |
| The following is a description of how this is coordinated with both standalone |
| (development, demo, etc) gateway instances and instances as part of a cluster |
| of gateways in mind. |
| |
| Upon start of the gateway server we: |
| |
| 1. Look for an identity store at conf/security/keystores/gateway.jks. The |
| identity store contains the certificate and private key used to represent the |
| identity of the server for SSL connections and signature creation. |
| |
| * If there is no identity store we create one and generate a self-signed |
| certificate for use in standalone/demo mode. The certificate is stored |
| with an alias of gateway-identity. |
| * If there is an identity store found than we ensure that it can be loaded |
| using the provided master secret and that there is an alias with called |
| gateway-identity. |
| |
| 2. Look for a credential store at |
| `conf/security/keystores/__gateway-credentials.jceks`. This credential |
| store is used to store secrets/passwords that are used by the gateway. |
| For instance, this is where the pass-phrase for accessing the |
| gateway-identity certificate is kept. |
| |
| * If there is no credential store found then we create one and populate it |
| with a generated pass-phrase for the alias `gateway-identity-passphrase`. |
| This is coordinated with the population of the self-signed cert into the |
| identity-store. |
| * If a credential store is found then we ensure that it can be loaded using |
| the provided master secret and that the expected aliases have been |
| populated with secrets. |
| |
| Upon deployment of a Hadoop cluster topology within the gateway we: |
| |
| 1. Look for a credential store for the topology. For instance, we have a |
| sample topology that gets deployed out of the box. We look for |
| `conf/security/keystores/sample-credentials.jceks`. This topology specific |
| credential store is used for storing secrets/passwords that are used for |
| encrypting sensitive data with topology specific keys. |
| |
| * If no credential store is found for the topology being deployed then |
| one is created for it. Population of the aliases is delegated to the |
| configured providers within the system that will require the use of a |
| secret for a particular task. They may programmatic set the value |
| of the secret or choose to have the value for the specified alias |
| generated through the AliasService. |
| * If a credential store is found then we ensure that it can be loaded |
| with the provided master secret and the configured providers have the |
| opportunity to ensure that the aliases are populated and if not to |
| populate them. |
| |
| By leveraging the algorithm described above we can provide a window of |
| opportunity for management of these artifacts in a number of ways. |
| |
| 1. Using a single gateway instance as a master instance the artifacts can be |
| generated or placed into the expected location and then replicated across |
| all of the slave instances before startup. |
| 2. Using an NFS mount as a central location for the artifacts would provide |
| a single source of truth without the need to replicate them over the |
| network. Of course, NFS mounts have their own challenges. |
| |
| Summary of Secrets to be Managed: |
| |
| 1. Master secret - the same for all gateway instances in a cluster of gateways |
| 2. All security related artifacts are protected with the master secret |
| 3. Secrets used by the gateway itself are stored within the gateway credential |
| store and are the same across all gateway instances in the cluster of |
| gateways |
| 4. Secrets used by providers within cluster topologies are stored in topology |
| specific credential stores and are the same for the same topology across |
| the cluster of gateway instances. However, they are specific to the |
| topology - so secrets for one hadoop cluster are different from those of |
| another. This allows for fail-over from one gateway instance to another |
| even when encryption is being used while not allowing the compromise of one |
| encryption key to expose the data for all clusters. |
| |
| NOTE: the SSL certificate will need special consideration depending on the |
| type of certificate. Wildcard certs may be able to be shared across all |
| gateway instances in a cluster. When certs are dedicated to specific machines |
| the gateway identity store will not be able to be blindly replicated as |
| hostname verification problems will ensue. Obviously, trust-stores will need |
| to be taken into account as well. |
| |
| ------------------------------------------------------------------------------ |
| Mapping Gateway URLs to Hadoop cluster URLs |
| ------------------------------------------------------------------------------ |
| The Gateway functions much like a reverse proxy. As such it maintains a |
| mapping of URLs that are exposed externally by the Gateway to URLs that are |
| provided by the Hadoop cluster. Examples of mappings for the NameNode and |
| Templeton are shown below. These mapping are generated from the combination |
| of the Gateway configuration file (i.e. {GATEWAY_HOME}/gateway-site.xml) |
| and the cluster topology descriptors |
| (e.g. {GATEWAY_HOME}/deployments/{cluster-name}.xml). |
| |
| * HDFS (NameNode) |
| * Gateway: `http://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/namenode/api/v1` |
| * Cluster: `http://{namenode-host}:50070/webhdfs/v1` |
| * WebHCat (Templeton) |
| * Gateway: `http://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/templeton/api/v1` |
| * Cluster: `http://{templeton-host}:50111/templeton/v1` |
| * Oozie |
| * Gateway: `http://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/oozie/api/v1` |
| * Cluster: `http://{templeton-host}:11000/oozie/v1` |
| |
| The values for `{gateway-host}`, `{gateway-port}`, `{gateway-path}` are |
| provided via the Gateway configuration file |
| (i.e. `{GATEWAY_HOME}/gateway-site.xml`). |
| |
| The value for `{cluster-name}` is derived from the name of the cluster |
| topology descriptor (e.g. `{GATEWAY_HOME}/deployments/{cluster-name}.xml`). |
| |
| The value for `{namenode-host}` and `{templeton-host}` is provided via the |
| cluster topology descriptor |
| (e.g. `{GATEWAY_HOME}/deployments/{cluster-name}.xml`). |
| |
| Note: The ports 50070, 50111 and 11000 are the defaults for NameNode, |
| Templeton and Oozie respectively. Their values can also be provided via |
| the cluster topology descriptor if your Hadoop cluster uses different |
| ports. |
| |
| ------------------------------------------------------------------------------ |
| Enabling logging |
| ------------------------------------------------------------------------------ |
| If necessary you can enable additional logging by editing the |
| `log4j.properties` file in the `conf` directory. Changing the rootLogger |
| value from `ERROR` to `DEBUG` will generate a large amount of debug logging. |
| A number of useful, more fine loggers are also provided in the file. |
| |
| ------------------------------------------------------------------------------ |
| Filing bugs |
| ------------------------------------------------------------------------------ |
| File bugs at hortonworks.jira.com under Project "Hadoop Gateway Development" |
| Include the results of |
| |
| java -jar bin/gateway-${gateway-version}.jar -version |
| |
| in the Environment section. Also include the version of Hadoop being used. |
| |
| ------------------------------------------------------------------------------ |
| Disclaimer |
| ------------------------------------------------------------------------------ |
| The Apache Knox Gateway is an effort undergoing incubation at the |
| Apache Software Foundation (ASF), sponsored by the Apache Incubator PMC. |
| |
| Incubation is required of all newly accepted projects until a further review |
| indicates that the infrastructure, communications, and decision making process |
| have stabilized in a manner consistent with other successful ASF projects. |
| |
| While incubation status is not necessarily a reflection of the completeness |
| or stability of the code, it does indicate that the project has yet to be |
| fully endorsed by the ASF. |
| |