| ------------------------------------------------------------------------------ |
| Requirements |
| ------------------------------------------------------------------------------ |
| Java: |
| Java 1.6 or later |
| |
| Hadoop Cluster: |
| A local installation of a Hadoop Cluster is required at this time. Hadoop |
| EC2 cluster and/or Sandbox installations are currently difficult to access |
| remotely via the Gateway. The EC2 and Sandbox limitation is caused by |
| Hadoop services running with internal IP addresses. For the Gateway to work |
| in these cases it will need to be deployed on the EC2 cluster or Sandbox, at |
| this time. |
| |
| The instructions that follow assume that the Gateway is *not* collocated |
| with the Hadoop clusters themselves and (most importantly) that the |
| hostnames and IP addresses of the cluster services are accessible by the |
| gateway where ever it happens to be running. |
| |
| The Hadoop cluster should be ensured to have WebHDFS, WebHCat |
| (i.e. Templeton) and Oozie configured, deployed and running. |
| |
| ------------------------------------------------------------------------------ |
| Installation and Deployment Instructions |
| ------------------------------------------------------------------------------ |
| 1. Install |
| Download and extract the knox-{VERSION}.zip file into the |
| installation directory that will contain your GATEWAY_HOME |
| jar xf knox-{VERSION}.zip |
| This will create a directory 'gateway' in your current directory. |
| |
| 2. Enter Gateway Home directory |
| cd gateway |
| The fully qualified name of this directory will be referenced as |
| {GATEWAY_HOME} throughout the remainder of this document. |
| |
| 3. Start the demo LDAP server (ApacheDS) |
| a. First, understand that the LDAP server provided here is for demonstration |
| purposes. You may configure the LDAP specifics within the topology |
| descriptor for the cluster as described in step 5 below, in order to |
| customize what LDAP instance to use. The assumption is that most users |
| will leverage the demo LDAP server while evaluating this release and |
| should therefore continue with the instructions here in step 3. |
| b. Edit {GATEWAY_HOME}/conf/users.ldif if required and add your users and |
| groups to the file. A number of normal Hadoop users |
| (e.g. hdfs, mapred, hcat, hive) have already been included. Note that |
| the passwords in this file are "fictitious" and have nothing to do with |
| the actual accounts on the Hadoop cluster you are using. There is also |
| a copy of this file in the templates directory that you can use to start |
| over if necessary. |
| c. Start the LDAP server - pointing it to the config dir where it will find |
| the users.ldif file in the conf directory. |
| java -jar bin/ldap.jar conf & |
| There are a number of log messages of the form "Created null." that can |
| safely be ignored. Take note of the port on which it was started as this |
| needs to match later configuration. This will create a directory named |
| 'org.apache.hadoop.gateway.security.EmbeddedApacheDirectoryServer' that |
| can safely be ignored. |
| |
| 4. Start the Gateway server |
| java -jar bin/server.jar |
| a. Take note of the port identified in the logging output as you will need this for |
| accessing the gateway. |
| b. The server will prompt you for the master secret (password). This secret is used |
| to secure artifacts used to secure artifacts used by the gateway server for |
| things like SSL, credential/password aliasing. This secret will have to be entered |
| at startup unless you choose to persist it. Remember this secret and keep it safe. |
| It represents the keys to the kingdom. See the Persisting the Master section for |
| more information. |
| |
| 5. Configure the Gateway with the topology of your Hadoop cluster |
| a. Edit the file {GATEWAY_HOME}/deployments/sample.xml |
| b. Change the host and port in the urls of the <service> elements for |
| WEBHDFS, WEBHCAT and OOZIE services to match your Hadoop cluster |
| deployment. |
| c. The default configuration contains the LDAP URL for a LDAP server. By |
| default that file is configured to access the demo ApacheDS based LDAP |
| server and its default configuration. By default, this server listens on |
| port 33389. Optionally, you can change the LDAP URL for the LDAP server |
| to be used for authentication. This is set via the |
| main.ldapRealm.contextFactory.url property in the |
| <gateway><provider><authentication> section. |
| d. Save the file. The directory {GATEWAY_HOME}/deployments is monitored |
| by the Gateway server and reacts to the discovery of a new or changed |
| cluster topology descriptor by provisioning the endpoints and required |
| filter chains to serve the needs of each cluster as described by the |
| topology file. Note that the name of the file excluding the extension |
| is also used as the path for that cluster in the URL. So for example |
| the sample.xml file will result in Gateway URLs of the form |
| http://{gateway-host}:{gateway-port}/gateway/sandbox/webhdfs/v1 |
| |
| 6. Test the installation and configuration of your Gateway |
| Invoke the LISTSATUS operation on HDFS represented by your configured |
| WEBHDFS by using your web browser or curl: |
| |
| curl -i -k -u guest:guest-password -X GET \ |
| 'https://localhost:8443/gateway/sandbox/webhdfs/v1/?op=LISTSTATUS' |
| |
| The results of the above command should result in something to along the |
| lines of the output below. The exact information returned is subject to |
| the content within HDFS in your Hadoop cluster. |
| |
| HTTP/1.1 200 OK |
| Content-Type: application/json |
| Content-Length: 760 |
| Server: Jetty(6.1.26) |
| |
| {"FileStatuses":{"FileStatus":[ |
| {"accessTime":0,"blockSize":0,"group":"hdfs","length":0,"modificationTime":1350595859762,"owner":"hdfs","pathSuffix":"apps","permission":"755","replication":0,"type":"DIRECTORY"}, |
| {"accessTime":0,"blockSize":0,"group":"mapred","length":0,"modificationTime":1350595874024,"owner":"mapred","pathSuffix":"mapred","permission":"755","replication":0,"type":"DIRECTORY"}, |
| {"accessTime":0,"blockSize":0,"group":"hdfs","length":0,"modificationTime":1350596040075,"owner":"hdfs","pathSuffix":"tmp","permission":"777","replication":0,"type":"DIRECTORY"}, |
| {"accessTime":0,"blockSize":0,"group":"hdfs","length":0,"modificationTime":1350595857178,"owner":"hdfs","pathSuffix":"user","permission":"755","replication":0,"type":"DIRECTORY"} |
| ]}} |
| |
| For additional information on WebHDFS, WebHCat/Templeton and Oozie |
| REST APIs, see the following URLs respectively: |
| |
| http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html |
| http://hive.apache.org/docs/hcat_r0.5.0/rest.html |
| http://oozie.apache.org/docs/4.0.0/WebServicesAPI.html |
| |
| ------------------------------------------------------------------------------ |
| Persisting the Master |
| ------------------------------------------------------------------------------ |
| The master secret is required to start the server. This secret is used to access secured artifacts by the gateway |
| instance. Keystore, trust stores and credential stores are all protected with the master secret. |
| |
| You may persist the master secret by supplying the *-persist-master* switch at startup. This will result in a |
| warning indicating that persisting the secret is less secure than providing it at startup. We do make some provisions in |
| order to protect the persisted password. |
| |
| It is encrypted with AES 128 bit encryption and where possible the file permissions are set to only be accessible by |
| the user that the gateway is running as. |
| |
| After persisting the secret, ensure that the file at config/security/master has the appropriate permissions set for your |
| environment. This is probably the most important layer of defense for master secret. Do not assume that the encryption if |
| sufficient protection. |
| |
| A specific user should be created to run the gateway this will protect a persisted master file. |
| |
| ------------------------------------------------------------------------------ |
| Management of Security Artifacts |
| ------------------------------------------------------------------------------ |
| There are a number of artifacts that are used by the gateway in ensuring the security of wire level communications, |
| access to protected resources and the encryption of sensitive data. These artifacts can be managed from outside of |
| the gateway instances or generated and populated by the gateway instance itself. |
| |
| The following is a description of how this is coordinated with both standalone (development, demo, etc) gateway |
| instances and instances as part of a cluster of gateways in mind. |
| |
| Upon start of the gateway server we: |
| |
| 1. Look for an identity store at conf/security/keystores/gateway.jks. The identity store contains the certificate |
| and private key used to represent the identity of the server for SSL connections and signtature creation. |
| a. If there is no identity store we create one and generate a self-signed certificate for use in standalone/demo |
| mode. The certificate is stored with an alias of gateway-identity. |
| b. If there is an identity store found than we ensure that it can be loaded using the provided master secret and |
| that there is an alias with called gateway-identity. |
| 2. Look for a credential store at conf/security/keystores/__gateway-credentials.jceks. This credential store is used |
| to store secrets/passwords that are used by the gateway. For instance, this is where the passphrase for accessing |
| the gateway-identity certificate is kept. |
| a. If there is no credential store found then we create one and populate it with a generated passphrase for the alias |
| gateway-identity-passphrase. This is coordinated with the population of the self-signed cert into the identity-store. |
| b. If a credential store is found then we ensure that it can be loaded using the provided master secret and that the |
| expected aliases have been populated with secrets. |
| |
| Upon deployment of a Hadoop cluster topology within the gateway we: |
| |
| 1. Look for a credential store for the topology. For instance, we have a sample topology that gets deployed out of the box. |
| We look for conf/security/keystores/sample-credentials.jceks. This topology specific credential store is used for storing |
| secrets/passwords that are used for encrypting sensitive data with topology specific keys. |
| a. If no credential store is found for the topology being deployed then one is created for it. Population of the aliases |
| is delegated to the configured providers within the system that will require the use of a secret for a particular |
| task. They may programmatically set the value of the secret or choose to have the value for the specified alias |
| generated through the AliasService.. |
| b. If a credential store is found then we ensure that it can be loaded with the provided master secret and the confgured |
| providers have the opportunity to ensure that the aliases are populated and if not to populate them. |
| |
| By leveraging the algorithm described above we can provide a window of opportunity for management of these artifacts in a |
| number of ways. |
| |
| 1. Using a single gateway instance as a master instance the artifacts can be generated or placed into the expected location |
| and then replicated across all of the slave instances before startup. |
| 2. Using an NFS mount as a central location for the artifacts would provide a single source of truth without the need to |
| replicate them over the network. Of course, NFS mounts have their own challenges. |
| |
| Summary of Secrets to be Managed: |
| |
| 1. Master secret - the same for all gateway instances in a cluster of gateways |
| 2. All security related artifacts are protected with the master secret |
| 3. Secrets used by the gateway itself are stored within the gateway credential store and are the same across all gateway |
| instances in the cluster of gateways |
| 4. Secrets used by providers within cluster topologies are stored in topology specific credential stores and are the same |
| for the same topology across the cluster of gateway instances. However, they are specific to the topology - so secrets |
| for one hadoop cluster are different from those of another. This allows for failover from one gateway instance to another |
| even when encryption is being used while not allowing the compromise of one encryption key to expose the data for all clusters. |
| |
| NOTE: the SSL certificate will need special consideration depending on the type of certificate. Wildcard certs may be able |
| to be shared across all gateway instances in a cluster. When certs are dedicated to specific machines the gateway identity |
| store will not be able to be blindly replicated as hostname verification problems will ensue. Obviously, truststores will |
| need to be taken into account as well. |
| |
| ------------------------------------------------------------------------------ |
| Mapping Gateway URLs to Hadoop cluster URLs |
| ------------------------------------------------------------------------------ |
| The Gateway functions much like a reverse proxy. As such it maintains a |
| mapping of URLs that are exposed externally by the Gateway to URLs that are |
| provided by the Hadoop cluster. Examples of mappings for the WebHDFS and |
| WebHCat are shown below. These mapping are generated from the combination |
| of the Gateway configuration file (i.e. {GATEWAY_HOME}/gateway-site.xml) |
| and the cluster topology descriptors |
| (e.g. {GATEWAY_HOME}/deployments/<cluster-name>.xml). |
| |
| WebHDFS |
| Gateway: http://<gateway-host>:<gateway-port>/<gateway-path>/<cluster-name>/webhdfs |
| Cluster: http://<webhdfs-host>:50070/webhdfs |
| WebHCat (Templeton) |
| Gateway: http://<gateway-host>:<gateway-port>/<gateway-path>/<cluster-name>/webhcat |
| Cluster: http://<webhcat-host>:50111/templeton |
| Oozie |
| Gateway: http://<gateway-host>:<gateway-port>/<gateway-path>/<cluster-name>/oozie |
| Cluster: http://<oozie-host>:11000/oozie |
| |
| The values for <gateway-host>, <gateway-port>, <gateway-path> are provided via |
| the Gateway configuration file (i.e. {GATEWAY_HOME}/gateway-site.xml). |
| |
| The value for <cluster-name> is derived from the name of the cluster topology |
| descriptor (e.g. {GATEWAY_HOME}/deployments/<cluster-name>.xml). |
| |
| The value for <webhdfs-host> and <webhcat-host> is provided via the cluster |
| topology descriptor (e.g. {GATEWAY_HOME}/deployments/<cluster-name>.xml). |
| |
| Note: The ports 50070, 50111 and 11000 are the defaults for WebHDFS, |
| WebHCat and Oozie respectively. Their values can also be provided via |
| the cluster topology descriptor if your Hadoop cluster uses different |
| ports. |
| |
| ------------------------------------------------------------------------------ |
| Usage Examples |
| ------------------------------------------------------------------------------ |
| Please see the Apache Knox Gateway website for detailed examples. |
| http://knox.incubator.apache.org/examples.html |
| |
| ------------------------------------------------------------------------------ |
| Enabling logging |
| ------------------------------------------------------------------------------ |
| If necessary you can enable additional logging by editing the log4j.properties |
| file in the conf directory. Changing the rootLogger value from ERROR to DEBUG |
| will generate a large amount of debug logging. A number of useful, more fine |
| loggers are also provided in the file. |
| |