| ~~ Licensed under the Apache License, Version 2.0 (the "License"); |
| ~~ you may not use this file except in compliance with the License. |
| ~~ You may obtain a copy of the License at |
| ~~ |
| ~~ http://www.apache.org/licenses/LICENSE-2.0 |
| ~~ |
| ~~ Unless required by applicable law or agreed to in writing, software |
| ~~ distributed under the License is distributed on an "AS IS" BASIS, |
| ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| ~~ See the License for the specific language governing permissions and |
| ~~ limitations under the License. See accompanying LICENSE file. |
| |
| --- |
| Hadoop Distributed File System-${project.version} - Transparent Encryption in HDFS |
| --- |
| --- |
| ${maven.build.timestamp} |
| |
| Transparent Encryption in HDFS |
| |
| %{toc|section=1|fromDepth=2|toDepth=3} |
| |
| * {Overview} |
| |
| HDFS implements <transparent>, <end-to-end> encryption. |
| Once configured, data read from and written to HDFS is <transparently> encrypted and decrypted without requiring changes to user application code. |
| This encryption is also <end-to-end>, which means the data can only be encrypted and decrypted by the client. |
| HDFS never stores or has access to unencrypted data or data encryption keys. |
| This satisfies two typical requirements for encryption: <at-rest encryption> (meaning data on persistent media, such as a disk) as well as <in-transit encryption> (e.g. when data is travelling over the network). |
| |
| * {Use Cases} |
| |
| Data encryption is required by a number of different government, financial, and regulatory entities. |
| For example, the health-care industry has HIPAA regulations, the card payment industry has PCI DSS regulations, and the US government has FISMA regulations. |
| Having transparent encryption built into HDFS makes it easier for organizations to comply with these regulations. |
| |
| Encryption can also be performed at the application-level, but by integrating it into HDFS, existing applications can operate on encrypted data without changes. |
| This integrated architecture implies stronger encrypted file semantics and better coordination with other HDFS functions. |
| |
| * {Architecture} |
| |
| ** {Key Management Server, KeyProvider, EDEKs} |
| |
| A new cluster service is required to store, manage, and access encryption keys: the Hadoop <Key Management Server (KMS)>. |
| The KMS is a proxy that interfaces with a backing key store on behalf of HDFS daemons and clients. |
| Both the backing key store and the KMS implement the Hadoop KeyProvider client API. |
| See the {{{../../hadoop-kms/index.html}KMS documentation}} for more information. |
| |
| In the KeyProvider API, each encryption key has a unique <key name>. |
| Because keys can be rolled, a key can have multiple <key versions>, where each key version has its own <key material> (the actual secret bytes used during encryption and decryption). |
| An encryption key can be fetched by either its key name, returning the latest version of the key, or by a specific key version. |
| |
| The KMS implements additional functionality which enables creation and decryption of <encrypted encryption keys (EEKs)>. |
| Creation and decryption of EEKs happens entirely on the KMS. |
| Importantly, the client requesting creation or decryption of an EEK never handles the EEK's encryption key. |
| To create a new EEK, the KMS generates a new random key, encrypts it with the specified key, and returns the EEK to the client. |
| To decrypt an EEK, the KMS checks that the user has access to the encryption key, uses it to decrypt the EEK, and returns the decrypted encryption key. |
| |
| In the context of HDFS encryption, EEKs are <encrypted data encryption keys (EDEKs)>, where a <data encryption key (DEK)> is what is used to encrypt and decrypt file data. |
| Typically, the key store is configured to only allow end users access to the keys used to encrypt DEKs. |
| This means that EDEKs can be safely stored and handled by HDFS, since the HDFS user will not have access to EDEK encryption keys. |
| |
| ** {Encryption zones} |
| |
| For transparent encryption, we introduce a new abstraction to HDFS: the <encryption zone>. |
| An encryption zone is a special directory whose contents will be transparently encrypted upon write and transparently decrypted upon read. |
| Each encryption zone is associated with a single <encryption zone key> which is specified when the zone is created. |
| Each file within an encryption zone has its own unique EDEK. |
| |
| When creating a new file in an encryption zone, the NameNode asks the KMS to generate a new EDEK encrypted with the encryption zone's key. |
| The EDEK is then stored persistently as part of the file's metadata on the NameNode. |
| |
| When reading a file within an encryption zone, the NameNode provides the client with the file's EDEK and the encryption zone key version used to encrypt the EDEK. |
| The client then asks the KMS to decrypt the EDEK, which involves checking that the client has permission to access the encryption zone key version. |
| Assuming that is successful, the client uses the DEK to decrypt the file's contents. |
| |
| All of the above steps for the read and write path happen automatically through interactions between the DFSClient, the NameNode, and the KMS. |
| |
| Access to encrypted file data and metadata is controlled by normal HDFS filesystem permissions. |
| This means that if HDFS is compromised (for example, by gaining unauthorized access to an HDFS superuser account), a malicious user only gains access to ciphertext and encrypted keys. |
| However, since access to encryption zone keys is controlled by a separate set of permissions on the KMS and key store, this does not pose a security threat. |
| |
| * {Configuration} |
| |
| A necessary prerequisite is an instance of the KMS, as well as a backing key store for the KMS. |
| See the {{{../../hadoop-kms/index.html}KMS documentation}} for more information. |
| |
| ** Selecting an encryption algorithm and codec |
| |
| *** hadoop.security.crypto.codec.classes.EXAMPLECIPHERSUITE |
| |
| The prefix for a given crypto codec, contains a comma-separated list of implementation classes for a given crypto codec (eg EXAMPLECIPHERSUITE). |
| The first implementation will be used if available, others are fallbacks. |
| |
| *** hadoop.security.crypto.codec.classes.aes.ctr.nopadding |
| |
| Default: <<<org.apache.hadoop.crypto.OpensslAesCtrCryptoCodec,org.apache.hadoop.crypto.JceAesCtrCryptoCodec>>> |
| |
| Comma-separated list of crypto codec implementations for AES/CTR/NoPadding. |
| The first implementation will be used if available, others are fallbacks. |
| |
| *** hadoop.security.crypto.cipher.suite |
| |
| Default: <<<AES/CTR/NoPadding>>> |
| |
| Cipher suite for crypto codec. |
| |
| *** hadoop.security.crypto.jce.provider |
| |
| Default: None |
| |
| The JCE provider name used in CryptoCodec. |
| |
| *** hadoop.security.crypto.buffer.size |
| |
| Default: <<<8192>>> |
| |
| The buffer size used by CryptoInputStream and CryptoOutputStream. |
| |
| ** Namenode configuration |
| |
| *** dfs.namenode.list.encryption.zones.num.responses |
| |
| Default: <<<100>>> |
| |
| When listing encryption zones, the maximum number of zones that will be returned in a batch. |
| Fetching the list incrementally in batches improves namenode performance. |
| |
| * {<<<crypto>>> command-line interface} |
| |
| ** {createZone} |
| |
| Usage: <<<[-createZone -keyName <keyName> -path <path>]>>> |
| |
| Create a new encryption zone. |
| |
| *--+--+ |
| <path> | The path of the encryption zone to create. It must be an empty directory. |
| *--+--+ |
| <keyName> | Name of the key to use for the encryption zone. |
| *--+--+ |
| |
| ** {listZones} |
| |
| Usage: <<<[-listZones]>>> |
| |
| List all encryption zones. Requires superuser permissions. |
| |
| * {Attack vectors} |
| |
| ** {Hardware access exploits} |
| |
| These exploits assume that attacker has gained physical access to hard drives from cluster machines, i.e. datanodes and namenodes. |
| |
| [[1]] Access to swap files of processes containing data encryption keys. |
| |
| * By itself, this does not expose cleartext, as it also requires access to encrypted block files. |
| |
| * This can be mitigated by disabling swap, using encrypted swap, or using mlock to prevent keys from being swapped out. |
| |
| [[1]] Access to encrypted block files. |
| |
| * By itself, this does not expose cleartext, as it also requires access to DEKs. |
| |
| ** {Root access exploits} |
| |
| These exploits assume that attacker has gained root shell access to cluster machines, i.e. datanodes and namenodes. |
| Many of these exploits cannot be addressed in HDFS, since a malicious root user has access to the in-memory state of processes holding encryption keys and cleartext. |
| For these exploits, the only mitigation technique is carefully restricting and monitoring root shell access. |
| |
| [[1]] Access to encrypted block files. |
| |
| * By itself, this does not expose cleartext, as it also requires access to encryption keys. |
| |
| [[1]] Dump memory of client processes to obtain DEKs, delegation tokens, cleartext. |
| |
| * No mitigation. |
| |
| [[1]] Recording network traffic to sniff encryption keys and encrypted data in transit. |
| |
| * By itself, insufficient to read cleartext without the EDEK encryption key. |
| |
| [[1]] Dump memory of datanode process to obtain encrypted block data. |
| |
| * By itself, insufficient to read cleartext without the DEK. |
| |
| [[1]] Dump memory of namenode process to obtain encrypted data encryption keys. |
| |
| * By itself, insufficient to read cleartext without the EDEK's encryption key and encrypted block files. |
| |
| ** {HDFS admin exploits} |
| |
| These exploits assume that the attacker has compromised HDFS, but does not have root or <<<hdfs>>> user shell access. |
| |
| [[1]] Access to encrypted block files. |
| |
| * By itself, insufficient to read cleartext without the EDEK and EDEK encryption key. |
| |
| [[1]] Access to encryption zone and encrypted file metadata (including encrypted data encryption keys), via -fetchImage. |
| |
| * By itself, insufficient to read cleartext without EDEK encryption keys. |
| |
| ** {Rogue user exploits} |
| |
| A rogue user can collect keys to which they have access, and use them later to decrypt encrypted data. |
| This can be mitigated through periodic key rolling policies. |