docs/security.adoc - kudu - Git at Google

 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
 // distributed with this work for additional information
 // regarding copyright ownership.  The ASF licenses this file
 // to you under the Apache License, Version 2.0 (the
 // "License"); you may not use this file except in compliance
 // with the License.  You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing,
 // software distributed under the License is distributed on an
 // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.

 [[security]]
 = Apache Kudu Security

 :author: Kudu Team
 :imagesdir: ./images
 :icons: font
 :toc: left
 :toclevels: 3
 :doctype: book
 :backend: html5
 :sectlinks:
 :experimental:

 Kudu includes security features which allow Kudu clusters to be hardened against
 access from unauthorized users. This guide describes the security features
 provided by Kudu. <<configuration>> lists essential configuration options when
 deploying a secure Kudu cluster. <<known-limitations>> contains a list of
 known deficiencies in Kudu's security capabilities.

 == Authentication

 Kudu can be configured to enforce secure authentication among servers, and
 between clients and servers. Authentication prevents untrusted actors from
 gaining access to Kudu, and securely identifies the connecting user or services
 for authorization checks. Authentication in Kudu is designed to interoperate
 with other secure Hadoop components by utilizing Kerberos.

 Authentication can be configured on Kudu servers using the
 `--rpc_authentication` flag, which can be set to `required`, `optional`, or
 `disabled`. By default, the flag is set to `optional`. When `required`, Kudu
 will reject connections from clients and servers who lack authentication
 credentials. When `optional`, Kudu will attempt to use strong authentication.
 When `disabled` or strong authentication fails for 'optional', by default Kudu
 will only allow unauthenticated connections from trusted subnets, which are
 private networks (127.0.0.0/8,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,
 169.254.0.0/16) and local subnets of all local network interfaces. Unauthenticated
 connections from publicly routable IPs will be rejected.

 The trusted subnets can be configured using the `--trusted_subnets` flag,
 which can be set to IP blocks in CIDR notation separated by comma. Set it to
 '0.0.0.0/0' to allow unauthenticated connections from all remote IP addresses.
 However, if network access is not otherwise restricted by a firewall,
 malicious users may be able to gain unauthorized access. This can be mitigated
 if authentication is configured to be required.

 WARNING: When the `--rpc_authentication` flag is set to `optional`,
 the cluster does not prevent access from unauthenticated users. To secure a
 cluster, use `--rpc_authentication=required`.

 === Internal PKI

 Kudu uses an internal PKI system to issue X.509 certificates to servers in
 the cluster. Connections between peers who have both obtained certificates will
 use TLS for authentication, which doesn't require contacting the Kerberos KDC.
 These certificates are _only_ used for internal communication among Kudu
 servers, and between Kudu clients and servers. The certificates are never
 presented in a public facing protocol.

 By using internally-issued certificates, Kudu offers strong authentication which
 scales to huge clusters, and allows TLS encryption to be used without requiring
 you to manually deploy certificates on every node.

 === Authentication Tokens

 After authenticating to a secure cluster, the Kudu client will automatically
 request an authentication token from the Kudu master. An authentication token
 encapsulates the identity of the authenticated user and carries the master's
 RSA signature so that its authenticity can be verified.

 This token will be used to authenticate subsequent connections. By default,
 authentication tokens are only valid for seven days, so that even if a token
 were compromised, it could not be used indefinitely. For the most part,
 authentication tokens should be completely transparent to users. By using
 authentication tokens, Kudu takes advantage of strong authentication without
 paying the scalability cost of communicating with a central authority for every
 connection.

 When used with distributed compute frameworks such as Spark, authentication
 tokens can simplify configuration and improve security. For example, the Kudu
 Spark connector will automatically retrieve an authentication token during the
 planning stage, and distribute the token to tasks. This allows Spark to work
 against a secured Kudu cluster where only the planner node has Kerberos
 credentials.

 === Client Authentication to Secure Kudu Clusters

 Users running client Kudu applications must first run the `kinit` command to
 obtain a Kerberos ticket-granting ticket. For example:

 [source,bash]
 ----
 $ kinit admin@EXAMPLE-REALM.COM
 ----

 Once authenticated, you use the same client code to read from and write to Kudu
 servers with and without Kerberos configuration.

 === Scalability

 Kudu authentication is designed to scale to thousands of nodes, which requires
 avoiding unnecessary coordination with a central authentication authority (such
 as the Kerberos KDC). Instead, Kudu servers and clients will use Kerberos to
 establish initial trust with the Kudu master, and then use alternate credentials
 for subsequent connections. In particular, the master will issue internal
 X.509 certificates to servers, and temporary authentication tokens to clients.

 == Coarse-Grained Authorization

 Kudu supports coarse-grained authorization of client requests based on the
 authenticated client Kerberos principal (i.e. user or service). The two levels
 of access which can be configured are:

 * *Superuser* - principals authorized as a superuser are able to perform
 certain administrative functionality such as using the `kudu` command line tool
 to diagnose or repair cluster issues.

 * *User* - principals authorized as a user are able to access and modify all
 data in the Kudu cluster. This includes the ability to create, drop, and alter
 tables as well as read, insert, update, and delete data.

 NOTE: Internally, Kudu has a third access level for the daemons themselves.
 This ensures that users cannot connect to the cluster and pose as tablet
 servers.

 Access levels are granted using whitelist-style Access Control Lists (ACLs), one
 for each of the two levels. Each access control list either specifies a
 comma-separated list of users, or may be set to `*` to indicate that all
 authenticated users are able to gain access at the specified level. See
 <<configuration>> below for examples.

 NOTE: The default value for the User ACL is `*`, which allows all users access
 to the cluster. However, if authentication is enabled, this still restricts access
 to only those users who are able to successfully authenticate via Kerberos.
 Unauthenticated users on the same network as the Kudu servers will be unable
 to access the cluster.

 [[fine_grained_authz]]
 == Fine-Grained Authorization

 As of Kudu 1.12.0, Kudu can be configured to enforce fine-grained authorization
 across servers. This ensures that users can see only the data they are
 explicitly authorized to see. Kudu supports this by leveraging policies defined
 in Apache Ranger 2.1 and later.

 WARNING: Fine-grained authorization policies are not enforced when accessing
 the web UI. User data may appear on various pages of the web UI (e.g. in logs,
 metrics, scans, etc.). As such, it is recommended to either limit access to the
 web UI ports, or redact or disable the web UI entirely, as desired. See the
 <<web-ui,instructions for securing the web UI>> for more details.

 === Apache Ranger

 Apache Ranger models tabular objects stored in a Kudu cluster in the following
 hierarchy:

 NOTE: Ranger allows you to add separate service repositories to manage privileges
 for different Kudu clusters. Depending on the value of the `ranger.plugin.kudu.service.name`
 configuration in Ranger client, Kudu knows which service repository to connect
 to. For more details about Ranger service repository, see the Apache Ranger
 link:https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=57901344[documentation].

 * *Database* - Kudu does not have the concept of a database. Therefore, a database
 is indicated as a prefix of table names with the format `<database>.<table>`.
 Since Kudu's only restriction on table names is that they be valid UTF-8 encoded
 strings, Kudu considers special characters to be valid parts of database or table
 names. For example, if a managed Kudu table created from Impala (see Kudu Impala
 integration <<kudu_impala_integration.adoc#managed_tables,documentation>>) is named
 `impala::bar.foo`, its database will be `impala::bar`.

 * *Table* - a single Kudu table.

 * *Column* - a column within a Kudu table.

 In Ranger, privileges are also associated with specific actions. Access to Kudu
 tables may rely on privileges on the following actions:

 * `ALTER`
 * `CREATE`
 * `DELETE`
 * `DROP`
 * `INSERT`
 * `UPDATE`
 * `SELECT`

 There are two additional access types:

 * `ALL`
 * `METADATA`

 If a user has the `ALL` privilege on a resource, they implicitly have privileges
 to perform any action on that resource (except those that require users to be a
 delegated admin, see below). Also, if a user is granted any privilege, they are
 able to perform actions requiring `METADATA` (e.g. opening the table) without
 having to explicitly grant `METADATA` privilege to them.

 Finally, Ranger supports a `delegate admin` flag which is independent of the
 action types (it's not implied by `ALL` and doesn't imply `METADATA`). This is
 similar to the `GRANT OPTION` part of `ALL WITH GRANT OPTION` in SQL as it is
 required to modify privileges in Ranger and change the owner of a Kudu table.

 WARNING: A user with the `delegate admin` privilege on a resource can grant any
 privilege to themselves and others.

 While the action types are hierarchical, in terms of privilege evaluation,
 Ranger doesn't have the concept of hierarchy. For instance, if a user has
 `SELECT` privilege on a database, it does not imply that the user has `SELECT`
 privileges on every table belonging to that database. On the other hand, Ranger
 supports privilege wildcard matching. For example, `db=a->table=\*` matches all
 the tables that belong to database `a`. Therefore, in Ranger users actually need
 the `SELECT` privilege granted on `db=a->table=*->column=*` to allow `SELECT` on
 every table and every column in database `a`.

 Nevertheless, with Ranger integration, when a Kudu master receives a request,
 it consults Ranger to determine what privileges a user has. And the required
 policies documented in the <<security.adoc#policy-for-kudu-masters, policy section>>
 are enforced to determine whether the user is authorized to perform the requested
 action or not.

 NOTE: Even though Kudu table names remain case sensitive with Ranger integration,
 policy authorization is considered case-insensitive.

 In addition to granting privileges to a user by username, privileges can also be
 granted to table owners using the special `{OWNER}` username. These policies are
 evaluated only when a user tries to perform an action on a table that they own.
 For example, a policy can be defined for the `{OWNER}` user and `db=*->table=*`
 resource, and it will automatically be applied when any table is accessed by its
 owner. This way administrators don't need to choose between creating policies
 one by one for each table, and granting access to a wide range of users.

 WARNING: If a user has `ALL` and `delegate admin` privileges on a table only via
 ownership and no privileges by username, they can effectively lock themselves
 out by giving away ownership.

 === Authorization Tokens

 Rather than having every tablet server communicate directly with the underlying
 authorization service (Ranger), privileges are propagated and checked via
 *authorization tokens*. These tokens encapsulate what privileges a user has on a
 given table. Tokens are generated by the master and returned to Kudu clients
 upon opening a Kudu table. Kudu clients automatically attach authorization
 tokens when sending requests to tablet servers.

 NOTE: Authorization tokens are a means to limiting the number of nodes directly
 accessing the authorization service to retrieve privileges. As such, since the
 expected number of tablet servers in a cluster is much higher than the number of
 Kudu masters, they are only used to authorize requests sent to tablet servers.
 Kudu masters fetch privileges directly from the authorization service or cache.
 See <<privilege-caching>> for more details of Kudu's privilege cache.

 Similar to the validity interval for authentication tokens, to limit the
 window of potential unwanted access if a token becomes compromised,
 authorization tokens are valid for five minutes by default. The acquisition and
 renewal of a token is hidden from the user, as Kudu clients automatically
 retrieve new tokens when existing tokens expire.

 When a tablet server that has been configured to enforce fine-grained access
 control receives a request, it checks the privileges in the attached token,
 rejecting it if the privileges are not sufficient to perform the requested
 operation, or if it is invalid (e.g. expired).

 [[trusted-users]]
 === Trusted Users

 It may be desirable to allow certain users to view and modify any data stored
 in Kudu. Such users can be specified via the `--trusted_user_acl` master
 configuration. Trusted users can perform any operation that would otherwise
 require fine-grained privileges, without Kudu consulting the authorization service.

 Additionally, some services that interact with Kudu may authorize requests on
 behalf of their end users. For example, Apache Impala authorizes queries on
 behalf of its users, and sends requests to Kudu as the Impala service user,
 commonly "impala". Since Impala authorizes requests on its own, to avoid
 extraneous communication between the authorization service and Kudu, the
 Impala service user should be listed as a trusted user.

 NOTE: When accessing Kudu through Impala, Impala enforces its own fine-grained
 authorization policy. This policy is similar to Kudu's and can be found in
 Impala's
 link:https://impala.apache.org/docs/build/html/topics/impala_authorization.html#authorization[authorization
 documentation].

 [[ranger-configuration]]
 === Configuring the Integration with Apache Ranger

 NOTE: Ranger is often configured with Kerberos authentication. See
 <<configuration>> for how to configure Kudu to authenticate via Kerberos.

 * After building Kudu from source, find the `kudu-subprocess.jar` under the build
 directory (e.g. `build/release/bin`). Note its path, as it is the one to the
 JAR file containing the Ranger subprocess, which houses the Ranger client that
 Kudu will use to communicate with the Ranger server.

 * Use the `kudu table list` tool to find any table names in the cluster that are
 not Ranger-compatible, which are names that begin or end with a period. Also check
 that there are no two table names that only differ by case, since authorization
 is case-insensitive. For those tables that don't comply with the requirements,
 use the `kudu table rename_table` tool to rename the tables.

 * Create Ranger client `ranger-kudu-security.xml` configuration file, and note down
 the directory containing this file.

 ```xml
 <property>
   <name>ranger.plugin.kudu.policy.cache.dir</name>
   <value>policycache</value>
   <description>Directory where Ranger policies are cached after successful retrieval from the Ranger service</description>
 </property>
 <property>
   <name>ranger.plugin.kudu.service.name</name>
   <value>kudu</value>
   <description>Name of the Ranger service repository storing policies for this Kudu cluster</description>
 </property>
 <property>
   <name>ranger.plugin.kudu.policy.rest.url</name>
   <value>http://host:port</value>
   <description>Ranger Admin URL</description>
 </property>
 <property>
   <name>ranger.plugin.kudu.policy.source.impl</name>
   <value>org.apache.ranger.admin.client.RangerAdminRESTClient</value>
   <description>Ranger client implementation to retrieve policies from the Ranger service</description>
 </property>
 <property>
   <name>ranger.plugin.kudu.policy.rest.ssl.config.file</name>
   <value>ranger-kudu-policymgr-ssl.xml</value>
   <description>Path to the file containing SSL details to connect Ranger Admin</description>
 </property>
 <property>
   <name>ranger.plugin.kudu.policy.pollIntervalMs</name>
   <value>30000</value>
   <description>Ranger client policy polling interval</description>
 </property>
 ```

 * When Secure Socket Layer (SSL) is enabled for Ranger Admin, add `ranger-kudu-policymgr-ssl.xml`
 file to the Ranger client configuration directory with the following configurations:

 ```xml
 <property>
   <name>xasecure.policymgr.clientssl.keystore</name>
   <value>[/path/to/keystore].jks</value>
   <description>Java keystore files</description>
 </property>
 <property>
   <name>xasecure.policymgr.clientssl.keystore.credential.file</name>
   <value>jceks://file/[path/to/credentials].jceks</value>
   <description>Java keystore credential file</description>
 </property>
 <property>
   <name>xasecure.policymgr.clientssl.truststore</name>
   <value>[/path/to/truststore].jks</value>
   <description>Java truststore file</description>
 </property>
 <property>
   <name>xasecure.policymgr.clientssl.truststore.credential.file</name>
   <value>jceks://file/[path/to/credentials].jceks</value>
   <description>Java truststore credential file</description>
 </property>
 ```

 * Set the following configurations on the Kudu master:

 ```
 # The path to directory containing Ranger client configuration. This example
 # assumes the path is '/kudu/ranger-config'.
 --ranger_config_path=/kudu/ranger-config

 # The path where the Java binary was installed. This example assumes
 # '$JAVA_HOME=/usr/local'
 --ranger_java_path=/usr/local/bin/java

 # The path to the JAR file containing the Ranger subprocess. This example
 # assumes '$KUDU_HOME=/kudu'
 --ranger_jar_path=/kudu/build/release/bin/kudu-subprocess.jar

 # This example ACL setup allows the 'impala' user to access all data stored in
 # Kudu, assuming Impala will authorize requests on its own. The 'kudu' user is
 # also granted access to all Kudu data, which may facilitate testing and
 # debugging (such as running the 'kudu cluster ksck' tool).
 --trusted_user_acl=impala,kudu
 ```

 * Set the following configurations on the tablet servers:

 ```
 --tserver_enforce_access_control=true
 ```

 * Add a Kudu service repository with the following configurations via the Ranger
 Admin web UI:

 ```xml
 # This example setup configures the Kudu service user as a privileged user to be
 # able to retrieve authorization policies stored in Ranger.

 <property>
   <name>policy.download.auth.users</name>
   <value>kudu</value>
 </property>
 ```

 [[privilege-caching]]
 === Ranger Client Caching
 On the other hand, privilege cache in Kudu master is disabled with Ranger integration,
 since Ranger provides client side cache the use privileges and can periodically poll
 the privilege store for any changes. When a change is detected, the cache will be
 automatically updated.

 NOTE: Update the `ranger.plugin.kudu.policy.pollIntervalMs` property specified in
 `ranger-kudu-security.xml` to set how often the Ranger client cache refreshes
 the privileges from the Ranger service.

 [[policy-for-kudu-masters]]
 === Policy for Kudu Masters

 The following authorization policy is enforced by Kudu masters.

 .Authorization Policy for Masters
 [options="header"]
 |===
 | Operation | Required Privilege
 | `CreateTable` | `CREATE ON DATABASE`
 | `CreateTable` with an owner different than the logged in user | `ALL ON
 DATABASE` and `delegate admin`
 | `DeleteTable` | `DROP ON TABLE`
 | `AlterTable` (with no rename) | `ALTER ON TABLE`
 | `AlterTable` (with rename) | `ALL ON TABLE <old-table>` and `CREATE ON DATABASE <new-database>`
 | `AlterTable` (with owner change) | `ALL ON TABLE` and `delegate admin`
 | `IsCreateTableDone` | `METADATA ON TABLE`
 | `IsAlterTableDone` | `METADATA ON TABLE`
 | `ListTables` | `METADATA ON TABLE`
 | `GetTableLocations` | `METADATA ON TABLE`
 | `GetTableSchema` | `METADATA ON TABLE`
 | `GetTabletLocations` | `METADATA ON TABLE`
 |===

 === Policy for Kudu Tablet Servers

 The following authorization policy is enforced by Kudu tablet servers.

 .Authorization Policy for Tablet Servers
 [options="header"]
 |===
 | Operation | Required Privilege
 | `Scan` | `SELECT ON TABLE`, or

 `METADATA ON TABLE` and `SELECT ON COLUMN` for each projected column and each predicate column
 | `Scan` (no projected columns, equivalent to `COUNT(*)`) | `SELECT ON TABLE`, or

 `SELECT ON COLUMN` for each column in the table
 | `Scan` (with virtual columns) | `SELECT ON TABLE`, or

 `SELECT ON COLUMN` for each column in the table
 | `Scan` (in `ORDERED` mode) | `<privileges required for a Scan>` and `SELECT ON COLUMN` for each primary key column
 | `Insert` | `INSERT ON TABLE`
 | `Update` | `UPDATE ON TABLE`
 | `Upsert` | `INSERT ON TABLE` and `UPDATE ON TABLE`
 | `Delete` | `DELETE ON TABLE`
 | `SplitKeyRange` | `SELECT ON COLUMN` for each primary key column and `SELECT ON COLUMN` for each projected column
 | `Checksum` | User must be configured in `--superuser_acl`
 | `ListTablets` | User must be configured in `--superuser_acl`
 |===

 NOTE: Unlike Impala, Kudu only supports all-or-nothing access to a table's
 schema, rather than showing only authorized columns.

 == Encryption

 Kudu allows all communications among servers and between clients and servers
 to be encrypted with TLS.

 Encryption can be configured on Kudu servers using the `--rpc_encryption` flag,
 which can be set to `required`, `optional`, or `disabled`. By default, the flag
 is set to `optional`. When `required`, Kudu will reject unencrypted connections.
 When `optional`, Kudu will attempt to use encryption. Same as authentication,
 when `disabled` or encryption fails for `optional`, Kudu will only allow
 unencrypted connections from trusted subnets and reject any unencrypted connections
 from publicly routable IPs. To secure a cluster, use `--rpc_encryption=required`.

 NOTE: Kudu will automatically turn off encryption on local loopback connections,
 since traffic from these connections is never exposed externally. This allows
 locality-aware compute frameworks like Spark and Impala to avoid encryption
 overhead, while still ensuring data confidentiality.

 [[web-ui]]
 == Web UI Encryption

 The Kudu web UI can be configured to use secure HTTPS encryption by providing
 each server with TLS certificates. See <<configuration>> for more information on
 web UI HTTPS configuration.

 == Web UI Redaction

 To prevent sensitive data from being exposed in the web UI, all row data is
 redacted. Table metadata, such as table names, column names, and partitioning
 information is not redacted. The web UI can be completely disabled by setting
 the `--webserver_enabled=false` flag on Kudu servers.

 WARNING: Disabling the web UI will also disable REST endpoints such as
 `/metrics`. Monitoring systems rely on these endpoints to gather metrics data.

 [[logs]]
 == Log Security

 To prevent sensitive data from being included in Kudu server logs, all row data
 is redacted by default. By setting the `--redact=log` flag, redaction will be
 disabled in the web UI but retained for server logs. Alternatively, `--redact=none`
 can be used to disable redaction completely.
 // TODO(dan): add link to configuration reference.

 [[configuration]]
 == Configuring a Secure Kudu Cluster

 The following configuration parameters should be set on all servers (master and
 tablet server) in order to ensure that a Kudu cluster is secure:

 ```
 # Connection Security
 #--------------------
 --rpc_authentication=required
 --rpc_encryption=required
 --keytab_file=<path-to-kerberos-keytab>

 # Web UI Security
 #--------------------
 --webserver_certificate_file=<path-to-cert-pem>
 --webserver_private_key_file=<path-to-key-pem>
 # optional
 --webserver_private_key_password_cmd=<password-cmd>

 # If you prefer to disable the web UI entirely:
 --webserver_enabled=false

 # Coarse-grained authorization
 #--------------------------------

 # This example ACL setup allows the 'impala' user as well as the
 # 'nightly_etl_service_account' principal access to all data in the
 # Kudu cluster. The 'hadoopadmin' user is allowed to use administrative
 # tooling. Note that, by granting access to 'impala', other users
 # may access data in Kudu via the Impala service subject to its own
 # authorization rules.
 --user_acl=impala,nightly_etl_service_account
 --superuser_acl=hadoopadmin
 ```

 See <<ranger-configuration>> to see an example of how to enable fine-grained
 authorization via Apache Ranger.

 Further information about these flags can be found in the configuration
 flag reference.
 // TODO(todd) add a link


 [[known-limitations]]
 == Known Limitations

 Kudu has a few known security limitations:

 // TODO(danburkert): add JIRA links for each of these.

 Custom Kerberos Principal:: Kudu does not support setting a custom service
 principal for Kudu processes. The principal must be 'kudu'.

 External PKI:: Kudu does not support externally-issued certificates for internal
 wire encryption (server to server and client to server).

 On-disk Encryption:: Kudu does not have built-in on-disk encryption. However,
 Kudu can be used with whole-disk encryption tools such as dm-crypt.