blob: 6e63d27509b3d29a83862f35638baffab2ea64bf [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="1.1" id="authorization">
<title>Enabling Sentry Authorization for Impala</title>
<prolog>
<metadata>
<data name="Category" value="Security"/>
<data name="Category" value="Sentry"/>
<data name="Category" value="Impala"/>
<data name="Category" value="Configuring"/>
<data name="Category" value="Starting and Stopping"/>
<data name="Category" value="Users"/>
<data name="Category" value="Groups"/>
<data name="Category" value="Administrators"/>
</metadata>
</prolog>
<conbody id="sentry">
<p>
Authorization determines which users are allowed to access which resources, and what operations they are
allowed to perform. In Impala 1.1 and higher, you use Apache Sentry for
authorization. Sentry adds a fine-grained authorization framework for Hadoop. By default (when authorization
is not enabled), Impala does all read and write operations with the privileges of the <codeph>impala</codeph>
user, which is suitable for a development/test environment but not for a secure production environment. When
authorization is enabled, Impala uses the OS user ID of the user who runs <cmdname>impala-shell</cmdname> or
other client program, and associates various privileges with each user.
</p>
<note>
Sentry is typically used in conjunction with Kerberos authentication, which defines which hosts are allowed
to connect to each server. Using the combination of Sentry and Kerberos prevents malicious users from being
able to connect by creating a named account on an untrusted machine. See
<xref href="impala_kerberos.xml#kerberos"/> for details about Kerberos authentication.
</note>
<p audience="PDF" outputclass="toc inpage">
See the following sections for details about using the Impala authorization features:
</p>
</conbody>
<concept id="sentry_priv_model">
<title>The Sentry Privilege Model</title>
<conbody>
<p>
Privileges can be granted on different objects in the schema. Any privilege that can be
granted is associated with a level in the object hierarchy. If a privilege is granted on
a parent object in the hierarchy, the child object automatically inherits it. This is
the same privilege model as Hive and other database systems.
</p>
<p>
The objects in the Impala schema hierarchy are:
</p>
<codeblock>Server
URI
Database
Table
Column
</codeblock>
<p rev="2.3.0 collevelauth"> The table-level privileges apply to views as
well. Anywhere you specify a table name, you can specify a view name
instead. </p>
<p rev="2.3.0 collevelauth"> In <keyword keyref="impala23_full"/> and
higher, you can specify privileges for individual columns. </p>
<p conref="../shared/impala_common.xml#common/sentry_privileges_objects"/>
<p> Originally, privileges were encoded in a policy file, stored in HDFS.
This mode of operation is still an option, but the emphasis of privilege
management is moving towards being SQL-based. The mode of operation with
<codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements instead
of the policy file requires that a special Sentry service be enabled;
this service stores, retrieves, and manipulates privilege information
stored inside the metastore database. </p>
<note>
<p>
Although this document refers to the <codeph>ALL</codeph> privilege, currently if you
use the policy file mode, you do not use the actual keyword <codeph>ALL</codeph> in
the policy file. When you code role entries in the policy file:
</p>
<ul>
<li>
To specify the <codeph>ALL</codeph> privilege for a server, use a role like
<codeph>server=<varname>server_name</varname></codeph>.
</li>
<li>
To specify the <codeph>ALL</codeph> privilege for a database, use a role like
<codeph>server=<varname>server_name</varname>-&gt;db=<varname>database_name</varname></codeph>.
</li>
<li>
To specify the <codeph>ALL</codeph> privilege for a table, use a role like
<codeph>server=<varname>server_name</varname>-&gt;db=<varname>database_name</varname>-&gt;table=<varname>table_name</varname>-&gt;action=*</codeph>.
</li>
</ul>
</note>
<p> If you change privileges in Sentry, e.g. adding a user, removing a
user, modifying privileges, you must clear the Impala Catalog server
cache by running the <codeph>INVALIDATE METADATA</codeph> statement.
<codeph>INVALIDATE METADATA</codeph> is not required if you make the
changes to privileges within Impala. </p>
</conbody>
</concept>
<concept id="secure_startup">
<title>Starting the impalad Daemon with Sentry Authorization Enabled</title>
<prolog>
<metadata>
<data name="Category" value="Starting and Stopping"/>
</metadata>
</prolog>
<conbody>
<p>
To run the <cmdname>impalad</cmdname> daemon with authorization enabled, you add one or more options to the
<codeph>IMPALA_SERVER_ARGS</codeph> declaration in the <filepath>/etc/default/impala</filepath>
configuration file:
</p>
<ul>
<li>
<codeph>-server_name</codeph>: Turns on Sentry authorization for
Impala. The authorization rules refer to a symbolic server name, and
you specify the same name to use as the argument to the
<codeph>-server_name</codeph> option for all
<cmdname>impalad</cmdname> nodes in the cluster. <p> Starting in
Impala 1.4.0 and higher, if you specify just
<codeph>-server_name</codeph> without
<codeph>-authorization_policy_file</codeph>, Impala uses the
Sentry service for authorization. </p>
</li>
<li>
<codeph>-sentry_config</codeph>: Specifies the local path to the
<codeph>sentry-site.xml</codeph> configuration file. This setting is
required to enable authorization. </li>
<li>
<codeph>-authorization_policy_file</codeph>: Specifies the HDFS path
to the policy file that defines the privileges on schema objects.
Prior to Impala 1.4.0, or if you want to continue storing privilege
rules in the policy file, specify the
<codeph>-authorization_policy_file</codeph> option to make Impala
read privilege information from a policy file, rather than from the
metastore database. </li>
</ul>
<p rev="1.4.0">
For example, you might adapt your <filepath>/etc/default/impala</filepath> configuration to contain lines
like the following. To use the Sentry service rather than the policy file:
</p>
<codeblock rev="1.4.0">IMPALA_SERVER_ARGS=" \
-server_name=server1 \
...
</codeblock>
<p>
Or to use the policy file, as in releases prior to Impala 1.4:
</p>
<codeblock>IMPALA_SERVER_ARGS=" \
-authorization_policy_file=/user/hive/warehouse/auth-policy.ini \
-server_name=server1 \
...
</codeblock>
<p>
The preceding examples set up a symbolic name of <codeph>server1</codeph> to refer to
the current instance of Impala. Specify the symbolic name for the
<codeph>sentry.hive.server</codeph> property in the <filepath>sentry-site.xml</filepath>
configuration file for Hive, as well as in the <codeph>-server_name</codeph> option for
<cmdname>impalad</cmdname>.
</p>
<p> Now restart the <cmdname>impalad</cmdname> daemons on all the nodes. </p>
</conbody>
</concept>
<concept id="sentry_service">
<title>Using Impala with the Sentry Service</title>
<conbody>
<p> When you use the Sentry service, you set up privileges through the
<codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements in
either Impala or Hive. Then both components use those same privileges
automatically. (Impala added the <codeph>GRANT</codeph> and
<codeph>REVOKE</codeph> statements in <keyword keyref="impala20_full"
/>.) </p>
<p> For information about using the Impala <codeph>GRANT</codeph> and
<codeph>REVOKE</codeph> statements, see <xref
href="impala_grant.xml#grant"/> and <xref
href="impala_revoke.xml#revoke"/>. </p>
<p> URIs represent the file paths you specify as part of statements such
as <codeph>CREATE EXTERNAL TABLE</codeph> and <codeph>LOAD
DATA</codeph>. Typically, you specify what look like UNIX paths, but
these locations can also be prefixed with <codeph>hdfs://</codeph> to
make clear that they are really URIs. To set privileges for a URI,
specify the name of a directory, and the privilege applies to all the
files in that directory and any directories underneath it. </p>
<p> URIs must start with <codeph>hdfs://</codeph>,
<codeph>s3a://</codeph>, <codeph>adl://</codeph>, or
<codeph>file://</codeph>. If a URI starts with an absolute path, the
path will be appended to the default filesystem prefix. For example, if
you specify: <codeblock>
GRANT ALL ON URI '/tmp';
</codeblock> The above
statement effectively becomes the following where the default filesystem
is HDFS.
<codeblock>
GRANT ALL ON URI 'hdfs://localhost:20500/tmp';
</codeblock>
</p>
<p> When defining URIs for HDFS, you must also specify the NameNode. For
example: <codeblock>GRANT ALL ON URI file:///path/to/dir TO &lt;role>
GRANT ALL ON URI hdfs://namenode:port/path/to/dir TO &lt;role></codeblock>
<note type="warning">
<p> Because the NameNode host and port must be specified, it is
strongly recommended that you use High Availability (HA). This
ensures that the URI will remain constant even if the NameNode
changes. For example: </p>
<codeblock>GRANT ALL ON URI hdfs://ha-nn-uri/path/to/dir TO &lt;role></codeblock>
</note>
</p>
</conbody>
</concept>
<concept id="concept_k45_lbm_f2b">
<title>Examples of Setting up Authorization for Security Scenarios</title>
<conbody>
<p> The following examples show how to set up authorization to deal with
various scenarios. </p>
<example>
<title>A User with No Privileges</title>
<p> If a user has no privileges at all, that user cannot access any
schema objects in the system. The error messages do not disclose the
names or existence of objects that the user is not authorized to read. </p>
<p> This is the experience you want a user to have if they somehow log
into a system where they are not an authorized Impala user. Or in a
real deployment, a user might have no privileges because they are not
a member of any of the authorized groups. </p>
</example>
<example>
<title>Examples of Privileges for Administrative Users</title>
<p> In this example, the SQL statements grant the
<codeph>entire_server</codeph> role all privileges on both the
databases and URIs within the server. </p>
<codeblock>CREATE ROLE entire_server;
GRANT ROLE entire_server TO GROUP admin_group;
GRANT ALL ON SERVER server1 TO ROLE entire_server;
</codeblock>
</example>
<example>
<title>A User with Privileges for Specific Databases and Tables</title>
<p> If a user has privileges for specific tables in specific databases,
the user can access those things but nothing else. They can see the
tables and their parent databases in the output of <codeph>SHOW
TABLES</codeph> and <codeph>SHOW DATABASES</codeph>,
<codeph>USE</codeph> the appropriate databases, and perform the
relevant actions (<codeph>SELECT</codeph> and/or
<codeph>INSERT</codeph>) based on the table privileges. To actually
create a table requires the <codeph>ALL</codeph> privilege at the
database level, so you might define separate roles for the user that
sets up a schema and other users or applications that perform
day-to-day operations on the tables. </p>
<codeblock>
CREATE ROLE one_database;
GRANT ROLE one_database TO GROUP admin_group;
GRANT ALL ON DATABASE db1 TO ROLE one_database;
CREATE ROLE instructor;
GRANT ROLE instructor TO GROUP trainers;
GRANT ALL ON TABLE db1.lesson TO ROLE instructor;
# This particular course is all about queries, so the students can SELECT but not INSERT or CREATE/DROP.
CREATE ROLE student;
GRANT ROLE student TO GROUP visitors;
GRANT SELECT ON TABLE db1.training TO ROLE student;</codeblock>
</example>
<example>
<title>Privileges for Working with External Data Files</title>
<p> When data is being inserted through the <codeph>LOAD DATA</codeph>
statement, or is referenced from an HDFS location outside the normal
Impala database directories, the user also needs appropriate
permissions on the URIs corresponding to those HDFS locations. </p>
<p> In this example: </p>
<ul>
<li> The <codeph>external_table</codeph> role can insert into and
query the Impala table, <codeph>external_table.sample</codeph>. </li>
<li> The <codeph>staging_dir</codeph> role can specify the HDFS path
<filepath>/user/impala-user/external_data</filepath> with the
<codeph>LOAD DATA</codeph> statement. When Impala queries or loads
data files, it operates on all the files in that directory, not just
a single file, so any Impala <codeph>LOCATION</codeph> parameters
refer to a directory rather than an individual file. </li>
</ul>
<codeblock>CREATE ROLE external_table;
GRANT ROLE external_table TO GROUP impala_users;
GRANT ALL ON TABLE external_table.sample TO ROLE external_table;
CREATE ROLE staging_dir;
GRANT ROLE staging TO GROUP impala_users;
GRANT ALL ON URI 'hdfs://127.0.0.1:8020/user/impala-user/external_data' TO ROLE staging_dir;</codeblock>
</example>
<example>
<title>Separating Administrator Responsibility from Read and Write
Privileges</title>
<p> To create a database, you need the full privilege on that database
while day-to-day operations on tables within that database can be
performed with lower levels of privilege on specific table. Thus, you
might set up separate roles for each database or application: an
administrative one that could create or drop the database, and a
user-level one that can access only the relevant tables. </p>
<p> In this example, the responsibilities are divided between users in 3
different groups: </p>
<ul>
<li> Members of the <codeph>supergroup</codeph> group have the
<codeph>training_sysadmin</codeph> role and so can set up a
database named <codeph>training</codeph>. </li>
<li> Members of the <codeph>impala_users</codeph> group have the
<codeph>instructor</codeph> role and so can create, insert into,
and query any tables in the <codeph>training</codeph> database, but
cannot create or drop the database itself. </li>
<li> Members of the <codeph>visitor</codeph> group have the
<codeph>student</codeph> role and so can query those tables in the
<codeph>training</codeph> database. </li>
</ul>
<codeblock>CREATE ROLE training_sysadmin;
GRANT ROLE training_sysadmin TO GROUP supergroup;
GRANT ALL ON DATABASE training1 TO ROLE training_sysadmin;
CREATE ROLE instructor;
GRANT ROLE instructor TO GROUP impala_users;
GRANT ALL ON TABLE training1.course1 TO ROLE instructor;
CREATE ROLE visitor;
GRANT ROLE student TO GROUP visitor;
GRANT SELECT ON TABLE training1.course1 TO ROLE student;</codeblock>
</example>
</conbody>
</concept>
<concept id="security_policy_file">
<title>Using Impala with the Sentry Policy File</title>
<conbody>
<p> The policy file is a file that you put in a designated location in
HDFS, and is read during the startup of the <cmdname>impalad</cmdname>
daemon when you specify both the <codeph>-server_name</codeph> and
<codeph>-authorization_policy_file</codeph> startup options. It
controls which objects (databases, tables, and HDFS directory paths) can
be accessed by the user who connects to <cmdname>impalad</cmdname>, and
what operations that user can perform on the objects. </p>
<note rev="1.4.0"> In <ph rev="upstream">CDH 5</ph> and higher, <ph
rev="upstream">Cloudera</ph> recommends managing privileges through
SQL statements, as described in <xref
href="impala_authorization.xml#sentry_service"/>. If you are still
using policy files, plan to migrate to the new approach some time in the
future. </note>
<p> The location of the policy file is listed in the
<filepath>auth-site.xml</filepath> configuration file. </p>
<p> When authorization is enabled, Impala uses the policy file as a
<i>whitelist</i>, representing every privilege available to any user
on any object. That is, only operations specified for the appropriate
combination of object, role, group, and user are allowed. All other
operations are not allowed. If a group or role is defined multiple times
in the policy file, the last definition takes precedence. </p>
<p> To understand the notion of whitelisting, set up a minimal policy file
that does not provide any privileges for any object. When you connect to
an Impala node where this policy file is in effect, you get no results
for <codeph>SHOW DATABASES</codeph>, and an error when you issue any
<codeph>SHOW TABLES</codeph>, <codeph>USE
<varname>database_name</varname></codeph>, <codeph>DESCRIBE
<varname>table_name</varname></codeph>, <codeph>SELECT</codeph>, and
or other statements that expect to access databases or tables, even if
the corresponding databases and tables exist. </p>
<p> The contents of the policy file are cached, to avoid a performance
penalty for each query. The policy file is re-checked by each
<cmdname>impalad</cmdname> node every 5 minutes. When you make a
non-time-sensitive change such as adding new privileges or new users,
you can let the change take effect automatically a few minutes later. If
you remove or reduce privileges, and want the change to take effect
immediately, restart the <cmdname>impalad</cmdname> daemon on all nodes,
again specifying the <codeph>-server_name</codeph> and
<codeph>-authorization_policy_file</codeph> options so that the rules
from the updated policy file are applied. </p>
</conbody>
<concept id="security_policy_file_details">
<title>Policy File Format</title>
<conbody>
<p> The policy file uses the familiar <codeph>.ini</codeph> format,
divided into the major sections <codeph>[groups]</codeph> and
<codeph>[roles]</codeph>. </p>
<p> There is also an optional <codeph>[databases]</codeph> section,
which allows you to specify a specific policy file for a particular
database, as explained in <xref href="#security_multiple_policy_files"
/>. </p>
<p> Another optional section, <codeph>[users]</codeph>, allows you to
override the OS-level mapping of users to groups; that is an advanced
technique primarily for testing and debugging, and is beyond the scope
of this document. </p>
<p> In the <codeph>[groups]</codeph> section, you define various
categories of users and select which roles are associated with each
category. The group and usernames correspond to Linux groups and users
on the server where the <cmdname>impalad</cmdname> daemon runs. </p>
<p> The group and usernames in the <codeph>[groups]</codeph> section
correspond to Hadoop groups and users on the server where the
<cmdname>impalad</cmdname> daemon runs. When you access Impala
through the <cmdname>impalad</cmdname> interpreter, for purposes of
authorization, the user is the logged-in Linux user and the groups are
the Linux groups that user is a member of. When you access Impala
through the ODBC or JDBC interfaces, the user and password specified
through the connection string are used as login credentials for the
Linux server, and authorization is based on that username and the
associated Linux group membership. </p>
<p> In the <codeph>[roles]</codeph> section, you a set of roles. For
each role, you specify precisely the set of privileges is available.
That is, which objects users with that role can access, and what
operations they can perform on those objects. This is the lowest-level
category of security information; the other sections in the policy
file map the privileges to higher-level divisions of groups and users.
In the <codeph>[groups]</codeph> section, you specify which roles are
associated with which groups. The group and usernames correspond to
Linux groups and users on the server where the
<cmdname>impalad</cmdname> daemon runs. The privileges are specified
using patterns like:
<codeblock>server=<varname>server_name</varname>-&gt;db=<varname>database_name</varname>-&gt;table=<varname>table_name</varname>-&gt;action=SELECT
server=<varname>server_name</varname>->db=<varname>database_name</varname>->table=t<varname>able_name</varname>->action=CREATE
server=<varname>server_name</varname>-&gt;db=<varname>database_name</varname>-&gt;table=<varname>table_name</varname>-&gt;action=ALL
</codeblock>
For the <varname>server_name</varname> value, substitute the same
symbolic name you specify with the <cmdname>impalad</cmdname>
<codeph>-server_name</codeph> option. You can use <codeph>*</codeph>
wildcard characters at each level of the privilege specification to
allow access to all such objects. For example:
<codeblock>server=impala-host.example.com-&gt;db=default-&gt;table=t1-&gt;action=SELECT
server=impala-host.example.com->db=*->table=*->action=CREATE
server=impala-host.example.com-&gt;db=*-&gt;table=audit_log-&gt;action=SELECT
server=impala-host.example.com-&gt;db=default-&gt;table=t1-&gt;action=*
</codeblock>
</p>
</conbody>
</concept>
<concept id="security_multiple_policy_files">
<title>Using Multiple Policy Files for Different Databases</title>
<conbody>
<p> For an Impala cluster with many databases being accessed by many
users and applications, it might be cumbersome to update the security
policy file for each privilege change or each new database, table, or
view. You can allow security to be managed separately for individual
databases, by setting up a separate policy file for each database: </p>
<ul>
<li> Add the optional <codeph>[databases]</codeph> section to the main
policy file. </li>
<li> Add entries in the <codeph>[databases]</codeph> section for each
database that has its own policy file. </li>
<li> For each listed database, specify the HDFS path of the
appropriate policy file. </li>
</ul>
<p> For example: </p>
<codeblock>[databases]
# Defines the location of the per-DB policy files for the 'customers' and 'sales' databases.
customers = hdfs://ha-nn-uri/etc/access/customers.ini
sales = hdfs://ha-nn-uri/etc/access/sales.ini
</codeblock>
<p> To enable URIs in per-DB policy files, the Java configuration option
<codeph>sentry.allow.uri.db.policyfile</codeph> must be set to
<codeph>true</codeph>. For example: </p>
<codeblock>JAVA_TOOL_OPTIONS="-Dsentry.allow.uri.db.policyfile=true"
</codeblock>
<note type="important"> Enabling URIs in per-DB policy files introduces
a security risk by allowing the owner of the db-level policy file to
grant himself/herself load privileges to anything the
<codeph>impala</codeph> user has read permissions for in HDFS
(including data in other databases controlled by different db-level
policy files). </note>
</conbody>
</concept>
</concept>
<concept id="security_schema">
<title>Setting Up Schema Objects for a Secure Impala Deployment</title>
<conbody>
<p> In your role definitions, you must specify privileges at the level of
individual databases and tables, or all databases or all tables within a
database. To simplify the structure of these rules, plan ahead of time
how to name your schema objects so that data with different
authorization requirements is divided into separate databases. </p>
<p> If you are adding security on top of an existing Impala deployment,
you can rename tables or even move them between databases using the
<codeph>ALTER TABLE</codeph> statement. </p>
</conbody>
</concept>
<concept id="sentry_debug">
<title><ph conref="../shared/impala_common.xml#common/title_sentry_debug"
/></title>
<conbody>
<p conref="../shared/impala_common.xml#common/sentry_debug"/>
</conbody>
</concept>
<concept id="sec_ex_default">
<title>The DEFAULT Database in a Secure Deployment</title>
<conbody>
<p> Because of the extra emphasis on granular access controls in a secure
deployment, you should move any important or sensitive information out
of the <codeph>DEFAULT</codeph> database into a named database whose
privileges are specified in the policy file. Sometimes you might need to
give privileges on the <codeph>DEFAULT</codeph> database for
administrative reasons; for example, as a place you can reliably specify
with a <codeph>USE</codeph> statement when preparing to drop a database.
</p>
</conbody>
</concept>
</concept>