blob: 33363a3772e4b41f9458d1ad3f863d08ca92c6b1 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="1.1" id="authorization">
<title>Enabling Sentry Authorization for Impala</title>
<prolog>
<metadata>
<data name="Category" value="Security"/>
<data name="Category" value="Sentry"/>
<data name="Category" value="Impala"/>
<data name="Category" value="Configuring"/>
<data name="Category" value="Starting and Stopping"/>
<data name="Category" value="Users"/>
<data name="Category" value="Groups"/>
<data name="Category" value="Administrators"/>
</metadata>
</prolog>
<conbody id="sentry">
<p>
Authorization determines which users are allowed to access which resources, and what
operations they are allowed to perform. In Impala 1.1 and higher, you use Apache Sentry
for authorization. Sentry adds a fine-grained authorization framework for Hadoop. By
default (when authorization is not enabled), Impala does all read and write operations
with the privileges of the <codeph>impala</codeph> user, which is suitable for a
development/test environment but not for a secure production environment. When
authorization is enabled, Impala uses the OS user ID of the user who runs
<cmdname>impala-shell</cmdname> or other client program, and associates various privileges
with each user.
</p>
<note>
Sentry is typically used in conjunction with Kerberos authentication, which defines which
hosts are allowed to connect to each server. Using the combination of Sentry and Kerberos
prevents malicious users from being able to connect by creating a named account on an
untrusted machine. See <xref href="impala_kerberos.xml#kerberos"/> for details about
Kerberos authentication.
</note>
<p audience="PDF" outputclass="toc inpage">
See the following sections for details about using the Impala authorization features:
</p>
</conbody>
<concept id="sentry_priv_model">
<title>The Sentry Privilege Model</title>
<conbody>
<p>
Privileges can be granted on different objects in the schema. Any privilege that can be
granted is associated with a level in the object hierarchy. If a privilege is granted on
a parent object in the hierarchy, the child object automatically inherits it. This is
the same privilege model as Hive and other database systems.
</p>
<p>
The objects in the Impala schema hierarchy are:
</p>
<codeblock>Server
URI
Database
Table
Column
</codeblock>
<p rev="2.3.0 collevelauth">
The table-level privileges apply to views as well. Anywhere you specify a table name,
you can specify a view name instead.
</p>
<p rev="2.3.0 collevelauth">
In <keyword keyref="impala23_full"/> and higher, you can specify privileges for
individual columns.
</p>
<p conref="../shared/impala_common.xml#common/sentry_privileges_objects"/>
<p>
Originally, privileges were encoded in a policy file, stored in HDFS. This mode of
operation is still an option, but the emphasis of privilege management is moving towards
being SQL-based. The mode of operation with <codeph>GRANT</codeph> and
<codeph>REVOKE</codeph> statements instead of the policy file requires that a special
Sentry service be enabled; this service stores, retrieves, and manipulates privilege
information stored inside the metastore database.
</p>
<note>
<p>
Although this document refers to the <codeph>ALL</codeph> privilege, currently if you
use the policy file mode, you do not use the actual keyword <codeph>ALL</codeph> in
the policy file. When you code role entries in the policy file:
</p>
<ul>
<li>
To specify the <codeph>ALL</codeph> privilege for a server, use a role like
<codeph>server=<varname>server_name</varname></codeph>.
</li>
<li>
To specify the <codeph>ALL</codeph> privilege for a database, use a role like
<codeph>server=<varname>server_name</varname>-&gt;db=<varname>database_name</varname></codeph>.
</li>
<li>
To specify the <codeph>ALL</codeph> privilege for a table, use a role like
<codeph>server=<varname>server_name</varname>-&gt;db=<varname>database_name</varname>-&gt;table=<varname>table_name</varname>-&gt;action=*</codeph>.
</li>
</ul>
</note>
<p>
If you change privileges in Sentry, e.g. adding a user, removing a user, modifying
privileges, you must clear the Impala Catalog server cache by running the
<codeph>INVALIDATE METADATA</codeph> statement. <codeph>INVALIDATE METADATA</codeph> is
not required if you make the changes to privileges within Impala.
</p>
</conbody>
</concept>
<concept id="secure_startup">
<title>Starting the impalad Daemon with Sentry Authorization Enabled</title>
<prolog>
<metadata>
<data name="Category" value="Starting and Stopping"/>
</metadata>
</prolog>
<conbody>
<p>
To run the <cmdname>impalad</cmdname> daemon with authorization enabled, you add one or
more options to the <codeph>IMPALA_SERVER_ARGS</codeph> declaration in the
<filepath>/etc/default/impala</filepath> configuration file:
</p>
<ul>
<li>
<codeph>-server_name</codeph>: Turns on Sentry authorization for Impala. The
authorization rules refer to a symbolic server name, and you specify the same name to
use as the argument to the <codeph>-server_name</codeph> option for all
<cmdname>impalad</cmdname> nodes in the cluster.
<p>
Starting in Impala 1.4.0 and higher, if you specify just
<codeph>-server_name</codeph> without <codeph>-authorization_policy_file</codeph>,
Impala uses the Sentry service for authorization.
</p>
</li>
<li>
<codeph>-sentry_config</codeph>: Specifies the local path to the
<codeph>sentry-site.xml</codeph> configuration file. This setting is required to
enable authorization.
</li>
<li>
<codeph>-authorization_policy_file</codeph>: Specifies the HDFS path to the policy
file that defines the privileges on schema objects. Prior to Impala 1.4.0, or if you
want to continue storing privilege rules in the policy file, specify the
<codeph>-authorization_policy_file</codeph> option to make Impala read privilege
information from a policy file, rather than from the metastore database.
</li>
</ul>
<p rev="1.4.0">
For example, you might adapt your <filepath>/etc/default/impala</filepath> configuration
to contain lines like the following. To use the Sentry service rather than the policy
file:
</p>
<codeblock rev="1.4.0">IMPALA_SERVER_ARGS=" \
-server_name=server1 \
...
</codeblock>
<p>
Or to use the policy file, as in releases prior to Impala 1.4:
</p>
<codeblock>IMPALA_SERVER_ARGS=" \
-authorization_policy_file=/user/hive/warehouse/auth-policy.ini \
-server_name=server1 \
...
</codeblock>
<p>
The preceding examples set up a symbolic name of <codeph>server1</codeph> to refer to
the current instance of Impala. Specify the symbolic name for the
<codeph>sentry.hive.server</codeph> property in the <filepath>sentry-site.xml</filepath>
configuration file for Hive, as well as in the <codeph>-server_name</codeph> option for
<cmdname>impalad</cmdname>.
</p>
<p>
Now restart the <cmdname>impalad</cmdname> daemons on all the nodes.
</p>
</conbody>
</concept>
<concept id="sentry_service">
<title>Using Impala with the Sentry Service</title>
<conbody>
<p>
When you use the Sentry service, you set up privileges through the
<codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements in either Impala or Hive.
Then both components use those same privileges automatically. (Impala added the
<codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements in
<keyword keyref="impala20_full"
/>.)
</p>
<p>
For information about using the Impala <codeph>GRANT</codeph> and
<codeph>REVOKE</codeph> statements, see <xref
href="impala_grant.xml#grant"/>
and <xref
href="impala_revoke.xml#revoke"/>.
</p>
</conbody>
<concept id="changing_privileges">
<title>Changing Privileges</title>
<conbody>
<p>
If you make a change to privileges in Sentry from outside of Impala, e.g. adding a
user, removing a user, modifying privileges, there are two options to propagate the
change:
</p>
<ul>
<li>
Use the <codeph>catalogd</codeph> flag,
<codeph>--sentry_catalog_polling_frequency_s</codeph> to specify how often to do a
Sentry refresh. The flag is set to 60 seconds by default.
</li>
<li>
Run the <codeph>INVALIDATE METADATA</codeph> statement to force a Sentry refresh.
<codeph>INVALIDATE METADATA</codeph> forces a Sentry refresh regardless of the
<codeph>--sentry_catalog_polling_fequency_s</codeph> flag.
</li>
</ul>
<p>
If you make a change to privileges within Impala, <codeph>INVALIDATE METADATA</codeph>
is not required.
</p>
<note type="warning">
As <codeph>INVALIDATE METADATA</codeph> is an expensive operation, you should use it
judiciously.
</note>
</conbody>
</concept>
<concept id="granting_on_uri">
<title>Granting Privileges on URI</title>
<conbody>
<p>
URIs represent the file paths you specify as part of statements such as <codeph>CREATE
EXTERNAL TABLE</codeph> and <codeph>LOAD DATA</codeph>. Typically, you specify what
look like UNIX paths, but these locations can also be prefixed with
<codeph>hdfs://</codeph> to make clear that they are really URIs. To set privileges
for a URI, specify the name of a directory, and the privilege applies to all the files
in that directory and any directories underneath it.
</p>
<p>
URIs must start with <codeph>hdfs://</codeph>, <codeph>s3a://</codeph>,
<codeph>adl://</codeph>, or <codeph>file://</codeph>. If a URI starts with an absolute
path, the path will be appended to the default filesystem prefix. For example, if you
specify:
<codeblock>
GRANT ALL ON URI '/tmp';
</codeblock>
The above statement effectively becomes the following where the default filesystem is
HDFS.
<codeblock>
GRANT ALL ON URI 'hdfs://localhost:20500/tmp';
</codeblock>
</p>
<p>
When defining URIs for HDFS, you must also specify the NameNode. For example:
<codeblock>GRANT ALL ON URI file:///path/to/dir TO &lt;role>
GRANT ALL ON URI hdfs://namenode:port/path/to/dir TO &lt;role></codeblock>
<note type="warning">
Because the NameNode host and port must be specified, it is strongly recommended
that you use High Availability (HA). This ensures that the URI will remain constant
even if the NameNode changes. For example:
<codeblock>GRANT ALL ON URI hdfs://ha-nn-uri/path/to/dir TO &lt;role></codeblock>
</note>
</p>
</conbody>
</concept>
<concept id="concept_k45_lbm_f2b">
<title>Examples of Setting up Authorization for Security Scenarios</title>
<conbody>
<p>
The following examples show how to set up authorization to deal with various
scenarios.
</p>
<example>
<title>A User with No Privileges</title>
<p>
If a user has no privileges at all, that user cannot access any schema objects in
the system. The error messages do not disclose the names or existence of objects
that the user is not authorized to read.
</p>
<p>
This is the experience you want a user to have if they somehow log into a system
where they are not an authorized Impala user. Or in a real deployment, a user might
have no privileges because they are not a member of any of the authorized groups.
</p>
</example>
<example>
<title>Examples of Privileges for Administrative Users</title>
<p>
In this example, the SQL statements grant the <codeph>entire_server</codeph> role
all privileges on both the databases and URIs within the server.
</p>
<codeblock>CREATE ROLE entire_server;
GRANT ROLE entire_server TO GROUP admin_group;
GRANT ALL ON SERVER server1 TO ROLE entire_server;
</codeblock>
</example>
<example>
<title>A User with Privileges for Specific Databases and Tables</title>
<p>
If a user has privileges for specific tables in specific databases, the user can
access those things but nothing else. They can see the tables and their parent
databases in the output of <codeph>SHOW TABLES</codeph> and <codeph>SHOW
DATABASES</codeph>, <codeph>USE</codeph> the appropriate databases, and perform the
relevant actions (<codeph>SELECT</codeph> and/or <codeph>INSERT</codeph>) based on
the table privileges. To actually create a table requires the <codeph>ALL</codeph>
privilege at the database level, so you might define separate roles for the user
that sets up a schema and other users or applications that perform day-to-day
operations on the tables.
</p>
<codeblock>
CREATE ROLE one_database;
GRANT ROLE one_database TO GROUP admin_group;
GRANT ALL ON DATABASE db1 TO ROLE one_database;
CREATE ROLE instructor;
GRANT ROLE instructor TO GROUP trainers;
GRANT ALL ON TABLE db1.lesson TO ROLE instructor;
# This particular course is all about queries, so the students can SELECT but not INSERT or CREATE/DROP.
CREATE ROLE student;
GRANT ROLE student TO GROUP visitors;
GRANT SELECT ON TABLE db1.training TO ROLE student;</codeblock>
</example>
<example>
<title>Privileges for Working with External Data Files</title>
<p>
When data is being inserted through the <codeph>LOAD DATA</codeph> statement, or is
referenced from an HDFS location outside the normal Impala database directories, the
user also needs appropriate permissions on the URIs corresponding to those HDFS
locations.
</p>
<p>
In this example:
</p>
<ul>
<li>
The <codeph>external_table</codeph> role can insert into and query the Impala
table, <codeph>external_table.sample</codeph>.
</li>
<li>
The <codeph>staging_dir</codeph> role can specify the HDFS path
<filepath>/user/impala-user/external_data</filepath> with the <codeph>LOAD
DATA</codeph> statement. When Impala queries or loads data files, it operates on
all the files in that directory, not just a single file, so any Impala
<codeph>LOCATION</codeph> parameters refer to a directory rather than an
individual file.
</li>
</ul>
<codeblock>CREATE ROLE external_table;
GRANT ROLE external_table TO GROUP impala_users;
GRANT ALL ON TABLE external_table.sample TO ROLE external_table;
CREATE ROLE staging_dir;
GRANT ROLE staging TO GROUP impala_users;
GRANT ALL ON URI 'hdfs://127.0.0.1:8020/user/impala-user/external_data' TO ROLE staging_dir;</codeblock>
</example>
<example>
<title>Separating Administrator Responsibility from Read and Write Privileges</title>
<p>
To create a database, you need the full privilege on that database while day-to-day
operations on tables within that database can be performed with lower levels of
privilege on specific table. Thus, you might set up separate roles for each database
or application: an administrative one that could create or drop the database, and a
user-level one that can access only the relevant tables.
</p>
<p>
In this example, the responsibilities are divided between users in 3 different
groups:
</p>
<ul>
<li>
Members of the <codeph>supergroup</codeph> group have the
<codeph>training_sysadmin</codeph> role and so can set up a database named
<codeph>training</codeph>.
</li>
<li>
Members of the <codeph>impala_users</codeph> group have the
<codeph>instructor</codeph> role and so can create, insert into, and query any
tables in the <codeph>training</codeph> database, but cannot create or drop the
database itself.
</li>
<li>
Members of the <codeph>visitor</codeph> group have the <codeph>student</codeph>
role and so can query those tables in the <codeph>training</codeph> database.
</li>
</ul>
<codeblock>CREATE ROLE training_sysadmin;
GRANT ROLE training_sysadmin TO GROUP supergroup;
GRANT ALL ON DATABASE training1 TO ROLE training_sysadmin;
CREATE ROLE instructor;
GRANT ROLE instructor TO GROUP impala_users;
GRANT ALL ON TABLE training1.course1 TO ROLE instructor;
CREATE ROLE visitor;
GRANT ROLE student TO GROUP visitor;
GRANT SELECT ON TABLE training1.course1 TO ROLE student;</codeblock>
</example>
</conbody>
</concept>
</concept>
<concept id="security_policy_file">
<title>Using Impala with the Sentry Policy File</title>
<conbody>
<p>
The policy file is a file that you put in a designated location in HDFS, and is read
during the startup of the <cmdname>impalad</cmdname> daemon when you specify both the
<codeph>-server_name</codeph> and <codeph>-authorization_policy_file</codeph> startup
options. It controls which objects (databases, tables, and HDFS directory paths) can be
accessed by the user who connects to <cmdname>impalad</cmdname>, and what operations
that user can perform on the objects.
</p>
<note rev="1.4.0">
The policy-file based authorization was deprecated in <keyword keyref="impala26"/>. We
recommend managing privileges through SQL statements as described in
<xref
href="impala_authorization.xml#sentry_service"/>. If you are still using
policy files, plan to migrate to the new approach some time in the future.
</note>
<p>
The location of the policy file is listed in the <filepath>auth-site.xml</filepath>
configuration file.
</p>
<p>
When authorization is enabled, Impala uses the policy file as a <i>whitelist</i>,
representing every privilege available to any user on any object. That is, only
operations specified for the appropriate combination of object, role, group, and user
are allowed. All other operations are not allowed. If a group or role is defined
multiple times in the policy file, the last definition takes precedence.
</p>
<p>
To understand the notion of whitelisting, set up a minimal policy file that does not
provide any privileges for any object. When you connect to an Impala node where this
policy file is in effect, you get no results for <codeph>SHOW DATABASES</codeph>, and an
error when you issue any <codeph>SHOW TABLES</codeph>, <codeph>USE
<varname>database_name</varname></codeph>, <codeph>DESCRIBE
<varname>table_name</varname></codeph>, <codeph>SELECT</codeph>, and or other statements
that expect to access databases or tables, even if the corresponding databases and
tables exist.
</p>
<p>
The contents of the policy file are cached, to avoid a performance penalty for each
query. The policy file is re-checked by each <cmdname>impalad</cmdname> node every 5
minutes. When you make a non-time-sensitive change such as adding new privileges or new
users, you can let the change take effect automatically a few minutes later. If you
remove or reduce privileges, and want the change to take effect immediately, restart the
<cmdname>impalad</cmdname> daemon on all nodes, again specifying the
<codeph>-server_name</codeph> and <codeph>-authorization_policy_file</codeph> options so
that the rules from the updated policy file are applied.
</p>
</conbody>
<concept id="security_policy_file_details">
<title>Policy File Format</title>
<conbody>
<p>
The policy file uses the familiar <codeph>.ini</codeph> format, divided into the major
sections <codeph>[groups]</codeph> and <codeph>[roles]</codeph>.
</p>
<p>
There is also an optional <codeph>[databases]</codeph> section, which allows you to
specify a specific policy file for a particular database, as explained in
<xref href="#security_multiple_policy_files"
/>.
</p>
<p>
Another optional section, <codeph>[users]</codeph>, allows you to override the
OS-level mapping of users to groups; that is an advanced technique primarily for
testing and debugging, and is beyond the scope of this document.
</p>
<p>
In the <codeph>[groups]</codeph> section, you define various categories of users and
select which roles are associated with each category. The group and usernames
correspond to Linux groups and users on the server where the
<cmdname>impalad</cmdname> daemon runs.
</p>
<p>
The group and usernames in the <codeph>[groups]</codeph> section correspond to Hadoop
groups and users on the server where the <cmdname>impalad</cmdname> daemon runs. When
you access Impala through the <cmdname>impalad</cmdname> interpreter, for purposes of
authorization, the user is the logged-in Linux user and the groups are the Linux
groups that user is a member of. When you access Impala through the ODBC or JDBC
interfaces, the user and password specified through the connection string are used as
login credentials for the Linux server, and authorization is based on that username
and the associated Linux group membership.
</p>
<p>
In the <codeph>[roles]</codeph> section, you a set of roles. For each role, you
specify precisely the set of privileges is available. That is, which objects users
with that role can access, and what operations they can perform on those objects. This
is the lowest-level category of security information; the other sections in the policy
file map the privileges to higher-level divisions of groups and users. In the
<codeph>[groups]</codeph> section, you specify which roles are associated with which
groups. The group and usernames correspond to Linux groups and users on the server
where the <cmdname>impalad</cmdname> daemon runs. The privileges are specified using
patterns like:
<codeblock>server=<varname>server_name</varname>-&gt;db=<varname>database_name</varname>-&gt;table=<varname>table_name</varname>-&gt;action=SELECT
server=<varname>server_name</varname>->db=<varname>database_name</varname>->table=t<varname>able_name</varname>->action=CREATE
server=<varname>server_name</varname>-&gt;db=<varname>database_name</varname>-&gt;table=<varname>table_name</varname>-&gt;action=ALL
</codeblock>
For the <varname>server_name</varname> value, substitute the same symbolic name you
specify with the <cmdname>impalad</cmdname> <codeph>-server_name</codeph> option. You
can use <codeph>*</codeph> wildcard characters at each level of the privilege
specification to allow access to all such objects. For example:
<codeblock>server=impala-host.example.com-&gt;db=default-&gt;table=t1-&gt;action=SELECT
server=impala-host.example.com->db=*->table=*->action=CREATE
server=impala-host.example.com-&gt;db=*-&gt;table=audit_log-&gt;action=SELECT
server=impala-host.example.com-&gt;db=default-&gt;table=t1-&gt;action=*
</codeblock>
</p>
</conbody>
</concept>
<concept id="security_multiple_policy_files">
<title>Using Multiple Policy Files for Different Databases</title>
<conbody>
<p>
For an Impala cluster with many databases being accessed by many users and
applications, it might be cumbersome to update the security policy file for each
privilege change or each new database, table, or view. You can allow security to be
managed separately for individual databases, by setting up a separate policy file for
each database:
</p>
<ul>
<li>
Add the optional <codeph>[databases]</codeph> section to the main policy file.
</li>
<li>
Add entries in the <codeph>[databases]</codeph> section for each database that has
its own policy file.
</li>
<li>
For each listed database, specify the HDFS path of the appropriate policy file.
</li>
</ul>
<p>
For example:
</p>
<codeblock>[databases]
# Defines the location of the per-DB policy files for the 'customers' and 'sales' databases.
customers = hdfs://ha-nn-uri/etc/access/customers.ini
sales = hdfs://ha-nn-uri/etc/access/sales.ini
</codeblock>
<p>
To enable URIs in per-DB policy files, the Java configuration option
<codeph>sentry.allow.uri.db.policyfile</codeph> must be set to <codeph>true</codeph>.
For example:
</p>
<codeblock>JAVA_TOOL_OPTIONS="-Dsentry.allow.uri.db.policyfile=true"
</codeblock>
<note type="important">
Enabling URIs in per-DB policy files introduces a security risk by allowing the owner
of the db-level policy file to grant himself/herself load privileges to anything the
<codeph>impala</codeph> user has read permissions for in HDFS (including data in other
databases controlled by different db-level policy files).
</note>
</conbody>
</concept>
</concept>
<concept id="security_schema">
<title>Setting Up Schema Objects for a Secure Impala Deployment</title>
<conbody>
<p>
In your role definitions, you must specify privileges at the level of individual
databases and tables, or all databases or all tables within a database. To simplify the
structure of these rules, plan ahead of time how to name your schema objects so that
data with different authorization requirements is divided into separate databases.
</p>
<p>
If you are adding security on top of an existing Impala deployment, you can rename
tables or even move them between databases using the <codeph>ALTER TABLE</codeph>
statement.
</p>
</conbody>
</concept>
<concept id="sentry_debug">
<title><ph conref="../shared/impala_common.xml#common/title_sentry_debug"
/></title>
<conbody>
<p conref="../shared/impala_common.xml#common/sentry_debug"/>
</conbody>
</concept>
<concept id="sec_ex_default">
<title>The DEFAULT Database in a Secure Deployment</title>
<conbody>
<p>
Because of the extra emphasis on granular access controls in a secure deployment, you
should move any important or sensitive information out of the <codeph>DEFAULT</codeph>
database into a named database whose privileges are specified in the policy file.
Sometimes you might need to give privileges on the <codeph>DEFAULT</codeph> database for
administrative reasons; for example, as a place you can reliably specify with a
<codeph>USE</codeph> statement when preparing to drop a database.
</p>
</conbody>
</concept>
</concept>