docs/topics/impala_authorization.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept rev="1.1" id="authorization">

   <title>Enabling Sentry Authorization for Impala</title>
   <prolog>
     <metadata>
       <data name="Category" value="Security"/>
       <data name="Category" value="Sentry"/>
       <data name="Category" value="Impala"/>
       <data name="Category" value="Configuring"/>
       <data name="Category" value="Starting and Stopping"/>
       <data name="Category" value="Users"/>
       <data name="Category" value="Groups"/>
       <data name="Category" value="Administrators"/>
     </metadata>
   </prolog>

   <conbody id="sentry">

     <p>
       Authorization determines which users are allowed to access which resources, and what operations they are
       allowed to perform. In Impala 1.1 and higher, you use Apache Sentry for
       authorization. Sentry adds a fine-grained authorization framework for Hadoop. By default (when authorization
       is not enabled), Impala does all read and write operations with the privileges of the <codeph>impala</codeph>
       user, which is suitable for a development/test environment but not for a secure production environment. When
       authorization is enabled, Impala uses the OS user ID of the user who runs <cmdname>impala-shell</cmdname> or
       other client program, and associates various privileges with each user.
     </p>

     <note>
       Sentry is typically used in conjunction with Kerberos authentication, which defines which hosts are allowed
       to connect to each server. Using the combination of Sentry and Kerberos prevents malicious users from being
       able to connect by creating a named account on an untrusted machine. See
       <xref href="impala_kerberos.xml#kerberos"/> for details about Kerberos authentication.
     </note>

     <p audience="PDF" outputclass="toc inpage">
       See the following sections for details about using the Impala authorization features:
     </p>
   </conbody>

   <concept id="sentry_priv_model">

     <title>The Sentry Privilege Model</title>

     <conbody>

       <p>
         Privileges can be granted on different objects in the schema. Any privilege that can be
         granted is associated with a level in the object hierarchy. If a privilege is granted on
         a parent object in the hierarchy, the child object automatically inherits it. This is
         the same privilege model as Hive and other database systems.
       </p>

       <p>
         The objects in the Impala schema hierarchy are:
       </p>

 <codeblock>Server
     URI
     Database
         Table
             Column
 </codeblock>
       <p rev="2.3.0 collevelauth"> The table-level privileges apply to views as
         well. Anywhere you specify a table name, you can specify a view name
         instead. </p>
       <p rev="2.3.0 collevelauth"> In <keyword keyref="impala23_full"/> and
         higher, you can specify privileges for individual columns. </p>

       <p conref="../shared/impala_common.xml#common/sentry_privileges_objects"/>
       <p> Originally, privileges were encoded in a policy file, stored in HDFS.
         This mode of operation is still an option, but the emphasis of privilege
         management is moving towards being SQL-based. The mode of operation with
           <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements instead
         of the policy file requires that a special Sentry service be enabled;
         this service stores, retrieves, and manipulates privilege information
         stored inside the metastore database. </p>

       <note>
         <p>
           Although this document refers to the <codeph>ALL</codeph> privilege, currently if you
           use the policy file mode, you do not use the actual keyword <codeph>ALL</codeph> in
           the policy file. When you code role entries in the policy file:
         </p>
         <ul>
           <li>
             To specify the <codeph>ALL</codeph> privilege for a server, use a role like
             <codeph>server=<varname>server_name</varname></codeph>.
           </li>

           <li>
             To specify the <codeph>ALL</codeph> privilege for a database, use a role like
             <codeph>server=<varname>server_name</varname>-&gt;db=<varname>database_name</varname></codeph>.
           </li>

           <li>
             To specify the <codeph>ALL</codeph> privilege for a table, use a role like
             <codeph>server=<varname>server_name</varname>-&gt;db=<varname>database_name</varname>-&gt;table=<varname>table_name</varname>-&gt;action=*</codeph>.
           </li>
         </ul>
       </note>
       <p> If you change privileges in Sentry, e.g. adding a user, removing a
         user, modifying privileges, you must clear the Impala Catalog server
         cache by running the <codeph>INVALIDATE METADATA</codeph> statement.
           <codeph>INVALIDATE METADATA</codeph> is not required if you make the
         changes to privileges within Impala. </p>
     </conbody>
   </concept>

   <concept id="secure_startup">

     <title>Starting the impalad Daemon with Sentry Authorization Enabled</title>
   <prolog>
     <metadata>
       <data name="Category" value="Starting and Stopping"/>
     </metadata>
   </prolog>

     <conbody>

       <p>
         To run the <cmdname>impalad</cmdname> daemon with authorization enabled, you add one or more options to the
         <codeph>IMPALA_SERVER_ARGS</codeph> declaration in the <filepath>/etc/default/impala</filepath>
         configuration file:
       </p>
       <ul>
         <li>
           <codeph>-server_name</codeph>: Turns on Sentry authorization for
           Impala. The authorization rules refer to a symbolic server name, and
           you specify the same name to use as the argument to the
             <codeph>-server_name</codeph> option for all
             <cmdname>impalad</cmdname> nodes in the cluster. <p> Starting in
             Impala 1.4.0 and higher, if you specify just
               <codeph>-server_name</codeph> without
               <codeph>-authorization_policy_file</codeph>, Impala uses the
             Sentry service for authorization. </p>
         </li>
         <li>
           <codeph>-sentry_config</codeph>: Specifies the local path to the
             <codeph>sentry-site.xml</codeph> configuration file. This setting is
           required to enable authorization. </li>
         <li>
           <codeph>-authorization_policy_file</codeph>: Specifies the HDFS path
           to the policy file that defines the privileges on schema objects.
           Prior to Impala 1.4.0, or if you want to continue storing privilege
           rules in the policy file, specify the
             <codeph>-authorization_policy_file</codeph> option to make Impala
           read privilege information from a policy file, rather than from the
           metastore database. </li>
       </ul>

       <p rev="1.4.0">
         For example, you might adapt your <filepath>/etc/default/impala</filepath> configuration to contain lines
         like the following. To use the Sentry service rather than the policy file:
       </p>

 <codeblock rev="1.4.0">IMPALA_SERVER_ARGS=" \
 -server_name=server1 \
 ...
 </codeblock>

       <p>
         Or to use the policy file, as in releases prior to Impala 1.4:
       </p>

 <codeblock>IMPALA_SERVER_ARGS=" \
 -authorization_policy_file=/user/hive/warehouse/auth-policy.ini \
 -server_name=server1 \
 ...
 </codeblock>

       <p>
         The preceding examples set up a symbolic name of <codeph>server1</codeph> to refer to
         the current instance of Impala. Specify the symbolic name for the
         <codeph>sentry.hive.server</codeph> property in the <filepath>sentry-site.xml</filepath>
         configuration file for Hive, as well as in the <codeph>-server_name</codeph> option for
         <cmdname>impalad</cmdname>.
       </p>
       <p> Now restart the <cmdname>impalad</cmdname> daemons on all the nodes. </p>
     </conbody>
   </concept>

   <concept id="sentry_service">

     <title>Using Impala with the Sentry Service</title>
     <conbody>
       <p> When you use the Sentry service, you set up privileges through the
           <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements in
         either Impala or Hive. Then both components use those same privileges
         automatically. (Impala added the <codeph>GRANT</codeph> and
           <codeph>REVOKE</codeph> statements in <keyword keyref="impala20_full"
         />.) </p>
       <p> For information about using the Impala <codeph>GRANT</codeph> and
           <codeph>REVOKE</codeph> statements, see <xref
           href="impala_grant.xml#grant"/> and <xref
           href="impala_revoke.xml#revoke"/>. </p>
       <p> URIs represent the file paths you specify as part of statements such
         as <codeph>CREATE EXTERNAL TABLE</codeph> and <codeph>LOAD
         DATA</codeph>. Typically, you specify what look like UNIX paths, but
         these locations can also be prefixed with <codeph>hdfs://</codeph> to
         make clear that they are really URIs. To set privileges for a URI,
         specify the name of a directory, and the privilege applies to all the
         files in that directory and any directories underneath it. </p>
       <p> URIs must start with <codeph>hdfs://</codeph>,
         <codeph>s3a://</codeph>, <codeph>adl://</codeph>, or
           <codeph>file://</codeph>. If a URI starts with an absolute path, the
         path will be appended to the default filesystem prefix. For example, if
         you specify: <codeblock>
 GRANT ALL ON URI '/tmp';
 </codeblock> The above
         statement effectively becomes the following where the default filesystem
         is HDFS.
         <codeblock>
 GRANT ALL ON URI 'hdfs://localhost:20500/tmp';
 </codeblock>
       </p>
       <p> When defining URIs for HDFS, you must also specify the NameNode. For
         example: <codeblock>GRANT ALL ON URI file:///path/to/dir TO &lt;role>
 GRANT ALL ON URI hdfs://namenode:port/path/to/dir TO &lt;role></codeblock>
         <note type="warning">
           <p> Because the NameNode host and port must be specified, it is
             strongly recommended that you use High Availability (HA). This
             ensures that the URI will remain constant even if the NameNode
             changes. For example: </p>
           <codeblock>GRANT ALL ON URI hdfs://ha-nn-uri/path/to/dir TO &lt;role></codeblock>
         </note>
       </p>
     </conbody>
   </concept>
   <concept id="concept_k45_lbm_f2b">
     <title>Examples of Setting up Authorization for Security Scenarios</title>
     <conbody>
       <p> The following examples show how to set up authorization to deal with
         various scenarios. </p>
       <example>
         <title>A User with No Privileges</title>
         <p> If a user has no privileges at all, that user cannot access any
           schema objects in the system. The error messages do not disclose the
           names or existence of objects that the user is not authorized to read. </p>
         <p> This is the experience you want a user to have if they somehow log
           into a system where they are not an authorized Impala user. Or in a
           real deployment, a user might have no privileges because they are not
           a member of any of the authorized groups. </p>
       </example>
       <example>
         <title>Examples of Privileges for Administrative Users</title>
         <p> In this example, the SQL statements grant the
             <codeph>entire_server</codeph> role all privileges on both the
           databases and URIs within the server. </p>
         <codeblock>CREATE ROLE entire_server;
 GRANT ROLE entire_server TO GROUP admin_group;
 GRANT ALL ON SERVER server1 TO ROLE entire_server;
 </codeblock>
       </example>
       <example>
         <title>A User with Privileges for Specific Databases and Tables</title>
         <p> If a user has privileges for specific tables in specific databases,
           the user can access those things but nothing else. They can see the
           tables and their parent databases in the output of <codeph>SHOW
             TABLES</codeph> and <codeph>SHOW DATABASES</codeph>,
             <codeph>USE</codeph> the appropriate databases, and perform the
           relevant actions (<codeph>SELECT</codeph> and/or
             <codeph>INSERT</codeph>) based on the table privileges. To actually
           create a table requires the <codeph>ALL</codeph> privilege at the
           database level, so you might define separate roles for the user that
           sets up a schema and other users or applications that perform
           day-to-day operations on the tables. </p>
         <codeblock>
 CREATE ROLE one_database;
 GRANT ROLE one_database TO GROUP admin_group;
 GRANT ALL ON DATABASE db1 TO ROLE one_database;

 CREATE ROLE instructor;
 GRANT ROLE instructor TO GROUP trainers;
 GRANT ALL ON TABLE db1.lesson TO ROLE instructor;

 # This particular course is all about queries, so the students can SELECT but not INSERT or CREATE/DROP.
 CREATE ROLE student;
 GRANT ROLE student TO GROUP visitors;
 GRANT SELECT ON TABLE db1.training TO ROLE student;</codeblock>
       </example>
       <example>
         <title>Privileges for Working with External Data Files</title>
         <p> When data is being inserted through the <codeph>LOAD DATA</codeph>
           statement, or is referenced from an HDFS location outside the normal
           Impala database directories, the user also needs appropriate
           permissions on the URIs corresponding to those HDFS locations. </p>
         <p> In this example: </p>
         <ul>
           <li> The <codeph>external_table</codeph> role can insert into and
             query the Impala table, <codeph>external_table.sample</codeph>. </li>
           <li> The <codeph>staging_dir</codeph> role can specify the HDFS path
               <filepath>/user/impala-user/external_data</filepath> with the
               <codeph>LOAD DATA</codeph> statement. When Impala queries or loads
             data files, it operates on all the files in that directory, not just
             a single file, so any Impala <codeph>LOCATION</codeph> parameters
             refer to a directory rather than an individual file. </li>
         </ul>
         <codeblock>CREATE ROLE external_table;
 GRANT ROLE external_table TO GROUP impala_users;
 GRANT ALL ON TABLE external_table.sample TO ROLE external_table;

 CREATE ROLE staging_dir;
 GRANT ROLE staging TO GROUP impala_users;
 GRANT ALL ON URI 'hdfs://127.0.0.1:8020/user/impala-user/external_data' TO ROLE staging_dir;</codeblock>
       </example>
       <example>
         <title>Separating Administrator Responsibility from Read and Write
           Privileges</title>
         <p> To create a database, you need the full privilege on that database
           while day-to-day operations on tables within that database can be
           performed with lower levels of privilege on specific table. Thus, you
           might set up separate roles for each database or application: an
           administrative one that could create or drop the database, and a
           user-level one that can access only the relevant tables. </p>
         <p> In this example, the responsibilities are divided between users in 3
           different groups: </p>
         <ul>
           <li> Members of the <codeph>supergroup</codeph> group have the
               <codeph>training_sysadmin</codeph> role and so can set up a
             database named <codeph>training</codeph>. </li>
           <li> Members of the <codeph>impala_users</codeph> group have the
               <codeph>instructor</codeph> role and so can create, insert into,
             and query any tables in the <codeph>training</codeph> database, but
             cannot create or drop the database itself. </li>
           <li> Members of the <codeph>visitor</codeph> group have the
               <codeph>student</codeph> role and so can query those tables in the
               <codeph>training</codeph> database. </li>
         </ul>
         <codeblock>CREATE ROLE training_sysadmin;
 GRANT ROLE training_sysadmin TO GROUP supergroup;
 GRANT ALL ON DATABASE training1 TO ROLE training_sysadmin;

 CREATE ROLE instructor;
 GRANT ROLE instructor TO GROUP impala_users;
 GRANT ALL ON TABLE training1.course1 TO ROLE instructor;

 CREATE ROLE visitor;
 GRANT ROLE student TO GROUP visitor;
 GRANT SELECT ON TABLE training1.course1 TO ROLE student;</codeblock>
       </example>
     </conbody>
   </concept>
   <concept id="security_policy_file">
     <title>Using Impala with the Sentry Policy File</title>
     <conbody>
       <p> The policy file is a file that you put in a designated location in
         HDFS, and is read during the startup of the <cmdname>impalad</cmdname>
         daemon when you specify both the <codeph>-server_name</codeph> and
           <codeph>-authorization_policy_file</codeph> startup options. It
         controls which objects (databases, tables, and HDFS directory paths) can
         be accessed by the user who connects to <cmdname>impalad</cmdname>, and
         what operations that user can perform on the objects. </p>
       <note rev="1.4.0"> In <ph rev="upstream">CDH 5</ph> and higher, <ph
           rev="upstream">Cloudera</ph> recommends managing privileges through
         SQL statements, as described in <xref
           href="impala_authorization.xml#sentry_service"/>. If you are still
         using policy files, plan to migrate to the new approach some time in the
         future. </note>
       <p> The location of the policy file is listed in the
           <filepath>auth-site.xml</filepath> configuration file. </p>
       <p> When authorization is enabled, Impala uses the policy file as a
           <i>whitelist</i>, representing every privilege available to any user
         on any object. That is, only operations specified for the appropriate
         combination of object, role, group, and user are allowed. All other
         operations are not allowed. If a group or role is defined multiple times
         in the policy file, the last definition takes precedence. </p>
       <p> To understand the notion of whitelisting, set up a minimal policy file
         that does not provide any privileges for any object. When you connect to
         an Impala node where this policy file is in effect, you get no results
         for <codeph>SHOW DATABASES</codeph>, and an error when you issue any
           <codeph>SHOW TABLES</codeph>, <codeph>USE
             <varname>database_name</varname></codeph>, <codeph>DESCRIBE
             <varname>table_name</varname></codeph>, <codeph>SELECT</codeph>, and
         or other statements that expect to access databases or tables, even if
         the corresponding databases and tables exist. </p>
       <p> The contents of the policy file are cached, to avoid a performance
         penalty for each query. The policy file is re-checked by each
           <cmdname>impalad</cmdname> node every 5 minutes. When you make a
         non-time-sensitive change such as adding new privileges or new users,
         you can let the change take effect automatically a few minutes later. If
         you remove or reduce privileges, and want the change to take effect
         immediately, restart the <cmdname>impalad</cmdname> daemon on all nodes,
         again specifying the <codeph>-server_name</codeph> and
           <codeph>-authorization_policy_file</codeph> options so that the rules
         from the updated policy file are applied. </p>
     </conbody>
     <concept id="security_policy_file_details">
       <title>Policy File Format</title>
       <conbody>
         <p> The policy file uses the familiar <codeph>.ini</codeph> format,
           divided into the major sections <codeph>[groups]</codeph> and
             <codeph>[roles]</codeph>. </p>
         <p> There is also an optional <codeph>[databases]</codeph> section,
           which allows you to specify a specific policy file for a particular
           database, as explained in <xref href="#security_multiple_policy_files"
           />. </p>
         <p> Another optional section, <codeph>[users]</codeph>, allows you to
           override the OS-level mapping of users to groups; that is an advanced
           technique primarily for testing and debugging, and is beyond the scope
           of this document. </p>
         <p> In the <codeph>[groups]</codeph> section, you define various
           categories of users and select which roles are associated with each
           category. The group and usernames correspond to Linux groups and users
           on the server where the <cmdname>impalad</cmdname> daemon runs. </p>
         <p> The group and usernames in the <codeph>[groups]</codeph> section
           correspond to Hadoop groups and users on the server where the
             <cmdname>impalad</cmdname> daemon runs. When you access Impala
           through the <cmdname>impalad</cmdname> interpreter, for purposes of
           authorization, the user is the logged-in Linux user and the groups are
           the Linux groups that user is a member of. When you access Impala
           through the ODBC or JDBC interfaces, the user and password specified
           through the connection string are used as login credentials for the
           Linux server, and authorization is based on that username and the
           associated Linux group membership. </p>
         <p> In the <codeph>[roles]</codeph> section, you a set of roles. For
           each role, you specify precisely the set of privileges is available.
           That is, which objects users with that role can access, and what
           operations they can perform on those objects. This is the lowest-level
           category of security information; the other sections in the policy
           file map the privileges to higher-level divisions of groups and users.
           In the <codeph>[groups]</codeph> section, you specify which roles are
           associated with which groups. The group and usernames correspond to
           Linux groups and users on the server where the
             <cmdname>impalad</cmdname> daemon runs. The privileges are specified
           using patterns like:
           <codeblock>server=<varname>server_name</varname>-&gt;db=<varname>database_name</varname>-&gt;table=<varname>table_name</varname>-&gt;action=SELECT
 server=<varname>server_name</varname>->db=<varname>database_name</varname>->table=t<varname>able_name</varname>->action=CREATE
 server=<varname>server_name</varname>-&gt;db=<varname>database_name</varname>-&gt;table=<varname>table_name</varname>-&gt;action=ALL
 </codeblock>
           For the <varname>server_name</varname> value, substitute the same
           symbolic name you specify with the <cmdname>impalad</cmdname>
           <codeph>-server_name</codeph> option. You can use <codeph>*</codeph>
           wildcard characters at each level of the privilege specification to
           allow access to all such objects. For example:
           <codeblock>server=impala-host.example.com-&gt;db=default-&gt;table=t1-&gt;action=SELECT
 server=impala-host.example.com->db=*->table=*->action=CREATE
 server=impala-host.example.com-&gt;db=*-&gt;table=audit_log-&gt;action=SELECT
 server=impala-host.example.com-&gt;db=default-&gt;table=t1-&gt;action=*
 </codeblock>
         </p>
       </conbody>
     </concept>
     <concept id="security_multiple_policy_files">
       <title>Using Multiple Policy Files for Different Databases</title>
       <conbody>
         <p> For an Impala cluster with many databases being accessed by many
           users and applications, it might be cumbersome to update the security
           policy file for each privilege change or each new database, table, or
           view. You can allow security to be managed separately for individual
           databases, by setting up a separate policy file for each database: </p>
         <ul>
           <li> Add the optional <codeph>[databases]</codeph> section to the main
             policy file. </li>
           <li> Add entries in the <codeph>[databases]</codeph> section for each
             database that has its own policy file. </li>
           <li> For each listed database, specify the HDFS path of the
             appropriate policy file. </li>
         </ul>
         <p> For example: </p>
         <codeblock>[databases]
 # Defines the location of the per-DB policy files for the 'customers' and 'sales' databases.
 customers = hdfs://ha-nn-uri/etc/access/customers.ini
 sales = hdfs://ha-nn-uri/etc/access/sales.ini
 </codeblock>
         <p> To enable URIs in per-DB policy files, the Java configuration option
             <codeph>sentry.allow.uri.db.policyfile</codeph> must be set to
             <codeph>true</codeph>. For example: </p>
         <codeblock>JAVA_TOOL_OPTIONS="-Dsentry.allow.uri.db.policyfile=true"
 </codeblock>
         <note type="important"> Enabling URIs in per-DB policy files introduces
           a security risk by allowing the owner of the db-level policy file to
           grant himself/herself load privileges to anything the
             <codeph>impala</codeph> user has read permissions for in HDFS
           (including data in other databases controlled by different db-level
           policy files). </note>
       </conbody>
     </concept>
   </concept>
   <concept id="security_schema">
     <title>Setting Up Schema Objects for a Secure Impala Deployment</title>
     <conbody>
       <p> In your role definitions, you must specify privileges at the level of
         individual databases and tables, or all databases or all tables within a
         database. To simplify the structure of these rules, plan ahead of time
         how to name your schema objects so that data with different
         authorization requirements is divided into separate databases. </p>
       <p> If you are adding security on top of an existing Impala deployment,
         you can rename tables or even move them between databases using the
           <codeph>ALTER TABLE</codeph> statement. </p>
     </conbody>
   </concept>
   <concept id="sentry_debug">
     <title><ph conref="../shared/impala_common.xml#common/title_sentry_debug"
       /></title>
     <conbody>
       <p conref="../shared/impala_common.xml#common/sentry_debug"/>
     </conbody>
   </concept>
   <concept id="sec_ex_default">
     <title>The DEFAULT Database in a Secure Deployment</title>
     <conbody>
       <p> Because of the extra emphasis on granular access controls in a secure
         deployment, you should move any important or sensitive information out
         of the <codeph>DEFAULT</codeph> database into a named database whose
         privileges are specified in the policy file. Sometimes you might need to
         give privileges on the <codeph>DEFAULT</codeph> database for
         administrative reasons; for example, as a place you can reliably specify
         with a <codeph>USE</codeph> statement when preparing to drop a database.
       </p>
     </conbody>
   </concept>
 </concept>