| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="adls" rev="2.9.0"> |
| |
| <title>Using Impala with the Azure Data Lake Store (ADLS)</title> |
| <titlealts audience="PDF"><navtitle>ADLS Tables</navtitle></titlealts> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="ADLS"/> |
| <data name="Category" value="Data Analysts"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Querying"/> |
| <data name="Category" value="Preview Features"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| You can use Impala to query data residing on the Azure Data Lake Store |
| (ADLS) filesystem. This capability allows convenient access to a storage |
| system that is remotely managed, accessible from anywhere, and integrated |
| with various cloud-based services. Impala can query files in any supported |
| file format from ADLS. The ADLS storage location can be for an entire |
| table or individual partitions in a partitioned table. |
| </p> |
| |
| <p> |
| The default Impala tables use data files stored on HDFS, which are ideal for bulk loads and queries using |
| full-table scans. In contrast, queries against ADLS data are less performant, making ADLS suitable for holding |
| <q>cold</q> data that is only queried occasionally, while more frequently accessed <q>hot</q> data resides in |
| HDFS. In a partitioned table, you can set the <codeph>LOCATION</codeph> attribute for individual partitions |
| to put some partitions on HDFS and others on ADLS, typically depending on the age of the data. |
| </p> |
| <p>Starting in <keyword keyref="impala31"/>, Impala supports ADLS Gen2 |
| filesystem, Azure Blob File System (ABFS).</p> |
| |
| <p outputclass="toc inpage"/> |
| |
| </conbody> |
| |
| <concept id="prereqs"> |
| <title>Prerequisites</title> |
| <conbody> |
| <p> |
| These procedures presume that you have already set up an Azure account, |
| configured an ADLS store, and configured your Hadoop cluster with appropriate |
| credentials to be able to access ADLS. See the following resources for information: |
| </p> |
| <ul> |
| <li> |
| <p> |
| <xref href="https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-get-started-portal" scope="external" format="html">Get started with Azure Data Lake Store using the Azure Portal</xref> |
| </p> |
| </li> |
| <li><xref |
| href="https://docs.microsoft.com/en-us/azure/storage/data-lake-storage/quickstart-create-account" |
| format="html" scope="external">Azure Data Lake Storage Gen2</xref></li> |
| <li> |
| <p> |
| <xref href="https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html" scope="external" format="html">Hadoop Azure Data Lake Support</xref> |
| </p> |
| </li> |
| </ul> |
| </conbody> |
| </concept> |
| |
| <concept id="sql"> |
| <title>How Impala SQL Statements Work with ADLS</title> |
| <conbody> |
| <p> Impala SQL statements work with data on ADLS as follows. </p> |
| <ul> |
| <li><p> The <xref href="impala_create_table.xml#create_table"/> or <xref |
| href="impala_alter_table.xml#alter_table"/> statements can specify |
| that a table resides on the ADLS filesystem by using one of the |
| following ADLS prefixes in the <codeph>LOCATION</codeph> property.<ul> |
| <li>For ADLS Gen1: <codeph>adl://</codeph></li> |
| <li>For ADLS Gen2: <codeph>abfs://</codeph> or |
| <codeph>abfss://</codeph></li> |
| </ul></p><p><codeph>ALTER TABLE</codeph> can also set the |
| <codeph>LOCATION</codeph> property for an individual partition, so |
| that some data in a table resides on ADLS and other data in the same |
| table resides on HDFS. </p> See <xref href="impala_adls.xml#ddl"/> |
| for usage information.</li> |
| <li> |
| <p> |
| Once a table or partition is designated as residing on ADLS, the <xref href="impala_select.xml#select"/> |
| statement transparently accesses the data files from the appropriate storage layer. |
| </p> |
| </li> |
| <li> |
| <p> |
| If the ADLS table is an internal table, the <xref href="impala_drop_table.xml#drop_table"/> statement |
| removes the corresponding data files from ADLS when the table is dropped. |
| </p> |
| </li> |
| <li> |
| <p> |
| The <xref href="impala_truncate_table.xml#truncate_table"/> statement always removes the corresponding |
| data files from ADLS when the table is truncated. |
| </p> |
| </li> |
| <li> |
| <p> |
| The <xref href="impala_load_data.xml#load_data"/> can move data files residing in HDFS into |
| an ADLS table. |
| </p> |
| </li> |
| <li> |
| <p> |
| The <xref href="impala_insert.xml#insert"/>, or the <codeph>CREATE TABLE AS SELECT</codeph> |
| form of the <codeph>CREATE TABLE</codeph> statement, can copy data from an HDFS table or another ADLS |
| table into an ADLS table. |
| </p> |
| </li> |
| </ul> |
| <p> For usage information about Impala SQL statements with ADLS tables, |
| see <xref href="impala_adls.xml#dml"/>. </p> |
| </conbody> |
| </concept> |
| |
| <concept id="creds"> |
| |
| <title>Specifying Impala Credentials to Access Data in ADLS</title> |
| |
| <conbody> |
| |
| <p> To allow Impala to access data in ADLS, specify values for the |
| following configuration settings in your |
| <filepath>core-site.xml</filepath> file.</p> |
| <p>For ADLS Gen1:</p> |
| |
| <codeblock><property> |
| <name>dfs.adls.oauth2.access.token.provider.type</name> |
| <value>ClientCredential</value> |
| </property> |
| <property> |
| <name>dfs.adls.oauth2.client.id</name> |
| <value><varname>your_client_id</varname></value> |
| </property> |
| <property> |
| <name>dfs.adls.oauth2.credential</name> |
| <value><varname>your_client_secret</varname></value> |
| </property> |
| <property> |
| <name>dfs.adls.oauth2.refresh.url</name> |
| <value>https://login.windows.net/<varname>your_azure_tenant_id</varname>/oauth2/token</value> |
| </property> |
| |
| </codeblock> |
| <p>For ADLS Gen2:</p> |
| <codeblock> <property> |
| <name>fs.azure.account.auth.type</name> |
| <value>OAuth</value> |
| </property> |
| |
| <property> |
| <name>fs.azure.account.oauth.provider.type</name> |
| <value>org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider</value> |
| </property> |
| |
| <property> |
| <name>fs.azure.account.oauth2.client.id</name> |
| <value><varname>your_client_id</varname></value> |
| </property> |
| |
| <property> |
| <name>fs.azure.account.oauth2.client.secret</name> |
| <value><varname>your_client_secret</varname></value> |
| </property> |
| |
| <property> |
| <name>fs.azure.account.oauth2.client.endpoint</name> |
| <value>https://login.microsoftonline.com/<varname>your_azure_tenant_id</varname>/oauth2/token</value> |
| </property></codeblock> |
| |
| <note> |
| <p> |
| Check if your Hadoop distribution or cluster management tool includes support for |
| filling in and distributing credentials across the cluster in an automated way. |
| </p> |
| </note> |
| |
| <p> After specifying the credentials, restart both the Impala and Hive |
| services. Restarting Hive is required because certain Impala queries, |
| such as <codeph>CREATE TABLE</codeph> statements, go through the Hive |
| metastore.</p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="etl"> |
| |
| <title>Loading Data into ADLS for Impala Queries</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="ETL"/> |
| <data name="Category" value="Ingest"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| If your ETL pipeline involves moving data into ADLS and then querying through Impala, |
| you can either use Impala DML statements to create, move, or copy the data, or |
| use the same data loading techniques as you would for non-Impala data. |
| </p> |
| |
| </conbody> |
| |
| <concept id="dml"> |
| <title>Using Impala DML Statements for ADLS Data</title> |
| <conbody> |
| <p conref="../shared/impala_common.xml#common/adls_dml" |
| conrefend="../shared/impala_common.xml#common/adls_dml_end"/> |
| </conbody> |
| </concept> |
| |
| <concept id="manual_etl"> |
| <title>Manually Loading Data into Impala Tables on ADLS</title> |
| <conbody> |
| <p> |
| As an alternative, you can use the Microsoft-provided methods to bring data files |
| into ADLS for querying through Impala. See |
| <xref href="https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-copy-data-azure-storage-blob" scope="external" format="html">the Microsoft ADLS documentation</xref> |
| for details. |
| </p> |
| |
| <p> |
| After you upload data files to a location already mapped to an Impala table or partition, or if you delete |
| files in ADLS from such a location, issue the <codeph>REFRESH <varname>table_name</varname></codeph> |
| statement to make Impala aware of the new set of data files. |
| </p> |
| |
| </conbody> |
| </concept> |
| |
| </concept> |
| |
| <concept id="ddl"> |
| |
| <title>Creating Impala Databases, Tables, and Partitions for Data Stored on ADLS</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Databases"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| Impala reads data for a table or partition from ADLS based on the |
| <codeph>LOCATION</codeph> attribute for the table or partition. |
| Specify the ADLS details in the <codeph>LOCATION</codeph> clause of a |
| <codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> |
| statement. The syntax for the <codeph>LOCATION</codeph> clause is: |
| <ul> |
| <li> |
| For ADLS Gen1: |
| <codeblock>adl://<varname>account</varname>.azuredatalakestore.net/<varname>path/file</varname></codeblock></li> |
| <li> |
| For ADLS Gen2: |
| <codeblock>abfs://<varname>container</varname>@<varname>account</varname>.dfs.core.windows.net/<varname>path</varname>/<varname>file</varname></codeblock> |
| <p> |
| or |
| </p> |
| <codeblock>abfss://<varname>container</varname>@<varname>account</varname>.dfs.core.windows.net/<varname>path</varname>/<varname>file</varname></codeblock> |
| </li> |
| </ul> |
| </p> |
| <p> |
| <codeph><varname>container</varname></codeph> denotes the parent |
| location that holds the files and folders, which is the Containers in |
| the Azure Storage Blobs service. |
| </p> |
| <p> |
| <codeph><varname>account</varname></codeph> is the name given for your |
| storage account. |
| </p> |
| <note> |
| <p> By default, TLS is enabled both with <codeph>abfs://</codeph> and |
| <codeph>abfss://</codeph>. </p> |
| <p> |
| When you set the <codeph>fs.azure.always.use.https=false</codeph> |
| property, TLS is disabled with <codeph>abfs://</codeph>, and TLS is |
| enabled with <codeph>abfss://</codeph> |
| </p> |
| </note> |
| |
| <p> |
| For a partitioned table, either specify a separate <codeph>LOCATION</codeph> clause for each new partition, |
| or specify a base <codeph>LOCATION</codeph> for the table and set up a directory structure in ADLS to mirror |
| the way Impala partitioned tables are structured in HDFS. Although, strictly speaking, ADLS filenames do not |
| have directory paths, Impala treats ADLS filenames with <codeph>/</codeph> characters the same as HDFS |
| pathnames that include directories. |
| </p> |
| |
| <p> |
| To point a nonpartitioned table or an individual partition at ADLS, specify a single directory |
| path in ADLS, which could be any arbitrary directory. To replicate the structure of an entire Impala |
| partitioned table or database in ADLS requires more care, with directories and subdirectories nested and |
| named to match the equivalent directory tree in HDFS. Consider setting up an empty staging area if |
| necessary in HDFS, and recording the complete directory structure so that you can replicate it in ADLS. |
| </p> |
| |
| <p> |
| For example, the following session creates a partitioned table where only a single partition resides on ADLS. |
| The partitions for years 2013 and 2014 are located on HDFS. The partition for year 2015 includes a |
| <codeph>LOCATION</codeph> attribute with an <codeph>adl://</codeph> URL, and so refers to data residing on |
| ADLS, under a specific path underneath the store <codeph>impalademo</codeph>. |
| </p> |
| |
| <codeblock>[localhost:21000] > create database db_on_hdfs; |
| [localhost:21000] > use db_on_hdfs; |
| [localhost:21000] > create table mostly_on_hdfs (x int) partitioned by (year int); |
| [localhost:21000] > alter table mostly_on_hdfs add partition (year=2013); |
| [localhost:21000] > alter table mostly_on_hdfs add partition (year=2014); |
| [localhost:21000] > alter table mostly_on_hdfs add partition (year=2015) |
| > location 'adl://impalademo.azuredatalakestore.net/dir1/dir2/dir3/t1'; |
| </codeblock> |
| |
| <p> For convenience when working with multiple tables with data files |
| stored in ADLS, you can create a database with a |
| <codeph>LOCATION</codeph> attribute pointing to an ADLS path. Specify |
| a URL of the form as shown above. Any tables created inside that |
| database automatically create directories underneath the one specified |
| by the database <codeph>LOCATION</codeph> attribute. </p> |
| |
| <p> |
| The following session creates a database and two partitioned tables residing entirely on ADLS, one |
| partitioned by a single column and the other partitioned by multiple columns. Because a |
| <codeph>LOCATION</codeph> attribute with an <codeph>adl://</codeph> URL is specified for the database, the |
| tables inside that database are automatically created on ADLS underneath the database directory. To see the |
| names of the associated subdirectories, including the partition key values, we use an ADLS client tool to |
| examine how the directory structure is organized on ADLS. For example, Impala partition directories such as |
| <codeph>month=1</codeph> do not include leading zeroes, which sometimes appear in partition directories created |
| through Hive. |
| </p> |
| |
| <codeblock>[localhost:21000] > create database db_on_adls location 'adl://impalademo.azuredatalakestore.net/dir1/dir2/dir3'; |
| [localhost:21000] > use db_on_adls; |
| |
| [localhost:21000] > create table partitioned_on_adls (x int) partitioned by (year int); |
| [localhost:21000] > alter table partitioned_on_adls add partition (year=2013); |
| [localhost:21000] > alter table partitioned_on_adls add partition (year=2014); |
| [localhost:21000] > alter table partitioned_on_adls add partition (year=2015); |
| |
| [localhost:21000] > ! hadoop fs -ls adl://impalademo.azuredatalakestore.net/dir1/dir2/dir3 --recursive; |
| 2015-03-17 13:56:34 0 dir1/dir2/dir3/ |
| 2015-03-17 16:43:28 0 dir1/dir2/dir3/partitioned_on_adls/ |
| 2015-03-17 16:43:49 0 dir1/dir2/dir3/partitioned_on_adls/year=2013/ |
| 2015-03-17 16:43:53 0 dir1/dir2/dir3/partitioned_on_adls/year=2014/ |
| 2015-03-17 16:43:58 0 dir1/dir2/dir3/partitioned_on_adls/year=2015/ |
| |
| [localhost:21000] > create table partitioned_multiple_keys (x int) |
| > partitioned by (year smallint, month tinyint, day tinyint); |
| [localhost:21000] > alter table partitioned_multiple_keys |
| > add partition (year=2015,month=1,day=1); |
| [localhost:21000] > alter table partitioned_multiple_keys |
| > add partition (year=2015,month=1,day=31); |
| [localhost:21000] > alter table partitioned_multiple_keys |
| > add partition (year=2015,month=2,day=28); |
| |
| [localhost:21000] > ! hadoop fs -ls adl://impalademo.azuredatalakestore.net/dir1/dir2/dir3 --recursive; |
| 2015-03-17 13:56:34 0 dir1/dir2/dir3/ |
| 2015-03-17 16:47:13 0 dir1/dir2/dir3/partitioned_multiple_keys/ |
| 2015-03-17 16:47:44 0 dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=1/day=1/ |
| 2015-03-17 16:47:50 0 dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=1/day=31/ |
| 2015-03-17 16:47:57 0 dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=2/day=28/ |
| 2015-03-17 16:43:28 0 dir1/dir2/dir3/partitioned_on_adls/ |
| 2015-03-17 16:43:49 0 dir1/dir2/dir3/partitioned_on_adls/year=2013/ |
| 2015-03-17 16:43:53 0 dir1/dir2/dir3/partitioned_on_adls/year=2014/ |
| 2015-03-17 16:43:58 0 dir1/dir2/dir3/partitioned_on_adls/year=2015/ |
| </codeblock> |
| |
| <p> |
| The <codeph>CREATE DATABASE</codeph> and <codeph>CREATE TABLE</codeph> statements create the associated |
| directory paths if they do not already exist. You can specify multiple levels of directories, and the |
| <codeph>CREATE</codeph> statement creates all appropriate levels, similar to using <codeph>mkdir |
| -p</codeph>. |
| </p> |
| |
| <p> |
| Use the standard ADLS file upload methods to actually put the data files into the right locations. You can |
| also put the directory paths and data files in place before creating the associated Impala databases or |
| tables, and Impala automatically uses the data from the appropriate location after the associated databases |
| and tables are created. |
| </p> |
| |
| <p> You can switch whether an existing table or partition points to data |
| in HDFS or ADLS. For example, if you have an Impala table or partition |
| pointing to data files in HDFS or ADLS, and you later transfer those |
| data files to the other filesystem, use an <codeph>ALTER TABLE</codeph> |
| statement to adjust the <codeph>LOCATION</codeph> attribute of the |
| corresponding table or partition to reflect that change. This |
| location-switching technique is not practical for entire databases that |
| have a custom <codeph>LOCATION</codeph> attribute. </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="internal_external"> |
| |
| <title>Internal and External Tables Located on ADLS</title> |
| |
| <conbody> |
| |
| <p> |
| Just as with tables located on HDFS storage, you can designate ADLS-based tables as either internal (managed |
| by Impala) or external, by using the syntax <codeph>CREATE TABLE</codeph> or <codeph>CREATE EXTERNAL |
| TABLE</codeph> respectively. When you drop an internal table, the files associated with the table are |
| removed, even if they are on ADLS storage. When you drop an external table, the files associated with the |
| table are left alone, and are still available for access by other tools or components. See |
| <xref href="impala_tables.xml#tables"/> for details. |
| </p> |
| |
| <p> |
| If the data on ADLS is intended to be long-lived and accessed by other tools in addition to Impala, create |
| any associated ADLS tables with the <codeph>CREATE EXTERNAL TABLE</codeph> syntax, so that the files are not |
| deleted from ADLS when the table is dropped. |
| </p> |
| |
| <p> |
| If the data on ADLS is only needed for querying by Impala and can be safely discarded once the Impala |
| workflow is complete, create the associated ADLS tables using the <codeph>CREATE TABLE</codeph> syntax, so |
| that dropping the table also deletes the corresponding data files on ADLS. |
| </p> |
| |
| <p> |
| For example, this session creates a table in ADLS with the same column layout as a table in HDFS, then |
| examines the ADLS table and queries some data from it. The table in ADLS works the same as a table in HDFS as |
| far as the expected file format of the data, table and column statistics, and other table properties. The |
| only indication that it is not an HDFS table is the <codeph>adl://</codeph> URL in the |
| <codeph>LOCATION</codeph> property. Many data files can reside in the ADLS directory, and their combined |
| contents form the table data. Because the data in this example is uploaded after the table is created, a |
| <codeph>REFRESH</codeph> statement prompts Impala to update its cached information about the data files. |
| </p> |
| |
| <codeblock>[localhost:21000] > create table usa_cities_adls like usa_cities location 'adl://impalademo.azuredatalakestore.net/usa_cities'; |
| [localhost:21000] > desc usa_cities_adls; |
| +-------+----------+---------+ |
| | name | type | comment | |
| +-------+----------+---------+ |
| | id | smallint | | |
| | city | string | | |
| | state | string | | |
| +-------+----------+---------+ |
| |
| -- Now from a web browser, upload the same data file(s) to ADLS as in the HDFS table, |
| -- under the relevant store and path. If you already have the data in ADLS, you would |
| -- point the table LOCATION at an existing path. |
| |
| [localhost:21000] > refresh usa_cities_adls; |
| [localhost:21000] > select count(*) from usa_cities_adls; |
| +----------+ |
| | count(*) | |
| +----------+ |
| | 289 | |
| +----------+ |
| [localhost:21000] > select distinct state from sample_data_adls limit 5; |
| +----------------------+ |
| | state | |
| +----------------------+ |
| | Louisiana | |
| | Minnesota | |
| | Georgia | |
| | Alaska | |
| | Ohio | |
| +----------------------+ |
| [localhost:21000] > desc formatted usa_cities_adls; |
| +------------------------------+----------------------------------------------------+---------+ |
| | name | type | comment | |
| +------------------------------+----------------------------------------------------+---------+ |
| | # col_name | data_type | comment | |
| | | NULL | NULL | |
| | id | smallint | NULL | |
| | city | string | NULL | |
| | state | string | NULL | |
| | | NULL | NULL | |
| | # Detailed Table Information | NULL | NULL | |
| | Database: | adls_testing | NULL | |
| | Owner: | jrussell | NULL | |
| | CreateTime: | Mon Mar 16 11:36:25 PDT 2017 | NULL | |
| | LastAccessTime: | UNKNOWN | NULL | |
| | Protect Mode: | None | NULL | |
| | Retention: | 0 | NULL | |
| | Location: | adl://impalademo.azuredatalakestore.net/usa_cities | NULL | |
| | Table Type: | MANAGED_TABLE | NULL | |
| ... |
| +------------------------------+----------------------------------------------------+---------+ |
| </codeblock> |
| |
| <p> |
| In this case, we have already uploaded a Parquet file with a million rows of data to the |
| <codeph>sample_data</codeph> directory underneath the <codeph>impalademo</codeph> store on ADLS. This |
| session creates a table with matching column settings pointing to the corresponding location in ADLS, then |
| queries the table. Because the data is already in place on ADLS when the table is created, no |
| <codeph>REFRESH</codeph> statement is required. |
| </p> |
| |
| <codeblock>[localhost:21000] > create table sample_data_adls |
| > (id int, id bigint, val int, zerofill string, |
| > name string, assertion boolean, city string, state string) |
| > stored as parquet location 'adl://impalademo.azuredatalakestore.net/sample_data'; |
| [localhost:21000] > select count(*) from sample_data_adls; |
| +----------+ |
| | count(*) | |
| +----------+ |
| | 1000000 | |
| +----------+ |
| [localhost:21000] > select count(*) howmany, assertion from sample_data_adls group by assertion; |
| +---------+-----------+ |
| | howmany | assertion | |
| +---------+-----------+ |
| | 667149 | true | |
| | 332851 | false | |
| +---------+-----------+ |
| </codeblock> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="queries"> |
| |
| <title>Running and Tuning Impala Queries for Data Stored on ADLS</title> |
| |
| <conbody> |
| |
| <p> |
| Once the appropriate <codeph>LOCATION</codeph> attributes are set up at the table or partition level, you |
| query data stored in ADLS exactly the same as data stored on HDFS or in HBase: |
| </p> |
| |
| <ul> |
| <li> |
| Queries against ADLS data support all the same file formats as for HDFS data. |
| </li> |
| |
| <li> |
| Tables can be unpartitioned or partitioned. For partitioned tables, either manually construct paths in ADLS |
| corresponding to the HDFS directories representing partition key values, or use <codeph>ALTER TABLE ... |
| ADD PARTITION</codeph> to set up the appropriate paths in ADLS. |
| </li> |
| |
| <li> |
| HDFS, Kudu, and HBase tables can be joined to ADLS tables, or ADLS tables can be joined with each other. |
| </li> |
| |
| <li> |
| Authorization using the Sentry framework to control access to databases, tables, or columns works the |
| same whether the data is in HDFS or in ADLS. |
| </li> |
| |
| <li> |
| The <cmdname>catalogd</cmdname> daemon caches metadata for both HDFS and ADLS tables. Use |
| <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> for ADLS tables in the same situations |
| where you would issue those statements for HDFS tables. |
| </li> |
| |
| <li> |
| Queries against ADLS tables are subject to the same kinds of admission control and resource management as |
| HDFS tables. |
| </li> |
| |
| <li> |
| Metadata about ADLS tables is stored in the same metastore database as for HDFS tables. |
| </li> |
| |
| <li> |
| You can set up views referring to ADLS tables, the same as for HDFS tables. |
| </li> |
| |
| <li> |
| The <codeph>COMPUTE STATS</codeph>, <codeph>SHOW TABLE STATS</codeph>, and <codeph>SHOW COLUMN |
| STATS</codeph> statements work for ADLS tables also. |
| </li> |
| </ul> |
| |
| </conbody> |
| |
| <concept id="performance"> |
| |
| <title>Understanding and Tuning Impala Query Performance for ADLS Data</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Performance"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| Although Impala queries for data stored in ADLS might be less performant than queries against the |
| equivalent data stored in HDFS, you can still do some tuning. Here are techniques you can use to |
| interpret explain plans and profiles for queries against ADLS data, and tips to achieve the best |
| performance possible for such queries. |
| </p> |
| |
| <p> |
| All else being equal, performance is expected to be lower for queries running against data on ADLS rather |
| than HDFS. The actual mechanics of the <codeph>SELECT</codeph> statement are somewhat different when the |
| data is in ADLS. Although the work is still distributed across the datanodes of the cluster, Impala might |
| parallelize the work for a distributed query differently for data on HDFS and ADLS. ADLS does not have the |
| same block notion as HDFS, so Impala uses heuristics to determine how to split up large ADLS files for |
| processing in parallel. Because all hosts can access any ADLS data file with equal efficiency, the |
| distribution of work might be different than for HDFS data, where the data blocks are physically read |
| using short-circuit local reads by hosts that contain the appropriate block replicas. Although the I/O to |
| read the ADLS data might be spread evenly across the hosts of the cluster, the fact that all data is |
| initially retrieved across the network means that the overall query performance is likely to be lower for |
| ADLS data than for HDFS data. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/adls_block_splitting"/> |
| |
| <p> |
| When optimizing aspects of for complex queries such as the join order, Impala treats tables on HDFS and |
| ADLS the same way. Therefore, follow all the same tuning recommendations for ADLS tables as for HDFS ones, |
| such as using the <codeph>COMPUTE STATS</codeph> statement to help Impala construct accurate estimates of |
| row counts and cardinality. See <xref href="impala_performance.xml#performance"/> for details. |
| </p> |
| |
| <p> |
| In query profile reports, the numbers for <codeph>BytesReadLocal</codeph>, |
| <codeph>BytesReadShortCircuit</codeph>, <codeph>BytesReadDataNodeCached</codeph>, and |
| <codeph>BytesReadRemoteUnexpected</codeph> are blank because those metrics come from HDFS. |
| If you do see any indications that a query against an ADLS table performed <q>remote read</q> |
| operations, do not be alarmed. That is expected because, by definition, all the I/O for ADLS tables involves |
| remote reads. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| </concept> |
| |
| <concept id="restrictions"> |
| |
| <title>Restrictions on Impala Support for ADLS</title> |
| |
| <conbody> |
| |
| <p> |
| Impala requires that the default filesystem for the cluster be HDFS. You cannot use ADLS as the only |
| filesystem in the cluster. |
| </p> |
| |
| <p> |
| Although ADLS is often used to store JSON-formatted data, the current Impala support for ADLS does not include |
| directly querying JSON data. For Impala queries, use data files in one of the file formats listed in |
| <xref href="impala_file_formats.xml#file_formats"/>. If you have data in JSON format, you can prepare a |
| flattened version of that data for querying by Impala as part of your ETL cycle. |
| </p> |
| |
| <p> |
| You cannot use the <codeph>ALTER TABLE ... SET CACHED</codeph> statement for tables or partitions that are |
| located in ADLS. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="best_practices"> |
| <title>Best Practices for Using Impala with ADLS</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Guidelines"/> |
| <data name="Category" value="Best Practices"/> |
| </metadata> |
| </prolog> |
| <conbody> |
| <p> |
| The following guidelines represent best practices derived from testing and real-world experience with Impala on ADLS: |
| </p> |
| <ul> |
| <li> |
| <p> |
| Any reference to an ADLS location must be fully qualified. (This rule applies when |
| ADLS is not designated as the default filesystem.) |
| </p> |
| </li> |
| <li> |
| <p> |
| Set any appropriate configuration settings for <cmdname>impalad</cmdname>. |
| </p> |
| </li> |
| </ul> |
| |
| </conbody> |
| </concept> |
| |
| </concept> |