| <?xml version="1.0" encoding="UTF-8"?><!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="tutorial"> |
| |
| <title id="tutorials">Impala Tutorials</title> |
| <titlealts audience="PDF"><navtitle>Tutorials</navtitle></titlealts> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="Tutorials"/> |
| <data name="Category" value="Getting Started"/> |
| <data name="Category" value="Querying"/> |
| <data name="Category" value="Tables"/> |
| <data name="Category" value="SQL"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Data Analysts"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| This section includes tutorial scenarios that demonstrate how to begin using Impala once the software is |
| installed. It focuses on techniques for loading data, because once you have some data in tables and can query |
| that data, you can quickly progress to more advanced Impala features. |
| </p> |
| |
| <note> |
| <p> |
| Where practical, the tutorials take you from <q>ground zero</q> to having the desired Impala tables and |
| data. In some cases, you might need to download additional files from outside sources, set up additional |
| software components, modify commands or scripts to fit your own configuration, or substitute your own |
| sample data. |
| </p> |
| </note> |
| |
| <p> |
| Before trying these tutorial lessons, install Impala using one of these procedures: |
| </p> |
| |
| <ul> |
| <li> |
| If you already have some <keyword keyref="hadoop_distro"/> environment set up and just need to add Impala to it, |
| follow the installation process described in <xref href="impala_install.xml#install"/>. Make sure to also install the Hive |
| metastore service if you do not already have Hive configured. |
| </li> |
| |
| </ul> |
| |
| <p outputclass="toc inpage"/> |
| </conbody> |
| |
| <concept id="tut_beginner"> |
| |
| <title>Tutorials for Getting Started</title> |
| |
| <conbody> |
| |
| <p> |
| These tutorials demonstrate the basics of using Impala. They are intended for first-time users, and for |
| trying out Impala on any new cluster to make sure the major components are working correctly. |
| </p> |
| |
| <p outputclass="toc inpage"/> |
| </conbody> |
| |
| <concept id="tutorial_explore"> |
| |
| <title>Explore a New Impala Instance</title> |
| |
| <conbody> |
| |
| <p> |
| This tutorial demonstrates techniques for finding your way around the tables and databases of an |
| unfamiliar (possibly empty) Impala instance. |
| </p> |
| |
| <p> |
| When you connect to an Impala instance for the first time, you use the <codeph>SHOW DATABASES</codeph> |
| and <codeph>SHOW TABLES</codeph> statements to view the most common types of objects. Also, call the |
| <codeph>version()</codeph> function to confirm which version of Impala you are running; the version |
| number is important when consulting documentation and dealing with support issues. |
| </p> |
| |
| <p> |
| A completely empty Impala instance contains no tables, but still has two databases: |
| </p> |
| |
| <ul> |
| <li> |
| <codeph>default</codeph>, where new tables are created when you do not specify any other database. |
| </li> |
| |
| <li> |
| <codeph>_impala_builtins</codeph>, a system database used to hold all the built-in functions. |
| </li> |
| </ul> |
| |
| <p> |
| The following example shows how to see the available databases, and the tables in each. If the list of |
| databases or tables is long, you can use wildcard notation to locate specific databases or tables based |
| on their names. |
| </p> |
| |
| <codeblock>$ impala-shell -i localhost --quiet |
| Starting Impala Shell without Kerberos authentication |
| Welcome to the Impala shell. Press TAB twice to see a list of available commands. |
| ... |
| <ph conref="../shared/ImpalaVariables.xml#impala_vars/ShellBanner"/> |
| [localhost:21000] > select version(); |
| +------------------------------------------- |
| | version() |
| +------------------------------------------- |
| | impalad version ... |
| | Built on ... |
| +------------------------------------------- |
| [localhost:21000] > show databases; |
| +--------------------------+ |
| | name | |
| +--------------------------+ |
| | _impala_builtins | |
| | ctas | |
| | d1 | |
| | d2 | |
| | d3 | |
| | default | |
| | explain_plans | |
| | external_table | |
| | file_formats | |
| | tpc | |
| +--------------------------+ |
| [localhost:21000] > select current_database(); |
| +--------------------+ |
| | current_database() | |
| +--------------------+ |
| | default | |
| +--------------------+ |
| [localhost:21000] > show tables; |
| +-------+ |
| | name | |
| +-------+ |
| | ex_t | |
| | t1 | |
| +-------+ |
| [localhost:21000] > show tables in d3; |
| |
| [localhost:21000] > show tables in tpc; |
| +------------------------+ |
| | name | |
| +------------------------+ |
| | city | |
| | customer | |
| | customer_address | |
| | customer_demographics | |
| | household_demographics | |
| | item | |
| | promotion | |
| | store | |
| | store2 | |
| | store_sales | |
| | ticket_view | |
| | time_dim | |
| | tpc_tables | |
| +------------------------+ |
| [localhost:21000] > show tables in tpc like 'customer*'; |
| +-----------------------+ |
| | name | |
| +-----------------------+ |
| | customer | |
| | customer_address | |
| | customer_demographics | |
| +-----------------------+ |
| </codeblock> |
| |
| <p> |
| Once you know what tables and databases are available, you descend into a database with the |
| <codeph>USE</codeph> statement. To understand the structure of each table, you use the |
| <codeph>DESCRIBE</codeph> command. Once inside a database, you can issue statements such as |
| <codeph>INSERT</codeph> and <codeph>SELECT</codeph> that operate on particular tables. |
| </p> |
| |
| <p> |
| The following example explores a database named <codeph>TPC</codeph> whose name we learned in the |
| previous example. It shows how to filter the table names within a database based on a search string, |
| examine the columns of a table, and run queries to examine the characteristics of the table data. For |
| example, for an unfamiliar table you might want to know the number of rows, the number of different |
| values for a column, and other properties such as whether the column contains any <codeph>NULL</codeph> |
| values. When sampling the actual data values from a table, use a <codeph>LIMIT</codeph> clause to avoid |
| excessive output if the table contains more rows or distinct values than you expect. |
| </p> |
| |
| <codeblock>[localhost:21000] > use tpc; |
| [localhost:21000] > show tables like '*view*'; |
| +-------------+ |
| | name | |
| +-------------+ |
| | ticket_view | |
| +-------------+ |
| [localhost:21000] > describe city; |
| +-------------+--------+---------+ |
| | name | type | comment | |
| +-------------+--------+---------+ |
| | id | int | | |
| | name | string | | |
| | countrycode | string | | |
| | district | string | | |
| | population | int | | |
| +-------------+--------+---------+ |
| [localhost:21000] > select count(*) from city; |
| +----------+ |
| | count(*) | |
| +----------+ |
| | 0 | |
| +----------+ |
| [localhost:21000] > desc customer; |
| +------------------------+--------+---------+ |
| | name | type | comment | |
| +------------------------+--------+---------+ |
| | c_customer_sk | int | | |
| | c_customer_id | string | | |
| | c_current_cdemo_sk | int | | |
| | c_current_hdemo_sk | int | | |
| | c_current_addr_sk | int | | |
| | c_first_shipto_date_sk | int | | |
| | c_first_sales_date_sk | int | | |
| | c_salutation | string | | |
| | c_first_name | string | | |
| | c_last_name | string | | |
| | c_preferred_cust_flag | string | | |
| | c_birth_day | int | | |
| | c_birth_month | int | | |
| | c_birth_year | int | | |
| | c_birth_country | string | | |
| | c_login | string | | |
| | c_email_address | string | | |
| | c_last_review_date | string | | |
| +------------------------+--------+---------+ |
| [localhost:21000] > select count(*) from customer; |
| +----------+ |
| | count(*) | |
| +----------+ |
| | 100000 | |
| +----------+ |
| [localhost:21000] > select count(distinct c_birth_month) from customer; |
| +-------------------------------+ |
| | count(distinct c_birth_month) | |
| +-------------------------------+ |
| | 12 | |
| +-------------------------------+ |
| [localhost:21000] > select count(*) from customer where c_email_address is null; |
| +----------+ |
| | count(*) | |
| +----------+ |
| | 0 | |
| +----------+ |
| [localhost:21000] > select distinct c_salutation from customer limit 10; |
| +--------------+ |
| | c_salutation | |
| +--------------+ |
| | Mr. | |
| | Ms. | |
| | Dr. | |
| | | |
| | Miss | |
| | Sir | |
| | Mrs. | |
| +--------------+ |
| </codeblock> |
| |
| <p> |
| When you graduate from read-only exploration, you use statements such as <codeph>CREATE DATABASE</codeph> |
| and <codeph>CREATE TABLE</codeph> to set up your own database objects. |
| </p> |
| |
| <p> |
| The following example demonstrates creating a new database holding a new table. Although the last example |
| ended inside the <codeph>TPC</codeph> database, the new <codeph>EXPERIMENTS</codeph> database is not |
| nested inside <codeph>TPC</codeph>; all databases are arranged in a single top-level list. |
| </p> |
| |
| <codeblock>[localhost:21000] > create database experiments; |
| [localhost:21000] > show databases; |
| +--------------------------+ |
| | name | |
| +--------------------------+ |
| | _impala_builtins | |
| | ctas | |
| | d1 | |
| | d2 | |
| | d3 | |
| | default | |
| | experiments | |
| | explain_plans | |
| | external_table | |
| | file_formats | |
| | tpc | |
| +--------------------------+ |
| [localhost:21000] > show databases like 'exp*'; |
| +---------------+ |
| | name | |
| +---------------+ |
| | experiments | |
| | explain_plans | |
| +---------------+ |
| </codeblock> |
| |
| <p> |
| The following example creates a new table, <codeph>T1</codeph>. To illustrate a common mistake, it creates this table inside |
| the wrong database, the <codeph>TPC</codeph> database where the previous example ended. The <codeph>ALTER |
| TABLE</codeph> statement lets you move the table to the intended database, <codeph>EXPERIMENTS</codeph>, as part of a rename operation. |
| The <codeph>USE</codeph> statement is always needed to switch to a new database, and the |
| <codeph>current_database()</codeph> function confirms which database the session is in, to avoid these |
| kinds of mistakes. |
| </p> |
| |
| <codeblock>[localhost:21000] > create table t1 (x int); |
| |
| [localhost:21000] > show tables; |
| +------------------------+ |
| | name | |
| +------------------------+ |
| | city | |
| | customer | |
| | customer_address | |
| | customer_demographics | |
| | household_demographics | |
| | item | |
| | promotion | |
| | store | |
| | store2 | |
| | store_sales | |
| | t1 | |
| | ticket_view | |
| | time_dim | |
| | tpc_tables | |
| +------------------------+ |
| [localhost:21000] > select current_database(); |
| +--------------------+ |
| | current_database() | |
| +--------------------+ |
| | tpc | |
| +--------------------+ |
| [localhost:21000] > alter table t1 rename to experiments.t1; |
| [localhost:21000] > use experiments; |
| [localhost:21000] > show tables; |
| +------+ |
| | name | |
| +------+ |
| | t1 | |
| +------+ |
| [localhost:21000] > select current_database(); |
| +--------------------+ |
| | current_database() | |
| +--------------------+ |
| | experiments | |
| +--------------------+ |
| </codeblock> |
| |
| <p> |
| For your initial experiments with tables, you can use ones with just a few columns and a few rows, and |
| text-format data files. |
| </p> |
| |
| <note> |
| As you graduate to more realistic scenarios, you will use more elaborate tables with many columns, |
| features such as partitioning, and file formats such as Parquet. When dealing with realistic data |
| volumes, you will bring in data using <codeph>LOAD DATA</codeph> or <codeph>INSERT ... SELECT</codeph> |
| statements to operate on millions or billions of rows at once. |
| </note> |
| |
| <p> |
| The following example sets up a couple of simple tables with a few rows, and performs queries involving |
| sorting, aggregate functions and joins. |
| </p> |
| |
| <codeblock>[localhost:21000] > insert into t1 values (1), (3), (2), (4); |
| [localhost:21000] > select x from t1 order by x desc; |
| +---+ |
| | x | |
| +---+ |
| | 4 | |
| | 3 | |
| | 2 | |
| | 1 | |
| +---+ |
| [localhost:21000] > select min(x), max(x), sum(x), avg(x) from t1; |
| +--------+--------+--------+--------+ |
| | min(x) | max(x) | sum(x) | avg(x) | |
| +--------+--------+--------+--------+ |
| | 1 | 4 | 10 | 2.5 | |
| +--------+--------+--------+--------+ |
| |
| [localhost:21000] > create table t2 (id int, word string); |
| [localhost:21000] > insert into t2 values (1, "one"), (3, "three"), (5, 'five'); |
| [localhost:21000] > select word from t1 join t2 on (t1.x = t2.id); |
| +-------+ |
| | word | |
| +-------+ |
| | one | |
| | three | |
| +-------+ |
| </codeblock> |
| |
| <p> |
| After completing this tutorial, you should now know: |
| </p> |
| |
| <ul> |
| <li> |
| How to tell which version of Impala is running on your system. |
| </li> |
| |
| <li> |
| How to find the names of databases in an Impala instance, either displaying the full list or |
| searching for specific names. |
| </li> |
| |
| <li> |
| How to find the names of tables in an Impala database, either displaying the full list or |
| searching for specific names. |
| </li> |
| |
| <li> |
| How to switch between databases and check which database you are currently in. |
| </li> |
| |
| <li> |
| How to learn the column names and types of a table. |
| </li> |
| |
| <li> |
| How to create databases and tables, insert small amounts of test data, and run simple queries. |
| </li> |
| </ul> |
| </conbody> |
| </concept> |
| |
| <concept audience="hidden" id="tutorial_hdfs"> |
| |
| <title>Exploring the HDFS Directory Structure</title> |
| |
| <conbody> |
| |
| <p> |
| This tutorial scenario illustrates the HDFS directory structures that correspond to various |
| Impala databases, tables, and partitions. It also shows how data directories are shared between |
| Impala and Hive, because of the shared metastore database. |
| </p> |
| </conbody> |
| </concept> |
| |
| <concept audience="hidden" id="tutorial_external_table"> |
| |
| <title/> |
| |
| <conbody> |
| |
| <p> |
| In this tutorial scenario, you create a simple text-format data file in HDFS and then define an Impala |
| table that refers to the data in its original location. |
| </p> |
| </conbody> |
| </concept> |
| |
| <concept id="tutorial_csv_setup"> |
| |
| <title>Load CSV Data from Local Files</title> |
| |
| <conbody> |
| |
| <p> |
| This scenario illustrates how to create some very small tables, suitable for first-time users to |
| experiment with Impala SQL features. <codeph>TAB1</codeph> and <codeph>TAB2</codeph> are loaded with data |
| from files in HDFS. A subset of data is copied from <codeph>TAB1</codeph> into <codeph>TAB3</codeph>. |
| </p> |
| |
| <p> |
| Populate HDFS with the data you want to query. To begin this process, create one or more new |
| subdirectories underneath your user directory in HDFS. The data for each table resides in a separate |
| subdirectory. Substitute your own username for <codeph>username</codeph> where appropriate. This example |
| uses the <codeph>-p</codeph> option with the <codeph>mkdir</codeph> operation to create any necessary |
| parent directories if they do not already exist. |
| </p> |
| |
| <codeblock>$ whoami |
| username |
| $ hdfs dfs -ls /user |
| Found 3 items |
| drwxr-xr-x - username username 0 2013-04-22 18:54 /user/username |
| drwxrwx--- - mapred mapred 0 2013-03-15 20:11 /user/history |
| drwxr-xr-x - hue supergroup 0 2013-03-15 20:10 /user/hive |
| <!-- $ hdfs dfs -mkdir -p /user/username/sample_data/tab1 --> |
| $ hdfs dfs -mkdir -p /user/username/sample_data/tab1 /user/username/sample_data/tab2</codeblock> |
| |
| <p> |
| Here is some sample data, for two tables named <codeph>TAB1</codeph> and <codeph>TAB2</codeph>. |
| </p> |
| |
| <p> |
| Copy the following content to <codeph>.csv</codeph> files in your local filesystem: |
| </p> |
| |
| <p> |
| <filepath>tab1.csv</filepath>: |
| </p> |
| |
| <codeblock>1,true,123.123,2012-10-24 08:55:00 |
| 2,false,1243.5,2012-10-25 13:40:00 |
| 3,false,24453.325,2008-08-22 09:33:21.123 |
| 4,false,243423.325,2007-05-12 22:32:21.33454 |
| 5,true,243.325,1953-04-22 09:11:33 |
| </codeblock> |
| |
| <p> |
| <filepath>tab2.csv</filepath>: |
| </p> |
| |
| <codeblock>1,true,12789.123 |
| 2,false,1243.5 |
| 3,false,24453.325 |
| 4,false,2423.3254 |
| 5,true,243.325 |
| 60,false,243565423.325 |
| 70,true,243.325 |
| 80,false,243423.325 |
| 90,true,243.325 |
| </codeblock> |
| |
| <p> |
| Put each <codeph>.csv</codeph> file into a separate HDFS directory using commands like the following, |
| which use paths available in the Impala Demo VM: |
| </p> |
| |
| <codeblock><!-- $ hdfs dfs -mkdir /user/username/tab1 -->$ hdfs dfs -put tab1.csv /user/username/sample_data/tab1 |
| $ hdfs dfs -ls /user/username/sample_data/tab1 |
| Found 1 items |
| -rw-r--r-- 1 username username 192 2013-04-02 20:08 /user/username/sample_data/tab1/tab1.csv |
| |
| <!-- $ hdfs dfs -mkdir /user/username/tab2 --> |
| $ hdfs dfs -put tab2.csv /user/username/sample_data/tab2 |
| $ hdfs dfs -ls /user/username/sample_data/tab2 |
| Found 1 items |
| -rw-r--r-- 1 username username 158 2013-04-02 20:09 /user/username/sample_data/tab2/tab2.csv |
| </codeblock> |
| |
| <p> |
| The name of each data file is not significant. In fact, when Impala examines the contents of the data |
| directory for the first time, it considers all files in the directory to make up the data of the table, |
| regardless of how many files there are or what the files are named. |
| </p> |
| |
| <p> |
| To understand what paths are available within your own HDFS filesystem and what the permissions are for |
| the various directories and files, issue <codeph>hdfs dfs -ls /</codeph> and work your way down the tree |
| doing <codeph>-ls</codeph> operations for the various directories. |
| </p> |
| |
| <p> |
| Use the <codeph>impala-shell</codeph> command to create tables, either interactively or through a SQL |
| script. |
| </p> |
| |
| <p> |
| The following example shows creating three tables. For each table, the example shows creating columns |
| with various attributes such as Boolean or integer types. The example also includes commands that provide |
| information about how the data is formatted, such as rows terminating with commas, which makes sense in |
| the case of importing data from a <codeph>.csv</codeph> file. Where we already have <codeph>.csv</codeph> |
| files containing data in the HDFS directory tree, we specify the location of the directory containing the |
| appropriate <codeph>.csv</codeph> file. Impala considers all the data from all the files in that |
| directory to represent the data for the table. |
| </p> |
| |
| <codeblock>DROP TABLE IF EXISTS tab1; |
| -- The EXTERNAL clause means the data is located outside the central location |
| -- for Impala data files and is preserved when the associated Impala table is dropped. |
| -- We expect the data to already exist in the directory specified by the LOCATION clause. |
| CREATE EXTERNAL TABLE tab1 |
| ( |
| id INT, |
| col_1 BOOLEAN, |
| col_2 DOUBLE, |
| col_3 TIMESTAMP |
| ) |
| ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' |
| LOCATION '/user/username/sample_data/tab1'; |
| |
| DROP TABLE IF EXISTS tab2; |
| -- TAB2 is an external table, similar to TAB1. |
| CREATE EXTERNAL TABLE tab2 |
| ( |
| id INT, |
| col_1 BOOLEAN, |
| col_2 DOUBLE |
| ) |
| ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' |
| LOCATION '/user/username/sample_data/tab2'; |
| |
| DROP TABLE IF EXISTS tab3; |
| -- Leaving out the EXTERNAL clause means the data will be managed |
| -- in the central Impala data directory tree. Rather than reading |
| -- existing data files when the table is created, we load the |
| -- data after creating the table. |
| CREATE TABLE tab3 |
| ( |
| id INT, |
| col_1 BOOLEAN, |
| col_2 DOUBLE, |
| month INT, |
| day INT |
| ) |
| ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; |
| </codeblock> |
| |
| <note> |
| Getting through these <codeph>CREATE TABLE</codeph> statements successfully is an important validation |
| step to confirm everything is configured correctly with the Hive metastore and HDFS permissions. If you |
| receive any errors during the <codeph>CREATE TABLE</codeph> statements: |
| <ul> |
| <li> |
| Make sure you followed the installation instructions closely, in |
| <xref href="impala_install.xml#install"/>. |
| </li> |
| |
| <li> |
| Make sure the <codeph>hive.metastore.warehouse.dir</codeph> property points to a directory that |
| Impala can write to. The ownership should be <codeph>hive:hive</codeph>, and the |
| <codeph>impala</codeph> user should also be a member of the <codeph>hive</codeph> group. |
| </li> |
| </ul> |
| </note> |
| </conbody> |
| </concept> |
| |
| <concept id="tutorial_create_table"> |
| |
| <title>Point an Impala Table at Existing Data Files</title> |
| |
| <conbody> |
| |
| <p> |
| A convenient way to set up data for Impala to access is to use an external table, where the data already |
| exists in a set of HDFS files and you just point the Impala table at the directory containing those |
| files. For example, you might run in <codeph>impala-shell</codeph> a <codeph>*.sql</codeph> file with |
| contents similar to the following, to create an Impala table that accesses an existing data file used by |
| Hive. |
| </p> |
| |
| <p> |
| The following examples set up 2 tables, referencing the paths and sample data from the sample TPC-DS kit for Impala. |
| For historical reasons, the data physically resides in an HDFS directory tree under |
| <filepath>/user/hive</filepath>, although this particular data is entirely managed by Impala rather than |
| Hive. When we create an external table, we specify the directory containing one or more data files, and |
| Impala queries the combined content of all the files inside that directory. Here is how we examine the |
| directories and files within the HDFS filesystem: |
| </p> |
| |
| <codeblock>$ cd ~/username/datasets |
| $ ./tpcds-setup.sh |
| ... Downloads and unzips the kit, builds the data and loads it into HDFS ... |
| $ hdfs dfs -ls /user/hive/tpcds/customer |
| Found 1 items |
| -rw-r--r-- 1 username supergroup 13209372 2013-03-22 18:09 /user/hive/tpcds/customer/customer.dat |
| $ hdfs dfs -cat /user/hive/tpcds/customer/customer.dat | more |
| 1|AAAAAAAABAAAAAAA|980124|7135|32946|2452238|2452208|Mr.|Javier|Lewis|Y|9|12|1936|CHILE||Javie |
| r.Lewis@VFAxlnZEvOx.org|2452508| |
| 2|AAAAAAAACAAAAAAA|819667|1461|31655|2452318|2452288|Dr.|Amy|Moses|Y|9|4|1966|TOGO||Amy.Moses@ |
| Ovk9KjHH.com|2452318| |
| 3|AAAAAAAADAAAAAAA|1473522|6247|48572|2449130|2449100|Miss|Latisha|Hamilton|N|18|9|1979|NIUE|| |
| Latisha.Hamilton@V.com|2452313| |
| 4|AAAAAAAAEAAAAAAA|1703214|3986|39558|2450030|2450000|Dr.|Michael|White|N|7|6|1983|MEXICO||Mic |
| hael.White@i.org|2452361| |
| 5|AAAAAAAAFAAAAAAA|953372|4470|36368|2449438|2449408|Sir|Robert|Moran|N|8|5|1956|FIJI||Robert. |
| Moran@Hh.edu|2452469| |
| ... |
| </codeblock> |
| |
| <p> |
| Here is a SQL script to set up Impala tables pointing to some of these data files in HDFS. |
| (The script in the VM sets up tables like this through Hive; ignore those tables |
| for purposes of this demonstration.) |
| Save the following as <filepath>customer_setup.sql</filepath>: |
| </p> |
| |
| <codeblock>-- |
| -- store_sales fact table and surrounding dimension tables only |
| -- |
| create database tpcds; |
| use tpcds; |
| |
| drop table if exists customer; |
| create external table customer |
| ( |
| c_customer_sk int, |
| c_customer_id string, |
| c_current_cdemo_sk int, |
| c_current_hdemo_sk int, |
| c_current_addr_sk int, |
| c_first_shipto_date_sk int, |
| c_first_sales_date_sk int, |
| c_salutation string, |
| c_first_name string, |
| c_last_name string, |
| c_preferred_cust_flag string, |
| c_birth_day int, |
| c_birth_month int, |
| c_birth_year int, |
| c_birth_country string, |
| c_login string, |
| c_email_address string, |
| c_last_review_date string |
| ) |
| row format delimited fields terminated by '|' |
| location '/user/hive/tpcds/customer'; |
| |
| drop table if exists customer_address; |
| create external table customer_address |
| ( |
| ca_address_sk int, |
| ca_address_id string, |
| ca_street_number string, |
| ca_street_name string, |
| ca_street_type string, |
| ca_suite_number string, |
| ca_city string, |
| ca_county string, |
| ca_state string, |
| ca_zip string, |
| ca_country string, |
| ca_gmt_offset float, |
| ca_location_type string |
| ) |
| row format delimited fields terminated by '|' |
| location '/user/hive/tpcds/customer_address'; |
| </codeblock> |
| |
| <p> |
| We would run this script with a command such as: |
| <codeblock>impala-shell -i localhost -f customer_setup.sql</codeblock> |
| </p> |
| |
| </conbody> |
| </concept> |
| |
| <concept id="tutorial_describe_impala"> |
| |
| <title>Describe the Impala Table</title> |
| |
| <conbody> |
| |
| <p> |
| Now that you have updated the database metadata that Impala caches, you can confirm that the expected |
| tables are accessible by Impala and examine the attributes of one of the tables. We created these tables |
| in the database named <codeph>default</codeph>. If the tables were in a database other than the default, |
| we would issue a command <codeph>use <varname>db_name</varname> </codeph> to switch to that database |
| before examining or querying its tables. We could also qualify the name of a table by prepending the |
| database name, for example <codeph>default.customer</codeph> and <codeph>default.customer_name</codeph>. |
| </p> |
| |
| <codeblock>[impala-host:21000] > show databases |
| Query finished, fetching results ... |
| default |
| Returned 1 row(s) in 0.00s |
| [impala-host:21000] > show tables |
| Query finished, fetching results ... |
| customer |
| customer_address |
| Returned 2 row(s) in 0.00s |
| [impala-host:21000] > describe customer_address |
| +------------------+--------+---------+ |
| | name | type | comment | |
| +------------------+--------+---------+ |
| | ca_address_sk | int | | |
| | ca_address_id | string | | |
| | ca_street_number | string | | |
| | ca_street_name | string | | |
| | ca_street_type | string | | |
| | ca_suite_number | string | | |
| | ca_city | string | | |
| | ca_county | string | | |
| | ca_state | string | | |
| | ca_zip | string | | |
| | ca_country | string | | |
| | ca_gmt_offset | float | | |
| | ca_location_type | string | | |
| +------------------+--------+---------+ |
| Returned 13 row(s) in 0.01 |
| </codeblock> |
| </conbody> |
| </concept> |
| |
| <concept id="tutorial_query_impala"> |
| |
| <title>Query the Impala Table</title> |
| |
| <conbody> |
| |
| <p> |
| You can query data contained in the tables. Impala coordinates the query execution across a single node |
| or multiple nodes depending on your configuration, without the overhead of running MapReduce jobs to |
| perform the intermediate processing. |
| </p> |
| |
| <p> |
| There are a variety of ways to execute queries on Impala: |
| </p> |
| |
| <ul> |
| <li> |
| Using the <codeph>impala-shell</codeph> command in interactive mode: |
| <codeblock>$ impala-shell -i impala-host |
| Connected to localhost:21000 |
| [impala-host:21000] > select count(*) from customer_address; |
| 50000 |
| Returned 1 row(s) in 0.37s |
| </codeblock> |
| </li> |
| |
| <li> |
| Passing a set of commands contained in a file: |
| <codeblock>$ impala-shell -i impala-host -f myquery.sql |
| Connected to localhost:21000 |
| 50000 |
| Returned 1 row(s) in 0.19s</codeblock> |
| </li> |
| |
| <li> |
| Passing a single command to the <codeph>impala-shell</codeph> command. The query is executed, the |
| results are returned, and the shell exits. Make sure to quote the command, preferably with single |
| quotation marks to avoid shell expansion of characters such as <codeph>*</codeph>. |
| <codeblock>$ impala-shell -i impala-host -q 'select count(*) from customer_address' |
| Connected to localhost:21000 |
| 50000 |
| Returned 1 row(s) in 0.29s</codeblock> |
| </li> |
| </ul> |
| </conbody> |
| </concept> |
| |
| <concept id="tutorial_etl"> |
| |
| <title>Data Loading and Querying Examples</title> |
| |
| <conbody> |
| |
| <p> |
| This section describes how to create some sample tables and load data into them. These tables can then be |
| queried using the Impala shell. |
| </p> |
| </conbody> |
| |
| <concept id="tutorial_loading"> |
| |
| <title>Loading Data</title> |
| |
| <conbody> |
| |
| <p> |
| Loading data involves: |
| </p> |
| |
| <ul> |
| <li> |
| Establishing a data set. The example below uses <codeph>.csv</codeph> files. |
| </li> |
| |
| <li> |
| Creating tables to which to load data. |
| </li> |
| |
| <li> |
| Loading the data into the tables you created. |
| </li> |
| </ul> |
| |
| <!-- |
| <section id="tut_hdfs_perms"> |
| |
| <title> |
| Modifying Directory Permissions |
| </title> |
| |
| <p> |
| Change permission settings so Hive and Impala are both able read and write to the Hive warehouse directory. |
| This process involves changing permissions for the root warehouse directory, then modifying the client |
| <codeph>hive-site.xml</codeph> file so newly created files inherit these permissions. |
| </p> |
| |
| <p> |
| Begin by modifying the file permissions. For example, if the warehouse directory is |
| <codeph>/user/hive/warehouse</codeph>, issue the following command: |
| </p> |
| |
| <codeblock>hdfs dfs -chmod -R 775 /user/hive/warehouse</codeblock> |
| <p> |
| Add the following property to the Hive client <codeph>hive-site.xml</codeph> file: |
| </p> |
| |
| <codeblock><property> |
| <name>hive.warehouse.subdir.inherit.perms</name> |
| <value>true</value> |
| <description>Set this to true if the table directories should inherit the |
| permission of the warehouse or database directory instead of being created |
| with the permissions derived from dfs umask</description> |
| </property></codeblock> |
| |
| </section> |
| --> |
| |
| <!-- To do: This is from old days when Impala didn't support LOAD DATA; re-do to show in impala-shell. |
| <section id="tut_loading_csv"> |
| |
| <title> |
| Loading .csv Data into Tables |
| </title> |
| |
| <p> |
| In the Hive shell, load data into TAB1 and TAB2: |
| </p> |
| |
| |
| <codeblock>LOAD DATA LOCAL INPATH 'tab1.csv' OVERWRITE INTO TABLE tab1; |
| LOAD DATA LOCAL INPATH 'tab2.csv' OVERWRITE INTO TABLE tab2;</codeblock> |
| |
| </section> |
| --> |
| |
| <!-- |
| <section id="tut_updating_metadata"> |
| |
| <title> |
| Updating Metadata |
| </title> |
| |
| <p> |
| While you have created new tables and loaded data, Impala does not have information about these new tables. To |
| enable Impala to work with the data in these new tables, connect to an Impala host and then refresh the Impala |
| metadata. For example, if you had an Impala host named <codeph>impala-host</codeph> using the default |
| port, you would issue the following command: |
| </p> |
| |
| <codeblock>$ impala-shell |
| [Not connected] > connect impala-host |
| [impala-host:21000] > refresh</codeblock> |
| |
| </section> |
| --> |
| </conbody> |
| </concept> |
| |
| <concept id="tutorial_queries"> |
| |
| <title>Sample Queries</title> |
| |
| <conbody> |
| |
| <p> |
| To run these sample queries, create a SQL query file <codeph>query.sql</codeph>, copy and paste each |
| query into the query file, and then run the query file using the shell. For example, to run |
| <codeph>query.sql</codeph> on <codeph>impala-host</codeph>, you might use the command: |
| </p> |
| |
| <codeblock>impala-shell.sh -i impala-host -f query.sql</codeblock> |
| |
| <p> |
| The examples and results below assume you have loaded the sample data into the tables as described |
| above. |
| </p> |
| |
| <example> |
| |
| <title>Example: Examining Contents of Tables</title> |
| |
| <p> |
| Let's start by verifying that the tables do contain the data we expect. Because Impala often deals |
| with tables containing millions or billions of rows, when examining tables of unknown size, include |
| the <codeph>LIMIT</codeph> clause to avoid huge amounts of unnecessary output, as in the final query. |
| (If your interactive query starts displaying an unexpected volume of data, press |
| <codeph>Ctrl-C</codeph> in <codeph>impala-shell</codeph> to cancel the query.) |
| </p> |
| |
| <codeblock>SELECT * FROM tab1; |
| SELECT * FROM tab2; |
| SELECT * FROM tab2 LIMIT 5;</codeblock> |
| |
| <p> |
| Results: |
| </p> |
| |
| <codeblock>+----+-------+------------+-------------------------------+ |
| | id | col_1 | col_2 | col_3 | |
| +----+-------+------------+-------------------------------+ |
| | 1 | true | 123.123 | 2012-10-24 08:55:00 | |
| | 2 | false | 1243.5 | 2012-10-25 13:40:00 | |
| | 3 | false | 24453.325 | 2008-08-22 09:33:21.123000000 | |
| | 4 | false | 243423.325 | 2007-05-12 22:32:21.334540000 | |
| | 5 | true | 243.325 | 1953-04-22 09:11:33 | |
| +----+-------+------------+-------------------------------+ |
| |
| +----+-------+---------------+ |
| | id | col_1 | col_2 | |
| +----+-------+---------------+ |
| | 1 | true | 12789.123 | |
| | 2 | false | 1243.5 | |
| | 3 | false | 24453.325 | |
| | 4 | false | 2423.3254 | |
| | 5 | true | 243.325 | |
| | 60 | false | 243565423.325 | |
| | 70 | true | 243.325 | |
| | 80 | false | 243423.325 | |
| | 90 | true | 243.325 | |
| +----+-------+---------------+ |
| |
| +----+-------+-----------+ |
| | id | col_1 | col_2 | |
| +----+-------+-----------+ |
| | 1 | true | 12789.123 | |
| | 2 | false | 1243.5 | |
| | 3 | false | 24453.325 | |
| | 4 | false | 2423.3254 | |
| | 5 | true | 243.325 | |
| +----+-------+-----------+</codeblock> |
| |
| </example> |
| |
| <example> |
| |
| <title>Example: Aggregate and Join</title> |
| |
| <codeblock>SELECT tab1.col_1, MAX(tab2.col_2), MIN(tab2.col_2) |
| FROM tab2 JOIN tab1 USING (id) |
| GROUP BY col_1 ORDER BY 1 LIMIT 5;</codeblock> |
| |
| <p> |
| Results: |
| </p> |
| |
| <codeblock>+-------+-----------------+-----------------+ |
| | col_1 | max(tab2.col_2) | min(tab2.col_2) | |
| +-------+-----------------+-----------------+ |
| | false | 24453.325 | 1243.5 | |
| | true | 12789.123 | 243.325 | |
| +-------+-----------------+-----------------+</codeblock> |
| |
| </example> |
| |
| <example> |
| |
| <title>Example: Subquery, Aggregate and Joins</title> |
| |
| <codeblock>SELECT tab2.* |
| FROM tab2, |
| (SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2 |
| FROM tab2, tab1 |
| WHERE tab1.id = tab2.id |
| GROUP BY col_1) subquery1 |
| WHERE subquery1.max_col2 = tab2.col_2;</codeblock> |
| |
| <p> |
| Results: |
| </p> |
| |
| <codeblock>+----+-------+-----------+ |
| | id | col_1 | col_2 | |
| +----+-------+-----------+ |
| | 1 | true | 12789.123 | |
| | 3 | false | 24453.325 | |
| +----+-------+-----------+</codeblock> |
| |
| </example> |
| |
| <example> |
| |
| <title>Example: INSERT Query</title> |
| |
| <codeblock>INSERT OVERWRITE TABLE tab3 |
| SELECT id, col_1, col_2, MONTH(col_3), DAYOFMONTH(col_3) |
| FROM tab1 WHERE YEAR(col_3) = 2012;</codeblock> |
| |
| <p> |
| Query <codeph>TAB3</codeph> to check the result: |
| </p> |
| |
| <codeblock>SELECT * FROM tab3; |
| </codeblock> |
| |
| <p> |
| Results: |
| </p> |
| |
| <codeblock>+----+-------+---------+-------+-----+ |
| | id | col_1 | col_2 | month | day | |
| +----+-------+---------+-------+-----+ |
| | 1 | true | 123.123 | 10 | 24 | |
| | 2 | false | 1243.5 | 10 | 25 | |
| +----+-------+---------+-------+-----+</codeblock> |
| |
| </example> |
| </conbody> |
| </concept> |
| </concept> |
| </concept> |
| |
| <concept id="tut_advanced"> |
| |
| <title>Advanced Tutorials</title> |
| |
| <conbody> |
| |
| <p> |
| These tutorials walk you through advanced scenarios or specialized features. |
| </p> |
| |
| <p outputclass="toc inpage"/> |
| </conbody> |
| |
| <concept id="tut_external_partition_data"> |
| |
| <title>Attaching an External Partitioned Table to an HDFS Directory Structure</title> |
| |
| <conbody> |
| |
| <p> |
| This tutorial shows how you might set up a directory tree in HDFS, put data files into the lowest-level |
| subdirectories, and then use an Impala external table to query the data files from their original |
| locations. |
| </p> |
| |
| <p> |
| The tutorial uses a table with web log data, with separate subdirectories for the year, month, day, and |
| host. For simplicity, we use a tiny amount of CSV data, loading the same data into each partition. |
| </p> |
| |
| <p> |
| First, we make an Impala partitioned table for CSV data, and look at the underlying HDFS directory |
| structure to understand the directory structure to re-create elsewhere in HDFS. The columns |
| <codeph>field1</codeph>, <codeph>field2</codeph>, and <codeph>field3</codeph> correspond to the contents |
| of the CSV data files. The <codeph>year</codeph>, <codeph>month</codeph>, <codeph>day</codeph>, and |
| <codeph>host</codeph> columns are all represented as subdirectories within the table structure, and are |
| not part of the CSV files. We use <codeph>STRING</codeph> for each of these columns so that we can |
| produce consistent subdirectory names, with leading zeros for a consistent length. |
| </p> |
| |
| <codeblock>create database external_partitions; |
| use external_partitions; |
| create table logs (field1 string, field2 string, field3 string) |
| partitioned by (year string, month string , day string, host string) |
| row format delimited fields terminated by ','; |
| insert into logs partition (year="2013", month="07", day="28", host="host1") values ("foo","foo","foo"); |
| insert into logs partition (year="2013", month="07", day="28", host="host2") values ("foo","foo","foo"); |
| insert into logs partition (year="2013", month="07", day="29", host="host1") values ("foo","foo","foo"); |
| insert into logs partition (year="2013", month="07", day="29", host="host2") values ("foo","foo","foo"); |
| insert into logs partition (year="2013", month="08", day="01", host="host1") values ("foo","foo","foo"); |
| </codeblock> |
| |
| <p> |
| Back in the Linux shell, we examine the HDFS directory structure. (Your Impala data directory might be in |
| a different location; for historical reasons, it is sometimes under the HDFS path |
| <filepath>/user/hive/warehouse</filepath>.) We use the <codeph>hdfs dfs -ls</codeph> command to examine |
| the nested subdirectories corresponding to each partitioning column, with separate subdirectories at each |
| level (with <codeph>=</codeph> in their names) representing the different values for each partitioning |
| column. When we get to the lowest level of subdirectory, we use the <codeph>hdfs dfs -cat</codeph> |
| command to examine the data file and see CSV-formatted data produced by the <codeph>INSERT</codeph> |
| statement in Impala. |
| </p> |
| |
| <codeblock>$ hdfs dfs -ls /user/impala/warehouse/external_partitions.db |
| Found 1 items |
| drwxrwxrwt - impala hive 0 2013-08-07 12:24 /user/impala/warehouse/external_partitions.db/logs |
| $ hdfs dfs -ls /user/impala/warehouse/external_partitions.db/logs |
| Found 1 items |
| drwxr-xr-x - impala hive 0 2013-08-07 12:24 /user/impala/warehouse/external_partitions.db/logs/year=2013 |
| $ hdfs dfs -ls /user/impala/warehouse/external_partitions.db/logs/year=2013 |
| Found 2 items |
| drwxr-xr-x - impala hive 0 2013-08-07 12:23 /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07 |
| drwxr-xr-x - impala hive 0 2013-08-07 12:24 /user/impala/warehouse/external_partitions.db/logs/year=2013/month=08 |
| $ hdfs dfs -ls /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07 |
| Found 2 items |
| drwxr-xr-x - impala hive 0 2013-08-07 12:22 /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28 |
| drwxr-xr-x - impala hive 0 2013-08-07 12:23 /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=29 |
| $ hdfs dfs -ls /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28 |
| Found 2 items |
| drwxr-xr-x - impala hive 0 2013-08-07 12:21 /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28/host=host1 |
| drwxr-xr-x - impala hive 0 2013-08-07 12:22 /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28/host=host2 |
| $ hdfs dfs -ls /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28/host=host1 |
| Found 1 items |
| -rw-r--r-- 3 impala hive 12 2013-08-07 12:21 /user/impala/warehouse/external_partiti |
| ons.db/logs/year=2013/month=07/day=28/host=host1/3981726974111751120--8907184999369517436_822630111_data.0 |
| $ hdfs dfs -cat /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28/\ |
| host=host1/3981726974111751120--8 907184999369517436_822630111_data.0 |
| foo,foo,foo |
| </codeblock> |
| |
| <p> |
| Still in the Linux shell, we use <codeph>hdfs dfs -mkdir</codeph> to create several data directories |
| outside the HDFS directory tree that Impala controls (<filepath>/user/impala/warehouse</filepath> in this |
| example, maybe different in your case). Depending on your configuration, you might need to log in as a |
| user with permission to write into this HDFS directory tree; for example, the commands shown here were |
| run while logged in as the <codeph>hdfs</codeph> user. |
| </p> |
| |
| <codeblock>$ hdfs dfs -mkdir -p /user/impala/data/logs/year=2013/month=07/day=28/host=host1 |
| $ hdfs dfs -mkdir -p /user/impala/data/logs/year=2013/month=07/day=28/host=host2 |
| $ hdfs dfs -mkdir -p /user/impala/data/logs/year=2013/month=07/day=28/host=host1 |
| $ hdfs dfs -mkdir -p /user/impala/data/logs/year=2013/month=07/day=29/host=host1 |
| $ hdfs dfs -mkdir -p /user/impala/data/logs/year=2013/month=08/day=01/host=host1 |
| </codeblock> |
| |
| <p> |
| We make a tiny CSV file, with values different than in the <codeph>INSERT</codeph> statements used |
| earlier, and put a copy within each subdirectory that we will use as an Impala partition. |
| </p> |
| |
| <codeblock>$ cat >dummy_log_data |
| bar,baz,bletch |
| $ hdfs dfs -mkdir -p /user/impala/data/external_partitions/year=2013/month=08/day=01/host=host1 |
| $ hdfs dfs -mkdir -p /user/impala/data/external_partitions/year=2013/month=07/day=28/host=host1 |
| $ hdfs dfs -mkdir -p /user/impala/data/external_partitions/year=2013/month=07/day=28/host=host2 |
| $ hdfs dfs -mkdir -p /user/impala/data/external_partitions/year=2013/month=07/day=29/host=host1 |
| $ hdfs dfs -put dummy_log_data /user/impala/data/logs/year=2013/month=07/day=28/host=host1 |
| $ hdfs dfs -put dummy_log_data /user/impala/data/logs/year=2013/month=07/day=28/host=host2 |
| $ hdfs dfs -put dummy_log_data /user/impala/data/logs/year=2013/month=07/day=29/host=host1 |
| $ hdfs dfs -put dummy_log_data /user/impala/data/logs/year=2013/month=08/day=01/host=host1 |
| </codeblock> |
| |
| <p> |
| Back in the <cmdname>impala-shell</cmdname> interpreter, we move the original Impala-managed table aside, |
| and create a new <i>external</i> table with a <codeph>LOCATION</codeph> clause pointing to the directory |
| under which we have set up all the partition subdirectories and data files. |
| </p> |
| |
| <codeblock>use external_partitions; |
| alter table logs rename to logs_original; |
| create external table logs (field1 string, field2 string, field3 string) |
| partitioned by (year string, month string, day string, host string) |
| row format delimited fields terminated by ',' |
| location '/user/impala/data/logs'; |
| </codeblock> |
| |
| <p> |
| Because partition subdirectories and data files come and go during the data lifecycle, you must identify |
| each of the partitions through an <codeph>ALTER TABLE</codeph> statement before Impala recognizes the |
| data files they contain. |
| </p> |
| |
| <codeblock>alter table logs add partition (year="2013",month="07",day="28",host="host1") |
| alter table log_type add partition (year="2013",month="07",day="28",host="host2"); |
| alter table log_type add partition (year="2013",month="07",day="29",host="host1"); |
| alter table log_type add partition (year="2013",month="08",day="01",host="host1"); |
| </codeblock> |
| |
| <p> |
| We issue a <codeph>REFRESH</codeph> statement for the table, always a safe practice when data files have |
| been manually added, removed, or changed. Then the data is ready to be queried. The <codeph>SELECT |
| *</codeph> statement illustrates that the data from our trivial CSV file was recognized in each of the |
| partitions where we copied it. Although in this case there are only a few rows, we include a |
| <codeph>LIMIT</codeph> clause on this test query just in case there is more data than we expect. |
| </p> |
| |
| <codeblock>refresh log_type; |
| select * from log_type limit 100; |
| +--------+--------+--------+------+-------+-----+-------+ |
| | field1 | field2 | field3 | year | month | day | host | |
| +--------+--------+--------+------+-------+-----+-------+ |
| | bar | baz | bletch | 2013 | 07 | 28 | host1 | |
| | bar | baz | bletch | 2013 | 08 | 01 | host1 | |
| | bar | baz | bletch | 2013 | 07 | 29 | host1 | |
| | bar | baz | bletch | 2013 | 07 | 28 | host2 | |
| +--------+--------+--------+------+-------+-----+-------+ |
| </codeblock> |
| </conbody> |
| </concept> |
| |
| <concept id="tutorial_impala_hive"> |
| |
| <title>Switching Back and Forth Between Impala and Hive</title> |
| |
| <conbody> |
| |
| <p> |
| Sometimes, you might find it convenient to switch to the Hive shell to perform some data loading or |
| transformation operation, particularly on file formats such as RCFile, SequenceFile, and Avro that Impala |
| currently can query but not write to. |
| </p> |
| |
| <p> |
| Whenever you create, drop, or alter a table or other kind of object through Hive, the next time you |
| switch back to the <cmdname>impala-shell</cmdname> interpreter, issue a one-time <codeph>INVALIDATE |
| METADATA</codeph> statement so that Impala recognizes the new or changed object. |
| </p> |
| |
| <p> |
| Whenever you load, insert, or change data in an existing table through Hive (or even through manual HDFS |
| operations such as the <cmdname>hdfs</cmdname> command), the next time you switch back to the |
| <cmdname>impala-shell</cmdname> interpreter, issue a one-time <codeph>REFRESH |
| <varname>table_name</varname></codeph> statement so that Impala recognizes the new or changed data. |
| </p> |
| |
| <p> |
| For examples showing how this process works for the <codeph>REFRESH</codeph> statement, look at the |
| examples of creating RCFile and SequenceFile tables in Impala, loading data through Hive, and then |
| querying the data through Impala. See <xref href="impala_rcfile.xml#rcfile"/> and |
| <xref href="impala_seqfile.xml#seqfile"/> for those examples. |
| </p> |
| |
| <p> |
| For examples showing how this process works for the <codeph>INVALIDATE METADATA</codeph> statement, look |
| at the example of creating and loading an Avro table in Hive, and then querying the data through Impala. |
| See <xref href="impala_avro.xml#avro"/> for that example. |
| </p> |
| |
| <note rev="1.2.0"> |
| <p rev="1.2.0"> |
| Originally, Impala did not support UDFs, but this feature is available in Impala starting in Impala |
| 1.2. Some <codeph>INSERT ... SELECT</codeph> transformations that you originally did through Hive can |
| now be done through Impala. See <xref href="impala_udf.xml#udfs"/> for details. |
| </p> |
| |
| <p rev="1.2.0"> |
| Prior to Impala 1.2, the <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> statements |
| needed to be issued on each Impala node to which you connected and issued queries. In Impala 1.2 and |
| higher, when you issue either of those statements on any Impala node, the results are broadcast to all |
| the Impala nodes in the cluster, making it truly a one-step operation after each round of DDL or ETL |
| operations in Hive. |
| </p> |
| </note> |
| </conbody> |
| </concept> |
| |
| <concept rev="1.2.2" id="tut_cross_join"> |
| |
| <title>Cross Joins and Cartesian Products with the CROSS JOIN Operator</title> |
| |
| <conbody> |
| |
| <p> |
| Originally, Impala restricted join queries so that they had to include at least one equality comparison |
| between the columns of the tables on each side of the join operator. With the huge tables typically |
| processed by Impala, any miscoded query that produced a full Cartesian product as a result set could |
| consume a huge amount of cluster resources. |
| </p> |
| |
| <p> |
| In Impala 1.2.2 and higher, this restriction is lifted when you use the <codeph>CROSS JOIN</codeph> |
| operator in the query. You still cannot remove all <codeph>WHERE</codeph> clauses from a query like |
| <codeph>SELECT * FROM t1 JOIN t2</codeph> to produce all combinations of rows from both tables. But you |
| can use the <codeph>CROSS JOIN</codeph> operator to explicitly request such a Cartesian product. |
| Typically, this operation is applicable for smaller tables, where the result set still fits within the |
| memory of a single Impala node. |
| </p> |
| |
| <p> |
| The following example sets up data for use in a series of comic books where characters battle each other. |
| At first, we use an equijoin query, which only allows characters from the same time period and the same |
| planet to meet. |
| </p> |
| |
| <codeblock>[localhost:21000] > create table heroes (name string, era string, planet string); |
| [localhost:21000] > create table villains (name string, era string, planet string); |
| [localhost:21000] > insert into heroes values |
| > ('Tesla','20th century','Earth'), |
| > ('Pythagoras','Antiquity','Earth'), |
| > ('Zopzar','Far Future','Mars'); |
| Inserted 3 rows in 2.28s |
| [localhost:21000] > insert into villains values |
| > ('Caligula','Antiquity','Earth'), |
| > ('John Dillinger','20th century','Earth'), |
| > ('Xibulor','Far Future','Venus'); |
| Inserted 3 rows in 1.93s |
| [localhost:21000] > select concat(heroes.name,' vs. ',villains.name) as battle |
| > from heroes join villains |
| > where heroes.era = villains.era and heroes.planet = villains.planet; |
| +--------------------------+ |
| | battle | |
| +--------------------------+ |
| | Tesla vs. John Dillinger | |
| | Pythagoras vs. Caligula | |
| +--------------------------+ |
| Returned 2 row(s) in 0.47s</codeblock> |
| |
| <p> |
| Readers demanded more action, so we added elements of time travel and space travel so that any hero could |
| face any villain. Prior to Impala 1.2.2, this type of query was impossible because all joins had to |
| reference matching values between the two tables: |
| </p> |
| |
| <codeblock>[localhost:21000] > -- Cartesian product not possible in Impala 1.1. |
| > select concat(heroes.name,' vs. ',villains.name) as battle from heroes join villains; |
| ERROR: NotImplementedException: Join between 'heroes' and 'villains' requires at least one conjunctive equality predicate between the two tables</codeblock> |
| |
| <p> |
| With Impala 1.2.2, we rewrite the query slightly to use <codeph>CROSS JOIN</codeph> rather than |
| <codeph>JOIN</codeph>, and now the result set includes all combinations: |
| </p> |
| |
| <codeblock>[localhost:21000] > -- Cartesian product available in Impala 1.2.2 with the CROSS JOIN syntax. |
| > select concat(heroes.name,' vs. ',villains.name) as battle from heroes cross join villains; |
| +-------------------------------+ |
| | battle | |
| +-------------------------------+ |
| | Tesla vs. Caligula | |
| | Tesla vs. John Dillinger | |
| | Tesla vs. Xibulor | |
| | Pythagoras vs. Caligula | |
| | Pythagoras vs. John Dillinger | |
| | Pythagoras vs. Xibulor | |
| | Zopzar vs. Caligula | |
| | Zopzar vs. John Dillinger | |
| | Zopzar vs. Xibulor | |
| +-------------------------------+ |
| Returned 9 row(s) in 0.33s</codeblock> |
| |
| <p> |
| The full combination of rows from both tables is known as the Cartesian product. This type of result set |
| is often used for creating grid data structures. You can also filter the result set by including |
| <codeph>WHERE</codeph> clauses that do not explicitly compare columns between the two tables. The |
| following example shows how you might produce a list of combinations of year and quarter for use in a |
| chart, and then a shorter list with only selected quarters. |
| </p> |
| |
| <codeblock>[localhost:21000] > create table x_axis (x int); |
| [localhost:21000] > create table y_axis (y int); |
| [localhost:21000] > insert into x_axis values (1),(2),(3),(4); |
| Inserted 4 rows in 2.14s |
| [localhost:21000] > insert into y_axis values (2010),(2011),(2012),(2013),(2014); |
| Inserted 5 rows in 1.32s |
| [localhost:21000] > select y as year, x as quarter from x_axis cross join y_axis; |
| +------+---------+ |
| | year | quarter | |
| +------+---------+ |
| | 2010 | 1 | |
| | 2011 | 1 | |
| | 2012 | 1 | |
| | 2013 | 1 | |
| | 2014 | 1 | |
| | 2010 | 2 | |
| | 2011 | 2 | |
| | 2012 | 2 | |
| | 2013 | 2 | |
| | 2014 | 2 | |
| | 2010 | 3 | |
| | 2011 | 3 | |
| | 2012 | 3 | |
| | 2013 | 3 | |
| | 2014 | 3 | |
| | 2010 | 4 | |
| | 2011 | 4 | |
| | 2012 | 4 | |
| | 2013 | 4 | |
| | 2014 | 4 | |
| +------+---------+ |
| Returned 20 row(s) in 0.38s |
| [localhost:21000] > select y as year, x as quarter from x_axis cross join y_axis where x in (1,3); |
| +------+---------+ |
| | year | quarter | |
| +------+---------+ |
| | 2010 | 1 | |
| | 2011 | 1 | |
| | 2012 | 1 | |
| | 2013 | 1 | |
| | 2014 | 1 | |
| | 2010 | 3 | |
| | 2011 | 3 | |
| | 2012 | 3 | |
| | 2013 | 3 | |
| | 2014 | 3 | |
| +------+---------+ |
| Returned 10 row(s) in 0.39s</codeblock> |
| </conbody> |
| </concept> |
| |
| <concept audience="hidden" id="tut_connect"> |
| |
| <title>Connecting to Impala through impala-shell</title> |
| |
| <conbody> |
| |
| <p> |
| This tutorial provides some tips for connecting to Impala through the <cmdname>impala-shell</cmdname> |
| interpreter, and gives usage tips for particular scenarios. |
| </p> |
| |
| <ul> |
| <li> |
| <p> |
| By default, <cmdname>impala-shell</cmdname> with no arguments connects to the current host, port |
| 21000. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| <codeph>-i localhost</codeph> option gives you a shorter command prompt. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| The <codeph>--quiet</codeph> option suppresses some informational messages, such as the |
| <codeph>Query:</codeph> line that echoes each command entered. |
| </p> |
| </li> |
| |
| <li> |
| <codeph>CONNECT <varname>hostname</varname></codeph> lets you switch to another host (keeping the same |
| default port 21000). |
| </li> |
| |
| <li> |
| <p> |
| You could run <cmdname>impala-shell</cmdname> from a completely different host to keep login accounts |
| off the actual Impala nodes. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| For load balancing, you would connect to different coordinator nodes for concurrent queries. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| For benchmarking, you would skip pretty printing with the <codeph>-B</codeph> option, or even skip |
| all screen output with the <codeph>-o</codeph> option or shell redirection. |
| </p> |
| </li> |
| </ul> |
| </conbody> |
| </concept> |
| </concept> |
| |
| <concept audience="hidden" id="tut_mem_limit"> |
| |
| <title>Effects of Memory Limits on Impala Queries</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Memory"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p></p> |
| </conbody> |
| </concept> |
| |
| <concept id="tut_parquet_schemaless"> |
| |
| <title>Dealing with Parquet Files with Unknown Schema</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="Tutorials"/> |
| <data name="Category" value="Schemas"/> |
| <data name="Category" value="Parquet"/> |
| <data name="Category" value="File Formats"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| As data pipelines start to include more aspects such as NoSQL or loosely specified schemas, you might encounter |
| situations where you have data files (particularly in Parquet format) where you do not know the precise table definition. |
| This tutorial shows how you can build an Impala table around data that comes from non-Impala or even non-SQL sources, |
| where you do not have control of the table layout and might not be familiar with the characteristics of the data. |
| </p> |
| |
| <p> |
| The data used in this tutorial represents airline on-time arrival statistics, from October 1987 through April 2008. |
| See the details on the <xref href="http://stat-computing.org/dataexpo/2009/" scope="external" format="html">2009 ASA Data Expo web site</xref>. |
| You can also see the <xref href="http://stat-computing.org/dataexpo/2009/the-data.html" scope="external" format="html">explanations of the columns</xref>; |
| for purposes of this exercise, wait until after following the tutorial before examining the schema, to better simulate |
| a real-life situation where you cannot rely on assumptions and assertions about the ranges and representations of |
| data values. |
| </p> |
| |
| <p> |
| We will download Parquet files containing this data from the Ibis blog. |
| First, we download and unpack the data files. |
| There are 8 files totalling 1.4 GB. |
| Each file is less than 256 MB. |
| </p> |
| |
| <codeblock>$ wget -O airlines_parquet.tar.gz https://www.dropbox.com/s/ol9x51tqp6cv4yc/airlines_parquet.tar.gz?dl=0 |
| ... |
| Length: 1245204740 (1.2G) [application/octet-stream] |
| Saving to: “airlines_parquet.tar.gz” |
| |
| 2015-08-12 17:14:24 (23.6 MB/s) - “airlines_parquet.tar.gz” saved [1245204740/1245204740] |
| |
| $ tar xvzf airlines_parquet.tar.gz |
| airlines_parquet/ |
| airlines_parquet/93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq |
| airlines_parquet/93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq |
| airlines_parquet/93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq |
| airlines_parquet/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq |
| airlines_parquet/93459d994898a9ba-77674173b331fa97_574780876_data.0.parq |
| airlines_parquet/93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq |
| airlines_parquet/93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq |
| airlines_parquet/93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq |
| $ cd airlines_parquet/ |
| $ du -kch *.parq |
| 253M 93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq |
| 65M 93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq |
| 156M 93459d994898a9ba-77674173b331fa97_574780876_data.0.parq |
| 240M 93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq |
| 253M 93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq |
| 16M 93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq |
| 177M 93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq |
| 213M 93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq |
| 1.4G total |
| </codeblock> |
| |
| <p> |
| Next, we put the Parquet data files in HDFS, all together in a single directory, |
| with permissions on the directory and the files so that the <codeph>impala</codeph> |
| user will be able to read them. |
| </p> |
| |
| <note> |
| After unpacking, we saw the largest Parquet file was 253 MB. |
| When copying Parquet files into HDFS for Impala to use, |
| for maximum query performance, make sure that each file resides in a single HDFS data block. |
| Therefore, we pick a size larger than any single file and specify that as the block size, using the argument |
| <codeph>-Ddfs.block.size=256m</codeph> on the <codeph>hdfs dfs -put</codeph> command. |
| </note> |
| |
| <codeblock>$ hdfs dfs -mkdir -p hdfs://demo_host.example.com:8020/user/impala/staging/airlines |
| $ hdfs dfs -Ddfs.block.size=256m -put *.parq /user/impala/staging/airlines |
| $ hdfs dfs -ls /user/impala/staging |
| Found 1 items |
| drwxrwxrwx - hdfs supergroup 0 2015-08-12 13:52 /user/impala/staging/airlines |
| $ hdfs dfs -ls hdfs://demo_host.example.com:8020/user/impala/staging/airlines |
| Found 8 items |
| -rw-r--r-- 3 jrussell supergroup 265107489 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq |
| -rw-r--r-- 3 jrussell supergroup 67544715 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq |
| -rw-r--r-- 3 jrussell supergroup 162556490 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa97_574780876_data.0.parq |
| -rw-r--r-- 3 jrussell supergroup 251603518 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq |
| -rw-r--r-- 3 jrussell supergroup 265186603 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq |
| -rw-r--r-- 3 jrussell supergroup 16663754 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq |
| -rw-r--r-- 3 jrussell supergroup 185511677 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq |
| -rw-r--r-- 3 jrussell supergroup 222794621 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq |
| </codeblock> |
| |
| <p> |
| With the files in an accessible location in HDFS, we create a database table that uses the data in those files. |
| The <codeph>CREATE EXTERNAL</codeph> syntax and the <codeph>LOCATION</codeph> attribute point Impala at the appropriate HDFS directory. |
| The <codeph>LIKE PARQUET '<varname>path_to_any_parquet_file</varname>'</codeph> clause means we skip the list of column names and types; |
| Impala automatically gets the column names and data types straight from the data files. |
| (Currently, this technique only works for Parquet files.) |
| We ignore the warning about lack of <codeph>READ_WRITE</codeph> access to the files in HDFS; |
| the <codeph>impala</codeph> user can read the files, which will be sufficient for us to experiment with |
| queries and perform some copy and transform operations into other tables. |
| </p> |
| |
| <codeblock>$ impala-shell -i localhost |
| Starting Impala Shell without Kerberos authentication |
| |
| Connected to localhost:21000 |
| <ph conref="../shared/ImpalaVariables.xml#impala_vars/ImpaladBanner"/> |
| Welcome to the Impala shell. Press TAB twice to see a list of available commands. |
| ... |
| <ph conref="../shared/ImpalaVariables.xml#impala_vars/ShellBanner"/> |
| [localhost:21000] > create database airline_data; |
| [localhost:21000] > use airline_data; |
| [localhost:21000] > create external table airlines_external |
| > like parquet 'hdfs://demo_host.example.com:8020/user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq' |
| > stored as parquet location 'hdfs://demo_host.example.com:8020/user/impala/staging/airlines'; |
| WARNINGS: Impala does not have READ_WRITE access to path 'hdfs://demo_host.example.com:8020/user/impala/staging' |
| </codeblock> |
| |
| <p> |
| With the table created, we examine its physical and logical characteristics to confirm that the data is really |
| there and in a format and shape that we can work with. |
| The <codeph>SHOW TABLE STATS</codeph> statement gives a very high-level summary of the table, |
| showing how many files and how much total data it contains. |
| Also, it confirms that the table is expecting all the associated data files to be in Parquet format. |
| (The ability to work with all kinds of HDFS data files in different formats means that it is |
| possible to have a mismatch between the format of the data files, and the format |
| that the table expects the data files to be in.) |
| The <codeph>SHOW FILES</codeph> statement confirms that the data in the table has the expected number, |
| names, and sizes of the original Parquet files. |
| The <codeph>DESCRIBE</codeph> statement (or its abbreviation <codeph>DESC</codeph>) confirms the names and types |
| of the columns that Impala automatically created after reading that metadata from the Parquet file. |
| The <codeph>DESCRIBE FORMATTED</codeph> statement prints out some extra detail along with the column definitions; |
| the pieces we care about for this exercise are the containing database for the table, |
| the location of the associated data files in HDFS, the fact that it's an external table so Impala will not |
| delete the HDFS files when we finish the experiments and drop the table, and the fact that the |
| table is set up to work exclusively with files in the Parquet format. |
| </p> |
| |
| <codeblock>[localhost:21000] > show table stats airlines_external; |
| +-------+--------+--------+--------------+-------------------+---------+-------------------+ |
| | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | |
| +-------+--------+--------+--------------+-------------------+---------+-------------------+ |
| | -1 | 8 | 1.34GB | NOT CACHED | NOT CACHED | PARQUET | false | |
| +-------+--------+--------+--------------+-------------------+---------+-------------------+ |
| [localhost:21000] > show files in airlines_external; |
| +----------------------------------------------------------------------------------------+----------+-----------+ |
| | path | size | partition | |
| +----------------------------------------------------------------------------------------+----------+-----------+ |
| | /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq | 252.83MB | | |
| | /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq | 64.42MB | | |
| | /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa97_574780876_data.0.parq | 155.03MB | | |
| | /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq | 239.95MB | | |
| | /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq | 252.90MB | | |
| | /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq | 15.89MB | | |
| | /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq | 176.92MB | | |
| | /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq | 212.47MB | | |
| +----------------------------------------------------------------------------------------+----------+-----------+ |
| [localhost:21000] > describe airlines_external; |
| +---------------------+--------+---------------------------------------------------+ |
| | name | type | comment | |
| +---------------------+--------+---------------------------------------------------+ |
| | year | int | inferred from: optional int32 year | |
| | month | int | inferred from: optional int32 month | |
| | day | int | inferred from: optional int32 day | |
| | dayofweek | int | inferred from: optional int32 dayofweek | |
| | dep_time | int | inferred from: optional int32 dep_time | |
| | crs_dep_time | int | inferred from: optional int32 crs_dep_time | |
| | arr_time | int | inferred from: optional int32 arr_time | |
| | crs_arr_time | int | inferred from: optional int32 crs_arr_time | |
| | carrier | string | inferred from: optional binary carrier | |
| | flight_num | int | inferred from: optional int32 flight_num | |
| | tail_num | int | inferred from: optional int32 tail_num | |
| | actual_elapsed_time | int | inferred from: optional int32 actual_elapsed_time | |
| | crs_elapsed_time | int | inferred from: optional int32 crs_elapsed_time | |
| | airtime | int | inferred from: optional int32 airtime | |
| | arrdelay | int | inferred from: optional int32 arrdelay | |
| | depdelay | int | inferred from: optional int32 depdelay | |
| | origin | string | inferred from: optional binary origin | |
| | dest | string | inferred from: optional binary dest | |
| | distance | int | inferred from: optional int32 distance | |
| | taxi_in | int | inferred from: optional int32 taxi_in | |
| | taxi_out | int | inferred from: optional int32 taxi_out | |
| | cancelled | int | inferred from: optional int32 cancelled | |
| | cancellation_code | string | inferred from: optional binary cancellation_code | |
| | diverted | int | inferred from: optional int32 diverted | |
| | carrier_delay | int | inferred from: optional int32 carrier_delay | |
| | weather_delay | int | inferred from: optional int32 weather_delay | |
| | nas_delay | int | inferred from: optional int32 nas_delay | |
| | security_delay | int | inferred from: optional int32 security_delay | |
| | late_aircraft_delay | int | inferred from: optional int32 late_aircraft_delay | |
| +---------------------+--------+---------------------------------------------------+ |
| [localhost:21000] > desc formatted airlines_external; |
| +------------------------------+------------------------------- |
| | name | type |
| +------------------------------+------------------------------- |
| ... |
| | # Detailed Table Information | NULL |
| | Database: | airline_data |
| | Owner: | jrussell |
| ... |
| | Location: | /user/impala/staging/airlines |
| | Table Type: | EXTERNAL_TABLE |
| ... |
| | # Storage Information | NULL |
| | SerDe Library: | parquet.hive.serde.ParquetHiveSerDe |
| | InputFormat: | parquet.hive.DeprecatedParquetInputFormat |
| | OutputFormat: | parquet.hive.DeprecatedParquetOutputFormat |
| ... |
| </codeblock> |
| |
| <p> |
| Now that we are confident that the connections are solid between the Impala table and the |
| underlying Parquet files, we run some initial queries to understand the characteristics |
| of the data: the overall number of rows, and the ranges and how many |
| different values are in certain columns. |
| For convenience in understanding the magnitude of the <codeph>COUNT(*)</codeph> |
| result, we run another query dividing the number of rows by 1 million, demonstrating that there are 123 million rows in the table. |
| </p> |
| |
| <!-- I think this is the very longest line that could possibly fit without wrapping in the PDF output; 87 characters wide. |
| [localhost:21000] > select count(*) / 1e6 as 'millions of rows' from airlines_external; |
| --> |
| |
| <codeblock>[localhost:21000] > select count(*) from airlines_external; |
| +-----------+ |
| | count(*) | |
| +-----------+ |
| | 123534969 | |
| +-----------+ |
| Fetched 1 row(s) in 1.32s |
| [localhost:21000] > select count(*) / 1e6 as 'millions of rows' from airlines_external; |
| +------------------+ |
| | millions of rows | |
| +------------------+ |
| | 123.534969 | |
| +------------------+ |
| Fetched 1 row(s) in 1.24s |
| </codeblock> |
| |
| <p> The <codeph>NDV()</codeph> function stands for <q>number of distinct |
| values</q>, which for performance reasons is an estimate when there |
| are lots of different values in the column, but is precise when the |
| cardinality is less than 16 K. Use <codeph>NDV()</codeph> calls for this |
| kind of exploration rather than <codeph>COUNT(DISTINCT |
| <varname>colname</varname>)</codeph>, because Impala can evaluate |
| multiple <codeph>NDV()</codeph> functions in a single query, but only a |
| single instance of <codeph>COUNT DISTINCT</codeph>. Here we see that |
| there are modest numbers of different airlines, flight numbers, and |
| origin and destination airports. Two things jump out from this query: |
| the number of <codeph>tail_num</codeph> values is much smaller than we |
| might have expected, and there are more destination airports than origin |
| airports. Let's dig further. What we find is that most |
| <codeph>tail_num</codeph> values are <codeph>NULL</codeph>. It looks |
| like this was an experimental column that wasn't filled in accurately. |
| We make a mental note that if we use this data as a starting point, |
| we'll ignore this column. We also find that certain airports are |
| represented in the <codeph>ORIGIN</codeph> column but not the |
| <codeph>DEST</codeph> column; now we know that we cannot rely on the |
| assumption that those sets of airport codes are identical. </p> |
| |
| <note> |
| A slight digression for some performance tuning. Notice how the first |
| <codeph>SELECT DISTINCT DEST</codeph> query takes almost 40 seconds. |
| We expect all queries on such a small data set, less than 2 GB, to |
| take a few seconds at most. The reason is because the expression |
| <codeph>NOT IN (SELECT origin FROM airlines_external)</codeph> |
| produces an intermediate result set of 123 million rows, then |
| runs 123 million comparisons on each data node against the tiny set of destination airports. |
| The way the <codeph>NOT IN</codeph> operator works internally means that |
| this intermediate result set with 123 million rows might be transmitted |
| across the network to each data node in the cluster. |
| Applying another <codeph>DISTINCT</codeph> inside the <codeph>NOT IN</codeph> |
| subquery means that the intermediate result set is only 340 items, |
| resulting in much less network traffic and fewer comparison operations. |
| The more efficient query with the added <codeph>DISTINCT</codeph> is approximately 7 times as fast. |
| </note> |
| |
| <codeblock>[localhost:21000] > select ndv(carrier), ndv(flight_num), ndv(tail_num), |
| > ndv(origin), ndv(dest) from airlines_external; |
| +--------------+-----------------+---------------+-------------+-----------+ |
| | ndv(carrier) | ndv(flight_num) | ndv(tail_num) | ndv(origin) | ndv(dest) | |
| +--------------+-----------------+---------------+-------------+-----------+ |
| | 29 | 9086 | 3 | 340 | 347 | |
| +--------------+-----------------+---------------+-------------+-----------+ |
| [localhost:21000] > select tail_num, count(*) as howmany from airlines_external |
| > group by tail_num; |
| +----------+-----------+ |
| | tail_num | howmany | |
| +----------+-----------+ |
| | 715 | 1 | |
| | 0 | 406405 | |
| | 112 | 6562 | |
| | NULL | 123122001 | |
| +----------+-----------+ |
| Fetched 1 row(s) in 5.18s |
| [localhost:21000] > select distinct dest from airlines_external |
| > where dest not in (select origin from airlines_external); |
| +------+ |
| | dest | |
| +------+ |
| | LBF | |
| | CBM | |
| | RCA | |
| | SKA | |
| | LAR | |
| +------+ |
| Fetched 5 row(s) in 39.64s |
| [localhost:21000] > select distinct dest from airlines_external |
| > where dest not in (select distinct origin from airlines_external); |
| +------+ |
| | dest | |
| +------+ |
| | LBF | |
| | RCA | |
| | CBM | |
| | SKA | |
| | LAR | |
| +------+ |
| Fetched 5 row(s) in 5.59s |
| [localhost:21000] > select distinct origin from airlines_external |
| > where origin not in (select distinct dest from airlines_external); |
| Fetched 0 row(s) in 5.37s |
| </codeblock> |
| |
| <p> Next, we try doing a simple calculation, with results broken down by year. |
| This reveals that some years have no data in the |
| <codeph>AIRTIME</codeph> column. That means we might be able to use |
| that column in queries involving certain date ranges, but we cannot |
| count on it to always be reliable. The question of whether a column |
| contains any <codeph>NULL</codeph> values, and if so what is their |
| number, proportion, and distribution, comes up again and again when |
| doing initial exploration of a data set. </p> |
| |
| <codeblock>[localhost:21000] > select year, sum(airtime) from airlines_external |
| > group by year order by year desc; |
| +------+--------------+ |
| | year | sum(airtime) | |
| +------+--------------+ |
| | 2008 | 713050445 | |
| | 2007 | 748015545 | |
| | 2006 | 720372850 | |
| | 2005 | 708204026 | |
| | 2004 | 714276973 | |
| | 2003 | 665706940 | |
| | 2002 | 549761849 | |
| | 2001 | 590867745 | |
| | 2000 | 583537683 | |
| | 1999 | 561219227 | |
| | 1998 | 538050663 | |
| | 1997 | 536991229 | |
| | 1996 | 519440044 | |
| | 1995 | 513364265 | |
| | 1994 | NULL | |
| | 1993 | NULL | |
| | 1992 | NULL | |
| | 1991 | NULL | |
| | 1990 | NULL | |
| | 1989 | NULL | |
| | 1988 | NULL | |
| | 1987 | NULL | |
| +------+--------------+ |
| </codeblock> |
| |
| <p> |
| With the notion of <codeph>NULL</codeph> values in mind, let's come back to the <codeph>TAILNUM</codeph> |
| column that we discovered had a lot of <codeph>NULL</codeph>s. |
| Let's quantify the <codeph>NULL</codeph> and non-<codeph>NULL</codeph> values in that column for better understanding. |
| First, we just count the overall number of rows versus the non-<codeph>NULL</codeph> values in that column. |
| That initial result gives the appearance of relatively few non-<codeph>NULL</codeph> values, but we can break |
| it down more clearly in a single query. |
| Once we have the <codeph>COUNT(*)</codeph> and the <codeph>COUNT(<varname>colname</varname>)</codeph> numbers, |
| we can encode that initial query in a <codeph>WITH</codeph> clause, then run a followon query that performs |
| multiple arithmetic operations on those values. |
| Seeing that only one-third of one percent of all rows have non-<codeph>NULL</codeph> values for the |
| <codeph>TAILNUM</codeph> column clearly illustrates that that column is not of much use. |
| </p> |
| |
| <codeblock>[localhost:21000] > select count(*) as 'rows', count(tail_num) as 'non-null tail numbers' |
| > from airlines_external; |
| +-----------+-----------------------+ |
| | rows | non-null tail numbers | |
| +-----------+-----------------------+ |
| | 123534969 | 412968 | |
| +-----------+-----------------------+ |
| Fetched 1 row(s) in 1.51s |
| [localhost:21000] > with t1 as |
| > (select count(*) as 'rows', count(tail_num) as 'nonnull' |
| > from airlines_external) |
| > select `rows`, `nonnull`, `rows` - `nonnull` as 'nulls', |
| > (`nonnull` / `rows`) * 100 as 'percentage non-null' |
| > from t1; |
| +-----------+---------+-----------+---------------------+ |
| | rows | nonnull | nulls | percentage non-null | |
| +-----------+---------+-----------+---------------------+ |
| | 123534969 | 412968 | 123122001 | 0.3342923897119365 | |
| +-----------+---------+-----------+---------------------+ |
| </codeblock> |
| |
| <p> |
| By examining other columns using these techniques, we can form a mental picture of the way data is distributed |
| throughout the table, and which columns are most significant for query purposes. For this tutorial, we focus mostly on |
| the fields likely to hold discrete values, rather than columns such as <codeph>ACTUAL_ELAPSED_TIME</codeph> |
| whose names suggest they hold measurements. We would dig deeper into those columns once we had a clear picture |
| of which questions were worthwhile to ask, and what kinds of trends we might look for. |
| For the final piece of initial exploration, let's look at the <codeph>YEAR</codeph> column. |
| A simple <codeph>GROUP BY</codeph> query shows that it has a well-defined range, a manageable number of |
| distinct values, and relatively even distribution of rows across the different years. |
| </p> |
| |
| <codeblock>[localhost:21000] > select min(year), max(year), ndv(year) from airlines_external; |
| +-----------+-----------+-----------+ |
| | min(year) | max(year) | ndv(year) | |
| +-----------+-----------+-----------+ |
| | 1987 | 2008 | 22 | |
| +-----------+-----------+-----------+ |
| Fetched 1 row(s) in 2.03s |
| [localhost:21000] > select year, count(*) howmany from airlines_external |
| > group by year order by year desc; |
| +------+---------+ |
| | year | howmany | |
| +------+---------+ |
| | 2008 | 7009728 | |
| | 2007 | 7453215 | |
| | 2006 | 7141922 | |
| | 2005 | 7140596 | |
| | 2004 | 7129270 | |
| | 2003 | 6488540 | |
| | 2002 | 5271359 | |
| | 2001 | 5967780 | |
| | 2000 | 5683047 | |
| | 1999 | 5527884 | |
| | 1998 | 5384721 | |
| | 1997 | 5411843 | |
| | 1996 | 5351983 | |
| | 1995 | 5327435 | |
| | 1994 | 5180048 | |
| | 1993 | 5070501 | |
| | 1992 | 5092157 | |
| | 1991 | 5076925 | |
| | 1990 | 5270893 | |
| | 1989 | 5041200 | |
| | 1988 | 5202096 | |
| | 1987 | 1311826 | |
| +------+---------+ |
| Fetched 22 row(s) in 2.13s |
| </codeblock> |
| |
| <p> |
| We could go quite far with the data in this initial raw format, just as we downloaded it from the web. |
| If the data set proved to be useful and worth persisting in Impala for extensive queries, |
| we might want to copy it to an internal table, letting Impala manage the data files and perhaps |
| reorganizing a little for higher efficiency. |
| In this next stage of the tutorial, we copy the original data into a partitioned table, still in Parquet format. |
| Partitioning based on the <codeph>YEAR</codeph> column lets us run queries with clauses such as <codeph>WHERE year = 2001</codeph> |
| or <codeph>WHERE year BETWEEN 1989 AND 1999</codeph>, which can dramatically cut down on I/O by |
| ignoring all the data from years outside the desired range. |
| Rather than reading all the data and then deciding which rows are in the matching years, Impala can |
| zero in on only the data files from specific <codeph>YEAR</codeph> partitions. |
| To do this, Impala physically reorganizes the data files, putting the rows from each year into |
| data files in a separate HDFS directory for each <codeph>YEAR</codeph> value. |
| Along the way, we'll also get rid of the <codeph>TAIL_NUM</codeph> column that proved to be almost entirely <codeph>NULL</codeph>. |
| </p> |
| |
| <p> |
| The first step is to create a new table with a layout very similar to the original <codeph>AIRLINES_EXTERNAL</codeph> table. |
| We'll do that by reverse-engineering a <codeph>CREATE TABLE</codeph> statement for the first table, |
| then tweaking it slightly to include a <codeph>PARTITION BY</codeph> clause for <codeph>YEAR</codeph>, |
| and excluding the <codeph>TAIL_NUM</codeph> column. |
| The <codeph>SHOW CREATE TABLE</codeph> statement gives us the starting point. |
| </p> |
| |
| <codeblock>[localhost:21000] > show create table airlines_external; |
| +------------------------------------------------------------------------------------- |
| | result |
| +------------------------------------------------------------------------------------- |
| | CREATE EXTERNAL TABLE airline_data.airlines_external ( |
| | year INT COMMENT 'inferred from: optional int32 year', |
| | month INT COMMENT 'inferred from: optional int32 month', |
| | day INT COMMENT 'inferred from: optional int32 day', |
| | dayofweek INT COMMENT 'inferred from: optional int32 dayofweek', |
| | dep_time INT COMMENT 'inferred from: optional int32 dep_time', |
| | crs_dep_time INT COMMENT 'inferred from: optional int32 crs_dep_time', |
| | arr_time INT COMMENT 'inferred from: optional int32 arr_time', |
| | crs_arr_time INT COMMENT 'inferred from: optional int32 crs_arr_time', |
| | carrier STRING COMMENT 'inferred from: optional binary carrier', |
| | flight_num INT COMMENT 'inferred from: optional int32 flight_num', |
| | tail_num INT COMMENT 'inferred from: optional int32 tail_num', |
| | actual_elapsed_time INT COMMENT 'inferred from: optional int32 actual_elapsed_time', |
| | crs_elapsed_time INT COMMENT 'inferred from: optional int32 crs_elapsed_time', |
| | airtime INT COMMENT 'inferred from: optional int32 airtime', |
| | arrdelay INT COMMENT 'inferred from: optional int32 arrdelay', |
| | depdelay INT COMMENT 'inferred from: optional int32 depdelay', |
| | origin STRING COMMENT 'inferred from: optional binary origin', |
| | dest STRING COMMENT 'inferred from: optional binary dest', |
| | distance INT COMMENT 'inferred from: optional int32 distance', |
| | taxi_in INT COMMENT 'inferred from: optional int32 taxi_in', |
| | taxi_out INT COMMENT 'inferred from: optional int32 taxi_out', |
| | cancelled INT COMMENT 'inferred from: optional int32 cancelled', |
| | cancellation_code STRING COMMENT 'inferred from: optional binary cancellation_code', |
| | diverted INT COMMENT 'inferred from: optional int32 diverted', |
| | carrier_delay INT COMMENT 'inferred from: optional int32 carrier_delay', |
| | weather_delay INT COMMENT 'inferred from: optional int32 weather_delay', |
| | nas_delay INT COMMENT 'inferred from: optional int32 nas_delay', |
| | security_delay INT COMMENT 'inferred from: optional int32 security_delay', |
| | late_aircraft_delay INT COMMENT 'inferred from: optional int32 late_aircraft_delay' |
| | ) |
| | STORED AS PARQUET |
| | LOCATION 'hdfs://a1730.example.com:8020/user/impala/staging/airlines' |
| | TBLPROPERTIES ('numFiles'='0', 'COLUMN_STATS_ACCURATE'='false', |
| | 'transient_lastDdlTime'='1439425228', 'numRows'='-1', 'totalSize'='0', |
| | 'rawDataSize'='-1') |
| +------------------------------------------------------------------------------------- |
| Fetched 1 row(s) in 0.03s |
| [localhost:21000] > quit; |
| </codeblock> |
| |
| <p> |
| Although we could edit that output into a new SQL statement, all the ASCII box characters |
| make such editing inconvenient. To get a more stripped-down <codeph>CREATE TABLE</codeph> |
| to start with, we restart the <cmdname>impala-shell</cmdname> command with the |
| <codeph>-B</codeph> option, which turns off the box-drawing behavior. |
| </p> |
| |
| <codeblock>[localhost:21000] > quit; |
| Goodbye jrussell |
| $ impala-shell -i localhost -B -d airline_data; |
| Starting Impala Shell without Kerberos authentication |
| Connected to localhost:21000 |
| <ph conref="../shared/ImpalaVariables.xml#impala_vars/ImpaladBanner"/> |
| Welcome to the Impala shell. Press TAB twice to see a list of available commands. |
| ... |
| <ph conref="../shared/ImpalaVariables.xml#impala_vars/ShellBanner"/> |
| [localhost:21000] > show create table airlines_external; |
| "CREATE EXTERNAL TABLE airline_data.airlines_external ( |
| year INT COMMENT 'inferred from: optional int32 year', |
| month INT COMMENT 'inferred from: optional int32 month', |
| day INT COMMENT 'inferred from: optional int32 day', |
| dayofweek INT COMMENT 'inferred from: optional int32 dayofweek', |
| dep_time INT COMMENT 'inferred from: optional int32 dep_time', |
| crs_dep_time INT COMMENT 'inferred from: optional int32 crs_dep_time', |
| arr_time INT COMMENT 'inferred from: optional int32 arr_time', |
| crs_arr_time INT COMMENT 'inferred from: optional int32 crs_arr_time', |
| carrier STRING COMMENT 'inferred from: optional binary carrier', |
| flight_num INT COMMENT 'inferred from: optional int32 flight_num', |
| tail_num INT COMMENT 'inferred from: optional int32 tail_num', |
| actual_elapsed_time INT COMMENT 'inferred from: optional int32 actual_elapsed_time', |
| crs_elapsed_time INT COMMENT 'inferred from: optional int32 crs_elapsed_time', |
| airtime INT COMMENT 'inferred from: optional int32 airtime', |
| arrdelay INT COMMENT 'inferred from: optional int32 arrdelay', |
| depdelay INT COMMENT 'inferred from: optional int32 depdelay', |
| origin STRING COMMENT 'inferred from: optional binary origin', |
| dest STRING COMMENT 'inferred from: optional binary dest', |
| distance INT COMMENT 'inferred from: optional int32 distance', |
| taxi_in INT COMMENT 'inferred from: optional int32 taxi_in', |
| taxi_out INT COMMENT 'inferred from: optional int32 taxi_out', |
| cancelled INT COMMENT 'inferred from: optional int32 cancelled', |
| cancellation_code STRING COMMENT 'inferred from: optional binary cancellation_code', |
| diverted INT COMMENT 'inferred from: optional int32 diverted', |
| carrier_delay INT COMMENT 'inferred from: optional int32 carrier_delay', |
| weather_delay INT COMMENT 'inferred from: optional int32 weather_delay', |
| nas_delay INT COMMENT 'inferred from: optional int32 nas_delay', |
| security_delay INT COMMENT 'inferred from: optional int32 security_delay', |
| late_aircraft_delay INT COMMENT 'inferred from: optional int32 late_aircraft_delay' |
| ) |
| STORED AS PARQUET |
| LOCATION 'hdfs://a1730.example.com:8020/user/impala/staging/airlines' |
| TBLPROPERTIES ('numFiles'='0', 'COLUMN_STATS_ACCURATE'='false', |
| 'transient_lastDdlTime'='1439425228', 'numRows'='-1', 'totalSize'='0', |
| 'rawDataSize'='-1')" |
| Fetched 1 row(s) in 0.01s |
| </codeblock> |
| |
| <p> |
| After copying and pasting the <codeph>CREATE TABLE</codeph> statement into a text editor for fine-tuning, we quit and restart <cmdname>impala-shell</cmdname> |
| without the <codeph>-B</codeph> option, to switch back to regular |
| output. |
| </p> |
| <p> |
| Next we run the <codeph>CREATE TABLE</codeph> statement that we adapted from the <codeph>SHOW CREATE TABLE</codeph> output. |
| We kept the <codeph>STORED AS PARQUET</codeph> clause because |
| we want to rearrange the data somewhat but still keep it in the high-performance |
| Parquet format. The <codeph>LOCATION</codeph> and <codeph>TBLPROPERTIES</codeph> |
| clauses are not relevant for this new table, so we edit those out. |
| Because we are going to partition the new table based on the <codeph>YEAR</codeph> |
| column, we move that column name (and its type) into a new <codeph>PARTITIONED BY</codeph> clause. |
| </p> |
| |
| <codeblock>[localhost:21000] > CREATE TABLE airline_data.airlines |
| > ( |
| > month INT, |
| > day INT, |
| > dayofweek INT, |
| > dep_time INT, |
| > crs_dep_time INT, |
| > arr_time INT, |
| > crs_arr_time INT, |
| > carrier STRING, |
| > flight_num INT, |
| > actual_elapsed_time INT, |
| > crs_elapsed_time INT, |
| > airtime INT, |
| > arrdelay INT, |
| > depdelay INT, |
| > origin STRING, |
| > dest STRING, |
| > distance INT, |
| > taxi_in INT, |
| > taxi_out INT, |
| > cancelled INT, |
| > cancellation_code STRING, |
| > diverted INT, |
| > carrier_delay INT, |
| > weather_delay INT, |
| > nas_delay INT, |
| > security_delay INT, |
| > late_aircraft_delay INT |
| > ) |
| > STORED AS PARQUET |
| > PARTITIONED BY (year INT); |
| Fetched 0 row(s) in 0.10s |
| </codeblock> |
| |
| <p> |
| Next, we copy all the rows from the original table into this new one with |
| an <codeph>INSERT</codeph> statement. (We edited the <codeph>CREATE TABLE</codeph> |
| statement to make an <codeph>INSERT</codeph> statement with the column names |
| in the same order.) The only change is to add a <codeph>PARTITION(year)</codeph> |
| clause, and move the <codeph>YEAR</codeph> column to the very end of the |
| <codeph>SELECT</codeph> list of the <codeph>INSERT</codeph> statement. |
| Specifying <codeph>PARTITION(year)</codeph>, rather than a fixed value |
| such as <codeph>PARTITION(year=2000)</codeph>, means that Impala figures |
| out the partition value for each row based on the value of the very last column |
| in the <codeph>SELECT</codeph> list. |
| This is the first SQL statement that legitimately takes any substantial time, |
| because the rows from different years are shuffled around the cluster; |
| the rows that go into each partition are collected on one node, before being |
| written to one or more new data files. |
| </p> |
| |
| <codeblock>[localhost:21000] > INSERT INTO airline_data.airlines |
| > PARTITION (year) |
| > SELECT |
| > month, |
| > day, |
| > dayofweek, |
| > dep_time, |
| > crs_dep_time, |
| > arr_time, |
| > crs_arr_time, |
| > carrier, |
| > flight_num, |
| > actual_elapsed_time, |
| > crs_elapsed_time, |
| > airtime, |
| > arrdelay, |
| > depdelay, |
| > origin, |
| > dest, |
| > distance, |
| > taxi_in, |
| > taxi_out, |
| > cancelled, |
| > cancellation_code, |
| > diverted, |
| > carrier_delay, |
| > weather_delay, |
| > nas_delay, |
| > security_delay, |
| > late_aircraft_delay, |
| > year |
| > FROM airline_data.airlines_external; |
| Inserted 123534969 row(s) in 202.70s |
| </codeblock> |
| |
| <p> |
| Once partitioning or join queries come into play, it's important to have statistics |
| that Impala can use to optimize queries on the corresponding tables. |
| The <codeph>COMPUTE INCREMENTAL STATS</codeph> statement is the way to collect |
| statistics for partitioned tables. |
| Then the <codeph>SHOW TABLE STATS</codeph> statement confirms that the statistics |
| are in place for each partition, and also illustrates how many files and how much raw data |
| is in each partition. |
| </p> |
| |
| <codeblock>[localhost:21000] > compute incremental stats airlines; |
| +-------------------------------------------+ |
| | summary | |
| +-------------------------------------------+ |
| | Updated 22 partition(s) and 27 column(s). | |
| +-------------------------------------------+ |
| [localhost:21000] > show table stats airlines; |
| +-------+-----------+--------+----------+--------------+------------+---------+-------------------+ |
| | year | #Rows | #Files | Size | Bytes Cached | Cache Repl | Format | Incremental stats | |
| +-------+-----------+--------+----------+--------------+------------+---------+----- |
| | 1987 | 1311826 | 1 | 9.32MB | NOT CACHED | NOT CACHED | PARQUET | true |
| | 1988 | 5202096 | 1 | 37.04MB | NOT CACHED | NOT CACHED | PARQUET | true |
| | 1989 | 5041200 | 1 | 36.25MB | NOT CACHED | NOT CACHED | PARQUET | true |
| | 1990 | 5270893 | 1 | 38.39MB | NOT CACHED | NOT CACHED | PARQUET | true |
| | 1991 | 5076925 | 1 | 37.23MB | NOT CACHED | NOT CACHED | PARQUET | true |
| | 1992 | 5092157 | 1 | 36.85MB | NOT CACHED | NOT CACHED | PARQUET | true |
| | 1993 | 5070501 | 1 | 37.16MB | NOT CACHED | NOT CACHED | PARQUET | true |
| | 1994 | 5180048 | 1 | 38.31MB | NOT CACHED | NOT CACHED | PARQUET | true |
| | 1995 | 5327435 | 1 | 53.14MB | NOT CACHED | NOT CACHED | PARQUET | true |
| | 1996 | 5351983 | 1 | 53.64MB | NOT CACHED | NOT CACHED | PARQUET | true |
| | 1997 | 5411843 | 1 | 54.41MB | NOT CACHED | NOT CACHED | PARQUET | true |
| | 1998 | 5384721 | 1 | 54.01MB | NOT CACHED | NOT CACHED | PARQUET | true |
| | 1999 | 5527884 | 1 | 56.32MB | NOT CACHED | NOT CACHED | PARQUET | true |
| | 2000 | 5683047 | 1 | 58.15MB | NOT CACHED | NOT CACHED | PARQUET | true |
| | 2001 | 5967780 | 1 | 60.65MB | NOT CACHED | NOT CACHED | PARQUET | true |
| | 2002 | 5271359 | 1 | 57.99MB | NOT CACHED | NOT CACHED | PARQUET | true |
| | 2003 | 6488540 | 1 | 81.33MB | NOT CACHED | NOT CACHED | PARQUET | true |
| | 2004 | 7129270 | 1 | 103.19MB | NOT CACHED | NOT CACHED | PARQUET | true |
| | 2005 | 7140596 | 1 | 102.61MB | NOT CACHED | NOT CACHED | PARQUET | true |
| | 2006 | 7141922 | 1 | 106.03MB | NOT CACHED | NOT CACHED | PARQUET | true |
| | 2007 | 7453215 | 1 | 112.15MB | NOT CACHED | NOT CACHED | PARQUET | true |
| | 2008 | 7009728 | 1 | 105.76MB | NOT CACHED | NOT CACHED | PARQUET | true |
| | Total | 123534969 | 22 | 1.30GB | 0B | | | |
| +-------+-----------+--------+----------+--------------+------------+---------+----- |
| </codeblock> |
| |
| <p> |
| At this point, we go through a quick thought process to sanity check the partitioning we did. |
| All the partitions have exactly one file, which is on the low side. A query that includes a |
| clause <codeph>WHERE year=2004</codeph> will only read a single data block; that data block |
| will be read and processed by a single data node; therefore, for a query targeting a single year, |
| all the other nodes in the cluster will sit idle while all the work happens on a single machine. |
| It's even possible that by chance (depending on HDFS replication factor and the way data blocks are distributed |
| across the cluster), that multiple year partitions selected by a filter such as |
| <codeph>WHERE year BETWEEN 1999 AND 2001</codeph> could all be read and processed by the same data node. |
| The more data files each partition has, the more parallelism you can get and the less probability of <q>hotspots</q> occurring |
| on particular nodes, therefore a bigger performance boost by having a big cluster. |
| </p> |
| |
| <p> |
| However, the more data files, the less data goes in each one. The overhead of dividing the work in a |
| parallel query might not be worth it if each node is only reading a few megabytes. 50 or 100 megabytes |
| is a decent size for a Parquet data block; 9 or 37 megabytes is on the small side. |
| Which is to say, the data distribution we ended up with based on this partitioning scheme is |
| on the borderline between sensible (reasonably large files) and suboptimal (few files in each partition). |
| The way to see how well it works in practice is to run the same queries against the original |
| flat table and the new partitioned table, and compare times. |
| </p> |
| |
| <p> |
| Spoiler: in this case, with my particular 4-node cluster with its specific distribution of data blocks and my |
| particular exploratory queries, queries against the partitioned table do consistently run faster than the same queries |
| against the unpartitioned table. But I could not be sure that would be the case without some real measurements. |
| Here are some queries I ran to draw that conclusion, first against <codeph>AIRLINES_EXTERNAL</codeph> (no partitioning), |
| then against <codeph>AIRLINES</codeph> (partitioned by year). The <codeph>AIRLINES</codeph> queries are consistently faster. |
| Changing the volume of data, changing the size of the cluster, running queries that did or didn't refer to the partition key |
| columns, or other factors could change the results to favor one table layout or the other. |
| </p> |
| |
| <note> |
| If you find the volume of each partition is only in the low tens of megabytes, consider lowering the granularity |
| of partitioning. For example, instead of partitioning by year, month, and day, partition by year and month |
| or even just by year. The ideal layout to distribute work efficiently in a parallel query is many tens or |
| even hundreds of megabytes per Parquet file, and the number of Parquet files in each partition somewhat |
| higher than the number of data nodes. |
| </note> |
| |
| <codeblock>[localhost:21000] > select sum(airtime) from airlines_external; |
| +--------------+ |
| | sum(airtime) | |
| +--------------+ |
| | 8662859484 | |
| +--------------+ |
| Fetched 1 row(s) in 2.02s |
| [localhost:21000] > select sum(airtime) from airlines; |
| +--------------+ |
| | sum(airtime) | |
| +--------------+ |
| | 8662859484 | |
| +--------------+ |
| Fetched 1 row(s) in 1.21s |
| |
| [localhost:21000] > select sum(airtime) from airlines_external where year = 2005; |
| +--------------+ |
| | sum(airtime) | |
| +--------------+ |
| | 708204026 | |
| +--------------+ |
| Fetched 1 row(s) in 2.61s |
| [localhost:21000] > select sum(airtime) from airlines where year = 2005; |
| +--------------+ |
| | sum(airtime) | |
| +--------------+ |
| | 708204026 | |
| +--------------+ |
| Fetched 1 row(s) in 1.19s |
| |
| [localhost:21000] > select sum(airtime) from airlines_external where dayofweek = 1; |
| +--------------+ |
| | sum(airtime) | |
| +--------------+ |
| | 1264945051 | |
| +--------------+ |
| Fetched 1 row(s) in 2.82s |
| [localhost:21000] > select sum(airtime) from airlines where dayofweek = 1; |
| +--------------+ |
| | sum(airtime) | |
| +--------------+ |
| | 1264945051 | |
| +--------------+ |
| Fetched 1 row(s) in 1.61s |
| </codeblock> |
| |
| <p> |
| Now we can finally do some serious analysis with this data set that, remember, |
| a few minutes ago all we had were some raw data files and we didn't even know |
| what columns they contained. |
| Let's see whether the <q>air time</q> of a flight tends to be different depending on the |
| day of the week. We can see that the average is a little higher on day number 6; |
| perhaps Saturday is a busy flying day and planes have to circle for longer at the |
| destination airport before landing. |
| </p> |
| |
| <codeblock>[localhost:21000] > select dayofweek, avg(airtime) from airlines |
| > group by dayofweek order by dayofweek; |
| +-----------+-------------------+ |
| | dayofweek | avg(airtime) | |
| +-----------+-------------------+ |
| | 1 | 102.1560425016671 | |
| | 2 | 102.1582931538807 | |
| | 3 | 102.2170009256653 | |
| | 4 | 102.37477661846 | |
| | 5 | 102.2697358763511 | |
| | 6 | 105.3627448363705 | |
| | 7 | 103.4144351202054 | |
| +-----------+-------------------+ |
| Fetched 7 row(s) in 2.25s |
| </codeblock> |
| |
| <p> |
| To see if the apparent trend holds up over time, let's do the same breakdown by day of week, but also |
| split up by year. Now we can see that day number 6 consistently has a higher average air time in each |
| year. We can also see that the average air time increased over time across the board. And the presence |
| of <codeph>NULL</codeph> for this column in years 1987 to 1994 shows that queries involving this column |
| need to be restricted to a date range of 1995 and higher. |
| </p> |
| |
| <codeblock>[localhost:21000] > select year, dayofweek, avg(airtime) from airlines |
| > group by year, dayofweek order by year desc, dayofweek; |
| +------+-----------+-------------------+ |
| | year | dayofweek | avg(airtime) | |
| +------+-----------+-------------------+ |
| | 2008 | 1 | 103.1821651651355 | |
| | 2008 | 2 | 103.2149301386094 | |
| | 2008 | 3 | 103.0585076622796 | |
| | 2008 | 4 | 103.4671383539038 | |
| | 2008 | 5 | 103.5575385182659 | |
| | 2008 | 6 | 107.4006306562128 | |
| | 2008 | 7 | 104.8648851041755 | |
| | 2007 | 1 | 102.2196114337825 | |
| | 2007 | 2 | 101.9317791906348 | |
| | 2007 | 3 | 102.0964767689043 | |
| | 2007 | 4 | 102.6215927201686 | |
| | 2007 | 5 | 102.4289399000661 | |
| | 2007 | 6 | 105.1477448215756 | |
| | 2007 | 7 | 103.6305945644095 | |
| ... |
| | 1996 | 1 | 99.33860750862108 | |
| | 1996 | 2 | 99.54225446396656 | |
| | 1996 | 3 | 99.41129336113134 | |
| | 1996 | 4 | 99.5110373340348 | |
| | 1996 | 5 | 99.22120745027595 | |
| | 1996 | 6 | 101.1717447111921 | |
| | 1996 | 7 | 99.95410136133704 | |
| | 1995 | 1 | 96.93779698300494 | |
| | 1995 | 2 | 96.93458674589712 | |
| | 1995 | 3 | 97.00972311337051 | |
| | 1995 | 4 | 96.90843832024412 | |
| | 1995 | 5 | 96.78382115425562 | |
| | 1995 | 6 | 98.70872826057003 | |
| | 1995 | 7 | 97.85570478374616 | |
| | 1994 | 1 | NULL | |
| | 1994 | 2 | NULL | |
| | 1994 | 3 | NULL | |
| ... |
| | 1987 | 5 | NULL | |
| | 1987 | 6 | NULL | |
| | 1987 | 7 | NULL | |
| +------+-----------+-------------------+ |
| </codeblock> |
| |
| </conbody> |
| </concept> |
| |
| <concept audience="hidden" id="tut_synthesizing"> |
| |
| <title>Synthesizing Data</title> |
| |
| <conbody> |
| |
| <!-- Scenarios: |
| INSERT VALUES 1 row or a few rows, then duplicate the values with slight alterations. |
| - Use row_number() and rank() to assign IDs to values from narrow tables. |
| - Use sum() in analytic context, with sliding window, to derive other arbitrary numbers. |
| Grab database, table, column, etc. names from metastore and play with those. |
| Maybe take results of a 'find' and play with those. |
| - Invent fields like 'username' based on /home or whatever subdirectories. |
| - Perhaps also capture permissions, timestamps, etc. via 'ls -l'. |
| - Illustrate narrow table with 'extension' field, wide table with 'jpg', 'Office', 'picture', 'XML', etc. fields |
| based on case() of extension. |
| Decompose some string values using regexp_extract. |
| Set up views to get the logical layout right before making new physical copies of data. |
| --> |
| |
| <p></p> |
| </conbody> |
| </concept> |
| </concept> |