| <!DOCTYPE html> |
| <html lang="en"> |
| <head> |
| <meta charset="utf-8" /> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge" /> |
| <meta name="viewport" content="width=device-width, initial-scale=1" /> |
| <!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags --> |
| <meta name="description" content="A new open source Apache Hadoop ecosystem project, Apache Kudu completes Hadoop's storage layer to enable fast analytics on fast data" /> |
| <meta name="author" content="Cloudera" /> |
| <title>Apache Kudu - Transparent Hierarchical Storage Management with Apache Kudu and Impala</title> |
| <!-- Bootstrap core CSS --> |
| <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css" |
| integrity="sha384-1q8mTJOASx8j1Au+a5WDVnPi2lkFfwwEAa8hDDdjZlpLegxhjVME1fgjWPGmkzs7" |
| crossorigin="anonymous"> |
| |
| <!-- Custom styles for this template --> |
| <link href="/css/kudu.css" rel="stylesheet"/> |
| <link href="/css/asciidoc.css" rel="stylesheet"/> |
| <link rel="shortcut icon" href="/img/logo-favicon.ico" /> |
| <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.1/css/font-awesome.min.css" /> |
| |
| |
| <link rel="alternate" type="application/atom+xml" |
| title="RSS Feed for Apache Kudu blog" |
| href="/feed.xml" /> |
| |
| |
| <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries --> |
| <!--[if lt IE 9]> |
| <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script> |
| <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script> |
| <![endif]--> |
| </head> |
| <body> |
| <div class="kudu-site container-fluid"> |
| <!-- Static navbar --> |
| <nav class="navbar navbar-default"> |
| <div class="container-fluid"> |
| <div class="navbar-header"> |
| <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar"> |
| <span class="sr-only">Toggle navigation</span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| </button> |
| |
| <a class="logo" href="/"><img |
| src="//d3dr9sfxru4sde.cloudfront.net/i/k/apachekudu_logo_0716_80px.png" |
| srcset="//d3dr9sfxru4sde.cloudfront.net/i/k/apachekudu_logo_0716_80px.png 1x, //d3dr9sfxru4sde.cloudfront.net/i/k/apachekudu_logo_0716_160px.png 2x" |
| alt="Apache Kudu"/></a> |
| |
| </div> |
| <div id="navbar" class="collapse navbar-collapse"> |
| <ul class="nav navbar-nav navbar-right"> |
| <li > |
| <a href="/">Home</a> |
| </li> |
| <li > |
| <a href="/overview.html">Overview</a> |
| </li> |
| <li > |
| <a href="/docs/">Documentation</a> |
| </li> |
| <li > |
| <a href="/releases/">Releases</a> |
| </li> |
| <li class="active"> |
| <a href="/blog/">Blog</a> |
| </li> |
| <!-- NOTE: this dropdown menu does not appear on Mobile, so don't add anything here |
| that doesn't also appear elsewhere on the site. --> |
| <li class="dropdown"> |
| <a href="/community.html" role="button" aria-haspopup="true" aria-expanded="false">Community <span class="caret"></span></a> |
| <ul class="dropdown-menu"> |
| <li class="dropdown-header">GET IN TOUCH</li> |
| <li><a class="icon email" href="/community.html">Mailing Lists</a></li> |
| <li><a class="icon slack" href="https://join.slack.com/t/getkudu/shared_invite/zt-244b4zvki-hB1q9IbAk6CqHNMZHvUALA">Slack Channel</a></li> |
| <li role="separator" class="divider"></li> |
| <li><a href="/community.html#meetups-user-groups-and-conference-presentations">Events and Meetups</a></li> |
| <li><a href="/committers.html">Project Committers</a></li> |
| <li><a href="/ecosystem.html">Ecosystem</a></li> |
| <!--<li><a href="/roadmap.html">Roadmap</a></li>--> |
| <li><a href="/community.html#contributions">How to Contribute</a></li> |
| <li role="separator" class="divider"></li> |
| <li class="dropdown-header">DEVELOPER RESOURCES</li> |
| <li><a class="icon github" href="https://github.com/apache/incubator-kudu">GitHub</a></li> |
| <li><a class="icon gerrit" href="http://gerrit.cloudera.org:8080/#/q/status:open+project:kudu">Gerrit Code Review</a></li> |
| <li><a class="icon jira" href="https://issues.apache.org/jira/browse/KUDU">JIRA Issue Tracker</a></li> |
| <li role="separator" class="divider"></li> |
| <li class="dropdown-header">SOCIAL MEDIA</li> |
| <li><a class="icon twitter" href="https://twitter.com/ApacheKudu">Twitter</a></li> |
| <li><a href="https://www.reddit.com/r/kudu/">Reddit</a></li> |
| <li role="separator" class="divider"></li> |
| <li class="dropdown-header">APACHE SOFTWARE FOUNDATION</li> |
| <li><a href="https://www.apache.org/security/" target="_blank">Security</a></li> |
| <li><a href="https://www.apache.org/foundation/sponsorship.html" target="_blank">Sponsorship</a></li> |
| <li><a href="https://www.apache.org/foundation/thanks.html" target="_blank">Thanks</a></li> |
| <li><a href="https://www.apache.org/licenses/" target="_blank">License</a></li> |
| </ul> |
| </li> |
| <li > |
| <a href="/faq.html">FAQ</a> |
| </li> |
| </ul><!-- /.nav --> |
| </div><!-- /#navbar --> |
| </div><!-- /.container-fluid --> |
| </nav> |
| |
| <div class="row header"> |
| <div class="col-lg-12"> |
| <h2><a href="/blog">Apache Kudu Blog</a></h2> |
| </div> |
| </div> |
| |
| <div class="row-fluid"> |
| <div class="col-lg-9"> |
| <article> |
| <header> |
| <h1 class="entry-title">Transparent Hierarchical Storage Management with Apache Kudu and Impala</h1> |
| <p class="meta">Posted 05 Mar 2019 by Grant Henke</p> |
| </header> |
| <div class="entry-content"> |
| <p>Note: This is a cross-post from the Cloudera Engineering Blog |
| <a href="https://blog.cloudera.com/blog/2019/03/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/">Transparent Hierarchical Storage Management with Apache Kudu and Impala</a></p> |
| |
| <p>When picking a storage option for an application it is common to pick a single |
| storage option which has the most applicable features to your use case. For mutability |
| and real-time analytics workloads you may want to use Apache Kudu, but for massive |
| scalability at a low cost you may want to use HDFS. For that reason, there is a need |
| for a solution that allows you to leverage the best features of multiple storage |
| options. This post describes the sliding window pattern using Apache Impala with data |
| stored in Apache Kudu and Apache HDFS. With this pattern you get all of the benefits |
| of multiple storage layers in a way that is transparent to users.</p> |
| |
| <!--more--> |
| |
| <p>Apache Kudu is designed for fast analytics on rapidly changing data. Kudu provides a |
| combination of fast inserts/updates and efficient columnar scans to enable multiple |
| real-time analytic workloads across a single storage layer. For that reason, Kudu fits |
| well into a data pipeline as the place to store real-time data that needs to be |
| queryable immediately. Additionally, Kudu supports updating and deleting rows in |
| real-time allowing support for late arriving data and data correction.</p> |
| |
| <p>Apache HDFS is designed to allow for limitless scalability at a low cost. It is |
| optimized for batch oriented use cases where data is immutable. When paired with the |
| Apache Parquet file format, structured data can be accessed with extremely high |
| throughput and efficiency.</p> |
| |
| <p>For situations in which the data is small and ever-changing, like dimension tables, |
| it is common to keep all of the data in Kudu. It is even common to keep large tables |
| in Kudu when the data fits within Kudu’s |
| <a href="https://kudu.apache.org/docs/known_issues.html#_scale">scaling limits</a> and can benefit |
| from Kudu’s unique features. In cases where the data is massive, batch oriented, and |
| unlikely to change, storing the data in HDFS using the Parquet format is preferred. |
| When you need the benefits of both storage layers, the sliding window pattern is a |
| useful solution.</p> |
| |
| <h2 id="the-sliding-window-pattern">The Sliding Window Pattern</h2> |
| |
| <p>In this pattern, matching Kudu and Parquet formatted HDFS tables are created in Impala. |
| These tables are partitioned by a unit of time based on how frequently the data is |
| moved between the Kudu and HDFS table. It is common to use daily, monthly, or yearly |
| partitions. A unified view is created and a <code class="language-plaintext highlighter-rouge">WHERE</code> clause is used to define a boundary |
| that separates which data is read from the Kudu table and which is read from the HDFS |
| table. The defined boundary is important so that you can move data between Kudu and |
| HDFS without exposing duplicate records to the view. Once the data is moved, an atomic |
| <code class="language-plaintext highlighter-rouge">ALTER VIEW</code> statement can be used to move the boundary forward.</p> |
| |
| <p><img src="/img/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/sliding-window-pattern.png" alt="png" class="img-responsive" /></p> |
| |
| <p>Note: This pattern works best with somewhat sequential data organized into range |
| partitions, because having a sliding window of time and dropping partitions is very |
| efficient.</p> |
| |
| <p>This pattern results in a sliding window of time where mutable data is stored in Kudu |
| and immutable data is stored in the Parquet format on HDFS. Leveraging both Kudu and |
| HDFS via Impala provides the benefits of both storage systems:</p> |
| |
| <ul> |
| <li>Streaming data is immediately queryable</li> |
| <li>Updates for late arriving data or manual corrections can be made</li> |
| <li>Data stored in HDFS is optimally sized increasing performance and preventing small files</li> |
| <li>Reduced cost</li> |
| </ul> |
| |
| <p>Impala also supports cloud storage options such as |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_s3.html">S3</a> and |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_adls.html">ADLS</a>. |
| This capability allows convenient access to a storage system that is remotely managed, |
| accessible from anywhere, and integrated with various cloud-based services. Because |
| this data is remote, queries against S3 data are less performant, making S3 suitable |
| for holding “cold” data that is only queried occasionally. This pattern can be |
| extended to use cloud storage for cold data by creating a third matching table and |
| adding another boundary to the unified view.</p> |
| |
| <p><img src="/img/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/sliding-window-pattern-cold.png" alt="png" class="img-responsive" /></p> |
| |
| <p>Note: For simplicity only Kudu and HDFS are illustrated in the examples below.</p> |
| |
| <p>The process for moving data from Kudu to HDFS is broken into two phases. The first |
| phase is the data migration, and the second phase is the metadata change. These |
| ongoing steps should be scheduled to run automatically on a regular basis.</p> |
| |
| <p>In the first phase, the now immutable data is copied from Kudu to HDFS. Even though |
| data is duplicated from Kudu into HDFS, the boundary defined in the view will prevent |
| duplicate data from being shown to users. This step can include any validation and |
| retries as needed to ensure the data offload is successful.</p> |
| |
| <p><img src="/img/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/phase-1.png" alt="png" class="img-responsive" /></p> |
| |
| <p>In the second phase, now that the data is safely copied to HDFS, the metadata is |
| changed to adjust how the offloaded partition is exposed. This includes shifting |
| the boundary forward, adding a new Kudu partition for the next period, and dropping |
| the old Kudu partition.</p> |
| |
| <p><img src="/img/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/phase-2.png" alt="png" class="img-responsive" /></p> |
| |
| <h2 id="building-blocks">Building Blocks</h2> |
| |
| <p>In order to implement the sliding window pattern, a few Impala fundamentals are |
| required. Below each fundamental building block of the sliding window pattern is |
| described.</p> |
| |
| <h3 id="moving-data">Moving Data</h3> |
| |
| <p>Moving data among storage systems via Impala is straightforward provided you have |
| matching tables defined using each of the storage formats. In order to keep this post |
| brief, all of the options available when creating an Impala table are not described. |
| However, Impala’s |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_create_table.html">CREATE TABLE documentation</a> |
| can be referenced to find the correct syntax for Kudu, HDFS, and cloud storage tables. |
| A few examples are shown further below where the sliding window pattern is illustrated.</p> |
| |
| <p>Once the tables are created, moving the data is as simple as an |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_insert.html">INSERT…SELECT</a> statement:</p> |
| |
| <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">table_foo</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">table_bar</span><span class="p">;</span></code></pre></figure> |
| |
| <p>All of the features of the |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_select.html">SELECT</a> |
| statement can be used to select the specific data you would like to move.</p> |
| |
| <p>Note: If moving data to Kudu, an <code class="language-plaintext highlighter-rouge">UPSERT INTO</code> statement can be used to handle |
| duplicate keys.</p> |
| |
| <h3 id="unified-querying">Unified Querying</h3> |
| |
| <p>Querying data from multiple tables and data sources in Impala is also straightforward. |
| For the sake of brevity, all of the options available when creating an Impala view are |
| not described. However, see Impala’s |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_create_view.html">CREATE VIEW documentation</a> |
| for more in-depth details.</p> |
| |
| <p>Creating a view for unified querying is as simple as a <code class="language-plaintext highlighter-rouge">CREATE VIEW</code> statement using |
| two <code class="language-plaintext highlighter-rouge">SELECT</code> clauses combined with a <code class="language-plaintext highlighter-rouge">UNION ALL</code>:</p> |
| |
| <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">VIEW</span> <span class="n">foo_view</span> <span class="k">AS</span> |
| <span class="k">SELECT</span> <span class="n">col1</span><span class="p">,</span> <span class="n">col2</span><span class="p">,</span> <span class="n">col3</span> <span class="k">FROM</span> <span class="n">foo_parquet</span> |
| <span class="k">UNION</span> <span class="k">ALL</span> |
| <span class="k">SELECT</span> <span class="n">col1</span><span class="p">,</span> <span class="n">col2</span><span class="p">,</span> <span class="n">col3</span> <span class="k">FROM</span> <span class="n">foo_kudu</span><span class="p">;</span></code></pre></figure> |
| |
| <p>WARNING: Be sure to use <code class="language-plaintext highlighter-rouge">UNION ALL</code> and not <code class="language-plaintext highlighter-rouge">UNION</code>. The <code class="language-plaintext highlighter-rouge">UNION</code> keyword by itself |
| is the same as <code class="language-plaintext highlighter-rouge">UNION DISTINCT</code> and can have significant performance impact. |
| More information can be found in the Impala |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_union.html">UNION documentation</a>.</p> |
| |
| <p>All of the features of the |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_select.html">SELECT</a> |
| statement can be used to expose the correct data and columns from each of the |
| underlying tables. It is important to use the <code class="language-plaintext highlighter-rouge">WHERE</code> clause to pass through and |
| pushdown any predicates that need special handling or transformations. More examples |
| will follow below in the discussion of the sliding window pattern.</p> |
| |
| <p>Additionally, views can be altered via the |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_alter_view.html">ALTER VIEW</a> |
| statement. This is useful when combined with the <code class="language-plaintext highlighter-rouge">SELECT</code> statement because it can be |
| used to atomically update what data is being accessed by the view.</p> |
| |
| <h2 id="an-example-implementation">An Example Implementation</h2> |
| |
| <p>Below are sample steps to implement the sliding window pattern using a monthly period |
| with three months of active mutable data. Data older than three months will be |
| offloaded to HDFS using the Parquet format.</p> |
| |
| <h3 id="create-the-kudu-table">Create the Kudu Table</h3> |
| |
| <p>First, create a Kudu table which will hold three months of active mutable data. |
| The table is range partitioned by the time column with each range containing one |
| period of data. It is important to have partitions that match the period because |
| dropping Kudu partitions is much more efficient than removing the data via the |
| <code class="language-plaintext highlighter-rouge">DELETE</code> clause. The table is also hash partitioned by the other key column to ensure |
| that all of the data is not written to a single partition.</p> |
| |
| <p>Note: Your schema design should vary based on your data and read/write performance |
| considerations. This example schema is intended for demonstration purposes and not as |
| an “optimal” schema. See the |
| <a href="https://kudu.apache.org/docs/schema_design.html">Kudu schema design documentation</a> |
| for more guidance on choosing your schema. For example, you may not need any hash |
| partitioning if your |
| data input rate is low. Alternatively, you may need more hash buckets if your data |
| input rate is very high.</p> |
| |
| <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">my_table_kudu</span> |
| <span class="p">(</span> |
| <span class="n">name</span> <span class="n">STRING</span><span class="p">,</span> |
| <span class="nb">time</span> <span class="nb">TIMESTAMP</span><span class="p">,</span> |
| <span class="n">message</span> <span class="n">STRING</span><span class="p">,</span> |
| <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="nb">time</span><span class="p">)</span> |
| <span class="p">)</span> |
| <span class="n">PARTITION</span> <span class="k">BY</span> |
| <span class="n">HASH</span><span class="p">(</span><span class="n">name</span><span class="p">)</span> <span class="n">PARTITIONS</span> <span class="mi">4</span><span class="p">,</span> |
| <span class="n">RANGE</span><span class="p">(</span><span class="nb">time</span><span class="p">)</span> <span class="p">(</span> |
| <span class="n">PARTITION</span> <span class="s1">'2018-01-01'</span> <span class="o"><=</span> <span class="k">VALUES</span> <span class="o"><</span> <span class="s1">'2018-02-01'</span><span class="p">,</span> <span class="c1">--January</span> |
| <span class="n">PARTITION</span> <span class="s1">'2018-02-01'</span> <span class="o"><=</span> <span class="k">VALUES</span> <span class="o"><</span> <span class="s1">'2018-03-01'</span><span class="p">,</span> <span class="c1">--February</span> |
| <span class="n">PARTITION</span> <span class="s1">'2018-03-01'</span> <span class="o"><=</span> <span class="k">VALUES</span> <span class="o"><</span> <span class="s1">'2018-04-01'</span><span class="p">,</span> <span class="c1">--March</span> |
| <span class="n">PARTITION</span> <span class="s1">'2018-04-01'</span> <span class="o"><=</span> <span class="k">VALUES</span> <span class="o"><</span> <span class="s1">'2018-05-01'</span> <span class="c1">--April</span> |
| <span class="p">)</span> |
| <span class="n">STORED</span> <span class="k">AS</span> <span class="n">KUDU</span><span class="p">;</span></code></pre></figure> |
| |
| <p>Note: There is an extra month partition to provide a buffer of time for the data to |
| be moved into the immutable table.</p> |
| |
| <h3 id="create-the-hdfs-table">Create the HDFS Table</h3> |
| |
| <p>Create the matching Parquet formatted HDFS table which will hold the older immutable |
| data. This table is partitioned by year, month, and day for efficient access even |
| though you can’t partition by the time column itself. This is addressed further in |
| the view step below. See Impala’s |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_partitioning.html">partitioning documentation</a> |
| for more details.</p> |
| |
| <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">my_table_parquet</span> |
| <span class="p">(</span> |
| <span class="n">name</span> <span class="n">STRING</span><span class="p">,</span> |
| <span class="nb">time</span> <span class="nb">TIMESTAMP</span><span class="p">,</span> |
| <span class="n">message</span> <span class="n">STRING</span> |
| <span class="p">)</span> |
| <span class="n">PARTITIONED</span> <span class="k">BY</span> <span class="p">(</span><span class="nb">year</span> <span class="nb">int</span><span class="p">,</span> <span class="k">month</span> <span class="nb">int</span><span class="p">,</span> <span class="k">day</span> <span class="nb">int</span><span class="p">)</span> |
| <span class="n">STORED</span> <span class="k">AS</span> <span class="n">PARQUET</span><span class="p">;</span></code></pre></figure> |
| |
| <h3 id="create-the-unified-view">Create the Unified View</h3> |
| |
| <p>Now create the unified view which will be used to query all of the data seamlessly:</p> |
| |
| <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">VIEW</span> <span class="n">my_table_view</span> <span class="k">AS</span> |
| <span class="k">SELECT</span> <span class="n">name</span><span class="p">,</span> <span class="nb">time</span><span class="p">,</span> <span class="n">message</span> |
| <span class="k">FROM</span> <span class="n">my_table_kudu</span> |
| <span class="k">WHERE</span> <span class="nb">time</span> <span class="o">>=</span> <span class="nv">"2018-01-01"</span> |
| <span class="k">UNION</span> <span class="k">ALL</span> |
| <span class="k">SELECT</span> <span class="n">name</span><span class="p">,</span> <span class="nb">time</span><span class="p">,</span> <span class="n">message</span> |
| <span class="k">FROM</span> <span class="n">my_table_parquet</span> |
| <span class="k">WHERE</span> <span class="nb">time</span> <span class="o"><</span> <span class="nv">"2018-01-01"</span> |
| <span class="k">AND</span> <span class="nb">year</span> <span class="o">=</span> <span class="nb">year</span><span class="p">(</span><span class="nb">time</span><span class="p">)</span> |
| <span class="k">AND</span> <span class="k">month</span> <span class="o">=</span> <span class="k">month</span><span class="p">(</span><span class="nb">time</span><span class="p">)</span> |
| <span class="k">AND</span> <span class="k">day</span> <span class="o">=</span> <span class="k">day</span><span class="p">(</span><span class="nb">time</span><span class="p">);</span></code></pre></figure> |
| |
| <p>Each <code class="language-plaintext highlighter-rouge">SELECT</code> clause explicitly lists all of the columns to expose. This ensures that |
| the year, month, and day columns that are unique to the Parquet table are not exposed. |
| If needed, it also allows any necessary column or type mapping to be handled.</p> |
| |
| <p>The initial <code class="language-plaintext highlighter-rouge">WHERE</code> clauses applied to both my_table_kudu and my_table_parquet define |
| the boundary between Kudu and HDFS to ensure duplicate data is not read while in the |
| process of offloading data.</p> |
| |
| <p>The additional <code class="language-plaintext highlighter-rouge">AND</code> clauses applied to my_table_parquet are used to ensure good |
| predicate pushdown on the individual year, month, and day columns.</p> |
| |
| <p>WARNING: As stated earlier, be sure to use <code class="language-plaintext highlighter-rouge">UNION ALL</code> and not <code class="language-plaintext highlighter-rouge">UNION</code>. The <code class="language-plaintext highlighter-rouge">UNION</code> |
| keyword by itself is the same as <code class="language-plaintext highlighter-rouge">UNION DISTINCT</code> and can have significant performance |
| impact. More information can be found in the Impala |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_union.html"><code class="language-plaintext highlighter-rouge">UNION</code> documentation</a>.</p> |
| |
| <h3 id="ongoing-steps">Ongoing Steps</h3> |
| |
| <p>Now that the base tables and view are created, prepare the ongoing steps to maintain |
| the sliding window. Because these ongoing steps should be scheduled to run on a |
| regular basis, the examples below are shown using <code class="language-plaintext highlighter-rouge">.sql</code> files that take variables |
| which can be passed from your scripts and scheduling tool of choice.</p> |
| |
| <p>Create the <code class="language-plaintext highlighter-rouge">window_data_move.sql</code> file to move the data from the oldest partition to HDFS:</p> |
| |
| <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">INSERT</span> <span class="k">INTO</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">hdfs_table</span><span class="err">}</span> <span class="n">PARTITION</span> <span class="p">(</span><span class="nb">year</span><span class="p">,</span> <span class="k">month</span><span class="p">,</span> <span class="k">day</span><span class="p">)</span> |
| <span class="k">SELECT</span> <span class="o">*</span><span class="p">,</span> <span class="nb">year</span><span class="p">(</span><span class="nb">time</span><span class="p">),</span> <span class="k">month</span><span class="p">(</span><span class="nb">time</span><span class="p">),</span> <span class="k">day</span><span class="p">(</span><span class="nb">time</span><span class="p">)</span> |
| <span class="k">FROM</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">kudu_table</span><span class="err">}</span> |
| <span class="k">WHERE</span> <span class="nb">time</span> <span class="o">>=</span> <span class="n">add_months</span><span class="p">(</span><span class="nv">"${var:new_boundary_time}"</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> |
| <span class="k">AND</span> <span class="nb">time</span> <span class="o"><</span> <span class="nv">"${var:new_boundary_time}"</span><span class="p">;</span> |
| <span class="n">COMPUTE</span> <span class="n">INCREMENTAL</span> <span class="n">STATS</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">hdfs_table</span><span class="err">}</span><span class="p">;</span></code></pre></figure> |
| |
| <p>Note: The |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_compute_stats.html">COMPUTE INCREMENTAL STATS</a> |
| clause is not required but helps Impala to optimize queries.</p> |
| |
| <p>To run the SQL statement, use the Impala shell and pass the required variables. |
| Below is an example:</p> |
| |
| <figure class="highlight"><pre><code class="language-bash" data-lang="bash">impala-shell <span class="nt">-i</span> <impalad:port> <span class="nt">-f</span> window_data_move.sql |
| <span class="nt">--var</span><span class="o">=</span><span class="nv">kudu_table</span><span class="o">=</span>my_table_kudu |
| <span class="nt">--var</span><span class="o">=</span><span class="nv">hdfs_table</span><span class="o">=</span>my_table_parquet |
| <span class="nt">--var</span><span class="o">=</span><span class="nv">new_boundary_time</span><span class="o">=</span><span class="s2">"2018-02-01"</span></code></pre></figure> |
| |
| <p>Note: You can adjust the <code class="language-plaintext highlighter-rouge">WHERE</code> clause to match the given period and cadence of your |
| offload. Here the add_months function is used with an argument of -1 to move one month |
| of data in the past from the new boundary time.</p> |
| |
| <p>Create the <code class="language-plaintext highlighter-rouge">window_view_alter.sql</code> file to shift the time boundary forward by altering |
| the unified view:</p> |
| |
| <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">ALTER</span> <span class="k">VIEW</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">view_name</span><span class="err">}</span> <span class="k">AS</span> |
| <span class="k">SELECT</span> <span class="n">name</span><span class="p">,</span> <span class="nb">time</span><span class="p">,</span> <span class="n">message</span> |
| <span class="k">FROM</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">kudu_table</span><span class="err">}</span> |
| <span class="k">WHERE</span> <span class="nb">time</span> <span class="o">>=</span> <span class="nv">"${var:new_boundary_time}"</span> |
| <span class="k">UNION</span> <span class="k">ALL</span> |
| <span class="k">SELECT</span> <span class="n">name</span><span class="p">,</span> <span class="nb">time</span><span class="p">,</span> <span class="n">message</span> |
| <span class="k">FROM</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">hdfs_table</span><span class="err">}</span> |
| <span class="k">WHERE</span> <span class="nb">time</span> <span class="o"><</span> <span class="nv">"${var:new_boundary_time}"</span> |
| <span class="k">AND</span> <span class="nb">year</span> <span class="o">=</span> <span class="nb">year</span><span class="p">(</span><span class="nb">time</span><span class="p">)</span> |
| <span class="k">AND</span> <span class="k">month</span> <span class="o">=</span> <span class="k">month</span><span class="p">(</span><span class="nb">time</span><span class="p">)</span> |
| <span class="k">AND</span> <span class="k">day</span> <span class="o">=</span> <span class="k">day</span><span class="p">(</span><span class="nb">time</span><span class="p">);</span></code></pre></figure> |
| |
| <p>To run the SQL statement, use the Impala shell and pass the required variables. |
| Below is an example:</p> |
| |
| <figure class="highlight"><pre><code class="language-bash" data-lang="bash">impala-shell <span class="nt">-i</span> <impalad:port> <span class="nt">-f</span> window_view_alter.sql |
| <span class="nt">--var</span><span class="o">=</span><span class="nv">view_name</span><span class="o">=</span>my_table_view |
| <span class="nt">--var</span><span class="o">=</span><span class="nv">kudu_table</span><span class="o">=</span>my_table_kudu |
| <span class="nt">--var</span><span class="o">=</span><span class="nv">hdfs_table</span><span class="o">=</span>my_table_parquet |
| <span class="nt">--var</span><span class="o">=</span><span class="nv">new_boundary_time</span><span class="o">=</span><span class="s2">"2018-02-01"</span></code></pre></figure> |
| |
| <p>Create the <code class="language-plaintext highlighter-rouge">window_partition_shift.sql</code> file to shift the Kudu partitions forward:</p> |
| |
| <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">ALTER</span> <span class="k">TABLE</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">kudu_table</span><span class="err">}</span> |
| |
| <span class="k">ADD</span> <span class="n">RANGE</span> <span class="n">PARTITION</span> <span class="n">add_months</span><span class="p">(</span><span class="nv">"${var:new_boundary_time}"</span><span class="p">,</span> |
| <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">window_length</span><span class="err">}</span><span class="p">)</span> <span class="o"><=</span> <span class="k">VALUES</span> <span class="o"><</span> <span class="n">add_months</span><span class="p">(</span><span class="nv">"${var:new_boundary_time}"</span><span class="p">,</span> |
| <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">window_length</span><span class="err">}</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span> |
| |
| <span class="k">ALTER</span> <span class="k">TABLE</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">kudu_table</span><span class="err">}</span> |
| |
| <span class="k">DROP</span> <span class="n">RANGE</span> <span class="n">PARTITION</span> <span class="n">add_months</span><span class="p">(</span><span class="nv">"${var:new_boundary_time}"</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> |
| <span class="o"><=</span> <span class="k">VALUES</span> <span class="o"><</span> <span class="nv">"${var:new_boundary_time}"</span><span class="p">;</span></code></pre></figure> |
| |
| <p>To run the SQL statement, use the Impala shell and pass the required variables. |
| Below is an example:</p> |
| |
| <figure class="highlight"><pre><code class="language-bash" data-lang="bash">impala-shell <span class="nt">-i</span> <impalad:port> <span class="nt">-f</span> window_partition_shift.sql |
| <span class="nt">--var</span><span class="o">=</span><span class="nv">kudu_table</span><span class="o">=</span>my_table_kudu |
| <span class="nt">--var</span><span class="o">=</span><span class="nv">new_boundary_time</span><span class="o">=</span><span class="s2">"2018-02-01"</span> |
| <span class="nt">--var</span><span class="o">=</span><span class="nv">window_length</span><span class="o">=</span>3</code></pre></figure> |
| |
| <p>Note: You should periodically run |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_compute_stats.html">COMPUTE STATS</a> |
| on your Kudu table to ensure Impala’s query performance is optimal.</p> |
| |
| <h3 id="experimentation">Experimentation</h3> |
| |
| <p>Now that you have created the tables, view, and scripts to leverage the sliding |
| window pattern, you can experiment with them by inserting data for different time |
| ranges and running the scripts to move the window forward through time.</p> |
| |
| <p>Insert some sample values into the Kudu table:</p> |
| |
| <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">my_table_kudu</span> <span class="k">VALUES</span> |
| <span class="p">(</span><span class="s1">'joey'</span><span class="p">,</span> <span class="s1">'2018-01-01'</span><span class="p">,</span> <span class="s1">'hello'</span><span class="p">),</span> |
| <span class="p">(</span><span class="s1">'ross'</span><span class="p">,</span> <span class="s1">'2018-02-01'</span><span class="p">,</span> <span class="s1">'goodbye'</span><span class="p">),</span> |
| <span class="p">(</span><span class="s1">'rachel'</span><span class="p">,</span> <span class="s1">'2018-03-01'</span><span class="p">,</span> <span class="s1">'hi'</span><span class="p">);</span></code></pre></figure> |
| |
| <p>Show the data in each table/view:</p> |
| |
| <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_kudu</span><span class="p">;</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_parquet</span><span class="p">;</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span><span class="p">;</span></code></pre></figure> |
| |
| <p>Move the January data into HDFS:</p> |
| |
| <figure class="highlight"><pre><code class="language-bash" data-lang="bash">impala-shell <span class="nt">-i</span> <impalad:port> <span class="nt">-f</span> window_data_move.sql |
| <span class="nt">--var</span><span class="o">=</span><span class="nv">kudu_table</span><span class="o">=</span>my_table_kudu |
| <span class="nt">--var</span><span class="o">=</span><span class="nv">hdfs_table</span><span class="o">=</span>my_table_parquet |
| <span class="nt">--var</span><span class="o">=</span><span class="nv">new_boundary_time</span><span class="o">=</span><span class="s2">"2018-02-01"</span></code></pre></figure> |
| |
| <p>Confirm the data is in both places, but not duplicated in the view:</p> |
| |
| <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_kudu</span><span class="p">;</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_parquet</span><span class="p">;</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span><span class="p">;</span></code></pre></figure> |
| |
| <p>Alter the view to shift the time boundary forward to February:</p> |
| |
| <figure class="highlight"><pre><code class="language-bash" data-lang="bash">impala-shell <span class="nt">-i</span> <impalad:port> <span class="nt">-f</span> window_view_alter.sql |
| <span class="nt">--var</span><span class="o">=</span><span class="nv">view_name</span><span class="o">=</span>my_table_view |
| <span class="nt">--var</span><span class="o">=</span><span class="nv">kudu_table</span><span class="o">=</span>my_table_kudu |
| <span class="nt">--var</span><span class="o">=</span><span class="nv">hdfs_table</span><span class="o">=</span>my_table_parquet |
| <span class="nt">--var</span><span class="o">=</span><span class="nv">new_boundary_time</span><span class="o">=</span><span class="s2">"2018-02-01"</span></code></pre></figure> |
| |
| <p>Confirm the data is still in both places, but not duplicated in the view:</p> |
| |
| <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_kudu</span><span class="p">;</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_parquet</span><span class="p">;</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span><span class="p">;</span></code></pre></figure> |
| |
| <p>Shift the Kudu partitions forward:</p> |
| |
| <figure class="highlight"><pre><code class="language-bash" data-lang="bash">impala-shell <span class="nt">-i</span> <impalad:port> <span class="nt">-f</span> window_partition_shift.sql |
| <span class="nt">--var</span><span class="o">=</span><span class="nv">kudu_table</span><span class="o">=</span>my_table_kudu |
| <span class="nt">--var</span><span class="o">=</span><span class="nv">new_boundary_time</span><span class="o">=</span><span class="s2">"2018-02-01"</span> |
| <span class="nt">--var</span><span class="o">=</span><span class="nv">window_length</span><span class="o">=</span>3</code></pre></figure> |
| |
| <p>Confirm the January data is now only in HDFS:</p> |
| |
| <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_kudu</span><span class="p">;</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_parquet</span><span class="p">;</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span><span class="p">;</span></code></pre></figure> |
| |
| <p>Confirm predicate push down with Impala’s EXPLAIN statement:</p> |
| |
| <figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">EXPLAIN</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span><span class="p">;</span> |
| <span class="k">EXPLAIN</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span> <span class="k">WHERE</span> <span class="nb">time</span> <span class="o"><</span> <span class="nv">"2018-02-01"</span><span class="p">;</span> |
| <span class="k">EXPLAIN</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span> <span class="k">WHERE</span> <span class="nb">time</span> <span class="o">></span> <span class="nv">"2018-02-01"</span><span class="p">;</span></code></pre></figure> |
| |
| <p>In the explain output you should see “kudu predicates” which include the time column |
| filters in the “SCAN KUDU” section and “predicates” which include the time, day, month, |
| and year columns in the “SCAN HDFS” section.</p> |
| |
| </div> |
| </article> |
| |
| |
| </div> |
| <div class="col-lg-3 recent-posts"> |
| <h3>Recent posts</h3> |
| <ul> |
| |
| <li> <a href="/2024/03/07/introducing-auto-incrementing-column.html">Introducing Auto-incrementing Column in Kudu</a> </li> |
| |
| <li> <a href="/2023/09/07/apache-kudu-1-17-0-released.html">Apache Kudu 1.17.0 Released</a> </li> |
| |
| <li> <a href="/2022/06/17/apache-kudu-1-16-0-released.html">Apache Kudu 1.16.0 Released</a> </li> |
| |
| <li> <a href="/2021/06/22/apache-kudu-1-15-0-released.html">Apache Kudu 1.15.0 Released</a> </li> |
| |
| <li> <a href="/2021/01/28/apache-kudu-1-14-0-release.html">Apache Kudu 1.14.0 Released</a> </li> |
| |
| <li> <a href="/2021/01/15/bloom-filter-predicate.html">Optimized joins & filtering with Bloom filter predicate in Kudu</a> </li> |
| |
| <li> <a href="/2020/09/21/apache-kudu-1-13-0-release.html">Apache Kudu 1.13.0 released</a> </li> |
| |
| <li> <a href="/2020/08/11/fine-grained-authz-ranger.html">Fine-Grained Authorization with Apache Kudu and Apache Ranger</a> </li> |
| |
| <li> <a href="/2020/07/30/building-near-real-time-big-data-lake.html">Building Near Real-time Big Data Lake</a> </li> |
| |
| <li> <a href="/2020/05/18/apache-kudu-1-12-0-release.html">Apache Kudu 1.12.0 released</a> </li> |
| |
| <li> <a href="/2019/11/20/apache-kudu-1-11-1-release.html">Apache Kudu 1.11.1 released</a> </li> |
| |
| <li> <a href="/2019/11/20/apache-kudu-1-10-1-release.html">Apache Kudu 1.10.1 released</a> </li> |
| |
| <li> <a href="/2019/07/09/apache-kudu-1-10-0-release.html">Apache Kudu 1.10.0 Released</a> </li> |
| |
| <li> <a href="/2019/04/30/location-awareness.html">Location Awareness in Kudu</a> </li> |
| |
| <li> <a href="/2019/04/22/fine-grained-authorization-with-apache-kudu-and-impala.html">Fine-Grained Authorization with Apache Kudu and Impala</a> </li> |
| |
| </ul> |
| </div> |
| </div> |
| |
| <footer class="footer"> |
| <div class="row"> |
| <div class="col-md-9"> |
| <p class="small"> |
| Copyright © 2023 The Apache Software Foundation. |
| </p> |
| <p class="small"> |
| Apache Kudu, Kudu, Apache, the Apache feather logo, and the Apache Kudu |
| project logo are either registered trademarks or trademarks of The |
| Apache Software Foundation in the United States and other countries. |
| </p> |
| </div> |
| <div class="col-md-3"> |
| <a class="pull-right" href="https://www.apache.org/events/current-event.html"> |
| <img src="https://www.apache.org/events/current-event-234x60.png"/> |
| </a> |
| </div> |
| </div> |
| </footer> |
| </div> |
| <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script> |
| <script> |
| // Try to detect touch-screen devices. Note: Many laptops have touch screens. |
| $(document).ready(function() { |
| if ("ontouchstart" in document.documentElement) { |
| $(document.documentElement).addClass("touch"); |
| } else { |
| $(document.documentElement).addClass("no-touch"); |
| } |
| }); |
| </script> |
| <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/js/bootstrap.min.js" |
| integrity="sha384-0mSbJDEHialfmuBBQP6A4Qrprq5OVfW37PRR3j5ELqxss1yVqOtnepnHVP9aJ7xS" |
| crossorigin="anonymous"></script> |
| <script> |
| (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ |
| (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), |
| m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) |
| })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); |
| |
| ga('create', 'UA-68448017-1', 'auto'); |
| ga('send', 'pageview'); |
| </script> |
| <script src="https://cdnjs.cloudflare.com/ajax/libs/anchor-js/3.1.0/anchor.js"></script> |
| <script> |
| anchors.options = { |
| placement: 'right', |
| visible: 'touch', |
| }; |
| anchors.add(); |
| </script> |
| </body> |
| </html> |
| |