blob: b832b7d337d902dab6eb9ff4fa408fa9adb2d3e4 [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<meta name="description" content="A new open source Apache Hadoop ecosystem project, Apache Kudu completes Hadoop's storage layer to enable fast analytics on fast data" />
<meta name="author" content="Cloudera" />
<title>Apache Kudu - Transparent Hierarchical Storage Management with Apache Kudu and Impala</title>
<!-- Bootstrap core CSS -->
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css"
integrity="sha384-1q8mTJOASx8j1Au+a5WDVnPi2lkFfwwEAa8hDDdjZlpLegxhjVME1fgjWPGmkzs7"
crossorigin="anonymous">
<!-- Custom styles for this template -->
<link href="/css/kudu.css" rel="stylesheet"/>
<link href="/css/asciidoc.css" rel="stylesheet"/>
<link rel="shortcut icon" href="/img/logo-favicon.ico" />
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.1/css/font-awesome.min.css" />
<link rel="alternate" type="application/atom+xml"
title="RSS Feed for Apache Kudu blog"
href="/feed.xml" />
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
</head>
<body>
<div class="kudu-site container-fluid">
<!-- Static navbar -->
<nav class="navbar navbar-default">
<div class="container-fluid">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="logo" href="/"><img
src="//d3dr9sfxru4sde.cloudfront.net/i/k/apachekudu_logo_0716_80px.png"
srcset="//d3dr9sfxru4sde.cloudfront.net/i/k/apachekudu_logo_0716_80px.png 1x, //d3dr9sfxru4sde.cloudfront.net/i/k/apachekudu_logo_0716_160px.png 2x"
alt="Apache Kudu"/></a>
</div>
<div id="navbar" class="collapse navbar-collapse">
<ul class="nav navbar-nav navbar-right">
<li >
<a href="/">Home</a>
</li>
<li >
<a href="/overview.html">Overview</a>
</li>
<li >
<a href="/docs/">Documentation</a>
</li>
<li >
<a href="/releases/">Releases</a>
</li>
<li class="active">
<a href="/blog/">Blog</a>
</li>
<!-- NOTE: this dropdown menu does not appear on Mobile, so don't add anything here
that doesn't also appear elsewhere on the site. -->
<li class="dropdown">
<a href="/community.html" role="button" aria-haspopup="true" aria-expanded="false">Community <span class="caret"></span></a>
<ul class="dropdown-menu">
<li class="dropdown-header">GET IN TOUCH</li>
<li><a class="icon email" href="/community.html">Mailing Lists</a></li>
<li><a class="icon slack" href="https://getkudu-slack.herokuapp.com/">Slack Channel</a></li>
<li role="separator" class="divider"></li>
<li><a href="/community.html#meetups-user-groups-and-conference-presentations">Events and Meetups</a></li>
<li><a href="/committers.html">Project Committers</a></li>
<li><a href="/ecosystem.html">Ecosystem</a></li>
<!--<li><a href="/roadmap.html">Roadmap</a></li>-->
<li><a href="/community.html#contributions">How to Contribute</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">DEVELOPER RESOURCES</li>
<li><a class="icon github" href="https://github.com/apache/incubator-kudu">GitHub</a></li>
<li><a class="icon gerrit" href="http://gerrit.cloudera.org:8080/#/q/status:open+project:kudu">Gerrit Code Review</a></li>
<li><a class="icon jira" href="https://issues.apache.org/jira/browse/KUDU">JIRA Issue Tracker</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">SOCIAL MEDIA</li>
<li><a class="icon twitter" href="https://twitter.com/ApacheKudu">Twitter</a></li>
<li><a href="https://www.reddit.com/r/kudu/">Reddit</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">APACHE SOFTWARE FOUNDATION</li>
<li><a href="https://www.apache.org/security/" target="_blank">Security</a></li>
<li><a href="https://www.apache.org/foundation/sponsorship.html" target="_blank">Sponsorship</a></li>
<li><a href="https://www.apache.org/foundation/thanks.html" target="_blank">Thanks</a></li>
<li><a href="https://www.apache.org/licenses/" target="_blank">License</a></li>
</ul>
</li>
<li >
<a href="/faq.html">FAQ</a>
</li>
</ul><!-- /.nav -->
</div><!-- /#navbar -->
</div><!-- /.container-fluid -->
</nav>
<div class="row header">
<div class="col-lg-12">
<h2><a href="/blog">Apache Kudu Blog</a></h2>
</div>
</div>
<div class="row-fluid">
<div class="col-lg-9">
<article>
<header>
<h1 class="entry-title">Transparent Hierarchical Storage Management with Apache Kudu and Impala</h1>
<p class="meta">Posted 05 Mar 2019 by Grant Henke</p>
</header>
<div class="entry-content">
<p>Note: This is a cross-post from the Cloudera Engineering Blog
<a href="https://blog.cloudera.com/blog/2019/03/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/">Transparent Hierarchical Storage Management with Apache Kudu and Impala</a></p>
<p>When picking a storage option for an application it is common to pick a single
storage option which has the most applicable features to your use case. For mutability
and real-time analytics workloads you may want to use Apache Kudu, but for massive
scalability at a low cost you may want to use HDFS. For that reason, there is a need
for a solution that allows you to leverage the best features of multiple storage
options. This post describes the sliding window pattern using Apache Impala with data
stored in Apache Kudu and Apache HDFS. With this pattern you get all of the benefits
of multiple storage layers in a way that is transparent to users.</p>
<!--more-->
<p>Apache Kudu is designed for fast analytics on rapidly changing data. Kudu provides a
combination of fast inserts/updates and efficient columnar scans to enable multiple
real-time analytic workloads across a single storage layer. For that reason, Kudu fits
well into a data pipeline as the place to store real-time data that needs to be
queryable immediately. Additionally, Kudu supports updating and deleting rows in
real-time allowing support for late arriving data and data correction.</p>
<p>Apache HDFS is designed to allow for limitless scalability at a low cost. It is
optimized for batch oriented use cases where data is immutable. When paired with the
Apache Parquet file format, structured data can be accessed with extremely high
throughput and efficiency.</p>
<p>For situations in which the data is small and ever-changing, like dimension tables,
it is common to keep all of the data in Kudu. It is even common to keep large tables
in Kudu when the data fits within Kudu’s
<a href="https://kudu.apache.org/docs/known_issues.html#_scale">scaling limits</a> and can benefit
from Kudu’s unique features. In cases where the data is massive, batch oriented, and
unlikely to change, storing the data in HDFS using the Parquet format is preferred.
When you need the benefits of both storage layers, the sliding window pattern is a
useful solution.</p>
<h2 id="the-sliding-window-pattern">The Sliding Window Pattern</h2>
<p>In this pattern, matching Kudu and Parquet formatted HDFS tables are created in Impala.
These tables are partitioned by a unit of time based on how frequently the data is
moved between the Kudu and HDFS table. It is common to use daily, monthly, or yearly
partitions. A unified view is created and a <code class="language-plaintext highlighter-rouge">WHERE</code> clause is used to define a boundary
that separates which data is read from the Kudu table and which is read from the HDFS
table. The defined boundary is important so that you can move data between Kudu and
HDFS without exposing duplicate records to the view. Once the data is moved, an atomic
<code class="language-plaintext highlighter-rouge">ALTER VIEW</code> statement can be used to move the boundary forward.</p>
<p><img src="/img/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/sliding-window-pattern.png" alt="png" class="img-responsive" /></p>
<p>Note: This pattern works best with somewhat sequential data organized into range
partitions, because having a sliding window of time and dropping partitions is very
efficient.</p>
<p>This pattern results in a sliding window of time where mutable data is stored in Kudu
and immutable data is stored in the Parquet format on HDFS. Leveraging both Kudu and
HDFS via Impala provides the benefits of both storage systems:</p>
<ul>
<li>Streaming data is immediately queryable</li>
<li>Updates for late arriving data or manual corrections can be made</li>
<li>Data stored in HDFS is optimally sized increasing performance and preventing small files</li>
<li>Reduced cost</li>
</ul>
<p>Impala also supports cloud storage options such as
<a href="https://impala.apache.org/docs/build/html/topics/impala_s3.html">S3</a> and
<a href="https://impala.apache.org/docs/build/html/topics/impala_adls.html">ADLS</a>.
This capability allows convenient access to a storage system that is remotely managed,
accessible from anywhere, and integrated with various cloud-based services. Because
this data is remote, queries against S3 data are less performant, making S3 suitable
for holding “cold” data that is only queried occasionally. This pattern can be
extended to use cloud storage for cold data by creating a third matching table and
adding another boundary to the unified view.</p>
<p><img src="/img/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/sliding-window-pattern-cold.png" alt="png" class="img-responsive" /></p>
<p>Note: For simplicity only Kudu and HDFS are illustrated in the examples below.</p>
<p>The process for moving data from Kudu to HDFS is broken into two phases. The first
phase is the data migration, and the second phase is the metadata change. These
ongoing steps should be scheduled to run automatically on a regular basis.</p>
<p>In the first phase, the now immutable data is copied from Kudu to HDFS. Even though
data is duplicated from Kudu into HDFS, the boundary defined in the view will prevent
duplicate data from being shown to users. This step can include any validation and
retries as needed to ensure the data offload is successful.</p>
<p><img src="/img/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/phase-1.png" alt="png" class="img-responsive" /></p>
<p>In the second phase, now that the data is safely copied to HDFS, the metadata is
changed to adjust how the offloaded partition is exposed. This includes shifting
the boundary forward, adding a new Kudu partition for the next period, and dropping
the old Kudu partition.</p>
<p><img src="/img/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/phase-2.png" alt="png" class="img-responsive" /></p>
<h2 id="building-blocks">Building Blocks</h2>
<p>In order to implement the sliding window pattern, a few Impala fundamentals are
required. Below each fundamental building block of the sliding window pattern is
described.</p>
<h3 id="moving-data">Moving Data</h3>
<p>Moving data among storage systems via Impala is straightforward provided you have
matching tables defined using each of the storage formats. In order to keep this post
brief, all of the options available when creating an Impala table are not described.
However, Impala’s
<a href="https://impala.apache.org/docs/build/html/topics/impala_create_table.html">CREATE TABLE documentation</a>
can be referenced to find the correct syntax for Kudu, HDFS, and cloud storage tables.
A few examples are shown further below where the sliding window pattern is illustrated.</p>
<p>Once the tables are created, moving the data is as simple as an
<a href="https://impala.apache.org/docs/build/html/topics/impala_insert.html">INSERT…SELECT</a> statement:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">table_foo</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">table_bar</span><span class="p">;</span></code></pre></figure>
<p>All of the features of the
<a href="https://impala.apache.org/docs/build/html/topics/impala_select.html">SELECT</a>
statement can be used to select the specific data you would like to move.</p>
<p>Note: If moving data to Kudu, an <code class="language-plaintext highlighter-rouge">UPSERT INTO</code> statement can be used to handle
duplicate keys.</p>
<h3 id="unified-querying">Unified Querying</h3>
<p>Querying data from multiple tables and data sources in Impala is also straightforward.
For the sake of brevity, all of the options available when creating an Impala view are
not described. However, see Impala’s
<a href="https://impala.apache.org/docs/build/html/topics/impala_create_view.html">CREATE VIEW documentation</a>
for more in-depth details.</p>
<p>Creating a view for unified querying is as simple as a <code class="language-plaintext highlighter-rouge">CREATE VIEW</code> statement using
two <code class="language-plaintext highlighter-rouge">SELECT</code> clauses combined with a <code class="language-plaintext highlighter-rouge">UNION ALL</code>:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">VIEW</span> <span class="n">foo_view</span> <span class="k">AS</span>
<span class="k">SELECT</span> <span class="n">col1</span><span class="p">,</span> <span class="n">col2</span><span class="p">,</span> <span class="n">col3</span> <span class="k">FROM</span> <span class="n">foo_parquet</span>
<span class="k">UNION</span> <span class="k">ALL</span>
<span class="k">SELECT</span> <span class="n">col1</span><span class="p">,</span> <span class="n">col2</span><span class="p">,</span> <span class="n">col3</span> <span class="k">FROM</span> <span class="n">foo_kudu</span><span class="p">;</span></code></pre></figure>
<p>WARNING: Be sure to use <code class="language-plaintext highlighter-rouge">UNION ALL</code> and not <code class="language-plaintext highlighter-rouge">UNION</code>. The <code class="language-plaintext highlighter-rouge">UNION</code> keyword by itself
is the same as <code class="language-plaintext highlighter-rouge">UNION DISTINCT</code> and can have significant performance impact.
More information can be found in the Impala
<a href="https://impala.apache.org/docs/build/html/topics/impala_union.html">UNION documentation</a>.</p>
<p>All of the features of the
<a href="https://impala.apache.org/docs/build/html/topics/impala_select.html">SELECT</a>
statement can be used to expose the correct data and columns from each of the
underlying tables. It is important to use the <code class="language-plaintext highlighter-rouge">WHERE</code> clause to pass through and
pushdown any predicates that need special handling or transformations. More examples
will follow below in the discussion of the sliding window pattern.</p>
<p>Additionally, views can be altered via the
<a href="https://impala.apache.org/docs/build/html/topics/impala_alter_view.html">ALTER VIEW</a>
statement. This is useful when combined with the <code class="language-plaintext highlighter-rouge">SELECT</code> statement because it can be
used to atomically update what data is being accessed by the view.</p>
<h2 id="an-example-implementation">An Example Implementation</h2>
<p>Below are sample steps to implement the sliding window pattern using a monthly period
with three months of active mutable data. Data older than three months will be
offloaded to HDFS using the Parquet format.</p>
<h3 id="create-the-kudu-table">Create the Kudu Table</h3>
<p>First, create a Kudu table which will hold three months of active mutable data.
The table is range partitioned by the time column with each range containing one
period of data. It is important to have partitions that match the period because
dropping Kudu partitions is much more efficient than removing the data via the
<code class="language-plaintext highlighter-rouge">DELETE</code> clause. The table is also hash partitioned by the other key column to ensure
that all of the data is not written to a single partition.</p>
<p>Note: Your schema design should vary based on your data and read/write performance
considerations. This example schema is intended for demonstration purposes and not as
an “optimal” schema. See the
<a href="https://kudu.apache.org/docs/schema_design.html">Kudu schema design documentation</a>
for more guidance on choosing your schema. For example, you may not need any hash
partitioning if your
data input rate is low. Alternatively, you may need more hash buckets if your data
input rate is very high.</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">my_table_kudu</span>
<span class="p">(</span>
<span class="n">name</span> <span class="n">STRING</span><span class="p">,</span>
<span class="nb">time</span> <span class="nb">TIMESTAMP</span><span class="p">,</span>
<span class="n">message</span> <span class="n">STRING</span><span class="p">,</span>
<span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="nb">time</span><span class="p">)</span>
<span class="p">)</span>
<span class="n">PARTITION</span> <span class="k">BY</span>
<span class="n">HASH</span><span class="p">(</span><span class="n">name</span><span class="p">)</span> <span class="n">PARTITIONS</span> <span class="mi">4</span><span class="p">,</span>
<span class="n">RANGE</span><span class="p">(</span><span class="nb">time</span><span class="p">)</span> <span class="p">(</span>
<span class="n">PARTITION</span> <span class="s1">'2018-01-01'</span> <span class="o">&lt;=</span> <span class="k">VALUES</span> <span class="o">&lt;</span> <span class="s1">'2018-02-01'</span><span class="p">,</span> <span class="c1">--January</span>
<span class="n">PARTITION</span> <span class="s1">'2018-02-01'</span> <span class="o">&lt;=</span> <span class="k">VALUES</span> <span class="o">&lt;</span> <span class="s1">'2018-03-01'</span><span class="p">,</span> <span class="c1">--February</span>
<span class="n">PARTITION</span> <span class="s1">'2018-03-01'</span> <span class="o">&lt;=</span> <span class="k">VALUES</span> <span class="o">&lt;</span> <span class="s1">'2018-04-01'</span><span class="p">,</span> <span class="c1">--March</span>
<span class="n">PARTITION</span> <span class="s1">'2018-04-01'</span> <span class="o">&lt;=</span> <span class="k">VALUES</span> <span class="o">&lt;</span> <span class="s1">'2018-05-01'</span> <span class="c1">--April</span>
<span class="p">)</span>
<span class="n">STORED</span> <span class="k">AS</span> <span class="n">KUDU</span><span class="p">;</span></code></pre></figure>
<p>Note: There is an extra month partition to provide a buffer of time for the data to
be moved into the immutable table.</p>
<h3 id="create-the-hdfs-table">Create the HDFS Table</h3>
<p>Create the matching Parquet formatted HDFS table which will hold the older immutable
data. This table is partitioned by year, month, and day for efficient access even
though you can’t partition by the time column itself. This is addressed further in
the view step below. See Impala’s
<a href="https://impala.apache.org/docs/build/html/topics/impala_partitioning.html">partitioning documentation</a>
for more details.</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">my_table_parquet</span>
<span class="p">(</span>
<span class="n">name</span> <span class="n">STRING</span><span class="p">,</span>
<span class="nb">time</span> <span class="nb">TIMESTAMP</span><span class="p">,</span>
<span class="n">message</span> <span class="n">STRING</span>
<span class="p">)</span>
<span class="n">PARTITIONED</span> <span class="k">BY</span> <span class="p">(</span><span class="nb">year</span> <span class="nb">int</span><span class="p">,</span> <span class="k">month</span> <span class="nb">int</span><span class="p">,</span> <span class="k">day</span> <span class="nb">int</span><span class="p">)</span>
<span class="n">STORED</span> <span class="k">AS</span> <span class="n">PARQUET</span><span class="p">;</span></code></pre></figure>
<h3 id="create-the-unified-view">Create the Unified View</h3>
<p>Now create the unified view which will be used to query all of the data seamlessly:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">VIEW</span> <span class="n">my_table_view</span> <span class="k">AS</span>
<span class="k">SELECT</span> <span class="n">name</span><span class="p">,</span> <span class="nb">time</span><span class="p">,</span> <span class="n">message</span>
<span class="k">FROM</span> <span class="n">my_table_kudu</span>
<span class="k">WHERE</span> <span class="nb">time</span> <span class="o">&gt;=</span> <span class="nv">"2018-01-01"</span>
<span class="k">UNION</span> <span class="k">ALL</span>
<span class="k">SELECT</span> <span class="n">name</span><span class="p">,</span> <span class="nb">time</span><span class="p">,</span> <span class="n">message</span>
<span class="k">FROM</span> <span class="n">my_table_parquet</span>
<span class="k">WHERE</span> <span class="nb">time</span> <span class="o">&lt;</span> <span class="nv">"2018-01-01"</span>
<span class="k">AND</span> <span class="nb">year</span> <span class="o">=</span> <span class="nb">year</span><span class="p">(</span><span class="nb">time</span><span class="p">)</span>
<span class="k">AND</span> <span class="k">month</span> <span class="o">=</span> <span class="k">month</span><span class="p">(</span><span class="nb">time</span><span class="p">)</span>
<span class="k">AND</span> <span class="k">day</span> <span class="o">=</span> <span class="k">day</span><span class="p">(</span><span class="nb">time</span><span class="p">);</span></code></pre></figure>
<p>Each <code class="language-plaintext highlighter-rouge">SELECT</code> clause explicitly lists all of the columns to expose. This ensures that
the year, month, and day columns that are unique to the Parquet table are not exposed.
If needed, it also allows any necessary column or type mapping to be handled.</p>
<p>The initial <code class="language-plaintext highlighter-rouge">WHERE</code> clauses applied to both my_table_kudu and my_table_parquet define
the boundary between Kudu and HDFS to ensure duplicate data is not read while in the
process of offloading data.</p>
<p>The additional <code class="language-plaintext highlighter-rouge">AND</code> clauses applied to my_table_parquet are used to ensure good
predicate pushdown on the individual year, month, and day columns.</p>
<p>WARNING: As stated earlier, be sure to use <code class="language-plaintext highlighter-rouge">UNION ALL</code> and not <code class="language-plaintext highlighter-rouge">UNION</code>. The <code class="language-plaintext highlighter-rouge">UNION</code>
keyword by itself is the same as <code class="language-plaintext highlighter-rouge">UNION DISTINCT</code> and can have significant performance
impact. More information can be found in the Impala
<a href="https://impala.apache.org/docs/build/html/topics/impala_union.html"><code class="language-plaintext highlighter-rouge">UNION</code> documentation</a>.</p>
<h3 id="ongoing-steps">Ongoing Steps</h3>
<p>Now that the base tables and view are created, prepare the ongoing steps to maintain
the sliding window. Because these ongoing steps should be scheduled to run on a
regular basis, the examples below are shown using <code class="language-plaintext highlighter-rouge">.sql</code> files that take variables
which can be passed from your scripts and scheduling tool of choice.</p>
<p>Create the <code class="language-plaintext highlighter-rouge">window_data_move.sql</code> file to move the data from the oldest partition to HDFS:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">INSERT</span> <span class="k">INTO</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">hdfs_table</span><span class="err">}</span> <span class="n">PARTITION</span> <span class="p">(</span><span class="nb">year</span><span class="p">,</span> <span class="k">month</span><span class="p">,</span> <span class="k">day</span><span class="p">)</span>
<span class="k">SELECT</span> <span class="o">*</span><span class="p">,</span> <span class="nb">year</span><span class="p">(</span><span class="nb">time</span><span class="p">),</span> <span class="k">month</span><span class="p">(</span><span class="nb">time</span><span class="p">),</span> <span class="k">day</span><span class="p">(</span><span class="nb">time</span><span class="p">)</span>
<span class="k">FROM</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">kudu_table</span><span class="err">}</span>
<span class="k">WHERE</span> <span class="nb">time</span> <span class="o">&gt;=</span> <span class="n">add_months</span><span class="p">(</span><span class="nv">"${var:new_boundary_time}"</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="k">AND</span> <span class="nb">time</span> <span class="o">&lt;</span> <span class="nv">"${var:new_boundary_time}"</span><span class="p">;</span>
<span class="n">COMPUTE</span> <span class="n">INCREMENTAL</span> <span class="n">STATS</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">hdfs_table</span><span class="err">}</span><span class="p">;</span></code></pre></figure>
<p>Note: The
<a href="https://impala.apache.org/docs/build/html/topics/impala_compute_stats.html">COMPUTE INCREMENTAL STATS</a>
clause is not required but helps Impala to optimize queries.</p>
<p>To run the SQL statement, use the Impala shell and pass the required variables.
Below is an example:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">impala-shell <span class="nt">-i</span> &lt;impalad:port&gt; <span class="nt">-f</span> window_data_move.sql
<span class="nt">--var</span><span class="o">=</span><span class="nv">kudu_table</span><span class="o">=</span>my_table_kudu
<span class="nt">--var</span><span class="o">=</span><span class="nv">hdfs_table</span><span class="o">=</span>my_table_parquet
<span class="nt">--var</span><span class="o">=</span><span class="nv">new_boundary_time</span><span class="o">=</span><span class="s2">"2018-02-01"</span></code></pre></figure>
<p>Note: You can adjust the <code class="language-plaintext highlighter-rouge">WHERE</code> clause to match the given period and cadence of your
offload. Here the add_months function is used with an argument of -1 to move one month
of data in the past from the new boundary time.</p>
<p>Create the <code class="language-plaintext highlighter-rouge">window_view_alter.sql</code> file to shift the time boundary forward by altering
the unified view:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">ALTER</span> <span class="k">VIEW</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">view_name</span><span class="err">}</span> <span class="k">AS</span>
<span class="k">SELECT</span> <span class="n">name</span><span class="p">,</span> <span class="nb">time</span><span class="p">,</span> <span class="n">message</span>
<span class="k">FROM</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">kudu_table</span><span class="err">}</span>
<span class="k">WHERE</span> <span class="nb">time</span> <span class="o">&gt;=</span> <span class="nv">"${var:new_boundary_time}"</span>
<span class="k">UNION</span> <span class="k">ALL</span>
<span class="k">SELECT</span> <span class="n">name</span><span class="p">,</span> <span class="nb">time</span><span class="p">,</span> <span class="n">message</span>
<span class="k">FROM</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">hdfs_table</span><span class="err">}</span>
<span class="k">WHERE</span> <span class="nb">time</span> <span class="o">&lt;</span> <span class="nv">"${var:new_boundary_time}"</span>
<span class="k">AND</span> <span class="nb">year</span> <span class="o">=</span> <span class="nb">year</span><span class="p">(</span><span class="nb">time</span><span class="p">)</span>
<span class="k">AND</span> <span class="k">month</span> <span class="o">=</span> <span class="k">month</span><span class="p">(</span><span class="nb">time</span><span class="p">)</span>
<span class="k">AND</span> <span class="k">day</span> <span class="o">=</span> <span class="k">day</span><span class="p">(</span><span class="nb">time</span><span class="p">);</span></code></pre></figure>
<p>To run the SQL statement, use the Impala shell and pass the required variables.
Below is an example:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">impala-shell <span class="nt">-i</span> &lt;impalad:port&gt; <span class="nt">-f</span> window_view_alter.sql
<span class="nt">--var</span><span class="o">=</span><span class="nv">view_name</span><span class="o">=</span>my_table_view
<span class="nt">--var</span><span class="o">=</span><span class="nv">kudu_table</span><span class="o">=</span>my_table_kudu
<span class="nt">--var</span><span class="o">=</span><span class="nv">hdfs_table</span><span class="o">=</span>my_table_parquet
<span class="nt">--var</span><span class="o">=</span><span class="nv">new_boundary_time</span><span class="o">=</span><span class="s2">"2018-02-01"</span></code></pre></figure>
<p>Create the <code class="language-plaintext highlighter-rouge">window_partition_shift.sql</code> file to shift the Kudu partitions forward:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">ALTER</span> <span class="k">TABLE</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">kudu_table</span><span class="err">}</span>
<span class="k">ADD</span> <span class="n">RANGE</span> <span class="n">PARTITION</span> <span class="n">add_months</span><span class="p">(</span><span class="nv">"${var:new_boundary_time}"</span><span class="p">,</span>
<span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">window_length</span><span class="err">}</span><span class="p">)</span> <span class="o">&lt;=</span> <span class="k">VALUES</span> <span class="o">&lt;</span> <span class="n">add_months</span><span class="p">(</span><span class="nv">"${var:new_boundary_time}"</span><span class="p">,</span>
<span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">window_length</span><span class="err">}</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
<span class="k">ALTER</span> <span class="k">TABLE</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">kudu_table</span><span class="err">}</span>
<span class="k">DROP</span> <span class="n">RANGE</span> <span class="n">PARTITION</span> <span class="n">add_months</span><span class="p">(</span><span class="nv">"${var:new_boundary_time}"</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="o">&lt;=</span> <span class="k">VALUES</span> <span class="o">&lt;</span> <span class="nv">"${var:new_boundary_time}"</span><span class="p">;</span></code></pre></figure>
<p>To run the SQL statement, use the Impala shell and pass the required variables.
Below is an example:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">impala-shell <span class="nt">-i</span> &lt;impalad:port&gt; <span class="nt">-f</span> window_partition_shift.sql
<span class="nt">--var</span><span class="o">=</span><span class="nv">kudu_table</span><span class="o">=</span>my_table_kudu
<span class="nt">--var</span><span class="o">=</span><span class="nv">new_boundary_time</span><span class="o">=</span><span class="s2">"2018-02-01"</span>
<span class="nt">--var</span><span class="o">=</span><span class="nv">window_length</span><span class="o">=</span>3</code></pre></figure>
<p>Note: You should periodically run
<a href="https://impala.apache.org/docs/build/html/topics/impala_compute_stats.html">COMPUTE STATS</a>
on your Kudu table to ensure Impala’s query performance is optimal.</p>
<h3 id="experimentation">Experimentation</h3>
<p>Now that you have created the tables, view, and scripts to leverage the sliding
window pattern, you can experiment with them by inserting data for different time
ranges and running the scripts to move the window forward through time.</p>
<p>Insert some sample values into the Kudu table:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">my_table_kudu</span> <span class="k">VALUES</span>
<span class="p">(</span><span class="s1">'joey'</span><span class="p">,</span> <span class="s1">'2018-01-01'</span><span class="p">,</span> <span class="s1">'hello'</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'ross'</span><span class="p">,</span> <span class="s1">'2018-02-01'</span><span class="p">,</span> <span class="s1">'goodbye'</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'rachel'</span><span class="p">,</span> <span class="s1">'2018-03-01'</span><span class="p">,</span> <span class="s1">'hi'</span><span class="p">);</span></code></pre></figure>
<p>Show the data in each table/view:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_kudu</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_parquet</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span><span class="p">;</span></code></pre></figure>
<p>Move the January data into HDFS:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">impala-shell <span class="nt">-i</span> &lt;impalad:port&gt; <span class="nt">-f</span> window_data_move.sql
<span class="nt">--var</span><span class="o">=</span><span class="nv">kudu_table</span><span class="o">=</span>my_table_kudu
<span class="nt">--var</span><span class="o">=</span><span class="nv">hdfs_table</span><span class="o">=</span>my_table_parquet
<span class="nt">--var</span><span class="o">=</span><span class="nv">new_boundary_time</span><span class="o">=</span><span class="s2">"2018-02-01"</span></code></pre></figure>
<p>Confirm the data is in both places, but not duplicated in the view:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_kudu</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_parquet</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span><span class="p">;</span></code></pre></figure>
<p>Alter the view to shift the time boundary forward to February:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">impala-shell <span class="nt">-i</span> &lt;impalad:port&gt; <span class="nt">-f</span> window_view_alter.sql
<span class="nt">--var</span><span class="o">=</span><span class="nv">view_name</span><span class="o">=</span>my_table_view
<span class="nt">--var</span><span class="o">=</span><span class="nv">kudu_table</span><span class="o">=</span>my_table_kudu
<span class="nt">--var</span><span class="o">=</span><span class="nv">hdfs_table</span><span class="o">=</span>my_table_parquet
<span class="nt">--var</span><span class="o">=</span><span class="nv">new_boundary_time</span><span class="o">=</span><span class="s2">"2018-02-01"</span></code></pre></figure>
<p>Confirm the data is still in both places, but not duplicated in the view:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_kudu</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_parquet</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span><span class="p">;</span></code></pre></figure>
<p>Shift the Kudu partitions forward:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">impala-shell <span class="nt">-i</span> &lt;impalad:port&gt; <span class="nt">-f</span> window_partition_shift.sql
<span class="nt">--var</span><span class="o">=</span><span class="nv">kudu_table</span><span class="o">=</span>my_table_kudu
<span class="nt">--var</span><span class="o">=</span><span class="nv">new_boundary_time</span><span class="o">=</span><span class="s2">"2018-02-01"</span>
<span class="nt">--var</span><span class="o">=</span><span class="nv">window_length</span><span class="o">=</span>3</code></pre></figure>
<p>Confirm the January data is now only in HDFS:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_kudu</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_parquet</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span><span class="p">;</span></code></pre></figure>
<p>Confirm predicate push down with Impala’s EXPLAIN statement:</p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">EXPLAIN</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span><span class="p">;</span>
<span class="k">EXPLAIN</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span> <span class="k">WHERE</span> <span class="nb">time</span> <span class="o">&lt;</span> <span class="nv">"2018-02-01"</span><span class="p">;</span>
<span class="k">EXPLAIN</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span> <span class="k">WHERE</span> <span class="nb">time</span> <span class="o">&gt;</span> <span class="nv">"2018-02-01"</span><span class="p">;</span></code></pre></figure>
<p>In the explain output you should see “kudu predicates” which include the time column
filters in the “SCAN KUDU” section and “predicates” which include the time, day, month,
and year columns in the “SCAN HDFS” section.</p>
</div>
</article>
</div>
<div class="col-lg-3 recent-posts">
<h3>Recent posts</h3>
<ul>
<li> <a href="/2021/06/22/apache-kudu-1-15-0-released.html">Apache Kudu 1.15.0 Released</a> </li>
<li> <a href="/2021/01/28/apache-kudu-1-14-0-release.html">Apache Kudu 1.14.0 Released</a> </li>
<li> <a href="/2021/01/15/bloom-filter-predicate.html">Optimized joins & filtering with Bloom filter predicate in Kudu</a> </li>
<li> <a href="/2020/09/21/apache-kudu-1-13-0-release.html">Apache Kudu 1.13.0 released</a> </li>
<li> <a href="/2020/08/11/fine-grained-authz-ranger.html">Fine-Grained Authorization with Apache Kudu and Apache Ranger</a> </li>
<li> <a href="/2020/07/30/building-near-real-time-big-data-lake.html">Building Near Real-time Big Data Lake</a> </li>
<li> <a href="/2020/05/18/apache-kudu-1-12-0-release.html">Apache Kudu 1.12.0 released</a> </li>
<li> <a href="/2019/11/20/apache-kudu-1-11-1-release.html">Apache Kudu 1.11.1 released</a> </li>
<li> <a href="/2019/11/20/apache-kudu-1-10-1-release.html">Apache Kudu 1.10.1 released</a> </li>
<li> <a href="/2019/07/09/apache-kudu-1-10-0-release.html">Apache Kudu 1.10.0 Released</a> </li>
<li> <a href="/2019/04/30/location-awareness.html">Location Awareness in Kudu</a> </li>
<li> <a href="/2019/04/22/fine-grained-authorization-with-apache-kudu-and-impala.html">Fine-Grained Authorization with Apache Kudu and Impala</a> </li>
<li> <a href="/2019/03/19/testing-apache-kudu-applications-on-the-jvm.html">Testing Apache Kudu Applications on the JVM</a> </li>
<li> <a href="/2019/03/15/apache-kudu-1-9-0-release.html">Apache Kudu 1.9.0 Released</a> </li>
<li> <a href="/2019/03/05/transparent-hierarchical-storage-management-with-apache-kudu-and-impala.html">Transparent Hierarchical Storage Management with Apache Kudu and Impala</a> </li>
</ul>
</div>
</div>
<footer class="footer">
<div class="row">
<div class="col-md-9">
<p class="small">
Copyright &copy; 2020 The Apache Software Foundation.
</p>
<p class="small">
Apache Kudu, Kudu, Apache, the Apache feather logo, and the Apache Kudu
project logo are either registered trademarks or trademarks of The
Apache Software Foundation in the United States and other countries.
</p>
</div>
<div class="col-md-3">
<a class="pull-right" href="https://www.apache.org/events/current-event.html">
<img src="https://www.apache.org/events/current-event-234x60.png"/>
</a>
</div>
</div>
</footer>
</div>
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
<script>
// Try to detect touch-screen devices. Note: Many laptops have touch screens.
$(document).ready(function() {
if ("ontouchstart" in document.documentElement) {
$(document.documentElement).addClass("touch");
} else {
$(document.documentElement).addClass("no-touch");
}
});
</script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/js/bootstrap.min.js"
integrity="sha384-0mSbJDEHialfmuBBQP6A4Qrprq5OVfW37PRR3j5ELqxss1yVqOtnepnHVP9aJ7xS"
crossorigin="anonymous"></script>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-68448017-1', 'auto');
ga('send', 'pageview');
</script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/anchor-js/3.1.0/anchor.js"></script>
<script>
anchors.options = {
placement: 'right',
visible: 'touch',
};
anchors.add();
</script>
</body>
</html>