| <?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom"><generator uri="http://jekyllrb.com" version="2.5.3">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2019-04-22T12:52:58-05:00</updated><id>/</id><entry><title>Fine-Grained Authorization with Apache Kudu and Impala</title><link href="/2019/04/22/fine-grained-authorization-with-apache-kudu-and-impala.html" rel="alternate" type="text/html" title="Fine-Grained Authorization with Apache Kudu and Impala" /><published>2019-04-22T00:00:00-05:00</published><updated>2019-04-22T00:00:00-05:00</updated><id>/2019/04/22/fine-grained-authorization-with-apache-kudu-and-impala</id><content type="html" xml:base="/2019/04/22/fine-grained-authorization-with-apache-kudu-and-impala.html"><p>Note: This is a cross-post from the Cloudera Engineering Blog |
| <a href="https://blog.cloudera.com/blog/2019/04/fine-grained-authorization-with-apache-kudu-and-impala/">Fine-Grained Authorization with Apache Kudu and Impala</a></p> |
| |
| <p>Apache Impala supports fine-grained authorization via Apache Sentry on all of the tables it |
| manages including Apache Kudu tables. Given Impala is a very common way to access the data stored |
| in Kudu, this capability allows users deploying Impala and Kudu to fully secure the Kudu data in |
| multi-tenant clusters even though Kudu does not yet have native fine-grained authorization of its |
| own. This solution works because Kudu natively supports coarse-grained (all or nothing) |
| authorization which enables blocking all access to Kudu directly except for the impala user and |
| an optional whitelist of other trusted users. This post will describe how to use Apache Impala’s |
| fine-grained authorization support along with Apache Kudu’s coarse-grained authorization to |
| achieve a secure multi-tenant deployment.</p> |
| |
| <!--more--> |
| |
| <h2 id="sample-workflow">Sample Workflow</h2> |
| |
| <p>The examples in this post enable a workflow that uses Apache Spark to ingest data directly into |
| Kudu and Impala to run analytic queries on that data. The Spark job, run as the <code>etl_service</code> user, |
| is permitted to access the Kudu data via coarse-grained authorization. Even though this gives |
| access to all the data in Kudu, the <code>etl_service</code> user is only used for scheduled jobs or by an |
| administrator. All queries on the data, from a wide array of users, will use Impala and leverage |
| Impala’s fine-grained authorization. Impala’s |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_grant.html"><code>GRANT</code> statements</a> |
| allow you to flexibly control the privileges on the Kudu storage tables. Impala’s fine-grained |
| privileges along with support for |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_select.html"><code>SELECT</code></a>, |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_insert.html"><code>INSERT</code></a>, |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_update.html"><code>UPDATE</code></a>, |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_upsert.html"><code>UPSERT</code></a>, |
| and <a href="https://impala.apache.org/docs/build/html/topics/impala_delete.html"><code>DELETE</code></a> |
| statements, allow you to finely control who can read and write data to your Kudu tables while |
| using Impala. Below is a diagram showing the workflow described:</p> |
| |
| <p><img src="/img/fine-grained-authorization-with-apache-kudu.png" alt="png" class="img-responsive" /></p> |
| |
| <p><em>Note</em>: The examples below assume that Authorization has already been configured for Kudu, Impala, |
| and Spark. For help configuring authorization see the Cloudera |
| <a href="https://www.cloudera.com/documentation/enterprise/latest/topics/sg_auth_overview.html">authorization documentation</a>.</p> |
| |
| <h2 id="configuring-kudus-coarse-grained-authorization">Configuring Kudu’s Coarse-Grained Authorization</h2> |
| |
| <p>Kudu supports coarse-grained authorization of client requests based on the authenticated client |
| Kerberos principal. The two levels of access which can be configured are:</p> |
| |
| <ul> |
| <li><em>Superuser</em> – principals authorized as a superuser are able to perform certain administrative |
| functionality such as using the kudu command line tool to diagnose or repair cluster issues.</li> |
| <li><em>User</em> – principals authorized as a user are able to access and modify all data in the Kudu |
| cluster. This includes the ability to create, drop, and alter tables as well as read, insert, |
| update, and delete data.</li> |
| </ul> |
| |
| <p>Access levels are granted using whitelist-style Access Control Lists (ACLs), one for each of the |
| two levels. Each access control list either specifies a comma-separated list of users, or may be |
| set to <code>*</code> to indicate that all authenticated users are able to gain access at the specified level.</p> |
| |
| <p><em>Note</em>: The default value for the User ACL is <code>*</code>, which allows all users access to the cluster.</p> |
| |
| <h3 id="example-configuration">Example Configuration</h3> |
| |
| <p>The first and most important step is to remove the default ACL of <code>*</code> from Kudu’s |
| <a href="https://kudu.apache.org/docs/configuration_reference.html#kudu-master_user_acl"><code>–user_acl</code> configuration</a>. |
| This will ensure only the users you list will have access to the Kudu cluster. Then, to allow the |
| Impala service to access all of the data in Kudu, the Impala service user, usually impala, should |
| be added to the Kudu <code>–user_acl</code> configuration. Any user that is not using Impala will also need |
| to be added to this list. For example, an Apache Spark job might be used to load data directly |
| into Kudu. Generally, a single user is used to run scheduled jobs of applications that do not |
| support fine-grained authorization on their own. For this example, that user is <code>etl_service</code>. The |
| full <code>–user_acl</code> configuration is:</p> |
| |
| <div class="highlight"><pre><code class="language-bash" data-lang="bash">--user_acl<span class="o">=</span>impala,etl_service</code></pre></div> |
| |
| <p>For more details see the Kudu |
| <a href="https://kudu.apache.org/docs/security.html#_coarse_grained_authorization">authorization documentation</a>.</p> |
| |
| <h2 id="using-impalas-fine-grained-authorization">Using Impala’s Fine-Grained Authorization</h2> |
| |
| <p>Follow Impala’s |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_authorization.html">authorization documentation</a> |
| to configure fine-grained authorization. Once configured, you can use Impala’s |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_grant.html"><code>GRANT</code> statements</a> |
| to control the privileges of Kudu tables. These fine-grained privileges can be set at the database, |
| table and column level. Additionally you can individually control <code>SELECT</code>, <code>INSERT</code>, <code>CREATE</code>, |
| <code>ALTER</code>, and <code>DROP</code> privileges.</p> |
| |
| <p><em>Note</em>: A user needs the <code>ALL</code> privilege in order to run <code>DELETE</code>, <code>UPDATE</code>, or <code>UPSERT</code> |
| statements against a Kudu table.</p> |
| |
| <p>Below is a brief example with a couple tables stored in Kudu:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">messages</span> |
| <span class="p">(</span> |
| <span class="n">name</span> <span class="n">STRING</span><span class="p">,</span> |
| <span class="n">time</span> <span class="k">TIMESTAMP</span><span class="p">,</span> |
| <span class="n">message</span> <span class="n">STRING</span><span class="p">,</span> |
| <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">time</span><span class="p">)</span> |
| <span class="p">)</span> |
| <span class="n">PARTITION</span> <span class="k">BY</span> <span class="n">HASH</span><span class="p">(</span><span class="n">name</span><span class="p">)</span> <span class="n">PARTITIONS</span> <span class="mi">4</span> |
| <span class="n">STORED</span> <span class="k">AS</span> <span class="n">KUDU</span><span class="p">;</span> |
| <span class="k">GRANT</span> <span class="k">ALL</span> <span class="k">ON</span> <span class="k">TABLE</span> <span class="n">messages</span> <span class="k">TO</span> <span class="n">userA</span><span class="p">;</span> |
| |
| <span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">metrics</span> |
| <span class="p">(</span> |
| <span class="k">host</span> <span class="n">STRING</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span> |
| <span class="n">metric</span> <span class="n">STRING</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span> |
| <span class="n">time</span> <span class="n">INT64</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span> |
| <span class="n">value</span> <span class="n">DOUBLE</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span> |
| <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="p">(</span><span class="k">host</span><span class="p">,</span> <span class="n">metric</span><span class="p">,</span> <span class="n">time</span><span class="p">)</span> |
| <span class="p">)</span> |
| <span class="n">PARTITION</span> <span class="k">BY</span> <span class="n">HASH</span><span class="p">(</span><span class="n">name</span><span class="p">)</span> <span class="n">PARTITIONS</span> <span class="mi">4</span> |
| <span class="n">STORED</span> <span class="k">AS</span> <span class="n">KUDU</span><span class="p">;</span> |
| <span class="k">GRANT</span> <span class="k">ALL</span> <span class="k">ON</span> <span class="k">TABLE</span> <span class="n">messages</span> <span class="k">TO</span> <span class="n">userB</span><span class="p">;</span></code></pre></div> |
| |
| <h2 id="conclusion">Conclusion</h2> |
| |
| <p>This brief example that combines Kudu’s coarse-grained authorization and Impala’s fine-grained |
| authorization should enable you to meet the security needs of your data workflow today. The |
| pattern described here can be applied to other services and workflows using Kudu as well. For |
| greater authorization flexibility, you can look forward to the near future when Kudu supports |
| native fine-grained authorization on its own. The Apache Kudu contributors understand the |
| importance of native fine-grained authorization and they are working on integrations with |
| Apache Sentry and Apache Ranger.</p></content><author><name>Grant Henke</name></author><summary>Note: This is a cross-post from the Cloudera Engineering Blog |
| Fine-Grained Authorization with Apache Kudu and Impala |
| |
| Apache Impala supports fine-grained authorization via Apache Sentry on all of the tables it |
| manages including Apache Kudu tables. Given Impala is a very common way to access the data stored |
| in Kudu, this capability allows users deploying Impala and Kudu to fully secure the Kudu data in |
| multi-tenant clusters even though Kudu does not yet have native fine-grained authorization of its |
| own. This solution works because Kudu natively supports coarse-grained (all or nothing) |
| authorization which enables blocking all access to Kudu directly except for the impala user and |
| an optional whitelist of other trusted users. This post will describe how to use Apache Impala’s |
| fine-grained authorization support along with Apache Kudu’s coarse-grained authorization to |
| achieve a secure multi-tenant deployment.</summary></entry><entry><title>Testing Apache Kudu Applications on the JVM</title><link href="/2019/03/19/testing-apache-kudu-applications-on-the-jvm.html" rel="alternate" type="text/html" title="Testing Apache Kudu Applications on the JVM" /><published>2019-03-19T00:00:00-05:00</published><updated>2019-03-19T00:00:00-05:00</updated><id>/2019/03/19/testing-apache-kudu-applications-on-the-jvm</id><content type="html" xml:base="/2019/03/19/testing-apache-kudu-applications-on-the-jvm.html"><p>Note: This is a cross-post from the Cloudera Engineering Blog |
| <a href="https://blog.cloudera.com/blog/2019/03/testing-apache-kudu-applications-on-the-jvm/">Testing Apache Kudu Applications on the JVM</a></p> |
| |
| <p>Although the Kudu server is written in C++ for performance and efficiency, developers can write |
| client applications in C++, Java, or Python. To make it easier for Java developers to create |
| reliable client applications, we’ve added new utilities in Kudu 1.9.0 that allow you to write tests |
| using a Kudu cluster without needing to build Kudu yourself, without any knowledge of C++, |
| and without any complicated coordination around starting and stopping Kudu clusters for each test. |
| This post describes how the new testing utilities work and how you can use them in your application |
| tests.</p> |
| |
| <!--more--> |
| |
| <h2 id="user-guide">User Guide</h2> |
| |
| <p>Note: It is possible this blog post could become outdated – for the latest documentation on using |
| the JVM testing utilities see the |
| <a href="https://kudu.apache.org/docs/developing.html#_jvm_based_integration_testing">Kudu documentation</a>.</p> |
| |
| <h3 id="requirements">Requirements</h3> |
| |
| <p>In order to use the new testing utilities, the following requirements must be met:</p> |
| |
| <ul> |
| <li>OS |
| <ul> |
| <li>macOS El Capitan (10.11) or later</li> |
| <li>CentOS 6.6+, Ubuntu 14.04+, or another recent distribution of Linux |
| <a href="https://kudu.apache.org/docs/installation.html#_prerequisites_and_requirements">supported by Kudu</a></li> |
| </ul> |
| </li> |
| <li>JVM |
| <ul> |
| <li>Java 8+</li> |
| <li>Note: Java 7+ is deprecated, but still supported</li> |
| </ul> |
| </li> |
| <li>Build Tool |
| <ul> |
| <li>Maven 3.1 or later, required to support the |
| <a href="https://github.com/trustin/os-maven-plugin">os-maven-plugin</a></li> |
| <li>Gradle 2.1 or later, to support the |
| <a href="https://github.com/google/osdetector-gradle-plugin">osdetector-gradle-plugin</a></li> |
| <li>Any other build tool that can download the correct jar from Maven</li> |
| </ul> |
| </li> |
| </ul> |
| |
| <h3 id="build-configuration">Build Configuration</h3> |
| |
| <p>In order to use the Kudu testing utilities, add two dependencies to your classpath:</p> |
| |
| <ul> |
| <li>The <code>kudu-test-utils</code> dependency</li> |
| <li>The <code>kudu-binary</code> dependency</li> |
| </ul> |
| |
| <p>The <code>kudu-test-utils</code> dependency has useful utilities for testing applications that use Kudu. |
| Primarily, it provides the |
| <a href="https://github.com/apache/kudu/blob/master/java/kudu-test-utils/src/main/java/org/apache/kudu/test/KuduTestHarness.java">KuduTestHarness class</a> |
| to manage the lifecycle of a Kudu cluster for each test. The <code>KuduTestHarness</code> is a |
| <a href="https://junit.org/junit4/javadoc/4.12/org/junit/rules/TestRule.html">JUnit TestRule</a> |
| that not only starts and stops a Kudu cluster for each test, but also has methods to manage the |
| cluster and get pre-configured <code>KuduClient</code> instances for use while testing.</p> |
| |
| <p>The <code>kudu-binary</code> dependency contains the native Kudu (server and command-line tool) binaries for |
| the specified operating system. In order to download the right artifact for the running operating |
| system it is easiest to use a plugin, such as the |
| <a href="https://github.com/trustin/os-maven-plugin">os-maven-plugin</a> or |
| <a href="https://github.com/google/osdetector-gradle-plugin">osdetector-gradle-plugin</a>, to detect the |
| current runtime environment. The <code>KuduTestHarness</code> will automatically find and use the <code>kudu-binary</code> |
| jar on the classpath.</p> |
| |
| <p>WARNING: The <code>kudu-binary</code> module should only be used to run Kudu for integration testing purposes. |
| It should never be used to run an actual Kudu service, in production or development, because the |
| <code>kudu-binary</code> module includes native security-related dependencies that have been copied from the |
| build system and will not be patched when the operating system on the runtime host is patched.</p> |
| |
| <h4 id="maven-configuration">Maven Configuration</h4> |
| |
| <p>If you are using Maven to build your project, add the following entries to your project’s |
| <code>pom.xml</code> file:</p> |
| |
| <div class="highlight"><pre><code class="language-xml" data-lang="xml"><span class="nt">&lt;build&gt;</span> |
| <span class="nt">&lt;extensions&gt;</span> |
| <span class="c">&lt;!-- Used to find the right kudu-binary artifact with the Maven</span> |
| <span class="c"> property ${os.detected.classifier} --&gt;</span> |
| <span class="nt">&lt;extension&gt;</span> |
| <span class="nt">&lt;groupId&gt;</span>kr.motd.maven<span class="nt">&lt;/groupId&gt;</span> |
| <span class="nt">&lt;artifactId&gt;</span>os-maven-plugin<span class="nt">&lt;/artifactId&gt;</span> |
| <span class="nt">&lt;version&gt;</span>1.6.2<span class="nt">&lt;/version&gt;</span> |
| <span class="nt">&lt;/extension&gt;</span> |
| <span class="nt">&lt;/extensions&gt;</span> |
| <span class="nt">&lt;/build&gt;</span> |
| |
| <span class="nt">&lt;dependencies&gt;</span> |
| <span class="nt">&lt;dependency&gt;</span> |
| <span class="nt">&lt;groupId&gt;</span>org.apache.kudu<span class="nt">&lt;/groupId&gt;</span> |
| <span class="nt">&lt;artifactId&gt;</span>kudu-test-utils<span class="nt">&lt;/artifactId&gt;</span> |
| <span class="nt">&lt;version&gt;</span>1.9.0<span class="nt">&lt;/version&gt;</span> |
| <span class="nt">&lt;scope&gt;</span>test<span class="nt">&lt;/scope&gt;</span> |
| <span class="nt">&lt;/dependency&gt;</span> |
| <span class="nt">&lt;dependency&gt;</span> |
| <span class="nt">&lt;groupId&gt;</span>org.apache.kudu<span class="nt">&lt;/groupId&gt;</span> |
| <span class="nt">&lt;artifactId&gt;</span>kudu-binary<span class="nt">&lt;/artifactId&gt;</span> |
| <span class="nt">&lt;version&gt;</span>1.9.0<span class="nt">&lt;/version&gt;</span> |
| <span class="nt">&lt;classifier&gt;</span>${os.detected.classifier}<span class="nt">&lt;/classifier&gt;</span> |
| <span class="nt">&lt;scope&gt;</span>test<span class="nt">&lt;/scope&gt;</span> |
| <span class="nt">&lt;/dependency&gt;</span> |
| <span class="nt">&lt;/dependencies&gt;</span></code></pre></div> |
| |
| <h4 id="gradle-configuration">Gradle Configuration</h4> |
| |
| <p>If you are using Gradle to build your project, add the following entries to your project’s |
| <code>build.gradle</code> file:</p> |
| |
| <div class="highlight"><pre><code class="language-groovy" data-lang="groovy"><span class="n">plugins</span> <span class="o">{</span> |
| <span class="c1">// Used to find the right kudu-binary artifact with the Gradle</span> |
| <span class="c1">// property ${osdetector.classifier}</span> |
| <span class="n">id</span> <span class="s2">&quot;com.google.osdetector&quot;</span> <span class="n">version</span> <span class="s2">&quot;1.6.2&quot;</span> |
| <span class="o">}</span> |
| |
| <span class="n">dependencies</span> <span class="o">{</span> |
| <span class="n">testCompile</span> <span class="s2">&quot;org.apache.kudu:kudu-test-utils:1.9.0&quot;</span> |
| <span class="n">testCompile</span> <span class="s2">&quot;org.apache.kudu:kudu-binary:1.9.0:${osdetector.classifier}&quot;</span> |
| <span class="o">}</span></code></pre></div> |
| |
| <h2 id="test-setup">Test Setup</h2> |
| |
| <p>Once your project is configured correctly, you can start writing tests using the <code>kudu-test-utils</code> |
| and <code>kudu-binary</code> artifacts. One line of code will ensure that each test automatically starts and |
| stops a real Kudu cluster and that cluster logging is output through <code>slf4j</code>:</p> |
| |
| <div class="highlight"><pre><code class="language-java" data-lang="java"><span class="nd">@Rule</span> <span class="kd">public</span> <span class="n">KuduTestHarness</span> <span class="n">harness</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">KuduTestHarness</span><span class="o">();</span></code></pre></div> |
| |
| <p>The <a href="https://github.com/apache/kudu/blob/master/java/kudu-test-utils/src/main/java/org/apache/kudu/test/KuduTestHarness.java">KuduTestHarness</a> |
| has methods to get pre-configured clients, start and stop servers, and more. Below is an example |
| test to showcase some of the capabilities:</p> |
| |
| <div class="highlight"><pre><code class="language-java" data-lang="java"><span class="kn">import</span> <span class="nn">org.apache.kudu.*</span><span class="o">;</span> |
| <span class="kn">import</span> <span class="nn">org.apache.kudu.client.*</span><span class="o">;</span> |
| <span class="kn">import</span> <span class="nn">org.apache.kudu.test.KuduTestHarness</span><span class="o">;</span> |
| <span class="kn">import</span> <span class="nn">org.junit.*</span><span class="o">;</span> |
| |
| <span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span> |
| <span class="kn">import</span> <span class="nn">java.util.Collections</span><span class="o">;</span> |
| |
| <span class="kd">public</span> <span class="kd">class</span> <span class="nc">MyKuduTest</span> <span class="o">{</span> |
| |
| <span class="nd">@Rule</span> |
| <span class="kd">public</span> <span class="n">KuduTestHarness</span> <span class="n">harness</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">KuduTestHarness</span><span class="o">();</span> |
| |
| <span class="nd">@Test</span> |
| <span class="kd">public</span> <span class="kt">void</span> <span class="nf">test</span><span class="o">()</span> <span class="kd">throws</span> <span class="n">Exception</span> <span class="o">{</span> |
| <span class="c1">// Get a KuduClient configured to talk to the running mini cluster.</span> |
| <span class="n">KuduClient</span> <span class="n">client</span> <span class="o">=</span> <span class="n">harness</span><span class="o">.</span><span class="na">getClient</span><span class="o">();</span> |
| |
| <span class="c1">// Some of the other most common KuduTestHarness methods include:</span> |
| <span class="n">AsyncKuduClient</span> <span class="n">asyncClient</span> <span class="o">=</span> <span class="n">harness</span><span class="o">.</span><span class="na">getAsyncClient</span><span class="o">();</span> |
| <span class="n">String</span> <span class="n">masterAddresses</span><span class="o">=</span> <span class="n">harness</span><span class="o">.</span><span class="na">getMasterAddressesAsString</span><span class="o">();</span> |
| <span class="n">List</span><span class="o">&lt;</span><span class="n">HostAndPort</span><span class="o">&gt;</span> <span class="n">masterServers</span> <span class="o">=</span> <span class="n">harness</span><span class="o">.</span><span class="na">getMasterServers</span><span class="o">();</span> |
| <span class="n">List</span><span class="o">&lt;</span><span class="n">HostAndPort</span><span class="o">&gt;</span> <span class="n">tabletServers</span> <span class="o">=</span> <span class="n">harness</span><span class="o">.</span><span class="na">getTabletServers</span><span class="o">();</span> |
| <span class="n">harness</span><span class="o">.</span><span class="na">killLeaderMasterServer</span><span class="o">();</span> |
| <span class="n">harness</span><span class="o">.</span><span class="na">killAllMasterServers</span><span class="o">();</span> |
| <span class="n">harness</span><span class="o">.</span><span class="na">startAllMasterServers</span><span class="o">();</span> |
| <span class="n">harness</span><span class="o">.</span><span class="na">killAllTabletServers</span><span class="o">();</span> |
| <span class="n">harness</span><span class="o">.</span><span class="na">startAllTabletServers</span><span class="o">();</span> |
| |
| <span class="c1">// Create a new Kudu table.</span> |
| <span class="n">String</span> <span class="n">tableName</span> <span class="o">=</span> <span class="s">&quot;myTable&quot;</span><span class="o">;</span> |
| <span class="n">Schema</span> <span class="n">schema</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">Schema</span><span class="o">(</span><span class="n">Arrays</span><span class="o">.</span><span class="na">asList</span><span class="o">(</span> |
| <span class="k">new</span> <span class="n">ColumnSchema</span><span class="o">.</span><span class="na">ColumnSchemaBuilder</span><span class="o">(</span><span class="s">&quot;key&quot;</span><span class="o">,</span> <span class="n">Type</span><span class="o">.</span><span class="na">INT32</span><span class="o">).</span><span class="na">key</span><span class="o">(</span><span class="kc">true</span><span class="o">).</span><span class="na">build</span><span class="o">(),</span> |
| <span class="k">new</span> <span class="n">ColumnSchema</span><span class="o">.</span><span class="na">ColumnSchemaBuilder</span><span class="o">(</span><span class="s">&quot;value&quot;</span><span class="o">,</span> <span class="n">Type</span><span class="o">.</span><span class="na">STRING</span><span class="o">).</span><span class="na">key</span><span class="o">(</span><span class="kc">true</span><span class="o">).</span><span class="na">build</span><span class="o">()</span> |
| <span class="o">));</span> |
| <span class="n">CreateTableOptions</span> <span class="n">opts</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">CreateTableOptions</span><span class="o">()</span> |
| <span class="o">.</span><span class="na">setRangePartitionColumns</span><span class="o">(</span><span class="n">Collections</span><span class="o">.</span><span class="na">singletonList</span><span class="o">(</span><span class="s">&quot;key&quot;</span><span class="o">));</span> |
| <span class="n">client</span><span class="o">.</span><span class="na">createTable</span><span class="o">(</span><span class="n">tableName</span><span class="o">,</span> <span class="n">schema</span><span class="o">,</span> <span class="n">opts</span><span class="o">);</span> |
| <span class="n">KuduTable</span> <span class="n">table</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="na">openTable</span><span class="o">(</span><span class="n">tableName</span><span class="o">);</span> |
| |
| <span class="c1">// Write a few rows to the table</span> |
| <span class="n">KuduSession</span> <span class="n">session</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="na">newSession</span><span class="o">();</span> |
| <span class="k">for</span><span class="o">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">10</span><span class="o">;</span> <span class="n">i</span><span class="o">++)</span> <span class="o">{</span> |
| <span class="n">Insert</span> <span class="n">insert</span> <span class="o">=</span> <span class="n">table</span><span class="o">.</span><span class="na">newInsert</span><span class="o">();</span> |
| <span class="n">PartialRow</span> <span class="n">row</span> <span class="o">=</span> <span class="n">insert</span><span class="o">.</span><span class="na">getRow</span><span class="o">();</span> |
| <span class="n">row</span><span class="o">.</span><span class="na">addInt</span><span class="o">(</span><span class="s">&quot;key&quot;</span><span class="o">,</span> <span class="n">i</span><span class="o">);</span> |
| <span class="n">row</span><span class="o">.</span><span class="na">addString</span><span class="o">(</span><span class="s">&quot;value&quot;</span><span class="o">,</span> <span class="n">String</span><span class="o">.</span><span class="na">valueOf</span><span class="o">(</span><span class="n">i</span><span class="o">));</span> |
| <span class="n">session</span><span class="o">.</span><span class="na">apply</span><span class="o">(</span><span class="n">insert</span><span class="o">);</span> |
| <span class="o">}</span> |
| <span class="n">session</span><span class="o">.</span><span class="na">close</span><span class="o">();</span> |
| |
| <span class="c1">// ... Continue the test. Read and validate the rows, alter the table, etc.</span> |
| <span class="o">}</span> |
| <span class="o">}</span></code></pre></div> |
| |
| <p>For a complete example of a project using the <code>KuduTestHarness</code>, see the |
| <a href="https://github.com/apache/kudu/tree/master/examples/java/java-example">java-example</a> project in |
| the Kudu source code repository. The Kudu project itself uses the <code>KuduTestHarness</code> for all of its |
| own integration tests. For more complex examples, you can explore the various |
| <a href="https://github.com/apache/kudu/tree/master/java/kudu-client/src/test/java/org/apache/kudu/client">Kudu integration</a> |
| tests in the Kudu source code repository.</p> |
| |
| <h2 id="feedback">Feedback</h2> |
| |
| <p>Kudu 1.9.0 is the first release to have these testing utilities available. Although these utilities |
| simplify testing of Kudu applications, there is always room for improvement. |
| Please report any issues, ideas, or feedback to the Kudu user mailing list, Jira, or Slack channel |
| and we will try to incorporate your feedback quickly. See the |
| <a href="https://kudu.apache.org/community.html">Kudu community page</a> for details.</p> |
| |
| <h2 id="thank-you">Thank You</h2> |
| |
| <p>We would like to give a special thank you to everyone who helped contribute to the <code>kudu-test-utils</code> |
| and <code>kudu-binary</code> artifacts. We would especially like to thank |
| <a href="https://www.linkedin.com/in/brian-mcdevitt-1136a08/">Brian McDevitt</a> at <a href="https://www.phdata.io/">phData</a> |
| and |
| <a href="https://twitter.com/timrobertson100">Tim Robertson</a> at <a href="https://www.gbif.org/">GBIF</a> who helped us |
| tremendously.</p></content><author><name>Grant Henke & Mike Percy</name></author><summary>Note: This is a cross-post from the Cloudera Engineering Blog |
| Testing Apache Kudu Applications on the JVM |
| |
| Although the Kudu server is written in C++ for performance and efficiency, developers can write |
| client applications in C++, Java, or Python. To make it easier for Java developers to create |
| reliable client applications, we’ve added new utilities in Kudu 1.9.0 that allow you to write tests |
| using a Kudu cluster without needing to build Kudu yourself, without any knowledge of C++, |
| and without any complicated coordination around starting and stopping Kudu clusters for each test. |
| This post describes how the new testing utilities work and how you can use them in your application |
| tests.</summary></entry><entry><title>Apache Kudu 1.9.0 Released</title><link href="/2019/03/15/apache-kudu-1-9-0-release.html" rel="alternate" type="text/html" title="Apache Kudu 1.9.0 Released" /><published>2019-03-15T00:00:00-05:00</published><updated>2019-03-15T00:00:00-05:00</updated><id>/2019/03/15/apache-kudu-1-9-0-release</id><content type="html" xml:base="/2019/03/15/apache-kudu-1-9-0-release.html"><p>The Apache Kudu team is happy to announce the release of Kudu 1.9.0!</p> |
| |
| <p>The new release adds several new features and improvements, including the |
| following:</p> |
| |
| <!--more--> |
| |
| <ul> |
| <li>Added support for location awareness for placement of tablet replicas.</li> |
| <li>Introduced docker scripts to facilitate building and running Kudu on various |
| operating systems.</li> |
| <li>Introduced an experimental feature to allow users to run tests against a Kudu |
| mini cluster without having to first locally build or install Kudu.</li> |
| <li>Updated the compaction policy to favor reducing the number of rowsets, which |
| can lead to significantly faster scans and bootup times in certain workloads.</li> |
| <li>Multiple tooling enhancements have been made to improve visibility into Kudu |
| tables.</li> |
| </ul> |
| |
| <p>The above is just a list of the highlights, for a more complete list of new |
| features, improvements and fixes please refer to the <a href="/releases/1.9.0/docs/release_notes.html">release |
| notes</a>.</p> |
| |
| <p>The Apache Kudu project only publishes source code releases. To build Kudu |
| 1.9.0, follow these steps:</p> |
| |
| <ul> |
| <li>Download the Kudu <a href="/releases/1.9.0">1.9.0 source release</a></li> |
| <li>Follow the instructions in the documentation to build Kudu <a href="/releases/1.9.0/docs/installation.html#build_from_source">1.9.0 from |
| source</a></li> |
| </ul> |
| |
| <p>For your convenience, binary JAR files for the Kudu Java client library, Spark |
| DataSource, Flume sink, and other Java integrations are published to the ASF |
| Maven repository and are <a href="https://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.kudu%22%20AND%20v%3A%221.9.0%22">now |
| available</a>.</p> |
| |
| <p>The Python client source is also available on |
| <a href="https://pypi.org/project/kudu-python/">PyPI</a>.</p></content><author><name>Andrew Wong</name></author><summary>The Apache Kudu team is happy to announce the release of Kudu 1.9.0! |
| |
| The new release adds several new features and improvements, including the |
| following:</summary></entry><entry><title>Transparent Hierarchical Storage Management with Apache Kudu and Impala</title><link href="/2019/03/05/transparent-hierarchical-storage-management-with-apache-kudu-and-impala.html" rel="alternate" type="text/html" title="Transparent Hierarchical Storage Management with Apache Kudu and Impala" /><published>2019-03-05T00:00:00-06:00</published><updated>2019-03-05T00:00:00-06:00</updated><id>/2019/03/05/transparent-hierarchical-storage-management-with-apache-kudu-and-impala</id><content type="html" xml:base="/2019/03/05/transparent-hierarchical-storage-management-with-apache-kudu-and-impala.html"><p>Note: This is a cross-post from the Cloudera Engineering Blog |
| <a href="https://blog.cloudera.com/blog/2019/03/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/">Transparent Hierarchical Storage Management with Apache Kudu and Impala</a></p> |
| |
| <p>When picking a storage option for an application it is common to pick a single |
| storage option which has the most applicable features to your use case. For mutability |
| and real-time analytics workloads you may want to use Apache Kudu, but for massive |
| scalability at a low cost you may want to use HDFS. For that reason, there is a need |
| for a solution that allows you to leverage the best features of multiple storage |
| options. This post describes the sliding window pattern using Apache Impala with data |
| stored in Apache Kudu and Apache HDFS. With this pattern you get all of the benefits |
| of multiple storage layers in a way that is transparent to users.</p> |
| |
| <!--more--> |
| |
| <p>Apache Kudu is designed for fast analytics on rapidly changing data. Kudu provides a |
| combination of fast inserts/updates and efficient columnar scans to enable multiple |
| real-time analytic workloads across a single storage layer. For that reason, Kudu fits |
| well into a data pipeline as the place to store real-time data that needs to be |
| queryable immediately. Additionally, Kudu supports updating and deleting rows in |
| real-time allowing support for late arriving data and data correction.</p> |
| |
| <p>Apache HDFS is designed to allow for limitless scalability at a low cost. It is |
| optimized for batch oriented use cases where data is immutable. When paired with the |
| Apache Parquet file format, structured data can be accessed with extremely high |
| throughput and efficiency.</p> |
| |
| <p>For situations in which the data is small and ever-changing, like dimension tables, |
| it is common to keep all of the data in Kudu. It is even common to keep large tables |
| in Kudu when the data fits within Kudu’s |
| <a href="https://kudu.apache.org/docs/known_issues.html#_scale">scaling limits</a> and can benefit |
| from Kudu’s unique features. In cases where the data is massive, batch oriented, and |
| unlikely to change, storing the data in HDFS using the Parquet format is preferred. |
| When you need the benefits of both storage layers, the sliding window pattern is a |
| useful solution.</p> |
| |
| <h2 id="the-sliding-window-pattern">The Sliding Window Pattern</h2> |
| |
| <p>In this pattern, matching Kudu and Parquet formatted HDFS tables are created in Impala. |
| These tables are partitioned by a unit of time based on how frequently the data is |
| moved between the Kudu and HDFS table. It is common to use daily, monthly, or yearly |
| partitions. A unified view is created and a <code>WHERE</code> clause is used to define a boundary |
| that separates which data is read from the Kudu table and which is read from the HDFS |
| table. The defined boundary is important so that you can move data between Kudu and |
| HDFS without exposing duplicate records to the view. Once the data is moved, an atomic |
| <code>ALTER VIEW</code> statement can be used to move the boundary forward.</p> |
| |
| <p><img src="/img/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/sliding-window-pattern.png" alt="png" class="img-responsive" /></p> |
| |
| <p>Note: This pattern works best with somewhat sequential data organized into range |
| partitions, because having a sliding window of time and dropping partitions is very |
| efficient.</p> |
| |
| <p>This pattern results in a sliding window of time where mutable data is stored in Kudu |
| and immutable data is stored in the Parquet format on HDFS. Leveraging both Kudu and |
| HDFS via Impala provides the benefits of both storage systems:</p> |
| |
| <ul> |
| <li>Streaming data is immediately queryable</li> |
| <li>Updates for late arriving data or manual corrections can be made</li> |
| <li>Data stored in HDFS is optimally sized increasing performance and preventing small files</li> |
| <li>Reduced cost</li> |
| </ul> |
| |
| <p>Impala also supports cloud storage options such as |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_s3.html">S3</a> and |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_adls.html">ADLS</a>. |
| This capability allows convenient access to a storage system that is remotely managed, |
| accessible from anywhere, and integrated with various cloud-based services. Because |
| this data is remote, queries against S3 data are less performant, making S3 suitable |
| for holding “cold” data that is only queried occasionally. This pattern can be |
| extended to use cloud storage for cold data by creating a third matching table and |
| adding another boundary to the unified view.</p> |
| |
| <p><img src="/img/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/sliding-window-pattern-cold.png" alt="png" class="img-responsive" /></p> |
| |
| <p>Note: For simplicity only Kudu and HDFS are illustrated in the examples below.</p> |
| |
| <p>The process for moving data from Kudu to HDFS is broken into two phases. The first |
| phase is the data migration, and the second phase is the metadata change. These |
| ongoing steps should be scheduled to run automatically on a regular basis.</p> |
| |
| <p>In the first phase, the now immutable data is copied from Kudu to HDFS. Even though |
| data is duplicated from Kudu into HDFS, the boundary defined in the view will prevent |
| duplicate data from being shown to users. This step can include any validation and |
| retries as needed to ensure the data offload is successful.</p> |
| |
| <p><img src="/img/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/phase-1.png" alt="png" class="img-responsive" /></p> |
| |
| <p>In the second phase, now that the data is safely copied to HDFS, the metadata is |
| changed to adjust how the offloaded partition is exposed. This includes shifting |
| the boundary forward, adding a new Kudu partition for the next period, and dropping |
| the old Kudu partition.</p> |
| |
| <p><img src="/img/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/phase-2.png" alt="png" class="img-responsive" /></p> |
| |
| <h2 id="building-blocks">Building Blocks</h2> |
| |
| <p>In order to implement the sliding window pattern, a few Impala fundamentals are |
| required. Below each fundamental building block of the sliding window pattern is |
| described.</p> |
| |
| <h3 id="moving-data">Moving Data</h3> |
| |
| <p>Moving data among storage systems via Impala is straightforward provided you have |
| matching tables defined using each of the storage formats. In order to keep this post |
| brief, all of the options available when creating an Impala table are not described. |
| However, Impala’s |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_create_table.html">CREATE TABLE documentation</a> |
| can be referenced to find the correct syntax for Kudu, HDFS, and cloud storage tables. |
| A few examples are shown further below where the sliding window pattern is illustrated.</p> |
| |
| <p>Once the tables are created, moving the data is as simple as an |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_insert.html">INSERT…SELECT</a> statement:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">table_foo</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">table_bar</span><span class="p">;</span></code></pre></div> |
| |
| <p>All of the features of the |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_select.html">SELECT</a> |
| statement can be used to select the specific data you would like to move.</p> |
| |
| <p>Note: If moving data to Kudu, an <code>UPSERT INTO</code> statement can be used to handle |
| duplicate keys.</p> |
| |
| <h3 id="unified-querying">Unified Querying</h3> |
| |
| <p>Querying data from multiple tables and data sources in Impala is also straightforward. |
| For the sake of brevity, all of the options available when creating an Impala view are |
| not described. However, see Impala’s |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_create_view.html">CREATE VIEW documentation</a> |
| for more in-depth details.</p> |
| |
| <p>Creating a view for unified querying is as simple as a <code>CREATE VIEW</code> statement using |
| two <code>SELECT</code> clauses combined with a <code>UNION ALL</code>:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">VIEW</span> <span class="n">foo_view</span> <span class="k">AS</span> |
| <span class="k">SELECT</span> <span class="n">col1</span><span class="p">,</span> <span class="n">col2</span><span class="p">,</span> <span class="n">col3</span> <span class="k">FROM</span> <span class="n">foo_parquet</span> |
| <span class="k">UNION</span> <span class="k">ALL</span> |
| <span class="k">SELECT</span> <span class="n">col1</span><span class="p">,</span> <span class="n">col2</span><span class="p">,</span> <span class="n">col3</span> <span class="k">FROM</span> <span class="n">foo_kudu</span><span class="p">;</span></code></pre></div> |
| |
| <p>WARNING: Be sure to use <code>UNION ALL</code> and not <code>UNION</code>. The <code>UNION</code> keyword by itself |
| is the same as <code>UNION DISTINCT</code> and can have significant performance impact. |
| More information can be found in the Impala |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_union.html">UNION documentation</a>.</p> |
| |
| <p>All of the features of the |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_select.html">SELECT</a> |
| statement can be used to expose the correct data and columns from each of the |
| underlying tables. It is important to use the <code>WHERE</code> clause to pass through and |
| pushdown any predicates that need special handling or transformations. More examples |
| will follow below in the discussion of the sliding window pattern.</p> |
| |
| <p>Additionally, views can be altered via the |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_alter_view.html">ALTER VIEW</a> |
| statement. This is useful when combined with the <code>SELECT</code> statement because it can be |
| used to atomically update what data is being accessed by the view.</p> |
| |
| <h2 id="an-example-implementation">An Example Implementation</h2> |
| |
| <p>Below are sample steps to implement the sliding window pattern using a monthly period |
| with three months of active mutable data. Data older than three months will be |
| offloaded to HDFS using the Parquet format.</p> |
| |
| <h3 id="create-the-kudu-table">Create the Kudu Table</h3> |
| |
| <p>First, create a Kudu table which will hold three months of active mutable data. |
| The table is range partitioned by the time column with each range containing one |
| period of data. It is important to have partitions that match the period because |
| dropping Kudu partitions is much more efficient than removing the data via the |
| <code>DELETE</code> clause. The table is also hash partitioned by the other key column to ensure |
| that all of the data is not written to a single partition.</p> |
| |
| <p>Note: Your schema design should vary based on your data and read/write performance |
| considerations. This example schema is intended for demonstration purposes and not as |
| an “optimal” schema. See the |
| <a href="https://kudu.apache.org/docs/schema_design.html">Kudu schema design documentation</a> |
| for more guidance on choosing your schema. For example, you may not need any hash |
| partitioning if your |
| data input rate is low. Alternatively, you may need more hash buckets if your data |
| input rate is very high.</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">my_table_kudu</span> |
| <span class="p">(</span> |
| <span class="n">name</span> <span class="n">STRING</span><span class="p">,</span> |
| <span class="n">time</span> <span class="k">TIMESTAMP</span><span class="p">,</span> |
| <span class="n">message</span> <span class="n">STRING</span><span class="p">,</span> |
| <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">time</span><span class="p">)</span> |
| <span class="p">)</span> |
| <span class="n">PARTITION</span> <span class="k">BY</span> |
| <span class="n">HASH</span><span class="p">(</span><span class="n">name</span><span class="p">)</span> <span class="n">PARTITIONS</span> <span class="mi">4</span><span class="p">,</span> |
| <span class="n">RANGE</span><span class="p">(</span><span class="n">time</span><span class="p">)</span> <span class="p">(</span> |
| <span class="n">PARTITION</span> <span class="s1">&#39;2018-01-01&#39;</span> <span class="o">&lt;=</span> <span class="k">VALUES</span> <span class="o">&lt;</span> <span class="s1">&#39;2018-02-01&#39;</span><span class="p">,</span> <span class="c1">--January</span> |
| <span class="n">PARTITION</span> <span class="s1">&#39;2018-02-01&#39;</span> <span class="o">&lt;=</span> <span class="k">VALUES</span> <span class="o">&lt;</span> <span class="s1">&#39;2018-03-01&#39;</span><span class="p">,</span> <span class="c1">--February</span> |
| <span class="n">PARTITION</span> <span class="s1">&#39;2018-03-01&#39;</span> <span class="o">&lt;=</span> <span class="k">VALUES</span> <span class="o">&lt;</span> <span class="s1">&#39;2018-04-01&#39;</span><span class="p">,</span> <span class="c1">--March</span> |
| <span class="n">PARTITION</span> <span class="s1">&#39;2018-04-01&#39;</span> <span class="o">&lt;=</span> <span class="k">VALUES</span> <span class="o">&lt;</span> <span class="s1">&#39;2018-05-01&#39;</span> <span class="c1">--April</span> |
| <span class="p">)</span> |
| <span class="n">STORED</span> <span class="k">AS</span> <span class="n">KUDU</span><span class="p">;</span></code></pre></div> |
| |
| <p>Note: There is an extra month partition to provide a buffer of time for the data to |
| be moved into the immutable table.</p> |
| |
| <h3 id="create-the-hdfs-table">Create the HDFS Table</h3> |
| |
| <p>Create the matching Parquet formatted HDFS table which will hold the older immutable |
| data. This table is partitioned by year, month, and day for efficient access even |
| though you can’t partition by the time column itself. This is addressed further in |
| the view step below. See Impala’s |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_partitioning.html">partitioning documentation</a> |
| for more details.</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">my_table_parquet</span> |
| <span class="p">(</span> |
| <span class="n">name</span> <span class="n">STRING</span><span class="p">,</span> |
| <span class="n">time</span> <span class="k">TIMESTAMP</span><span class="p">,</span> |
| <span class="n">message</span> <span class="n">STRING</span> |
| <span class="p">)</span> |
| <span class="n">PARTITIONED</span> <span class="k">BY</span> <span class="p">(</span><span class="k">year</span> <span class="nb">int</span><span class="p">,</span> <span class="k">month</span> <span class="nb">int</span><span class="p">,</span> <span class="k">day</span> <span class="nb">int</span><span class="p">)</span> |
| <span class="n">STORED</span> <span class="k">AS</span> <span class="n">PARQUET</span><span class="p">;</span></code></pre></div> |
| |
| <h3 id="create-the-unified-view">Create the Unified View</h3> |
| |
| <p>Now create the unified view which will be used to query all of the data seamlessly:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">VIEW</span> <span class="n">my_table_view</span> <span class="k">AS</span> |
| <span class="k">SELECT</span> <span class="n">name</span><span class="p">,</span> <span class="n">time</span><span class="p">,</span> <span class="n">message</span> |
| <span class="k">FROM</span> <span class="n">my_table_kudu</span> |
| <span class="k">WHERE</span> <span class="n">time</span> <span class="o">&gt;=</span> <span class="ss">&quot;2018-01-01&quot;</span> |
| <span class="k">UNION</span> <span class="k">ALL</span> |
| <span class="k">SELECT</span> <span class="n">name</span><span class="p">,</span> <span class="n">time</span><span class="p">,</span> <span class="n">message</span> |
| <span class="k">FROM</span> <span class="n">my_table_parquet</span> |
| <span class="k">WHERE</span> <span class="n">time</span> <span class="o">&lt;</span> <span class="ss">&quot;2018-01-01&quot;</span> |
| <span class="k">AND</span> <span class="k">year</span> <span class="o">=</span> <span class="k">year</span><span class="p">(</span><span class="n">time</span><span class="p">)</span> |
| <span class="k">AND</span> <span class="k">month</span> <span class="o">=</span> <span class="k">month</span><span class="p">(</span><span class="n">time</span><span class="p">)</span> |
| <span class="k">AND</span> <span class="k">day</span> <span class="o">=</span> <span class="k">day</span><span class="p">(</span><span class="n">time</span><span class="p">);</span></code></pre></div> |
| |
| <p>Each <code>SELECT</code> clause explicitly lists all of the columns to expose. This ensures that |
| the year, month, and day columns that are unique to the Parquet table are not exposed. |
| If needed, it also allows any necessary column or type mapping to be handled.</p> |
| |
| <p>The initial <code>WHERE</code> clauses applied to both my_table_kudu and my_table_parquet define |
| the boundary between Kudu and HDFS to ensure duplicate data is not read while in the |
| process of offloading data.</p> |
| |
| <p>The additional <code>AND</code> clauses applied to my_table_parquet are used to ensure good |
| predicate pushdown on the individual year, month, and day columns.</p> |
| |
| <p>WARNING: As stated earlier, be sure to use <code>UNION ALL</code> and not <code>UNION</code>. The <code>UNION</code> |
| keyword by itself is the same as <code>UNION DISTINCT</code> and can have significant performance |
| impact. More information can be found in the Impala |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_union.html"><code>UNION</code> documentation</a>.</p> |
| |
| <h3 id="ongoing-steps">Ongoing Steps</h3> |
| |
| <p>Now that the base tables and view are created, prepare the ongoing steps to maintain |
| the sliding window. Because these ongoing steps should be scheduled to run on a |
| regular basis, the examples below are shown using <code>.sql</code> files that take variables |
| which can be passed from your scripts and scheduling tool of choice.</p> |
| |
| <p>Create the <code>window_data_move.sql</code> file to move the data from the oldest partition to HDFS:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">INSERT</span> <span class="k">INTO</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">hdfs_table</span><span class="err">}</span> <span class="n">PARTITION</span> <span class="p">(</span><span class="k">year</span><span class="p">,</span> <span class="k">month</span><span class="p">,</span> <span class="k">day</span><span class="p">)</span> |
| <span class="k">SELECT</span> <span class="o">*</span><span class="p">,</span> <span class="k">year</span><span class="p">(</span><span class="n">time</span><span class="p">),</span> <span class="k">month</span><span class="p">(</span><span class="n">time</span><span class="p">),</span> <span class="k">day</span><span class="p">(</span><span class="n">time</span><span class="p">)</span> |
| <span class="k">FROM</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">kudu_table</span><span class="err">}</span> |
| <span class="k">WHERE</span> <span class="n">time</span> <span class="o">&gt;=</span> <span class="n">add_months</span><span class="p">(</span><span class="ss">&quot;${var:new_boundary_time}&quot;</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> |
| <span class="k">AND</span> <span class="n">time</span> <span class="o">&lt;</span> <span class="ss">&quot;${var:new_boundary_time}&quot;</span><span class="p">;</span> |
| <span class="n">COMPUTE</span> <span class="n">INCREMENTAL</span> <span class="n">STATS</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">hdfs_table</span><span class="err">}</span><span class="p">;</span></code></pre></div> |
| |
| <p>Note: The |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_compute_stats.html">COMPUTE INCREMENTAL STATS</a> |
| clause is not required but helps Impala to optimize queries.</p> |
| |
| <p>To run the SQL statement, use the Impala shell and pass the required variables. |
| Below is an example:</p> |
| |
| <div class="highlight"><pre><code class="language-bash" data-lang="bash">impala-shell -i &lt;impalad:port&gt; -f window_data_move.sql |
| --var<span class="o">=</span><span class="nv">kudu_table</span><span class="o">=</span>my_table_kudu |
| --var<span class="o">=</span><span class="nv">hdfs_table</span><span class="o">=</span>my_table_parquet |
| --var<span class="o">=</span><span class="nv">new_boundary_time</span><span class="o">=</span><span class="s2">&quot;2018-02-01&quot;</span></code></pre></div> |
| |
| <p>Note: You can adjust the <code>WHERE</code> clause to match the given period and cadence of your |
| offload. Here the add_months function is used with an argument of -1 to move one month |
| of data in the past from the new boundary time.</p> |
| |
| <p>Create the <code>window_view_alter.sql</code> file to shift the time boundary forward by altering |
| the unified view:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">ALTER</span> <span class="k">VIEW</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">view_name</span><span class="err">}</span> <span class="k">AS</span> |
| <span class="k">SELECT</span> <span class="n">name</span><span class="p">,</span> <span class="n">time</span><span class="p">,</span> <span class="n">message</span> |
| <span class="k">FROM</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">kudu_table</span><span class="err">}</span> |
| <span class="k">WHERE</span> <span class="n">time</span> <span class="o">&gt;=</span> <span class="ss">&quot;${var:new_boundary_time}&quot;</span> |
| <span class="k">UNION</span> <span class="k">ALL</span> |
| <span class="k">SELECT</span> <span class="n">name</span><span class="p">,</span> <span class="n">time</span><span class="p">,</span> <span class="n">message</span> |
| <span class="k">FROM</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">hdfs_table</span><span class="err">}</span> |
| <span class="k">WHERE</span> <span class="n">time</span> <span class="o">&lt;</span> <span class="ss">&quot;${var:new_boundary_time}&quot;</span> |
| <span class="k">AND</span> <span class="k">year</span> <span class="o">=</span> <span class="k">year</span><span class="p">(</span><span class="n">time</span><span class="p">)</span> |
| <span class="k">AND</span> <span class="k">month</span> <span class="o">=</span> <span class="k">month</span><span class="p">(</span><span class="n">time</span><span class="p">)</span> |
| <span class="k">AND</span> <span class="k">day</span> <span class="o">=</span> <span class="k">day</span><span class="p">(</span><span class="n">time</span><span class="p">);</span></code></pre></div> |
| |
| <p>To run the SQL statement, use the Impala shell and pass the required variables. |
| Below is an example:</p> |
| |
| <div class="highlight"><pre><code class="language-bash" data-lang="bash">impala-shell -i &lt;impalad:port&gt; -f window_view_alter.sql |
| --var<span class="o">=</span><span class="nv">view_name</span><span class="o">=</span>my_table_view |
| --var<span class="o">=</span><span class="nv">kudu_table</span><span class="o">=</span>my_table_kudu |
| --var<span class="o">=</span><span class="nv">hdfs_table</span><span class="o">=</span>my_table_parquet |
| --var<span class="o">=</span><span class="nv">new_boundary_time</span><span class="o">=</span><span class="s2">&quot;2018-02-01&quot;</span></code></pre></div> |
| |
| <p>Create the <code>window_partition_shift.sql</code> file to shift the Kudu partitions forward:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">ALTER</span> <span class="k">TABLE</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">kudu_table</span><span class="err">}</span> |
| |
| <span class="k">ADD</span> <span class="n">RANGE</span> <span class="n">PARTITION</span> <span class="n">add_months</span><span class="p">(</span><span class="ss">&quot;${var:new_boundary_time}&quot;</span><span class="p">,</span> |
| <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">window_length</span><span class="err">}</span><span class="p">)</span> <span class="o">&lt;=</span> <span class="k">VALUES</span> <span class="o">&lt;</span> <span class="n">add_months</span><span class="p">(</span><span class="ss">&quot;${var:new_boundary_time}&quot;</span><span class="p">,</span> |
| <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">window_length</span><span class="err">}</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span> |
| |
| <span class="k">ALTER</span> <span class="k">TABLE</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">kudu_table</span><span class="err">}</span> |
| |
| <span class="k">DROP</span> <span class="n">RANGE</span> <span class="n">PARTITION</span> <span class="n">add_months</span><span class="p">(</span><span class="ss">&quot;${var:new_boundary_time}&quot;</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> |
| <span class="o">&lt;=</span> <span class="k">VALUES</span> <span class="o">&lt;</span> <span class="ss">&quot;${var:new_boundary_time}&quot;</span><span class="p">;</span></code></pre></div> |
| |
| <p>To run the SQL statement, use the Impala shell and pass the required variables. |
| Below is an example:</p> |
| |
| <div class="highlight"><pre><code class="language-bash" data-lang="bash">impala-shell -i &lt;impalad:port&gt; -f window_partition_shift.sql |
| --var<span class="o">=</span><span class="nv">kudu_table</span><span class="o">=</span>my_table_kudu |
| --var<span class="o">=</span><span class="nv">new_boundary_time</span><span class="o">=</span><span class="s2">&quot;2018-02-01&quot;</span> |
| --var<span class="o">=</span><span class="nv">window_length</span><span class="o">=</span>3</code></pre></div> |
| |
| <p>Note: You should periodically run |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_compute_stats.html">COMPUTE STATS</a> |
| on your Kudu table to ensure Impala’s query performance is optimal.</p> |
| |
| <h3 id="experimentation">Experimentation</h3> |
| |
| <p>Now that you have created the tables, view, and scripts to leverage the sliding |
| window pattern, you can experiment with them by inserting data for different time |
| ranges and running the scripts to move the window forward through time.</p> |
| |
| <p>Insert some sample values into the Kudu table:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">my_table_kudu</span> <span class="k">VALUES</span> |
| <span class="p">(</span><span class="s1">&#39;joey&#39;</span><span class="p">,</span> <span class="s1">&#39;2018-01-01&#39;</span><span class="p">,</span> <span class="s1">&#39;hello&#39;</span><span class="p">),</span> |
| <span class="p">(</span><span class="s1">&#39;ross&#39;</span><span class="p">,</span> <span class="s1">&#39;2018-02-01&#39;</span><span class="p">,</span> <span class="s1">&#39;goodbye&#39;</span><span class="p">),</span> |
| <span class="p">(</span><span class="s1">&#39;rachel&#39;</span><span class="p">,</span> <span class="s1">&#39;2018-03-01&#39;</span><span class="p">,</span> <span class="s1">&#39;hi&#39;</span><span class="p">);</span></code></pre></div> |
| |
| <p>Show the data in each table/view:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_kudu</span><span class="p">;</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_parquet</span><span class="p">;</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span><span class="p">;</span></code></pre></div> |
| |
| <p>Move the January data into HDFS:</p> |
| |
| <div class="highlight"><pre><code class="language-bash" data-lang="bash">impala-shell -i &lt;impalad:port&gt; -f window_data_move.sql |
| --var<span class="o">=</span><span class="nv">kudu_table</span><span class="o">=</span>my_table_kudu |
| --var<span class="o">=</span><span class="nv">hdfs_table</span><span class="o">=</span>my_table_parquet |
| --var<span class="o">=</span><span class="nv">new_boundary_time</span><span class="o">=</span><span class="s2">&quot;2018-02-01&quot;</span></code></pre></div> |
| |
| <p>Confirm the data is in both places, but not duplicated in the view:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_kudu</span><span class="p">;</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_parquet</span><span class="p">;</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span><span class="p">;</span></code></pre></div> |
| |
| <p>Alter the view to shift the time boundary forward to February:</p> |
| |
| <div class="highlight"><pre><code class="language-bash" data-lang="bash">impala-shell -i &lt;impalad:port&gt; -f window_view_alter.sql |
| --var<span class="o">=</span><span class="nv">view_name</span><span class="o">=</span>my_table_view |
| --var<span class="o">=</span><span class="nv">kudu_table</span><span class="o">=</span>my_table_kudu |
| --var<span class="o">=</span><span class="nv">hdfs_table</span><span class="o">=</span>my_table_parquet |
| --var<span class="o">=</span><span class="nv">new_boundary_time</span><span class="o">=</span><span class="s2">&quot;2018-02-01&quot;</span></code></pre></div> |
| |
| <p>Confirm the data is still in both places, but not duplicated in the view:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_kudu</span><span class="p">;</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_parquet</span><span class="p">;</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span><span class="p">;</span></code></pre></div> |
| |
| <p>Shift the Kudu partitions forward:</p> |
| |
| <div class="highlight"><pre><code class="language-bash" data-lang="bash">impala-shell -i &lt;impalad:port&gt; -f window_partition_shift.sql |
| --var<span class="o">=</span><span class="nv">kudu_table</span><span class="o">=</span>my_table_kudu |
| --var<span class="o">=</span><span class="nv">new_boundary_time</span><span class="o">=</span><span class="s2">&quot;2018-02-01&quot;</span> |
| --var<span class="o">=</span><span class="nv">window_length</span><span class="o">=</span>3</code></pre></div> |
| |
| <p>Confirm the January data is now only in HDFS:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_kudu</span><span class="p">;</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_parquet</span><span class="p">;</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span><span class="p">;</span></code></pre></div> |
| |
| <p>Confirm predicate push down with Impala’s EXPLAIN statement:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">EXPLAIN</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span><span class="p">;</span> |
| <span class="k">EXPLAIN</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span> <span class="k">WHERE</span> <span class="n">time</span> <span class="o">&lt;</span> <span class="ss">&quot;2018-02-01&quot;</span><span class="p">;</span> |
| <span class="k">EXPLAIN</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span> <span class="k">WHERE</span> <span class="n">time</span> <span class="o">&gt;</span> <span class="ss">&quot;2018-02-01&quot;</span><span class="p">;</span></code></pre></div> |
| |
| <p>In the explain output you should see “kudu predicates” which include the time column |
| filters in the “SCAN KUDU” section and “predicates” which include the time, day, month, |
| and year columns in the “SCAN HDFS” section.</p></content><author><name>Grant Henke</name></author><summary>Note: This is a cross-post from the Cloudera Engineering Blog |
| Transparent Hierarchical Storage Management with Apache Kudu and Impala |
| |
| When picking a storage option for an application it is common to pick a single |
| storage option which has the most applicable features to your use case. For mutability |
| and real-time analytics workloads you may want to use Apache Kudu, but for massive |
| scalability at a low cost you may want to use HDFS. For that reason, there is a need |
| for a solution that allows you to leverage the best features of multiple storage |
| options. This post describes the sliding window pattern using Apache Impala with data |
| stored in Apache Kudu and Apache HDFS. With this pattern you get all of the benefits |
| of multiple storage layers in a way that is transparent to users.</summary></entry><entry><title>Call for Posts</title><link href="/2018/12/11/call-for-posts.html" rel="alternate" type="text/html" title="Call for Posts" /><published>2018-12-11T00:00:00-06:00</published><updated>2018-12-11T00:00:00-06:00</updated><id>/2018/12/11/call-for-posts</id><content type="html" xml:base="/2018/12/11/call-for-posts.html"><p>Most of the posts in the Kudu blog have been written by the project’s |
| committers and are either technical or news-like in nature. We’d like to hear |
| how you’re using Kudu in production, in testing, or in your hobby project and |
| we’d like to share it with the world!</p> |
| |
| <!--more--> |
| |
| <p>If you’d like to tell the world about how you are using Kudu in your project, |
| now is the time.</p> |
| |
| <p>To learn how to submit posts, read our <a href="/docs/contributing.html#_blog_posts">contributing |
| documentation</a>. Alternatively, you can |
| draft your post to Google Docs and share it with us on |
| <a href="&#109;&#097;&#105;&#108;&#116;&#111;:&#100;&#101;&#118;&#064;&#107;&#117;&#100;&#117;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;&#103;">&#100;&#101;&#118;&#064;&#107;&#117;&#100;&#117;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;&#103;</a> and we’re happy to review it |
| and post it to the blog for you.</p></content><author><name>Attila Bukor</name></author><summary>Most of the posts in the Kudu blog have been written by the project’s |
| committers and are either technical or news-like in nature. We’d like to hear |
| how you’re using Kudu in production, in testing, or in your hobby project and |
| we’d like to share it with the world!</summary></entry><entry><title>Apache Kudu 1.8.0 Released</title><link href="/2018/10/26/apache-kudu-1-8-0-released.html" rel="alternate" type="text/html" title="Apache Kudu 1.8.0 Released" /><published>2018-10-26T00:00:00-05:00</published><updated>2018-10-26T00:00:00-05:00</updated><id>/2018/10/26/apache-kudu-1-8-0-released</id><content type="html" xml:base="/2018/10/26/apache-kudu-1-8-0-released.html"><p>The Apache Kudu team is happy to announce the release of Kudu 1.8.0!</p> |
| |
| <p>The new release adds several new features and improvements, including the |
| following:</p> |
| |
| <!--more--> |
| |
| <ul> |
| <li>Introduced manual data rebalancer tool which can be used to redistribute |
| table replicas among tablet servers</li> |
| <li>Added support for <code>IS NULL</code> and <code>IS NOT NULL</code> predicates to the Kudu Python |
| client</li> |
| <li>Multiple tooling improvements make diagnostics and troubleshooting simpler</li> |
| <li>The Kudu Spark connector now supports Spark Streaming DataFrames</li> |
| <li>Added Pandas support to the Python client</li> |
| </ul> |
| |
| <p>The above is just a list of the highlights, for a more complete list of new |
| features, improvements and fixes please refer to the <a href="/releases/1.8.0/docs/release_notes.html">release |
| notes</a>.</p> |
| |
| <p>The Apache Kudu project only publishes source code releases. To build Kudu |
| 1.8.0, follow these steps:</p> |
| |
| <ul> |
| <li>Download the Kudu <a href="/releases/1.8.0">1.8.0 source release</a></li> |
| <li>Follow the instructions in the documentation to build Kudu <a href="/releases/1.8.0/docs/installation.html#build_from_source">1.8.0 from |
| source</a></li> |
| </ul> |
| |
| <p>For your convenience, binary JAR files for the Kudu Java client library, Spark |
| DataSource, Flume sink, and other Java integrations are published to the ASF |
| Maven repository and are <a href="https://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.kudu%22%20AND%20v%3A%221.8.0%22">now |
| available</a>.</p> |
| |
| <p>The Python client source is also available on |
| <a href="https://pypi.org/project/kudu-python/">PyPI</a>.</p></content><author><name>Attila Bukor</name></author><summary>The Apache Kudu team is happy to announce the release of Kudu 1.8.0! |
| |
| The new release adds several new features and improvements, including the |
| following:</summary></entry><entry><title>Index Skip Scan Optimization in Kudu</title><link href="/2018/09/26/index-skip-scan-optimization-in-kudu.html" rel="alternate" type="text/html" title="Index Skip Scan Optimization in Kudu" /><published>2018-09-26T00:00:00-05:00</published><updated>2018-09-26T00:00:00-05:00</updated><id>/2018/09/26/index-skip-scan-optimization-in-kudu</id><content type="html" xml:base="/2018/09/26/index-skip-scan-optimization-in-kudu.html"><p>This summer I got the opportunity to intern with the Apache Kudu team at Cloudera. |
| My project was to optimize the Kudu scan path by implementing a technique called |
| index skip scan (a.k.a. scan-to-seek, see section 4.1 in [1]). I wanted to share |
| my experience and the progress we’ve made so far on the approach.</p> |
| |
| <!--more--> |
| |
| <p>Let’s begin with discussing the current query flow in Kudu. |
| Consider the following table:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">metrics</span> <span class="p">(</span> |
| <span class="k">host</span> <span class="n">STRING</span><span class="p">,</span> |
| <span class="n">tstamp</span> <span class="nb">INT</span><span class="p">,</span> |
| <span class="n">clusterid</span> <span class="nb">INT</span><span class="p">,</span> |
| <span class="k">role</span> <span class="n">STRING</span><span class="p">,</span> |
| <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="p">(</span><span class="k">host</span><span class="p">,</span> <span class="n">tstamp</span><span class="p">,</span> <span class="n">clusterid</span><span class="p">)</span> |
| <span class="p">);</span></code></pre></div> |
| |
| <p><img src="/img/index-skip-scan/example-table.png" alt="png" class="img-responsive" /> |
| <em>Sample rows of table <code>metrics</code> (sorted by key columns).</em></p> |
| |
| <p>In this case, by default, Kudu internally builds a primary key index (implemented as a |
| <a href="https://en.wikipedia.org/wiki/B-tree">B-tree</a>) for the table <code>metrics</code>. |
| As shown in the table above, the index data is sorted by the composite of all key columns. |
| When the user query contains the first key column (<code>host</code>), Kudu uses the index (as the index data is |
| primarily sorted on the first key column).</p> |
| |
| <p>Now, what if the user query does not contain the first key column and instead only contains the <code>tstamp</code> column? |
| In the above case, the <code>tstamp</code> column values are sorted with respect to <code>host</code>, |
| but are not globally sorted, and as such, it’s non-trivial to use the index to filter rows. |
| Instead, a full tablet scan is done by default. Other databases may optimize such scans by building secondary indexes |
| (though it might be redundant to build one on one of the primary keys). However, this isn’t an option for Kudu, |
| given its lack of secondary index support.</p> |
| |
| <p>The question is, can Kudu do better than a full tablet scan here?</p> |
| |
| <p>The answer is yes! Let’s observe the column preceding the <code>tstamp</code> column. We will refer to it as the |
| “prefix column” and its specific value as the “prefix key”. In this example, <code>host</code> is the prefix column. |
| Note that the prefix keys are sorted in the index and that all rows of a given prefix key are also sorted by the |
| remaining key columns. Therefore, we can use the index to skip to the rows that have distinct prefix keys, |
| and also satisfy the predicate on the <code>tstamp</code> column. |
| For example, consider the query:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="n">clusterid</span> <span class="k">FROM</span> <span class="n">metrics</span> <span class="k">WHERE</span> <span class="n">tstamp</span> <span class="o">=</span> <span class="mi">100</span><span class="p">;</span></code></pre></div> |
| |
| <p><img src="/img/index-skip-scan/skip-scan-example-table.png" alt="png" class="img-responsive" /> |
| <em>Skip scan flow illustration. The rows in green are scanned and the rest are skipped.</em></p> |
| |
| <p>The tablet server can use the index to <strong>skip</strong> to the first row with a distinct prefix key (<code>host = helium</code>) that |
| matches the predicate (<code>tstamp = 100</code>) and then <strong>scan</strong> through the rows until the predicate no longer matches. At that |
| point we would know that no more rows with <code>host = helium</code> will satisfy the predicate, and we can skip to the next |
| prefix key. This holds true for all distinct keys of <code>host</code>. Hence, this method is popularly known as |
| <strong>skip scan optimization</strong>[2, 3].</p> |
| |
| <h1 id="performance">Performance</h1> |
| |
| <p>This optimization can speed up queries significantly, depending on the cardinality (number of distinct values) of the |
| prefix column. The lower the prefix column cardinality, the better the skip scan performance. In fact, when the |
| prefix column cardinality is high, skip scan is not a viable approach. The performance graph (obtained using the example |
| schema and query pattern mentioned earlier) is shown below.</p> |
| |
| <p>Based on our experiments, on up to 10 million rows per tablet (as shown below), we found that the skip scan performance |
| begins to get worse with respect to the full tablet scan performance when the prefix column cardinality |
| exceeds sqrt(number_of_rows_in_tablet). |
| Therefore, in order to use skip scan performance benefits when possible and maintain a consistent performance in cases |
| of large prefix column cardinality, we have tentatively chosen to dynamically disable skip scan when the number of skips for |
| distinct prefix keys exceeds sqrt(number_of_rows_in_tablet). |
| It will be an interesting project to further explore sophisticated heuristics to decide when |
| to dynamically disable skip scan.</p> |
| |
| <p><img src="/img/index-skip-scan/skip-scan-performance-graph.png" alt="png" class="img-responsive" /></p> |
| |
| <h1 id="conclusion">Conclusion</h1> |
| |
| <p>Skip scan optimization in Kudu can lead to huge performance benefits that scale with the size of |
| data in Kudu tablets. This is a work-in-progress <a href="https://gerrit.cloudera.org/#/c/10983/">patch</a>. |
| The implementation in the patch works only for equality predicates on the non-first primary key |
| columns. An important point to note is that although, in the above specific example, the number of prefix |
| columns is one (<code>host</code>), this approach is generalized to work with any number of prefix columns.</p> |
| |
| <p>This work also lays the groundwork to leverage the skip scan approach and optimize query processing time in the |
| following use cases:</p> |
| |
| <ul> |
| <li>Range predicates</li> |
| <li>In-list predicates</li> |
| </ul> |
| |
| <p>This was my first time working on an open source project. I thoroughly enjoyed working on this challenging problem, |
| right from understanding the scan path in Kudu to working on a full-fledged implementation of |
| the skip scan optimization. I am very grateful to the Kudu team for guiding and supporting me throughout the |
| internship period.</p> |
| |
| <h1 id="references">References</h1> |
| |
| <p><a href="https://storage.googleapis.com/pub-tools-public-publication-data/pdf/42851.pdf">[1]</a>: Gupta, Ashish, et al. “Mesa: |
| Geo-replicated, near real-time, scalable data warehousing.” Proceedings of the VLDB Endowment 7.12 (2014): 1259-1270.</p> |
| |
| <p><a href="https://oracle-base.com/articles/9i/index-skip-scanning/">[2]</a>: Index Skip Scanning - Oracle Database</p> |
| |
| <p><a href="https://www.sqlite.org/optoverview.html#skipscan">[3]</a>: Skip Scan - SQLite</p></content><author><name>Anupama Gupta</name></author><summary>This summer I got the opportunity to intern with the Apache Kudu team at Cloudera. |
| My project was to optimize the Kudu scan path by implementing a technique called |
| index skip scan (a.k.a. scan-to-seek, see section 4.1 in [1]). I wanted to share |
| my experience and the progress we’ve made so far on the approach.</summary></entry><entry><title>Simplified Data Pipelines with Kudu</title><link href="/2018/09/11/simplified-pipelines-with-kudu.html" rel="alternate" type="text/html" title="Simplified Data Pipelines with Kudu" /><published>2018-09-11T00:00:00-05:00</published><updated>2018-09-11T00:00:00-05:00</updated><id>/2018/09/11/simplified-pipelines-with-kudu</id><content type="html" xml:base="/2018/09/11/simplified-pipelines-with-kudu.html"><p>I’ve been working with Hadoop now for over seven years and fortunately, or unfortunately, have run |
| across a lot of structured data use cases. What we, at <a href="https://phdata.io/">phData</a>, have found is |
| that end users are typically comfortable with tabular data and prefer to access their data in a |
| structured manner using tables. |
| <!--more--></p> |
| |
| <p>When working on new structured data projects, the first question we always get from non-Hadoop |
| followers is, <em>“how do I update or delete a record?”</em> The second question we get is, <em>“when adding |
| records, why don’t they show up in Impala right away?”</em> For those of us who have worked with HDFS |
| and Impala on HDFS for years, these are simple questions to answer, but hard ones to explain.</p> |
| |
| <p>The pre-Kudu years were filled with 100’s (or 1000’s) of self-join views (or materialization jobs) |
| and compaction jobs, along with scheduled jobs to refresh Impala cache periodically so new records |
| show up. And while doable, for 10,000’s of tables, this basically became a distraction from solving |
| real business problems.</p> |
| |
| <p>With the introduction of Kudu, mixing record level updates, deletes, and inserts, while supporting |
| large scans, are now something we can sustainably manage at scale. HBase is very good at record |
| level updates, deletes and inserts, but doesn’t scale well for analytic use cases that often do full |
| table scans. Moreover, for streaming use cases, changes are available in near real-time. End users, |
| accustomed to having to <em>”wait”</em> for their data, can now consume the data as it arrives in their |
| table.</p> |
| |
| <p>A common data ingest pattern where Kudu becomes necessary is change data capture (CDC). That is, |
| capturing the inserts, updates, hard deletes, and streaming them into Kudu where they can be applied |
| immediately. Pre-Kudu this pipeline was very tedious to implement. Now with tools like |
| <a href="https://streamsets.com/">StreamSets</a>, you can get up and running in a few hours.</p> |
| |
| <p>A second common workflow is near real-time analytics. We’ve streamed data off mining trucks, |
| oil wells, manufacturing lines, and needed to make that data available to end users immediately. No |
| longer do we need to batch up writes, flush to HDFS and then refresh cache in Impala. As mentioned |
| before, with Kudu, the data are available as soon as it lands. This has been a significant |
| enhancement for end users, who previously had to <em>”wait”</em> for data.</p> |
| |
| <p>In summary, Kudu has made a tremendous impact in removing the operational distractions of merging in |
| changes, and refreshing the cache of downstream consumers. This now allows data engineers |
| and users to focus on solving business problems, rather than being bothered by the tediousness of |
| the backend.</p></content><author><name>Mac Noland</name></author><summary>I’ve been working with Hadoop now for over seven years and fortunately, or unfortunately, have run |
| across a lot of structured data use cases. What we, at phData, have found is |
| that end users are typically comfortable with tabular data and prefer to access their data in a |
| structured manner using tables.</summary></entry><entry><title>Getting Started with Kudu - an O’Reilly Title</title><link href="/2018/08/06/getting-started-with-kudu-an-oreilly-title.html" rel="alternate" type="text/html" title="Getting Started with Kudu - an O'Reilly Title" /><published>2018-08-06T00:00:00-05:00</published><updated>2018-08-06T00:00:00-05:00</updated><id>/2018/08/06/getting-started-with-kudu-an-oreilly-title</id><content type="html" xml:base="/2018/08/06/getting-started-with-kudu-an-oreilly-title.html"><p>The following article by Brock Noland was reposted from the |
| <a href="https://www.phdata.io/getting-started-with-kudu/">phData</a> |
| blog with their permission.</p> |
| |
| <p>Five years ago, enabling Data Science and Advanced Analytics on the |
| Hadoop platform was hard. Organizations required strong Software Engineering |
| capabilities to successfully implement complex Lambda architectures or even |
| simply implement continuous ingest. Updating or deleting data, were simply a |
| nightmare. General Data Protection Regulation (GDPR) would have been an extreme |
| challenge at that time. |
| <!--more--> |
| In that context, on October 11th 2012 Todd Lipcon perform Apache Kudu’s initial |
| commit. The commit message was:</p> |
| |
| <pre><code>Code for writing cfiles seems to basically work |
| Need to write code for reading cfiles, still |
| </code></pre> |
| |
| <p>And Kudu development was off and running. Around this same time Todd, on his |
| internal Wiki page, started listing out the papers he was reading to develop |
| the theoretical background for creating Kudu. I followed along, reading as many |
| as I could, understanding little, because I knew Todd was up to something |
| important. About a year after that initial commit, I got my |
| <a href="https://github.com/apache/kudu/commit/1d7e6864b4a31d3fe6897e4cb484dfcda6608d43">Kudu first commit</a>, |
| documenting the upper bound of a library. This is a small contribution of which I am still |
| proud.</p> |
| |
| <p>In the meantime, I was lucky enough to be a founder of a Hadoop Managed Services |
| and Consulting company known as <a href="http://phdata.io/">phData</a>. We found that a majority |
| of our customers had use cases which Kudu vastly simplified. Whether it’s Change Data |
| Capture (CDC) from thousands of source tables to Internet of Things (IoT) ingest, Kudu |
| makes life much easier as both an operator of a Hadoop cluster and a developer providing |
| business value on the platform.</p> |
| |
| <p>Through this work, I was lucky enough to be a co-author of |
| <a href="http://shop.oreilly.com/product/0636920065739.do">Getting Started with Kudu</a>. |
| The book is a summation of mine and our co-authors, Jean-Marc Spaggiari, Mladen |
| Kovacevic, and Ryan Bosshart, learnings while cutting our teeth on early versions |
| of Kudu. Specifically you will learn:</p> |
| |
| <ul> |
| <li>Theoretical understanding of Kudu concepts in simple plain spoken words and simple diagrams</li> |
| <li>Why, for many use cases, using Kudu is so much easier than other ecosystem storage technologies</li> |
| <li>How Kudu enables Hybrid Transactional/Analytical Processing (HTAP) use cases</li> |
| <li>How to design IoT, Predictive Modeling, and Mixed Platform Solutions using Kudu</li> |
| <li>How to design Kudu Schemas</li> |
| </ul> |
| |
| <p><img src="/img/2018-08-06-getting-started-with-kudu-an-oreilly-title.gif" alt="Getting Started with Kudu Cover" class="img-responsive" /></p> |
| |
| <p>Looking forward, I am excited to see Kudu gain additional features and adoption |
| and eventually the second revision of this title. In the meantime, if you have |
| feedback or questions, please reach out on the <code>#getting-started-kudu</code> channel of |
| the <a href="https://getkudu-slack.herokuapp.com/">Kudu Slack</a> or if you prefer non-real-time |
| communication, please use the user@ mailing list!</p></content><author><name>Brock Noland</name></author><summary>The following article by Brock Noland was reposted from the |
| phData |
| blog with their permission. |
| |
| Five years ago, enabling Data Science and Advanced Analytics on the |
| Hadoop platform was hard. Organizations required strong Software Engineering |
| capabilities to successfully implement complex Lambda architectures or even |
| simply implement continuous ingest. Updating or deleting data, were simply a |
| nightmare. General Data Protection Regulation (GDPR) would have been an extreme |
| challenge at that time.</summary></entry><entry><title>Instrumentation in Apache Kudu</title><link href="/2018/07/10/instrumentation-in-kudu.html" rel="alternate" type="text/html" title="Instrumentation in Apache Kudu" /><published>2018-07-10T00:00:00-05:00</published><updated>2018-07-10T00:00:00-05:00</updated><id>/2018/07/10/instrumentation-in-kudu</id><content type="html" xml:base="/2018/07/10/instrumentation-in-kudu.html"><p>Last week, the <a href="http://opentracing.io/">OpenTracing</a> community invited me to |
| their monthly Google Hangout meetup to give an informal talk on tracing and |
| instrumentation in Apache Kudu.</p> |
| |
| <p>While Kudu doesn’t currently support distributed tracing using OpenTracing, |
| it does have quite a lot of other types of instrumentation, metrics, and |
| diagnostics logging. The OpenTracing team was interested to hear about some of |
| the approaches that Kudu has used, and so I gave a brief introduction to topics |
| including: |
| <!--more--> |
| - The Kudu <a href="/docs/administration.html#_diagnostics_logging">diagnostics log</a> |
| which periodically logs metrics and stack traces. |
| - The <a href="/docs/troubleshooting.html#kudu_tracing">process-wide tracing</a> |
| support based on the open source tracing framework implemented by Google Chrome. |
| - The <a href="/docs/troubleshooting.html#kudu_tracing">stack watchdog</a> |
| which helps us find various latency outliers and issues in our libraries and |
| the Linux kernel. |
| - <a href="/docs/troubleshooting.html#heap_sampling">Heap sampling</a> support |
| which helps us understand unexpected memory usage.</p> |
| |
| <p>If you’re interested in learning about these topics and more, check out the video recording |
| below. My talk spans the first 34 minutes.</p> |
| |
| <iframe width="800" height="500" src="https://www.youtube.com/embed/qBXwKU6Ubjo?end=2058&amp;start=23"> |
| </iframe> |
| |
| <p>If you have any questions about this content or about Kudu in general, |
| <a href="http://kudu.apache.org/community.html">join the community</a></p></content><author><name>Todd Lipcon</name></author><summary>Last week, the OpenTracing community invited me to |
| their monthly Google Hangout meetup to give an informal talk on tracing and |
| instrumentation in Apache Kudu. |
| |
| While Kudu doesn’t currently support distributed tracing using OpenTracing, |
| it does have quite a lot of other types of instrumentation, metrics, and |
| diagnostics logging. The OpenTracing team was interested to hear about some of |
| the approaches that Kudu has used, and so I gave a brief introduction to topics |
| including:</summary></entry></feed> |