feed.xml - kudu-site - Git at Google

 <?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom"><generator uri="http://jekyllrb.com" version="2.5.3">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2019-11-20T13:28:24-08:00</updated><id>/</id><entry><title>Apache Kudu 1.10.0 Released</title><link href="/2019/07/09/apache-kudu-1-10-0-release.html" rel="alternate" type="text/html" title="Apache Kudu 1.10.0 Released" /><published>2019-07-09T00:00:00-07:00</published><updated>2019-07-09T00:00:00-07:00</updated><id>/2019/07/09/apache-kudu-1-10-0-release</id><content type="html" xml:base="/2019/07/09/apache-kudu-1-10-0-release.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.10.0!&lt;/p&gt;

 &lt;p&gt;The new release adds several new features and improvements, including the
 following:&lt;/p&gt;

 &lt;!--more--&gt;

 &lt;ul&gt;
   &lt;li&gt;Kudu now supports both full and incremental table backups via a job
 implemented using Apache Spark. Additionally it supports restoring
 tables from full and incremental backups via a restore job implemented using
 Apache Spark. See the
 &lt;a href=&quot;/releases/1.10.0/docs/administration.html#backup&quot;&gt;backup documentation&lt;/a&gt;
 for more details.&lt;/li&gt;
   &lt;li&gt;Kudu can now synchronize its internal catalog with the Apache Hive Metastore,
 automatically updating Hive Metastore table entries upon table creation,
 deletion, and alterations in Kudu. See the
 &lt;a href=&quot;/releases/1.10.0/docs/hive_metastore.html#metadata_sync&quot;&gt;HMS synchronization documentation&lt;/a&gt;
 for more details.&lt;/li&gt;
   &lt;li&gt;Kudu now supports native fine-grained authorization via integration with
 Apache Sentry. Kudu may now enforce access control policies defined for Kudu
 tables and columns, as well as policies defined on Hive servers and databases
 that may store Kudu tables. See the
 &lt;a href=&quot;/releases/1.10.0/docs/security.html#fine_grained_authz&quot;&gt;authorization documentation&lt;/a&gt;
 for more details.&lt;/li&gt;
   &lt;li&gt;Kudu’s web UI now supports SPNEGO, a protocol for securing HTTP requests with
 Kerberos by passing negotiation through HTTP headers. To enable, set the
 &lt;code&gt;--webserver_require_spnego&lt;/code&gt; command line flag.&lt;/li&gt;
   &lt;li&gt;Column comments can now be stored in Kudu tables, and can be updated using
 the AlterTable API
 (see &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-1711&quot;&gt;KUDU-1711&lt;/a&gt;).&lt;/li&gt;
   &lt;li&gt;The performance of mutations (i.e. UPDATE, DELETE, and re-INSERT) to
 not-yet-flushed Kudu data has been significantly optimized
 (see &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2826&quot;&gt;KUDU-2826&lt;/a&gt; and
 &lt;a href=&quot;https://github.com/apache/kudu/commit/f9f9526d3&quot;&gt;f9f9526d3&lt;/a&gt;).&lt;/li&gt;
   &lt;li&gt;Predicate performance for primitive columns and IS NULL and IS NOT NULL
 has been optimized
 (see &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2846&quot;&gt;KUDU-2846&lt;/a&gt;).&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;The above is just a list of the highlights, for a more complete list of new
 features, improvements and fixes please refer to the &lt;a href=&quot;/releases/1.10.0/docs/release_notes.html&quot;&gt;release
 notes&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;The Apache Kudu project only publishes source code releases. To build Kudu
 1.10.0, follow these steps:&lt;/p&gt;

 &lt;ul&gt;
   &lt;li&gt;Download the Kudu &lt;a href=&quot;/releases/1.10.0&quot;&gt;1.10.0 source release&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;Follow the instructions in the documentation to build Kudu &lt;a href=&quot;/releases/1.10.0/docs/installation.html#build_from_source&quot;&gt;1.10.0 from
 source&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;For your convenience, binary JAR files for the Kudu Java client library, Spark
 DataSource, Flume sink, and other Java integrations are published to the ASF
 Maven repository and are &lt;a href=&quot;https://search.maven.org/search?q=g:org.apache.kudu%20AND%20v:1.10.0&quot;&gt;now
 available&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;The Python client source is also available on
 &lt;a href=&quot;https://pypi.org/project/kudu-python/&quot;&gt;PyPI&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;Additionally experimental Docker images are published to
 &lt;a href=&quot;https://hub.docker.com/r/apache/kudu&quot;&gt;Docker Hub&lt;/a&gt;.&lt;/p&gt;</content><author><name>Grant Henke</name></author><summary>The Apache Kudu team is happy to announce the release of Kudu 1.10.0!

 The new release adds several new features and improvements, including the
 following:</summary></entry><entry><title>Location Awareness in Kudu</title><link href="/2019/04/30/location-awareness.html" rel="alternate" type="text/html" title="Location Awareness in Kudu" /><published>2019-04-30T00:00:00-07:00</published><updated>2019-04-30T00:00:00-07:00</updated><id>/2019/04/30/location-awareness</id><content type="html" xml:base="/2019/04/30/location-awareness.html">&lt;p&gt;This post is about location awareness in Kudu. It gives an overview
 of the following:
 - principles of the design
 - restrictions of the current implementation
 - potential future enhancements and extensions&lt;/p&gt;

 &lt;!--more--&gt;

 &lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;
 &lt;p&gt;Kudu supports location awareness starting with the 1.9.0 release. The
 initial implementation of location awareness in Kudu is built to satisfy the
 following requirement:&lt;/p&gt;

 &lt;ul&gt;
   &lt;li&gt;In a Kudu cluster consisting of multiple servers spread over several racks,
 place the replicas of a tablet in such a way that the tablet stays available
 even if all the servers in a single rack become unavailable.&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;A rack failure can occur when a hardware component shared among servers in the
 rack, such as a network switch or power supply, fails. More generally,
 replace ‘rack’ with any other aggregation of nodes (e.g., chassis, site,
 cloud availability zone, etc.) where some or all nodes in an aggregate become
 unavailable in case of a failure. This even applies to a datacenter if the
 network latency between datacenters is low. This is why we call the feature
 &lt;em&gt;location awareness&lt;/em&gt; and not &lt;em&gt;rack awareness&lt;/em&gt;.&lt;/p&gt;

 &lt;h1 id=&quot;locations-in-kudu&quot;&gt;Locations in Kudu&lt;/h1&gt;
 &lt;p&gt;In Kudu, a location is defined by a string that begins with a slash (&lt;code&gt;/&lt;/code&gt;) and
 consists of slash-separated tokens each of which contains only characters from
 the set &lt;code&gt;[a-zA-Z0-9_-.]&lt;/code&gt;. The components of the location string hierarchy
 should correspond to the physical or cloud-defined hierarchy of the deployed
 cluster, e.g. &lt;code&gt;/data-center-0/rack-09&lt;/code&gt; or &lt;code&gt;/region-0/availability-zone-01&lt;/code&gt;.&lt;/p&gt;

 &lt;p&gt;The design choice of using hierarchical paths for location strings is
 partially influenced by HDFS. The intention was to make it possible using
 the same locations as for existing HDFS nodes, because it’s common to deploy
 Kudu alongside HDFS. In addition, the hierarchical structure of location
 strings allows for interpretation of those in terms of common ancestry and
 relative proximity. As of now, Kudu does not exploit the hierarchical
 structure of the location except for the client’s logic to find the closest
 tablet server. However, we plan to leverage the hierarchical structure
 in future releases.&lt;/p&gt;

 &lt;h1 id=&quot;defining-and-assigning-locations&quot;&gt;Defining and assigning locations&lt;/h1&gt;
 &lt;p&gt;Kudu masters assign locations to tablet servers and clients.&lt;/p&gt;

 &lt;p&gt;Every Kudu master runs the location assignment procedure to assign a location
 to a tablet server when it registers. To determine the location for a tablet
 server, the master invokes an executable that takes the IP address or hostname
 of the tablet server and outputs the corresponding location string for the
 specified IP address or hostname. If the executable exits with non-zero exit
 status, that’s interpreted as an error and masters add corresponding error
 message about that into their logs. In case of tablet server registrations
 such outcome is deemed as a registration failure and the corresponding tablet
 server is not added into the master’s registry. The latter renders the tablet
 server unusable to Kudu clients since non-registered tablet servers are not
 discoverable to Kudu clients via &lt;code&gt;GetTableLocations&lt;/code&gt; RPC.&lt;/p&gt;

 &lt;p&gt;The master associates the produced location string with the registered tablet
 server and keeps it until the tablet server re-registers, which only occurs
 if the master or tablet server restarts. Masters use the assigned location
 information internally to make replica placement decisions, trying to place
 replicas evenly across locations and to keep tablets available in case all
 tablet servers in a single location fail (see
 &lt;a href=&quot;https://s.apache.org/location-awareness-design&quot;&gt;the design document&lt;/a&gt;
 for details). In addition, masters provide connected clients with
 the information on the client’s assigned location, so the clients can make
 informed decisions when they attempt to read from the closest tablet server.
 Kudu tablet servers themselves are location agnostic, at least for now,
 so the assigned location is not reported back to a registered tablet server.&lt;/p&gt;

 &lt;h1 id=&quot;the-location-aware-placement-policy-for-tablet-replicas-in-kudu&quot;&gt;The location-aware placement policy for tablet replicas in Kudu&lt;/h1&gt;
 &lt;p&gt;While placing replicas of tablets in location-aware cluster, Kudu uses a best
 effort approach to adhere to the following principle:
 - Spread replicas across locations so that the failure of tablet servers
   in one location does not make tablets unavailable.&lt;/p&gt;

 &lt;p&gt;That’s referred to as the &lt;em&gt;replica placement policy&lt;/em&gt; or just &lt;em&gt;placement policy&lt;/em&gt;.
 In Kudu, both the initial placement of tablet replicas and the automatic
 re-replication are governed by that policy. As of now, that’s the only
 replica placement policy available in Kudu. The placement policy isn’t
 customizable and doesn’t have any configurable parameters.&lt;/p&gt;

 &lt;h1 id=&quot;automatic-re-replication-and-placement-policy&quot;&gt;Automatic re-replication and placement policy&lt;/h1&gt;
 &lt;p&gt;By design, keeping the target replication factor for tablets has higher
 priority than conforming to the replica placement policy. In other words,
 when bringing up tablet replicas to replace failed ones, Kudu uses a best-effort
 approach with regard to conforming to the constraints of the placement policy.
 Essentially, that means that if there isn’t a way to place a replica to conform
 with the placement policy, the system places the replica anyway. The resulting
 violation of the placement policy can be addressed later on when unreachable
 tablet servers become available again or the misconfiguration is addressed.
 As of now, to fix the resulting placement policy violations it’s necessary
 to run the CLI rebalancer tool manually (see below for details),
 but in future releases that might be done &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2780&quot;&gt;automatically in background&lt;/a&gt;.&lt;/p&gt;

 &lt;h1 id=&quot;an-example-of-location-aware-rebalancing&quot;&gt;An example of location-aware rebalancing&lt;/h1&gt;
 &lt;p&gt;This section illustrates what happens during each phase of the location-aware
 rebalancing process.&lt;/p&gt;

 &lt;p&gt;In the diagrams below, the larger outer boxes denote locations, and the
 smaller inner ones denote tablet servers. As for the real-world objects behind
 locations in this example, one might think of server racks with a shared power
 supply or a shared network switch. It’s assumed that no more than one tablet
 server is run at each node (i.e. machine) in a rack.&lt;/p&gt;

 &lt;p&gt;The first phase of the rebalancing process is about detecting violations and
 reinstating the placement policy in the cluster. In the diagram below, there
 are three locations defined: &lt;code&gt;/L0&lt;/code&gt;, &lt;code&gt;/L1&lt;/code&gt;, &lt;code&gt;/L2&lt;/code&gt;. Each location has two tablet
 servers. Table &lt;code&gt;A&lt;/code&gt; has the replication factor of three (RF=3) and consists of
 four tablets: &lt;code&gt;A0&lt;/code&gt;, &lt;code&gt;A1&lt;/code&gt;, &lt;code&gt;A2&lt;/code&gt;, &lt;code&gt;A3&lt;/code&gt;. Table &lt;code&gt;B&lt;/code&gt; has replication factor of five
 (RF=5) and consists of three tablets: &lt;code&gt;B0&lt;/code&gt;, &lt;code&gt;B1&lt;/code&gt;, &lt;code&gt;B2&lt;/code&gt;.&lt;/p&gt;

 &lt;p&gt;The distribution of the replicas for tablet &lt;code&gt;A0&lt;/code&gt; violates the placement policy.
 Why? Because replicas &lt;code&gt;A0.0&lt;/code&gt; and &lt;code&gt;A0.1&lt;/code&gt; constitute the majority of replicas
 (two out of three) and reside in the same location &lt;code&gt;/L0&lt;/code&gt;.&lt;/p&gt;

 &lt;pre&gt;&lt;code&gt;         /L0                     /L1                    /L2
 +-------------------+   +-------------------+  +-------------------+
 |   TS0      TS1    |   |   TS2      TS3    |  |   TS4      TS5    |
 | +------+ +------+ |   | +------+ +------+ |  | +------+ +------+ |
 | | A0.0 | | A0.1 | |   | | A0.2 | |      | |  | |      | |      | |
 | |      | | A1.0 | |   | | A1.1 | |      | |  | | A1.2 | |      | |
 | |      | | A2.0 | |   | | A2.1 | |      | |  | | A2.2 | |      | |
 | |      | | A3.0 | |   | | A3.1 | |      | |  | | A3.2 | |      | |
 | | B0.0 | | B0.1 | |   | | B0.2 | | B0.3 | |  | | B0.4 | |      | |
 | | B1.0 | | B1.1 | |   | | B1.2 | | B1.3 | |  | | B1.4 | |      | |
 | | B2.0 | | B2.1 | |   | | B2.2 | | B2.3 | |  | | B2.4 | |      | |
 | +------+ +------+ |   | +------+ +------+ |  | +------+ +------+ |
 +-------------------+   +-------------------+  +-------------------+
 &lt;/code&gt;&lt;/pre&gt;

 &lt;p&gt;The location-aware rebalancer should initiate movement either of &lt;code&gt;T0.0&lt;/code&gt; or
 &lt;code&gt;T0.1&lt;/code&gt; from &lt;code&gt;/L0&lt;/code&gt; to other location, so the resulting replica distribution would
 &lt;em&gt;not&lt;/em&gt; contain the majority of replicas in any single location. In addition to
 that, the rebalancer tool tries to evenly spread the load across all locations
 and tablet servers within each location. The latter narrows down the list
 of the candidate replicas to move: &lt;code&gt;A0.1&lt;/code&gt; is the best candidate to move from
 location &lt;code&gt;/L0&lt;/code&gt;, so location &lt;code&gt;/L0&lt;/code&gt; would not contain the majority of replicas
 for tablet &lt;code&gt;A0&lt;/code&gt;. The same principle dictates the target location and the target
 tablet server to receive &lt;code&gt;A0.1&lt;/code&gt;: that should be tablet server &lt;code&gt;TS5&lt;/code&gt; in the
 location &lt;code&gt;/L2&lt;/code&gt;. The result distribution of the tablet replicas after the move
 is represented in the diagram below.&lt;/p&gt;

 &lt;pre&gt;&lt;code&gt;         /L0                     /L1                    /L2
 +-------------------+   +-------------------+  +-------------------+
 |   TS0      TS1    |   |   TS2      TS3    |  |   TS4      TS5    |
 | +------+ +------+ |   | +------+ +------+ |  | +------+ +------+ |
 | | A0.0 | |      | |   | | A0.2 | |      | |  | |      | | A0.1 | |
 | |      | | A1.0 | |   | | A1.1 | |      | |  | | A1.2 | |      | |
 | |      | | A2.0 | |   | | A2.1 | |      | |  | | A2.2 | |      | |
 | |      | | A3.0 | |   | | A3.1 | |      | |  | | A3.2 | |      | |
 | | B0.0 | | B0.1 | |   | | B0.2 | | B0.3 | |  | | B0.4 | |      | |
 | | B1.0 | | B1.1 | |   | | B1.2 | | B1.3 | |  | | B1.4 | |      | |
 | | B2.0 | | B2.1 | |   | | B2.2 | | B2.3 | |  | | B2.4 | |      | |
 | +------+ +------+ |   | +------+ +------+ |  | +------+ +------+ |
 +-------------------+   +-------------------+  +-------------------+
 &lt;/code&gt;&lt;/pre&gt;

 &lt;p&gt;The second phase of the location-aware rebalancing is about moving tablet
 replicas across locations to make the locations’ load more balanced. For the
 number &lt;code&gt;S&lt;/code&gt; of tablet servers in a location and the total number &lt;code&gt;R&lt;/code&gt; of replicas
 in the location, the &lt;em&gt;load of the location&lt;/em&gt; is defined as &lt;code&gt;R/S&lt;/code&gt;.&lt;/p&gt;

 &lt;p&gt;At this stage all violations of the placement policy are already rectified. The
 rebalancer tool doesn’t attempt to make any moves which would violate the
 placement policy.&lt;/p&gt;

 &lt;p&gt;The load of the locations in the diagram above:
 - &lt;code&gt;/L0&lt;/code&gt;: 1/5
 - &lt;code&gt;/L1&lt;/code&gt;: 1/5
 - &lt;code&gt;/L2&lt;/code&gt;: 2/7&lt;/p&gt;

 &lt;p&gt;A possible distribution of the tablet replicas after the second phase is
 represented below. The result load of the locations:
 - &lt;code&gt;/L0&lt;/code&gt;: 2/9
 - &lt;code&gt;/L1&lt;/code&gt;: 2/9
 - &lt;code&gt;/L2&lt;/code&gt;: 2/9&lt;/p&gt;

 &lt;pre&gt;&lt;code&gt;         /L0                     /L1                    /L2
 +-------------------+   +-------------------+  +-------------------+
 |   TS0      TS1    |   |   TS2      TS3    |  |   TS4      TS5    |
 | +------+ +------+ |   | +------+ +------+ |  | +------+ +------+ |
 | | A0.0 | |      | |   | | A0.2 | |      | |  | |      | | A0.1 | |
 | |      | | A1.0 | |   | | A1.1 | |      | |  | | A1.2 | |      | |
 | |      | | A2.0 | |   | | A2.1 | |      | |  | | A2.2 | |      | |
 | |      | | A3.0 | |   | | A3.1 | |      | |  | | A3.2 | |      | |
 | | B0.0 | |      | |   | | B0.2 | | B0.3 | |  | | B0.4 | | B0.1 | |
 | | B1.0 | | B1.1 | |   | |      | | B1.3 | |  | | B1.4 | | B2.2 | |
 | | B2.0 | | B2.1 | |   | | B2.2 | | B2.3 | |  | | B2.4 | |      | |
 | +------+ +------+ |   | +------+ +------+ |  | +------+ +------+ |
 +-------------------+   +-------------------+  +-------------------+
 &lt;/code&gt;&lt;/pre&gt;

 &lt;p&gt;The third phase of the location-aware rebalancing is about moving tablet
 replicas within each location to make the distribution of replicas even,
 both per-table and per-server.&lt;/p&gt;

 &lt;p&gt;See below for a possible replicas’ distribution in the example scenario
 after the third phase of the location-aware rebalancing successfully completes.&lt;/p&gt;

 &lt;pre&gt;&lt;code&gt;         /L0                     /L1                    /L2
 +-------------------+   +-------------------+  +-------------------+
 |   TS0      TS1    |   |   TS2      TS3    |  |   TS4      TS5    |
 | +------+ +------+ |   | +------+ +------+ |  | +------+ +------+ |
 | | A0.0 | |      | |   | |      | | A0.2 | |  | |      | | A0.1 | |
 | |      | | A1.0 | |   | | A1.1 | |      | |  | | A1.2 | |      | |
 | |      | | A2.0 | |   | | A2.1 | |      | |  | | A2.2 | |      | |
 | |      | | A3.0 | |   | | A3.1 | |      | |  | | A3.2 | |      | |
 | | B0.0 | |      | |   | | B0.2 | | B0.3 | |  | | B0.4 | | B0.1 | |
 | | B1.0 | | B1.1 | |   | |      | | B1.3 | |  | | B1.4 | | B1.2 | |
 | | B2.0 | | B2.1 | |   | | B2.2 | | B2.3 | |  | |      | | B2.4 | |
 | +------+ +------+ |   | +------+ +------+ |  | +------+ +------+ |
 +-------------------+   +-------------------+  +-------------------+
 &lt;/code&gt;&lt;/pre&gt;

 &lt;h1 id=&quot;how-to-make-a-kudu-cluster-location-aware&quot;&gt;How to make a Kudu cluster location-aware&lt;/h1&gt;
 &lt;p&gt;To make a Kudu cluster location-aware, it’s necessary to set the
 &lt;code&gt;--location_mapping_cmd&lt;/code&gt; flag for Kudu master(s) and make the corresponding
 executable (binary or a script) available at the nodes where Kudu masters run.
 In case of multiple masters, it’s important to make sure that the location
 mappings stay the same regardless of the node where the location assignment
 command is running.&lt;/p&gt;

 &lt;p&gt;It’s recommended to have at least three locations defined in a Kudu
 cluster so that no location contains a majority of tablet replicas.
 With two locations or less it’s not possible to spread replicas
 of tablets with replication factor of three and higher such that no location
 contains a majority of replicas.&lt;/p&gt;

 &lt;p&gt;For example, running a Kudu cluster in a single datacenter &lt;code&gt;dc0&lt;/code&gt;, assign
 location &lt;code&gt;/dc0/rack0&lt;/code&gt; to tablet servers running at machines in the rack &lt;code&gt;rack0&lt;/code&gt;,
 &lt;code&gt;/dc0/rack1&lt;/code&gt; to tablet servers running at machines in the rack &lt;code&gt;rack1&lt;/code&gt;,
 and &lt;code&gt;/dc0/rack2&lt;/code&gt; to tablet servers running at machines in the rack &lt;code&gt;rack2&lt;/code&gt;.
 In a similar way, when running in cloud, assign location &lt;code&gt;/regionA/az0&lt;/code&gt;
 to tablet servers running in availability zone &lt;code&gt;az0&lt;/code&gt; of region &lt;code&gt;regionA&lt;/code&gt;,
 and &lt;code&gt;/regionA/az1&lt;/code&gt; to tablet servers running in zone &lt;code&gt;az1&lt;/code&gt; of the same region.&lt;/p&gt;

 &lt;h1 id=&quot;an-example-of-location-assignment-script-for-kudu&quot;&gt;An example of location assignment script for Kudu&lt;/h1&gt;
 &lt;pre&gt;&lt;code&gt;#!/bin/sh
 #
 # It&#39;s assumed a Kudu cluster consists of nodes with IPv4 addresses in the
 # private 192.168.100.0/32 subnet. The nodes are hosted in racks, where
 # each rack can contain at most 32 nodes. This results in 8 locations,
 # one location per rack.
 #
 # This example script maps IP addresses into locations assuming that RPC
 # endpoints of tablet servers are specified via IPv4 addresses. If tablet
 # servers&#39; RPC endpoints are specified using DNS hostnames (and that&#39;s how
 # it&#39;s done by default), the script should consume DNS hostname instead of
 # an IP address as an input parameter. Check the `--rpc_bind_addresses` and
 # `--rpc_advertised_addresses` command line flags of kudu-tserver for details.
 #
 # DISCLAIMER:
 #   This is an example Bourne shell script for Kudu location assignment. Please
 #   note it&#39;s just a toy script created with illustrative-only purpose.
 #   The error handling and the input validation are minimalistic. Also, the
 #   network topology choice, supportability and capacity planning aspects of
 #   this script might be sub-optimal if applied as-is for real-world use cases.

 set -e

 if [ $# -ne 1 ]; then
   echo &quot;usage: $0 &amp;lt;ip_address&amp;gt;&quot;
   exit 1
 fi

 ip_address=$1
 shift

 suffix=${ip_address##192.168.100.}
 if [ -z &quot;${suffix##*.*}&quot; ]; then
   # An IP address from a non-controlled subnet: maps into the &#39;other&#39; location.
   echo &quot;/other&quot;
   exit 0
 fi

 # The mapping of the IP addresses
 if [ -z &quot;$suffix&quot; -o $suffix -lt 0 -o $suffix -gt 255 ]; then
   echo &quot;ERROR: &#39;$ip_address&#39; is not a valid IPv4 address&quot;
   exit 2
 fi

 if [ $suffix -eq 0 -o $suffix -eq 255 ]; then
   echo &quot;ERROR: &#39;$ip_address&#39; is a 0xffffff00 IPv4 subnet address&quot;
   exit 3
 fi

 if [ $suffix -lt 32 ]; then
   echo &quot;/dc0/rack00&quot;
 elif [ $suffix -ge 32 -a $suffix -lt 64 ]; then
   echo &quot;/dc0/rack01&quot;
 elif [ $suffix -ge 64 -a $suffix -lt 96 ]; then
   echo &quot;/dc0/rack02&quot;
 elif [ $suffix -ge 96 -a $suffix -lt 128 ]; then
   echo &quot;/dc0/rack03&quot;
 elif [ $suffix -ge 128 -a $suffix -lt 160 ]; then
   echo &quot;/dc0/rack04&quot;
 elif [ $suffix -ge 160 -a $suffix -lt 192 ]; then
   echo &quot;/dc0/rack05&quot;
 elif [ $suffix -ge 192 -a $suffix -lt 224 ]; then
   echo &quot;/dc0/rack06&quot;
 else
   echo &quot;/dc0/rack07&quot;
 fi
 &lt;/code&gt;&lt;/pre&gt;

 &lt;h1 id=&quot;reinstating-the-placement-policy-in-a-location-aware-kudu-cluster&quot;&gt;Reinstating the placement policy in a location-aware Kudu cluster&lt;/h1&gt;
 &lt;p&gt;As explained earlier, even if the initial placement of tablet replicas conforms
 to the placement policy, the cluster might get to a point where there are not
 enough tablet servers to place a new or a replacement replica. Ideally, such
 situations should be handled automatically: once there are enough tablet servers
 in the cluster or the misconfiguration is fixed, the placement policy should
 be reinstated. Currently, it’s possible to reinstate the placement policy using
 the &lt;code&gt;kudu&lt;/code&gt; CLI tool:&lt;/p&gt;

 &lt;p&gt;&lt;code&gt;sudo -u kudu kudu cluster rebalance &amp;lt;master_rpc_endpoints&amp;gt;&lt;/code&gt;&lt;/p&gt;

 &lt;p&gt;In the first phase, the location-aware rebalancing process tries to
 reestablish the placement policy. If that’s not possible, the tool
 terminates. Use the &lt;code&gt;--disable_policy_fixer&lt;/code&gt; flag to skip this phase and
 continue to the cross-location rebalancing phase.&lt;/p&gt;

 &lt;p&gt;The second phase is cross-location rebalancing, i.e. moving tablet replicas
 between different locations in attempt to spread tablet replicas among
 locations evenly, equalizing the loads of locations throughout the cluster.
 If the benefits of spreading the load among locations do not justify the cost
 of the cross-location replica movement, the tool can be instructed to skip the
 second phase of the location-aware rebalancing. Use the
 &lt;code&gt;--disable_cross_location_rebalancing&lt;/code&gt; command line flag for that.&lt;/p&gt;

 &lt;p&gt;The third phase is intra-location rebalancing, i.e. balancing the distribution
 of tablet replicas within each location as if each location is a cluster on its
 own. Use the &lt;code&gt;--disable_intra_location_rebalancing&lt;/code&gt; flag to skip this phase.&lt;/p&gt;

 &lt;h1 id=&quot;future-work&quot;&gt;Future work&lt;/h1&gt;
 &lt;p&gt;Having a CLI tool to reinstate placement policy is nice, but it would be great
 to run the location-aware rebalancing in background, automatically reinstating
 the placement policy and making tablet replica distribution even
 across a Kudu cluster.&lt;/p&gt;

 &lt;p&gt;In addition to that, there is a idea to make it possible to have
 multiple customizable placement policies in the system. As of now, there is
 a request to implement so-called ‘table pinning’, i.e. make it possible
 to specify placement policy where replicas of tablets of particular tables
 are placed only at nodes within the specified locations. The table pinning
 request is tracked by KUDU-2604 in Apache JIRA, see
 &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2604&quot;&gt;KUDU-2604&lt;/a&gt;.&lt;/p&gt;

 &lt;h1 id=&quot;references&quot;&gt;References&lt;/h1&gt;
 &lt;p&gt;[1] Location awareness in Kudu: &lt;a href=&quot;https://github.com/apache/kudu/blob/master/docs/design-docs/location-awareness.md&quot;&gt;design document&lt;/a&gt;&lt;/p&gt;

 &lt;p&gt;[2] A proposal for Kudu tablet server labeling: &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2604&quot;&gt;KUDU-2604&lt;/a&gt;&lt;/p&gt;

 &lt;p&gt;[3] Further improvement: &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2780&quot;&gt;automatic cluster rebalancing&lt;/a&gt;.&lt;/p&gt;</content><author><name>Alexey Serbin</name></author><summary>This post is about location awareness in Kudu. It gives an overview
 of the following:
 - principles of the design
 - restrictions of the current implementation
 - potential future enhancements and extensions</summary></entry><entry><title>Fine-Grained Authorization with Apache Kudu and Impala</title><link href="/2019/04/22/fine-grained-authorization-with-apache-kudu-and-impala.html" rel="alternate" type="text/html" title="Fine-Grained Authorization with Apache Kudu and Impala" /><published>2019-04-22T00:00:00-07:00</published><updated>2019-04-22T00:00:00-07:00</updated><id>/2019/04/22/fine-grained-authorization-with-apache-kudu-and-impala</id><content type="html" xml:base="/2019/04/22/fine-grained-authorization-with-apache-kudu-and-impala.html">&lt;p&gt;Note: This is a cross-post from the Cloudera Engineering Blog
 &lt;a href=&quot;https://blog.cloudera.com/blog/2019/04/fine-grained-authorization-with-apache-kudu-and-impala/&quot;&gt;Fine-Grained Authorization with Apache Kudu and Impala&lt;/a&gt;&lt;/p&gt;

 &lt;p&gt;Apache Impala supports fine-grained authorization via Apache Sentry on all of the tables it
 manages including Apache Kudu tables. Given Impala is a very common way to access the data stored
 in Kudu, this capability allows users deploying Impala and Kudu to fully secure the Kudu data in
 multi-tenant clusters even though Kudu does not yet have native fine-grained authorization of its
 own. This solution works because Kudu natively supports coarse-grained (all or nothing)
 authorization which enables blocking all access to Kudu directly except for the impala user and
 an optional whitelist of other trusted users. This post will describe how to use Apache Impala’s
 fine-grained authorization support along with Apache Kudu’s coarse-grained authorization to
 achieve a secure multi-tenant deployment.&lt;/p&gt;

 &lt;!--more--&gt;

 &lt;h2 id=&quot;sample-workflow&quot;&gt;Sample Workflow&lt;/h2&gt;

 &lt;p&gt;The examples in this post enable a workflow that uses Apache Spark to ingest data directly into
 Kudu and Impala to run analytic queries on that data. The Spark job, run as the &lt;code&gt;etl_service&lt;/code&gt; user,
 is permitted to access the Kudu data via coarse-grained authorization. Even though this gives
 access to all the data in Kudu, the &lt;code&gt;etl_service&lt;/code&gt; user is only used for scheduled jobs or by an
 administrator. All queries on the data, from a wide array of users, will use Impala and leverage
 Impala’s fine-grained authorization. Impala’s
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_grant.html&quot;&gt;&lt;code&gt;GRANT&lt;/code&gt; statements&lt;/a&gt;
 allow you to flexibly control the privileges on the Kudu storage tables. Impala’s fine-grained
 privileges along with support for
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_select.html&quot;&gt;&lt;code&gt;SELECT&lt;/code&gt;&lt;/a&gt;,
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_insert.html&quot;&gt;&lt;code&gt;INSERT&lt;/code&gt;&lt;/a&gt;,
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_update.html&quot;&gt;&lt;code&gt;UPDATE&lt;/code&gt;&lt;/a&gt;,
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_upsert.html&quot;&gt;&lt;code&gt;UPSERT&lt;/code&gt;&lt;/a&gt;,
 and &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_delete.html&quot;&gt;&lt;code&gt;DELETE&lt;/code&gt;&lt;/a&gt;
 statements, allow you to finely control who can read and write data to your Kudu tables while
 using Impala. Below is a diagram showing the workflow described:&lt;/p&gt;

 &lt;p&gt;&lt;img src=&quot;/img/fine-grained-authorization-with-apache-kudu.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;

 &lt;p&gt;&lt;em&gt;Note&lt;/em&gt;: The examples below assume that Authorization has already been configured for Kudu, Impala,
 and Spark. For help configuring authorization see the Cloudera
 &lt;a href=&quot;https://www.cloudera.com/documentation/enterprise/latest/topics/sg_auth_overview.html&quot;&gt;authorization documentation&lt;/a&gt;.&lt;/p&gt;

 &lt;h2 id=&quot;configuring-kudus-coarse-grained-authorization&quot;&gt;Configuring Kudu’s Coarse-Grained Authorization&lt;/h2&gt;

 &lt;p&gt;Kudu supports coarse-grained authorization of client requests based on the authenticated client
 Kerberos principal. The two levels of access which can be configured are:&lt;/p&gt;

 &lt;ul&gt;
   &lt;li&gt;&lt;em&gt;Superuser&lt;/em&gt; – principals authorized as a superuser are able to perform certain administrative
 functionality such as using the kudu command line tool to diagnose or repair cluster issues.&lt;/li&gt;
   &lt;li&gt;&lt;em&gt;User&lt;/em&gt; – principals authorized as a user are able to access and modify all data in the Kudu
 cluster. This includes the ability to create, drop, and alter tables as well as read, insert,
 update, and delete data.&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;Access levels are granted using whitelist-style Access Control Lists (ACLs), one for each of the
 two levels. Each access control list either specifies a comma-separated list of users, or may be
 set to &lt;code&gt;*&lt;/code&gt; to indicate that all authenticated users are able to gain access at the specified level.&lt;/p&gt;

 &lt;p&gt;&lt;em&gt;Note&lt;/em&gt;: The default value for the User ACL is &lt;code&gt;*&lt;/code&gt;, which allows all users access to the cluster.&lt;/p&gt;

 &lt;h3 id=&quot;example-configuration&quot;&gt;Example Configuration&lt;/h3&gt;

 &lt;p&gt;The first and most important step is to remove the default ACL of &lt;code&gt;*&lt;/code&gt; from Kudu’s
 &lt;a href=&quot;https://kudu.apache.org/docs/configuration_reference.html#kudu-master_user_acl&quot;&gt;&lt;code&gt;–user_acl&lt;/code&gt; configuration&lt;/a&gt;.
 This will ensure only the users you list will have access to the Kudu cluster. Then, to allow the
 Impala service to access all of the data in Kudu, the Impala service user, usually impala, should
 be added to the Kudu &lt;code&gt;–user_acl&lt;/code&gt; configuration. Any user that is not using Impala will also need
 to be added to this list. For example, an Apache Spark job might be used to load data directly
 into Kudu. Generally, a single user is used to run scheduled jobs of applications that do not
 support fine-grained authorization on their own. For this example, that user is &lt;code&gt;etl_service&lt;/code&gt;. The
 full &lt;code&gt;–user_acl&lt;/code&gt; configuration is:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;--user_acl&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;impala,etl_service&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;For more details see the Kudu
 &lt;a href=&quot;https://kudu.apache.org/docs/security.html#_coarse_grained_authorization&quot;&gt;authorization documentation&lt;/a&gt;.&lt;/p&gt;

 &lt;h2 id=&quot;using-impalas-fine-grained-authorization&quot;&gt;Using Impala’s Fine-Grained Authorization&lt;/h2&gt;

 &lt;p&gt;Follow Impala’s
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_authorization.html&quot;&gt;authorization documentation&lt;/a&gt;
 to configure fine-grained authorization. Once configured, you can use Impala’s
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_grant.html&quot;&gt;&lt;code&gt;GRANT&lt;/code&gt; statements&lt;/a&gt;
 to control the privileges of Kudu tables. These fine-grained privileges can be set at the database,
 table and column level. Additionally you can individually control &lt;code&gt;SELECT&lt;/code&gt;, &lt;code&gt;INSERT&lt;/code&gt;, &lt;code&gt;CREATE&lt;/code&gt;,
 &lt;code&gt;ALTER&lt;/code&gt;, and &lt;code&gt;DROP&lt;/code&gt; privileges.&lt;/p&gt;

 &lt;p&gt;&lt;em&gt;Note&lt;/em&gt;: A user needs the &lt;code&gt;ALL&lt;/code&gt; privilege in order to run &lt;code&gt;DELETE&lt;/code&gt;, &lt;code&gt;UPDATE&lt;/code&gt;, or &lt;code&gt;UPSERT&lt;/code&gt;
 statements against a Kudu table.&lt;/p&gt;

 &lt;p&gt;Below is a brief example with a couple tables stored in Kudu:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;messages&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STRING&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TIMESTAMP&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;message&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STRING&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
   &lt;span class=&quot;k&quot;&gt;PRIMARY&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;KEY&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;HASH&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PARTITIONS&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;
 &lt;span class=&quot;n&quot;&gt;STORED&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;KUDU&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;GRANT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ALL&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;messages&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TO&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;userA&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

 &lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;metrics&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
   &lt;span class=&quot;k&quot;&gt;host&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STRING&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NULL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;metric&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STRING&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NULL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;INT64&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NULL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;value&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DOUBLE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NULL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
   &lt;span class=&quot;k&quot;&gt;PRIMARY&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;KEY&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;host&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;metric&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;HASH&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PARTITIONS&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;
 &lt;span class=&quot;n&quot;&gt;STORED&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;KUDU&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;GRANT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ALL&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;messages&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TO&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;userB&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

 &lt;p&gt;This brief example that combines Kudu’s coarse-grained authorization and Impala’s fine-grained
 authorization should enable you to meet the security needs of your data workflow today. The
 pattern described here can be applied to other services and workflows using Kudu as well. For
 greater authorization flexibility, you can look forward to the near future when Kudu supports
 native fine-grained authorization on its own. The Apache Kudu contributors understand the
 importance of native fine-grained authorization and they are working on integrations with
 Apache Sentry and Apache Ranger.&lt;/p&gt;</content><author><name>Grant Henke</name></author><summary>Note: This is a cross-post from the Cloudera Engineering Blog
 Fine-Grained Authorization with Apache Kudu and Impala

 Apache Impala supports fine-grained authorization via Apache Sentry on all of the tables it
 manages including Apache Kudu tables. Given Impala is a very common way to access the data stored
 in Kudu, this capability allows users deploying Impala and Kudu to fully secure the Kudu data in
 multi-tenant clusters even though Kudu does not yet have native fine-grained authorization of its
 own. This solution works because Kudu natively supports coarse-grained (all or nothing)
 authorization which enables blocking all access to Kudu directly except for the impala user and
 an optional whitelist of other trusted users. This post will describe how to use Apache Impala’s
 fine-grained authorization support along with Apache Kudu’s coarse-grained authorization to
 achieve a secure multi-tenant deployment.</summary></entry><entry><title>Testing Apache Kudu Applications on the JVM</title><link href="/2019/03/19/testing-apache-kudu-applications-on-the-jvm.html" rel="alternate" type="text/html" title="Testing Apache Kudu Applications on the JVM" /><published>2019-03-19T00:00:00-07:00</published><updated>2019-03-19T00:00:00-07:00</updated><id>/2019/03/19/testing-apache-kudu-applications-on-the-jvm</id><content type="html" xml:base="/2019/03/19/testing-apache-kudu-applications-on-the-jvm.html">&lt;p&gt;Note: This is a cross-post from the Cloudera Engineering Blog
 &lt;a href=&quot;https://blog.cloudera.com/blog/2019/03/testing-apache-kudu-applications-on-the-jvm/&quot;&gt;Testing Apache Kudu Applications on the JVM&lt;/a&gt;&lt;/p&gt;

 &lt;p&gt;Although the Kudu server is written in C++ for performance and efficiency, developers can write
 client applications in C++, Java, or Python. To make it easier for Java developers to create
 reliable client applications, we’ve added new utilities in Kudu 1.9.0 that allow you to write tests
 using a Kudu cluster without needing to build Kudu yourself, without any knowledge of C++,
 and without any complicated coordination around starting and stopping Kudu clusters for each test.
 This post describes how the new testing utilities work and how you can use them in your application
 tests.&lt;/p&gt;

 &lt;!--more--&gt;

 &lt;h2 id=&quot;user-guide&quot;&gt;User Guide&lt;/h2&gt;

 &lt;p&gt;Note: It is possible this blog post could become outdated – for the latest documentation on using
 the JVM testing utilities see the
 &lt;a href=&quot;https://kudu.apache.org/docs/developing.html#_jvm_based_integration_testing&quot;&gt;Kudu documentation&lt;/a&gt;.&lt;/p&gt;

 &lt;h3 id=&quot;requirements&quot;&gt;Requirements&lt;/h3&gt;

 &lt;p&gt;In order to use the new testing utilities, the following requirements must be met:&lt;/p&gt;

 &lt;ul&gt;
   &lt;li&gt;OS
     &lt;ul&gt;
       &lt;li&gt;macOS El Capitan (10.11) or later&lt;/li&gt;
       &lt;li&gt;CentOS 6.6+, Ubuntu 14.04+, or another recent distribution of Linux
  &lt;a href=&quot;https://kudu.apache.org/docs/installation.html#_prerequisites_and_requirements&quot;&gt;supported by Kudu&lt;/a&gt;&lt;/li&gt;
     &lt;/ul&gt;
   &lt;/li&gt;
   &lt;li&gt;JVM
     &lt;ul&gt;
       &lt;li&gt;Java 8+&lt;/li&gt;
       &lt;li&gt;Note: Java 7+ is deprecated, but still supported&lt;/li&gt;
     &lt;/ul&gt;
   &lt;/li&gt;
   &lt;li&gt;Build Tool
     &lt;ul&gt;
       &lt;li&gt;Maven 3.1 or later, required to support the
 &lt;a href=&quot;https://github.com/trustin/os-maven-plugin&quot;&gt;os-maven-plugin&lt;/a&gt;&lt;/li&gt;
       &lt;li&gt;Gradle 2.1 or later, to support the
 &lt;a href=&quot;https://github.com/google/osdetector-gradle-plugin&quot;&gt;osdetector-gradle-plugin&lt;/a&gt;&lt;/li&gt;
       &lt;li&gt;Any other build tool that can download the correct jar from Maven&lt;/li&gt;
     &lt;/ul&gt;
   &lt;/li&gt;
 &lt;/ul&gt;

 &lt;h3 id=&quot;build-configuration&quot;&gt;Build Configuration&lt;/h3&gt;

 &lt;p&gt;In order to use the Kudu testing utilities, add two dependencies to your classpath:&lt;/p&gt;

 &lt;ul&gt;
   &lt;li&gt;The &lt;code&gt;kudu-test-utils&lt;/code&gt; dependency&lt;/li&gt;
   &lt;li&gt;The &lt;code&gt;kudu-binary&lt;/code&gt; dependency&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;The &lt;code&gt;kudu-test-utils&lt;/code&gt; dependency has useful utilities for testing applications that use Kudu.
 Primarily, it provides the
 &lt;a href=&quot;https://github.com/apache/kudu/blob/master/java/kudu-test-utils/src/main/java/org/apache/kudu/test/KuduTestHarness.java&quot;&gt;KuduTestHarness class&lt;/a&gt;
 to manage the lifecycle of a Kudu cluster for each test. The &lt;code&gt;KuduTestHarness&lt;/code&gt; is a
 &lt;a href=&quot;https://junit.org/junit4/javadoc/4.12/org/junit/rules/TestRule.html&quot;&gt;JUnit TestRule&lt;/a&gt;
 that not only starts and stops a Kudu cluster for each test, but also has methods to manage the
 cluster and get pre-configured &lt;code&gt;KuduClient&lt;/code&gt; instances for use while testing.&lt;/p&gt;

 &lt;p&gt;The &lt;code&gt;kudu-binary&lt;/code&gt; dependency contains the native Kudu (server and command-line tool) binaries for
 the specified operating system. In order to download the right artifact for the running operating
 system it is easiest to use a plugin, such as the
 &lt;a href=&quot;https://github.com/trustin/os-maven-plugin&quot;&gt;os-maven-plugin&lt;/a&gt; or
 &lt;a href=&quot;https://github.com/google/osdetector-gradle-plugin&quot;&gt;osdetector-gradle-plugin&lt;/a&gt;, to detect the
 current runtime environment. The &lt;code&gt;KuduTestHarness&lt;/code&gt; will automatically find and use the &lt;code&gt;kudu-binary&lt;/code&gt;
 jar on the classpath.&lt;/p&gt;

 &lt;p&gt;WARNING: The &lt;code&gt;kudu-binary&lt;/code&gt; module should only be used to run Kudu for integration testing purposes.
 It should never be used to run an actual Kudu service, in production or development, because the
 &lt;code&gt;kudu-binary&lt;/code&gt; module includes native security-related dependencies that have been copied from the
 build system and will not be patched when the operating system on the runtime host is patched.&lt;/p&gt;

 &lt;h4 id=&quot;maven-configuration&quot;&gt;Maven Configuration&lt;/h4&gt;

 &lt;p&gt;If you are using Maven to build your project, add the following entries to your project’s
 &lt;code&gt;pom.xml&lt;/code&gt; file:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-xml&quot; data-lang=&quot;xml&quot;&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;build&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;extensions&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;c&quot;&gt;&amp;lt;!-- Used to find the right kudu-binary artifact with the Maven&lt;/span&gt;
 &lt;span class=&quot;c&quot;&gt;         property ${os.detected.classifier} --&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;extension&amp;gt;&lt;/span&gt;
       &lt;span class=&quot;nt&quot;&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;kr.motd.maven&lt;span class=&quot;nt&quot;&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
       &lt;span class=&quot;nt&quot;&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;os-maven-plugin&lt;span class=&quot;nt&quot;&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
       &lt;span class=&quot;nt&quot;&gt;&amp;lt;version&amp;gt;&lt;/span&gt;1.6.2&lt;span class=&quot;nt&quot;&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;/extension&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;/extensions&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;/build&amp;gt;&lt;/span&gt;

 &lt;span class=&quot;nt&quot;&gt;&amp;lt;dependencies&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;org.apache.kudu&lt;span class=&quot;nt&quot;&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;kudu-test-utils&lt;span class=&quot;nt&quot;&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;version&amp;gt;&lt;/span&gt;1.9.0&lt;span class=&quot;nt&quot;&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;scope&amp;gt;&lt;/span&gt;test&lt;span class=&quot;nt&quot;&gt;&amp;lt;/scope&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;org.apache.kudu&lt;span class=&quot;nt&quot;&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;kudu-binary&lt;span class=&quot;nt&quot;&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;version&amp;gt;&lt;/span&gt;1.9.0&lt;span class=&quot;nt&quot;&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;classifier&amp;gt;&lt;/span&gt;${os.detected.classifier}&lt;span class=&quot;nt&quot;&gt;&amp;lt;/classifier&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;scope&amp;gt;&lt;/span&gt;test&lt;span class=&quot;nt&quot;&gt;&amp;lt;/scope&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;/dependencies&amp;gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;h4 id=&quot;gradle-configuration&quot;&gt;Gradle Configuration&lt;/h4&gt;

 &lt;p&gt;If you are using Gradle to build your project, add the following entries to your project’s
 &lt;code&gt;build.gradle&lt;/code&gt; file:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-groovy&quot; data-lang=&quot;groovy&quot;&gt;&lt;span class=&quot;n&quot;&gt;plugins&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
   &lt;span class=&quot;c1&quot;&gt;// Used to find the right kudu-binary artifact with the Gradle&lt;/span&gt;
   &lt;span class=&quot;c1&quot;&gt;// property ${osdetector.classifier}&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&amp;quot;com.google.osdetector&amp;quot;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;version&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&amp;quot;1.6.2&amp;quot;&lt;/span&gt;
 &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

 &lt;span class=&quot;n&quot;&gt;dependencies&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;testCompile&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&amp;quot;org.apache.kudu:kudu-test-utils:1.9.0&amp;quot;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;testCompile&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&amp;quot;org.apache.kudu:kudu-binary:1.9.0:${osdetector.classifier}&amp;quot;&lt;/span&gt;
 &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;h2 id=&quot;test-setup&quot;&gt;Test Setup&lt;/h2&gt;

 &lt;p&gt;Once your project is configured correctly, you can start writing tests using the &lt;code&gt;kudu-test-utils&lt;/code&gt;
 and &lt;code&gt;kudu-binary&lt;/code&gt; artifacts. One line of code will ensure that each test automatically starts and
 stops a real Kudu cluster and that cluster logging is output through &lt;code&gt;slf4j&lt;/code&gt;:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-java&quot; data-lang=&quot;java&quot;&gt;&lt;span class=&quot;nd&quot;&gt;@Rule&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;KuduTestHarness&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;harness&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;KuduTestHarness&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;The &lt;a href=&quot;https://github.com/apache/kudu/blob/master/java/kudu-test-utils/src/main/java/org/apache/kudu/test/KuduTestHarness.java&quot;&gt;KuduTestHarness&lt;/a&gt;
 has methods to get pre-configured clients, start and stop servers, and more. Below is an example
 test to showcase some of the capabilities:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-java&quot; data-lang=&quot;java&quot;&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;org.apache.kudu.*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;org.apache.kudu.client.*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;org.apache.kudu.test.KuduTestHarness&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;org.junit.*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;

 &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;java.util.Arrays&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;java.util.Collections&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;

 &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;MyKuduTest&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;

     &lt;span class=&quot;nd&quot;&gt;@Rule&lt;/span&gt;
     &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;KuduTestHarness&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;harness&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;KuduTestHarness&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;

     &lt;span class=&quot;nd&quot;&gt;@Test&lt;/span&gt;
     &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;test&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;throws&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Exception&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
         &lt;span class=&quot;c1&quot;&gt;// Get a KuduClient configured to talk to the running mini cluster.&lt;/span&gt;
         &lt;span class=&quot;n&quot;&gt;KuduClient&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;client&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;harness&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getClient&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;

         &lt;span class=&quot;c1&quot;&gt;// Some of the other most common KuduTestHarness methods include:&lt;/span&gt;
         &lt;span class=&quot;n&quot;&gt;AsyncKuduClient&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;asyncClient&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;harness&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getAsyncClient&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
         &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;masterAddresses&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;harness&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getMasterAddressesAsString&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
         &lt;span class=&quot;n&quot;&gt;List&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;HostAndPort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;masterServers&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;harness&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getMasterServers&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
         &lt;span class=&quot;n&quot;&gt;List&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;HostAndPort&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tabletServers&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;harness&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getTabletServers&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
         &lt;span class=&quot;n&quot;&gt;harness&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;killLeaderMasterServer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
         &lt;span class=&quot;n&quot;&gt;harness&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;killAllMasterServers&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
         &lt;span class=&quot;n&quot;&gt;harness&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;startAllMasterServers&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
         &lt;span class=&quot;n&quot;&gt;harness&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;killAllTabletServers&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
         &lt;span class=&quot;n&quot;&gt;harness&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;startAllTabletServers&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;

         &lt;span class=&quot;c1&quot;&gt;// Create a new Kudu table.&lt;/span&gt;
         &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tableName&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;myTable&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
         &lt;span class=&quot;n&quot;&gt;Schema&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;schema&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;Schema&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Arrays&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;asList&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;
             &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ColumnSchema&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;ColumnSchemaBuilder&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;key&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Type&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;INT32&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;build&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(),&lt;/span&gt;
             &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ColumnSchema&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;ColumnSchemaBuilder&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;value&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Type&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;STRING&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;build&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;
         &lt;span class=&quot;o&quot;&gt;));&lt;/span&gt;
         &lt;span class=&quot;n&quot;&gt;CreateTableOptions&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;opts&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;CreateTableOptions&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;
             &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;setRangePartitionColumns&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Collections&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;singletonList&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;key&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;));&lt;/span&gt;
         &lt;span class=&quot;n&quot;&gt;client&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;createTable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tableName&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;schema&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;opts&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
         &lt;span class=&quot;n&quot;&gt;KuduTable&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;table&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;client&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;openTable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tableName&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;

         &lt;span class=&quot;c1&quot;&gt;// Write a few rows to the table&lt;/span&gt;
         &lt;span class=&quot;n&quot;&gt;KuduSession&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;session&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;client&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;newSession&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
         &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
             &lt;span class=&quot;n&quot;&gt;Insert&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;insert&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;table&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;newInsert&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
             &lt;span class=&quot;n&quot;&gt;PartialRow&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;row&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;insert&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getRow&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
             &lt;span class=&quot;n&quot;&gt;row&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;addInt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;key&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
             &lt;span class=&quot;n&quot;&gt;row&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;addString&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;value&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;valueOf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;));&lt;/span&gt;
             &lt;span class=&quot;n&quot;&gt;session&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;apply&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;insert&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
         &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
         &lt;span class=&quot;n&quot;&gt;session&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;close&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;

         &lt;span class=&quot;c1&quot;&gt;// ... Continue the test. Read and validate the rows, alter the table, etc.&lt;/span&gt;
     &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
 &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;For a complete example of a project using the &lt;code&gt;KuduTestHarness&lt;/code&gt;, see the
 &lt;a href=&quot;https://github.com/apache/kudu/tree/master/examples/java/java-example&quot;&gt;java-example&lt;/a&gt; project in
 the Kudu source code repository. The Kudu project itself uses the &lt;code&gt;KuduTestHarness&lt;/code&gt; for all of its
 own integration tests. For more complex examples, you can explore the various
 &lt;a href=&quot;https://github.com/apache/kudu/tree/master/java/kudu-client/src/test/java/org/apache/kudu/client&quot;&gt;Kudu integration&lt;/a&gt;
 tests in the Kudu source code repository.&lt;/p&gt;

 &lt;h2 id=&quot;feedback&quot;&gt;Feedback&lt;/h2&gt;

 &lt;p&gt;Kudu 1.9.0 is the first release to have these testing utilities available. Although these utilities
 simplify testing of Kudu applications, there is always room for improvement.
 Please report any issues, ideas, or feedback to the Kudu user mailing list, Jira, or Slack channel
 and we will try to incorporate your feedback quickly. See the
 &lt;a href=&quot;https://kudu.apache.org/community.html&quot;&gt;Kudu community page&lt;/a&gt; for details.&lt;/p&gt;

 &lt;h2 id=&quot;thank-you&quot;&gt;Thank You&lt;/h2&gt;

 &lt;p&gt;We would like to give a special thank you to everyone who helped contribute to the &lt;code&gt;kudu-test-utils&lt;/code&gt;
 and &lt;code&gt;kudu-binary&lt;/code&gt; artifacts. We would especially like to thank
 &lt;a href=&quot;https://www.linkedin.com/in/brian-mcdevitt-1136a08/&quot;&gt;Brian McDevitt&lt;/a&gt; at &lt;a href=&quot;https://www.phdata.io/&quot;&gt;phData&lt;/a&gt;
 and
 &lt;a href=&quot;https://twitter.com/timrobertson100&quot;&gt;Tim Robertson&lt;/a&gt; at &lt;a href=&quot;https://www.gbif.org/&quot;&gt;GBIF&lt;/a&gt; who helped us
 tremendously.&lt;/p&gt;</content><author><name>Grant Henke &amp; Mike Percy</name></author><summary>Note: This is a cross-post from the Cloudera Engineering Blog
 Testing Apache Kudu Applications on the JVM

 Although the Kudu server is written in C++ for performance and efficiency, developers can write
 client applications in C++, Java, or Python. To make it easier for Java developers to create
 reliable client applications, we’ve added new utilities in Kudu 1.9.0 that allow you to write tests
 using a Kudu cluster without needing to build Kudu yourself, without any knowledge of C++,
 and without any complicated coordination around starting and stopping Kudu clusters for each test.
 This post describes how the new testing utilities work and how you can use them in your application
 tests.</summary></entry><entry><title>Apache Kudu 1.9.0 Released</title><link href="/2019/03/15/apache-kudu-1-9-0-release.html" rel="alternate" type="text/html" title="Apache Kudu 1.9.0 Released" /><published>2019-03-15T00:00:00-07:00</published><updated>2019-03-15T00:00:00-07:00</updated><id>/2019/03/15/apache-kudu-1-9-0-release</id><content type="html" xml:base="/2019/03/15/apache-kudu-1-9-0-release.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.9.0!&lt;/p&gt;

 &lt;p&gt;The new release adds several new features and improvements, including the
 following:&lt;/p&gt;

 &lt;!--more--&gt;

 &lt;ul&gt;
   &lt;li&gt;Added support for location awareness for placement of tablet replicas.&lt;/li&gt;
   &lt;li&gt;Introduced docker scripts to facilitate building and running Kudu on various
 operating systems.&lt;/li&gt;
   &lt;li&gt;Introduced an experimental feature to allow users to run tests against a Kudu
 mini cluster without having to first locally build or install Kudu.&lt;/li&gt;
   &lt;li&gt;Updated the compaction policy to favor reducing the number of rowsets, which
 can lead to significantly faster scans and bootup times in certain workloads.&lt;/li&gt;
   &lt;li&gt;Multiple tooling enhancements have been made to improve visibility into Kudu
 tables.&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;The above is just a list of the highlights, for a more complete list of new
 features, improvements and fixes please refer to the &lt;a href=&quot;/releases/1.9.0/docs/release_notes.html&quot;&gt;release
 notes&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;The Apache Kudu project only publishes source code releases. To build Kudu
 1.9.0, follow these steps:&lt;/p&gt;

 &lt;ul&gt;
   &lt;li&gt;Download the Kudu &lt;a href=&quot;/releases/1.9.0&quot;&gt;1.9.0 source release&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;Follow the instructions in the documentation to build Kudu &lt;a href=&quot;/releases/1.9.0/docs/installation.html#build_from_source&quot;&gt;1.9.0 from
 source&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;For your convenience, binary JAR files for the Kudu Java client library, Spark
 DataSource, Flume sink, and other Java integrations are published to the ASF
 Maven repository and are &lt;a href=&quot;https://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.kudu%22%20AND%20v%3A%221.9.0%22&quot;&gt;now
 available&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;The Python client source is also available on
 &lt;a href=&quot;https://pypi.org/project/kudu-python/&quot;&gt;PyPI&lt;/a&gt;.&lt;/p&gt;</content><author><name>Andrew Wong</name></author><summary>The Apache Kudu team is happy to announce the release of Kudu 1.9.0!

 The new release adds several new features and improvements, including the
 following:</summary></entry><entry><title>Transparent Hierarchical Storage Management with Apache Kudu and Impala</title><link href="/2019/03/05/transparent-hierarchical-storage-management-with-apache-kudu-and-impala.html" rel="alternate" type="text/html" title="Transparent Hierarchical Storage Management with Apache Kudu and Impala" /><published>2019-03-05T00:00:00-08:00</published><updated>2019-03-05T00:00:00-08:00</updated><id>/2019/03/05/transparent-hierarchical-storage-management-with-apache-kudu-and-impala</id><content type="html" xml:base="/2019/03/05/transparent-hierarchical-storage-management-with-apache-kudu-and-impala.html">&lt;p&gt;Note: This is a cross-post from the Cloudera Engineering Blog
 &lt;a href=&quot;https://blog.cloudera.com/blog/2019/03/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/&quot;&gt;Transparent Hierarchical Storage Management with Apache Kudu and Impala&lt;/a&gt;&lt;/p&gt;

 &lt;p&gt;When picking a storage option for an application it is common to pick a single
 storage option which has the most applicable features to your use case. For mutability
 and real-time analytics workloads you may want to use Apache Kudu, but for massive
 scalability at a low cost you may want to use HDFS. For that reason, there is a need
 for a solution that allows you to leverage the best features of multiple storage
 options. This post describes the sliding window pattern using Apache Impala with data
 stored in Apache Kudu and Apache HDFS. With this pattern you get all of the benefits
 of multiple storage layers in a way that is transparent to users.&lt;/p&gt;

 &lt;!--more--&gt;

 &lt;p&gt;Apache Kudu is designed for fast analytics on rapidly changing data. Kudu provides a
 combination of fast inserts/updates and efficient columnar scans to enable multiple
 real-time analytic workloads across a single storage layer. For that reason, Kudu fits
 well into a data pipeline as the place to store real-time data that needs to be
 queryable immediately. Additionally, Kudu supports updating and deleting rows in
 real-time allowing support for late arriving data and data correction.&lt;/p&gt;

 &lt;p&gt;Apache HDFS is designed to allow for limitless scalability at a low cost. It is
 optimized for batch oriented use cases where data is immutable. When paired with the
 Apache Parquet file format, structured data can be accessed with extremely high
 throughput and efficiency.&lt;/p&gt;

 &lt;p&gt;For situations in which the data is small and ever-changing, like dimension tables,
 it is common to keep all of the data in Kudu. It is even common to keep large tables
 in Kudu when the data fits within Kudu’s
 &lt;a href=&quot;https://kudu.apache.org/docs/known_issues.html#_scale&quot;&gt;scaling limits&lt;/a&gt; and can benefit
 from Kudu’s unique features. In cases where the data is massive, batch oriented, and
 unlikely to change, storing the data in HDFS using the Parquet format is preferred.
 When you need the benefits of both storage layers, the sliding window pattern is a
 useful solution.&lt;/p&gt;

 &lt;h2 id=&quot;the-sliding-window-pattern&quot;&gt;The Sliding Window Pattern&lt;/h2&gt;

 &lt;p&gt;In this pattern, matching Kudu and Parquet formatted HDFS tables are created in Impala.
 These tables are partitioned by a unit of time based on how frequently the data is
 moved between the Kudu and HDFS table. It is common to use daily, monthly, or yearly
 partitions. A unified view is created and a &lt;code&gt;WHERE&lt;/code&gt; clause is used to define a boundary
 that separates which data is read from the Kudu table and which is read from the HDFS
 table. The defined boundary is important so that you can move data between Kudu and
 HDFS without exposing duplicate records to the view. Once the data is moved, an atomic
 &lt;code&gt;ALTER VIEW&lt;/code&gt; statement can be used to move the boundary forward.&lt;/p&gt;

 &lt;p&gt;&lt;img src=&quot;/img/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/sliding-window-pattern.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;

 &lt;p&gt;Note: This pattern works best with somewhat sequential data organized into range
 partitions, because having a sliding window of time and dropping partitions is very
 efficient.&lt;/p&gt;

 &lt;p&gt;This pattern results in a sliding window of time where mutable data is stored in Kudu
 and immutable data is stored in the Parquet format on HDFS. Leveraging both Kudu and
 HDFS via Impala provides the benefits of both storage systems:&lt;/p&gt;

 &lt;ul&gt;
   &lt;li&gt;Streaming data is immediately queryable&lt;/li&gt;
   &lt;li&gt;Updates for late arriving data or manual corrections can be made&lt;/li&gt;
   &lt;li&gt;Data stored in HDFS is optimally sized increasing performance and preventing small files&lt;/li&gt;
   &lt;li&gt;Reduced cost&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;Impala also supports cloud storage options such as
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_s3.html&quot;&gt;S3&lt;/a&gt; and
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_adls.html&quot;&gt;ADLS&lt;/a&gt;.
 This capability allows convenient access to a storage system that is remotely managed,
 accessible from anywhere, and integrated with various cloud-based services. Because
 this data is remote, queries against S3 data are less performant, making S3 suitable
 for holding “cold” data that is only queried occasionally. This pattern can be
 extended to use cloud storage for cold data by creating a third matching table and
 adding another boundary to the unified view.&lt;/p&gt;

 &lt;p&gt;&lt;img src=&quot;/img/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/sliding-window-pattern-cold.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;

 &lt;p&gt;Note: For simplicity only Kudu and HDFS are illustrated in the examples below.&lt;/p&gt;

 &lt;p&gt;The process for moving data from Kudu to HDFS is broken into two phases. The first
 phase is the data migration, and the second phase is the metadata change. These
 ongoing steps should be scheduled to run automatically on a regular basis.&lt;/p&gt;

 &lt;p&gt;In the first phase, the now immutable data is copied from Kudu to HDFS. Even though
 data is duplicated from Kudu into HDFS, the boundary defined in the view will prevent
 duplicate data from being shown to users. This step can include any validation and
 retries as needed to ensure the data offload is successful.&lt;/p&gt;

 &lt;p&gt;&lt;img src=&quot;/img/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/phase-1.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;

 &lt;p&gt;In the second phase, now that the data is safely copied to HDFS, the metadata is
 changed to adjust how the offloaded partition is exposed. This includes shifting
 the boundary forward, adding a new Kudu partition for the next period, and dropping
 the old Kudu partition.&lt;/p&gt;

 &lt;p&gt;&lt;img src=&quot;/img/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/phase-2.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;

 &lt;h2 id=&quot;building-blocks&quot;&gt;Building Blocks&lt;/h2&gt;

 &lt;p&gt;In order to implement the sliding window pattern, a few Impala fundamentals are
 required. Below each fundamental building block of the sliding window pattern is
 described.&lt;/p&gt;

 &lt;h3 id=&quot;moving-data&quot;&gt;Moving Data&lt;/h3&gt;

 &lt;p&gt;Moving data among storage systems via Impala is straightforward provided you have
 matching tables defined using each of the storage formats. In order to keep this post
 brief, all of the options available when creating an Impala table are not described.
 However, Impala’s
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_create_table.html&quot;&gt;CREATE TABLE documentation&lt;/a&gt;
 can be referenced to find the correct syntax for Kudu, HDFS, and cloud storage tables.
 A few examples are shown further below where the sliding window pattern is illustrated.&lt;/p&gt;

 &lt;p&gt;Once the tables are created, moving the data is as simple as an
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_insert.html&quot;&gt;INSERT…SELECT&lt;/a&gt; statement:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INTO&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;table_foo&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;table_bar&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;All of the features of the
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_select.html&quot;&gt;SELECT&lt;/a&gt;
 statement can be used to select the specific data you would like to move.&lt;/p&gt;

 &lt;p&gt;Note: If moving data to Kudu, an &lt;code&gt;UPSERT INTO&lt;/code&gt; statement can be used to handle
 duplicate keys.&lt;/p&gt;

 &lt;h3 id=&quot;unified-querying&quot;&gt;Unified Querying&lt;/h3&gt;

 &lt;p&gt;Querying data from multiple tables and data sources in Impala is also straightforward.
 For the sake of brevity, all of the options available when creating an Impala view are
 not described. However, see Impala’s
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_create_view.html&quot;&gt;CREATE VIEW documentation&lt;/a&gt;
 for more in-depth details.&lt;/p&gt;

 &lt;p&gt;Creating a view for unified querying is as simple as a &lt;code&gt;CREATE VIEW&lt;/code&gt; statement using
 two &lt;code&gt;SELECT&lt;/code&gt; clauses combined with a &lt;code&gt;UNION ALL&lt;/code&gt;:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;VIEW&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;foo_view&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;col1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;col2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;col3&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;foo_parquet&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;UNION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ALL&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;col1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;col2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;col3&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;foo_kudu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;WARNING: Be sure to use &lt;code&gt;UNION ALL&lt;/code&gt; and not &lt;code&gt;UNION&lt;/code&gt;. The &lt;code&gt;UNION&lt;/code&gt; keyword by itself
 is the same as &lt;code&gt;UNION DISTINCT&lt;/code&gt; and can have significant performance impact.
 More information can be found in the Impala
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_union.html&quot;&gt;UNION documentation&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;All of the features of the
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_select.html&quot;&gt;SELECT&lt;/a&gt;
 statement can be used to expose the correct data and columns from each of the
 underlying tables. It is important to use the &lt;code&gt;WHERE&lt;/code&gt; clause to pass through and
 pushdown any predicates that need special handling or transformations. More examples
 will follow below in the discussion of the sliding window pattern.&lt;/p&gt;

 &lt;p&gt;Additionally, views can be altered via the
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_alter_view.html&quot;&gt;ALTER VIEW&lt;/a&gt;
 statement. This is useful when combined with the &lt;code&gt;SELECT&lt;/code&gt; statement because it can be
 used to atomically update what data is being accessed by the view.&lt;/p&gt;

 &lt;h2 id=&quot;an-example-implementation&quot;&gt;An Example Implementation&lt;/h2&gt;

 &lt;p&gt;Below are sample steps to implement the sliding window pattern using a monthly period
 with three months of active mutable data. Data older than three months will be
 offloaded to HDFS using the Parquet format.&lt;/p&gt;

 &lt;h3 id=&quot;create-the-kudu-table&quot;&gt;Create the Kudu Table&lt;/h3&gt;

 &lt;p&gt;First, create a Kudu table which will hold three months of active mutable data.
 The table is range partitioned by the time column with each range containing one
 period of data. It is important to have partitions that match the period because
 dropping Kudu partitions is much more efficient than removing the data via the
 &lt;code&gt;DELETE&lt;/code&gt; clause. The table is also hash partitioned by the other key column to ensure
 that all of the data is not written to a single partition.&lt;/p&gt;

 &lt;p&gt;Note: Your schema design should vary based on your data and read/write performance
 considerations. This example schema is intended for demonstration purposes and not as
 an “optimal” schema. See the
 &lt;a href=&quot;https://kudu.apache.org/docs/schema_design.html&quot;&gt;Kudu schema design documentation&lt;/a&gt;
 for more guidance on choosing your schema. For example, you may not need any hash
 partitioning if your
 data input rate is low. Alternatively, you may need more hash buckets if your data
 input rate is very high.&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_kudu&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STRING&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TIMESTAMP&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;message&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STRING&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
   &lt;span class=&quot;k&quot;&gt;PRIMARY&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;KEY&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;HASH&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PARTITIONS&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;RANGE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
       &lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;2018-01-01&amp;#39;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;VALUES&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;2018-02-01&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;--January&lt;/span&gt;
       &lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;2018-02-01&amp;#39;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;VALUES&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;2018-03-01&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;--February&lt;/span&gt;
       &lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;2018-03-01&amp;#39;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;VALUES&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;2018-04-01&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;--March&lt;/span&gt;
       &lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;2018-04-01&amp;#39;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;VALUES&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;2018-05-01&amp;#39;&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;--April&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;n&quot;&gt;STORED&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;KUDU&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Note: There is an extra month partition to provide a buffer of time for the data to
 be moved into the immutable table.&lt;/p&gt;

 &lt;h3 id=&quot;create-the-hdfs-table&quot;&gt;Create the HDFS Table&lt;/h3&gt;

 &lt;p&gt;Create the matching Parquet formatted HDFS table which will hold the older immutable
 data. This table is partitioned by year, month, and day for efficient access even
 though you can’t partition by the time column itself. This is addressed further in
 the view step below. See Impala’s
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_partitioning.html&quot;&gt;partitioning documentation&lt;/a&gt;
 for more details.&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_parquet&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STRING&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TIMESTAMP&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;message&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STRING&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;n&quot;&gt;PARTITIONED&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;year&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;month&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;day&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;n&quot;&gt;STORED&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PARQUET&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;h3 id=&quot;create-the-unified-view&quot;&gt;Create the Unified View&lt;/h3&gt;

 &lt;p&gt;Now create the unified view which will be used to query all of the data seamlessly:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;VIEW&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_view&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;message&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_kudu&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;&amp;quot;2018-01-01&amp;quot;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;UNION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ALL&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;message&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_parquet&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;&amp;quot;2018-01-01&amp;quot;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;year&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;year&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;month&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;month&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;day&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;day&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Each &lt;code&gt;SELECT&lt;/code&gt; clause explicitly lists all of the columns to expose. This ensures that
 the year, month, and day columns that are unique to the Parquet table are not exposed.
 If needed, it also allows any necessary column or type mapping to be handled.&lt;/p&gt;

 &lt;p&gt;The initial &lt;code&gt;WHERE&lt;/code&gt; clauses applied to both my_table_kudu and my_table_parquet define
 the boundary between Kudu and HDFS to ensure duplicate data is not read while in the
 process of offloading data.&lt;/p&gt;

 &lt;p&gt;The additional &lt;code&gt;AND&lt;/code&gt; clauses applied to my_table_parquet are used to ensure good
 predicate pushdown on the individual year, month, and day columns.&lt;/p&gt;

 &lt;p&gt;WARNING: As stated earlier, be sure to use &lt;code&gt;UNION ALL&lt;/code&gt; and not &lt;code&gt;UNION&lt;/code&gt;. The &lt;code&gt;UNION&lt;/code&gt;
 keyword by itself is the same as &lt;code&gt;UNION DISTINCT&lt;/code&gt; and can have significant performance
 impact. More information can be found in the Impala
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_union.html&quot;&gt;&lt;code&gt;UNION&lt;/code&gt; documentation&lt;/a&gt;.&lt;/p&gt;

 &lt;h3 id=&quot;ongoing-steps&quot;&gt;Ongoing Steps&lt;/h3&gt;

 &lt;p&gt;Now that the base tables and view are created, prepare the ongoing steps to maintain
 the sliding window. Because these ongoing steps should be scheduled to run on a
 regular basis, the examples below are shown using &lt;code&gt;.sql&lt;/code&gt; files that take variables
 which can be passed from your scripts and scheduling tool of choice.&lt;/p&gt;

 &lt;p&gt;Create the &lt;code&gt;window_data_move.sql&lt;/code&gt; file to move the data from the oldest partition to HDFS:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INTO&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hdfs_table&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;year&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;month&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;day&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;year&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;month&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;day&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kudu_table&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;add_months&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;ss&quot;&gt;&amp;quot;${var:new_boundary_time}&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;&amp;quot;${var:new_boundary_time}&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;n&quot;&gt;COMPUTE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;INCREMENTAL&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STATS&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hdfs_table&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Note: The
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_compute_stats.html&quot;&gt;COMPUTE INCREMENTAL STATS&lt;/a&gt;
 clause is not required but helps Impala to optimize queries.&lt;/p&gt;

 &lt;p&gt;To run the SQL statement, use the Impala shell and pass the required variables.
 Below is an example:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;impala-shell -i &amp;lt;impalad:port&amp;gt; -f window_data_move.sql
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;kudu_table&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;my_table_kudu
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;hdfs_table&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;my_table_parquet
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;new_boundary_time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&amp;quot;2018-02-01&amp;quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Note: You can adjust the &lt;code&gt;WHERE&lt;/code&gt; clause to match the given period and cadence of your
 offload. Here the add_months function is used with an argument of -1 to move one month
 of data in the past from the new boundary time.&lt;/p&gt;

 &lt;p&gt;Create the &lt;code&gt;window_view_alter.sql&lt;/code&gt; file to shift the time boundary forward by altering
 the unified view:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;ALTER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;VIEW&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;view_name&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;message&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kudu_table&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;&amp;quot;${var:new_boundary_time}&amp;quot;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;UNION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ALL&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;message&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hdfs_table&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;&amp;quot;${var:new_boundary_time}&amp;quot;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;year&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;year&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;month&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;month&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;day&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;day&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;To run the SQL statement, use the Impala shell and pass the required variables.
 Below is an example:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;impala-shell -i &amp;lt;impalad:port&amp;gt; -f window_view_alter.sql
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;view_name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;my_table_view
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;kudu_table&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;my_table_kudu
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;hdfs_table&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;my_table_parquet
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;new_boundary_time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&amp;quot;2018-02-01&amp;quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Create the &lt;code&gt;window_partition_shift.sql&lt;/code&gt; file to shift the Kudu partitions forward:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;ALTER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kudu_table&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;

 &lt;span class=&quot;k&quot;&gt;ADD&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RANGE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;add_months&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;ss&quot;&gt;&amp;quot;${var:new_boundary_time}&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
 &lt;span class=&quot;err&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;window_length&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;VALUES&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;add_months&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;ss&quot;&gt;&amp;quot;${var:new_boundary_time}&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
 &lt;span class=&quot;err&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;window_length&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

 &lt;span class=&quot;k&quot;&gt;ALTER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kudu_table&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;

 &lt;span class=&quot;k&quot;&gt;DROP&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RANGE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;add_months&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;ss&quot;&gt;&amp;quot;${var:new_boundary_time}&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;VALUES&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;&amp;quot;${var:new_boundary_time}&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;To run the SQL statement, use the Impala shell and pass the required variables.
 Below is an example:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;impala-shell -i &amp;lt;impalad:port&amp;gt; -f window_partition_shift.sql
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;kudu_table&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;my_table_kudu
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;new_boundary_time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&amp;quot;2018-02-01&amp;quot;&lt;/span&gt;
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;window_length&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;3&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Note: You should periodically run
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_compute_stats.html&quot;&gt;COMPUTE STATS&lt;/a&gt;
 on your Kudu table to ensure Impala’s query performance is optimal.&lt;/p&gt;

 &lt;h3 id=&quot;experimentation&quot;&gt;Experimentation&lt;/h3&gt;

 &lt;p&gt;Now that you have created the tables, view, and scripts to leverage the sliding
 window pattern, you can experiment with them by inserting data for different time
 ranges and running the scripts to move the window forward through time.&lt;/p&gt;

 &lt;p&gt;Insert some sample values into the Kudu table:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INTO&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_kudu&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;VALUES&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&amp;#39;joey&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;2018-01-01&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;hello&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&amp;#39;ross&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;2018-02-01&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;goodbye&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&amp;#39;rachel&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;2018-03-01&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;hi&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Show the data in each table/view:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_kudu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_parquet&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_view&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Move the January data into HDFS:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;impala-shell -i &amp;lt;impalad:port&amp;gt; -f window_data_move.sql
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;kudu_table&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;my_table_kudu
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;hdfs_table&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;my_table_parquet
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;new_boundary_time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&amp;quot;2018-02-01&amp;quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Confirm the data is in both places, but not duplicated in the view:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_kudu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_parquet&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_view&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Alter the view to shift the time boundary forward to February:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;impala-shell -i &amp;lt;impalad:port&amp;gt; -f window_view_alter.sql
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;view_name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;my_table_view
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;kudu_table&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;my_table_kudu
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;hdfs_table&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;my_table_parquet
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;new_boundary_time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&amp;quot;2018-02-01&amp;quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Confirm the data is still in both places, but not duplicated in the view:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_kudu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_parquet&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_view&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Shift the Kudu partitions forward:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;impala-shell -i &amp;lt;impalad:port&amp;gt; -f window_partition_shift.sql
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;kudu_table&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;my_table_kudu
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;new_boundary_time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&amp;quot;2018-02-01&amp;quot;&lt;/span&gt;
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;window_length&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;3&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Confirm the January data is now only in HDFS:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_kudu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_parquet&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_view&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Confirm predicate push down with Impala’s EXPLAIN statement:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;EXPLAIN&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_view&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;EXPLAIN&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_view&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;&amp;quot;2018-02-01&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;EXPLAIN&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_view&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;&amp;quot;2018-02-01&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;In the explain output you should see “kudu predicates” which include the time column
 filters in the “SCAN KUDU” section and “predicates” which include the time, day, month,
 and year columns in the “SCAN HDFS” section.&lt;/p&gt;</content><author><name>Grant Henke</name></author><summary>Note: This is a cross-post from the Cloudera Engineering Blog
 Transparent Hierarchical Storage Management with Apache Kudu and Impala

 When picking a storage option for an application it is common to pick a single
 storage option which has the most applicable features to your use case. For mutability
 and real-time analytics workloads you may want to use Apache Kudu, but for massive
 scalability at a low cost you may want to use HDFS. For that reason, there is a need
 for a solution that allows you to leverage the best features of multiple storage
 options. This post describes the sliding window pattern using Apache Impala with data
 stored in Apache Kudu and Apache HDFS. With this pattern you get all of the benefits
 of multiple storage layers in a way that is transparent to users.</summary></entry><entry><title>Call for Posts</title><link href="/2018/12/11/call-for-posts.html" rel="alternate" type="text/html" title="Call for Posts" /><published>2018-12-11T00:00:00-08:00</published><updated>2018-12-11T00:00:00-08:00</updated><id>/2018/12/11/call-for-posts</id><content type="html" xml:base="/2018/12/11/call-for-posts.html">&lt;p&gt;Most of the posts in the Kudu blog have been written by the project’s
 committers and are either technical or news-like in nature. We’d like to hear
 how you’re using Kudu in production, in testing, or in your hobby project and
 we’d like to share it with the world!&lt;/p&gt;

 &lt;!--more--&gt;

 &lt;p&gt;If you’d like to tell the world about how you are using Kudu in your project,
 now is the time.&lt;/p&gt;

 &lt;p&gt;To learn how to submit posts, read our &lt;a href=&quot;/docs/contributing.html#_blog_posts&quot;&gt;contributing
 documentation&lt;/a&gt;. Alternatively, you can
 draft your post to Google Docs and share it with us on
 &lt;a href=&quot;&amp;#109;&amp;#097;&amp;#105;&amp;#108;&amp;#116;&amp;#111;:&amp;#100;&amp;#101;&amp;#118;&amp;#064;&amp;#107;&amp;#117;&amp;#100;&amp;#117;&amp;#046;&amp;#097;&amp;#112;&amp;#097;&amp;#099;&amp;#104;&amp;#101;&amp;#046;&amp;#111;&amp;#114;&amp;#103;&quot;&gt;&amp;#100;&amp;#101;&amp;#118;&amp;#064;&amp;#107;&amp;#117;&amp;#100;&amp;#117;&amp;#046;&amp;#097;&amp;#112;&amp;#097;&amp;#099;&amp;#104;&amp;#101;&amp;#046;&amp;#111;&amp;#114;&amp;#103;&lt;/a&gt; and we’re happy to review it
 and post it to the blog for you.&lt;/p&gt;</content><author><name>Attila Bukor</name></author><summary>Most of the posts in the Kudu blog have been written by the project’s
 committers and are either technical or news-like in nature. We’d like to hear
 how you’re using Kudu in production, in testing, or in your hobby project and
 we’d like to share it with the world!</summary></entry><entry><title>Apache Kudu 1.8.0 Released</title><link href="/2018/10/26/apache-kudu-1-8-0-released.html" rel="alternate" type="text/html" title="Apache Kudu 1.8.0 Released" /><published>2018-10-26T00:00:00-07:00</published><updated>2018-10-26T00:00:00-07:00</updated><id>/2018/10/26/apache-kudu-1-8-0-released</id><content type="html" xml:base="/2018/10/26/apache-kudu-1-8-0-released.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.8.0!&lt;/p&gt;

 &lt;p&gt;The new release adds several new features and improvements, including the
 following:&lt;/p&gt;

 &lt;!--more--&gt;

 &lt;ul&gt;
   &lt;li&gt;Introduced manual data rebalancer tool which can be used to redistribute
 table replicas among tablet servers&lt;/li&gt;
   &lt;li&gt;Added support for &lt;code&gt;IS NULL&lt;/code&gt; and &lt;code&gt;IS NOT NULL&lt;/code&gt; predicates to the Kudu Python
 client&lt;/li&gt;
   &lt;li&gt;Multiple tooling improvements make diagnostics and troubleshooting simpler&lt;/li&gt;
   &lt;li&gt;The Kudu Spark connector now supports Spark Streaming DataFrames&lt;/li&gt;
   &lt;li&gt;Added Pandas support to the Python client&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;The above is just a list of the highlights, for a more complete list of new
 features, improvements and fixes please refer to the &lt;a href=&quot;/releases/1.8.0/docs/release_notes.html&quot;&gt;release
 notes&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;The Apache Kudu project only publishes source code releases. To build Kudu
 1.8.0, follow these steps:&lt;/p&gt;

 &lt;ul&gt;
   &lt;li&gt;Download the Kudu &lt;a href=&quot;/releases/1.8.0&quot;&gt;1.8.0 source release&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;Follow the instructions in the documentation to build Kudu &lt;a href=&quot;/releases/1.8.0/docs/installation.html#build_from_source&quot;&gt;1.8.0 from
 source&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;For your convenience, binary JAR files for the Kudu Java client library, Spark
 DataSource, Flume sink, and other Java integrations are published to the ASF
 Maven repository and are &lt;a href=&quot;https://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.kudu%22%20AND%20v%3A%221.8.0%22&quot;&gt;now
 available&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;The Python client source is also available on
 &lt;a href=&quot;https://pypi.org/project/kudu-python/&quot;&gt;PyPI&lt;/a&gt;.&lt;/p&gt;</content><author><name>Attila Bukor</name></author><summary>The Apache Kudu team is happy to announce the release of Kudu 1.8.0!

 The new release adds several new features and improvements, including the
 following:</summary></entry><entry><title>Index Skip Scan Optimization in Kudu</title><link href="/2018/09/26/index-skip-scan-optimization-in-kudu.html" rel="alternate" type="text/html" title="Index Skip Scan Optimization in Kudu" /><published>2018-09-26T00:00:00-07:00</published><updated>2018-09-26T00:00:00-07:00</updated><id>/2018/09/26/index-skip-scan-optimization-in-kudu</id><content type="html" xml:base="/2018/09/26/index-skip-scan-optimization-in-kudu.html">&lt;p&gt;This summer I got the opportunity to intern with the Apache Kudu team at Cloudera.
 My project was to optimize the Kudu scan path by implementing a technique called
 index skip scan (a.k.a. scan-to-seek, see section 4.1 in [1]). I wanted to share
 my experience and the progress we’ve made so far on the approach.&lt;/p&gt;

 &lt;!--more--&gt;

 &lt;p&gt;Let’s begin with discussing the current query flow in Kudu.
 Consider the following table:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;metrics&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;host&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STRING&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
     &lt;span class=&quot;n&quot;&gt;tstamp&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;INT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
     &lt;span class=&quot;n&quot;&gt;clusterid&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;INT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;role&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STRING&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;PRIMARY&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;KEY&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;host&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tstamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clusterid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;&lt;img src=&quot;/img/index-skip-scan/example-table.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;
 &lt;em&gt;Sample rows of table &lt;code&gt;metrics&lt;/code&gt; (sorted by key columns).&lt;/em&gt;&lt;/p&gt;

 &lt;p&gt;In this case, by default, Kudu internally builds a primary key index (implemented as a
 &lt;a href=&quot;https://en.wikipedia.org/wiki/B-tree&quot;&gt;B-tree&lt;/a&gt;) for the table &lt;code&gt;metrics&lt;/code&gt;.
 As shown in the table above, the index data is sorted by the composite of all key columns.
 When the user query contains the first key column (&lt;code&gt;host&lt;/code&gt;), Kudu uses the index (as the index data is
 primarily sorted on the first key column).&lt;/p&gt;

 &lt;p&gt;Now, what if the user query does not contain the first key column and instead only contains the &lt;code&gt;tstamp&lt;/code&gt; column?
 In the above case, the &lt;code&gt;tstamp&lt;/code&gt; column values are sorted with respect to &lt;code&gt;host&lt;/code&gt;,
 but are not globally sorted, and as such, it’s non-trivial to use the index to filter rows.
 Instead, a full tablet scan is done by default. Other databases may optimize such scans by building secondary indexes
 (though it might be redundant to build one on one of the primary keys). However, this isn’t an option for Kudu,
 given its lack of secondary index support.&lt;/p&gt;

 &lt;p&gt;The question is, can Kudu do better than a full tablet scan here?&lt;/p&gt;

 &lt;p&gt;The answer is yes! Let’s observe the column preceding the &lt;code&gt;tstamp&lt;/code&gt; column. We will refer to it as the
 “prefix column” and its specific value as the “prefix key”. In this example, &lt;code&gt;host&lt;/code&gt; is the prefix column.
 Note that the prefix keys are sorted in the index and that all rows of a given prefix key are also sorted by the
 remaining key columns. Therefore, we can use the index to skip to the rows that have distinct prefix keys,
 and also satisfy the predicate on the &lt;code&gt;tstamp&lt;/code&gt; column.
 For example, consider the query:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clusterid&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;metrics&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tstamp&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;&lt;img src=&quot;/img/index-skip-scan/skip-scan-example-table.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;
 &lt;em&gt;Skip scan flow illustration. The rows in green are scanned and the rest are skipped.&lt;/em&gt;&lt;/p&gt;

 &lt;p&gt;The tablet server can use the index to &lt;strong&gt;skip&lt;/strong&gt; to the first row with a distinct prefix key (&lt;code&gt;host = helium&lt;/code&gt;) that
 matches the predicate (&lt;code&gt;tstamp = 100&lt;/code&gt;) and then &lt;strong&gt;scan&lt;/strong&gt; through the rows until the predicate no longer matches. At that
 point we would know that no more rows with &lt;code&gt;host = helium&lt;/code&gt; will satisfy the predicate, and we can skip to the next
 prefix key. This holds true for all distinct keys of &lt;code&gt;host&lt;/code&gt;. Hence, this method is popularly known as
 &lt;strong&gt;skip scan optimization&lt;/strong&gt;[2, 3].&lt;/p&gt;

 &lt;h1 id=&quot;performance&quot;&gt;Performance&lt;/h1&gt;

 &lt;p&gt;This optimization can speed up queries significantly, depending on the cardinality (number of distinct values) of the
 prefix column. The lower the prefix column cardinality, the better the skip scan performance. In fact, when the
 prefix column cardinality is high, skip scan is not a viable approach. The performance graph (obtained using the example
 schema and query pattern mentioned earlier) is shown below.&lt;/p&gt;

 &lt;p&gt;Based on our experiments, on up to 10 million rows per tablet (as shown below), we found that the skip scan performance
 begins to get worse with respect to the full tablet scan performance when the prefix column cardinality
 exceeds sqrt(number_of_rows_in_tablet).
 Therefore, in order to use skip scan performance benefits when possible and maintain a consistent performance in cases
 of large prefix column cardinality, we have tentatively chosen to dynamically disable skip scan when the number of skips for
 distinct prefix keys exceeds sqrt(number_of_rows_in_tablet).
 It will be an interesting project to further explore sophisticated heuristics to decide when
 to dynamically disable skip scan.&lt;/p&gt;

 &lt;p&gt;&lt;img src=&quot;/img/index-skip-scan/skip-scan-performance-graph.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;

 &lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;

 &lt;p&gt;Skip scan optimization in Kudu can lead to huge performance benefits that scale with the size of
 data in Kudu tablets. This is a work-in-progress &lt;a href=&quot;https://gerrit.cloudera.org/#/c/10983/&quot;&gt;patch&lt;/a&gt;.
 The implementation in the patch works only for equality predicates on the non-first primary key
 columns. An important point to note is that although, in the above specific example, the number of prefix
 columns is one (&lt;code&gt;host&lt;/code&gt;), this approach is generalized to work with any number of prefix columns.&lt;/p&gt;

 &lt;p&gt;This work also lays the groundwork to leverage the skip scan approach and optimize query processing time in the
 following use cases:&lt;/p&gt;

 &lt;ul&gt;
   &lt;li&gt;Range predicates&lt;/li&gt;
   &lt;li&gt;In-list predicates&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;This was my first time working on an open source project. I thoroughly enjoyed working on this challenging problem,
 right from understanding the scan path in Kudu to working on a full-fledged implementation of
 the skip scan optimization. I am very grateful to the Kudu team for guiding and supporting me throughout the
 internship period.&lt;/p&gt;

 &lt;h1 id=&quot;references&quot;&gt;References&lt;/h1&gt;

 &lt;p&gt;&lt;a href=&quot;https://storage.googleapis.com/pub-tools-public-publication-data/pdf/42851.pdf&quot;&gt;[1]&lt;/a&gt;: Gupta, Ashish, et al. “Mesa:
 Geo-replicated, near real-time, scalable data warehousing.” Proceedings of the VLDB Endowment 7.12 (2014): 1259-1270.&lt;/p&gt;

 &lt;p&gt;&lt;a href=&quot;https://oracle-base.com/articles/9i/index-skip-scanning/&quot;&gt;[2]&lt;/a&gt;: Index Skip Scanning - Oracle Database&lt;/p&gt;

 &lt;p&gt;&lt;a href=&quot;https://www.sqlite.org/optoverview.html#skipscan&quot;&gt;[3]&lt;/a&gt;: Skip Scan - SQLite&lt;/p&gt;</content><author><name>Anupama Gupta</name></author><summary>This summer I got the opportunity to intern with the Apache Kudu team at Cloudera.
 My project was to optimize the Kudu scan path by implementing a technique called
 index skip scan (a.k.a. scan-to-seek, see section 4.1 in [1]). I wanted to share
 my experience and the progress we’ve made so far on the approach.</summary></entry><entry><title>Simplified Data Pipelines with Kudu</title><link href="/2018/09/11/simplified-pipelines-with-kudu.html" rel="alternate" type="text/html" title="Simplified Data Pipelines with Kudu" /><published>2018-09-11T00:00:00-07:00</published><updated>2018-09-11T00:00:00-07:00</updated><id>/2018/09/11/simplified-pipelines-with-kudu</id><content type="html" xml:base="/2018/09/11/simplified-pipelines-with-kudu.html">&lt;p&gt;I’ve been working with Hadoop now for over seven years and fortunately, or unfortunately, have run
 across a lot of structured data use cases.  What we, at &lt;a href=&quot;https://phdata.io/&quot;&gt;phData&lt;/a&gt;, have found is
 that end users are typically comfortable with tabular data and prefer to access their data in a
 structured manner using tables.
 &lt;!--more--&gt;&lt;/p&gt;

 &lt;p&gt;When working on new structured data projects, the first question we always get from non-Hadoop
 followers is, &lt;em&gt;“how do I update or delete a record?”&lt;/em&gt;  The second question we get is, &lt;em&gt;“when adding
 records, why don’t they show up in Impala right away?”&lt;/em&gt;  For those of us who have worked with HDFS
 and Impala on HDFS for years, these are simple questions to answer, but hard ones to explain.&lt;/p&gt;

 &lt;p&gt;The pre-Kudu years were filled with 100’s (or 1000’s) of self-join views (or materialization jobs)
 and compaction jobs, along with scheduled jobs to refresh Impala cache periodically so new records
 show up.  And while doable, for 10,000’s of tables, this basically became a distraction from solving
 real business problems.&lt;/p&gt;

 &lt;p&gt;With the introduction of Kudu, mixing record level updates, deletes, and inserts, while supporting
 large scans, are now something we can sustainably manage at scale.  HBase is very good at record
 level updates, deletes and inserts, but doesn’t scale well for analytic use cases that often do full
 table scans. Moreover, for streaming use cases, changes are available in near real-time.  End users,
 accustomed to having to &lt;em&gt;”wait”&lt;/em&gt; for their data, can now consume the data as it arrives in their
 table.&lt;/p&gt;

 &lt;p&gt;A common data ingest pattern where Kudu becomes necessary is change data capture (CDC).  That is,
 capturing the inserts, updates, hard deletes, and streaming them into Kudu where they can be applied
 immediately.  Pre-Kudu this pipeline was very tedious to implement.  Now with tools like
 &lt;a href=&quot;https://streamsets.com/&quot;&gt;StreamSets&lt;/a&gt;, you can get up and running in a few hours.&lt;/p&gt;

 &lt;p&gt;A second common workflow is near real-time analytics.  We’ve streamed data off mining trucks,
 oil wells, manufacturing lines, and needed to make that data available to end users immediately.  No
 longer do we need to batch up writes, flush to HDFS and then refresh cache in Impala.  As mentioned
 before, with Kudu, the data are available as soon as it lands.  This has been a significant
 enhancement for end users, who previously had to &lt;em&gt;”wait”&lt;/em&gt; for data.&lt;/p&gt;

 &lt;p&gt;In summary, Kudu has made a tremendous impact in removing the operational distractions of merging in
 changes, and refreshing the cache of downstream consumers.  This now allows data engineers
 and users to focus on solving business problems, rather than being bothered by the tediousness of
 the backend.&lt;/p&gt;</content><author><name>Mac Noland</name></author><summary>I’ve been working with Hadoop now for over seven years and fortunately, or unfortunately, have run
 across a lot of structured data use cases.  What we, at phData, have found is
 that end users are typically comfortable with tabular data and prefer to access their data in a
 structured manner using tables.</summary></entry></feed>