feed.xml - kudu-site - Git at Google

 <?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom"><generator uri="http://jekyllrb.com" version="2.5.3">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2019-03-15T10:35:54-07:00</updated><id>/</id><entry><title>Transparent Hierarchical Storage Management with Apache Kudu and Impala</title><link href="/2019/03/05/transparent-hierarchical-storage-management-with-apache-kudu-and-impala.html" rel="alternate" type="text/html" title="Transparent Hierarchical Storage Management with Apache Kudu and Impala" /><published>2019-03-05T00:00:00-08:00</published><updated>2019-03-05T00:00:00-08:00</updated><id>/2019/03/05/transparent-hierarchical-storage-management-with-apache-kudu-and-impala</id><content type="html" xml:base="/2019/03/05/transparent-hierarchical-storage-management-with-apache-kudu-and-impala.html">&lt;p&gt;Note: This is a cross-post from the Cloudera Engineering Blog
 &lt;a href=&quot;https://blog.cloudera.com/blog/2019/03/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/&quot;&gt;Transparent Hierarchical Storage Management with Apache Kudu and Impala&lt;/a&gt;&lt;/p&gt;

 &lt;p&gt;When picking a storage option for an application it is common to pick a single
 storage option which has the most applicable features to your use case. For mutability
 and real-time analytics workloads you may want to use Apache Kudu, but for massive
 scalability at a low cost you may want to use HDFS. For that reason, there is a need
 for a solution that allows you to leverage the best features of multiple storage
 options. This post describes the sliding window pattern using Apache Impala with data
 stored in Apache Kudu and Apache HDFS. With this pattern you get all of the benefits
 of multiple storage layers in a way that is transparent to users.&lt;/p&gt;

 &lt;!--more--&gt;

 &lt;p&gt;Apache Kudu is designed for fast analytics on rapidly changing data. Kudu provides a
 combination of fast inserts/updates and efficient columnar scans to enable multiple
 real-time analytic workloads across a single storage layer. For that reason, Kudu fits
 well into a data pipeline as the place to store real-time data that needs to be
 queryable immediately. Additionally, Kudu supports updating and deleting rows in
 real-time allowing support for late arriving data and data correction.&lt;/p&gt;

 &lt;p&gt;Apache HDFS is designed to allow for limitless scalability at a low cost. It is
 optimized for batch oriented use cases where data is immutable. When paired with the
 Apache Parquet file format, structured data can be accessed with extremely high
 throughput and efficiency.&lt;/p&gt;

 &lt;p&gt;For situations in which the data is small and ever-changing, like dimension tables,
 it is common to keep all of the data in Kudu. It is even common to keep large tables
 in Kudu when the data fits within Kudu’s
 &lt;a href=&quot;https://kudu.apache.org/docs/known_issues.html#_scale&quot;&gt;scaling limits&lt;/a&gt; and can benefit
 from Kudu’s unique features. In cases where the data is massive, batch oriented, and
 unlikely to change, storing the data in HDFS using the Parquet format is preferred.
 When you need the benefits of both storage layers, the sliding window pattern is a
 useful solution.&lt;/p&gt;

 &lt;h2 id=&quot;the-sliding-window-pattern&quot;&gt;The Sliding Window Pattern&lt;/h2&gt;

 &lt;p&gt;In this pattern, matching Kudu and Parquet formatted HDFS tables are created in Impala.
 These tables are partitioned by a unit of time based on how frequently the data is
 moved between the Kudu and HDFS table. It is common to use daily, monthly, or yearly
 partitions. A unified view is created and a &lt;code&gt;WHERE&lt;/code&gt; clause is used to define a boundary
 that separates which data is read from the Kudu table and which is read from the HDFS
 table. The defined boundary is important so that you can move data between Kudu and
 HDFS without exposing duplicate records to the view. Once the data is moved, an atomic
 &lt;code&gt;ALTER VIEW&lt;/code&gt; statement can be used to move the boundary forward.&lt;/p&gt;

 &lt;p&gt;&lt;img src=&quot;/img/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/sliding-window-pattern.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;

 &lt;p&gt;Note: This pattern works best with somewhat sequential data organized into range
 partitions, because having a sliding window of time and dropping partitions is very
 efficient.&lt;/p&gt;

 &lt;p&gt;This pattern results in a sliding window of time where mutable data is stored in Kudu
 and immutable data is stored in the Parquet format on HDFS. Leveraging both Kudu and
 HDFS via Impala provides the benefits of both storage systems:&lt;/p&gt;

 &lt;ul&gt;
   &lt;li&gt;Streaming data is immediately queryable&lt;/li&gt;
   &lt;li&gt;Updates for late arriving data or manual corrections can be made&lt;/li&gt;
   &lt;li&gt;Data stored in HDFS is optimally sized increasing performance and preventing small files&lt;/li&gt;
   &lt;li&gt;Reduced cost&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;Impala also supports cloud storage options such as
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_s3.html&quot;&gt;S3&lt;/a&gt; and
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_adls.html&quot;&gt;ADLS&lt;/a&gt;.
 This capability allows convenient access to a storage system that is remotely managed,
 accessible from anywhere, and integrated with various cloud-based services. Because
 this data is remote, queries against S3 data are less performant, making S3 suitable
 for holding “cold” data that is only queried occasionally. This pattern can be
 extended to use cloud storage for cold data by creating a third matching table and
 adding another boundary to the unified view.&lt;/p&gt;

 &lt;p&gt;&lt;img src=&quot;/img/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/sliding-window-pattern-cold.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;

 &lt;p&gt;Note: For simplicity only Kudu and HDFS are illustrated in the examples below.&lt;/p&gt;

 &lt;p&gt;The process for moving data from Kudu to HDFS is broken into two phases. The first
 phase is the data migration, and the second phase is the metadata change. These
 ongoing steps should be scheduled to run automatically on a regular basis.&lt;/p&gt;

 &lt;p&gt;In the first phase, the now immutable data is copied from Kudu to HDFS. Even though
 data is duplicated from Kudu into HDFS, the boundary defined in the view will prevent
 duplicate data from being shown to users. This step can include any validation and
 retries as needed to ensure the data offload is successful.&lt;/p&gt;

 &lt;p&gt;&lt;img src=&quot;/img/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/phase-1.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;

 &lt;p&gt;In the second phase, now that the data is safely copied to HDFS, the metadata is
 changed to adjust how the offloaded partition is exposed. This includes shifting
 the boundary forward, adding a new Kudu partition for the next period, and dropping
 the old Kudu partition.&lt;/p&gt;

 &lt;p&gt;&lt;img src=&quot;/img/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/phase-2.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;

 &lt;h2 id=&quot;building-blocks&quot;&gt;Building Blocks&lt;/h2&gt;

 &lt;p&gt;In order to implement the sliding window pattern, a few Impala fundamentals are
 required. Below each fundamental building block of the sliding window pattern is
 described.&lt;/p&gt;

 &lt;h3 id=&quot;moving-data&quot;&gt;Moving Data&lt;/h3&gt;

 &lt;p&gt;Moving data among storage systems via Impala is straightforward provided you have
 matching tables defined using each of the storage formats. In order to keep this post
 brief, all of the options available when creating an Impala table are not described.
 However, Impala’s
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_create_table.html&quot;&gt;CREATE TABLE documentation&lt;/a&gt;
 can be referenced to find the correct syntax for Kudu, HDFS, and cloud storage tables.
 A few examples are shown further below where the sliding window pattern is illustrated.&lt;/p&gt;

 &lt;p&gt;Once the tables are created, moving the data is as simple as an
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_insert.html&quot;&gt;INSERT…SELECT&lt;/a&gt; statement:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INTO&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;table_foo&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;table_bar&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;All of the features of the
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_select.html&quot;&gt;SELECT&lt;/a&gt;
 statement can be used to select the specific data you would like to move.&lt;/p&gt;

 &lt;p&gt;Note: If moving data to Kudu, an &lt;code&gt;UPSERT INTO&lt;/code&gt; statement can be used to handle
 duplicate keys.&lt;/p&gt;

 &lt;h3 id=&quot;unified-querying&quot;&gt;Unified Querying&lt;/h3&gt;

 &lt;p&gt;Querying data from multiple tables and data sources in Impala is also straightforward.
 For the sake of brevity, all of the options available when creating an Impala view are
 not described. However, see Impala’s
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_create_view.html&quot;&gt;CREATE VIEW documentation&lt;/a&gt;
 for more in-depth details.&lt;/p&gt;

 &lt;p&gt;Creating a view for unified querying is as simple as a &lt;code&gt;CREATE VIEW&lt;/code&gt; statement using
 two &lt;code&gt;SELECT&lt;/code&gt; clauses combined with a &lt;code&gt;UNION ALL&lt;/code&gt;:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;VIEW&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;foo_view&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;col1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;col2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;col3&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;foo_parquet&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;UNION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ALL&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;col1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;col2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;col3&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;foo_kudu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;WARNING: Be sure to use &lt;code&gt;UNION ALL&lt;/code&gt; and not &lt;code&gt;UNION&lt;/code&gt;. The &lt;code&gt;UNION&lt;/code&gt; keyword by itself
 is the same as &lt;code&gt;UNION DISTINCT&lt;/code&gt; and can have significant performance impact.
 More information can be found in the Impala
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_union.html&quot;&gt;UNION documentation&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;All of the features of the
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_select.html&quot;&gt;SELECT&lt;/a&gt;
 statement can be used to expose the correct data and columns from each of the
 underlying tables. It is important to use the &lt;code&gt;WHERE&lt;/code&gt; clause to pass through and
 pushdown any predicates that need special handling or transformations. More examples
 will follow below in the discussion of the sliding window pattern.&lt;/p&gt;

 &lt;p&gt;Additionally, views can be altered via the
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_alter_view.html&quot;&gt;ALTER VIEW&lt;/a&gt;
 statement. This is useful when combined with the &lt;code&gt;SELECT&lt;/code&gt; statement because it can be
 used to atomically update what data is being accessed by the view.&lt;/p&gt;

 &lt;h2 id=&quot;an-example-implementation&quot;&gt;An Example Implementation&lt;/h2&gt;

 &lt;p&gt;Below are sample steps to implement the sliding window pattern using a monthly period
 with three months of active mutable data. Data older than three months will be
 offloaded to HDFS using the Parquet format.&lt;/p&gt;

 &lt;h3 id=&quot;create-the-kudu-table&quot;&gt;Create the Kudu Table&lt;/h3&gt;

 &lt;p&gt;First, create a Kudu table which will hold three months of active mutable data.
 The table is range partitioned by the time column with each range containing one
 period of data. It is important to have partitions that match the period because
 dropping Kudu partitions is much more efficient than removing the data via the
 &lt;code&gt;DELETE&lt;/code&gt; clause. The table is also hash partitioned by the other key column to ensure
 that all of the data is not written to a single partition.&lt;/p&gt;

 &lt;p&gt;Note: Your schema design should vary based on your data and read/write performance
 considerations. This example schema is intended for demonstration purposes and not as
 an “optimal” schema. See the
 &lt;a href=&quot;https://kudu.apache.org/docs/schema_design.html&quot;&gt;Kudu schema design documentation&lt;/a&gt;
 for more guidance on choosing your schema. For example, you may not need any hash
 partitioning if your
 data input rate is low. Alternatively, you may need more hash buckets if your data
 input rate is very high.&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_kudu&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STRING&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TIMESTAMP&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;message&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STRING&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
   &lt;span class=&quot;k&quot;&gt;PRIMARY&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;KEY&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;HASH&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PARTITIONS&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;RANGE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
       &lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;2018-01-01&amp;#39;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;VALUES&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;2018-02-01&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;--January&lt;/span&gt;
       &lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;2018-02-01&amp;#39;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;VALUES&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;2018-03-01&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;--February&lt;/span&gt;
       &lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;2018-03-01&amp;#39;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;VALUES&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;2018-04-01&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;--March&lt;/span&gt;
       &lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;2018-04-01&amp;#39;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;VALUES&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;2018-05-01&amp;#39;&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;--April&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;n&quot;&gt;STORED&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;KUDU&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Note: There is an extra month partition to provide a buffer of time for the data to
 be moved into the immutable table.&lt;/p&gt;

 &lt;h3 id=&quot;create-the-hdfs-table&quot;&gt;Create the HDFS Table&lt;/h3&gt;

 &lt;p&gt;Create the matching Parquet formatted HDFS table which will hold the older immutable
 data. This table is partitioned by year, month, and day for efficient access even
 though you can’t partition by the time column itself. This is addressed further in
 the view step below. See Impala’s
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_partitioning.html&quot;&gt;partitioning documentation&lt;/a&gt;
 for more details.&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_parquet&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STRING&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TIMESTAMP&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;message&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STRING&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;n&quot;&gt;PARTITIONED&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;year&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;month&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;day&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;n&quot;&gt;STORED&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PARQUET&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;h3 id=&quot;create-the-unified-view&quot;&gt;Create the Unified View&lt;/h3&gt;

 &lt;p&gt;Now create the unified view which will be used to query all of the data seamlessly:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;VIEW&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_view&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;message&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_kudu&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;&amp;quot;2018-01-01&amp;quot;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;UNION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ALL&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;message&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_parquet&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;&amp;quot;2018-01-01&amp;quot;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;year&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;year&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;month&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;month&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;day&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;day&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Each &lt;code&gt;SELECT&lt;/code&gt; clause explicitly lists all of the columns to expose. This ensures that
 the year, month, and day columns that are unique to the Parquet table are not exposed.
 If needed, it also allows any necessary column or type mapping to be handled.&lt;/p&gt;

 &lt;p&gt;The initial &lt;code&gt;WHERE&lt;/code&gt; clauses applied to both my_table_kudu and my_table_parquet define
 the boundary between Kudu and HDFS to ensure duplicate data is not read while in the
 process of offloading data.&lt;/p&gt;

 &lt;p&gt;The additional &lt;code&gt;AND&lt;/code&gt; clauses applied to my_table_parquet are used to ensure good
 predicate pushdown on the individual year, month, and day columns.&lt;/p&gt;

 &lt;p&gt;WARNING: As stated earlier, be sure to use &lt;code&gt;UNION ALL&lt;/code&gt; and not &lt;code&gt;UNION&lt;/code&gt;. The &lt;code&gt;UNION&lt;/code&gt;
 keyword by itself is the same as &lt;code&gt;UNION DISTINCT&lt;/code&gt; and can have significant performance
 impact. More information can be found in the Impala
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_union.html&quot;&gt;&lt;code&gt;UNION&lt;/code&gt; documentation&lt;/a&gt;.&lt;/p&gt;

 &lt;h3 id=&quot;ongoing-steps&quot;&gt;Ongoing Steps&lt;/h3&gt;

 &lt;p&gt;Now that the base tables and view are created, prepare the ongoing steps to maintain
 the sliding window. Because these ongoing steps should be scheduled to run on a
 regular basis, the examples below are shown using &lt;code&gt;.sql&lt;/code&gt; files that take variables
 which can be passed from your scripts and scheduling tool of choice.&lt;/p&gt;

 &lt;p&gt;Create the &lt;code&gt;window_data_move.sql&lt;/code&gt; file to move the data from the oldest partition to HDFS:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INTO&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hdfs_table&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;year&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;month&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;day&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;year&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;month&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;day&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kudu_table&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;add_months&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;ss&quot;&gt;&amp;quot;${var:new_boundary_time}&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;&amp;quot;${var:new_boundary_time}&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;n&quot;&gt;COMPUTE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;INCREMENTAL&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STATS&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hdfs_table&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Note: The
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_compute_stats.html&quot;&gt;COMPUTE INCREMENTAL STATS&lt;/a&gt;
 clause is not required but helps Impala to optimize queries.&lt;/p&gt;

 &lt;p&gt;To run the SQL statement, use the Impala shell and pass the required variables.
 Below is an example:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;impala-shell -i &amp;lt;impalad:port&amp;gt; -f window_data_move.sql
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;kudu_table&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;my_table_kudu
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;hdfs_table&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;my_table_parquet
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;new_boundary_time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&amp;quot;2018-02-01&amp;quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Note: You can adjust the &lt;code&gt;WHERE&lt;/code&gt; clause to match the given period and cadence of your
 offload. Here the add_months function is used with an argument of -1 to move one month
 of data in the past from the new boundary time.&lt;/p&gt;

 &lt;p&gt;Create the &lt;code&gt;window_view_alter.sql&lt;/code&gt; file to shift the time boundary forward by altering
 the unified view:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;ALTER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;VIEW&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;view_name&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;message&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kudu_table&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;&amp;quot;${var:new_boundary_time}&amp;quot;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;UNION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ALL&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;message&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hdfs_table&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;&amp;quot;${var:new_boundary_time}&amp;quot;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;year&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;year&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;month&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;month&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;day&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;day&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;To run the SQL statement, use the Impala shell and pass the required variables.
 Below is an example:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;impala-shell -i &amp;lt;impalad:port&amp;gt; -f window_view_alter.sql
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;view_name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;my_table_view
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;kudu_table&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;my_table_kudu
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;hdfs_table&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;my_table_parquet
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;new_boundary_time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&amp;quot;2018-02-01&amp;quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Create the &lt;code&gt;window_partition_shift.sql&lt;/code&gt; file to shift the Kudu partitions forward:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;ALTER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kudu_table&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;

 &lt;span class=&quot;k&quot;&gt;ADD&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RANGE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;add_months&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;ss&quot;&gt;&amp;quot;${var:new_boundary_time}&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
 &lt;span class=&quot;err&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;window_length&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;VALUES&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;add_months&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;ss&quot;&gt;&amp;quot;${var:new_boundary_time}&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
 &lt;span class=&quot;err&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;window_length&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

 &lt;span class=&quot;k&quot;&gt;ALTER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kudu_table&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;

 &lt;span class=&quot;k&quot;&gt;DROP&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RANGE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;add_months&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;ss&quot;&gt;&amp;quot;${var:new_boundary_time}&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;VALUES&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;&amp;quot;${var:new_boundary_time}&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;To run the SQL statement, use the Impala shell and pass the required variables.
 Below is an example:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;impala-shell -i &amp;lt;impalad:port&amp;gt; -f window_partition_shift.sql
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;kudu_table&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;my_table_kudu
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;new_boundary_time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&amp;quot;2018-02-01&amp;quot;&lt;/span&gt;
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;window_length&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;3&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Note: You should periodically run
 &lt;a href=&quot;https://impala.apache.org/docs/build/html/topics/impala_compute_stats.html&quot;&gt;COMPUTE STATS&lt;/a&gt;
 on your Kudu table to ensure Impala’s query performance is optimal.&lt;/p&gt;

 &lt;h3 id=&quot;experimentation&quot;&gt;Experimentation&lt;/h3&gt;

 &lt;p&gt;Now that you have created the tables, view, and scripts to leverage the sliding
 window pattern, you can experiment with them by inserting data for different time
 ranges and running the scripts to move the window forward through time.&lt;/p&gt;

 &lt;p&gt;Insert some sample values into the Kudu table:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INTO&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_kudu&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;VALUES&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&amp;#39;joey&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;2018-01-01&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;hello&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&amp;#39;ross&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;2018-02-01&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;goodbye&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&amp;#39;rachel&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;2018-03-01&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&amp;#39;hi&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Show the data in each table/view:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_kudu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_parquet&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_view&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Move the January data into HDFS:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;impala-shell -i &amp;lt;impalad:port&amp;gt; -f window_data_move.sql
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;kudu_table&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;my_table_kudu
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;hdfs_table&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;my_table_parquet
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;new_boundary_time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&amp;quot;2018-02-01&amp;quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Confirm the data is in both places, but not duplicated in the view:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_kudu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_parquet&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_view&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Alter the view to shift the time boundary forward to February:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;impala-shell -i &amp;lt;impalad:port&amp;gt; -f window_view_alter.sql
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;view_name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;my_table_view
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;kudu_table&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;my_table_kudu
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;hdfs_table&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;my_table_parquet
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;new_boundary_time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&amp;quot;2018-02-01&amp;quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Confirm the data is still in both places, but not duplicated in the view:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_kudu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_parquet&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_view&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Shift the Kudu partitions forward:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;impala-shell -i &amp;lt;impalad:port&amp;gt; -f window_partition_shift.sql
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;kudu_table&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;my_table_kudu
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;new_boundary_time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&amp;quot;2018-02-01&amp;quot;&lt;/span&gt;
 --var&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;window_length&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;3&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Confirm the January data is now only in HDFS:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_kudu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_parquet&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_view&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;Confirm predicate push down with Impala’s EXPLAIN statement:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;EXPLAIN&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_view&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;EXPLAIN&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_view&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;&amp;quot;2018-02-01&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;EXPLAIN&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_table_view&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;&amp;quot;2018-02-01&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;In the explain output you should see “kudu predicates” which include the time column
 filters in the “SCAN KUDU” section and “predicates” which include the time, day, month,
 and year columns in the “SCAN HDFS” section.&lt;/p&gt;</content><author><name>Grant Henke</name></author><summary>Note: This is a cross-post from the Cloudera Engineering Blog
 Transparent Hierarchical Storage Management with Apache Kudu and Impala

 When picking a storage option for an application it is common to pick a single
 storage option which has the most applicable features to your use case. For mutability
 and real-time analytics workloads you may want to use Apache Kudu, but for massive
 scalability at a low cost you may want to use HDFS. For that reason, there is a need
 for a solution that allows you to leverage the best features of multiple storage
 options. This post describes the sliding window pattern using Apache Impala with data
 stored in Apache Kudu and Apache HDFS. With this pattern you get all of the benefits
 of multiple storage layers in a way that is transparent to users.</summary></entry><entry><title>Call for Posts</title><link href="/2018/12/11/call-for-posts.html" rel="alternate" type="text/html" title="Call for Posts" /><published>2018-12-11T00:00:00-08:00</published><updated>2018-12-11T00:00:00-08:00</updated><id>/2018/12/11/call-for-posts</id><content type="html" xml:base="/2018/12/11/call-for-posts.html">&lt;p&gt;Most of the posts in the Kudu blog have been written by the project’s
 committers and are either technical or news-like in nature. We’d like to hear
 how you’re using Kudu in production, in testing, or in your hobby project and
 we’d like to share it with the world!&lt;/p&gt;

 &lt;!--more--&gt;

 &lt;p&gt;If you’d like to tell the world about how you are using Kudu in your project,
 now is the time.&lt;/p&gt;

 &lt;p&gt;To learn how to submit posts, read our &lt;a href=&quot;/docs/contributing.html#_blog_posts&quot;&gt;contributing
 documentation&lt;/a&gt;. Alternatively, you can
 draft your post to Google Docs and share it with us on
 &lt;a href=&quot;&amp;#109;&amp;#097;&amp;#105;&amp;#108;&amp;#116;&amp;#111;:&amp;#100;&amp;#101;&amp;#118;&amp;#064;&amp;#107;&amp;#117;&amp;#100;&amp;#117;&amp;#046;&amp;#097;&amp;#112;&amp;#097;&amp;#099;&amp;#104;&amp;#101;&amp;#046;&amp;#111;&amp;#114;&amp;#103;&quot;&gt;&amp;#100;&amp;#101;&amp;#118;&amp;#064;&amp;#107;&amp;#117;&amp;#100;&amp;#117;&amp;#046;&amp;#097;&amp;#112;&amp;#097;&amp;#099;&amp;#104;&amp;#101;&amp;#046;&amp;#111;&amp;#114;&amp;#103;&lt;/a&gt; and we’re happy to review it
 and post it to the blog for you.&lt;/p&gt;</content><author><name>Attila Bukor</name></author><summary>Most of the posts in the Kudu blog have been written by the project’s
 committers and are either technical or news-like in nature. We’d like to hear
 how you’re using Kudu in production, in testing, or in your hobby project and
 we’d like to share it with the world!</summary></entry><entry><title>Apache Kudu 1.8.0 Released</title><link href="/2018/10/26/apache-kudu-1-8-0-released.html" rel="alternate" type="text/html" title="Apache Kudu 1.8.0 Released" /><published>2018-10-26T00:00:00-07:00</published><updated>2018-10-26T00:00:00-07:00</updated><id>/2018/10/26/apache-kudu-1-8-0-released</id><content type="html" xml:base="/2018/10/26/apache-kudu-1-8-0-released.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.8.0!&lt;/p&gt;

 &lt;p&gt;The new release adds several new features and improvements, including the
 following:&lt;/p&gt;

 &lt;!--more--&gt;

 &lt;ul&gt;
   &lt;li&gt;Introduced manual data rebalancer tool which can be used to redistribute
 table replicas among tablet servers&lt;/li&gt;
   &lt;li&gt;Added support for &lt;code&gt;IS NULL&lt;/code&gt; and &lt;code&gt;IS NOT NULL&lt;/code&gt; predicates to the Kudu Python
 client&lt;/li&gt;
   &lt;li&gt;Multiple tooling improvements make diagnostics and troubleshooting simpler&lt;/li&gt;
   &lt;li&gt;The Kudu Spark connector now supports Spark Streaming DataFrames&lt;/li&gt;
   &lt;li&gt;Added Pandas support to the Python client&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;The above is just a list of the highlights, for a more complete list of new
 features, improvements and fixes please refer to the &lt;a href=&quot;/releases/1.8.0/docs/release_notes.html&quot;&gt;release
 notes&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;The Apache Kudu project only publishes source code releases. To build Kudu
 1.8.0, follow these steps:&lt;/p&gt;

 &lt;ul&gt;
   &lt;li&gt;Download the Kudu &lt;a href=&quot;/releases/1.8.0&quot;&gt;1.8.0 source release&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;Follow the instructions in the documentation to build Kudu &lt;a href=&quot;/releases/1.8.0/docs/installation.html#build_from_source&quot;&gt;1.8.0 from
 source&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;For your convenience, binary JAR files for the Kudu Java client library, Spark
 DataSource, Flume sink, and other Java integrations are published to the ASF
 Maven repository and are &lt;a href=&quot;https://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.kudu%22%20AND%20v%3A%221.8.0%22&quot;&gt;now
 available&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;The Python client source is also available on
 &lt;a href=&quot;https://pypi.org/project/kudu-python/&quot;&gt;PyPI&lt;/a&gt;.&lt;/p&gt;</content><author><name>Attila Bukor</name></author><summary>The Apache Kudu team is happy to announce the release of Kudu 1.8.0!

 The new release adds several new features and improvements, including the
 following:</summary></entry><entry><title>Index Skip Scan Optimization in Kudu</title><link href="/2018/09/26/index-skip-scan-optimization-in-kudu.html" rel="alternate" type="text/html" title="Index Skip Scan Optimization in Kudu" /><published>2018-09-26T00:00:00-07:00</published><updated>2018-09-26T00:00:00-07:00</updated><id>/2018/09/26/index-skip-scan-optimization-in-kudu</id><content type="html" xml:base="/2018/09/26/index-skip-scan-optimization-in-kudu.html">&lt;p&gt;This summer I got the opportunity to intern with the Apache Kudu team at Cloudera.
 My project was to optimize the Kudu scan path by implementing a technique called
 index skip scan (a.k.a. scan-to-seek, see section 4.1 in [1]). I wanted to share
 my experience and the progress we’ve made so far on the approach.&lt;/p&gt;

 &lt;!--more--&gt;

 &lt;p&gt;Let’s begin with discussing the current query flow in Kudu.
 Consider the following table:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;metrics&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;host&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STRING&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
     &lt;span class=&quot;n&quot;&gt;tstamp&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;INT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
     &lt;span class=&quot;n&quot;&gt;clusterid&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;INT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;role&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STRING&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;PRIMARY&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;KEY&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;host&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tstamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clusterid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;&lt;img src=&quot;/img/index-skip-scan/example-table.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;
 &lt;em&gt;Sample rows of table &lt;code&gt;metrics&lt;/code&gt; (sorted by key columns).&lt;/em&gt;&lt;/p&gt;

 &lt;p&gt;In this case, by default, Kudu internally builds a primary key index (implemented as a
 &lt;a href=&quot;https://en.wikipedia.org/wiki/B-tree&quot;&gt;B-tree&lt;/a&gt;) for the table &lt;code&gt;metrics&lt;/code&gt;.
 As shown in the table above, the index data is sorted by the composite of all key columns.
 When the user query contains the first key column (&lt;code&gt;host&lt;/code&gt;), Kudu uses the index (as the index data is
 primarily sorted on the first key column).&lt;/p&gt;

 &lt;p&gt;Now, what if the user query does not contain the first key column and instead only contains the &lt;code&gt;tstamp&lt;/code&gt; column?
 In the above case, the &lt;code&gt;tstamp&lt;/code&gt; column values are sorted with respect to &lt;code&gt;host&lt;/code&gt;,
 but are not globally sorted, and as such, it’s non-trivial to use the index to filter rows.
 Instead, a full tablet scan is done by default. Other databases may optimize such scans by building secondary indexes
 (though it might be redundant to build one on one of the primary keys). However, this isn’t an option for Kudu,
 given its lack of secondary index support.&lt;/p&gt;

 &lt;p&gt;The question is, can Kudu do better than a full tablet scan here?&lt;/p&gt;

 &lt;p&gt;The answer is yes! Let’s observe the column preceding the &lt;code&gt;tstamp&lt;/code&gt; column. We will refer to it as the
 “prefix column” and its specific value as the “prefix key”. In this example, &lt;code&gt;host&lt;/code&gt; is the prefix column.
 Note that the prefix keys are sorted in the index and that all rows of a given prefix key are also sorted by the
 remaining key columns. Therefore, we can use the index to skip to the rows that have distinct prefix keys,
 and also satisfy the predicate on the &lt;code&gt;tstamp&lt;/code&gt; column.
 For example, consider the query:&lt;/p&gt;

 &lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clusterid&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;metrics&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tstamp&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

 &lt;p&gt;&lt;img src=&quot;/img/index-skip-scan/skip-scan-example-table.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;
 &lt;em&gt;Skip scan flow illustration. The rows in green are scanned and the rest are skipped.&lt;/em&gt;&lt;/p&gt;

 &lt;p&gt;The tablet server can use the index to &lt;strong&gt;skip&lt;/strong&gt; to the first row with a distinct prefix key (&lt;code&gt;host = helium&lt;/code&gt;) that
 matches the predicate (&lt;code&gt;tstamp = 100&lt;/code&gt;) and then &lt;strong&gt;scan&lt;/strong&gt; through the rows until the predicate no longer matches. At that
 point we would know that no more rows with &lt;code&gt;host = helium&lt;/code&gt; will satisfy the predicate, and we can skip to the next
 prefix key. This holds true for all distinct keys of &lt;code&gt;host&lt;/code&gt;. Hence, this method is popularly known as
 &lt;strong&gt;skip scan optimization&lt;/strong&gt;[2, 3].&lt;/p&gt;

 &lt;h1 id=&quot;performance&quot;&gt;Performance&lt;/h1&gt;

 &lt;p&gt;This optimization can speed up queries significantly, depending on the cardinality (number of distinct values) of the
 prefix column. The lower the prefix column cardinality, the better the skip scan performance. In fact, when the
 prefix column cardinality is high, skip scan is not a viable approach. The performance graph (obtained using the example
 schema and query pattern mentioned earlier) is shown below.&lt;/p&gt;

 &lt;p&gt;Based on our experiments, on up to 10 million rows per tablet (as shown below), we found that the skip scan performance
 begins to get worse with respect to the full tablet scan performance when the prefix column cardinality
 exceeds sqrt(number_of_rows_in_tablet).
 Therefore, in order to use skip scan performance benefits when possible and maintain a consistent performance in cases
 of large prefix column cardinality, we have tentatively chosen to dynamically disable skip scan when the number of skips for
 distinct prefix keys exceeds sqrt(number_of_rows_in_tablet).
 It will be an interesting project to further explore sophisticated heuristics to decide when
 to dynamically disable skip scan.&lt;/p&gt;

 &lt;p&gt;&lt;img src=&quot;/img/index-skip-scan/skip-scan-performance-graph.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;

 &lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;

 &lt;p&gt;Skip scan optimization in Kudu can lead to huge performance benefits that scale with the size of
 data in Kudu tablets. This is a work-in-progress &lt;a href=&quot;https://gerrit.cloudera.org/#/c/10983/&quot;&gt;patch&lt;/a&gt;.
 The implementation in the patch works only for equality predicates on the non-first primary key
 columns. An important point to note is that although, in the above specific example, the number of prefix
 columns is one (&lt;code&gt;host&lt;/code&gt;), this approach is generalized to work with any number of prefix columns.&lt;/p&gt;

 &lt;p&gt;This work also lays the groundwork to leverage the skip scan approach and optimize query processing time in the
 following use cases:&lt;/p&gt;

 &lt;ul&gt;
   &lt;li&gt;Range predicates&lt;/li&gt;
   &lt;li&gt;In-list predicates&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;This was my first time working on an open source project. I thoroughly enjoyed working on this challenging problem,
 right from understanding the scan path in Kudu to working on a full-fledged implementation of
 the skip scan optimization. I am very grateful to the Kudu team for guiding and supporting me throughout the
 internship period.&lt;/p&gt;

 &lt;h1 id=&quot;references&quot;&gt;References&lt;/h1&gt;

 &lt;p&gt;&lt;a href=&quot;https://storage.googleapis.com/pub-tools-public-publication-data/pdf/42851.pdf&quot;&gt;[1]&lt;/a&gt;: Gupta, Ashish, et al. “Mesa:
 Geo-replicated, near real-time, scalable data warehousing.” Proceedings of the VLDB Endowment 7.12 (2014): 1259-1270.&lt;/p&gt;

 &lt;p&gt;&lt;a href=&quot;https://oracle-base.com/articles/9i/index-skip-scanning/&quot;&gt;[2]&lt;/a&gt;: Index Skip Scanning - Oracle Database&lt;/p&gt;

 &lt;p&gt;&lt;a href=&quot;https://www.sqlite.org/optoverview.html#skipscan&quot;&gt;[3]&lt;/a&gt;: Skip Scan - SQLite&lt;/p&gt;</content><author><name>Anupama Gupta</name></author><summary>This summer I got the opportunity to intern with the Apache Kudu team at Cloudera.
 My project was to optimize the Kudu scan path by implementing a technique called
 index skip scan (a.k.a. scan-to-seek, see section 4.1 in [1]). I wanted to share
 my experience and the progress we’ve made so far on the approach.</summary></entry><entry><title>Simplified Data Pipelines with Kudu</title><link href="/2018/09/11/simplified-pipelines-with-kudu.html" rel="alternate" type="text/html" title="Simplified Data Pipelines with Kudu" /><published>2018-09-11T00:00:00-07:00</published><updated>2018-09-11T00:00:00-07:00</updated><id>/2018/09/11/simplified-pipelines-with-kudu</id><content type="html" xml:base="/2018/09/11/simplified-pipelines-with-kudu.html">&lt;p&gt;I’ve been working with Hadoop now for over seven years and fortunately, or unfortunately, have run
 across a lot of structured data use cases.  What we, at &lt;a href=&quot;https://phdata.io/&quot;&gt;phData&lt;/a&gt;, have found is
 that end users are typically comfortable with tabular data and prefer to access their data in a
 structured manner using tables.
 &lt;!--more--&gt;&lt;/p&gt;

 &lt;p&gt;When working on new structured data projects, the first question we always get from non-Hadoop
 followers is, &lt;em&gt;“how do I update or delete a record?”&lt;/em&gt;  The second question we get is, &lt;em&gt;“when adding
 records, why don’t they show up in Impala right away?”&lt;/em&gt;  For those of us who have worked with HDFS
 and Impala on HDFS for years, these are simple questions to answer, but hard ones to explain.&lt;/p&gt;

 &lt;p&gt;The pre-Kudu years were filled with 100’s (or 1000’s) of self-join views (or materialization jobs)
 and compaction jobs, along with scheduled jobs to refresh Impala cache periodically so new records
 show up.  And while doable, for 10,000’s of tables, this basically became a distraction from solving
 real business problems.&lt;/p&gt;

 &lt;p&gt;With the introduction of Kudu, mixing record level updates, deletes, and inserts, while supporting
 large scans, are now something we can sustainably manage at scale.  HBase is very good at record
 level updates, deletes and inserts, but doesn’t scale well for analytic use cases that often do full
 table scans. Moreover, for streaming use cases, changes are available in near real-time.  End users,
 accustomed to having to &lt;em&gt;”wait”&lt;/em&gt; for their data, can now consume the data as it arrives in their
 table.&lt;/p&gt;

 &lt;p&gt;A common data ingest pattern where Kudu becomes necessary is change data capture (CDC).  That is,
 capturing the inserts, updates, hard deletes, and streaming them into Kudu where they can be applied
 immediately.  Pre-Kudu this pipeline was very tedious to implement.  Now with tools like
 &lt;a href=&quot;https://streamsets.com/&quot;&gt;StreamSets&lt;/a&gt;, you can get up and running in a few hours.&lt;/p&gt;

 &lt;p&gt;A second common workflow is near real-time analytics.  We’ve streamed data off mining trucks,
 oil wells, manufacturing lines, and needed to make that data available to end users immediately.  No
 longer do we need to batch up writes, flush to HDFS and then refresh cache in Impala.  As mentioned
 before, with Kudu, the data are available as soon as it lands.  This has been a significant
 enhancement for end users, who previously had to &lt;em&gt;”wait”&lt;/em&gt; for data.&lt;/p&gt;

 &lt;p&gt;In summary, Kudu has made a tremendous impact in removing the operational distractions of merging in
 changes, and refreshing the cache of downstream consumers.  This now allows data engineers
 and users to focus on solving business problems, rather than being bothered by the tediousness of
 the backend.&lt;/p&gt;</content><author><name>Mac Noland</name></author><summary>I’ve been working with Hadoop now for over seven years and fortunately, or unfortunately, have run
 across a lot of structured data use cases.  What we, at phData, have found is
 that end users are typically comfortable with tabular data and prefer to access their data in a
 structured manner using tables.</summary></entry><entry><title>Getting Started with Kudu - an O’Reilly Title</title><link href="/2018/08/06/getting-started-with-kudu-an-oreilly-title.html" rel="alternate" type="text/html" title="Getting Started with Kudu - an O&#39;Reilly Title" /><published>2018-08-06T00:00:00-07:00</published><updated>2018-08-06T00:00:00-07:00</updated><id>/2018/08/06/getting-started-with-kudu-an-oreilly-title</id><content type="html" xml:base="/2018/08/06/getting-started-with-kudu-an-oreilly-title.html">&lt;p&gt;The following article by Brock Noland was reposted from the
 &lt;a href=&quot;https://www.phdata.io/getting-started-with-kudu/&quot;&gt;phData&lt;/a&gt;
 blog with their permission.&lt;/p&gt;

 &lt;p&gt;Five years ago, enabling Data Science and Advanced Analytics on the
 Hadoop platform was hard. Organizations required strong Software Engineering
 capabilities to successfully implement complex Lambda architectures or even
 simply implement continuous ingest. Updating or deleting data, were simply a
 nightmare. General Data Protection Regulation (GDPR) would have been an extreme
 challenge at that time.
 &lt;!--more--&gt;
 In that context, on October 11th 2012 Todd Lipcon perform Apache Kudu’s initial
 commit. The commit message was:&lt;/p&gt;

 &lt;pre&gt;&lt;code&gt;Code for writing cfiles seems to basically work
 Need to write code for reading cfiles, still
 &lt;/code&gt;&lt;/pre&gt;

 &lt;p&gt;And Kudu development was off and running. Around this same time Todd, on his
 internal Wiki page, started listing out the papers he was reading to develop
 the theoretical background for creating Kudu. I followed along, reading as many
 as I could, understanding little, because I knew Todd was up to something
 important. About a year after that initial commit, I got my
 &lt;a href=&quot;https://github.com/apache/kudu/commit/1d7e6864b4a31d3fe6897e4cb484dfcda6608d43&quot;&gt;Kudu first commit&lt;/a&gt;,
 documenting the upper bound of a library. This is a small contribution of which I am still
 proud.&lt;/p&gt;

 &lt;p&gt;In the meantime, I was lucky enough to be a founder of a Hadoop Managed Services
 and Consulting company known as &lt;a href=&quot;http://phdata.io/&quot;&gt;phData&lt;/a&gt;. We found that a majority
 of our customers had use cases which Kudu vastly simplified. Whether it’s Change Data
 Capture (CDC) from thousands of source tables to Internet of Things (IoT) ingest, Kudu
 makes life much easier as both an operator of a Hadoop cluster and a developer providing
 business value on the platform.&lt;/p&gt;

 &lt;p&gt;Through this work, I was lucky enough to be a co-author of
 &lt;a href=&quot;http://shop.oreilly.com/product/0636920065739.do&quot;&gt;Getting Started with Kudu&lt;/a&gt;.
 The book is a summation of mine and our co-authors, Jean-Marc Spaggiari, Mladen
 Kovacevic, and Ryan Bosshart,  learnings while cutting our teeth on early versions
 of Kudu. Specifically you will learn:&lt;/p&gt;

 &lt;ul&gt;
   &lt;li&gt;Theoretical understanding of Kudu concepts in simple plain spoken words and simple diagrams&lt;/li&gt;
   &lt;li&gt;Why, for many use cases, using Kudu is so much easier than other ecosystem storage technologies&lt;/li&gt;
   &lt;li&gt;How Kudu enables Hybrid Transactional/Analytical Processing (HTAP) use cases&lt;/li&gt;
   &lt;li&gt;How to design IoT, Predictive Modeling, and Mixed Platform Solutions using Kudu&lt;/li&gt;
   &lt;li&gt;How to design Kudu Schemas&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;&lt;img src=&quot;/img/2018-08-06-getting-started-with-kudu-an-oreilly-title.gif&quot; alt=&quot;Getting Started with Kudu Cover&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;

 &lt;p&gt;Looking forward, I am excited to see Kudu gain additional features and adoption
 and eventually the second revision of this title. In the meantime, if you have
 feedback or questions, please reach out on the &lt;code&gt;#getting-started-kudu&lt;/code&gt; channel of
 the &lt;a href=&quot;https://getkudu-slack.herokuapp.com/&quot;&gt;Kudu Slack&lt;/a&gt; or if you prefer non-real-time
 communication, please use the user@ mailing list!&lt;/p&gt;</content><author><name>Brock Noland</name></author><summary>The following article by Brock Noland was reposted from the
 phData
 blog with their permission.

 Five years ago, enabling Data Science and Advanced Analytics on the
 Hadoop platform was hard. Organizations required strong Software Engineering
 capabilities to successfully implement complex Lambda architectures or even
 simply implement continuous ingest. Updating or deleting data, were simply a
 nightmare. General Data Protection Regulation (GDPR) would have been an extreme
 challenge at that time.</summary></entry><entry><title>Instrumentation in Apache Kudu</title><link href="/2018/07/10/instrumentation-in-kudu.html" rel="alternate" type="text/html" title="Instrumentation in Apache Kudu" /><published>2018-07-10T00:00:00-07:00</published><updated>2018-07-10T00:00:00-07:00</updated><id>/2018/07/10/instrumentation-in-kudu</id><content type="html" xml:base="/2018/07/10/instrumentation-in-kudu.html">&lt;p&gt;Last week, the &lt;a href=&quot;http://opentracing.io/&quot;&gt;OpenTracing&lt;/a&gt; community invited me to
 their monthly Google Hangout meetup to give an informal talk on tracing and
 instrumentation in Apache Kudu.&lt;/p&gt;

 &lt;p&gt;While Kudu doesn’t currently support distributed tracing using OpenTracing,
 it does have quite a lot of other types of instrumentation, metrics, and
 diagnostics logging. The OpenTracing team was interested to hear about some of
 the approaches that Kudu has used, and so I gave a brief introduction to topics
 including:
 &lt;!--more--&gt;
 - The Kudu &lt;a href=&quot;/docs/administration.html#_diagnostics_logging&quot;&gt;diagnostics log&lt;/a&gt;
   which periodically logs metrics and stack traces.
 - The &lt;a href=&quot;/docs/troubleshooting.html#kudu_tracing&quot;&gt;process-wide tracing&lt;/a&gt;
   support based on the open source tracing framework implemented by Google Chrome.
 - The &lt;a href=&quot;/docs/troubleshooting.html#kudu_tracing&quot;&gt;stack watchdog&lt;/a&gt;
   which helps us find various latency outliers and issues in our libraries and
   the Linux kernel.
 - &lt;a href=&quot;/docs/troubleshooting.html#heap_sampling&quot;&gt;Heap sampling&lt;/a&gt; support
   which helps us understand unexpected memory usage.&lt;/p&gt;

 &lt;p&gt;If you’re interested in learning about these topics and more, check out the video recording
 below. My talk spans the first 34 minutes.&lt;/p&gt;

 &lt;iframe width=&quot;800&quot; height=&quot;500&quot; src=&quot;https://www.youtube.com/embed/qBXwKU6Ubjo?end=2058&amp;amp;start=23&quot;&gt;
 &lt;/iframe&gt;

 &lt;p&gt;If you have any questions about this content or about Kudu in general,
 &lt;a href=&quot;http://kudu.apache.org/community.html&quot;&gt;join the community&lt;/a&gt;&lt;/p&gt;</content><author><name>Todd Lipcon</name></author><summary>Last week, the OpenTracing community invited me to
 their monthly Google Hangout meetup to give an informal talk on tracing and
 instrumentation in Apache Kudu.

 While Kudu doesn’t currently support distributed tracing using OpenTracing,
 it does have quite a lot of other types of instrumentation, metrics, and
 diagnostics logging. The OpenTracing team was interested to hear about some of
 the approaches that Kudu has used, and so I gave a brief introduction to topics
 including:</summary></entry><entry><title>Apache Kudu 1.7.0 released</title><link href="/2018/03/23/apache-kudu-1-7-0-released.html" rel="alternate" type="text/html" title="Apache Kudu 1.7.0 released" /><published>2018-03-23T00:00:00-07:00</published><updated>2018-03-23T00:00:00-07:00</updated><id>/2018/03/23/apache-kudu-1-7-0-released</id><content type="html" xml:base="/2018/03/23/apache-kudu-1-7-0-released.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.7.0!&lt;/p&gt;

 &lt;p&gt;Apache Kudu 1.7.0 is a minor release that offers new features, performance
 optimizations, incremental improvements, and bug fixes.&lt;/p&gt;

 &lt;p&gt;Release highlights:&lt;/p&gt;

 &lt;!--more--&gt;

 &lt;ol&gt;
   &lt;li&gt;Kudu now supports the decimal column type. The decimal type is a numeric
 data type with fixed scale and precision suitable for financial and other
 arithmetic calculations where the imprecise representation and rounding
 behavior of float and double make those types impractical. The decimal type
 is also useful for integers larger than int64 and cases with fractional values
 in a primary key. See &lt;a href=&quot;/releases/1.7.0/docs/schema_design.html#decimal&quot;&gt;Decimal Type&lt;/a&gt;
 for more details.&lt;/li&gt;
   &lt;li&gt;The strategy Kudu uses for automatically healing tablets which have lost a
 replica due to server or disk failures has been improved. The new re-replication
 strategy, or replica management scheme, first adds a replacement tablet replica
 before evicting the failed one.&lt;/li&gt;
   &lt;li&gt;A new scan read mode READ_YOUR_WRITES. Users can specify READ_YOUR_WRITES when
 creating a new scanner in C++, Java and Python clients. If this mode is used,
 the client will perform a read such that it follows all previously known writes
 and reads from this client. Reads in this mode ensure read-your-writes and
 read-your-reads session guarantees, while minimizing latency caused by waiting
 for outstanding write transactions to complete. Note that this is still an
 experimental feature which may be stabilized in future releases.&lt;/li&gt;
   &lt;li&gt;The tablet server web UI scans dashboard (/scans) has been improved with several
 new features, including: showing the most recently completed scans, a pseudo-SQL
 scan descriptor that concisely shows the selected columns and applied predicates,
 and more complete and better documented scan statistics.&lt;/li&gt;
   &lt;li&gt;Kudu daemons now expose a web page /stacks which dumps the current stack trace of
 every thread running in the server. This information can be helpful when diagnosing
 performance issues.&lt;/li&gt;
   &lt;li&gt;By default, each tablet replica will now stripe data blocks across 3 data directories
 instead of all data directories. This decreases the likelihood that any given tablet
 will be affected in the event of a single disk failure.&lt;/li&gt;
   &lt;li&gt;The Java client now uses a predefined prioritized list of TLS ciphers when
 establishing an encrypted connection to Kudu servers. This cipher list matches the
 list of ciphers preferred for server-to-server communication and ensures that the
 most efficient and secure ciphers are preferred. When the Kudu client is running on
 Java 8 or newer, this provides a substantial speed-up to read and write performance.&lt;/li&gt;
   &lt;li&gt;The performance of inserting rows containing many string or binary columns has been
 improved, especially in the case of highly concurrent write workloads.&lt;/li&gt;
   &lt;li&gt;The Java client will now automatically attempt to re-acquire Kerberos credentials
 from the ticket cache when the prior credentials are about to expire. This allows
 client instances to persist longer than the expiration time of a single Kerberos
 ticket so long as some other process renews the credentials in the ticket cache.&lt;/li&gt;
 &lt;/ol&gt;

 &lt;p&gt;For more details, and the complete list of changes in Kudu 1.7.0, please see
 the &lt;a href=&quot;/releases/1.7.0/docs/release_notes.html&quot;&gt;Kudu 1.7.0 release notes&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;The Apache Kudu project only publishes source code releases. To build Kudu
 1.7.0, follow these steps:&lt;/p&gt;

 &lt;ol&gt;
   &lt;li&gt;Download the &lt;a href=&quot;/releases/1.7.0/&quot;&gt;Kudu 1.7.0 source release&lt;/a&gt;.&lt;/li&gt;
   &lt;li&gt;Follow the instructions in the documentation to
 &lt;a href=&quot;/releases/1.7.0/docs/installation.html#build_from_source&quot;&gt;build Kudu 1.7.0 from source&lt;/a&gt;.&lt;/li&gt;
 &lt;/ol&gt;

 &lt;p&gt;For your convenience, binary JAR files for the Kudu Java client library, Spark
 DataSource, Flume sink, and other Java integrations are published to the ASF
 Maven repository and are
 &lt;a href=&quot;https://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.kudu%22%20AND%20v%3A%221.7.0%22&quot;&gt;now available&lt;/a&gt;.&lt;/p&gt;</content><author><name>Grant Henke</name></author><summary>The Apache Kudu team is happy to announce the release of Kudu 1.7.0!

 Apache Kudu 1.7.0 is a minor release that offers new features, performance
 optimizations, incremental improvements, and bug fixes.

 Release highlights:</summary></entry><entry><title>Apache Kudu 1.6.0 released</title><link href="/2017/12/08/apache-kudu-1-6-0-released.html" rel="alternate" type="text/html" title="Apache Kudu 1.6.0 released" /><published>2017-12-08T00:00:00-08:00</published><updated>2017-12-08T00:00:00-08:00</updated><id>/2017/12/08/apache-kudu-1-6-0-released</id><content type="html" xml:base="/2017/12/08/apache-kudu-1-6-0-released.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.6.0!&lt;/p&gt;

 &lt;p&gt;Apache Kudu 1.6.0 is a minor release that offers new features, performance
 optimizations, incremental improvements, and bug fixes.&lt;/p&gt;

 &lt;p&gt;Release highlights:&lt;/p&gt;

 &lt;!--more--&gt;

 &lt;ol&gt;
   &lt;li&gt;Kudu servers can now tolerate short interruptions in NTP clock
 synchronization. NTP synchronization is still required when any Kudu daemon
 starts up.&lt;/li&gt;
   &lt;li&gt;Tablet servers will no longer crash when a disk containing data blocks
 fails, unless that disk also stores WAL segments or tablet metadata. Instead
 of crashing, the tablet server will shut down any tablets that may have lost
 data locally and Kudu will re-replicate the affected tablets to another
 tablet server. More information can be found in the documentation under
 &lt;a href=&quot;/releases/1.6.0/docs/administration.html#disk_failure_recovery&quot;&gt;Recovering from Disk Failure&lt;/a&gt;.&lt;/li&gt;
   &lt;li&gt;Tablet server startup time has been improved significantly on servers
 containing large numbers of blocks.&lt;/li&gt;
   &lt;li&gt;The Spark DataSource integration now can take advantage of scan locality for
 better scan performance. The scan will take place at the closest replica
 instead of going to the leader.&lt;/li&gt;
   &lt;li&gt;Support for Spark 1 has been removed in Kudu 1.6.0 and now only Spark 2 is
 supported. Spark 1 support was deprecated in Kudu 1.5.0.&lt;/li&gt;
   &lt;li&gt;HybridTime timestamp propagation now works in the Java client when using
 scan tokens.&lt;/li&gt;
   &lt;li&gt;Tablet servers now consider the health of all replicas of a tablet before
 deciding to evict one. This can improve the stability of the Kudu cluster
 when multiple servers temporarily go down at the same time.&lt;/li&gt;
   &lt;li&gt;A bug in the C++ client was fixed that could cause tablets to be erroneously
 pruned, or skipped, during certain scans, resulting in fewer results than
 expected being returned from queries. The bug only affected tables whose
 range partition columns are a proper prefix of the primary key.
 See &lt;a href=&quot;https://issues.apache.org/jira/browse/KUDU-2173&quot;&gt;KUDU-2173&lt;/a&gt; for more
 information.&lt;/li&gt;
 &lt;/ol&gt;

 &lt;p&gt;For more details, and the complete list of changes in Kudu 1.6.0, please see
 the &lt;a href=&quot;/releases/1.6.0/docs/release_notes.html&quot;&gt;Kudu 1.6.0 release notes&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;The Apache Kudu project only publishes source code releases. To build Kudu
 1.6.0, follow these steps:&lt;/p&gt;

 &lt;ol&gt;
   &lt;li&gt;Download the &lt;a href=&quot;/releases/1.6.0/&quot;&gt;Kudu 1.6.0 source release&lt;/a&gt;.&lt;/li&gt;
   &lt;li&gt;Follow the instructions in the documentation to
 &lt;a href=&quot;/releases/1.6.0/docs/installation.html#build_from_source&quot;&gt;build Kudu 1.6.0 from source&lt;/a&gt;.&lt;/li&gt;
 &lt;/ol&gt;

 &lt;p&gt;For your convenience, binary JAR files for the Kudu Java client library, Spark
 DataSource, Flume sink, and other Java integrations are published to the ASF
 Maven repository and are
 &lt;a href=&quot;https://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.kudu%22%20AND%20v%3A%221.6.0%22&quot;&gt;now available&lt;/a&gt;.&lt;/p&gt;</content><author><name>Mike Percy</name></author><summary>The Apache Kudu team is happy to announce the release of Kudu 1.6.0!

 Apache Kudu 1.6.0 is a minor release that offers new features, performance
 optimizations, incremental improvements, and bug fixes.

 Release highlights:</summary></entry><entry><title>Slides: A brave new world in mutable big data: Relational storage</title><link href="/2017/10/23/nosql-kudu-spanner-slides.html" rel="alternate" type="text/html" title="Slides: A brave new world in mutable big data: Relational storage" /><published>2017-10-23T00:00:00-07:00</published><updated>2017-10-23T00:00:00-07:00</updated><id>/2017/10/23/nosql-kudu-spanner-slides</id><content type="html" xml:base="/2017/10/23/nosql-kudu-spanner-slides.html">&lt;p&gt;Since the Apache Kudu project made its debut in 2015, there have been
 a few common questions that kept coming up at every presentation:&lt;/p&gt;

 &lt;ul&gt;
   &lt;li&gt;Is Kudu an open source version of Google’s Spanner system?&lt;/li&gt;
   &lt;li&gt;Is Kudu NoSQL or SQL?&lt;/li&gt;
   &lt;li&gt;Why does Kudu have a relational data model? Isn’t SQL dead?&lt;/li&gt;
 &lt;/ul&gt;

 &lt;!--more--&gt;

 &lt;p&gt;A few of these questions are addressed in the
 &lt;a href=&quot;https://kudu.apache.org/faq.html&quot;&gt;Kudu FAQ&lt;/a&gt;, but I thought they were
 interesting enough that I decided to give a talk on these subjects
 at &lt;a href=&quot;https://conferences.oreilly.com/strata/strata-ny&quot;&gt;Strata Data Conference NYC 2017&lt;/a&gt;.&lt;/p&gt;

 &lt;p&gt;Preparing this talk was particularly interesting, since Google recently released
 Spanner to the public in SaaS form as &lt;a href=&quot;https://cloud.google.com/spanner/&quot;&gt;Google Cloud Spanner&lt;/a&gt;.
 This meant that I was able to compare Kudu vs Spanner not just qualitatively
 based on some academic papers, but quantitatively as well.&lt;/p&gt;

 &lt;p&gt;To summarize the key points of the presentation:&lt;/p&gt;

 &lt;ul&gt;
   &lt;li&gt;
     &lt;p&gt;Despite the growing popularity of “NoSQL” from 2009 through 2013, SQL has
 once again become the access mechanism of choice for the majority of
 analytic applications. NoSQL has become “Not Only SQL”.&lt;/p&gt;
   &lt;/li&gt;
   &lt;li&gt;
     &lt;p&gt;Spanner and Kudu share a lot of common features. However:&lt;/p&gt;

     &lt;ul&gt;
       &lt;li&gt;
         &lt;p&gt;Spanner offers a superior feature set and performance for Online
  Transactional Processing (OLTP) workloads, including ACID transactions and
  secondary indexing.&lt;/p&gt;
       &lt;/li&gt;
       &lt;li&gt;
         &lt;p&gt;Kudu offers a superior feature set and performance for Online
  Analytical Processing (OLAP) and Hybrid Transactional/Analytic Processing
  (HTAP) workloads, including more complete SQL support and orders of
  magnitude better performance on large queries.&lt;/p&gt;
       &lt;/li&gt;
     &lt;/ul&gt;
   &lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;For more details and for the full benchmark numbers, check out the slide deck
 below:&lt;/p&gt;

 &lt;iframe src=&quot;//www.slideshare.net/slideshow/embed_code/key/loQpO2vzlwGGgz&quot; width=&quot;595&quot; height=&quot;485&quot; frameborder=&quot;0&quot; marginwidth=&quot;0&quot; marginheight=&quot;0&quot; scrolling=&quot;no&quot; style=&quot;border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;&quot; allowfullscreen=&quot;&quot;&gt; &lt;/iframe&gt;
 &lt;div style=&quot;margin-bottom:15px&quot;&gt; &lt;strong&gt; &lt;a href=&quot;//www.slideshare.net/ToddLipcon/a-brave-new-world-in-mutable-big-data-relational-storage-strata-nyc-2017&quot; title=&quot;A brave new world in mutable big data relational storage (Strata NYC 2017)&quot; target=&quot;_blank&quot;&gt;A brave new world in mutable big data relational storage (Strata NYC 2017)&lt;/a&gt; &lt;/strong&gt; from &lt;strong&gt;&lt;a href=&quot;https://www.slideshare.net/ToddLipcon&quot; target=&quot;_blank&quot;&gt;Todd Lipcon&lt;/a&gt;&lt;/strong&gt; &lt;/div&gt;

 &lt;p&gt;Questions or comments? Join the &lt;a href=&quot;/community.html&quot;&gt;Apache Kudu Community&lt;/a&gt; to discuss.&lt;/p&gt;</content><author><name>Todd Lipcon</name></author><summary>Since the Apache Kudu project made its debut in 2015, there have been
 a few common questions that kept coming up at every presentation:


   Is Kudu an open source version of Google’s Spanner system?
   Is Kudu NoSQL or SQL?
   Why does Kudu have a relational data model? Isn’t SQL dead?</summary></entry></feed>