| <?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom"><generator uri="http://jekyllrb.com" version="2.5.3">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2019-03-15T10:35:54-07:00</updated><id>/</id><entry><title>Transparent Hierarchical Storage Management with Apache Kudu and Impala</title><link href="/2019/03/05/transparent-hierarchical-storage-management-with-apache-kudu-and-impala.html" rel="alternate" type="text/html" title="Transparent Hierarchical Storage Management with Apache Kudu and Impala" /><published>2019-03-05T00:00:00-08:00</published><updated>2019-03-05T00:00:00-08:00</updated><id>/2019/03/05/transparent-hierarchical-storage-management-with-apache-kudu-and-impala</id><content type="html" xml:base="/2019/03/05/transparent-hierarchical-storage-management-with-apache-kudu-and-impala.html"><p>Note: This is a cross-post from the Cloudera Engineering Blog |
| <a href="https://blog.cloudera.com/blog/2019/03/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/">Transparent Hierarchical Storage Management with Apache Kudu and Impala</a></p> |
| |
| <p>When picking a storage option for an application it is common to pick a single |
| storage option which has the most applicable features to your use case. For mutability |
| and real-time analytics workloads you may want to use Apache Kudu, but for massive |
| scalability at a low cost you may want to use HDFS. For that reason, there is a need |
| for a solution that allows you to leverage the best features of multiple storage |
| options. This post describes the sliding window pattern using Apache Impala with data |
| stored in Apache Kudu and Apache HDFS. With this pattern you get all of the benefits |
| of multiple storage layers in a way that is transparent to users.</p> |
| |
| <!--more--> |
| |
| <p>Apache Kudu is designed for fast analytics on rapidly changing data. Kudu provides a |
| combination of fast inserts/updates and efficient columnar scans to enable multiple |
| real-time analytic workloads across a single storage layer. For that reason, Kudu fits |
| well into a data pipeline as the place to store real-time data that needs to be |
| queryable immediately. Additionally, Kudu supports updating and deleting rows in |
| real-time allowing support for late arriving data and data correction.</p> |
| |
| <p>Apache HDFS is designed to allow for limitless scalability at a low cost. It is |
| optimized for batch oriented use cases where data is immutable. When paired with the |
| Apache Parquet file format, structured data can be accessed with extremely high |
| throughput and efficiency.</p> |
| |
| <p>For situations in which the data is small and ever-changing, like dimension tables, |
| it is common to keep all of the data in Kudu. It is even common to keep large tables |
| in Kudu when the data fits within Kudu’s |
| <a href="https://kudu.apache.org/docs/known_issues.html#_scale">scaling limits</a> and can benefit |
| from Kudu’s unique features. In cases where the data is massive, batch oriented, and |
| unlikely to change, storing the data in HDFS using the Parquet format is preferred. |
| When you need the benefits of both storage layers, the sliding window pattern is a |
| useful solution.</p> |
| |
| <h2 id="the-sliding-window-pattern">The Sliding Window Pattern</h2> |
| |
| <p>In this pattern, matching Kudu and Parquet formatted HDFS tables are created in Impala. |
| These tables are partitioned by a unit of time based on how frequently the data is |
| moved between the Kudu and HDFS table. It is common to use daily, monthly, or yearly |
| partitions. A unified view is created and a <code>WHERE</code> clause is used to define a boundary |
| that separates which data is read from the Kudu table and which is read from the HDFS |
| table. The defined boundary is important so that you can move data between Kudu and |
| HDFS without exposing duplicate records to the view. Once the data is moved, an atomic |
| <code>ALTER VIEW</code> statement can be used to move the boundary forward.</p> |
| |
| <p><img src="/img/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/sliding-window-pattern.png" alt="png" class="img-responsive" /></p> |
| |
| <p>Note: This pattern works best with somewhat sequential data organized into range |
| partitions, because having a sliding window of time and dropping partitions is very |
| efficient.</p> |
| |
| <p>This pattern results in a sliding window of time where mutable data is stored in Kudu |
| and immutable data is stored in the Parquet format on HDFS. Leveraging both Kudu and |
| HDFS via Impala provides the benefits of both storage systems:</p> |
| |
| <ul> |
| <li>Streaming data is immediately queryable</li> |
| <li>Updates for late arriving data or manual corrections can be made</li> |
| <li>Data stored in HDFS is optimally sized increasing performance and preventing small files</li> |
| <li>Reduced cost</li> |
| </ul> |
| |
| <p>Impala also supports cloud storage options such as |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_s3.html">S3</a> and |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_adls.html">ADLS</a>. |
| This capability allows convenient access to a storage system that is remotely managed, |
| accessible from anywhere, and integrated with various cloud-based services. Because |
| this data is remote, queries against S3 data are less performant, making S3 suitable |
| for holding “cold” data that is only queried occasionally. This pattern can be |
| extended to use cloud storage for cold data by creating a third matching table and |
| adding another boundary to the unified view.</p> |
| |
| <p><img src="/img/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/sliding-window-pattern-cold.png" alt="png" class="img-responsive" /></p> |
| |
| <p>Note: For simplicity only Kudu and HDFS are illustrated in the examples below.</p> |
| |
| <p>The process for moving data from Kudu to HDFS is broken into two phases. The first |
| phase is the data migration, and the second phase is the metadata change. These |
| ongoing steps should be scheduled to run automatically on a regular basis.</p> |
| |
| <p>In the first phase, the now immutable data is copied from Kudu to HDFS. Even though |
| data is duplicated from Kudu into HDFS, the boundary defined in the view will prevent |
| duplicate data from being shown to users. This step can include any validation and |
| retries as needed to ensure the data offload is successful.</p> |
| |
| <p><img src="/img/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/phase-1.png" alt="png" class="img-responsive" /></p> |
| |
| <p>In the second phase, now that the data is safely copied to HDFS, the metadata is |
| changed to adjust how the offloaded partition is exposed. This includes shifting |
| the boundary forward, adding a new Kudu partition for the next period, and dropping |
| the old Kudu partition.</p> |
| |
| <p><img src="/img/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/phase-2.png" alt="png" class="img-responsive" /></p> |
| |
| <h2 id="building-blocks">Building Blocks</h2> |
| |
| <p>In order to implement the sliding window pattern, a few Impala fundamentals are |
| required. Below each fundamental building block of the sliding window pattern is |
| described.</p> |
| |
| <h3 id="moving-data">Moving Data</h3> |
| |
| <p>Moving data among storage systems via Impala is straightforward provided you have |
| matching tables defined using each of the storage formats. In order to keep this post |
| brief, all of the options available when creating an Impala table are not described. |
| However, Impala’s |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_create_table.html">CREATE TABLE documentation</a> |
| can be referenced to find the correct syntax for Kudu, HDFS, and cloud storage tables. |
| A few examples are shown further below where the sliding window pattern is illustrated.</p> |
| |
| <p>Once the tables are created, moving the data is as simple as an |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_insert.html">INSERT…SELECT</a> statement:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">table_foo</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">table_bar</span><span class="p">;</span></code></pre></div> |
| |
| <p>All of the features of the |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_select.html">SELECT</a> |
| statement can be used to select the specific data you would like to move.</p> |
| |
| <p>Note: If moving data to Kudu, an <code>UPSERT INTO</code> statement can be used to handle |
| duplicate keys.</p> |
| |
| <h3 id="unified-querying">Unified Querying</h3> |
| |
| <p>Querying data from multiple tables and data sources in Impala is also straightforward. |
| For the sake of brevity, all of the options available when creating an Impala view are |
| not described. However, see Impala’s |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_create_view.html">CREATE VIEW documentation</a> |
| for more in-depth details.</p> |
| |
| <p>Creating a view for unified querying is as simple as a <code>CREATE VIEW</code> statement using |
| two <code>SELECT</code> clauses combined with a <code>UNION ALL</code>:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">VIEW</span> <span class="n">foo_view</span> <span class="k">AS</span> |
| <span class="k">SELECT</span> <span class="n">col1</span><span class="p">,</span> <span class="n">col2</span><span class="p">,</span> <span class="n">col3</span> <span class="k">FROM</span> <span class="n">foo_parquet</span> |
| <span class="k">UNION</span> <span class="k">ALL</span> |
| <span class="k">SELECT</span> <span class="n">col1</span><span class="p">,</span> <span class="n">col2</span><span class="p">,</span> <span class="n">col3</span> <span class="k">FROM</span> <span class="n">foo_kudu</span><span class="p">;</span></code></pre></div> |
| |
| <p>WARNING: Be sure to use <code>UNION ALL</code> and not <code>UNION</code>. The <code>UNION</code> keyword by itself |
| is the same as <code>UNION DISTINCT</code> and can have significant performance impact. |
| More information can be found in the Impala |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_union.html">UNION documentation</a>.</p> |
| |
| <p>All of the features of the |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_select.html">SELECT</a> |
| statement can be used to expose the correct data and columns from each of the |
| underlying tables. It is important to use the <code>WHERE</code> clause to pass through and |
| pushdown any predicates that need special handling or transformations. More examples |
| will follow below in the discussion of the sliding window pattern.</p> |
| |
| <p>Additionally, views can be altered via the |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_alter_view.html">ALTER VIEW</a> |
| statement. This is useful when combined with the <code>SELECT</code> statement because it can be |
| used to atomically update what data is being accessed by the view.</p> |
| |
| <h2 id="an-example-implementation">An Example Implementation</h2> |
| |
| <p>Below are sample steps to implement the sliding window pattern using a monthly period |
| with three months of active mutable data. Data older than three months will be |
| offloaded to HDFS using the Parquet format.</p> |
| |
| <h3 id="create-the-kudu-table">Create the Kudu Table</h3> |
| |
| <p>First, create a Kudu table which will hold three months of active mutable data. |
| The table is range partitioned by the time column with each range containing one |
| period of data. It is important to have partitions that match the period because |
| dropping Kudu partitions is much more efficient than removing the data via the |
| <code>DELETE</code> clause. The table is also hash partitioned by the other key column to ensure |
| that all of the data is not written to a single partition.</p> |
| |
| <p>Note: Your schema design should vary based on your data and read/write performance |
| considerations. This example schema is intended for demonstration purposes and not as |
| an “optimal” schema. See the |
| <a href="https://kudu.apache.org/docs/schema_design.html">Kudu schema design documentation</a> |
| for more guidance on choosing your schema. For example, you may not need any hash |
| partitioning if your |
| data input rate is low. Alternatively, you may need more hash buckets if your data |
| input rate is very high.</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">my_table_kudu</span> |
| <span class="p">(</span> |
| <span class="n">name</span> <span class="n">STRING</span><span class="p">,</span> |
| <span class="n">time</span> <span class="k">TIMESTAMP</span><span class="p">,</span> |
| <span class="n">message</span> <span class="n">STRING</span><span class="p">,</span> |
| <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">time</span><span class="p">)</span> |
| <span class="p">)</span> |
| <span class="n">PARTITION</span> <span class="k">BY</span> |
| <span class="n">HASH</span><span class="p">(</span><span class="n">name</span><span class="p">)</span> <span class="n">PARTITIONS</span> <span class="mi">4</span><span class="p">,</span> |
| <span class="n">RANGE</span><span class="p">(</span><span class="n">time</span><span class="p">)</span> <span class="p">(</span> |
| <span class="n">PARTITION</span> <span class="s1">&#39;2018-01-01&#39;</span> <span class="o">&lt;=</span> <span class="k">VALUES</span> <span class="o">&lt;</span> <span class="s1">&#39;2018-02-01&#39;</span><span class="p">,</span> <span class="c1">--January</span> |
| <span class="n">PARTITION</span> <span class="s1">&#39;2018-02-01&#39;</span> <span class="o">&lt;=</span> <span class="k">VALUES</span> <span class="o">&lt;</span> <span class="s1">&#39;2018-03-01&#39;</span><span class="p">,</span> <span class="c1">--February</span> |
| <span class="n">PARTITION</span> <span class="s1">&#39;2018-03-01&#39;</span> <span class="o">&lt;=</span> <span class="k">VALUES</span> <span class="o">&lt;</span> <span class="s1">&#39;2018-04-01&#39;</span><span class="p">,</span> <span class="c1">--March</span> |
| <span class="n">PARTITION</span> <span class="s1">&#39;2018-04-01&#39;</span> <span class="o">&lt;=</span> <span class="k">VALUES</span> <span class="o">&lt;</span> <span class="s1">&#39;2018-05-01&#39;</span> <span class="c1">--April</span> |
| <span class="p">)</span> |
| <span class="n">STORED</span> <span class="k">AS</span> <span class="n">KUDU</span><span class="p">;</span></code></pre></div> |
| |
| <p>Note: There is an extra month partition to provide a buffer of time for the data to |
| be moved into the immutable table.</p> |
| |
| <h3 id="create-the-hdfs-table">Create the HDFS Table</h3> |
| |
| <p>Create the matching Parquet formatted HDFS table which will hold the older immutable |
| data. This table is partitioned by year, month, and day for efficient access even |
| though you can’t partition by the time column itself. This is addressed further in |
| the view step below. See Impala’s |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_partitioning.html">partitioning documentation</a> |
| for more details.</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">my_table_parquet</span> |
| <span class="p">(</span> |
| <span class="n">name</span> <span class="n">STRING</span><span class="p">,</span> |
| <span class="n">time</span> <span class="k">TIMESTAMP</span><span class="p">,</span> |
| <span class="n">message</span> <span class="n">STRING</span> |
| <span class="p">)</span> |
| <span class="n">PARTITIONED</span> <span class="k">BY</span> <span class="p">(</span><span class="k">year</span> <span class="nb">int</span><span class="p">,</span> <span class="k">month</span> <span class="nb">int</span><span class="p">,</span> <span class="k">day</span> <span class="nb">int</span><span class="p">)</span> |
| <span class="n">STORED</span> <span class="k">AS</span> <span class="n">PARQUET</span><span class="p">;</span></code></pre></div> |
| |
| <h3 id="create-the-unified-view">Create the Unified View</h3> |
| |
| <p>Now create the unified view which will be used to query all of the data seamlessly:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">VIEW</span> <span class="n">my_table_view</span> <span class="k">AS</span> |
| <span class="k">SELECT</span> <span class="n">name</span><span class="p">,</span> <span class="n">time</span><span class="p">,</span> <span class="n">message</span> |
| <span class="k">FROM</span> <span class="n">my_table_kudu</span> |
| <span class="k">WHERE</span> <span class="n">time</span> <span class="o">&gt;=</span> <span class="ss">&quot;2018-01-01&quot;</span> |
| <span class="k">UNION</span> <span class="k">ALL</span> |
| <span class="k">SELECT</span> <span class="n">name</span><span class="p">,</span> <span class="n">time</span><span class="p">,</span> <span class="n">message</span> |
| <span class="k">FROM</span> <span class="n">my_table_parquet</span> |
| <span class="k">WHERE</span> <span class="n">time</span> <span class="o">&lt;</span> <span class="ss">&quot;2018-01-01&quot;</span> |
| <span class="k">AND</span> <span class="k">year</span> <span class="o">=</span> <span class="k">year</span><span class="p">(</span><span class="n">time</span><span class="p">)</span> |
| <span class="k">AND</span> <span class="k">month</span> <span class="o">=</span> <span class="k">month</span><span class="p">(</span><span class="n">time</span><span class="p">)</span> |
| <span class="k">AND</span> <span class="k">day</span> <span class="o">=</span> <span class="k">day</span><span class="p">(</span><span class="n">time</span><span class="p">);</span></code></pre></div> |
| |
| <p>Each <code>SELECT</code> clause explicitly lists all of the columns to expose. This ensures that |
| the year, month, and day columns that are unique to the Parquet table are not exposed. |
| If needed, it also allows any necessary column or type mapping to be handled.</p> |
| |
| <p>The initial <code>WHERE</code> clauses applied to both my_table_kudu and my_table_parquet define |
| the boundary between Kudu and HDFS to ensure duplicate data is not read while in the |
| process of offloading data.</p> |
| |
| <p>The additional <code>AND</code> clauses applied to my_table_parquet are used to ensure good |
| predicate pushdown on the individual year, month, and day columns.</p> |
| |
| <p>WARNING: As stated earlier, be sure to use <code>UNION ALL</code> and not <code>UNION</code>. The <code>UNION</code> |
| keyword by itself is the same as <code>UNION DISTINCT</code> and can have significant performance |
| impact. More information can be found in the Impala |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_union.html"><code>UNION</code> documentation</a>.</p> |
| |
| <h3 id="ongoing-steps">Ongoing Steps</h3> |
| |
| <p>Now that the base tables and view are created, prepare the ongoing steps to maintain |
| the sliding window. Because these ongoing steps should be scheduled to run on a |
| regular basis, the examples below are shown using <code>.sql</code> files that take variables |
| which can be passed from your scripts and scheduling tool of choice.</p> |
| |
| <p>Create the <code>window_data_move.sql</code> file to move the data from the oldest partition to HDFS:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">INSERT</span> <span class="k">INTO</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">hdfs_table</span><span class="err">}</span> <span class="n">PARTITION</span> <span class="p">(</span><span class="k">year</span><span class="p">,</span> <span class="k">month</span><span class="p">,</span> <span class="k">day</span><span class="p">)</span> |
| <span class="k">SELECT</span> <span class="o">*</span><span class="p">,</span> <span class="k">year</span><span class="p">(</span><span class="n">time</span><span class="p">),</span> <span class="k">month</span><span class="p">(</span><span class="n">time</span><span class="p">),</span> <span class="k">day</span><span class="p">(</span><span class="n">time</span><span class="p">)</span> |
| <span class="k">FROM</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">kudu_table</span><span class="err">}</span> |
| <span class="k">WHERE</span> <span class="n">time</span> <span class="o">&gt;=</span> <span class="n">add_months</span><span class="p">(</span><span class="ss">&quot;${var:new_boundary_time}&quot;</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> |
| <span class="k">AND</span> <span class="n">time</span> <span class="o">&lt;</span> <span class="ss">&quot;${var:new_boundary_time}&quot;</span><span class="p">;</span> |
| <span class="n">COMPUTE</span> <span class="n">INCREMENTAL</span> <span class="n">STATS</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">hdfs_table</span><span class="err">}</span><span class="p">;</span></code></pre></div> |
| |
| <p>Note: The |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_compute_stats.html">COMPUTE INCREMENTAL STATS</a> |
| clause is not required but helps Impala to optimize queries.</p> |
| |
| <p>To run the SQL statement, use the Impala shell and pass the required variables. |
| Below is an example:</p> |
| |
| <div class="highlight"><pre><code class="language-bash" data-lang="bash">impala-shell -i &lt;impalad:port&gt; -f window_data_move.sql |
| --var<span class="o">=</span><span class="nv">kudu_table</span><span class="o">=</span>my_table_kudu |
| --var<span class="o">=</span><span class="nv">hdfs_table</span><span class="o">=</span>my_table_parquet |
| --var<span class="o">=</span><span class="nv">new_boundary_time</span><span class="o">=</span><span class="s2">&quot;2018-02-01&quot;</span></code></pre></div> |
| |
| <p>Note: You can adjust the <code>WHERE</code> clause to match the given period and cadence of your |
| offload. Here the add_months function is used with an argument of -1 to move one month |
| of data in the past from the new boundary time.</p> |
| |
| <p>Create the <code>window_view_alter.sql</code> file to shift the time boundary forward by altering |
| the unified view:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">ALTER</span> <span class="k">VIEW</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">view_name</span><span class="err">}</span> <span class="k">AS</span> |
| <span class="k">SELECT</span> <span class="n">name</span><span class="p">,</span> <span class="n">time</span><span class="p">,</span> <span class="n">message</span> |
| <span class="k">FROM</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">kudu_table</span><span class="err">}</span> |
| <span class="k">WHERE</span> <span class="n">time</span> <span class="o">&gt;=</span> <span class="ss">&quot;${var:new_boundary_time}&quot;</span> |
| <span class="k">UNION</span> <span class="k">ALL</span> |
| <span class="k">SELECT</span> <span class="n">name</span><span class="p">,</span> <span class="n">time</span><span class="p">,</span> <span class="n">message</span> |
| <span class="k">FROM</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">hdfs_table</span><span class="err">}</span> |
| <span class="k">WHERE</span> <span class="n">time</span> <span class="o">&lt;</span> <span class="ss">&quot;${var:new_boundary_time}&quot;</span> |
| <span class="k">AND</span> <span class="k">year</span> <span class="o">=</span> <span class="k">year</span><span class="p">(</span><span class="n">time</span><span class="p">)</span> |
| <span class="k">AND</span> <span class="k">month</span> <span class="o">=</span> <span class="k">month</span><span class="p">(</span><span class="n">time</span><span class="p">)</span> |
| <span class="k">AND</span> <span class="k">day</span> <span class="o">=</span> <span class="k">day</span><span class="p">(</span><span class="n">time</span><span class="p">);</span></code></pre></div> |
| |
| <p>To run the SQL statement, use the Impala shell and pass the required variables. |
| Below is an example:</p> |
| |
| <div class="highlight"><pre><code class="language-bash" data-lang="bash">impala-shell -i &lt;impalad:port&gt; -f window_view_alter.sql |
| --var<span class="o">=</span><span class="nv">view_name</span><span class="o">=</span>my_table_view |
| --var<span class="o">=</span><span class="nv">kudu_table</span><span class="o">=</span>my_table_kudu |
| --var<span class="o">=</span><span class="nv">hdfs_table</span><span class="o">=</span>my_table_parquet |
| --var<span class="o">=</span><span class="nv">new_boundary_time</span><span class="o">=</span><span class="s2">&quot;2018-02-01&quot;</span></code></pre></div> |
| |
| <p>Create the <code>window_partition_shift.sql</code> file to shift the Kudu partitions forward:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">ALTER</span> <span class="k">TABLE</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">kudu_table</span><span class="err">}</span> |
| |
| <span class="k">ADD</span> <span class="n">RANGE</span> <span class="n">PARTITION</span> <span class="n">add_months</span><span class="p">(</span><span class="ss">&quot;${var:new_boundary_time}&quot;</span><span class="p">,</span> |
| <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">window_length</span><span class="err">}</span><span class="p">)</span> <span class="o">&lt;=</span> <span class="k">VALUES</span> <span class="o">&lt;</span> <span class="n">add_months</span><span class="p">(</span><span class="ss">&quot;${var:new_boundary_time}&quot;</span><span class="p">,</span> |
| <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">window_length</span><span class="err">}</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span> |
| |
| <span class="k">ALTER</span> <span class="k">TABLE</span> <span class="err">${</span><span class="n">var</span><span class="p">:</span><span class="n">kudu_table</span><span class="err">}</span> |
| |
| <span class="k">DROP</span> <span class="n">RANGE</span> <span class="n">PARTITION</span> <span class="n">add_months</span><span class="p">(</span><span class="ss">&quot;${var:new_boundary_time}&quot;</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> |
| <span class="o">&lt;=</span> <span class="k">VALUES</span> <span class="o">&lt;</span> <span class="ss">&quot;${var:new_boundary_time}&quot;</span><span class="p">;</span></code></pre></div> |
| |
| <p>To run the SQL statement, use the Impala shell and pass the required variables. |
| Below is an example:</p> |
| |
| <div class="highlight"><pre><code class="language-bash" data-lang="bash">impala-shell -i &lt;impalad:port&gt; -f window_partition_shift.sql |
| --var<span class="o">=</span><span class="nv">kudu_table</span><span class="o">=</span>my_table_kudu |
| --var<span class="o">=</span><span class="nv">new_boundary_time</span><span class="o">=</span><span class="s2">&quot;2018-02-01&quot;</span> |
| --var<span class="o">=</span><span class="nv">window_length</span><span class="o">=</span>3</code></pre></div> |
| |
| <p>Note: You should periodically run |
| <a href="https://impala.apache.org/docs/build/html/topics/impala_compute_stats.html">COMPUTE STATS</a> |
| on your Kudu table to ensure Impala’s query performance is optimal.</p> |
| |
| <h3 id="experimentation">Experimentation</h3> |
| |
| <p>Now that you have created the tables, view, and scripts to leverage the sliding |
| window pattern, you can experiment with them by inserting data for different time |
| ranges and running the scripts to move the window forward through time.</p> |
| |
| <p>Insert some sample values into the Kudu table:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">my_table_kudu</span> <span class="k">VALUES</span> |
| <span class="p">(</span><span class="s1">&#39;joey&#39;</span><span class="p">,</span> <span class="s1">&#39;2018-01-01&#39;</span><span class="p">,</span> <span class="s1">&#39;hello&#39;</span><span class="p">),</span> |
| <span class="p">(</span><span class="s1">&#39;ross&#39;</span><span class="p">,</span> <span class="s1">&#39;2018-02-01&#39;</span><span class="p">,</span> <span class="s1">&#39;goodbye&#39;</span><span class="p">),</span> |
| <span class="p">(</span><span class="s1">&#39;rachel&#39;</span><span class="p">,</span> <span class="s1">&#39;2018-03-01&#39;</span><span class="p">,</span> <span class="s1">&#39;hi&#39;</span><span class="p">);</span></code></pre></div> |
| |
| <p>Show the data in each table/view:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_kudu</span><span class="p">;</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_parquet</span><span class="p">;</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span><span class="p">;</span></code></pre></div> |
| |
| <p>Move the January data into HDFS:</p> |
| |
| <div class="highlight"><pre><code class="language-bash" data-lang="bash">impala-shell -i &lt;impalad:port&gt; -f window_data_move.sql |
| --var<span class="o">=</span><span class="nv">kudu_table</span><span class="o">=</span>my_table_kudu |
| --var<span class="o">=</span><span class="nv">hdfs_table</span><span class="o">=</span>my_table_parquet |
| --var<span class="o">=</span><span class="nv">new_boundary_time</span><span class="o">=</span><span class="s2">&quot;2018-02-01&quot;</span></code></pre></div> |
| |
| <p>Confirm the data is in both places, but not duplicated in the view:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_kudu</span><span class="p">;</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_parquet</span><span class="p">;</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span><span class="p">;</span></code></pre></div> |
| |
| <p>Alter the view to shift the time boundary forward to February:</p> |
| |
| <div class="highlight"><pre><code class="language-bash" data-lang="bash">impala-shell -i &lt;impalad:port&gt; -f window_view_alter.sql |
| --var<span class="o">=</span><span class="nv">view_name</span><span class="o">=</span>my_table_view |
| --var<span class="o">=</span><span class="nv">kudu_table</span><span class="o">=</span>my_table_kudu |
| --var<span class="o">=</span><span class="nv">hdfs_table</span><span class="o">=</span>my_table_parquet |
| --var<span class="o">=</span><span class="nv">new_boundary_time</span><span class="o">=</span><span class="s2">&quot;2018-02-01&quot;</span></code></pre></div> |
| |
| <p>Confirm the data is still in both places, but not duplicated in the view:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_kudu</span><span class="p">;</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_parquet</span><span class="p">;</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span><span class="p">;</span></code></pre></div> |
| |
| <p>Shift the Kudu partitions forward:</p> |
| |
| <div class="highlight"><pre><code class="language-bash" data-lang="bash">impala-shell -i &lt;impalad:port&gt; -f window_partition_shift.sql |
| --var<span class="o">=</span><span class="nv">kudu_table</span><span class="o">=</span>my_table_kudu |
| --var<span class="o">=</span><span class="nv">new_boundary_time</span><span class="o">=</span><span class="s2">&quot;2018-02-01&quot;</span> |
| --var<span class="o">=</span><span class="nv">window_length</span><span class="o">=</span>3</code></pre></div> |
| |
| <p>Confirm the January data is now only in HDFS:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_kudu</span><span class="p">;</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_parquet</span><span class="p">;</span> |
| <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span><span class="p">;</span></code></pre></div> |
| |
| <p>Confirm predicate push down with Impala’s EXPLAIN statement:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">EXPLAIN</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span><span class="p">;</span> |
| <span class="k">EXPLAIN</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span> <span class="k">WHERE</span> <span class="n">time</span> <span class="o">&lt;</span> <span class="ss">&quot;2018-02-01&quot;</span><span class="p">;</span> |
| <span class="k">EXPLAIN</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">my_table_view</span> <span class="k">WHERE</span> <span class="n">time</span> <span class="o">&gt;</span> <span class="ss">&quot;2018-02-01&quot;</span><span class="p">;</span></code></pre></div> |
| |
| <p>In the explain output you should see “kudu predicates” which include the time column |
| filters in the “SCAN KUDU” section and “predicates” which include the time, day, month, |
| and year columns in the “SCAN HDFS” section.</p></content><author><name>Grant Henke</name></author><summary>Note: This is a cross-post from the Cloudera Engineering Blog |
| Transparent Hierarchical Storage Management with Apache Kudu and Impala |
| |
| When picking a storage option for an application it is common to pick a single |
| storage option which has the most applicable features to your use case. For mutability |
| and real-time analytics workloads you may want to use Apache Kudu, but for massive |
| scalability at a low cost you may want to use HDFS. For that reason, there is a need |
| for a solution that allows you to leverage the best features of multiple storage |
| options. This post describes the sliding window pattern using Apache Impala with data |
| stored in Apache Kudu and Apache HDFS. With this pattern you get all of the benefits |
| of multiple storage layers in a way that is transparent to users.</summary></entry><entry><title>Call for Posts</title><link href="/2018/12/11/call-for-posts.html" rel="alternate" type="text/html" title="Call for Posts" /><published>2018-12-11T00:00:00-08:00</published><updated>2018-12-11T00:00:00-08:00</updated><id>/2018/12/11/call-for-posts</id><content type="html" xml:base="/2018/12/11/call-for-posts.html"><p>Most of the posts in the Kudu blog have been written by the project’s |
| committers and are either technical or news-like in nature. We’d like to hear |
| how you’re using Kudu in production, in testing, or in your hobby project and |
| we’d like to share it with the world!</p> |
| |
| <!--more--> |
| |
| <p>If you’d like to tell the world about how you are using Kudu in your project, |
| now is the time.</p> |
| |
| <p>To learn how to submit posts, read our <a href="/docs/contributing.html#_blog_posts">contributing |
| documentation</a>. Alternatively, you can |
| draft your post to Google Docs and share it with us on |
| <a href="&#109;&#097;&#105;&#108;&#116;&#111;:&#100;&#101;&#118;&#064;&#107;&#117;&#100;&#117;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;&#103;">&#100;&#101;&#118;&#064;&#107;&#117;&#100;&#117;&#046;&#097;&#112;&#097;&#099;&#104;&#101;&#046;&#111;&#114;&#103;</a> and we’re happy to review it |
| and post it to the blog for you.</p></content><author><name>Attila Bukor</name></author><summary>Most of the posts in the Kudu blog have been written by the project’s |
| committers and are either technical or news-like in nature. We’d like to hear |
| how you’re using Kudu in production, in testing, or in your hobby project and |
| we’d like to share it with the world!</summary></entry><entry><title>Apache Kudu 1.8.0 Released</title><link href="/2018/10/26/apache-kudu-1-8-0-released.html" rel="alternate" type="text/html" title="Apache Kudu 1.8.0 Released" /><published>2018-10-26T00:00:00-07:00</published><updated>2018-10-26T00:00:00-07:00</updated><id>/2018/10/26/apache-kudu-1-8-0-released</id><content type="html" xml:base="/2018/10/26/apache-kudu-1-8-0-released.html"><p>The Apache Kudu team is happy to announce the release of Kudu 1.8.0!</p> |
| |
| <p>The new release adds several new features and improvements, including the |
| following:</p> |
| |
| <!--more--> |
| |
| <ul> |
| <li>Introduced manual data rebalancer tool which can be used to redistribute |
| table replicas among tablet servers</li> |
| <li>Added support for <code>IS NULL</code> and <code>IS NOT NULL</code> predicates to the Kudu Python |
| client</li> |
| <li>Multiple tooling improvements make diagnostics and troubleshooting simpler</li> |
| <li>The Kudu Spark connector now supports Spark Streaming DataFrames</li> |
| <li>Added Pandas support to the Python client</li> |
| </ul> |
| |
| <p>The above is just a list of the highlights, for a more complete list of new |
| features, improvements and fixes please refer to the <a href="/releases/1.8.0/docs/release_notes.html">release |
| notes</a>.</p> |
| |
| <p>The Apache Kudu project only publishes source code releases. To build Kudu |
| 1.8.0, follow these steps:</p> |
| |
| <ul> |
| <li>Download the Kudu <a href="/releases/1.8.0">1.8.0 source release</a></li> |
| <li>Follow the instructions in the documentation to build Kudu <a href="/releases/1.8.0/docs/installation.html#build_from_source">1.8.0 from |
| source</a></li> |
| </ul> |
| |
| <p>For your convenience, binary JAR files for the Kudu Java client library, Spark |
| DataSource, Flume sink, and other Java integrations are published to the ASF |
| Maven repository and are <a href="https://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.kudu%22%20AND%20v%3A%221.8.0%22">now |
| available</a>.</p> |
| |
| <p>The Python client source is also available on |
| <a href="https://pypi.org/project/kudu-python/">PyPI</a>.</p></content><author><name>Attila Bukor</name></author><summary>The Apache Kudu team is happy to announce the release of Kudu 1.8.0! |
| |
| The new release adds several new features and improvements, including the |
| following:</summary></entry><entry><title>Index Skip Scan Optimization in Kudu</title><link href="/2018/09/26/index-skip-scan-optimization-in-kudu.html" rel="alternate" type="text/html" title="Index Skip Scan Optimization in Kudu" /><published>2018-09-26T00:00:00-07:00</published><updated>2018-09-26T00:00:00-07:00</updated><id>/2018/09/26/index-skip-scan-optimization-in-kudu</id><content type="html" xml:base="/2018/09/26/index-skip-scan-optimization-in-kudu.html"><p>This summer I got the opportunity to intern with the Apache Kudu team at Cloudera. |
| My project was to optimize the Kudu scan path by implementing a technique called |
| index skip scan (a.k.a. scan-to-seek, see section 4.1 in [1]). I wanted to share |
| my experience and the progress we’ve made so far on the approach.</p> |
| |
| <!--more--> |
| |
| <p>Let’s begin with discussing the current query flow in Kudu. |
| Consider the following table:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">metrics</span> <span class="p">(</span> |
| <span class="k">host</span> <span class="n">STRING</span><span class="p">,</span> |
| <span class="n">tstamp</span> <span class="nb">INT</span><span class="p">,</span> |
| <span class="n">clusterid</span> <span class="nb">INT</span><span class="p">,</span> |
| <span class="k">role</span> <span class="n">STRING</span><span class="p">,</span> |
| <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="p">(</span><span class="k">host</span><span class="p">,</span> <span class="n">tstamp</span><span class="p">,</span> <span class="n">clusterid</span><span class="p">)</span> |
| <span class="p">);</span></code></pre></div> |
| |
| <p><img src="/img/index-skip-scan/example-table.png" alt="png" class="img-responsive" /> |
| <em>Sample rows of table <code>metrics</code> (sorted by key columns).</em></p> |
| |
| <p>In this case, by default, Kudu internally builds a primary key index (implemented as a |
| <a href="https://en.wikipedia.org/wiki/B-tree">B-tree</a>) for the table <code>metrics</code>. |
| As shown in the table above, the index data is sorted by the composite of all key columns. |
| When the user query contains the first key column (<code>host</code>), Kudu uses the index (as the index data is |
| primarily sorted on the first key column).</p> |
| |
| <p>Now, what if the user query does not contain the first key column and instead only contains the <code>tstamp</code> column? |
| In the above case, the <code>tstamp</code> column values are sorted with respect to <code>host</code>, |
| but are not globally sorted, and as such, it’s non-trivial to use the index to filter rows. |
| Instead, a full tablet scan is done by default. Other databases may optimize such scans by building secondary indexes |
| (though it might be redundant to build one on one of the primary keys). However, this isn’t an option for Kudu, |
| given its lack of secondary index support.</p> |
| |
| <p>The question is, can Kudu do better than a full tablet scan here?</p> |
| |
| <p>The answer is yes! Let’s observe the column preceding the <code>tstamp</code> column. We will refer to it as the |
| “prefix column” and its specific value as the “prefix key”. In this example, <code>host</code> is the prefix column. |
| Note that the prefix keys are sorted in the index and that all rows of a given prefix key are also sorted by the |
| remaining key columns. Therefore, we can use the index to skip to the rows that have distinct prefix keys, |
| and also satisfy the predicate on the <code>tstamp</code> column. |
| For example, consider the query:</p> |
| |
| <div class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span> <span class="n">clusterid</span> <span class="k">FROM</span> <span class="n">metrics</span> <span class="k">WHERE</span> <span class="n">tstamp</span> <span class="o">=</span> <span class="mi">100</span><span class="p">;</span></code></pre></div> |
| |
| <p><img src="/img/index-skip-scan/skip-scan-example-table.png" alt="png" class="img-responsive" /> |
| <em>Skip scan flow illustration. The rows in green are scanned and the rest are skipped.</em></p> |
| |
| <p>The tablet server can use the index to <strong>skip</strong> to the first row with a distinct prefix key (<code>host = helium</code>) that |
| matches the predicate (<code>tstamp = 100</code>) and then <strong>scan</strong> through the rows until the predicate no longer matches. At that |
| point we would know that no more rows with <code>host = helium</code> will satisfy the predicate, and we can skip to the next |
| prefix key. This holds true for all distinct keys of <code>host</code>. Hence, this method is popularly known as |
| <strong>skip scan optimization</strong>[2, 3].</p> |
| |
| <h1 id="performance">Performance</h1> |
| |
| <p>This optimization can speed up queries significantly, depending on the cardinality (number of distinct values) of the |
| prefix column. The lower the prefix column cardinality, the better the skip scan performance. In fact, when the |
| prefix column cardinality is high, skip scan is not a viable approach. The performance graph (obtained using the example |
| schema and query pattern mentioned earlier) is shown below.</p> |
| |
| <p>Based on our experiments, on up to 10 million rows per tablet (as shown below), we found that the skip scan performance |
| begins to get worse with respect to the full tablet scan performance when the prefix column cardinality |
| exceeds sqrt(number_of_rows_in_tablet). |
| Therefore, in order to use skip scan performance benefits when possible and maintain a consistent performance in cases |
| of large prefix column cardinality, we have tentatively chosen to dynamically disable skip scan when the number of skips for |
| distinct prefix keys exceeds sqrt(number_of_rows_in_tablet). |
| It will be an interesting project to further explore sophisticated heuristics to decide when |
| to dynamically disable skip scan.</p> |
| |
| <p><img src="/img/index-skip-scan/skip-scan-performance-graph.png" alt="png" class="img-responsive" /></p> |
| |
| <h1 id="conclusion">Conclusion</h1> |
| |
| <p>Skip scan optimization in Kudu can lead to huge performance benefits that scale with the size of |
| data in Kudu tablets. This is a work-in-progress <a href="https://gerrit.cloudera.org/#/c/10983/">patch</a>. |
| The implementation in the patch works only for equality predicates on the non-first primary key |
| columns. An important point to note is that although, in the above specific example, the number of prefix |
| columns is one (<code>host</code>), this approach is generalized to work with any number of prefix columns.</p> |
| |
| <p>This work also lays the groundwork to leverage the skip scan approach and optimize query processing time in the |
| following use cases:</p> |
| |
| <ul> |
| <li>Range predicates</li> |
| <li>In-list predicates</li> |
| </ul> |
| |
| <p>This was my first time working on an open source project. I thoroughly enjoyed working on this challenging problem, |
| right from understanding the scan path in Kudu to working on a full-fledged implementation of |
| the skip scan optimization. I am very grateful to the Kudu team for guiding and supporting me throughout the |
| internship period.</p> |
| |
| <h1 id="references">References</h1> |
| |
| <p><a href="https://storage.googleapis.com/pub-tools-public-publication-data/pdf/42851.pdf">[1]</a>: Gupta, Ashish, et al. “Mesa: |
| Geo-replicated, near real-time, scalable data warehousing.” Proceedings of the VLDB Endowment 7.12 (2014): 1259-1270.</p> |
| |
| <p><a href="https://oracle-base.com/articles/9i/index-skip-scanning/">[2]</a>: Index Skip Scanning - Oracle Database</p> |
| |
| <p><a href="https://www.sqlite.org/optoverview.html#skipscan">[3]</a>: Skip Scan - SQLite</p></content><author><name>Anupama Gupta</name></author><summary>This summer I got the opportunity to intern with the Apache Kudu team at Cloudera. |
| My project was to optimize the Kudu scan path by implementing a technique called |
| index skip scan (a.k.a. scan-to-seek, see section 4.1 in [1]). I wanted to share |
| my experience and the progress we’ve made so far on the approach.</summary></entry><entry><title>Simplified Data Pipelines with Kudu</title><link href="/2018/09/11/simplified-pipelines-with-kudu.html" rel="alternate" type="text/html" title="Simplified Data Pipelines with Kudu" /><published>2018-09-11T00:00:00-07:00</published><updated>2018-09-11T00:00:00-07:00</updated><id>/2018/09/11/simplified-pipelines-with-kudu</id><content type="html" xml:base="/2018/09/11/simplified-pipelines-with-kudu.html"><p>I’ve been working with Hadoop now for over seven years and fortunately, or unfortunately, have run |
| across a lot of structured data use cases. What we, at <a href="https://phdata.io/">phData</a>, have found is |
| that end users are typically comfortable with tabular data and prefer to access their data in a |
| structured manner using tables. |
| <!--more--></p> |
| |
| <p>When working on new structured data projects, the first question we always get from non-Hadoop |
| followers is, <em>“how do I update or delete a record?”</em> The second question we get is, <em>“when adding |
| records, why don’t they show up in Impala right away?”</em> For those of us who have worked with HDFS |
| and Impala on HDFS for years, these are simple questions to answer, but hard ones to explain.</p> |
| |
| <p>The pre-Kudu years were filled with 100’s (or 1000’s) of self-join views (or materialization jobs) |
| and compaction jobs, along with scheduled jobs to refresh Impala cache periodically so new records |
| show up. And while doable, for 10,000’s of tables, this basically became a distraction from solving |
| real business problems.</p> |
| |
| <p>With the introduction of Kudu, mixing record level updates, deletes, and inserts, while supporting |
| large scans, are now something we can sustainably manage at scale. HBase is very good at record |
| level updates, deletes and inserts, but doesn’t scale well for analytic use cases that often do full |
| table scans. Moreover, for streaming use cases, changes are available in near real-time. End users, |
| accustomed to having to <em>”wait”</em> for their data, can now consume the data as it arrives in their |
| table.</p> |
| |
| <p>A common data ingest pattern where Kudu becomes necessary is change data capture (CDC). That is, |
| capturing the inserts, updates, hard deletes, and streaming them into Kudu where they can be applied |
| immediately. Pre-Kudu this pipeline was very tedious to implement. Now with tools like |
| <a href="https://streamsets.com/">StreamSets</a>, you can get up and running in a few hours.</p> |
| |
| <p>A second common workflow is near real-time analytics. We’ve streamed data off mining trucks, |
| oil wells, manufacturing lines, and needed to make that data available to end users immediately. No |
| longer do we need to batch up writes, flush to HDFS and then refresh cache in Impala. As mentioned |
| before, with Kudu, the data are available as soon as it lands. This has been a significant |
| enhancement for end users, who previously had to <em>”wait”</em> for data.</p> |
| |
| <p>In summary, Kudu has made a tremendous impact in removing the operational distractions of merging in |
| changes, and refreshing the cache of downstream consumers. This now allows data engineers |
| and users to focus on solving business problems, rather than being bothered by the tediousness of |
| the backend.</p></content><author><name>Mac Noland</name></author><summary>I’ve been working with Hadoop now for over seven years and fortunately, or unfortunately, have run |
| across a lot of structured data use cases. What we, at phData, have found is |
| that end users are typically comfortable with tabular data and prefer to access their data in a |
| structured manner using tables.</summary></entry><entry><title>Getting Started with Kudu - an O’Reilly Title</title><link href="/2018/08/06/getting-started-with-kudu-an-oreilly-title.html" rel="alternate" type="text/html" title="Getting Started with Kudu - an O'Reilly Title" /><published>2018-08-06T00:00:00-07:00</published><updated>2018-08-06T00:00:00-07:00</updated><id>/2018/08/06/getting-started-with-kudu-an-oreilly-title</id><content type="html" xml:base="/2018/08/06/getting-started-with-kudu-an-oreilly-title.html"><p>The following article by Brock Noland was reposted from the |
| <a href="https://www.phdata.io/getting-started-with-kudu/">phData</a> |
| blog with their permission.</p> |
| |
| <p>Five years ago, enabling Data Science and Advanced Analytics on the |
| Hadoop platform was hard. Organizations required strong Software Engineering |
| capabilities to successfully implement complex Lambda architectures or even |
| simply implement continuous ingest. Updating or deleting data, were simply a |
| nightmare. General Data Protection Regulation (GDPR) would have been an extreme |
| challenge at that time. |
| <!--more--> |
| In that context, on October 11th 2012 Todd Lipcon perform Apache Kudu’s initial |
| commit. The commit message was:</p> |
| |
| <pre><code>Code for writing cfiles seems to basically work |
| Need to write code for reading cfiles, still |
| </code></pre> |
| |
| <p>And Kudu development was off and running. Around this same time Todd, on his |
| internal Wiki page, started listing out the papers he was reading to develop |
| the theoretical background for creating Kudu. I followed along, reading as many |
| as I could, understanding little, because I knew Todd was up to something |
| important. About a year after that initial commit, I got my |
| <a href="https://github.com/apache/kudu/commit/1d7e6864b4a31d3fe6897e4cb484dfcda6608d43">Kudu first commit</a>, |
| documenting the upper bound of a library. This is a small contribution of which I am still |
| proud.</p> |
| |
| <p>In the meantime, I was lucky enough to be a founder of a Hadoop Managed Services |
| and Consulting company known as <a href="http://phdata.io/">phData</a>. We found that a majority |
| of our customers had use cases which Kudu vastly simplified. Whether it’s Change Data |
| Capture (CDC) from thousands of source tables to Internet of Things (IoT) ingest, Kudu |
| makes life much easier as both an operator of a Hadoop cluster and a developer providing |
| business value on the platform.</p> |
| |
| <p>Through this work, I was lucky enough to be a co-author of |
| <a href="http://shop.oreilly.com/product/0636920065739.do">Getting Started with Kudu</a>. |
| The book is a summation of mine and our co-authors, Jean-Marc Spaggiari, Mladen |
| Kovacevic, and Ryan Bosshart, learnings while cutting our teeth on early versions |
| of Kudu. Specifically you will learn:</p> |
| |
| <ul> |
| <li>Theoretical understanding of Kudu concepts in simple plain spoken words and simple diagrams</li> |
| <li>Why, for many use cases, using Kudu is so much easier than other ecosystem storage technologies</li> |
| <li>How Kudu enables Hybrid Transactional/Analytical Processing (HTAP) use cases</li> |
| <li>How to design IoT, Predictive Modeling, and Mixed Platform Solutions using Kudu</li> |
| <li>How to design Kudu Schemas</li> |
| </ul> |
| |
| <p><img src="/img/2018-08-06-getting-started-with-kudu-an-oreilly-title.gif" alt="Getting Started with Kudu Cover" class="img-responsive" /></p> |
| |
| <p>Looking forward, I am excited to see Kudu gain additional features and adoption |
| and eventually the second revision of this title. In the meantime, if you have |
| feedback or questions, please reach out on the <code>#getting-started-kudu</code> channel of |
| the <a href="https://getkudu-slack.herokuapp.com/">Kudu Slack</a> or if you prefer non-real-time |
| communication, please use the user@ mailing list!</p></content><author><name>Brock Noland</name></author><summary>The following article by Brock Noland was reposted from the |
| phData |
| blog with their permission. |
| |
| Five years ago, enabling Data Science and Advanced Analytics on the |
| Hadoop platform was hard. Organizations required strong Software Engineering |
| capabilities to successfully implement complex Lambda architectures or even |
| simply implement continuous ingest. Updating or deleting data, were simply a |
| nightmare. General Data Protection Regulation (GDPR) would have been an extreme |
| challenge at that time.</summary></entry><entry><title>Instrumentation in Apache Kudu</title><link href="/2018/07/10/instrumentation-in-kudu.html" rel="alternate" type="text/html" title="Instrumentation in Apache Kudu" /><published>2018-07-10T00:00:00-07:00</published><updated>2018-07-10T00:00:00-07:00</updated><id>/2018/07/10/instrumentation-in-kudu</id><content type="html" xml:base="/2018/07/10/instrumentation-in-kudu.html"><p>Last week, the <a href="http://opentracing.io/">OpenTracing</a> community invited me to |
| their monthly Google Hangout meetup to give an informal talk on tracing and |
| instrumentation in Apache Kudu.</p> |
| |
| <p>While Kudu doesn’t currently support distributed tracing using OpenTracing, |
| it does have quite a lot of other types of instrumentation, metrics, and |
| diagnostics logging. The OpenTracing team was interested to hear about some of |
| the approaches that Kudu has used, and so I gave a brief introduction to topics |
| including: |
| <!--more--> |
| - The Kudu <a href="/docs/administration.html#_diagnostics_logging">diagnostics log</a> |
| which periodically logs metrics and stack traces. |
| - The <a href="/docs/troubleshooting.html#kudu_tracing">process-wide tracing</a> |
| support based on the open source tracing framework implemented by Google Chrome. |
| - The <a href="/docs/troubleshooting.html#kudu_tracing">stack watchdog</a> |
| which helps us find various latency outliers and issues in our libraries and |
| the Linux kernel. |
| - <a href="/docs/troubleshooting.html#heap_sampling">Heap sampling</a> support |
| which helps us understand unexpected memory usage.</p> |
| |
| <p>If you’re interested in learning about these topics and more, check out the video recording |
| below. My talk spans the first 34 minutes.</p> |
| |
| <iframe width="800" height="500" src="https://www.youtube.com/embed/qBXwKU6Ubjo?end=2058&amp;start=23"> |
| </iframe> |
| |
| <p>If you have any questions about this content or about Kudu in general, |
| <a href="http://kudu.apache.org/community.html">join the community</a></p></content><author><name>Todd Lipcon</name></author><summary>Last week, the OpenTracing community invited me to |
| their monthly Google Hangout meetup to give an informal talk on tracing and |
| instrumentation in Apache Kudu. |
| |
| While Kudu doesn’t currently support distributed tracing using OpenTracing, |
| it does have quite a lot of other types of instrumentation, metrics, and |
| diagnostics logging. The OpenTracing team was interested to hear about some of |
| the approaches that Kudu has used, and so I gave a brief introduction to topics |
| including:</summary></entry><entry><title>Apache Kudu 1.7.0 released</title><link href="/2018/03/23/apache-kudu-1-7-0-released.html" rel="alternate" type="text/html" title="Apache Kudu 1.7.0 released" /><published>2018-03-23T00:00:00-07:00</published><updated>2018-03-23T00:00:00-07:00</updated><id>/2018/03/23/apache-kudu-1-7-0-released</id><content type="html" xml:base="/2018/03/23/apache-kudu-1-7-0-released.html"><p>The Apache Kudu team is happy to announce the release of Kudu 1.7.0!</p> |
| |
| <p>Apache Kudu 1.7.0 is a minor release that offers new features, performance |
| optimizations, incremental improvements, and bug fixes.</p> |
| |
| <p>Release highlights:</p> |
| |
| <!--more--> |
| |
| <ol> |
| <li>Kudu now supports the decimal column type. The decimal type is a numeric |
| data type with fixed scale and precision suitable for financial and other |
| arithmetic calculations where the imprecise representation and rounding |
| behavior of float and double make those types impractical. The decimal type |
| is also useful for integers larger than int64 and cases with fractional values |
| in a primary key. See <a href="/releases/1.7.0/docs/schema_design.html#decimal">Decimal Type</a> |
| for more details.</li> |
| <li>The strategy Kudu uses for automatically healing tablets which have lost a |
| replica due to server or disk failures has been improved. The new re-replication |
| strategy, or replica management scheme, first adds a replacement tablet replica |
| before evicting the failed one.</li> |
| <li>A new scan read mode READ_YOUR_WRITES. Users can specify READ_YOUR_WRITES when |
| creating a new scanner in C++, Java and Python clients. If this mode is used, |
| the client will perform a read such that it follows all previously known writes |
| and reads from this client. Reads in this mode ensure read-your-writes and |
| read-your-reads session guarantees, while minimizing latency caused by waiting |
| for outstanding write transactions to complete. Note that this is still an |
| experimental feature which may be stabilized in future releases.</li> |
| <li>The tablet server web UI scans dashboard (/scans) has been improved with several |
| new features, including: showing the most recently completed scans, a pseudo-SQL |
| scan descriptor that concisely shows the selected columns and applied predicates, |
| and more complete and better documented scan statistics.</li> |
| <li>Kudu daemons now expose a web page /stacks which dumps the current stack trace of |
| every thread running in the server. This information can be helpful when diagnosing |
| performance issues.</li> |
| <li>By default, each tablet replica will now stripe data blocks across 3 data directories |
| instead of all data directories. This decreases the likelihood that any given tablet |
| will be affected in the event of a single disk failure.</li> |
| <li>The Java client now uses a predefined prioritized list of TLS ciphers when |
| establishing an encrypted connection to Kudu servers. This cipher list matches the |
| list of ciphers preferred for server-to-server communication and ensures that the |
| most efficient and secure ciphers are preferred. When the Kudu client is running on |
| Java 8 or newer, this provides a substantial speed-up to read and write performance.</li> |
| <li>The performance of inserting rows containing many string or binary columns has been |
| improved, especially in the case of highly concurrent write workloads.</li> |
| <li>The Java client will now automatically attempt to re-acquire Kerberos credentials |
| from the ticket cache when the prior credentials are about to expire. This allows |
| client instances to persist longer than the expiration time of a single Kerberos |
| ticket so long as some other process renews the credentials in the ticket cache.</li> |
| </ol> |
| |
| <p>For more details, and the complete list of changes in Kudu 1.7.0, please see |
| the <a href="/releases/1.7.0/docs/release_notes.html">Kudu 1.7.0 release notes</a>.</p> |
| |
| <p>The Apache Kudu project only publishes source code releases. To build Kudu |
| 1.7.0, follow these steps:</p> |
| |
| <ol> |
| <li>Download the <a href="/releases/1.7.0/">Kudu 1.7.0 source release</a>.</li> |
| <li>Follow the instructions in the documentation to |
| <a href="/releases/1.7.0/docs/installation.html#build_from_source">build Kudu 1.7.0 from source</a>.</li> |
| </ol> |
| |
| <p>For your convenience, binary JAR files for the Kudu Java client library, Spark |
| DataSource, Flume sink, and other Java integrations are published to the ASF |
| Maven repository and are |
| <a href="https://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.kudu%22%20AND%20v%3A%221.7.0%22">now available</a>.</p></content><author><name>Grant Henke</name></author><summary>The Apache Kudu team is happy to announce the release of Kudu 1.7.0! |
| |
| Apache Kudu 1.7.0 is a minor release that offers new features, performance |
| optimizations, incremental improvements, and bug fixes. |
| |
| Release highlights:</summary></entry><entry><title>Apache Kudu 1.6.0 released</title><link href="/2017/12/08/apache-kudu-1-6-0-released.html" rel="alternate" type="text/html" title="Apache Kudu 1.6.0 released" /><published>2017-12-08T00:00:00-08:00</published><updated>2017-12-08T00:00:00-08:00</updated><id>/2017/12/08/apache-kudu-1-6-0-released</id><content type="html" xml:base="/2017/12/08/apache-kudu-1-6-0-released.html"><p>The Apache Kudu team is happy to announce the release of Kudu 1.6.0!</p> |
| |
| <p>Apache Kudu 1.6.0 is a minor release that offers new features, performance |
| optimizations, incremental improvements, and bug fixes.</p> |
| |
| <p>Release highlights:</p> |
| |
| <!--more--> |
| |
| <ol> |
| <li>Kudu servers can now tolerate short interruptions in NTP clock |
| synchronization. NTP synchronization is still required when any Kudu daemon |
| starts up.</li> |
| <li>Tablet servers will no longer crash when a disk containing data blocks |
| fails, unless that disk also stores WAL segments or tablet metadata. Instead |
| of crashing, the tablet server will shut down any tablets that may have lost |
| data locally and Kudu will re-replicate the affected tablets to another |
| tablet server. More information can be found in the documentation under |
| <a href="/releases/1.6.0/docs/administration.html#disk_failure_recovery">Recovering from Disk Failure</a>.</li> |
| <li>Tablet server startup time has been improved significantly on servers |
| containing large numbers of blocks.</li> |
| <li>The Spark DataSource integration now can take advantage of scan locality for |
| better scan performance. The scan will take place at the closest replica |
| instead of going to the leader.</li> |
| <li>Support for Spark 1 has been removed in Kudu 1.6.0 and now only Spark 2 is |
| supported. Spark 1 support was deprecated in Kudu 1.5.0.</li> |
| <li>HybridTime timestamp propagation now works in the Java client when using |
| scan tokens.</li> |
| <li>Tablet servers now consider the health of all replicas of a tablet before |
| deciding to evict one. This can improve the stability of the Kudu cluster |
| when multiple servers temporarily go down at the same time.</li> |
| <li>A bug in the C++ client was fixed that could cause tablets to be erroneously |
| pruned, or skipped, during certain scans, resulting in fewer results than |
| expected being returned from queries. The bug only affected tables whose |
| range partition columns are a proper prefix of the primary key. |
| See <a href="https://issues.apache.org/jira/browse/KUDU-2173">KUDU-2173</a> for more |
| information.</li> |
| </ol> |
| |
| <p>For more details, and the complete list of changes in Kudu 1.6.0, please see |
| the <a href="/releases/1.6.0/docs/release_notes.html">Kudu 1.6.0 release notes</a>.</p> |
| |
| <p>The Apache Kudu project only publishes source code releases. To build Kudu |
| 1.6.0, follow these steps:</p> |
| |
| <ol> |
| <li>Download the <a href="/releases/1.6.0/">Kudu 1.6.0 source release</a>.</li> |
| <li>Follow the instructions in the documentation to |
| <a href="/releases/1.6.0/docs/installation.html#build_from_source">build Kudu 1.6.0 from source</a>.</li> |
| </ol> |
| |
| <p>For your convenience, binary JAR files for the Kudu Java client library, Spark |
| DataSource, Flume sink, and other Java integrations are published to the ASF |
| Maven repository and are |
| <a href="https://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.kudu%22%20AND%20v%3A%221.6.0%22">now available</a>.</p></content><author><name>Mike Percy</name></author><summary>The Apache Kudu team is happy to announce the release of Kudu 1.6.0! |
| |
| Apache Kudu 1.6.0 is a minor release that offers new features, performance |
| optimizations, incremental improvements, and bug fixes. |
| |
| Release highlights:</summary></entry><entry><title>Slides: A brave new world in mutable big data: Relational storage</title><link href="/2017/10/23/nosql-kudu-spanner-slides.html" rel="alternate" type="text/html" title="Slides: A brave new world in mutable big data: Relational storage" /><published>2017-10-23T00:00:00-07:00</published><updated>2017-10-23T00:00:00-07:00</updated><id>/2017/10/23/nosql-kudu-spanner-slides</id><content type="html" xml:base="/2017/10/23/nosql-kudu-spanner-slides.html"><p>Since the Apache Kudu project made its debut in 2015, there have been |
| a few common questions that kept coming up at every presentation:</p> |
| |
| <ul> |
| <li>Is Kudu an open source version of Google’s Spanner system?</li> |
| <li>Is Kudu NoSQL or SQL?</li> |
| <li>Why does Kudu have a relational data model? Isn’t SQL dead?</li> |
| </ul> |
| |
| <!--more--> |
| |
| <p>A few of these questions are addressed in the |
| <a href="https://kudu.apache.org/faq.html">Kudu FAQ</a>, but I thought they were |
| interesting enough that I decided to give a talk on these subjects |
| at <a href="https://conferences.oreilly.com/strata/strata-ny">Strata Data Conference NYC 2017</a>.</p> |
| |
| <p>Preparing this talk was particularly interesting, since Google recently released |
| Spanner to the public in SaaS form as <a href="https://cloud.google.com/spanner/">Google Cloud Spanner</a>. |
| This meant that I was able to compare Kudu vs Spanner not just qualitatively |
| based on some academic papers, but quantitatively as well.</p> |
| |
| <p>To summarize the key points of the presentation:</p> |
| |
| <ul> |
| <li> |
| <p>Despite the growing popularity of “NoSQL” from 2009 through 2013, SQL has |
| once again become the access mechanism of choice for the majority of |
| analytic applications. NoSQL has become “Not Only SQL”.</p> |
| </li> |
| <li> |
| <p>Spanner and Kudu share a lot of common features. However:</p> |
| |
| <ul> |
| <li> |
| <p>Spanner offers a superior feature set and performance for Online |
| Transactional Processing (OLTP) workloads, including ACID transactions and |
| secondary indexing.</p> |
| </li> |
| <li> |
| <p>Kudu offers a superior feature set and performance for Online |
| Analytical Processing (OLAP) and Hybrid Transactional/Analytic Processing |
| (HTAP) workloads, including more complete SQL support and orders of |
| magnitude better performance on large queries.</p> |
| </li> |
| </ul> |
| </li> |
| </ul> |
| |
| <p>For more details and for the full benchmark numbers, check out the slide deck |
| below:</p> |
| |
| <iframe src="//www.slideshare.net/slideshow/embed_code/key/loQpO2vzlwGGgz" width="595" height="485" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen=""> </iframe> |
| <div style="margin-bottom:15px"> <strong> <a href="//www.slideshare.net/ToddLipcon/a-brave-new-world-in-mutable-big-data-relational-storage-strata-nyc-2017" title="A brave new world in mutable big data relational storage (Strata NYC 2017)" target="_blank">A brave new world in mutable big data relational storage (Strata NYC 2017)</a> </strong> from <strong><a href="https://www.slideshare.net/ToddLipcon" target="_blank">Todd Lipcon</a></strong> </div> |
| |
| <p>Questions or comments? Join the <a href="/community.html">Apache Kudu Community</a> to discuss.</p></content><author><name>Todd Lipcon</name></author><summary>Since the Apache Kudu project made its debut in 2015, there have been |
| a few common questions that kept coming up at every presentation: |
| |
| |
| Is Kudu an open source version of Google’s Spanner system? |
| Is Kudu NoSQL or SQL? |
| Why does Kudu have a relational data model? Isn’t SQL dead?</summary></entry></feed> |