blob: b9f9e228eaa893ba712e951fc358ab0c38d53bcb [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<meta name="description" content="A new open source Apache Hadoop ecosystem project, Apache Kudu completes Hadoop's storage layer to enable fast analytics on fast data" />
<meta name="author" content="Cloudera" />
<title>Apache Kudu - Simplified Data Pipelines with Kudu</title>
<!-- Bootstrap core CSS -->
<link rel="stylesheet" href=""
<!-- Custom styles for this template -->
<link href="/css/kudu.css" rel="stylesheet"/>
<link href="/css/asciidoc.css" rel="stylesheet"/>
<link rel="shortcut icon" href="/img/logo-favicon.ico" />
<link rel="stylesheet" href="" />
<link rel="alternate" type="application/atom+xml"
title="RSS Feed for Apache Kudu blog"
href="/feed.xml" />
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src=""></script>
<script src=""></script>
<div class="kudu-site container-fluid">
<!-- Static navbar -->
<nav class="navbar navbar-default">
<div class="container-fluid">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<a class="logo" href="/"><img
srcset="// 1x, // 2x"
alt="Apache Kudu"/></a>
<div id="navbar" class="collapse navbar-collapse">
<ul class="nav navbar-nav navbar-right">
<li >
<a href="/">Home</a>
<li >
<a href="/overview.html">Overview</a>
<li >
<a href="/docs/">Documentation</a>
<li >
<a href="/releases/">Releases</a>
<li class="active">
<a href="/blog/">Blog</a>
<!-- NOTE: this dropdown menu does not appear on Mobile, so don't add anything here
that doesn't also appear elsewhere on the site. -->
<li class="dropdown">
<a href="/community.html" role="button" aria-haspopup="true" aria-expanded="false">Community <span class="caret"></span></a>
<ul class="dropdown-menu">
<li class="dropdown-header">GET IN TOUCH</li>
<li><a class="icon email" href="/community.html">Mailing Lists</a></li>
<li><a class="icon slack" href="">Slack Channel</a></li>
<li role="separator" class="divider"></li>
<li><a href="/community.html#meetups-user-groups-and-conference-presentations">Events and Meetups</a></li>
<li><a href="/committers.html">Project Committers</a></li>
<!--<li><a href="/roadmap.html">Roadmap</a></li>-->
<li><a href="/community.html#contributions">How to Contribute</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">DEVELOPER RESOURCES</li>
<li><a class="icon github" href="">GitHub</a></li>
<li><a class="icon gerrit" href="">Gerrit Code Review</a></li>
<li><a class="icon jira" href="">JIRA Issue Tracker</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">SOCIAL MEDIA</li>
<li><a class="icon twitter" href="">Twitter</a></li>
<li><a href="">Reddit</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">APACHE SOFTWARE FOUNDATION</li>
<li><a href="" target="_blank">Security</a></li>
<li><a href="" target="_blank">Sponsorship</a></li>
<li><a href="" target="_blank">Thanks</a></li>
<li><a href="" target="_blank">License</a></li>
<li >
<a href="/faq.html">FAQ</a>
</ul><!-- /.nav -->
</div><!-- /#navbar -->
</div><!-- /.container-fluid -->
<div class="row header">
<div class="col-lg-12">
<h2><a href="/blog">Apache Kudu Blog</a></h2>
<div class="row-fluid">
<div class="col-lg-9">
<h1 class="entry-title">Simplified Data Pipelines with Kudu</h1>
<p class="meta">Posted 11 Sep 2018 by Mac Noland</p>
<div class="entry-content">
<p>I’ve been working with Hadoop now for over seven years and fortunately, or unfortunately, have run
across a lot of structured data use cases. What we, at <a href="">phData</a>, have found is
that end users are typically comfortable with tabular data and prefer to access their data in a
structured manner using tables.
<p>When working on new structured data projects, the first question we always get from non-Hadoop
followers is, <em>“how do I update or delete a record?”</em> The second question we get is, <em>“when adding
records, why don’t they show up in Impala right away?”</em> For those of us who have worked with HDFS
and Impala on HDFS for years, these are simple questions to answer, but hard ones to explain.</p>
<p>The pre-Kudu years were filled with 100’s (or 1000’s) of self-join views (or materialization jobs)
and compaction jobs, along with scheduled jobs to refresh Impala cache periodically so new records
show up. And while doable, for 10,000’s of tables, this basically became a distraction from solving
real business problems.</p>
<p>With the introduction of Kudu, mixing record level updates, deletes, and inserts, while supporting
large scans, are now something we can sustainably manage at scale. HBase is very good at record
level updates, deletes and inserts, but doesn’t scale well for analytic use cases that often do full
table scans. Moreover, for streaming use cases, changes are available in near real-time. End users,
accustomed to having to <em>”wait”</em> for their data, can now consume the data as it arrives in their
<p>A common data ingest pattern where Kudu becomes necessary is change data capture (CDC). That is,
capturing the inserts, updates, hard deletes, and streaming them into Kudu where they can be applied
immediately. Pre-Kudu this pipeline was very tedious to implement. Now with tools like
<a href="">StreamSets</a>, you can get up and running in a few hours.</p>
<p>A second common workflow is near real-time analytics. We’ve streamed data off mining trucks,
oil wells, manufacturing lines, and needed to make that data available to end users immediately. No
longer do we need to batch up writes, flush to HDFS and then refresh cache in Impala. As mentioned
before, with Kudu, the data are available as soon as it lands. This has been a significant
enhancement for end users, who previously had to <em>”wait”</em> for data.</p>
<p>In summary, Kudu has made a tremendous impact in removing the operational distractions of merging in
changes, and refreshing the cache of downstream consumers. This now allows data engineers
and users to focus on solving business problems, rather than being bothered by the tediousness of
the backend.</p>
<div class="col-lg-3 recent-posts">
<h3>Recent posts</h3>
<li> <a href="/2020/07/30/building-near-real-time-big-data-lake.html">Building Near Real-time Big Data Lake</a> </li>
<li> <a href="/2020/05/18/apache-kudu-1-12-0-release.html">Apache Kudu 1.12.0 released</a> </li>
<li> <a href="/2019/11/20/apache-kudu-1-11-1-release.html">Apache Kudu 1.11.1 released</a> </li>
<li> <a href="/2019/11/20/apache-kudu-1-10-1-release.html">Apache Kudu 1.10.1 released</a> </li>
<li> <a href="/2019/07/09/apache-kudu-1-10-0-release.html">Apache Kudu 1.10.0 Released</a> </li>
<li> <a href="/2019/04/30/location-awareness.html">Location Awareness in Kudu</a> </li>
<li> <a href="/2019/04/22/fine-grained-authorization-with-apache-kudu-and-impala.html">Fine-Grained Authorization with Apache Kudu and Impala</a> </li>
<li> <a href="/2019/03/19/testing-apache-kudu-applications-on-the-jvm.html">Testing Apache Kudu Applications on the JVM</a> </li>
<li> <a href="/2019/03/15/apache-kudu-1-9-0-release.html">Apache Kudu 1.9.0 Released</a> </li>
<li> <a href="/2019/03/05/transparent-hierarchical-storage-management-with-apache-kudu-and-impala.html">Transparent Hierarchical Storage Management with Apache Kudu and Impala</a> </li>
<li> <a href="/2018/12/11/call-for-posts.html">Call for Posts</a> </li>
<li> <a href="/2018/10/26/apache-kudu-1-8-0-released.html">Apache Kudu 1.8.0 Released</a> </li>
<li> <a href="/2018/09/26/index-skip-scan-optimization-in-kudu.html">Index Skip Scan Optimization in Kudu</a> </li>
<li> <a href="/2018/09/11/simplified-pipelines-with-kudu.html">Simplified Data Pipelines with Kudu</a> </li>
<li> <a href="/2018/08/06/getting-started-with-kudu-an-oreilly-title.html">Getting Started with Kudu - an O'Reilly Title</a> </li>
<footer class="footer">
<div class="row">
<div class="col-md-9">
<p class="small">
Copyright &copy; 2019 The Apache Software Foundation.
<p class="small">
Apache Kudu, Kudu, Apache, the Apache feather logo, and the Apache Kudu
project logo are either registered trademarks or trademarks of The
Apache Software Foundation in the United States and other countries.
<div class="col-md-3">
<a class="pull-right" href="">
<img src=""/>
<script src=""></script>
// Try to detect touch-screen devices. Note: Many laptops have touch screens.
$(document).ready(function() {
if ("ontouchstart" in document.documentElement) {
} else {
<script src=""
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
ga('create', 'UA-68448017-1', 'auto');
ga('send', 'pageview');
<script src=""></script>
anchors.options = {
placement: 'right',
visible: 'touch',