blob: 962f3e28debbe59b776733d1a085bdceb20f3419 [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<meta name="description" content="A new open source Apache Hadoop ecosystem project, Apache Kudu completes Hadoop's storage layer to enable fast analytics on fast data" />
<meta name="author" content="Cloudera" />
<title>Apache Kudu - Apache Kudu Quickstart</title>
<!-- Bootstrap core CSS -->
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css"
integrity="sha384-1q8mTJOASx8j1Au+a5WDVnPi2lkFfwwEAa8hDDdjZlpLegxhjVME1fgjWPGmkzs7"
crossorigin="anonymous">
<!-- Custom styles for this template -->
<link href="/css/kudu.css" rel="stylesheet"/>
<link href="/css/asciidoc.css" rel="stylesheet"/>
<link rel="shortcut icon" href="/img/logo-favicon.ico" />
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.1/css/font-awesome.min.css" />
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
</head>
<body>
<div class="kudu-site container-fluid">
<!-- Static navbar -->
<nav class="navbar navbar-default">
<div class="container-fluid">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="logo" href="/"><img
src="//d3dr9sfxru4sde.cloudfront.net/i/k/apachekudu_logo_0716_80px.png"
srcset="//d3dr9sfxru4sde.cloudfront.net/i/k/apachekudu_logo_0716_80px.png 1x, //d3dr9sfxru4sde.cloudfront.net/i/k/apachekudu_logo_0716_160px.png 2x"
alt="Apache Kudu"/></a>
</div>
<div id="navbar" class="collapse navbar-collapse">
<ul class="nav navbar-nav navbar-right">
<li >
<a href="/">Home</a>
</li>
<li >
<a href="/overview.html">Overview</a>
</li>
<li class="active">
<a href="/docs/">Documentation</a>
</li>
<li >
<a href="/releases/">Releases</a>
</li>
<li >
<a href="/blog/">Blog</a>
</li>
<!-- NOTE: this dropdown menu does not appear on Mobile, so don't add anything here
that doesn't also appear elsewhere on the site. -->
<li class="dropdown">
<a href="/community.html" role="button" aria-haspopup="true" aria-expanded="false">Community <span class="caret"></span></a>
<ul class="dropdown-menu">
<li class="dropdown-header">GET IN TOUCH</li>
<li><a class="icon email" href="/community.html">Mailing Lists</a></li>
<li><a class="icon slack" href="https://getkudu-slack.herokuapp.com/">Slack Channel</a></li>
<li role="separator" class="divider"></li>
<li><a href="/community.html#meetups-user-groups-and-conference-presentations">Events and Meetups</a></li>
<li><a href="/committers.html">Project Committers</a></li>
<li><a href="/ecosystem.html">Ecosystem</a></li>
<!--<li><a href="/roadmap.html">Roadmap</a></li>-->
<li><a href="/community.html#contributions">How to Contribute</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">DEVELOPER RESOURCES</li>
<li><a class="icon github" href="https://github.com/apache/incubator-kudu">GitHub</a></li>
<li><a class="icon gerrit" href="http://gerrit.cloudera.org:8080/#/q/status:open+project:kudu">Gerrit Code Review</a></li>
<li><a class="icon jira" href="https://issues.apache.org/jira/browse/KUDU">JIRA Issue Tracker</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">SOCIAL MEDIA</li>
<li><a class="icon twitter" href="https://twitter.com/ApacheKudu">Twitter</a></li>
<li><a href="https://www.reddit.com/r/kudu/">Reddit</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">APACHE SOFTWARE FOUNDATION</li>
<li><a href="https://www.apache.org/security/" target="_blank">Security</a></li>
<li><a href="https://www.apache.org/foundation/sponsorship.html" target="_blank">Sponsorship</a></li>
<li><a href="https://www.apache.org/foundation/thanks.html" target="_blank">Thanks</a></li>
<li><a href="https://www.apache.org/licenses/" target="_blank">License</a></li>
</ul>
</li>
<li >
<a href="/faq.html">FAQ</a>
</li>
</ul><!-- /.nav -->
</div><!-- /#navbar -->
</div><!-- /.container-fluid -->
</nav>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<div class="container">
<div class="row">
<div class="col-md-9">
<h1>Apache Kudu Quickstart</h1>
<div id="preamble">
<div class="sectionbody">
<div class="paragraph">
<p>Follow these instructions to set up and run the Kudu VM, and start with Kudu, Kudu_Impala,
and CDH in minutes.</p>
</div>
</div>
</div>
<div class="sect1">
<h2 id="quickstart_vm"><a class="link" href="#quickstart_vm">Get The Kudu Quickstart VM</a></h2>
<div class="sectionbody">
<div class="sect2">
<h3 id="_prerequisites"><a class="link" href="#_prerequisites">Prerequisites</a></h3>
<div class="olist arabic">
<ol class="arabic">
<li>
<p>Install <a href="https://www.virtualbox.org/">Oracle Virtualbox</a>. The VM has been tested to work
with VirtualBox version 4.3 on Ubuntu 14.04 and VirtualBox version 5 on OSX
10.9. VirtualBox is also included in most package managers: apt-get, brew, etc.</p>
</li>
<li>
<p>After the installation, make sure that <code>VBoxManage</code> is in your <code>PATH</code> by using the
<code>which VBoxManage</code> command.</p>
</li>
</ol>
</div>
</div>
<div class="sect2">
<h3 id="_installation"><a class="link" href="#_installation">Installation</a></h3>
<div class="paragraph">
<p>To download and start the VM, execute the following command in a terminal window.</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="highlight"><code class="language-bash" data-lang="bash">$ curl -s https://raw.githubusercontent.com/cloudera/kudu-examples/master/demo-vm-setup/bootstrap.sh | bash</code></pre>
</div>
</div>
<div class="paragraph">
<p>This command downloads a shell script which clones the <code>kudu-examples</code> Git repository and
then downloads a VM image of about 1.2GB size into the current working
directory.<sup class="footnote">[<a id="_footnoteref_1" class="footnote" href="#_footnote_1" title="View footnote.">1</a>]</sup> You can examine the script after downloading it by removing
the <code>| bash</code> component of the command above. Once the setup is complete, you can verify
that everything works by connecting to the guest via SSH:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="highlight"><code class="language-bash" data-lang="bash">$ ssh demo@quickstart.cloudera</code></pre>
</div>
</div>
<div class="paragraph">
<p>The username and password for the demo account are both <code>demo</code>. In addition, the <code>demo</code>
user has password-less <code>sudo</code> privileges so that you can install additional software or
manage the guest OS. You can also access the <code>kudu-examples</code> as a shared folder in
<code>/home/demo/kudu-examples/</code> on the guest or from your VirtualBox shared folder location on
the host. This is a quick way to make scripts or data visible to the guest.</p>
</div>
<div class="paragraph">
<p>You can quickly verify if Kudu and Impala are running by executing the following commands:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="highlight"><code class="language-bash" data-lang="bash">$ ps aux | grep kudu
$ ps aux | grep impalad</code></pre>
</div>
</div>
<div class="paragraph">
<p>If you have issues connecting to the VM or one of the processes is not running, make sure
to consult the <a href="#trouble">Troubleshooting</a> section.</p>
</div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_load_data"><a class="link" href="#_load_data">Load Data</a></h2>
<div class="sectionbody">
<div class="paragraph">
<p>To practice some typical operations with Kudu and Impala, we&#8217;ll use the
<a href="https://data.sfgov.org/Transportation/Raw-AVL-GPS-data/5fk7-ivit/data">San Francisco MTA
GPS dataset</a>. This dataset contains raw location data transmitted periodically from
sensors installed on the buses in the SF MTA&#8217;s fleet.</p>
</div>
<div class="olist arabic">
<ol class="arabic">
<li>
<p>Download the sample data and load it into HDFS</p>
<div class="paragraph">
<p>First we&#8217;ll download the sample dataset, prepare it, and upload it into the HDFS
cluster.</p>
</div>
<div class="paragraph">
<p>The SF MTA&#8217;s site is often a bit slow, so we&#8217;ve mirrored a sample CSV file from the
dataset at <a href="http://kudu-sample-data.s3.amazonaws.com/sfmtaAVLRawData01012013.csv.gz" class="bare">http://kudu-sample-data.s3.amazonaws.com/sfmtaAVLRawData01012013.csv.gz</a></p>
</div>
<div class="paragraph">
<p>The original dataset uses DOS-style line endings, so we&#8217;ll convert it to
UNIX-style during the upload process using <code>tr</code>.</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="highlight"><code class="language-bash" data-lang="bash">$ wget http://kudu-sample-data.s3.amazonaws.com/sfmtaAVLRawData01012013.csv.gz
$ hdfs dfs -mkdir /sfmta
$ zcat sfmtaAVLRawData01012013.csv.gz | tr -d '\r' | hadoop fs -put - /sfmta/data.csv</code></pre>
</div>
</div>
</li>
<li>
<p>Create a new external Impala table to access the plain text data. To connect to Impala
in the virtual machine issue the following command:</p>
<div class="listingblock">
<div class="content">
<pre class="highlight"><code class="language-bash" data-lang="bash">ssh demo@quickstart.cloudera -t impala-shell</code></pre>
</div>
</div>
<div class="paragraph">
<p>Now, you can execute the following commands:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="highlight"><code class="language-sql" data-lang="sql">CREATE EXTERNAL TABLE sfmta_raw (
revision int,
report_time string,
vehicle_tag int,
longitude float,
latitude float,
speed float,
heading float
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/sfmta/'
TBLPROPERTIES ('skip.header.line.count'='1');</code></pre>
</div>
</div>
</li>
<li>
<p>Validate if the data was actually loaded run the following command:</p>
<div class="listingblock">
<div class="content">
<pre class="highlight"><code class="language-sql" data-lang="sql">SELECT count(*) FROM sfmta_raw;
+----------+
| count(*) |
+----------+
| 859086 |
+----------+</code></pre>
</div>
</div>
</li>
<li>
<p>Next we&#8217;ll create a Kudu table and load the data. Note that we convert
the string <code>report_time</code> field into a unix-style timestamp for more efficient
storage.</p>
<div class="listingblock">
<div class="content">
<pre class="highlight"><code class="language-sql" data-lang="sql">CREATE TABLE sfmta
PRIMARY KEY (report_time, vehicle_tag)
PARTITION BY HASH(report_time) PARTITIONS 8
STORED AS KUDU
AS SELECT
UNIX_TIMESTAMP(report_time, 'MM/dd/yyyy HH:mm:ss') AS report_time,
vehicle_tag,
longitude,
latitude,
speed,
heading
FROM sfmta_raw;
+------------------------+
| summary |
+------------------------+
| Inserted 859086 row(s) |
+------------------------+
Fetched 1 row(s) in 5.75s</code></pre>
</div>
</div>
<div class="paragraph">
<p>The created table uses a composite primary key. See
<a href="kudu_impala_integration.html#kudu_impala">Kudu Impala Integration</a> for a more detailed
introduction to the extended SQL syntax for Impala.</p>
</div>
</li>
</ol>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_read_and_modify_data"><a class="link" href="#_read_and_modify_data">Read and Modify Data</a></h2>
<div class="sectionbody">
<div class="paragraph">
<p>Now that the data is stored in Kudu, you can run queries against it. The following query
finds the data point containing the highest recorded vehicle speed.</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="highlight"><code class="language-sql" data-lang="sql">SELECT * FROM sfmta ORDER BY speed DESC LIMIT 1;
+-------------+-------------+--------------------+-------------------+-------------------+---------+
| report_time | vehicle_tag | longitude | latitude | speed | heading |
+-------------+-------------+--------------------+-------------------+-------------------+---------+
| 1357022342 | 5411 | -122.3968811035156 | 37.76665878295898 | 68.33300018310547 | 82 |
+-------------+-------------+--------------------+-------------------+-------------------+---------+</code></pre>
</div>
</div>
<div class="paragraph">
<p>With a quick <a href="https://www.google.com/search?q=122.3968811035156W+37.76665878295898N">Google search</a>
we can see that this bus was traveling east on 16th street at 68MPH.
At first glance, this seems unlikely to be true. Perhaps we do some research
and find that this bus&#8217;s sensor equipment was broken and we decide to
remove the data. With Kudu this is very easy to correct using standard
SQL:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="highlight"><code class="language-sql" data-lang="sql">DELETE FROM sfmta WHERE vehicle_tag = '5411';
-- Modified 1169 row(s), 0 row error(s) in 0.25s</code></pre>
</div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_next_steps"><a class="link" href="#_next_steps">Next steps</a></h2>
<div class="sectionbody">
<div class="paragraph">
<p>The above example showed how to load, query, and mutate a static dataset with Impala
and Kudu. The real power of Kudu, however, is the ability to ingest and mutate data
in a streaming fashion.</p>
</div>
<div class="paragraph">
<p>As an exercise to learn the Kudu programmatic APIs, try implementing a program
that uses the <a href="http://www.nextbus.com/xmlFeedDocs/NextBusXMLFeed.pdf">SFMTA
XML data feed</a> to ingest this same dataset in real time into the Kudu table.</p>
</div>
<div class="sect2">
<h3 id="trouble"><a class="link" href="#trouble">Troubleshooting</a></h3>
<div class="sect3">
<h4 id="_problems_accessing_the_vm_via_ssh"><a class="link" href="#_problems_accessing_the_vm_via_ssh">Problems accessing the VM via SSH</a></h4>
<div class="ulist">
<ul>
<li>
<p>Make sure the host has a SSH client installed.</p>
</li>
<li>
<p>Make sure the VM is running, by running the following command and checking for a VM called <code>kudu-demo</code>:</p>
<div class="listingblock">
<div class="content">
<pre class="highlight"><code class="language-bash" data-lang="bash">$ VBoxManage list runningvms</code></pre>
</div>
</div>
</li>
<li>
<p>Verify that the VM&#8217;s IP address is included in the host&#8217;s <code>/etc/hosts</code> file. You should
see a line that includes an IP address followed by the hostname
<code>quickstart.cloudera</code>. To check the running VM&#8217;s IP address, use the <code>VBoxManage</code>
command below.</p>
<div class="listingblock">
<div class="content">
<pre class="highlight"><code class="language-bash" data-lang="bash">$ VBoxManage guestproperty get kudu-demo /VirtualBox/GuestInfo/Net/0/V4/IP
Value: 192.168.56.100</code></pre>
</div>
</div>
</li>
<li>
<p>If you&#8217;ve used a Cloudera Quickstart VM before, your <code>.ssh/known_hosts</code> file may
contain references to the previous VM&#8217;s SSH credentials. Remove any references to
<code>quickstart.cloudera</code> from this file.</p>
</li>
</ul>
</div>
</div>
<div class="sect3">
<h4 id="_failing_with_lack_of_sse4_2_support_when_running_inside_virtualbox"><a class="link" href="#_failing_with_lack_of_sse4_2_support_when_running_inside_virtualbox">Failing with lack of SSE4.2 support when running inside VirtualBox</a></h4>
<div class="ulist">
<ul>
<li>
<p>Running Kudu currently requires a CPU that supports SSE4.2 (Nehalem or later for Intel). To pass through SSE4.2 support into the guest VM, refer to the <a href="https://www.virtualbox.org/manual/ch09.html#sse412passthrough">VirtualBox documentation</a></p>
</li>
</ul>
</div>
</div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_next_steps_2"><a class="link" href="#_next_steps_2">Next Steps</a></h2>
<div class="sectionbody">
<div class="ulist">
<ul>
<li>
<p><a href="installation.html">Installing Kudu</a></p>
</li>
<li>
<p><a href="configuration.html">Configuring Kudu</a></p>
</li>
</ul>
</div>
</div>
</div>
</div>
<div class="col-md-3">
<div id="toc" data-spy="affix" data-offset-top="70">
<ul>
<li>
<a href="index.html">Introducing Kudu</a>
</li>
<li>
<a href="release_notes.html">Kudu Release Notes</a>
</li>
<li>
<span class="active-toc">Getting Started with Kudu</span>
<ul class="sectlevel1">
<li><a href="#quickstart_vm">Get The Kudu Quickstart VM</a>
<ul class="sectlevel2">
<li><a href="#_prerequisites">Prerequisites</a></li>
<li><a href="#_installation">Installation</a></li>
</ul>
</li>
<li><a href="#_load_data">Load Data</a></li>
<li><a href="#_read_and_modify_data">Read and Modify Data</a></li>
<li><a href="#_next_steps">Next steps</a>
<ul class="sectlevel2">
<li><a href="#trouble">Troubleshooting</a></li>
</ul>
</li>
<li><a href="#_next_steps_2">Next Steps</a></li>
</ul>
</li>
<li>
<a href="installation.html">Installation Guide</a>
</li>
<li>
<a href="configuration.html">Configuring Kudu</a>
</li>
<li>
<a href="kudu_impala_integration.html">Using Impala with Kudu</a>
</li>
<li>
<a href="administration.html">Administering Kudu</a>
</li>
<li>
<a href="troubleshooting.html">Troubleshooting Kudu</a>
</li>
<li>
<a href="developing.html">Developing Applications with Kudu</a>
</li>
<li>
<a href="schema_design.html">Kudu Schema Design</a>
</li>
<li>
<a href="transaction_semantics.html">Kudu Transaction Semantics</a>
</li>
<li>
<a href="contributing.html">Contributing to Kudu</a>
</li>
<li>
<a href="style_guide.html">Kudu Documentation Style Guide</a>
</li>
<li>
<a href="configuration_reference.html">Kudu Configuration Reference</a>
</li>
<li>
<a href="known_issues.html">Known Issues and Limitations</a>
</li>
<li>
<a href="export_control.html">Export Control Notice</a>
</li>
</ul>
</div>
</div>
</div>
</div>
<div id="footnotes">
<hr>
<div class="footnote" id="_footnote_1">
<a href="#_footnoteref_1">1</a>. In addition, the script will create a host-only network between host and guest and setup an entry in the <code>/etc/hosts</code> file with the name <code>quickstart.cloudera</code> and the guest&#8217;s IP address.
</div>
</div>
<footer class="footer">
<div class="row">
<div class="col-md-9">
<p class="small">
Copyright &copy; 2020 The Apache Software Foundation. Last updated 2017-03-01 12:43:33 PST
</p>
<p class="small">
Apache Kudu, Kudu, Apache, the Apache feather logo, and the Apache Kudu
project logo are either registered trademarks or trademarks of The
Apache Software Foundation in the United States and other countries.
</p>
</div>
<div class="col-md-3">
<a class="pull-right" href="https://www.apache.org/events/current-event.html">
<img src="https://www.apache.org/events/current-event-234x60.png"/>
</a>
</div>
</div>
</footer>
</div>
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
<script>
// Try to detect touch-screen devices. Note: Many laptops have touch screens.
$(document).ready(function() {
if ("ontouchstart" in document.documentElement) {
$(document.documentElement).addClass("touch");
} else {
$(document.documentElement).addClass("no-touch");
}
});
</script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/js/bootstrap.min.js"
integrity="sha384-0mSbJDEHialfmuBBQP6A4Qrprq5OVfW37PRR3j5ELqxss1yVqOtnepnHVP9aJ7xS"
crossorigin="anonymous"></script>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-68448017-1', 'auto');
ga('send', 'pageview');
</script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/anchor-js/3.1.0/anchor.js"></script>
<script>
anchors.options = {
placement: 'right',
visible: 'touch',
};
anchors.add();
</script>
</body>
</html>