| --- |
| title: Apache Kudu Quickstart |
| layout: default |
| active_nav: docs |
| last_updated: 'Last updated 2017-08-14 13:47:38 PDT' |
| --- |
| <!-- |
| |
| Licensed under the Apache License, Version 2.0 (the "License"); |
| you may not use this file except in compliance with the License. |
| You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| |
| <div class="container"> |
| <div class="row"> |
| <div class="col-md-9"> |
| |
| <h1>Apache Kudu Quickstart</h1> |
| <div id="preamble"> |
| <div class="sectionbody"> |
| <div class="paragraph"> |
| <p>Follow these instructions to set up and run the Kudu VM, and start with Kudu, Kudu_Impala, |
| and CDH in minutes.</p> |
| </div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="quickstart_vm"><a class="link" href="#quickstart_vm">Get The Kudu Quickstart VM</a></h2> |
| <div class="sectionbody"> |
| <div class="sect2"> |
| <h3 id="_prerequisites"><a class="link" href="#_prerequisites">Prerequisites</a></h3> |
| <div class="olist arabic"> |
| <ol class="arabic"> |
| <li> |
| <p>Install <a href="https://www.virtualbox.org/">Oracle Virtualbox</a>. The VM has been tested to work |
| with VirtualBox version 4.3 on Ubuntu 14.04 and VirtualBox version 5 on OSX |
| 10.9. VirtualBox is also included in most package managers: apt-get, brew, etc.</p> |
| </li> |
| <li> |
| <p>After the installation, make sure that <code>VBoxManage</code> is in your <code>PATH</code> by using the |
| <code>which VBoxManage</code> command.</p> |
| </li> |
| </ol> |
| </div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_installation"><a class="link" href="#_installation">Installation</a></h3> |
| <div class="paragraph"> |
| <p>To download and start the VM, execute the following command in a terminal window.</p> |
| </div> |
| <div class="listingblock"> |
| <div class="content"> |
| <pre class="highlight"><code class="language-bash" data-lang="bash">$ curl -s https://raw.githubusercontent.com/cloudera/kudu-examples/master/demo-vm-setup/bootstrap.sh | bash</code></pre> |
| </div> |
| </div> |
| <div class="paragraph"> |
| <p>This command downloads a shell script which clones the <code>kudu-examples</code> Git repository and |
| then downloads a VM image of about 1.2GB size into the current working |
| directory.<sup class="footnote">[<a id="_footnoteref_1" class="footnote" href="#_footnote_1" title="View footnote.">1</a>]</sup> You can examine the script after downloading it by removing |
| the <code>| bash</code> component of the command above. Once the setup is complete, you can verify |
| that everything works by connecting to the guest via SSH:</p> |
| </div> |
| <div class="listingblock"> |
| <div class="content"> |
| <pre class="highlight"><code class="language-bash" data-lang="bash">$ ssh demo@quickstart.cloudera</code></pre> |
| </div> |
| </div> |
| <div class="paragraph"> |
| <p>The username and password for the demo account are both <code>demo</code>. In addition, the <code>demo</code> |
| user has password-less <code>sudo</code> privileges so that you can install additional software or |
| manage the guest OS. You can also access the <code>kudu-examples</code> as a shared folder in |
| <code>/home/demo/kudu-examples/</code> on the guest or from your VirtualBox shared folder location on |
| the host. This is a quick way to make scripts or data visible to the guest.</p> |
| </div> |
| <div class="paragraph"> |
| <p>You can quickly verify if Kudu and Impala are running by executing the following commands:</p> |
| </div> |
| <div class="listingblock"> |
| <div class="content"> |
| <pre class="highlight"><code class="language-bash" data-lang="bash">$ ps aux | grep kudu |
| $ ps aux | grep impalad</code></pre> |
| </div> |
| </div> |
| <div class="paragraph"> |
| <p>If you have issues connecting to the VM or one of the processes is not running, make sure |
| to consult the <a href="#trouble">Troubleshooting</a> section.</p> |
| </div> |
| </div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_load_data"><a class="link" href="#_load_data">Load Data</a></h2> |
| <div class="sectionbody"> |
| <div class="paragraph"> |
| <p>To practice some typical operations with Kudu and Impala, we’ll use the |
| <a href="https://data.sfgov.org/Transportation/Raw-AVL-GPS-data/5fk7-ivit/data">San Francisco MTA |
| GPS dataset</a>. This dataset contains raw location data transmitted periodically from |
| sensors installed on the buses in the SF MTA’s fleet.</p> |
| </div> |
| <div class="olist arabic"> |
| <ol class="arabic"> |
| <li> |
| <p>Download the sample data and load it into HDFS</p> |
| <div class="paragraph"> |
| <p>First we’ll download the sample dataset, prepare it, and upload it into the HDFS |
| cluster.</p> |
| </div> |
| <div class="paragraph"> |
| <p>The SF MTA’s site is often a bit slow, so we’ve mirrored a sample CSV file from the |
| dataset at <a href="http://kudu-sample-data.s3.amazonaws.com/sfmtaAVLRawData01012013.csv.gz" class="bare">http://kudu-sample-data.s3.amazonaws.com/sfmtaAVLRawData01012013.csv.gz</a></p> |
| </div> |
| <div class="paragraph"> |
| <p>The original dataset uses DOS-style line endings, so we’ll convert it to |
| UNIX-style during the upload process using <code>tr</code>.</p> |
| </div> |
| <div class="listingblock"> |
| <div class="content"> |
| <pre class="highlight"><code class="language-bash" data-lang="bash">$ wget http://kudu-sample-data.s3.amazonaws.com/sfmtaAVLRawData01012013.csv.gz |
| $ hdfs dfs -mkdir /sfmta |
| $ zcat sfmtaAVLRawData01012013.csv.gz | tr -d '\r' | hadoop fs -put - /sfmta/data.csv</code></pre> |
| </div> |
| </div> |
| </li> |
| <li> |
| <p>Create a new external Impala table to access the plain text data. To connect to Impala |
| in the virtual machine issue the following command:</p> |
| <div class="listingblock"> |
| <div class="content"> |
| <pre class="highlight"><code class="language-bash" data-lang="bash">ssh demo@quickstart.cloudera -t impala-shell</code></pre> |
| </div> |
| </div> |
| <div class="paragraph"> |
| <p>Now, you can execute the following commands:</p> |
| </div> |
| <div class="listingblock"> |
| <div class="content"> |
| <pre class="highlight"><code class="language-sql" data-lang="sql">CREATE EXTERNAL TABLE sfmta_raw ( |
| revision int, |
| report_time string, |
| vehicle_tag int, |
| longitude float, |
| latitude float, |
| speed float, |
| heading float |
| ) |
| ROW FORMAT DELIMITED |
| FIELDS TERMINATED BY ',' |
| LOCATION '/sfmta/' |
| TBLPROPERTIES ('skip.header.line.count'='1');</code></pre> |
| </div> |
| </div> |
| </li> |
| <li> |
| <p>Validate if the data was actually loaded run the following command:</p> |
| <div class="listingblock"> |
| <div class="content"> |
| <pre class="highlight"><code class="language-sql" data-lang="sql">SELECT count(*) FROM sfmta_raw; |
| |
| +----------+ |
| | count(*) | |
| +----------+ |
| | 859086 | |
| +----------+</code></pre> |
| </div> |
| </div> |
| </li> |
| <li> |
| <p>Next we’ll create a Kudu table and load the data. Note that we convert |
| the string <code>report_time</code> field into a unix-style timestamp for more efficient |
| storage.</p> |
| <div class="listingblock"> |
| <div class="content"> |
| <pre class="highlight"><code class="language-sql" data-lang="sql">CREATE TABLE sfmta |
| PRIMARY KEY (report_time, vehicle_tag) |
| PARTITION BY HASH(report_time) PARTITIONS 8 |
| STORED AS KUDU |
| AS SELECT |
| UNIX_TIMESTAMP(report_time, 'MM/dd/yyyy HH:mm:ss') AS report_time, |
| vehicle_tag, |
| longitude, |
| latitude, |
| speed, |
| heading |
| FROM sfmta_raw; |
| |
| +------------------------+ |
| | summary | |
| +------------------------+ |
| | Inserted 859086 row(s) | |
| +------------------------+ |
| Fetched 1 row(s) in 5.75s</code></pre> |
| </div> |
| </div> |
| <div class="paragraph"> |
| <p>The created table uses a composite primary key. See |
| <a href="kudu_impala_integration.html#kudu_impala">Kudu Impala Integration</a> for a more detailed |
| introduction to the extended SQL syntax for Impala.</p> |
| </div> |
| </li> |
| </ol> |
| </div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_read_and_modify_data"><a class="link" href="#_read_and_modify_data">Read and Modify Data</a></h2> |
| <div class="sectionbody"> |
| <div class="paragraph"> |
| <p>Now that the data is stored in Kudu, you can run queries against it. The following query |
| finds the data point containing the highest recorded vehicle speed.</p> |
| </div> |
| <div class="listingblock"> |
| <div class="content"> |
| <pre class="highlight"><code class="language-sql" data-lang="sql">SELECT * FROM sfmta ORDER BY speed DESC LIMIT 1; |
| |
| +-------------+-------------+--------------------+-------------------+-------------------+---------+ |
| | report_time | vehicle_tag | longitude | latitude | speed | heading | |
| +-------------+-------------+--------------------+-------------------+-------------------+---------+ |
| | 1357022342 | 5411 | -122.3968811035156 | 37.76665878295898 | 68.33300018310547 | 82 | |
| +-------------+-------------+--------------------+-------------------+-------------------+---------+</code></pre> |
| </div> |
| </div> |
| <div class="paragraph"> |
| <p>With a quick <a href="https://www.google.com/search?q=122.3968811035156W+37.76665878295898N">Google search</a> |
| we can see that this bus was traveling east on 16th street at 68MPH. |
| At first glance, this seems unlikely to be true. Perhaps we do some research |
| and find that this bus’s sensor equipment was broken and we decide to |
| remove the data. With Kudu this is very easy to correct using standard |
| SQL:</p> |
| </div> |
| <div class="listingblock"> |
| <div class="content"> |
| <pre class="highlight"><code class="language-sql" data-lang="sql">DELETE FROM sfmta WHERE vehicle_tag = '5411'; |
| |
| -- Modified 1169 row(s), 0 row error(s) in 0.25s</code></pre> |
| </div> |
| </div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_next_steps"><a class="link" href="#_next_steps">Next steps</a></h2> |
| <div class="sectionbody"> |
| <div class="paragraph"> |
| <p>The above example showed how to load, query, and mutate a static dataset with Impala |
| and Kudu. The real power of Kudu, however, is the ability to ingest and mutate data |
| in a streaming fashion.</p> |
| </div> |
| <div class="paragraph"> |
| <p>As an exercise to learn the Kudu programmatic APIs, try implementing a program |
| that uses the <a href="http://www.nextbus.com/xmlFeedDocs/NextBusXMLFeed.pdf">SFMTA |
| XML data feed</a> to ingest this same dataset in real time into the Kudu table.</p> |
| </div> |
| <div class="sect2"> |
| <h3 id="trouble"><a class="link" href="#trouble">Troubleshooting</a></h3> |
| <div class="sect3"> |
| <h4 id="_problems_accessing_the_vm_via_ssh"><a class="link" href="#_problems_accessing_the_vm_via_ssh">Problems accessing the VM via SSH</a></h4> |
| <div class="ulist"> |
| <ul> |
| <li> |
| <p>Make sure the host has a SSH client installed.</p> |
| </li> |
| <li> |
| <p>Make sure the VM is running, by running the following command and checking for a VM called <code>kudu-demo</code>:</p> |
| <div class="listingblock"> |
| <div class="content"> |
| <pre class="highlight"><code class="language-bash" data-lang="bash">$ VBoxManage list runningvms</code></pre> |
| </div> |
| </div> |
| </li> |
| <li> |
| <p>Verify that the VM’s IP address is included in the host’s <code>/etc/hosts</code> file. You should |
| see a line that includes an IP address followed by the hostname |
| <code>quickstart.cloudera</code>. To check the running VM’s IP address, use the <code>VBoxManage</code> |
| command below.</p> |
| <div class="listingblock"> |
| <div class="content"> |
| <pre class="highlight"><code class="language-bash" data-lang="bash">$ VBoxManage guestproperty get kudu-demo /VirtualBox/GuestInfo/Net/0/V4/IP |
| Value: 192.168.56.100</code></pre> |
| </div> |
| </div> |
| </li> |
| <li> |
| <p>If you’ve used a Cloudera Quickstart VM before, your <code>.ssh/known_hosts</code> file may |
| contain references to the previous VM’s SSH credentials. Remove any references to |
| <code>quickstart.cloudera</code> from this file.</p> |
| </li> |
| </ul> |
| </div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_failing_with_lack_of_sse4_2_support_when_running_inside_virtualbox"><a class="link" href="#_failing_with_lack_of_sse4_2_support_when_running_inside_virtualbox">Failing with lack of SSE4.2 support when running inside VirtualBox</a></h4> |
| <div class="ulist"> |
| <ul> |
| <li> |
| <p>Running Kudu currently requires a CPU that supports SSE4.2 (Nehalem or later for Intel). To pass through SSE4.2 support into the guest VM, refer to the <a href="https://www.virtualbox.org/manual/ch09.html#sse412passthrough">VirtualBox documentation</a></p> |
| </li> |
| </ul> |
| </div> |
| </div> |
| </div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_next_steps_2"><a class="link" href="#_next_steps_2">Next Steps</a></h2> |
| <div class="sectionbody"> |
| <div class="ulist"> |
| <ul> |
| <li> |
| <p><a href="installation.html">Installing Kudu</a></p> |
| </li> |
| <li> |
| <p><a href="configuration.html">Configuring Kudu</a></p> |
| </li> |
| </ul> |
| </div> |
| </div> |
| </div> |
| </div> |
| <div class="col-md-3"> |
| |
| <div id="toc" data-spy="affix" data-offset-top="70"> |
| <ul> |
| |
| <li> |
| |
| <a href="index.html">Introducing Kudu</a> |
| </li> |
| <li> |
| |
| <a href="release_notes.html">Kudu Release Notes</a> |
| </li> |
| <li> |
| <span class="active-toc">Getting Started with Kudu</span> |
| <ul class="sectlevel1"> |
| <li><a href="#quickstart_vm">Get The Kudu Quickstart VM</a> |
| <ul class="sectlevel2"> |
| <li><a href="#_prerequisites">Prerequisites</a></li> |
| <li><a href="#_installation">Installation</a></li> |
| </ul> |
| </li> |
| <li><a href="#_load_data">Load Data</a></li> |
| <li><a href="#_read_and_modify_data">Read and Modify Data</a></li> |
| <li><a href="#_next_steps">Next steps</a> |
| <ul class="sectlevel2"> |
| <li><a href="#trouble">Troubleshooting</a></li> |
| </ul> |
| </li> |
| <li><a href="#_next_steps_2">Next Steps</a></li> |
| </ul> |
| </li> |
| <li> |
| |
| <a href="installation.html">Installation Guide</a> |
| </li> |
| <li> |
| |
| <a href="configuration.html">Configuring Kudu</a> |
| </li> |
| <li> |
| |
| <a href="kudu_impala_integration.html">Using Impala with Kudu</a> |
| </li> |
| <li> |
| |
| <a href="administration.html">Administering Kudu</a> |
| </li> |
| <li> |
| |
| <a href="troubleshooting.html">Troubleshooting Kudu</a> |
| </li> |
| <li> |
| |
| <a href="developing.html">Developing Applications with Kudu</a> |
| </li> |
| <li> |
| |
| <a href="schema_design.html">Kudu Schema Design</a> |
| </li> |
| <li> |
| |
| <a href="security.html">Kudu Security</a> |
| </li> |
| <li> |
| |
| <a href="transaction_semantics.html">Kudu Transaction Semantics</a> |
| </li> |
| <li> |
| |
| <a href="background_tasks.html">Background Maintenance Tasks</a> |
| </li> |
| <li> |
| |
| <a href="configuration_reference.html">Kudu Configuration Reference</a> |
| </li> |
| <li> |
| |
| <a href="command_line_tools_reference.html">Kudu Command Line Tools Reference</a> |
| </li> |
| <li> |
| |
| <a href="known_issues.html">Known Issues and Limitations</a> |
| </li> |
| <li> |
| |
| <a href="contributing.html">Contributing to Kudu</a> |
| </li> |
| <li> |
| |
| <a href="export_control.html">Export Control Notice</a> |
| </li> |
| </ul> |
| </div> |
| </div> |
| </div> |
| </div> |
| |
| |
| <div id="footnotes"> |
| <hr> |
| <div class="footnote" id="_footnote_1"> |
| <a href="#_footnoteref_1">1</a>. In addition, the script will create a host-only network between host and guest and setup an entry in the <code>/etc/hosts</code> file with the name <code>quickstart.cloudera</code> and the guest’s IP address. |
| </div> |
| </div> |