| // Licensed to the Apache Software Foundation (ASF) under one |
| // or more contributor license agreements. See the NOTICE file |
| // distributed with this work for additional information |
| // regarding copyright ownership. The ASF licenses this file |
| // to you under the Apache License, Version 2.0 (the |
| // "License"); you may not use this file except in compliance |
| // with the License. You may obtain a copy of the License at |
| // |
| // http://www.apache.org/licenses/LICENSE-2.0 |
| // |
| // Unless required by applicable law or agreed to in writing, |
| // software distributed under the License is distributed on an |
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| // KIND, either express or implied. See the License for the |
| // specific language governing permissions and limitations |
| // under the License. |
| |
| [[quickstart]] |
| = Apache Kudu Quickstart |
| :author: Kudu Team |
| :imagesdir: ./images |
| :icons: font |
| :toc: left |
| :toclevels: 2 |
| :doctype: book |
| :backend: html5 |
| :sectlinks: |
| :experimental: |
| |
| Follow these instructions to set up and run the Kudu VM, and start with Kudu, Kudu_Impala, |
| and CDH in minutes. |
| |
| |
| [[quickstart_vm]] |
| == Get The Kudu Quickstart VM |
| |
| === Prerequisites |
| |
| 1. Install https://www.virtualbox.org/[Oracle Virtualbox]. The VM has been tested to work |
| with VirtualBox version 4.3 on Ubuntu 14.04 and VirtualBox version 5 on OSX |
| 10.9. VirtualBox is also included in most package managers: apt-get, brew, etc. |
| |
| 2. After the installation, make sure that `VBoxManage` is in your `PATH` by using the |
| `which VBoxManage` command. |
| |
| === Installation |
| |
| To download and start the VM, execute the following command in a terminal window. |
| |
| [source,bash] |
| ---- |
| $ curl -s https://raw.githubusercontent.com/cloudera/kudu-examples/master/demo-vm-setup/bootstrap.sh | bash |
| ---- |
| |
| This command downloads a shell script which clones the `kudu-examples` Git repository and |
| then downloads a VM image of about 1.2GB size into the current working |
| directory.footnote:[In addition, the script will create a host-only network between host |
| and guest and setup an entry in the `/etc/hosts` file with the name `quickstart.cloudera` |
| and the guest's IP address.] You can examine the script after downloading it by removing |
| the `| bash` component of the command above. Once the setup is complete, you can verify |
| that everything works by connecting to the guest via SSH: |
| |
| [source,bash] |
| ---- |
| $ ssh demo@quickstart.cloudera |
| ---- |
| |
| The username and password for the demo account are both `demo`. In addition, the `demo` |
| user has password-less `sudo` privileges so that you can install additional software or |
| manage the guest OS. You can also access the `kudu-examples` as a shared folder in |
| `/home/demo/kudu-examples/` on the guest or from your VirtualBox shared folder location on |
| the host. This is a quick way to make scripts or data visible to the guest. |
| |
| You can quickly verify if Kudu and Impala are running by executing the following commands: |
| |
| [source,bash] |
| ---- |
| $ ps aux | grep kudu |
| $ ps aux | grep impalad |
| ---- |
| |
| If you have issues connecting to the VM or one of the processes is not running, make sure |
| to consult the <<trouble, Troubleshooting>> section. |
| |
| == Load Data |
| |
| To practice some typical operations with Kudu and Impala, we'll use the |
| https://data.sfgov.org/Transportation/Raw-AVL-GPS-data/5fk7-ivit/data[San Francisco MTA |
| GPS dataset]. This dataset contains raw location data transmitted periodically from |
| sensors installed on the buses in the SF MTA's fleet. |
| |
| 1. Download the sample data and load it into HDFS |
| + |
| First we'll download the sample dataset, prepare it, and upload it into the HDFS |
| cluster. |
| + |
| The SF MTA's site is often a bit slow, so we've mirrored a sample CSV file from the |
| dataset at http://kudu-sample-data.s3.amazonaws.com/sfmtaAVLRawData01012013.csv.gz |
| + |
| The original dataset uses DOS-style line endings, so we'll convert it to |
| UNIX-style during the upload process using `tr`. |
| + |
| [source,bash] |
| ---- |
| $ wget http://kudu-sample-data.s3.amazonaws.com/sfmtaAVLRawData01012013.csv.gz |
| $ hdfs dfs -mkdir /sfmta |
| $ zcat sfmtaAVLRawData01012013.csv.gz | tr -d '\r' | hadoop fs -put - /sfmta/data.csv |
| ---- |
| |
| 2. Create a new external Impala table to access the plain text data. To connect to Impala |
| in the virtual machine issue the following command: |
| + |
| [source,bash] |
| ---- |
| ssh demo@quickstart.cloudera -t impala-shell |
| ---- |
| + |
| Now, you can execute the following commands: |
| + |
| [source,sql] |
| ---- |
| CREATE EXTERNAL TABLE sfmta_raw ( |
| revision int, |
| report_time string, |
| vehicle_tag int, |
| longitude float, |
| latitude float, |
| speed float, |
| heading float |
| ) |
| ROW FORMAT DELIMITED |
| FIELDS TERMINATED BY ',' |
| LOCATION '/sfmta/' |
| TBLPROPERTIES ('skip.header.line.count'='1'); |
| ---- |
| + |
| 3. Validate if the data was actually loaded run the following command: |
| + |
| [source,sql] |
| ---- |
| SELECT count(*) FROM sfmta_raw; |
| |
| +----------+ |
| | count(*) | |
| +----------+ |
| | 859086 | |
| +----------+ |
| ---- |
| + |
| 4. Next we'll create a Kudu table and load the data. Note that we convert |
| the string `report_time` field into a unix-style timestamp for more efficient |
| storage. |
| + |
| [source,sql] |
| ---- |
| CREATE TABLE sfmta |
| PRIMARY KEY (report_time, vehicle_tag) |
| PARTITION BY HASH(report_time) PARTITIONS 8 |
| STORED AS KUDU |
| AS SELECT |
| UNIX_TIMESTAMP(report_time, 'MM/dd/yyyy HH:mm:ss') AS report_time, |
| vehicle_tag, |
| longitude, |
| latitude, |
| speed, |
| heading |
| FROM sfmta_raw; |
| |
| +------------------------+ |
| | summary | |
| +------------------------+ |
| | Inserted 859086 row(s) | |
| +------------------------+ |
| Fetched 1 row(s) in 5.75s |
| ---- |
| + |
| The created table uses a composite primary key. See |
| <<kudu_impala_integration.adoc#kudu_impala,Kudu Impala Integration>> for a more detailed |
| introduction to the extended SQL syntax for Impala. |
| |
| == Read and Modify Data |
| |
| Now that the data is stored in Kudu, you can run queries against it. The following query |
| finds the data point containing the highest recorded vehicle speed. |
| |
| [source,sql] |
| ---- |
| SELECT * FROM sfmta ORDER BY speed DESC LIMIT 1; |
| |
| +-------------+-------------+--------------------+-------------------+-------------------+---------+ |
| | report_time | vehicle_tag | longitude | latitude | speed | heading | |
| +-------------+-------------+--------------------+-------------------+-------------------+---------+ |
| | 1357022342 | 5411 | -122.3968811035156 | 37.76665878295898 | 68.33300018310547 | 82 | |
| +-------------+-------------+--------------------+-------------------+-------------------+---------+ |
| ---- |
| |
| With a quick link:https://www.google.com/search?q=122.3968811035156W+37.76665878295898N[Google search] |
| we can see that this bus was traveling east on 16th street at 68MPH. |
| At first glance, this seems unlikely to be true. Perhaps we do some research |
| and find that this bus's sensor equipment was broken and we decide to |
| remove the data. With Kudu this is very easy to correct using standard |
| SQL: |
| |
| [source,sql] |
| ---- |
| DELETE FROM sfmta WHERE vehicle_tag = '5411'; |
| |
| -- Modified 1169 row(s), 0 row error(s) in 0.25s |
| ---- |
| |
| == Next steps |
| |
| The above example showed how to load, query, and mutate a static dataset with Impala |
| and Kudu. The real power of Kudu, however, is the ability to ingest and mutate data |
| in a streaming fashion. |
| |
| As an exercise to learn the Kudu programmatic APIs, try implementing a program |
| that uses the link:http://www.nextbus.com/xmlFeedDocs/NextBusXMLFeed.pdf[SFMTA |
| XML data feed] to ingest this same dataset in real time into the Kudu table. |
| |
| [[trouble]] |
| === Troubleshooting |
| |
| ==== Problems accessing the VM via SSH |
| |
| * Make sure the host has a SSH client installed. |
| * Make sure the VM is running, by running the following command and checking for a VM called `kudu-demo`: |
| + |
| [source,bash] |
| ---- |
| $ VBoxManage list runningvms |
| ---- |
| * Verify that the VM's IP address is included in the host's `/etc/hosts` file. You should |
| see a line that includes an IP address followed by the hostname |
| `quickstart.cloudera`. To check the running VM's IP address, use the `VBoxManage` |
| command below. |
| + |
| [source,bash] |
| ---- |
| $ VBoxManage guestproperty get kudu-demo /VirtualBox/GuestInfo/Net/0/V4/IP |
| Value: 192.168.56.100 |
| ---- |
| * If you've used a Cloudera Quickstart VM before, your `.ssh/known_hosts` file may |
| contain references to the previous VM's SSH credentials. Remove any references to |
| `quickstart.cloudera` from this file. |
| |
| ==== Failing with lack of SSE4.2 support when running inside VirtualBox |
| |
| * Running Kudu currently requires a CPU that supports SSE4.2 (Nehalem or later for Intel). To pass through SSE4.2 support into the guest VM, refer to the link:https://www.virtualbox.org/manual/ch09.html#sse412passthrough[VirtualBox documentation] |
| |
| == Next Steps |
| - link:installation.html[Installing Kudu] |
| - link:configuration.html[Configuring Kudu] |