blob: 1474dc56223f900877c877723a3c8d2d65f28c92 [file] [log] [blame]
---
title: Machine Learning Analytics with Zeppelin
---
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
[Apache Zeppelin](http://zeppelin-project.org/) is an interactive computational
environment built on Apache Spark like the IPython Notebook. With [Apache
PredictionIO](http://predictionio.apache.org) and [Spark
SQL](https://spark.apache.org/sql/), you can easily analyze your collected
events when you are developing or tuning your engine.
## Prerequisites
The following instructions assume that you have the command `sbt` accessible in
your shell's search path. Alternatively, you can use the `sbt` command that
comes with Apache PredictionIO at `$PIO_HOME/sbt/sbt`.
<%= partial 'shared/datacollection/parquet' %>
## Building Zeppelin for Apache Spark 1.2+
Start by cloning Zeppelin.
```
$ git clone https://github.com/apache/zeppelin.git
```
Build Zeppelin with Hadoop 2.4 and Spark 1.2 profiles.
```
$ cd zeppelin
$ mvn clean package -Pspark-1.2 -Dhadoop.version=2.4.0 -Phadoop-2.4 -DskipTests
```
Now you should have working Zeppelin binaries.
## Preparing Zeppelin
First, start Zeppelin.
```
$ bin/zeppelin-daemon.sh start
```
By default, you should be able to access Zeppelin via web browser at
http://localhost:8080. Create a new notebook and put the following in the first
cell.
```scala
sqlc.parquetFile("/tmp/movies").registerTempTable("events")
```
![Preparing Zeppelin](/images/datacollection/zeppelin-01.png)
## Performing Analysis with Zeppelin
If all steps above ran successfully, you should have a ready-to-use analytics
environment by now. Let's try a few examples to see if everything is functional.
In the second cell, put in this piece of code and run it.
```
%sql
SELECT entityType, event, targetEntityType, COUNT(*) AS c FROM events
GROUP BY entityType, event, targetEntityType
```
![Summary of Events](/images/datacollection/zeppelin-02.png)
We can also easily plot a pie chart.
```
%sql
SELECT event, COUNT(*) AS c FROM events GROUP BY event
```
![Summary of Event in Pie Chart](/images/datacollection/zeppelin-03.png)
And see a breakdown of rating values.
```
%sql
SELECT properties.rating AS r, COUNT(*) AS c FROM events
WHERE properties.rating IS NOT NULL GROUP BY properties.rating ORDER BY r
```
![Breakdown of Rating Values](/images/datacollection/zeppelin-04.png)
Happy analyzing!