blob: 3aa8e18697555a9d0f9d78b0ec13b2953a8241c2 [file] [log] [blame]
---
title: Machine Learning Analytics with Zeppelin
---
[Apache Zeppelin](http://zeppelin-project.org/) is an interactive computational
environment built on Apache Spark like the IPython Notebook. With
[PredictionIO](https://prediction.io) and [Spark
SQL](https://spark.apache.org/sql/), you can easily analyze your collected
events when you are developing or tuning your engine.
## Prerequisites
The following instructions assume that you have the command `sbt` accessible in
your shell's search path. Alternatively, you can use the `sbt` command that
comes with PredictionIO at `$PIO_HOME/sbt/sbt`.
<%= partial 'shared/datacollection/parquet' %>
## Building Zeppelin for Apache Spark 1.2+
Start by cloning Zeppelin.
```
$ git clone https://github.com/NFLabs/zeppelin.git
```
Build Zeppelin with Hadoop 2.4 and Spark 1.2 profiles.
```
$ cd zeppelin
$ mvn clean package -Pspark-1.2 -Dhadoop.version=2.4.0 -Phadoop-2.4 -DskipTests
```
Now you should have working Zeppelin binaries.
## Preparing Zeppelin
First, start Zeppelin.
```
$ bin/zeppelin-daemon.sh start
```
By default, you should be able to access Zeppelin via web browser at
http://localhost:8080. Create a new notebook and put the following in the first
cell.
```scala
sqlc.parquetFile("/tmp/movies").registerTempTable("events")
```
![Preparing Zeppelin](/images/datacollection/zeppelin-01.png)
## Performing Analysis with Zeppelin
If all steps above ran successfully, you should have a ready-to-use analytics
environment by now. Let's try a few examples to see if everything is functional.
In the second cell, put in this piece of code and run it.
```
%sql
SELECT entityType, event, targetEntityType, COUNT(*) AS c FROM events
GROUP BY entityType, event, targetEntityType
```
![Summary of Events](/images/datacollection/zeppelin-02.png)
We can also easily plot a pie chart.
```
%sql
SELECT event, COUNT(*) AS c FROM events GROUP BY event
```
![Summary of Event in Pie Chart](/images/datacollection/zeppelin-03.png)
And see a breakdown of rating values.
```
%sql
SELECT properties.rating AS r, COUNT(*) AS c FROM events
WHERE properties.rating IS NOT NULL GROUP BY properties.rating ORDER BY r
```
![Breakdown of Rating Values](/images/datacollection/zeppelin-04.png)
Happy analyzing!