| --- |
| title: Machine Learning Analytics with Zeppelin |
| --- |
| |
| [Apache Zeppelin](http://zeppelin-project.org/) is an interactive computational |
| environment built on Apache Spark like the IPython Notebook. With |
| [PredictionIO](https://prediction.io) and [Spark |
| SQL](https://spark.apache.org/sql/), you can easily analyze your collected |
| events when you are developing or tuning your engine. |
| |
| ## Prerequisites |
| |
| The following instructions assume that you have the command `sbt` accessible in |
| your shell's search path. Alternatively, you can use the `sbt` command that |
| comes with PredictionIO at `$PIO_HOME/sbt/sbt`. |
| |
| <%= partial 'shared/datacollection/parquet' %> |
| |
| ## Building Zeppelin for Apache Spark 1.2+ |
| |
| Start by cloning Zeppelin. |
| |
| ``` |
| $ git clone https://github.com/NFLabs/zeppelin.git |
| ``` |
| |
| Build Zeppelin with Hadoop 2.4 and Spark 1.2 profiles. |
| |
| ``` |
| $ cd zeppelin |
| $ mvn clean package -Pspark-1.2 -Dhadoop.version=2.4.0 -Phadoop-2.4 -DskipTests |
| ``` |
| |
| Now you should have working Zeppelin binaries. |
| |
| ## Preparing Zeppelin |
| |
| First, start Zeppelin. |
| |
| ``` |
| $ bin/zeppelin-daemon.sh start |
| ``` |
| |
| By default, you should be able to access Zeppelin via web browser at |
| http://localhost:8080. Create a new notebook and put the following in the first |
| cell. |
| |
| ```scala |
| sqlc.parquetFile("/tmp/movies").registerTempTable("events") |
| ``` |
| |
|  |
| |
| ## Performing Analysis with Zeppelin |
| |
| If all steps above ran successfully, you should have a ready-to-use analytics |
| environment by now. Let's try a few examples to see if everything is functional. |
| |
| In the second cell, put in this piece of code and run it. |
| |
| ``` |
| %sql |
| SELECT entityType, event, targetEntityType, COUNT(*) AS c FROM events |
| GROUP BY entityType, event, targetEntityType |
| ``` |
| |
|  |
| |
| We can also easily plot a pie chart. |
| |
| ``` |
| %sql |
| SELECT event, COUNT(*) AS c FROM events GROUP BY event |
| ``` |
| |
|  |
| |
| And see a breakdown of rating values. |
| |
| ``` |
| %sql |
| SELECT properties.rating AS r, COUNT(*) AS c FROM events |
| WHERE properties.rating IS NOT NULL GROUP BY properties.rating ORDER BY r |
| ``` |
| |
|  |
| |
| Happy analyzing! |