| --- |
| title: Machine Learning Analytics with Zeppelin |
| --- |
| |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| [Apache Zeppelin](http://zeppelin-project.org/) is an interactive computational |
| environment built on Apache Spark like the IPython Notebook. With [Apache |
| PredictionIO](http://predictionio.apache.org) and [Spark |
| SQL](https://spark.apache.org/sql/), you can easily analyze your collected |
| events when you are developing or tuning your engine. |
| |
| ## Prerequisites |
| |
| The following instructions assume that you have the command `sbt` accessible in |
| your shell's search path. Alternatively, you can use the `sbt` command that |
| comes with Apache PredictionIO at `$PIO_HOME/sbt/sbt`. |
| |
| <%= partial 'shared/datacollection/parquet' %> |
| |
| ## Building Zeppelin for Apache Spark 1.2+ |
| |
| Start by cloning Zeppelin. |
| |
| ``` |
| $ git clone https://github.com/apache/zeppelin.git |
| ``` |
| |
| Build Zeppelin with Hadoop 2.4 and Spark 1.2 profiles. |
| |
| ``` |
| $ cd zeppelin |
| $ mvn clean package -Pspark-1.2 -Dhadoop.version=2.4.0 -Phadoop-2.4 -DskipTests |
| ``` |
| |
| Now you should have working Zeppelin binaries. |
| |
| ## Preparing Zeppelin |
| |
| First, start Zeppelin. |
| |
| ``` |
| $ bin/zeppelin-daemon.sh start |
| ``` |
| |
| By default, you should be able to access Zeppelin via web browser at |
| http://localhost:8080. Create a new notebook and put the following in the first |
| cell. |
| |
| ```scala |
| sqlc.parquetFile("/tmp/movies").registerTempTable("events") |
| ``` |
| |
|  |
| |
| ## Performing Analysis with Zeppelin |
| |
| If all steps above ran successfully, you should have a ready-to-use analytics |
| environment by now. Let's try a few examples to see if everything is functional. |
| |
| In the second cell, put in this piece of code and run it. |
| |
| ``` |
| %sql |
| SELECT entityType, event, targetEntityType, COUNT(*) AS c FROM events |
| GROUP BY entityType, event, targetEntityType |
| ``` |
| |
|  |
| |
| We can also easily plot a pie chart. |
| |
| ``` |
| %sql |
| SELECT event, COUNT(*) AS c FROM events GROUP BY event |
| ``` |
| |
|  |
| |
| And see a breakdown of rating values. |
| |
| ``` |
| %sql |
| SELECT properties.rating AS r, COUNT(*) AS c FROM events |
| WHERE properties.rating IS NOT NULL GROUP BY properties.rating ORDER BY r |
| ``` |
| |
|  |
| |
| Happy analyzing! |