docs/manual/source/datacollection/analytics-zeppelin.html.md.erb - predictionio - Git at Google

 ---
 title: Machine Learning Analytics with Zeppelin
 ---

 <!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 -->

 [Apache Zeppelin](http://zeppelin-project.org/) is an interactive computational
 environment built on Apache Spark like the IPython Notebook. With [Apache
 PredictionIO](http://predictionio.apache.org) and [Spark
 SQL](https://spark.apache.org/sql/), you can easily analyze your collected
 events when you are developing or tuning your engine.

 ## Prerequisites

 The following instructions assume that you have the command `sbt` accessible in
 your shell's search path. Alternatively, you can use the `sbt` command that
 comes with Apache PredictionIO at `$PIO_HOME/sbt/sbt`.

 <%= partial 'shared/datacollection/parquet' %>

 ## Building Zeppelin for Apache Spark 1.2+

 Start by cloning Zeppelin.

 ```
 $ git clone https://github.com/apache/zeppelin.git
 ```

 Build Zeppelin with Hadoop 2.4 and Spark 1.2 profiles.

 ```
 $ cd zeppelin
 $ mvn clean package -Pspark-1.2 -Dhadoop.version=2.4.0 -Phadoop-2.4 -DskipTests
 ```

 Now you should have working Zeppelin binaries.

 ## Preparing Zeppelin

 First, start Zeppelin.

 ```
 $ bin/zeppelin-daemon.sh start
 ```

 By default, you should be able to access Zeppelin via web browser at
 http://localhost:8080. Create a new notebook and put the following in the first
 cell.

 ```scala
 sqlc.parquetFile("/tmp/movies").registerTempTable("events")
 ```

 ![Preparing Zeppelin](/images/datacollection/zeppelin-01.png)

 ## Performing Analysis with Zeppelin

 If all steps above ran successfully, you should have a ready-to-use analytics
 environment by now. Let's try a few examples to see if everything is functional.

 In the second cell, put in this piece of code and run it.

 ```
 %sql
 SELECT entityType, event, targetEntityType, COUNT(*) AS c FROM events
 GROUP BY entityType, event, targetEntityType
 ```

 ![Summary of Events](/images/datacollection/zeppelin-02.png)

 We can also easily plot a pie chart.

 ```
 %sql
 SELECT event, COUNT(*) AS c FROM events GROUP BY event
 ```

 ![Summary of Event in Pie Chart](/images/datacollection/zeppelin-03.png)

 And see a breakdown of rating values.

 ```
 %sql
 SELECT properties.rating AS r, COUNT(*) AS c FROM events
 WHERE properties.rating IS NOT NULL GROUP BY properties.rating ORDER BY r
 ```

 ![Breakdown of Rating Values](/images/datacollection/zeppelin-04.png)

 Happy analyzing!
	---
	title: Machine Learning Analytics with Zeppelin
	---

	<!--
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->

	[Apache Zeppelin](http://zeppelin-project.org/) is an interactive computational
	environment built on Apache Spark like the IPython Notebook. With [Apache
	PredictionIO](http://predictionio.apache.org) and [Spark
	SQL](https://spark.apache.org/sql/), you can easily analyze your collected
	events when you are developing or tuning your engine.

	## Prerequisites

	The following instructions assume that you have the command `sbt` accessible in
	your shell's search path. Alternatively, you can use the `sbt` command that
	comes with Apache PredictionIO at `$PIO_HOME/sbt/sbt`.

	<%= partial 'shared/datacollection/parquet' %>

	## Building Zeppelin for Apache Spark 1.2+

	Start by cloning Zeppelin.

	```
	$ git clone https://github.com/apache/zeppelin.git
	```

	Build Zeppelin with Hadoop 2.4 and Spark 1.2 profiles.

	```
	$ cd zeppelin
	$ mvn clean package -Pspark-1.2 -Dhadoop.version=2.4.0 -Phadoop-2.4 -DskipTests
	```

	Now you should have working Zeppelin binaries.

	## Preparing Zeppelin

	First, start Zeppelin.

	```
	$ bin/zeppelin-daemon.sh start
	```

	By default, you should be able to access Zeppelin via web browser at
	http://localhost:8080. Create a new notebook and put the following in the first
	cell.

	```scala
	sqlc.parquetFile("/tmp/movies").registerTempTable("events")
	```

	![Preparing Zeppelin](/images/datacollection/zeppelin-01.png)

	## Performing Analysis with Zeppelin

	If all steps above ran successfully, you should have a ready-to-use analytics
	environment by now. Let's try a few examples to see if everything is functional.

	In the second cell, put in this piece of code and run it.

	```
	%sql
	SELECT entityType, event, targetEntityType, COUNT(*) AS c FROM events
	GROUP BY entityType, event, targetEntityType
	```

	![Summary of Events](/images/datacollection/zeppelin-02.png)

	We can also easily plot a pie chart.

	```
	%sql
	SELECT event, COUNT(*) AS c FROM events GROUP BY event
	```

	![Summary of Event in Pie Chart](/images/datacollection/zeppelin-03.png)

	And see a breakdown of rating values.

	```
	%sql
	SELECT properties.rating AS r, COUNT(*) AS c FROM events
	WHERE properties.rating IS NOT NULL GROUP BY properties.rating ORDER BY r
	```

	![Breakdown of Rating Values](/images/datacollection/zeppelin-04.png)

	Happy analyzing!