blob: f3453795498ec2b5422f330a7bf2b720af1278dd [file] [log] [blame]
---
title: Machine Learning Analytics with Tableau
---
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
With Spark SQL, it is possible to connect Tableau to Apache PredictionIO
Event Server for interactive analysis of event data.
## Prerequisites
- Tableau Desktop 8.3+ with a proper license key that supports Spark SQL;
- Spark ODBC Driver from Databricks
(https://databricks.com/spark/odbc-driver-download);
- Apache Hadoop 2.4+
- Apache Hive 0.3.1+
INFO: In this article, we will assume that you have a working HDFS, and that
your environmental variable `HADOOP_HOME` has been properly set. This is
essential for Apache Hive to function properly. In addition, `HADOOP_CONF_DIR`
in `$PIO_HOME/conf/pio-env.sh` must also be properly set for the `pio export`
command to write to HDFS instead of the local filesystem.
<%= partial 'shared/datacollection/parquet' %>
## Creating Hive Tables
Before you can use Spark SQL's Thrift JDBC/ODBC Server, you will need to create
the table schema in Hive first. Please make sure to replace `path_of_hive` with
the real path.
```
$ cd path_of_hive
$ bin/hive
hive> CREATE EXTERNAL TABLE events (event STRING, entityType STRING, entityId STRING, targetEntityType STRING, targetEntityId STRING, properties STRUCT<rating:DOUBLE>) STORED AS parquet LOCATION '/tmp/movies';
hive> exit;
```
## Launch Spark SQL's Thrift JDBC/ODBC Server
Once you have created your Hive tables, create a Hive configuration in your Spark
installation. If you have a custom `hive-site.xml`, simply copy or link it to
`$SPARK_HOME/conf`. Otherwise, Hive would have created a local Derby database,
and you will need to let Spark knows about it. Create
`$SPARK_HOME/conf/hive-site.xml` from scratch with the following template.
WARNING: You must change `/opt/apache-hive-0.13.1-bin` below to a real Hive
path.
```xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/opt/apache-hive-0.13.1-bin/metastore_db;create=true</value>
</property>
</configuration>
```
Launch Spark SQL's Thift JDBC/ODBC Server by
```
$ $SPARK_HOME/sbin/start-thriftserver.sh
```
You can test the server using the included Beeline client.
```
$ $SPARK_HOME/bin/beeline
beeline> !connect jdbc:hive2://localhost:10000
(Use empty username and password when prompted)
0: jdbc:hive2://localhost:10000> select * from events limit 10;
+--------+-------------+-----------+-------------------+-----------------+------------------+
| event | entitytype | entityid | targetentitytype | targetentityid | properties |
+--------+-------------+-----------+-------------------+-----------------+------------------+
| buy | user | 3 | item | 0 | {"rating":null} |
| buy | user | 3 | item | 1 | {"rating":null} |
| rate | user | 3 | item | 2 | {"rating":1.0} |
| buy | user | 3 | item | 7 | {"rating":null} |
| buy | user | 3 | item | 8 | {"rating":null} |
| buy | user | 3 | item | 9 | {"rating":null} |
| rate | user | 3 | item | 14 | {"rating":1.0} |
| buy | user | 3 | item | 15 | {"rating":null} |
| buy | user | 3 | item | 16 | {"rating":null} |
| buy | user | 3 | item | 18 | {"rating":null} |
+--------+-------------+-----------+-------------------+-----------------+------------------+
10 rows selected (0.515 seconds)
0: jdbc:hive2://localhost:10000>
```
Now you are ready to use Tableau!
## Performing Analysis with Tableau
Launch Tableau and Connect to Data. Click on **Spark SQL (Beta)** and enter
Spark SQL's Thrift JDBC/ODBC Server information. Make sure to pick **User Name**
as **Authentication**. Click **Connect**.
![Tableau and Spark SQL](/images/datacollection/tableau-01.png)
On the next page, pick **default** under **Schema**.
INFO: You may not see any choices when you click on Schema. Simply press Enter
and Tableau will try to list all schemas.
Once you see a list of tables that includes **events**, click **New Custom
SQL**, then enter the following.
```sql
SELECT event, entityType, entityId, targetEntityType, targetEntityId, properties.rating FROM events
```
Click **Update Now**. You should see the following screen by now, indicating
success in loading data. Using a custom SQL allows you to extract arbitrary
fields from within properties.
![Setting up Tableau](/images/datacollection/tableau-02.png)
Click **Go to Worksheet** and start analyzing. The following shows an example of
breaking down different rating values.
![Rating Values Breakdown](/images/datacollection/tableau-03.png)
The following shows a summary of interactions.
![Interactions](/images/datacollection/tableau-04.png)
Happy analyzing!