Apache Hivemall provides a batch learning scheme that builds prediction models on Apache Hive. The learning process itself is a batch process; however, an online/real-time prediction can be achieved by carrying a prediction on a transactional relational DBMS.

In this article, we explain how to run a real-time prediction using a relational DBMS. We assume that you have already run the a9a binary classification task.

Prerequisites

  • MySQL

    Put mysql-connector-java.jar (JDBC driver) on $SQOOP_HOME/lib.

  • Sqoop

    Sqoop 1.4.5 does not support Hadoop v2.6.0. So, you need to build packages for Hadoop 2.6. To do that you need to edit build.xml and ivy.xml as shown in this patch.

Preparing Model Tables on MySQL

create database a9a;
use a9a;

create user sqoop identified by 'sqoop';
grant all privileges on a9a.* to 'sqoop'@'%' identified by 'sqoop';
flush privileges;

create table a9a_model1 (
  feature int, 
  weight double
);

Do not forget to edit bind_address in the MySQL configuration file (/etc/mysql/my.conf) accessible from master and slave nodes of Hadoop.

Exporting Hive table to MySQL

Check the connectivity to MySQL server using Sqoop.

export MYSQL_HOST=dm01

export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop/
export HADOOP_COMMON_HOME=${HADOOP_HOME}

bin/sqoop list-tables --connect jdbc:mysql://${MYSQL_HOST}/a9a --username sqoop --password sqoop

Create TSV table because Sqoop cannot directory read Hive tables.

create table a9a_model1_tsv 
  ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY "\t"
    LINES TERMINATED BY "\n"
  STORED AS TEXTFILE
AS
select * from a9a_model1;

Check the location of ‘a9a_model1_tsv’ as follows:

desc extended a9a_model1_tsv;
> location:hdfs://dm01:9000/user/hive/warehouse/a9a.db/a9a_model1_tsv
bin/sqoop export \
--connect jdbc:mysql://${MYSQL_HOST}/a9a \
--username sqoop --password sqoop \
--table a9a_model1 \
--export-dir /user/hive/warehouse/a9a.db/a9a_model1_tsv \
--input-fields-terminated-by '\t' --input-lines-terminated-by '\n' \
--batch

When the exporting successfully finishes, you can find entries in the model table in MySQL.

mysql> select * from a9a_model1 limit 3;
+---------+---------------------+
| feature | weight              |
+---------+---------------------+
|       0 | -0.5761121511459351 |
|       1 | -1.5259535312652588 |
|      10 | 0.21053194999694824 |
+---------+---------------------+
3 rows in set (0.00 sec)

We recommend to create an index of model tables to boost lookups in online prediction.

CREATE UNIQUE INDEX a9a_model1_feature_index on a9a_model1 (feature);
-- USING BTREE;

Exporting test data from Hadoop to MySQL (optional step)

Prepare a testing data table in Hive which is being exported.

create table a9atest_exploded_tsv
  ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY "\t"
    LINES TERMINATED BY "\n"
  STORED AS TEXTFILE
AS
select
  rowid, 
  -- label, 
  extract_feature(feature) as feature,
  extract_weight(feature) as value
from
  a9atest LATERAL VIEW explode(add_bias(features)) t AS feature;

desc extended a9atest_exploded_tsv;
> location:hdfs://dm01:9000/user/hive/warehouse/a9a.db/a9atest_exploded_tsv,

Prepare a test table, importing data from Hadoop.

use a9a;

create table a9atest_exploded (
  rowid bigint,
  feature int, 
  value double
);

Then, run Sqoop to export data from HDFS to MySQL.

export MYSQL_HOST=dm01

bin/sqoop export \
--connect jdbc:mysql://${MYSQL_HOST}/a9a \
--username sqoop --password sqoop \
--table a9atest_exploded \
--export-dir /user/hive/warehouse/a9a.db/a9atest_exploded_tsv \
--input-fields-terminated-by '\t' --input-lines-terminated-by '\n' \
--batch

Better to add an index to the rowid column to boost selection by rowids.

CREATE INDEX a9atest_exploded_rowid_index on a9atest_exploded (rowid) USING BTREE;

When the exporting successfully finishes, you can find entries in the test table in MySQL.

mysql> select * from a9atest_exploded limit 10;
+-------+---------+-------+
| rowid | feature | value |
+-------+---------+-------+
| 12427 |      67 |     1 |
| 12427 |      73 |     1 |
| 12427 |      74 |     1 |
| 12427 |      76 |     1 |
| 12427 |      82 |     1 |
| 12427 |      83 |     1 |
| 12427 |       0 |     1 |
| 12428 |       5 |     1 |
| 12428 |       7 |     1 |
| 12428 |      16 |     1 |
+-------+---------+-------+
10 rows in set (0.00 sec)

Online/realtime prediction on MySQL

Define sigmoid function used for a prediction of logistic regression as follows:

DROP FUNCTION IF EXISTS sigmoid;
DELIMITER //
CREATE FUNCTION sigmoid(x DOUBLE)
  RETURNS DOUBLE
  LANGUAGE SQL
BEGIN
  RETURN 1.0 / (1.0 + EXP(-x));
END;
//
DELIMITER ;

We assume here that doing prediction for a ‘features’ having (0,1,10) and each of them is a categorical feature (i.e., the weight is 1.0). Then, you can get the probability by logistic regression simply as follows:

select
  sigmoid(sum(m.weight)) as prob
from
  a9a_model1 m
where
  m.feature in (0,1,10);
+--------------------+
| prob               |
+--------------------+
| 0.1310696931351625 |
+--------------------+
1 row in set (0.00 sec)

Similar to the way in Hive, you can run prediction as follows:

select
  sigmoid(sum(t.value * m.weight)) as prob, 
  if(sigmoid(sum(t.value * m.weight)) > 0.5, 1.0, 0.0) as predicted
from
  a9atest_exploded t LEFT OUTER JOIN
  a9a_model1 m ON (t.feature = m.feature)
where
  t.rowid = 12427; -- prediction on a particular id

Alternatively, you can use SQL views for testing target ‘t’ in the above query.

+---------------------+-----------+
| prob                | predicted |
+---------------------+-----------+
| 0.05595205126313402 |       0.0 |
+---------------------+-----------+
1 row in set (0.00 sec)