blob: 91aff87bb4aa75843de06488326f6da5a4bcc0f6 [file] [log] [blame] [view]
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
In this page, we will explain how to use model mixing on Hivemall. The model mixing is useful for a better prediction performance and faster convergence in training classifiers.
You can find a brief explanation of the internal design of MIX protocol in [this slide](http://www.slideshare.net/myui/hivemall-mix-internal).
<!-- toc -->
Prerequisite
============
* Hivemall v0.3 or later
We recommend to use Mixing in a cluster with fast networking. The current standard GbE is enough though.
Running Mix Server
===================
First, put the following files on server(s) that are accessible from Hadoop worker nodes:
* [target/hivemall-mixserv.jar](https://github.com/myui/hivemall/releases)
* [bin/run_mixserv.sh](https://github.com/myui/hivemall/raw/master/bin/run_mixserv.sh)
_Caution: hivemall-mixserv.jar is large in size and thus only used for Mix servers._
```sh
# run a Mix Server
./run_mixserv.sh
```
We assume in this example that Mix servers are running on host01, host03 and host03.
The default port used by Mix server is 11212 and the port is configurable through "-port" option of run_mixserv.sh.
See [MixServer.java](https://github.com/myui/hivemall/blob/master/mixserv/src/main/java/hivemall/mix/server/MixServer.java#L90) to get detail of the Mix server options.
We recommended to use multiple MIX servers to get better MIX throughput (3-5 or so would be enough for normal cluster size). The MIX protocol of Hivemall is *horizontally scalable* by adding MIX server nodes.
Using Mix Protocol through Hivemall
===================================
[Install Hivemall](../getting_started/installation.html) on Hive.
_Make sure that [hivemall-with-dependencies.jar](https://github.com/myui/hivemall/raw/master/target/hivemall-with-dependencies.jar) is used for installation. The jar contains minimum requirement jars (netty,jsr305) for running Hivemall on Hive._
Now, we explain that how to use mixing in [an example using KDD2010a dataset](../binaryclass/kdd2010a_dataset.html).
Enabling the mixing on Hivemall is simple as follows:
```sql
use kdd2010;
create table kdd10a_pa1_model1 as
select
feature,
cast(voted_avg(weight) as float) as weight
from
(select
train_pa1(add_bias(features),label,"-mix host01,host02,host03") as (feature,weight)
from
kdd10a_train_x3
) t
group by feature;
```
All you have to do is just adding "*-mix*" training option as seen in the above query.
The effect of model mixing
===========================
In my experience, the MIX improved the prediction accuracy of the above KDD2010a PA1 training on a 32 nodes cluster from 0.844835019263103 (w/o mix) to 0.8678096499719774 (w/ mix).
The overhead of using the MIX protocol is *almost negligible* because the MIX communication is efficiently handled using asynchronous non-blocking I/O. Furthermore, the training time could be improved on certain settings because of the faster convergence due to mixing.