docs/gitbook/tips/mixserver.md - incubator-hivemall - Git at Google

 <!--
   Licensed to the Apache Software Foundation (ASF) under one
   or more contributor license agreements.  See the NOTICE file
   distributed with this work for additional information
   regarding copyright ownership.  The ASF licenses this file
   to you under the Apache License, Version 2.0 (the
   "License"); you may not use this file except in compliance
   with the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing,
   software distributed under the License is distributed on an
   "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
   KIND, either express or implied.  See the License for the
   specific language governing permissions and limitations
   under the License.
 -->

 In this page, we will explain how to use model mixing on Hivemall. The model mixing is useful for a better prediction performance and faster convergence in training classifiers.
 You can find a brief explanation of the internal design of MIX protocol in [this slide](http://www.slideshare.net/myui/hivemall-mix-internal).

 <!-- toc -->

 Prerequisite
 ============

 * Hivemall v0.3 or later

     We recommend to use Mixing in a cluster with fast networking. The current standard GbE is enough though.

 Running Mix Server
 ===================

 First, put the following files on server(s) that are accessible from Hadoop worker nodes:
 * [target/hivemall-mixserv.jar](https://github.com/myui/hivemall/releases)
 * [bin/run_mixserv.sh](https://github.com/myui/hivemall/raw/master/bin/run_mixserv.sh)

 _Caution: hivemall-mixserv.jar is large in size and thus only used for Mix servers._

 ```sh
 # run a Mix Server
 ./run_mixserv.sh
 ```

 We assume in this example that Mix servers are running on host01, host03 and host03.
 The default port used by Mix server is 11212 and the port is configurable through "-port" option of run_mixserv.sh.

 See [MixServer.java](https://github.com/myui/hivemall/blob/master/mixserv/src/main/java/hivemall/mix/server/MixServer.java#L90) to get detail of the Mix server options.

 We recommended to use multiple MIX servers to get better MIX throughput (3-5 or so would be enough for normal cluster size). The MIX protocol of Hivemall is *horizontally scalable* by adding MIX server nodes.

 Using Mix Protocol through Hivemall
 ===================================

 [Install Hivemall](../getting_started/installation.html) on Hive.

 _Make sure that [hivemall-with-dependencies.jar](https://github.com/myui/hivemall/raw/master/target/hivemall-with-dependencies.jar) is used for installation. The jar contains minimum requirement jars (netty,jsr305) for running Hivemall on Hive._

 Now, we explain that how to use mixing in [an example using KDD2010a dataset](../binaryclass/kdd2010a_dataset.html).

 Enabling the mixing on Hivemall is simple as follows:
 ```sql
 use kdd2010;

 create table kdd10a_pa1_model1 as
 select
  feature,
  cast(voted_avg(weight) as float) as weight
 from
  (select
      train_pa1(add_bias(features),label,"-mix host01,host02,host03") as (feature,weight)
   from
      kdd10a_train_x3
  ) t
 group by feature;
 ```

 All you have to do is just adding "*-mix*" training option as seen in the above query.

 The effect of model mixing
 ===========================

 In my experience, the MIX improved the prediction accuracy of the above KDD2010a PA1 training on a 32 nodes cluster from 0.844835019263103 (w/o mix) to 0.8678096499719774 (w/ mix).

 The overhead of using the MIX protocol is *almost negligible* because the MIX communication is efficiently handled using asynchronous non-blocking I/O. Furthermore, the training time could be improved on certain settings because of the faster convergence due to mixing.
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->

	In this page, we will explain how to use model mixing on Hivemall. The model mixing is useful for a better prediction performance and faster convergence in training classifiers.
	You can find a brief explanation of the internal design of MIX protocol in [this slide](http://www.slideshare.net/myui/hivemall-mix-internal).

	<!-- toc -->

	Prerequisite
	============

	* Hivemall v0.3 or later

	We recommend to use Mixing in a cluster with fast networking. The current standard GbE is enough though.

	Running Mix Server
	===================

	First, put the following files on server(s) that are accessible from Hadoop worker nodes:
	* [target/hivemall-mixserv.jar](https://github.com/myui/hivemall/releases)
	* [bin/run_mixserv.sh](https://github.com/myui/hivemall/raw/master/bin/run_mixserv.sh)

	_Caution: hivemall-mixserv.jar is large in size and thus only used for Mix servers._

	```sh
	# run a Mix Server
	./run_mixserv.sh
	```

	We assume in this example that Mix servers are running on host01, host03 and host03.
	The default port used by Mix server is 11212 and the port is configurable through "-port" option of run_mixserv.sh.

	See [MixServer.java](https://github.com/myui/hivemall/blob/master/mixserv/src/main/java/hivemall/mix/server/MixServer.java#L90) to get detail of the Mix server options.

	We recommended to use multiple MIX servers to get better MIX throughput (3-5 or so would be enough for normal cluster size). The MIX protocol of Hivemall is horizontally scalable by adding MIX server nodes.

	Using Mix Protocol through Hivemall
	===================================

	[Install Hivemall](../getting_started/installation.html) on Hive.

	_Make sure that [hivemall-with-dependencies.jar](https://github.com/myui/hivemall/raw/master/target/hivemall-with-dependencies.jar) is used for installation. The jar contains minimum requirement jars (netty,jsr305) for running Hivemall on Hive._

	Now, we explain that how to use mixing in [an example using KDD2010a dataset](../binaryclass/kdd2010a_dataset.html).

	Enabling the mixing on Hivemall is simple as follows:
	```sql
	use kdd2010;

	create table kdd10a_pa1_model1 as
	select
	feature,
	cast(voted_avg(weight) as float) as weight
	from
	(select
	train_pa1(add_bias(features),label,"-mix host01,host02,host03") as (feature,weight)
	from
	kdd10a_train_x3
	) t
	group by feature;
	```

	All you have to do is just adding "-mix" training option as seen in the above query.

	The effect of model mixing
	===========================

	In my experience, the MIX improved the prediction accuracy of the above KDD2010a PA1 training on a 32 nodes cluster from 0.844835019263103 (w/o mix) to 0.8678096499719774 (w/ mix).

	The overhead of using the MIX protocol is almost negligible because the MIX communication is efficiently handled using asynchronous non-blocking I/O. Furthermore, the training time could be improved on certain settings because of the faster convergence due to mixing.