docs/gitbook/tips/hadoop_tuning.md - incubator-hivemall - Git at Google

 <!--
   Licensed to the Apache Software Foundation (ASF) under one
   or more contributor license agreements.  See the NOTICE file
   distributed with this work for additional information
   regarding copyright ownership.  The ASF licenses this file
   to you under the Apache License, Version 2.0 (the
   "License"); you may not use this file except in compliance
   with the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing,
   software distributed under the License is distributed on an
   "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
   KIND, either express or implied.  See the License for the
   specific language governing permissions and limitations
   under the License.
 -->

 <!-- toc -->

 # Prerequisites

 Please refer the following guides for Hadoop tuning:

 * http://hadoopbook.com/
 * http://www.slideshare.net/cloudera/mr-perf

 ---
 # Mapper-side configuration
 _Mapper configuration is important for hivemall when training runs on mappers (e.g., when using rand_amplify())._

 ```
 mapreduce.map.java.opts="-Xmx2048m -XX:+PrintGCDetails" (YARN)
 mapred.map.child.java.opts="-Xmx2048m -XX:+PrintGCDetails" (MR v1)

 mapreduce.task.io.sort.mb=1024 (YARN)
 io.sort.mb=1024 (MR v1)
 ```

 Hivemall can use at max 1024MB in the above case.
 > mapreduce.map.java.opts - mapreduce.task.io.sort.mb = 2048MB - 1024MB = 1024MB

 Moreover, other Hadoop components consumes memory spaces. It would be about 1024MB * 0.5 or so is available for Hivemall. We recommend to set at least -Xmx2048m for a mapper.

 So, make `mapreduce.map.java.opts - mapreduce.task.io.sort.mb` as large as possible.

 # Reducer-side configuration
 _Reducer configuration is important for hivemall when training runs on reducers (e.g., when using amplify())._

 ```
 mapreduce.reduce.java.opts="-Xmx2048m -XX:+PrintGCDetails" (YARN)
 mapred.reduce.child.java.opts="-Xmx2048m -XX:+PrintGCDetails" (MR v1)

 mapreduce.reduce.shuffle.input.buffer.percent=0.6 (YARN)
 mapred.reduce.shuffle.input.buffer.percent=0.6 (MR v1)

 -- mapreduce.reduce.input.buffer.percent=0.2 (YARN)
 -- mapred.job.reduce.input.buffer.percent=0.2 (MR v1)
 ```

 Hivemall can use at max 820MB in the above case.
 > mapreduce.reduce.java.opts * (1 - mapreduce.reduce.input.buffer.percent) = 2048 * (1 - 0.6) ≈ 820 MB

 Moreover, other Hadoop components consumes memory spaces. It would be about 820MB * 0.5 or so is available for Hivemall. We recommend to set at least -Xmx2048m for a reducer.

 So, make `mapreduce.reduce.java.opts * (1 - mapreduce.reduce.input.buffer.percent)` as large as possible.

 ---
 # Formula to estimate consumed memory in Hivemall

 For a dense model, the consumed memory in Hivemall is as follows:
 ```
 feature_dimensions (2^24 by the default) * 4 bytes (float) * 2 (iff covariance is calculated) * 1.2 (heuristics)
 ```
 > 2^24 * 4 bytes * 2 * 1.2 ≈ 161MB

 When [SpaceEfficientDenseModel](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/io/SpaceEfficientDenseModel.java) is used, the formula changes as follows:
 ```
 feature_dimensions (assume here 2^25) * 2 bytes (short) * 2 (iff covariance is calculated) * 1.2 (heuristics)
 ```
 > 2^25 * 2 bytes * 2 * 1.2 ≈ 161MB

 Note: Hivemall uses a [sparse representation](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/io/SparseModel.java) of prediction model (using a hash table) by the default. Use "[-densemodel](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/LearnerBaseUDTF.java#L87)" option to use a dense model.

 # Execution Engine of Hive

 We recommend to use Apache Tez for execute engine of Hive for Hivemall queries.

 ```sql
 set mapreduce.framework.name=yarn-tez;
 set hive.execution.engine=tez;
 ```

 You can use the plain old MapReduce by setting following setting:

 ```sql
 set mapreduce.framework.name=yarn;
 set hive.execution.engine=mr;
 ```
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->

	<!-- toc -->

	# Prerequisites

	Please refer the following guides for Hadoop tuning:

	* http://hadoopbook.com/
	* http://www.slideshare.net/cloudera/mr-perf

	---
	# Mapper-side configuration
	_Mapper configuration is important for hivemall when training runs on mappers (e.g., when using rand_amplify())._

	```
	mapreduce.map.java.opts="-Xmx2048m -XX:+PrintGCDetails" (YARN)
	mapred.map.child.java.opts="-Xmx2048m -XX:+PrintGCDetails" (MR v1)

	mapreduce.task.io.sort.mb=1024 (YARN)
	io.sort.mb=1024 (MR v1)
	```

	Hivemall can use at max 1024MB in the above case.
	> mapreduce.map.java.opts - mapreduce.task.io.sort.mb = 2048MB - 1024MB = 1024MB

	Moreover, other Hadoop components consumes memory spaces. It would be about 1024MB * 0.5 or so is available for Hivemall. We recommend to set at least -Xmx2048m for a mapper.

	So, make `mapreduce.map.java.opts - mapreduce.task.io.sort.mb` as large as possible.

	# Reducer-side configuration
	_Reducer configuration is important for hivemall when training runs on reducers (e.g., when using amplify())._

	```
	mapreduce.reduce.java.opts="-Xmx2048m -XX:+PrintGCDetails" (YARN)
	mapred.reduce.child.java.opts="-Xmx2048m -XX:+PrintGCDetails" (MR v1)

	mapreduce.reduce.shuffle.input.buffer.percent=0.6 (YARN)
	mapred.reduce.shuffle.input.buffer.percent=0.6 (MR v1)

	-- mapreduce.reduce.input.buffer.percent=0.2 (YARN)
	-- mapred.job.reduce.input.buffer.percent=0.2 (MR v1)
	```

	Hivemall can use at max 820MB in the above case.
	> mapreduce.reduce.java.opts * (1 - mapreduce.reduce.input.buffer.percent) = 2048 * (1 - 0.6) ≈ 820 MB

	Moreover, other Hadoop components consumes memory spaces. It would be about 820MB * 0.5 or so is available for Hivemall. We recommend to set at least -Xmx2048m for a reducer.

	So, make `mapreduce.reduce.java.opts * (1 - mapreduce.reduce.input.buffer.percent)` as large as possible.

	---
	# Formula to estimate consumed memory in Hivemall

	For a dense model, the consumed memory in Hivemall is as follows:
	```
	feature_dimensions (2^24 by the default) * 4 bytes (float) * 2 (iff covariance is calculated) * 1.2 (heuristics)
	```
	> 2^24 * 4 bytes * 2 * 1.2 ≈ 161MB

	When [SpaceEfficientDenseModel](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/io/SpaceEfficientDenseModel.java) is used, the formula changes as follows:
	```
	feature_dimensions (assume here 2^25) * 2 bytes (short) * 2 (iff covariance is calculated) * 1.2 (heuristics)
	```
	> 2^25 * 2 bytes * 2 * 1.2 ≈ 161MB

	Note: Hivemall uses a [sparse representation](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/io/SparseModel.java) of prediction model (using a hash table) by the default. Use "[-densemodel](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/LearnerBaseUDTF.java#L87)" option to use a dense model.

	# Execution Engine of Hive

	We recommend to use Apache Tez for execute engine of Hive for Hivemall queries.

	```sql
	set mapreduce.framework.name=yarn-tez;
	set hive.execution.engine=tez;
	```

	You can use the plain old MapReduce by setting following setting:

	```sql
	set mapreduce.framework.name=yarn;
	set hive.execution.engine=mr;
	```