blob: c5168205b8875d7ac9fd67bdfe1d57ea57903e2b [file] [log] [blame] [view]
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!-- toc -->
# Prerequisites
Please refer the following guides for Hadoop tuning:
* http://hadoopbook.com/
* http://www.slideshare.net/cloudera/mr-perf
---
# Mapper-side configuration
_Mapper configuration is important for hivemall when training runs on mappers (e.g., when using rand_amplify())._
```
mapreduce.map.java.opts="-Xmx2048m -XX:+PrintGCDetails" (YARN)
mapred.map.child.java.opts="-Xmx2048m -XX:+PrintGCDetails" (MR v1)
mapreduce.task.io.sort.mb=1024 (YARN)
io.sort.mb=1024 (MR v1)
```
Hivemall can use at max 1024MB in the above case.
> mapreduce.map.java.opts - mapreduce.task.io.sort.mb = 2048MB - 1024MB = 1024MB
Moreover, other Hadoop components consumes memory spaces. It would be about 1024MB * 0.5 or so is available for Hivemall. We recommend to set at least -Xmx2048m for a mapper.
So, make `mapreduce.map.java.opts - mapreduce.task.io.sort.mb` as large as possible.
# Reducer-side configuration
_Reducer configuration is important for hivemall when training runs on reducers (e.g., when using amplify())._
```
mapreduce.reduce.java.opts="-Xmx2048m -XX:+PrintGCDetails" (YARN)
mapred.reduce.child.java.opts="-Xmx2048m -XX:+PrintGCDetails" (MR v1)
mapreduce.reduce.shuffle.input.buffer.percent=0.6 (YARN)
mapred.reduce.shuffle.input.buffer.percent=0.6 (MR v1)
-- mapreduce.reduce.input.buffer.percent=0.2 (YARN)
-- mapred.job.reduce.input.buffer.percent=0.2 (MR v1)
```
Hivemall can use at max 820MB in the above case.
> mapreduce.reduce.java.opts * (1 - mapreduce.reduce.input.buffer.percent) = 2048 * (1 - 0.6) ≈ 820 MB
Moreover, other Hadoop components consumes memory spaces. It would be about 820MB * 0.5 or so is available for Hivemall. We recommend to set at least -Xmx2048m for a reducer.
So, make `mapreduce.reduce.java.opts * (1 - mapreduce.reduce.input.buffer.percent)` as large as possible.
---
# Formula to estimate consumed memory in Hivemall
For a dense model, the consumed memory in Hivemall is as follows:
```
feature_dimensions (2^24 by the default) * 4 bytes (float) * 2 (iff covariance is calculated) * 1.2 (heuristics)
```
> 2^24 * 4 bytes * 2 * 1.2 ≈ 161MB
When [SpaceEfficientDenseModel](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/io/SpaceEfficientDenseModel.java) is used, the formula changes as follows:
```
feature_dimensions (assume here 2^25) * 2 bytes (short) * 2 (iff covariance is calculated) * 1.2 (heuristics)
```
> 2^25 * 2 bytes * 2 * 1.2 ≈ 161MB
Note: Hivemall uses a [sparse representation](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/io/SparseModel.java) of prediction model (using a hash table) by the default. Use "[-densemodel](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/LearnerBaseUDTF.java#L87)" option to use a dense model.
# Execution Engine of Hive
We recommend to use Apache Tez for execute engine of Hive for Hivemall queries.
```sql
set mapreduce.framework.name=yarn-tez;
set hive.execution.engine=tez;
```
You can use the plain old MapReduce by setting following setting:
```sql
set mapreduce.framework.name=yarn;
set hive.execution.engine=mr;
```