mapreduce/src/benchmarks/gridmix2/README.gridmix2 - hadoop - Git at Google

 #   Licensed under the Apache License, Version 2.0 (the "License");
 #   you may not use this file except in compliance with the License.
 #   You may obtain a copy of the License at
 #
 #       http://www.apache.org/licenses/LICENSE-2.0
 #
 #      Unless required by applicable law or agreed to in writing, software
 #      distributed under the License is distributed on an "AS IS" BASIS,
 #      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 #      See the License for the specific language governing permissions and
 #      limitations under the License.

 ### "Gridmix" Benchmark ###

 Contents:

 0 Overview
 1 Getting Started
   1.0 Build
   1.1 Configure
   1.2 Generate test data
 2 Running
   2.0 General
   2.1 Non-Hod cluster
   2.2 Hod
     2.2.0 Static cluster
     2.2.1 Hod cluster


 * 0 Overview

 The scripts in this package model a cluster workload. The workload is
 simulated by generating random data and submitting map/reduce jobs that
 mimic observed data-access patterns in user jobs. The full benchmark
 generates approximately 2.5TB of (often compressed) input data operated on
 by the following simulated jobs:

 1) Three stage map/reduce job
 	   Input:      500GB compressed (2TB uncompressed) SequenceFile
                  (k,v) = (5 words, 100 words)
                  hadoop-env: FIXCOMPSEQ
      Compute1:   keep 10% map, 40% reduce
 	   Compute2:   keep 100% map, 77% reduce
                  Input from Compute1
      Compute3:   keep 116% map, 91% reduce
                  Input from Compute2
      Motivation: Many user workloads are implemented as pipelined map/reduce
                  jobs, including Pig workloads

 2) Large sort of variable key/value size
      Input:      500GB compressed (2TB uncompressed) SequenceFile
                  (k,v) = (5-10 words, 100-10000 words)
                  hadoop-env: VARCOMPSEQ
      Compute:    keep 100% map, 100% reduce
      Motivation: Processing large, compressed datsets is common.

 3) Reference select
      Input:      500GB compressed (2TB uncompressed) SequenceFile
                  (k,v) = (5-10 words, 100-10000 words)
                  hadoop-env: VARCOMPSEQ
      Compute:    keep 0.2% map, 5% reduce
                  1 Reducer
      Motivation: Sampling from a large, reference dataset is common.

 4) API text sort (java, streaming)
      Input:      500GB uncompressed Text
                  (k,v) = (1-10 words, 0-200 words)
                  hadoop-env: VARINFLTEXT
      Compute:    keep 100% map, 100% reduce
      Motivation: This benchmark should exercise each of the APIs to
                  map/reduce

 5) Jobs with combiner (word count jobs)

 A benchmark load is a mix of different numbers of small, medium, and large jobs of the above types.
 The exact mix is specified in an xml file (gridmix_config.xml). We have a Java program to
 construct those jobs based on the xml file and put them under the control of a JobControl object.
 The JobControl object then submitts the jobs to the cluster and monitors their progress until all jobs complete.


 Notes(1-3): Since input data are compressed, this means that each mapper
 outputs a lot more bytes than it reads in, typically causing map output
 spills.


 * 1 Getting Started

 1.0 Build

 In the src/benchmarks/gridmix dir, type "ant".
 gridmix.jar will be created in the build subdir.
 copy gridmix.jar to gridmix dir.

 1.1 Configure environment variables

 One must modify gridmix-env-2 to set the following variables:

 HADOOP_HOME     The hadoop install location
 HADOOP_VERSION  The exact hadoop version to be used. e.g. hadoop-0.18.2-dev
 HADOOP_CONF_DIR The dir containing the hadoop-site.xml for teh cluster to be used.
 USE_REAL_DATA   A large data-set will be created and used by the benchmark if it is set to true.


 1.2 Configure the job mixture

 A default gridmix_conf.xml file is provided.
 One may make appropriate changes as necessary on the number of jobs of various types
 and sizes. One can also change the number of reducers of each jobs, and specify whether
 to compress the output data of a map/reduce job.
 Note that one can specify multiple numbers of in the
 numOfJobs field and numOfReduces field, like:
 <property>
   <name>javaSort.smallJobs.numOfJobs</name>
   <value>8,2</value>
   <description></description>
 </property>


 <property>
   <name>javaSort.smallJobs.numOfReduces</name>
   <value>15,70</value>
   <description></description>
 </property>

 The above spec means that we will have 8 small java sort jobs with 15 reducers and 2 small java sort
 jobs with 17 reducers.

 1.3 Generate test data

 Test data is generated using the generateGridmix2Data.sh script.
         ./generateGridmix2Data.sh
 One may modify the structure and size of the data generated here.

 It is sufficient to run the script without modification, though it may
 require up to 4TB of free space in the default filesystem. Changing the size
 of the input data (COMPRESSED_DATA_BYTES, UNCOMPRESSED_DATA_BYTES,
 INDIRECT_DATA_BYTES) is safe. A 4x compression ratio for generated, block
 compressed data is typical.

 * 2 Running

 You need to set HADOOP_CONF_DIR to the right directory where hadoop-site.xml exists.
 Then you just need to type
 	./rungridmix_2
 It will create start.out to record the start time, and at the end, it will create end.out to record the
 endi time.
	# Licensed under the Apache License, Version 2.0 (the "License");
	# you may not use this file except in compliance with the License.
	# You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0
	#
	# Unless required by applicable law or agreed to in writing, software
	# distributed under the License is distributed on an "AS IS" BASIS,
	# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	# See the License for the specific language governing permissions and
	# limitations under the License.

	### "Gridmix" Benchmark ###

	Contents:

	0 Overview
	1 Getting Started
	1.0 Build
	1.1 Configure
	1.2 Generate test data
	2 Running
	2.0 General
	2.1 Non-Hod cluster
	2.2 Hod
	2.2.0 Static cluster
	2.2.1 Hod cluster


	* 0 Overview

	The scripts in this package model a cluster workload. The workload is
	simulated by generating random data and submitting map/reduce jobs that
	mimic observed data-access patterns in user jobs. The full benchmark
	generates approximately 2.5TB of (often compressed) input data operated on
	by the following simulated jobs:

	1) Three stage map/reduce job
	Input: 500GB compressed (2TB uncompressed) SequenceFile
	(k,v) = (5 words, 100 words)
	hadoop-env: FIXCOMPSEQ
	Compute1: keep 10% map, 40% reduce
	Compute2: keep 100% map, 77% reduce
	Input from Compute1
	Compute3: keep 116% map, 91% reduce
	Input from Compute2
	Motivation: Many user workloads are implemented as pipelined map/reduce
	jobs, including Pig workloads

	2) Large sort of variable key/value size
	Input: 500GB compressed (2TB uncompressed) SequenceFile
	(k,v) = (5-10 words, 100-10000 words)
	hadoop-env: VARCOMPSEQ
	Compute: keep 100% map, 100% reduce
	Motivation: Processing large, compressed datsets is common.

	3) Reference select
	Input: 500GB compressed (2TB uncompressed) SequenceFile
	(k,v) = (5-10 words, 100-10000 words)
	hadoop-env: VARCOMPSEQ
	Compute: keep 0.2% map, 5% reduce
	1 Reducer
	Motivation: Sampling from a large, reference dataset is common.

	4) API text sort (java, streaming)
	Input: 500GB uncompressed Text
	(k,v) = (1-10 words, 0-200 words)
	hadoop-env: VARINFLTEXT
	Compute: keep 100% map, 100% reduce
	Motivation: This benchmark should exercise each of the APIs to
	map/reduce

	5) Jobs with combiner (word count jobs)

	A benchmark load is a mix of different numbers of small, medium, and large jobs of the above types.
	The exact mix is specified in an xml file (gridmix_config.xml). We have a Java program to
	construct those jobs based on the xml file and put them under the control of a JobControl object.
	The JobControl object then submitts the jobs to the cluster and monitors their progress until all jobs complete.


	Notes(1-3): Since input data are compressed, this means that each mapper
	outputs a lot more bytes than it reads in, typically causing map output
	spills.



	* 1 Getting Started

	1.0 Build

	In the src/benchmarks/gridmix dir, type "ant".
	gridmix.jar will be created in the build subdir.
	copy gridmix.jar to gridmix dir.

	1.1 Configure environment variables

	One must modify gridmix-env-2 to set the following variables:

	HADOOP_HOME The hadoop install location
	HADOOP_VERSION The exact hadoop version to be used. e.g. hadoop-0.18.2-dev
	HADOOP_CONF_DIR The dir containing the hadoop-site.xml for teh cluster to be used.
	USE_REAL_DATA A large data-set will be created and used by the benchmark if it is set to true.


	1.2 Configure the job mixture

	A default gridmix_conf.xml file is provided.
	One may make appropriate changes as necessary on the number of jobs of various types
	and sizes. One can also change the number of reducers of each jobs, and specify whether
	to compress the output data of a map/reduce job.
	Note that one can specify multiple numbers of in the
	numOfJobs field and numOfReduces field, like:
	<property>
	<name>javaSort.smallJobs.numOfJobs</name>
	<value>8,2</value>
	<description></description>
	</property>


	<property>
	<name>javaSort.smallJobs.numOfReduces</name>
	<value>15,70</value>
	<description></description>
	</property>

	The above spec means that we will have 8 small java sort jobs with 15 reducers and 2 small java sort
	jobs with 17 reducers.

	1.3 Generate test data

	Test data is generated using the generateGridmix2Data.sh script.
	./generateGridmix2Data.sh
	One may modify the structure and size of the data generated here.

	It is sufficient to run the script without modification, though it may
	require up to 4TB of free space in the default filesystem. Changing the size
	of the input data (COMPRESSED_DATA_BYTES, UNCOMPRESSED_DATA_BYTES,
	INDIRECT_DATA_BYTES) is safe. A 4x compression ratio for generated, block
	compressed data is typical.

	* 2 Running

	You need to set HADOOP_CONF_DIR to the right directory where hadoop-site.xml exists.
	Then you just need to type
	./rungridmix_2
	It will create start.out to record the start time, and at the end, it will create end.out to record the
	endi time.