mapreduce/src/examples/org/apache/hadoop/examples/terasort/package.html - hadoop - Git at Google

 <html>

 <!--
    Licensed to the Apache Software Foundation (ASF) under one or more
    contributor license agreements.  See the NOTICE file distributed with
    this work for additional information regarding copyright ownership.
    The ASF licenses this file to You under the Apache License, Version 2.0
    (the "License"); you may not use this file except in compliance with
    the License.  You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.
 -->

 <body>
 This package consists of 3 map/reduce applications for Hadoop to
 compete in the annual <a
 href="http://www.hpl.hp.com/hosted/sortbenchmark" target="_top">terabyte sort</a>
 competition.

 <ul>
 <li><b>TeraGen</b> is a map/reduce program to generate the data.
 <li><b>TeraSort</b> samples the input data and uses map/reduce to
     sort the data into a total order.
 <li><b>TeraValidate</b> is a map/reduce program that validates the
     output is sorted.
 </ul>

 <p>

 <b>TeraGen</b> generates output data that is byte for byte
 equivalent to the C version including the newlines and specific
 keys. It divides the desired number of rows by the desired number of
 tasks and assigns ranges of rows to each map. The map jumps the random
 number generator to the correct value for the first row and generates
 the following rows.

 <p>

 <b>TeraSort</b> is a standard map/reduce sort, except for a custom
 partitioner that uses a sorted list of <i>N-1</i> sampled keys that define
 the key range for each reduce. In particular, all keys such that
 <i>sample[i-1] &lt;= key &lt; sample[i]</i> are sent to reduce
 <i>i</i>. This guarantees that the output of reduce <i>i</i> are all
 less than the output of reduce <i>i+1</i>. To speed up the
 partitioning, the partitioner builds a two level trie that quickly
 indexes into the list of sample keys based on the first two bytes of
 the key. TeraSort generates the sample keys by sampling the input
 before the job is submitted and writing the list of keys into HDFS.
 The input and output format, which are used by all 3 applications,
 read and write the text files in the right format. The output of the
 reduce has replication set to 1, instead of the default 3, because the
 contest does not require the output data be replicated on to multiple
 nodes.

 <p>

 <b>TeraValidate</b> ensures that the output is globally sorted. It
 creates one map per a file in the output directory and each map ensures that
 each key is less than or equal to the previous one. The map also generates
 records with the first and last keys of the file and the reduce
 ensures that the first key of file <i>i</i> is greater that the last key of
 file <i>i-1</i>. Any problems are reported as output of the reduce with the
 keys that are out of order.

 <p>

 In May 2008, Owen O'Malley ran this code on a 910 node cluster and
 sorted the 10 billion records (1 TB) in 209 seconds (3.48 minutes) to
 win the annual general purpose (daytona)
 <a href="http://www.hpl.hp.com/hosted/sortbenchmark/">terabyte sort
 benchmark</a>.

 <p>

 The cluster statistics were:
 <ul>
 <li> 910 nodes
 <li> 4 dual core Xeons @ 2.0ghz per a node
 <li> 4 SATA disks per a node
 <li> 8G RAM per a node
 <li> 1 gigabit ethernet on each node
 <li> 40 nodes per a rack
 <li> 8 gigabit ethernet uplinks from each rack to the core
 <li> Red Hat Enterprise Linux Server Release 5.1 (kernel 2.6.18)
 <li> Sun Java JDK 1.6.0_05-b13
 </ul>

 <p>

 The test was on Hadoop trunk (pre-0.18) patched with <a
 href="http://issues.apache.org/jira/browse/HADOOP-3443">HADOOP-3443</a>
 and <a
 href="http://issues.apache.org/jira/browse/HADOOP-3446">HADOOP-3446</a>,
 which were required to remove intermediate writes to disk.
 TeraGen used
 1800 tasks to generate a total of 10 billion rows in HDFS, with a
 block size of 1024 MB.
 TeraSort was configured with 1800 maps and 1800 reduces, and
 <i>mapreduce.task.io.sort.mb</i>,
 <i>mapreduce.task.io.sort.factor</i>,
 <i>fs.inmemory.size.mb</i>, and task heap size
 sufficient that transient data was never spilled to disk, other at the
 end of the map. The sampler looked at 100,000 keys to determine the
 reduce boundaries, which lead to imperfect balancing with reduce
 outputs ranging from 337 MB to 872 MB.

 </body>
 </html>
	<html>

	<!--
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->

	<body>
	This package consists of 3 map/reduce applications for Hadoop to
	compete in the annual <a
	href="http://www.hpl.hp.com/hosted/sortbenchmark" target="_top">terabyte sort</a>
	competition.

	<ul>
	<li><b>TeraGen</b> is a map/reduce program to generate the data.
	<li><b>TeraSort</b> samples the input data and uses map/reduce to
	sort the data into a total order.
	<li><b>TeraValidate</b> is a map/reduce program that validates the
	output is sorted.
	</ul>

	<p>

	<b>TeraGen</b> generates output data that is byte for byte
	equivalent to the C version including the newlines and specific
	keys. It divides the desired number of rows by the desired number of
	tasks and assigns ranges of rows to each map. The map jumps the random
	number generator to the correct value for the first row and generates
	the following rows.

	<p>

	<b>TeraSort</b> is a standard map/reduce sort, except for a custom
	partitioner that uses a sorted list of <i>N-1</i> sampled keys that define
	the key range for each reduce. In particular, all keys such that
	<i>sample[i-1] <= key < sample[i]</i> are sent to reduce
	<i>i</i>. This guarantees that the output of reduce <i>i</i> are all
	less than the output of reduce <i>i+1</i>. To speed up the
	partitioning, the partitioner builds a two level trie that quickly
	indexes into the list of sample keys based on the first two bytes of
	the key. TeraSort generates the sample keys by sampling the input
	before the job is submitted and writing the list of keys into HDFS.
	The input and output format, which are used by all 3 applications,
	read and write the text files in the right format. The output of the
	reduce has replication set to 1, instead of the default 3, because the
	contest does not require the output data be replicated on to multiple
	nodes.

	<p>

	<b>TeraValidate</b> ensures that the output is globally sorted. It
	creates one map per a file in the output directory and each map ensures that
	each key is less than or equal to the previous one. The map also generates
	records with the first and last keys of the file and the reduce
	ensures that the first key of file <i>i</i> is greater that the last key of
	file <i>i-1</i>. Any problems are reported as output of the reduce with the
	keys that are out of order.

	<p>

	In May 2008, Owen O'Malley ran this code on a 910 node cluster and
	sorted the 10 billion records (1 TB) in 209 seconds (3.48 minutes) to
	win the annual general purpose (daytona)
	<a href="http://www.hpl.hp.com/hosted/sortbenchmark/">terabyte sort
	benchmark</a>.

	<p>

	The cluster statistics were:
	<ul>
	<li> 910 nodes
	<li> 4 dual core Xeons @ 2.0ghz per a node
	<li> 4 SATA disks per a node
	<li> 8G RAM per a node
	<li> 1 gigabit ethernet on each node
	<li> 40 nodes per a rack
	<li> 8 gigabit ethernet uplinks from each rack to the core
	<li> Red Hat Enterprise Linux Server Release 5.1 (kernel 2.6.18)
	<li> Sun Java JDK 1.6.0_05-b13
	</ul>

	<p>

	The test was on Hadoop trunk (pre-0.18) patched with <a
	href="http://issues.apache.org/jira/browse/HADOOP-3443">HADOOP-3443</a>
	and <a
	href="http://issues.apache.org/jira/browse/HADOOP-3446">HADOOP-3446</a>,
	which were required to remove intermediate writes to disk.
	TeraGen used
	1800 tasks to generate a total of 10 billion rows in HDFS, with a
	block size of 1024 MB.
	TeraSort was configured with 1800 maps and 1800 reduces, and
	<i>mapreduce.task.io.sort.mb</i>,
	<i>mapreduce.task.io.sort.factor</i>,
	<i>fs.inmemory.size.mb</i>, and task heap size
	sufficient that transient data was never spilled to disk, other at the
	end of the map. The sampler looked at 100,000 keys to determine the
	reduce boundaries, which lead to imperfect balancing with reduce
	outputs ranging from 337 MB to 872 MB.

	</body>
	</html>