blob: 71438c5da64bcdd6ad36fcddd570d817d1dce2c9 [file] [log] [blame]
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<!-- NewPage -->
<html lang="en">
<head>
<!-- Generated by javadoc (1.8.0_292) on Tue Jun 15 06:00:57 GMT 2021 -->
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>org.apache.hadoop.examples.terasort (Apache Hadoop Main 3.3.1 API)</title>
<meta name="date" content="2021-06-15">
<link rel="stylesheet" type="text/css" href="../../../../../stylesheet.css" title="Style">
<script type="text/javascript" src="../../../../../script.js"></script>
</head>
<body>
<script type="text/javascript"><!--
try {
if (location.href.indexOf('is-external=true') == -1) {
parent.document.title="org.apache.hadoop.examples.terasort (Apache Hadoop Main 3.3.1 API)";
}
}
catch(err) {
}
//-->
</script>
<noscript>
<div>JavaScript is disabled on your browser.</div>
</noscript>
<!-- ========= START OF TOP NAVBAR ======= -->
<div class="topNav"><a name="navbar.top">
<!-- -->
</a>
<div class="skipNav"><a href="#skip.navbar.top" title="Skip navigation links">Skip navigation links</a></div>
<a name="navbar.top.firstrow">
<!-- -->
</a>
<ul class="navList" title="Navigation">
<li><a href="../../../../../overview-summary.html">Overview</a></li>
<li class="navBarCell1Rev">Package</li>
<li>Class</li>
<li><a href="package-use.html">Use</a></li>
<li><a href="package-tree.html">Tree</a></li>
<li><a href="../../../../../deprecated-list.html">Deprecated</a></li>
<li><a href="../../../../../index-all.html">Index</a></li>
<li><a href="../../../../../help-doc.html">Help</a></li>
</ul>
</div>
<div class="subNav">
<ul class="navList">
<li><a href="../../../../../org/apache/hadoop/examples/pi/math/package-summary.html">Prev&nbsp;Package</a></li>
<li><a href="../../../../../org/apache/hadoop/filecache/package-summary.html">Next&nbsp;Package</a></li>
</ul>
<ul class="navList">
<li><a href="../../../../../index.html?org/apache/hadoop/examples/terasort/package-summary.html" target="_top">Frames</a></li>
<li><a href="package-summary.html" target="_top">No&nbsp;Frames</a></li>
</ul>
<ul class="navList" id="allclasses_navbar_top">
<li><a href="../../../../../allclasses-noframe.html">All&nbsp;Classes</a></li>
</ul>
<div>
<script type="text/javascript"><!--
allClassesLink = document.getElementById("allclasses_navbar_top");
if(window==top) {
allClassesLink.style.display = "block";
}
else {
allClassesLink.style.display = "none";
}
//-->
</script>
</div>
<a name="skip.navbar.top">
<!-- -->
</a></div>
<!-- ========= END OF TOP NAVBAR ========= -->
<div class="header">
<h1 title="Package" class="title">Package&nbsp;org.apache.hadoop.examples.terasort</h1>
<div class="docSummary">
<div class="block">This package consists of 3 map/reduce applications for Hadoop to
compete in the annual <a
href="http://www.hpl.hp.com/hosted/sortbenchmark" target="_top">terabyte sort</a>
competition.</div>
</div>
<p>See:&nbsp;<a href="#package.description">Description</a></p>
</div>
<div class="contentContainer"><a name="package.description">
<!-- -->
</a>
<h2 title="Package org.apache.hadoop.examples.terasort Description">Package org.apache.hadoop.examples.terasort Description</h2>
<div class="block">This package consists of 3 map/reduce applications for Hadoop to
compete in the annual <a
href="http://www.hpl.hp.com/hosted/sortbenchmark" target="_top">terabyte sort</a>
competition.
<ul>
<li><b>TeraGen</b> is a map/reduce program to generate the data.
<li><b>TeraSort</b> samples the input data and uses map/reduce to
sort the data into a total order.
<li><b>TeraValidate</b> is a map/reduce program that validates the
output is sorted.
</ul>
<p>
<b>TeraGen</b> generates output data that is byte for byte
equivalent to the C version including the newlines and specific
keys. It divides the desired number of rows by the desired number of
tasks and assigns ranges of rows to each map. The map jumps the random
number generator to the correct value for the first row and generates
the following rows.
<p>
<b>TeraSort</b> is a standard map/reduce sort, except for a custom
partitioner that uses a sorted list of <i>N-1</i> sampled keys that define
the key range for each reduce. In particular, all keys such that
<i>sample[i-1] &lt;= key &lt; sample[i]</i> are sent to reduce
<i>i</i>. This guarantees that the output of reduce <i>i</i> are all
less than the output of reduce <i>i+1</i>. To speed up the
partitioning, the partitioner builds a two level trie that quickly
indexes into the list of sample keys based on the first two bytes of
the key. TeraSort generates the sample keys by sampling the input
before the job is submitted and writing the list of keys into HDFS.
The input and output format, which are used by all 3 applications,
read and write the text files in the right format. The output of the
reduce has replication set to 1, instead of the default 3, because the
contest does not require the output data be replicated on to multiple
nodes.
<p>
<b>TeraValidate</b> ensures that the output is globally sorted. It
creates one map per a file in the output directory and each map ensures that
each key is less than or equal to the previous one. The map also generates
records with the first and last keys of the file and the reduce
ensures that the first key of file <i>i</i> is greater that the last key of
file <i>i-1</i>. Any problems are reported as output of the reduce with the
keys that are out of order.
<p>
In May 2008, Owen O'Malley ran this code on a 910 node cluster and
sorted the 10 billion records (1 TB) in 209 seconds (3.48 minutes) to
win the annual general purpose (daytona)
<a href="http://www.hpl.hp.com/hosted/sortbenchmark/">terabyte sort
benchmark</a>.
<p>
The cluster statistics were:
<ul>
<li> 910 nodes
<li> 4 dual core Xeons @ 2.0ghz per a node
<li> 4 SATA disks per a node
<li> 8G RAM per a node
<li> 1 gigabit ethernet on each node
<li> 40 nodes per a rack
<li> 8 gigabit ethernet uplinks from each rack to the core
<li> Red Hat Enterprise Linux Server Release 5.1 (kernel 2.6.18)
<li> Sun Java JDK 1.6.0_05-b13
</ul>
<p>
The test was on Hadoop trunk (pre-0.18) patched with <a
href="http://issues.apache.org/jira/browse/HADOOP-3443">HADOOP-3443</a>
and <a
href="http://issues.apache.org/jira/browse/HADOOP-3446">HADOOP-3446</a>,
which were required to remove intermediate writes to disk.
TeraGen used
1800 tasks to generate a total of 10 billion rows in HDFS, with a
block size of 1024 MB.
TeraSort was configured with 1800 maps and 1800 reduces, and
<i>mapreduce.task.io.sort.mb</i>,
<i>mapreduce.task.io.sort.factor</i>,
<i>fs.inmemory.size.mb</i>, and task heap size
sufficient that transient data was never spilled to disk, other at the
end of the map. The sampler looked at 100,000 keys to determine the
reduce boundaries, which lead to imperfect balancing with reduce
outputs ranging from 337 MB to 872 MB.</div>
</div>
<!-- ======= START OF BOTTOM NAVBAR ====== -->
<div class="bottomNav"><a name="navbar.bottom">
<!-- -->
</a>
<div class="skipNav"><a href="#skip.navbar.bottom" title="Skip navigation links">Skip navigation links</a></div>
<a name="navbar.bottom.firstrow">
<!-- -->
</a>
<ul class="navList" title="Navigation">
<li><a href="../../../../../overview-summary.html">Overview</a></li>
<li class="navBarCell1Rev">Package</li>
<li>Class</li>
<li><a href="package-use.html">Use</a></li>
<li><a href="package-tree.html">Tree</a></li>
<li><a href="../../../../../deprecated-list.html">Deprecated</a></li>
<li><a href="../../../../../index-all.html">Index</a></li>
<li><a href="../../../../../help-doc.html">Help</a></li>
</ul>
</div>
<div class="subNav">
<ul class="navList">
<li><a href="../../../../../org/apache/hadoop/examples/pi/math/package-summary.html">Prev&nbsp;Package</a></li>
<li><a href="../../../../../org/apache/hadoop/filecache/package-summary.html">Next&nbsp;Package</a></li>
</ul>
<ul class="navList">
<li><a href="../../../../../index.html?org/apache/hadoop/examples/terasort/package-summary.html" target="_top">Frames</a></li>
<li><a href="package-summary.html" target="_top">No&nbsp;Frames</a></li>
</ul>
<ul class="navList" id="allclasses_navbar_bottom">
<li><a href="../../../../../allclasses-noframe.html">All&nbsp;Classes</a></li>
</ul>
<div>
<script type="text/javascript"><!--
allClassesLink = document.getElementById("allclasses_navbar_bottom");
if(window==top) {
allClassesLink.style.display = "block";
}
else {
allClassesLink.style.display = "none";
}
//-->
</script>
</div>
<a name="skip.navbar.bottom">
<!-- -->
</a></div>
<!-- ======== END OF BOTTOM NAVBAR ======= -->
<p class="legalCopy"><small>Copyright &#169; 2021 <a href="https://www.apache.org">Apache Software Foundation</a>. All rights reserved.</small></p>
</body>
</html>