| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> |
| <!-- NewPage --> |
| <html lang="en"> |
| <head> |
| <!-- Generated by javadoc (1.8.0_292) on Tue Jun 15 06:00:57 GMT 2021 --> |
| <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> |
| <title>org.apache.hadoop.examples.terasort (Apache Hadoop Main 3.3.1 API)</title> |
| <meta name="date" content="2021-06-15"> |
| <link rel="stylesheet" type="text/css" href="../../../../../stylesheet.css" title="Style"> |
| <script type="text/javascript" src="../../../../../script.js"></script> |
| </head> |
| <body> |
| <script type="text/javascript"><!-- |
| try { |
| if (location.href.indexOf('is-external=true') == -1) { |
| parent.document.title="org.apache.hadoop.examples.terasort (Apache Hadoop Main 3.3.1 API)"; |
| } |
| } |
| catch(err) { |
| } |
| //--> |
| </script> |
| <noscript> |
| <div>JavaScript is disabled on your browser.</div> |
| </noscript> |
| <!-- ========= START OF TOP NAVBAR ======= --> |
| <div class="topNav"><a name="navbar.top"> |
| <!-- --> |
| </a> |
| <div class="skipNav"><a href="#skip.navbar.top" title="Skip navigation links">Skip navigation links</a></div> |
| <a name="navbar.top.firstrow"> |
| <!-- --> |
| </a> |
| <ul class="navList" title="Navigation"> |
| <li><a href="../../../../../overview-summary.html">Overview</a></li> |
| <li class="navBarCell1Rev">Package</li> |
| <li>Class</li> |
| <li><a href="package-use.html">Use</a></li> |
| <li><a href="package-tree.html">Tree</a></li> |
| <li><a href="../../../../../deprecated-list.html">Deprecated</a></li> |
| <li><a href="../../../../../index-all.html">Index</a></li> |
| <li><a href="../../../../../help-doc.html">Help</a></li> |
| </ul> |
| </div> |
| <div class="subNav"> |
| <ul class="navList"> |
| <li><a href="../../../../../org/apache/hadoop/examples/pi/math/package-summary.html">Prev Package</a></li> |
| <li><a href="../../../../../org/apache/hadoop/filecache/package-summary.html">Next Package</a></li> |
| </ul> |
| <ul class="navList"> |
| <li><a href="../../../../../index.html?org/apache/hadoop/examples/terasort/package-summary.html" target="_top">Frames</a></li> |
| <li><a href="package-summary.html" target="_top">No Frames</a></li> |
| </ul> |
| <ul class="navList" id="allclasses_navbar_top"> |
| <li><a href="../../../../../allclasses-noframe.html">All Classes</a></li> |
| </ul> |
| <div> |
| <script type="text/javascript"><!-- |
| allClassesLink = document.getElementById("allclasses_navbar_top"); |
| if(window==top) { |
| allClassesLink.style.display = "block"; |
| } |
| else { |
| allClassesLink.style.display = "none"; |
| } |
| //--> |
| </script> |
| </div> |
| <a name="skip.navbar.top"> |
| <!-- --> |
| </a></div> |
| <!-- ========= END OF TOP NAVBAR ========= --> |
| <div class="header"> |
| <h1 title="Package" class="title">Package org.apache.hadoop.examples.terasort</h1> |
| <div class="docSummary"> |
| <div class="block">This package consists of 3 map/reduce applications for Hadoop to |
| compete in the annual <a |
| href="http://www.hpl.hp.com/hosted/sortbenchmark" target="_top">terabyte sort</a> |
| competition.</div> |
| </div> |
| <p>See: <a href="#package.description">Description</a></p> |
| </div> |
| <div class="contentContainer"><a name="package.description"> |
| <!-- --> |
| </a> |
| <h2 title="Package org.apache.hadoop.examples.terasort Description">Package org.apache.hadoop.examples.terasort Description</h2> |
| <div class="block">This package consists of 3 map/reduce applications for Hadoop to |
| compete in the annual <a |
| href="http://www.hpl.hp.com/hosted/sortbenchmark" target="_top">terabyte sort</a> |
| competition. |
| |
| <ul> |
| <li><b>TeraGen</b> is a map/reduce program to generate the data. |
| <li><b>TeraSort</b> samples the input data and uses map/reduce to |
| sort the data into a total order. |
| <li><b>TeraValidate</b> is a map/reduce program that validates the |
| output is sorted. |
| </ul> |
| |
| <p> |
| |
| <b>TeraGen</b> generates output data that is byte for byte |
| equivalent to the C version including the newlines and specific |
| keys. It divides the desired number of rows by the desired number of |
| tasks and assigns ranges of rows to each map. The map jumps the random |
| number generator to the correct value for the first row and generates |
| the following rows. |
| |
| <p> |
| |
| <b>TeraSort</b> is a standard map/reduce sort, except for a custom |
| partitioner that uses a sorted list of <i>N-1</i> sampled keys that define |
| the key range for each reduce. In particular, all keys such that |
| <i>sample[i-1] <= key < sample[i]</i> are sent to reduce |
| <i>i</i>. This guarantees that the output of reduce <i>i</i> are all |
| less than the output of reduce <i>i+1</i>. To speed up the |
| partitioning, the partitioner builds a two level trie that quickly |
| indexes into the list of sample keys based on the first two bytes of |
| the key. TeraSort generates the sample keys by sampling the input |
| before the job is submitted and writing the list of keys into HDFS. |
| The input and output format, which are used by all 3 applications, |
| read and write the text files in the right format. The output of the |
| reduce has replication set to 1, instead of the default 3, because the |
| contest does not require the output data be replicated on to multiple |
| nodes. |
| |
| <p> |
| |
| <b>TeraValidate</b> ensures that the output is globally sorted. It |
| creates one map per a file in the output directory and each map ensures that |
| each key is less than or equal to the previous one. The map also generates |
| records with the first and last keys of the file and the reduce |
| ensures that the first key of file <i>i</i> is greater that the last key of |
| file <i>i-1</i>. Any problems are reported as output of the reduce with the |
| keys that are out of order. |
| |
| <p> |
| |
| In May 2008, Owen O'Malley ran this code on a 910 node cluster and |
| sorted the 10 billion records (1 TB) in 209 seconds (3.48 minutes) to |
| win the annual general purpose (daytona) |
| <a href="http://www.hpl.hp.com/hosted/sortbenchmark/">terabyte sort |
| benchmark</a>. |
| |
| <p> |
| |
| The cluster statistics were: |
| <ul> |
| <li> 910 nodes |
| <li> 4 dual core Xeons @ 2.0ghz per a node |
| <li> 4 SATA disks per a node |
| <li> 8G RAM per a node |
| <li> 1 gigabit ethernet on each node |
| <li> 40 nodes per a rack |
| <li> 8 gigabit ethernet uplinks from each rack to the core |
| <li> Red Hat Enterprise Linux Server Release 5.1 (kernel 2.6.18) |
| <li> Sun Java JDK 1.6.0_05-b13 |
| </ul> |
| |
| <p> |
| |
| The test was on Hadoop trunk (pre-0.18) patched with <a |
| href="http://issues.apache.org/jira/browse/HADOOP-3443">HADOOP-3443</a> |
| and <a |
| href="http://issues.apache.org/jira/browse/HADOOP-3446">HADOOP-3446</a>, |
| which were required to remove intermediate writes to disk. |
| TeraGen used |
| 1800 tasks to generate a total of 10 billion rows in HDFS, with a |
| block size of 1024 MB. |
| TeraSort was configured with 1800 maps and 1800 reduces, and |
| <i>mapreduce.task.io.sort.mb</i>, |
| <i>mapreduce.task.io.sort.factor</i>, |
| <i>fs.inmemory.size.mb</i>, and task heap size |
| sufficient that transient data was never spilled to disk, other at the |
| end of the map. The sampler looked at 100,000 keys to determine the |
| reduce boundaries, which lead to imperfect balancing with reduce |
| outputs ranging from 337 MB to 872 MB.</div> |
| </div> |
| <!-- ======= START OF BOTTOM NAVBAR ====== --> |
| <div class="bottomNav"><a name="navbar.bottom"> |
| <!-- --> |
| </a> |
| <div class="skipNav"><a href="#skip.navbar.bottom" title="Skip navigation links">Skip navigation links</a></div> |
| <a name="navbar.bottom.firstrow"> |
| <!-- --> |
| </a> |
| <ul class="navList" title="Navigation"> |
| <li><a href="../../../../../overview-summary.html">Overview</a></li> |
| <li class="navBarCell1Rev">Package</li> |
| <li>Class</li> |
| <li><a href="package-use.html">Use</a></li> |
| <li><a href="package-tree.html">Tree</a></li> |
| <li><a href="../../../../../deprecated-list.html">Deprecated</a></li> |
| <li><a href="../../../../../index-all.html">Index</a></li> |
| <li><a href="../../../../../help-doc.html">Help</a></li> |
| </ul> |
| </div> |
| <div class="subNav"> |
| <ul class="navList"> |
| <li><a href="../../../../../org/apache/hadoop/examples/pi/math/package-summary.html">Prev Package</a></li> |
| <li><a href="../../../../../org/apache/hadoop/filecache/package-summary.html">Next Package</a></li> |
| </ul> |
| <ul class="navList"> |
| <li><a href="../../../../../index.html?org/apache/hadoop/examples/terasort/package-summary.html" target="_top">Frames</a></li> |
| <li><a href="package-summary.html" target="_top">No Frames</a></li> |
| </ul> |
| <ul class="navList" id="allclasses_navbar_bottom"> |
| <li><a href="../../../../../allclasses-noframe.html">All Classes</a></li> |
| </ul> |
| <div> |
| <script type="text/javascript"><!-- |
| allClassesLink = document.getElementById("allclasses_navbar_bottom"); |
| if(window==top) { |
| allClassesLink.style.display = "block"; |
| } |
| else { |
| allClassesLink.style.display = "none"; |
| } |
| //--> |
| </script> |
| </div> |
| <a name="skip.navbar.bottom"> |
| <!-- --> |
| </a></div> |
| <!-- ======== END OF BOTTOM NAVBAR ======= --> |
| <p class="legalCopy"><small>Copyright © 2021 <a href="https://www.apache.org">Apache Software Foundation</a>. All rights reserved.</small></p> |
| </body> |
| </html> |