hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/terasort/2009-write-up/Yahoo2009.tex - hadoop - Git at Google

 % Licensed to the Apache Software Foundation (ASF) under one
 % or more contributor license agreements.  See the NOTICE file
 % distributed with this work for additional information
 % regarding copyright ownership.  The ASF licenses this file
 % to you under the Apache License, Version 2.0 (the
 % "License"); you may not use this file except in compliance
 % with the License.  You may obtain a copy of the License at
 %
 %     http://www.apache.org/licenses/LICENSE-2.0
 %
 % Unless required by applicable law or agreed to in writing, software
 % distributed under the License is distributed on an "AS IS" BASIS,
 % WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 % See the License for the specific language governing permissions and
 % limitations under the License.
 \documentclass{article}
 \usepackage[pdftex]{hyperref}
 \usepackage[pdftex]{graphicx}

 \title{Winning a 60 Second Dash with a Yellow Elephant}
 \author{\href{http://people.apache.org/~omalley}{Owen O'Malley} and
         \href{http://people.apache.org/~acmurthy}{Arun C. Murthy}\\
 \href{http://www.yahoo.com/}{Yahoo!}\\
 owen@yahoo-inc.com and acm@yahoo-inc.com}
 \date{April 2009}
 \begin{document}
 \maketitle
 \href{http://hadoop.apache.org/core}{Apache Hadoop} is a open source
 software framework that dramatically simplifies writing distributed
 data intensive applications. It provides a distributed file system,
 which is modeled after the Google File System\cite{gfs}, and a
 map/reduce\cite{mapreduce} implementation that manages distributed
 computation. Jim Gray defined a benchmark to compare large sorting
 programs. Since the core of map/reduce is a distributed sort, most of
 the custom code is glue to get the desired behavior.

 \section{Benchmark Rules}

 Jim's Gray's sort benchmark consists of a set of many related
 benchmarks, each with their own rules. All of the sort benchmarks
 measure the time to sort different numbers of 100 byte records. The
 first 10 bytes of each record is the key and the rest is the
 value. The \textbf{minute sort} must finish end to end in less than a
 minute. The \textbf{Gray sort} must sort more than 100 terabytes and
 must run for at least an hour.

 \begin{itemize}
 \item The input data must precisely match the data generated by the C
   data generator.
 \item The input must not be in the operating system's file
   cache when the job starts.. Under Linux, this requires using the memory for something
   else between sorting runs.
 \item The input and output data must not be compressed.
 \item The output must not overwrite the input.
 \item The output must be synced to disk.
 \item The 128 bit sum of the crc32's of each key/value pair must be
   calculated for the input and output. Naturally, they must be
   identical.
 \item The output may be divided into multiple output files, but it
   must be totally ordered (simply concatenating the output files must
   produce the completely sorted output).
 \item Starting and distributing the application to the cluster must be
   included in the execution time.
 \item Any sampling must be included in the execution time.
 \end{itemize}

 \section{Hadoop implementation}

 We extended the programs from last year to create and manipulate the
 new binary format and match the new rules. There are now 4 Hadoop
 map/reduce applications to support the benchmark:
 \begin{enumerate}
 \item \textbf{TeraGen} is a map/reduce program to generate the data.
 \item \textbf{TeraSort} samples the input data and uses map/reduce to
   sort the data into a total order.
 \item \textbf{TeraSum} is a map/reduce program computes the 128 bit
   sum of the crc32 of each key/value pair.
 \item \textbf{TeraValidate} is a map/reduce program that validates the
   output is sorted and computes the sum of the checksums as TeraSum.
 \end{enumerate}
 The update to the terasort programs will be checked in as
 \href{http://issues.apache.org/jira/browse/HADOOP-5716}{HADOOP-5716}.

 \textbf{TeraGen} generates input data for the sort that is byte for byte
 equivalent to the C version that was released in March of 2009,
 including specific keys and values. It divides the desired number of
 rows by the desired number of tasks and assigns ranges of rows to each
 map. The map jumps the random number generator to the correct value
 for the first row and generates the following rows.

 \textbf{TeraSort} is a standard map/reduce sort, except for a custom
 partitioner that ensures that all of the keys in reduce $N$ are after
 all of the keys in reduce $N-1$. This is a requirement of the contest
 so that the output of the sort is totally ordered, even if it is
 divided up by reduce.

 We wrote an input and output format, used by all 4 applications to
 read and write the files in the new format.

 \textbf{TeraSum} computes the 128 bit sum of the CRC32 of each
 key/value pair. Each map computes the sum of its input and emits a
 single 128 bit sum. There is a single reduce that adds the sums from
 each map. We used this program on the input directory to calculate the
 sum of the checksums of each key/value pair to check the correctness
 of the output of the sort. We also used TeraSum on a distinct dataset
 that was larger than the total RAM in the cluster to flush the Linux
 file cache between runs of the small (500 GB and 1TB) sorts.

 \textbf{TeraValidate} ensures that the output is globally sorted. It
 creates one map per file in the output directory and each map
 ensures that each key is less than or equal to the previous one. The
 map also generates records with the first and last keys of the file
 and the reduce ensures that the first key of file $i$ is greater that
 the last key of file $i-1$. Any problems are reported as output of the
 reduce with the keys that are out of order. Additionally, TeraValidate
 calculates the sum of checksums of the output directory.

 \section{Hardware and Operating System}

 We ran our benchmarks on Yahoo's Hammer cluster. Hammer's hardware is
 very similar to the hardware that we used in last year's terabyte
 sort. The hardware and operating system details are:

 \begin{itemize}
 \item approximately 3800 nodes (in such a large cluster, nodes are
   always down)
 \item 2 quad core Xeons @ 2.5ghz per node
 \item 4 SATA disks per node
 \item 8G RAM per node (upgraded to 16GB before the petabyte sort)
 \item 1 gigabit ethernet on each node
 \item 40 nodes per rack
 \item 8 gigabit ethernet uplinks from each rack to the core
 \item Red Hat Enterprise Linux Server Release 5.1 (kernel 2.6.18)
 \item Sun Java JDK (1.6.0\_05-b13 and 1.6.0\_13-b03) (32 and 64 bit)
 \end{itemize}

 We hit a JVM bug in 1.6.0\_05-b13 on the larger sorts (100TB and 1PB)
 and switched over to the later JVM, which resolved the issue. For the
 larger sorts, we used 64 bit JVMs for the Name Node and Job Tracker.

 \section{Software and Configuration}

 The version of Hadoop we used was a private branch of trunk that was
 started in January 2009, which is after the 0.20 branch was feature
 frozen. We used git to manage our branch and it allowed us to easily
 coordinate our work, track our changes, and resynchronize with the
 current Hadoop trunk.

 The changes include:

 \begin{enumerate}

 \item Updated the terasort example in the Hadoop code base to match
   the dataset defined by the rule changes in the benchmark from March
   of 2009.
   (\href{http://issues.apache.org/jira/browse/HADOOP-5716}{HADOOP-5716})

 \item We reimplemented the reducer side of Hadoop's shuffle. The
   redesign improved the performance of the shuffle and removed
   bottlenecks and over-throttling. It also made the code more
   maintainable and understandable by breaking a 3000 line Java file
   into multiple classes with a clean set of interfaces.
   (\href{http://issues.apache.org/jira/browse/HADOOP-5223}{HADOOP-5223})

 \item The new shuffle also fetches multiple map outputs from the same
   node over each connection rather than one at a time. Fetching
   multiple map outputs at the same time avoids connection setup costs
   and also avoids the round trip while the server responds to the request.
   (\href{http://issues.apache.org/jira/browse/HADOOP-1338}{HADOOP-1338})

 \item Allowed configuring timeouts on the shuffle connections and we
   shortened them for the small sorts. We observed cases where the
   connections for the shuffle would hang until the timeout, which made
   low latency jobs impossibly long.
   (\href{http://issues.apache.org/jira/browse/HADOOP-5789}{HADOOP-5789})

 \item Set TCP no-delay and more frequent pings between the Task and
   the Task Tracker to reduce latency in detecting problems.
   (\href{http://issues.apache.org/jira/browse/HADOOP-5788}{HADOOP-5788})

 \item We added some protection code to detect incorrect data being
   transmitted in the shuffle from causing the reduce to fail. It
   appears this is either a JVM NIO bug or Jetty bug that likely
   affects 0.20 and trunk under heavy load.
   (\href{http://issues.apache.org/jira/browse/HADOOP-5783}{HADOOP-5783})

 \item We used LZO compression on the map outputs. On the new dataset, LZO
   compresses down to 45\% of the original size. By comparison, the
   dataset from last year compresses to 20\% of the original size. Last
   year, the shuffle would run out of direct buffers if we used
   compression on the map outputs.

 \item We implemented memory to memory merges in the reduce during the
   shuffle to combine the map outputs in memory before we finish the
   shuffle, thereby reducing the work needed when the reduce is
   running.

 \item We multi-threaded the sampling code that read the input set to
   find the partition points between the reduces. We also wrote a
   simple partitioner that assumes the keys are evenly
   distributed. Since the new dataset does not require sampling, the
   simple partitioner produces very even partitions.
   (\href{http://issues.apache.org/jira/browse/HADOOP-4946}{HADOOP-4946})

 \item On the smaller clusters, we configured the system with faster
   heartbeat cycles from the Task Trackers to the Job Tracker (it
   defaults to 10 secs / 1000 nodes, but we made it configurable and
   brought it down to 2 seconds/1000 nodes to provide lower latency)
   (\href{http://issues.apache.org/jira/browse/HADOOP-5784}{HADOOP-5784})

 \item Typically the Job Tracker assigns tasks to Task Trackers on a
   first come first served basis. This greedy assignment of tasks did
   not lead to good data locality. However, by taking a global view and
   placing all of the map tasks at once, the system achieves much better
   locality. Rather than implement global scheduling for all of Hadoop,
   which would be much harder, we implemented a global scheduler for
   the terasort example in the input format. Basically, the input
   format computes the splits and assigns work to the nodes that have
   the fewest blocks first. For a node that has more blocks
   than map slots, it picks the block that have the fewest remaining
   locations left. This greedy global algorithm seems to get very good
   locality. The input format would schedule the maps and then change
   the input split descriptions to only have a single location instead
   of the original 3. This increased task locality by 40\% or so over
   the greedy scheduler.

 \item Hadoop 0.20 added setup and cleanup tasks. Since they are not
   required for the sort benchmarks, we allow them to be disabled to
   reduce the latency of starting and stopping the job.
   (\href{http://issues.apache.org/jira/browse/HADOOP-5785}{HADOOP-5785})

 \item We discovered a performance problem where in some contexts the
   cost of using the JNI-based CRC32 was very high. By implementing it
   in pure Java, the average case is a little slower, but the worst
   case is much better.
   (\href{http://issues.apache.org/jira/browse/HADOOP-5598}{HADOOP-5598})

 \item We found and removed some hard-coded wait loops from the
   framework that don't matter for large jobs, but can seriously slow
   down low latency jobs.

 \item Allowed setting the logging level for the tasks, so that we
   could cut down on logging. When running for "real" we configure the
   logging level to WARN instead of the default INFO. Reducing the
   amount of logging has a huge impact on the performance of the
   system, but obviously makes debugging and analysis much harder.
   (\href{http://issues.apache.org/jira/browse/HADOOP-5786}{HADOOP-5786})

 \item One optimization that we didn't finish is to optimize the job
   planning code. Currently, it uses an RPC to the Name Node for each
   input file, which we have observed taking a substantial amount of
   time. For the terabyte sort, our investigations show that we
   could save about 4 seconds out of the 8 that were spent on setting
   up the job.
   (\href{http://issues.apache.org/jira/browse/HADOOP-5795}{HADOOP-5795})

 \end{enumerate}

 \section{Results}

 Hadoop has made a lot of progress in the last year and we were able to
 run much lower latency jobs as well as much larger jobs. Note that in
 any large cluster and distributed application, there are a lot of
 moving pieces and thus we have seen a wide variation in execution
 times. As Hadoop evolves and becomes more graceful in the presence of
 hardware degradation and failure, this variation should smooth
 out. The best times for each of the listed sort sizes were:
 \\

 \begin{tabular}{| c | c | c | c | c | c |}
 \hline
 Bytes & Nodes & Maps & Reduces & Replication & Time \\
 \hline
 $5*10^{11}$ & 1406 & 8000 & 2600 & 1 & 59 seconds \\
 $10^{12}$ & 1460 & 8000 & 2700 & 1 & 62 seconds \\
 $10^{14}$ & 3452 & 190,000 & 10,000 & 2 & 173 minutes \\
 $10^{15}$ & 3658 & 80,000 & 20,000 & 2 & 975 minutes \\
 \hline
 \end{tabular}\\

 Within the rules for the 2009 Gray sort, our 500 GB sort set a new
 record for the minute sort and the 1PB sort set a new record of 1.03
 TB/minute. The 62 second terabyte sort would have set a new record,
 but the terabyte benchmark that we won last year has been
 retired. (Clearly the minute sort and terabyte sort are rapidly
 converging, and thus it is not a loss.)  One piece of trivia is that
 only the petabyte dataset had any duplicate keys (40 of them).

 Because the smaller sorts needed lower latency and faster network, we
 only used part of the cluster for those runs. In particular, instead
 of our normal 5:1 over subscription between racks, we limited it to 16
 nodes in each rack for a 2:1 over subscription. The smaller runs can
 also use output replication of 1, because they only take minutes to
 run and run on smaller clusters, the likelihood of a node failing is
 fairly low. On the larger runs, failure is expected and thus
 replication of 2 is required. HDFS protects against data loss during
 rack failure by writing the second replica on a different rack and
 thus writing the second replica is relatively slow.

 We've included the timelines for the jobs counting from the job
 submission at the Job Tracker. The diagrams show the number of tasks
 running at each point in time. While maps only have a single phase,
 the reduces have three: \textbf{shuffle}, \textbf{merge}, and
 \textbf{reduce}. The shuffle is the transfer of the data from the
 maps. Merge doesn't happen in these benchmarks, because none of the
 reduces need multiple levels of merges. Finally, the reduce phase is
 where the final merge and writing to HDFS happens. I've also included
 a category named \textbf{waste} that represents task attempts that
 were running, but ended up either failing, or being killed (often as
 speculatively executed task attempts). The job logs and configuration
 for the four runs, which are the raw data for the charts, are
 available on
 \href{http://people.apache.org/~omalley/tera-2009/}{http://people.apache.org/~omalley/tera-2009/}.

 If you compare this years charts to last year's, you'll notice that
 tasks are launching much faster now. Last year we only launched one
 task per heartbeat, so it took 40 seconds to get all of the tasks
 launched. Now, Hadoop will fill up a Task Tracker in a single
 heartbeat. Reducing that job launch overhead is very important
 for getting runs under a minute.

 As with last year, we ran with significantly larger tasks than the
 defaults for Hadoop. Even with the new more aggressive shuffle,
 minimizing the number of transfers (maps * reduces) is very important
 to the performance of the job. Notice that in the petabyte sort, each
 map is processing 15 GB instead of the default 128 MB and each reduce
 is handling 50 GB. When we ran the petabyte with more typical values
 1.5 GB / map, it took 40 hours to finish. Therefore, to increase
 throughput, it makes sense to consider increasing the default block
 size, which translates into the default map size, to at least up to 1
 GB.

 \section{Comments on the Rule Changes}

 The group that runs the Gray Sort Benchmark made very substantial
 changes to the rules this year. The changes were not announced; but
 rather appeared on the website in March. We feel that it was too late
 to make rule changes and that the benchmark should have been changed
 next year. We'd also like to point out that while most of the changes to
 the data generator were positive, it was a poor choice to remove the
 skewed distribution of the keys. The previously skewed distribution
 required sampling of the input to pick good partition points between
 the reduces. The current dataset picks keys so completely random that
 sampling is counter productive and yields even less distributions between the
 reduces.

 \bibliographystyle{abbrv}
 \bibliography{tera}

 \begin{figure}[!p]
 \includegraphics[width=4.21in]{500GBTaskTime.png}
 \caption{500 GB sort tasks across time}\label{500GbTimeline}
 \end{figure}

 \begin{figure}[!p]
 \includegraphics[width=4.5in]{1TBTaskTime.png}
 \caption{1 TB sort tasks across time}\label{1TbTimeline}
 \end{figure}

 \begin{figure}[!p]
 \includegraphics[width=4.5in]{100TBTaskTime.png}
 \caption{100 TB sort tasks across time}\label{100TbTimeline}
 \end{figure}

 \begin{figure}[!p]
 \includegraphics[width=4.5in]{1PBTaskTime.png}
 \caption{1 PB sort tasks across time}\label{1PbTimeline}
 \end{figure}

 \end{document}
	% Licensed to the Apache Software Foundation (ASF) under one
	% or more contributor license agreements. See the NOTICE file
	% distributed with this work for additional information
	% regarding copyright ownership. The ASF licenses this file
	% to you under the Apache License, Version 2.0 (the
	% "License"); you may not use this file except in compliance
	% with the License. You may obtain a copy of the License at
	%
	% http://www.apache.org/licenses/LICENSE-2.0
	%
	% Unless required by applicable law or agreed to in writing, software
	% distributed under the License is distributed on an "AS IS" BASIS,
	% WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	% See the License for the specific language governing permissions and
	% limitations under the License.
	\documentclass{article}
	\usepackage[pdftex]{hyperref}
	\usepackage[pdftex]{graphicx}

	\title{Winning a 60 Second Dash with a Yellow Elephant}
	\author{\href{http://people.apache.org/~omalley}{Owen O'Malley} and
	\href{http://people.apache.org/~acmurthy}{Arun C. Murthy}\\
	\href{http://www.yahoo.com/}{Yahoo!}\\
	owen@yahoo-inc.com and acm@yahoo-inc.com}
	\date{April 2009}
	\begin{document}
	\maketitle
	\href{http://hadoop.apache.org/core}{Apache Hadoop} is a open source
	software framework that dramatically simplifies writing distributed
	data intensive applications. It provides a distributed file system,
	which is modeled after the Google File System\cite{gfs}, and a
	map/reduce\cite{mapreduce} implementation that manages distributed
	computation. Jim Gray defined a benchmark to compare large sorting
	programs. Since the core of map/reduce is a distributed sort, most of
	the custom code is glue to get the desired behavior.

	\section{Benchmark Rules}

	Jim's Gray's sort benchmark consists of a set of many related
	benchmarks, each with their own rules. All of the sort benchmarks
	measure the time to sort different numbers of 100 byte records. The
	first 10 bytes of each record is the key and the rest is the
	value. The \textbf{minute sort} must finish end to end in less than a
	minute. The \textbf{Gray sort} must sort more than 100 terabytes and
	must run for at least an hour.

	\begin{itemize}
	\item The input data must precisely match the data generated by the C
	data generator.
	\item The input must not be in the operating system's file
	cache when the job starts.. Under Linux, this requires using the memory for something
	else between sorting runs.
	\item The input and output data must not be compressed.
	\item The output must not overwrite the input.
	\item The output must be synced to disk.
	\item The 128 bit sum of the crc32's of each key/value pair must be
	calculated for the input and output. Naturally, they must be
	identical.
	\item The output may be divided into multiple output files, but it
	must be totally ordered (simply concatenating the output files must
	produce the completely sorted output).
	\item Starting and distributing the application to the cluster must be
	included in the execution time.
	\item Any sampling must be included in the execution time.
	\end{itemize}

	\section{Hadoop implementation}

	We extended the programs from last year to create and manipulate the
	new binary format and match the new rules. There are now 4 Hadoop
	map/reduce applications to support the benchmark:
	\begin{enumerate}
	\item \textbf{TeraGen} is a map/reduce program to generate the data.
	\item \textbf{TeraSort} samples the input data and uses map/reduce to
	sort the data into a total order.
	\item \textbf{TeraSum} is a map/reduce program computes the 128 bit
	sum of the crc32 of each key/value pair.
	\item \textbf{TeraValidate} is a map/reduce program that validates the
	output is sorted and computes the sum of the checksums as TeraSum.
	\end{enumerate}
	The update to the terasort programs will be checked in as
	\href{http://issues.apache.org/jira/browse/HADOOP-5716}{HADOOP-5716}.

	\textbf{TeraGen} generates input data for the sort that is byte for byte
	equivalent to the C version that was released in March of 2009,
	including specific keys and values. It divides the desired number of
	rows by the desired number of tasks and assigns ranges of rows to each
	map. The map jumps the random number generator to the correct value
	for the first row and generates the following rows.

	\textbf{TeraSort} is a standard map/reduce sort, except for a custom
	partitioner that ensures that all of the keys in reduce $N$ are after
	all of the keys in reduce $N-1$. This is a requirement of the contest
	so that the output of the sort is totally ordered, even if it is
	divided up by reduce.

	We wrote an input and output format, used by all 4 applications to
	read and write the files in the new format.

	\textbf{TeraSum} computes the 128 bit sum of the CRC32 of each
	key/value pair. Each map computes the sum of its input and emits a
	single 128 bit sum. There is a single reduce that adds the sums from
	each map. We used this program on the input directory to calculate the
	sum of the checksums of each key/value pair to check the correctness
	of the output of the sort. We also used TeraSum on a distinct dataset
	that was larger than the total RAM in the cluster to flush the Linux
	file cache between runs of the small (500 GB and 1TB) sorts.

	\textbf{TeraValidate} ensures that the output is globally sorted. It
	creates one map per file in the output directory and each map
	ensures that each key is less than or equal to the previous one. The
	map also generates records with the first and last keys of the file
	and the reduce ensures that the first key of file $i$ is greater that
	the last key of file $i-1$. Any problems are reported as output of the
	reduce with the keys that are out of order. Additionally, TeraValidate
	calculates the sum of checksums of the output directory.

	\section{Hardware and Operating System}

	We ran our benchmarks on Yahoo's Hammer cluster. Hammer's hardware is
	very similar to the hardware that we used in last year's terabyte
	sort. The hardware and operating system details are:

	\begin{itemize}
	\item approximately 3800 nodes (in such a large cluster, nodes are
	always down)
	\item 2 quad core Xeons @ 2.5ghz per node
	\item 4 SATA disks per node
	\item 8G RAM per node (upgraded to 16GB before the petabyte sort)
	\item 1 gigabit ethernet on each node
	\item 40 nodes per rack
	\item 8 gigabit ethernet uplinks from each rack to the core
	\item Red Hat Enterprise Linux Server Release 5.1 (kernel 2.6.18)
	\item Sun Java JDK (1.6.0\_05-b13 and 1.6.0\_13-b03) (32 and 64 bit)
	\end{itemize}

	We hit a JVM bug in 1.6.0\_05-b13 on the larger sorts (100TB and 1PB)
	and switched over to the later JVM, which resolved the issue. For the
	larger sorts, we used 64 bit JVMs for the Name Node and Job Tracker.

	\section{Software and Configuration}

	The version of Hadoop we used was a private branch of trunk that was
	started in January 2009, which is after the 0.20 branch was feature
	frozen. We used git to manage our branch and it allowed us to easily
	coordinate our work, track our changes, and resynchronize with the
	current Hadoop trunk.

	The changes include:

	\begin{enumerate}

	\item Updated the terasort example in the Hadoop code base to match
	the dataset defined by the rule changes in the benchmark from March
	of 2009.
	(\href{http://issues.apache.org/jira/browse/HADOOP-5716}{HADOOP-5716})

	\item We reimplemented the reducer side of Hadoop's shuffle. The
	redesign improved the performance of the shuffle and removed
	bottlenecks and over-throttling. It also made the code more
	maintainable and understandable by breaking a 3000 line Java file
	into multiple classes with a clean set of interfaces.
	(\href{http://issues.apache.org/jira/browse/HADOOP-5223}{HADOOP-5223})

	\item The new shuffle also fetches multiple map outputs from the same
	node over each connection rather than one at a time. Fetching
	multiple map outputs at the same time avoids connection setup costs
	and also avoids the round trip while the server responds to the request.
	(\href{http://issues.apache.org/jira/browse/HADOOP-1338}{HADOOP-1338})

	\item Allowed configuring timeouts on the shuffle connections and we
	shortened them for the small sorts. We observed cases where the
	connections for the shuffle would hang until the timeout, which made
	low latency jobs impossibly long.
	(\href{http://issues.apache.org/jira/browse/HADOOP-5789}{HADOOP-5789})

	\item Set TCP no-delay and more frequent pings between the Task and
	the Task Tracker to reduce latency in detecting problems.
	(\href{http://issues.apache.org/jira/browse/HADOOP-5788}{HADOOP-5788})

	\item We added some protection code to detect incorrect data being
	transmitted in the shuffle from causing the reduce to fail. It
	appears this is either a JVM NIO bug or Jetty bug that likely
	affects 0.20 and trunk under heavy load.
	(\href{http://issues.apache.org/jira/browse/HADOOP-5783}{HADOOP-5783})

	\item We used LZO compression on the map outputs. On the new dataset, LZO
	compresses down to 45\% of the original size. By comparison, the
	dataset from last year compresses to 20\% of the original size. Last
	year, the shuffle would run out of direct buffers if we used
	compression on the map outputs.

	\item We implemented memory to memory merges in the reduce during the
	shuffle to combine the map outputs in memory before we finish the
	shuffle, thereby reducing the work needed when the reduce is
	running.

	\item We multi-threaded the sampling code that read the input set to
	find the partition points between the reduces. We also wrote a
	simple partitioner that assumes the keys are evenly
	distributed. Since the new dataset does not require sampling, the
	simple partitioner produces very even partitions.
	(\href{http://issues.apache.org/jira/browse/HADOOP-4946}{HADOOP-4946})

	\item On the smaller clusters, we configured the system with faster
	heartbeat cycles from the Task Trackers to the Job Tracker (it
	defaults to 10 secs / 1000 nodes, but we made it configurable and
	brought it down to 2 seconds/1000 nodes to provide lower latency)
	(\href{http://issues.apache.org/jira/browse/HADOOP-5784}{HADOOP-5784})

	\item Typically the Job Tracker assigns tasks to Task Trackers on a
	first come first served basis. This greedy assignment of tasks did
	not lead to good data locality. However, by taking a global view and
	placing all of the map tasks at once, the system achieves much better
	locality. Rather than implement global scheduling for all of Hadoop,
	which would be much harder, we implemented a global scheduler for
	the terasort example in the input format. Basically, the input
	format computes the splits and assigns work to the nodes that have
	the fewest blocks first. For a node that has more blocks
	than map slots, it picks the block that have the fewest remaining
	locations left. This greedy global algorithm seems to get very good
	locality. The input format would schedule the maps and then change
	the input split descriptions to only have a single location instead
	of the original 3. This increased task locality by 40\% or so over
	the greedy scheduler.

	\item Hadoop 0.20 added setup and cleanup tasks. Since they are not
	required for the sort benchmarks, we allow them to be disabled to
	reduce the latency of starting and stopping the job.
	(\href{http://issues.apache.org/jira/browse/HADOOP-5785}{HADOOP-5785})

	\item We discovered a performance problem where in some contexts the
	cost of using the JNI-based CRC32 was very high. By implementing it
	in pure Java, the average case is a little slower, but the worst
	case is much better.
	(\href{http://issues.apache.org/jira/browse/HADOOP-5598}{HADOOP-5598})

	\item We found and removed some hard-coded wait loops from the
	framework that don't matter for large jobs, but can seriously slow
	down low latency jobs.

	\item Allowed setting the logging level for the tasks, so that we
	could cut down on logging. When running for "real" we configure the
	logging level to WARN instead of the default INFO. Reducing the
	amount of logging has a huge impact on the performance of the
	system, but obviously makes debugging and analysis much harder.
	(\href{http://issues.apache.org/jira/browse/HADOOP-5786}{HADOOP-5786})

	\item One optimization that we didn't finish is to optimize the job
	planning code. Currently, it uses an RPC to the Name Node for each
	input file, which we have observed taking a substantial amount of
	time. For the terabyte sort, our investigations show that we
	could save about 4 seconds out of the 8 that were spent on setting
	up the job.
	(\href{http://issues.apache.org/jira/browse/HADOOP-5795}{HADOOP-5795})

	\end{enumerate}

	\section{Results}

	Hadoop has made a lot of progress in the last year and we were able to
	run much lower latency jobs as well as much larger jobs. Note that in
	any large cluster and distributed application, there are a lot of
	moving pieces and thus we have seen a wide variation in execution
	times. As Hadoop evolves and becomes more graceful in the presence of
	hardware degradation and failure, this variation should smooth
	out. The best times for each of the listed sort sizes were:
	\\

	\begin{tabular}{\| c \| c \| c \| c \| c \| c \|}
	\hline
	Bytes & Nodes & Maps & Reduces & Replication & Time \\
	\hline
	$5*10^{11}$ & 1406 & 8000 & 2600 & 1 & 59 seconds \\
	$10^{12}$ & 1460 & 8000 & 2700 & 1 & 62 seconds \\
	$10^{14}$ & 3452 & 190,000 & 10,000 & 2 & 173 minutes \\
	$10^{15}$ & 3658 & 80,000 & 20,000 & 2 & 975 minutes \\
	\hline
	\end{tabular}\\

	Within the rules for the 2009 Gray sort, our 500 GB sort set a new
	record for the minute sort and the 1PB sort set a new record of 1.03
	TB/minute. The 62 second terabyte sort would have set a new record,
	but the terabyte benchmark that we won last year has been
	retired. (Clearly the minute sort and terabyte sort are rapidly
	converging, and thus it is not a loss.) One piece of trivia is that
	only the petabyte dataset had any duplicate keys (40 of them).

	Because the smaller sorts needed lower latency and faster network, we
	only used part of the cluster for those runs. In particular, instead
	of our normal 5:1 over subscription between racks, we limited it to 16
	nodes in each rack for a 2:1 over subscription. The smaller runs can
	also use output replication of 1, because they only take minutes to
	run and run on smaller clusters, the likelihood of a node failing is
	fairly low. On the larger runs, failure is expected and thus
	replication of 2 is required. HDFS protects against data loss during
	rack failure by writing the second replica on a different rack and
	thus writing the second replica is relatively slow.

	We've included the timelines for the jobs counting from the job
	submission at the Job Tracker. The diagrams show the number of tasks
	running at each point in time. While maps only have a single phase,
	the reduces have three: \textbf{shuffle}, \textbf{merge}, and
	\textbf{reduce}. The shuffle is the transfer of the data from the
	maps. Merge doesn't happen in these benchmarks, because none of the
	reduces need multiple levels of merges. Finally, the reduce phase is
	where the final merge and writing to HDFS happens. I've also included
	a category named \textbf{waste} that represents task attempts that
	were running, but ended up either failing, or being killed (often as
	speculatively executed task attempts). The job logs and configuration
	for the four runs, which are the raw data for the charts, are
	available on
	\href{http://people.apache.org/~omalley/tera-2009/}{http://people.apache.org/~omalley/tera-2009/}.

	If you compare this years charts to last year's, you'll notice that
	tasks are launching much faster now. Last year we only launched one
	task per heartbeat, so it took 40 seconds to get all of the tasks
	launched. Now, Hadoop will fill up a Task Tracker in a single
	heartbeat. Reducing that job launch overhead is very important
	for getting runs under a minute.

	As with last year, we ran with significantly larger tasks than the
	defaults for Hadoop. Even with the new more aggressive shuffle,
	minimizing the number of transfers (maps * reduces) is very important
	to the performance of the job. Notice that in the petabyte sort, each
	map is processing 15 GB instead of the default 128 MB and each reduce
	is handling 50 GB. When we ran the petabyte with more typical values
	1.5 GB / map, it took 40 hours to finish. Therefore, to increase
	throughput, it makes sense to consider increasing the default block
	size, which translates into the default map size, to at least up to 1
	GB.

	\section{Comments on the Rule Changes}

	The group that runs the Gray Sort Benchmark made very substantial
	changes to the rules this year. The changes were not announced; but
	rather appeared on the website in March. We feel that it was too late
	to make rule changes and that the benchmark should have been changed
	next year. We'd also like to point out that while most of the changes to
	the data generator were positive, it was a poor choice to remove the
	skewed distribution of the keys. The previously skewed distribution
	required sampling of the input to pick good partition points between
	the reduces. The current dataset picks keys so completely random that
	sampling is counter productive and yields even less distributions between the
	reduces.

	\bibliographystyle{abbrv}
	\bibliography{tera}

	\begin{figure}[!p]
	\includegraphics[width=4.21in]{500GBTaskTime.png}
	\caption{500 GB sort tasks across time}\label{500GbTimeline}
	\end{figure}

	\begin{figure}[!p]
	\includegraphics[width=4.5in]{1TBTaskTime.png}
	\caption{1 TB sort tasks across time}\label{1TbTimeline}
	\end{figure}

	\begin{figure}[!p]
	\includegraphics[width=4.5in]{100TBTaskTime.png}
	\caption{100 TB sort tasks across time}\label{100TbTimeline}
	\end{figure}

	\begin{figure}[!p]
	\includegraphics[width=4.5in]{1PBTaskTime.png}
	\caption{1 PB sort tasks across time}\label{1PbTimeline}
	\end{figure}

	\end{document}