| <?xml version="1.0"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd"> |
| |
| <document> |
| |
| <header> |
| <title>Rumen</title> |
| </header> |
| |
| <body> |
| <!-- |
| Overview [What is Rumen and why is it needed?] |
| --> |
| <section id="overview"> |
| <title>Overview</title> |
| |
| <p><em>Rumen</em> is a data extraction and analysis tool built for |
| <em>Apache Hadoop</em>. <em>Rumen</em> mines <em>JobHistory</em> logs to |
| extract meaningful data and stores it in an easily-parsed, condensed |
| format or <em>digest</em>. The raw trace data from MapReduce logs are |
| often insufficient for simulation, emulation, and benchmarking, as these |
| tools often attempt to measure conditions that did not occur in the |
| source data. For example, if a task ran locally in the raw trace data |
| but a simulation of the scheduler elects to run that task on a remote |
| rack, the simulator requires a runtime its input cannot provide. |
| To fill in these gaps, Rumen performs a statistical analysis of the |
| digest to estimate the variables the trace doesn't supply. Rumen traces |
| drive both Gridmix (a benchmark of Hadoop MapReduce clusters) and Mumak |
| (a simulator for the JobTracker). |
| </p> |
| |
| <!-- |
| Why is Rumen needed? |
| --> |
| <section> |
| <title>Motivation</title> |
| |
| <ul> |
| <li>Extracting meaningful data from <em>JobHistory</em> logs is a common |
| task for any tool built to work on <em>MapReduce</em>. It |
| is tedious to write a custom tool which is so tightly coupled with |
| the <em>MapReduce</em> framework. Hence there is a need for a |
| built-in tool for performing framework level task of log parsing and |
| analysis. Such a tool would insulate external systems depending on |
| job history against the changes made to the job history format. |
| </li> |
| <li>Performing statistical analysis of various attributes of a |
| <em>MapReduce Job</em> such as <em>task runtimes, task failures |
| etc</em> is another common task that the benchmarking |
| and simulation tools might need. <em>Rumen</em> generates |
| <a href="http://en.wikipedia.org/wiki/Cumulative_distribution_function"> |
| <em>Cumulative Distribution Functions (CDF)</em> |
| </a> for the Map/Reduce task runtimes. |
| Runtime CDF can be used for extrapolating the task runtime of |
| incomplete, missing and synthetic tasks. Similarly CDF is also |
| computed for the total number of successful tasks for every attempt. |
| |
| </li> |
| </ul> |
| </section> |
| |
| <!-- |
| Basic high level view of components |
| --> |
| <section> |
| <title>Components</title> |
| |
| <p><em>Rumen</em> consists of 2 components</p> |
| |
| <ul> |
| <li><em>Trace Builder</em> : |
| Converts <em>JobHistory</em> logs into an easily-parsed format. |
| Currently <code>TraceBuilder</code> outputs the trace in |
| <a href="http://www.json.org/"><em>JSON</em></a> |
| format. |
| </li> |
| <li><em>Folder </em>: |
| A utility to scale the input trace. A trace obtained from |
| <em>TraceBuilder</em> simply summarizes the jobs in the |
| input folders and files. The time-span within which all the jobs in |
| a given trace finish can be considered as the trace runtime. |
| <em>Folder</em> can be used to scale the runtime of a trace. |
| Decreasing the trace runtime might involve dropping some jobs from |
| the input trace and scaling down the runtime of remaining jobs. |
| Increasing the trace runtime might involve adding some dummy jobs to |
| the resulting trace and scaling up the runtime of individual jobs. |
| </li> |
| |
| </ul> |
| <p></p><p></p><p></p> |
| </section> |
| </section> |
| |
| <!-- |
| Usage [How to run Rumen? What are the various configuration parameters?] |
| --> |
| <section id="usage"> |
| <title>How to use <em>Rumen</em>?</title> |
| |
| <p>Converting <em>JobHistory</em> logs into a desired job-trace consists of |
| 2 steps</p> |
| <ol> |
| <li>Extracting information into an intermediate format</li> |
| <li>Adjusting the job-trace obtained from the intermediate trace to |
| have the desired properties.</li> |
| </ol> |
| |
| <note>Extracting information from <em>JobHistory</em> logs is a one time |
| operation. This so called <em>Gold Trace</em> can be reused to |
| generate traces with desired values of properties such as |
| <code>output-duration</code>, <code>concentration</code> etc. |
| </note> |
| |
| <p><em>Rumen</em> provides 2 basic commands</p> |
| <ul> |
| <li><code>TraceBuilder</code></li> |
| <li><code>Folder</code></li> |
| </ul> |
| |
| <p>Firstly, we need to generate the <em>Gold Trace</em>. Hence the first |
| step is to run <code>TraceBuilder</code> on a job-history folder. |
| The output of the <code>TraceBuilder</code> is a job-trace file (and an |
| optional cluster-topology file). In case we want to scale the output, we |
| can use the <code>Folder</code> utility to fold the current trace to the |
| desired length. The remaining part of this section explains these |
| utilities in detail. |
| </p> |
| |
| <note>Examples in this section assumes that certain libraries are present |
| in the java CLASSPATH. See <em>Section-3.2</em> for more details. |
| </note> |
| <!-- |
| TraceBuilder command |
| --> |
| <section> |
| <title>Trace Builder</title> |
| |
| <p><code>Command:</code></p> |
| <source>java org.apache.hadoop.tools.rumen.TraceBuilder [options] <jobtrace-output> <topology-output> <inputs></source> |
| |
| <p>This command invokes the <code>TraceBuilder</code> utility of |
| <em>Rumen</em>. It converts the JobHistory files into a series of JSON |
| objects and writes them into the <code><jobtrace-output></code> |
| file. It also extracts the cluster layout (topology) and writes it in |
| the<code><topology-output></code> file. |
| <code><inputs></code> represents a space-separated list of |
| JobHistory files and folders. |
| </p> |
| |
| <note>1) Input and output to <code>TraceBuilder</code> is expected to |
| be a fully qualified FileSystem path. So use '<em>file://</em>' |
| to specify files on the <code>local</code> FileSystem and |
| '<em>hdfs://</em>' to specify files on HDFS. Since input files or |
| folder are FileSystem paths, it means that they can be globbed. |
| This can be useful while specifying multiple file paths using |
| regular expressions. |
| </note> |
| <note> |
| 2) By default, TraceBuilder does not recursively scan the input |
| folder for job history files. Only the files that are directly |
| placed under the input folder will be considered for generating |
| the trace. To add all the files under the input directory by |
| recursively scanning the input directory, use ‘-recursive’ |
| option. |
| </note> |
| |
| <p>Cluster topology is used as follows :</p> |
| <ul> |
| <li>To reconstruct the splits and make sure that the |
| distances/latencies seen in the actual run are modeled correctly. |
| </li> |
| <li>To extrapolate splits information for tasks with missing splits |
| details or synthetically generated tasks. |
| </li> |
| </ul> |
| |
| <p><code>Options :</code></p> |
| <table> |
| <tr> |
| <th> Parameter</th> |
| <th> Description</th> |
| <th> Notes </th> |
| </tr> |
| <tr> |
| <td><code>-demuxer</code></td> |
| <td>Used to read the jobhistory files. The default is |
| <code>DefaultInputDemuxer</code>.</td> |
| <td>Demuxer decides how the input file maps to jobhistory file(s). |
| Job history logs and job configuration files are typically small |
| files, and can be more effectively stored when embedded in some |
| container file format like SequenceFile or TFile. To support such |
| usage cases, one can specify a customized Demuxer class that can |
| extract individual job history logs and job configuration files |
| from the source files. |
| </td> |
| </tr> |
| <tr> |
| <td><code>-recursive</code></td> |
| <td>Recursively traverse input paths for job history logs.</td> |
| <td>This option should be used to inform the TraceBuilder to |
| recursively scan the input paths and process all the files under it. |
| Note that, by default, only the history logs that are directly under |
| the input folder are considered for generating the trace. |
| </td> |
| </tr> |
| </table> |
| |
| <section> |
| <title>Example</title> |
| <source>java org.apache.hadoop.tools.rumen.TraceBuilder file:///home/user/job-trace.json file:///home/user/topology.output file:///home/user/logs/history/done</source> |
| <p></p> |
| <p>This will analyze all the jobs in |
| <code>/home/user/logs/history/done</code> stored on the |
| <code>local</code> FileSystem and output the jobtraces in |
| <code>/home/user/job-trace.json</code> along with topology |
| information in <code>/home/user/topology.output</code>. |
| </p> |
| </section> |
| <p></p><p></p><p></p><p></p><p></p><p></p> |
| </section> |
| |
| <!-- |
| Folder command |
| --> |
| <section> |
| <title>Folder</title> |
| |
| <p><code>Command</code>:</p> |
| <source>java org.apache.hadoop.tools.rumen.Folder [options] [input] [output]</source> |
| |
| <note>Input and output to <code>Folder</code> is expected to be a fully |
| qualified FileSystem path. So use '<em>file://</em>' to specify |
| files on the <code>local</code> FileSystem and '<em>hdfs://</em>' to |
| specify files on HDFS. |
| </note> |
| |
| <p>This command invokes the <code>Folder</code> utility of |
| <em>Rumen</em>. Folding essentially means that the output duration of |
| the resulting trace is fixed and job timelines are adjusted |
| to respect the final output duration. |
| </p> |
| |
| <p></p> |
| <p><code>Options :</code></p> |
| <table> |
| <tr> |
| <th> Parameter</th> |
| <th> Description</th> |
| <th> Notes </th> |
| </tr> |
| <tr> |
| <td><code>-input-cycle</code></td> |
| <td>Defines the basic unit of time for the folding operation. There is |
| no default value for <code>input-cycle</code>. |
| <strong>Input cycle must be provided</strong>. |
| </td> |
| <td>'<code>-input-cycle 10m</code>' |
| implies that the whole trace run will be now sliced at a 10min |
| interval. Basic operations will be done on the 10m chunks. Note |
| that <em>Rumen</em> understands various time units like |
| <em>m(min), h(hour), d(days) etc</em>. |
| </td> |
| </tr> |
| <tr> |
| <td><code>-output-duration</code></td> |
| <td>This parameter defines the final runtime of the trace. |
| Default value if <strong>1 hour</strong>. |
| </td> |
| <td>'<code>-output-duration 30m</code>' |
| implies that the resulting trace will have a max runtime of |
| 30mins. All the jobs in the input trace file will be folded and |
| scaled to fit this window. |
| </td> |
| </tr> |
| <tr> |
| <td><code>-concentration</code></td> |
| <td>Set the concentration of the resulting trace. Default value is |
| <strong>1</strong>. |
| </td> |
| <td>If the total runtime of the resulting trace is less than the total |
| runtime of the input trace, then the resulting trace would contain |
| lesser number of jobs as compared to the input trace. This |
| essentially means that the output is diluted. To increase the |
| density of jobs, set the concentration to a higher value.</td> |
| </tr> |
| <tr> |
| <td><code>-debug</code></td> |
| <td>Run the Folder in debug mode. By default it is set to |
| <strong>false</strong>.</td> |
| <td>In debug mode, the Folder will print additional statements for |
| debugging. Also the intermediate files generated in the scratch |
| directory will not be cleaned up. |
| </td> |
| </tr> |
| <tr> |
| <td><code>-seed</code></td> |
| <td>Initial seed to the Random Number Generator. By default, a Random |
| Number Generator is used to generate a seed and the seed value is |
| reported back to the user for future use. |
| </td> |
| <td>If an initial seed is passed, then the <code>Random Number |
| Generator</code> will generate the random numbers in the same |
| sequence i.e the sequence of random numbers remains same if the |
| same seed is used. Folder uses Random Number Generator to decide |
| whether or not to emit the job. |
| </td> |
| </tr> |
| <tr> |
| <td><code>-starts-after</code></td> |
| <td>Specify the time (in milliseconds) relative to the start of |
| the trace, after which this utility should consider the |
| jobs from input trace. |
| </td> |
| <td>If this value is specified as 10000, Folder would ignore |
| first 10000ms worth of jobs in the trace and |
| start considering the rest of the jobs in the trace for folding. |
| </td> |
| </tr> |
| <tr> |
| <td><code>-temp-directory</code></td> |
| <td>Temporary directory for the Folder. By default the <strong>output |
| folder's parent directory</strong> is used as the scratch space. |
| </td> |
| <td>This is the scratch space used by Folder. All the |
| temporary files are cleaned up in the end unless the Folder is run |
| in <code>debug</code> mode.</td> |
| </tr> |
| <tr> |
| <td><code>-skew-buffer-length</code></td> |
| <td>Enables <em>Folder</em> to tolerate skewed jobs. |
| The default buffer length is <strong>0</strong>.</td> |
| <td>'<code>-skew-buffer-length 100</code>' |
| indicates that if the jobs appear out of order within a window |
| size of 100, then they will be emitted in-order by the folder. |
| If a job appears out-of-order outside this window, then the Folder |
| will bail out provided <code>-allow-missorting</code> is not set. |
| <em>Folder</em> reports the maximum skew size seen in the |
| input trace for future use. |
| </td> |
| </tr> |
| <tr> |
| <td><code>-allow-missorting</code></td> |
| <td>Enables <em>Folder</em> to tolerate out-of-order jobs. By default |
| mis-sorting is not allowed. |
| </td> |
| <td>If mis-sorting is allowed, then the <em>Folder</em> will ignore |
| out-of-order jobs that cannot be deskewed using a skew buffer of |
| size specified using <code>-skew-buffer-length</code>. If |
| mis-sorting is not allowed, then the Folder will bail out if the |
| skew buffer is incapable of tolerating the skew. |
| </td> |
| </tr> |
| </table> |
| |
| <section> |
| <title>Examples</title> |
| <section> |
| <title>Folding an input trace with 10 hours of total runtime to |
| generate an output trace with 1 hour of total runtime</title> |
| <source>java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json</source> |
| <p></p> |
| <p>If the folded jobs are out of order then the command |
| will bail out. |
| </p> |
| <p> |
| |
| </p> |
| </section> |
| |
| <section> |
| <title>Folding an input trace with 10 hours of total runtime to |
| generate an output trace with 1 hour of total runtime and |
| tolerate some skewness |
| </title> |
| <source>java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m -allow-missorting -skew-buffer-length 100 file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json</source> |
| <p></p> |
| <p>If the folded jobs are out of order, then atmost |
| 100 jobs will be de-skewed. If the 101<sup>st</sup> job is |
| <em>out-of-order</em>, then the command will bail out. |
| </p> |
| </section> |
| <section> |
| <title>Folding an input trace with 10 hours of total runtime to |
| generate an output trace with 1 hour of total runtime in debug |
| mode |
| </title> |
| <source>java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m -debug -temp-directory file:///tmp/debug file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json</source> |
| <p></p> |
| <p>This will fold the 10hr job-trace file |
| <code>file:///home/user/job-trace.json</code> to finish within 1hr |
| and use <code>file:///tmp/debug</code> as the temporary directory. |
| The intermediate files in the temporary directory will not be cleaned |
| up. |
| </p> |
| </section> |
| |
| <section> |
| <title>Folding an input trace with 10 hours of total runtime to |
| generate an output trace with 1 hour of total runtime with custom |
| concentration. |
| </title> |
| <source>java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m -concentration 2 file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json</source> |
| <p></p> |
| <p>This will fold the 10hr job-trace file |
| <code>file:///home/user/job-trace.json</code> to finish within 1hr |
| with concentration of 2. <code>Example-2.3.2</code> will retain 10% |
| of the jobs. With <em>concentration</em> as 2, 20% of the total input |
| jobs will be retained. |
| </p> |
| </section> |
| </section> |
| </section> |
| <p></p><p></p><p></p> |
| </section> |
| |
| <!-- |
| Appendix [Resources i.e ppts, jiras, definition etc] |
| --> |
| <section> |
| <title>Appendix</title> |
| |
| <section> |
| <title>Resources</title> |
| <p><a href="https://issues.apache.org/jira/browse/MAPREDUCE-751">MAPREDUCE-751</a> is the main JIRA that introduced <em>Rumen</em> to <em>MapReduce</em>. |
| Look at the MapReduce <a href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313617">rumen-component</a> for further details.</p> |
| </section> |
| |
| <section> |
| <title>Dependencies</title> |
| <p><em>Rumen</em> expects certain library <em>JARs</em> to be present in |
| the <em>CLASSPATH</em>. |
| The required libraries are </p> |
| <ul> |
| <li><code>Hadoop MapReduce Tools</code> (<code>hadoop-mapred-tools-{hadoop-version}.jar</code>)</li> |
| <li><code>Hadoop Common</code> (<code>hadoop-common-{hadoop-version}.jar</code>)</li> |
| <li><code>Apache Commons Logging</code> (<code>commons-logging-1.1.1.jar</code>)</li> |
| <li><code>Apache Commons CLI</code> (<code>commons-cli-1.2.jar</code>)</li> |
| <li><code>Jackson Mapper</code> (<code>jackson-mapper-asl-1.4.2.jar</code>)</li> |
| <li><code>Jackson Core</code> (<code>jackson-core-asl-1.4.2.jar</code>)</li> |
| </ul> |
| |
| <note>One simple way to run Rumen is to use '$HADOOP_PREFIX/bin/hadoop jar' |
| option to run it. |
| </note> |
| </section> |
| </section> |
| </body> |
| </document> |