src/docs/src/documentation/content/xdocs/gridmix.xml - hadoop-mapreduce - Git at Google

 <?xml version="1.0"?>
 <!--
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
 -->
 <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
 <document>
   <header>
     <title>GridMix</title>
   </header>
   <body>
     <section id="overview">
       <title>Overview</title>
       <p>GridMix is a benchmark for Hadoop clusters. It submits a mix of
       synthetic jobs, modeling a profile mined from production loads.</p>
       <p>There exist three versions of the GridMix tool. This document
       discusses the third (checked into <code>src/contrib</code>), distinct
       from the two checked into the <code>src/benchmarks</code> sub-directory.
       While the first two versions of the tool included stripped-down versions
       of common jobs, both were principally saturation tools for stressing the
       framework at scale. In support of a broader range of deployments and
       finer-tuned job mixes, this version of the tool will attempt to model
       the resource profiles of production jobs to identify bottlenecks, guide
       development, and serve as a replacement for the existing GridMix
       benchmarks.</p>
       <p>To run GridMix, you need a MapReduce job trace describing the job mix
       for a given cluster. Such traces are typically generated by Rumen (see
       Rumen documentation). GridMix also requires input data from which the
       synthetic jobs will be reading bytes. The input data need not be in any
       particular format, as the synthetic jobs are currently binary readers.
       If you are running on a new cluster, an optional step generating input
       data may precede the run.</p>
       <p>In order to emulate the load of production jobs from a given cluster
       on the same or another cluster, follow these steps:</p>
       <ol>
 	<li>Locate the job history files on the production cluster. This
 	location is specified by the
 	<code>mapreduce.jobtracker.jobhistory.completed.location</code>
 	configuration property of the cluster.</li>
 	<li>Run Rumen to build a job trace in JSON format for all or select
 	jobs.</li>
 	<li><em>(Optional)</em> Use Rumen to fold this job trace to scale
 	the load.</li>
 	<li>Use GridMix with the job trace on the benchmark cluster.</li>
       </ol>
       <p>Jobs submitted by GridMix have names of the form
       &quot;<code>GRIDMIXnnnnn</code>&quot;, where
       &quot;<code>nnnnn</code>&quot; is a sequence number padded with leading
       zeroes.</p>
     </section>
     <section id="usage">
       <title>Usage</title>
       <p>Basic command-line usage without configuration parameters:</p>
       <source>
 org.apache.hadoop.mapred.gridmix.Gridmix [-generate &lt;size&gt;] [-users &lt;users-list&gt;] &lt;iopath&gt; &lt;trace&gt;
       </source>
       <p>Basic command-line usage with configuration parameters:</p>
       <source>
 org.apache.hadoop.mapred.gridmix.Gridmix \
   -Dgridmix.client.submit.threads=10 -Dgridmix.output.directory=foo \
   [-generate &lt;size&gt;] [-users &lt;users-list&gt;] &lt;iopath&gt; &lt;trace&gt;
       </source>
       <note>
 	Configuration parameters like
 	<code>-Dgridmix.client.submit.threads=10</code> and
 	<code>-Dgridmix.output.directory=foo</code> as given above should
 	be used <em>before</em> other GridMix parameters.
       </note>
       <p>The <code>-generate</code> option is used to generate input data
       for the synthetic jobs. It accepts size suffixes, e.g.
       <code>100g</code> will generate 100 * 2<sup>30</sup> bytes.</p>
       <p>The <code>-users</code> option is used to point to a users-list
       file (see <a href="#usersqueues">Emulating Users and Queues</a>).</p>
       <p>The <code>&lt;iopath&gt;</code> parameter is the destination
       directory for generated output and/or the directory from which input
       data will be read. Note that this can either be on the local file-system
       or on HDFS, but it is highly recommended that it be the same as that for
       the original job mix so that GridMix puts the same load on the local
       file-system and HDFS respectively.</p>
       <p>The <code>&lt;trace&gt;</code> parameter is a path to a job trace
       generated by Rumen. This trace can be compressed (it must be readable
       using one of the compression codecs supported by the cluster) or
       uncompressed. Use &quot;-&quot; as the value of this parameter if you
       want to pass an <em>uncompressed</em> trace via the standard
       input-stream of GridMix.</p>
       <p>The class <code>org.apache.hadoop.mapred.gridmix.Gridmix</code> can
       be found in the JAR
       <code>contrib/gridmix/hadoop-$VERSION-gridmix.jar</code> inside your
       Hadoop installation, where <code>$VERSION</code> corresponds to the
       version of Hadoop installed. A simple way of ensuring that this class
       and all its dependencies are loaded correctly is to use the
       <code>hadoop</code> wrapper script in Hadoop:</p>
       <source>
 hadoop jar &lt;gridmix-jar&gt; org.apache.hadoop.mapred.gridmix.Gridmix \
   [-generate &lt;size&gt;] [-users &lt;users-list&gt;] &lt;iopath&gt; &lt;trace&gt;
       </source>
       <p>The supported configuration parameters are explained in the
       following sections.</p>
     </section>
     <section id="cfgparams">
       <title>General Configuration Parameters</title>
       <p/>
       <table>
         <tr>
           <th>Parameter</th>
           <th>Description</th>
         </tr>
         <tr>
           <td>
             <code>gridmix.output.directory</code>
           </td>
           <td>The directory into which output will be written. If specified,
 	  <code>iopath</code> will be relative to this parameter. The
 	  submitting user must have read/write access to this directory. The
 	  user should also be mindful of any quota issues that may arise
 	  during a run. The default is &quot;<code>gridmix</code>&quot;.</td>
         </tr>
         <tr>
           <td>
             <code>gridmix.client.submit.threads</code>
           </td>
           <td>The number of threads submitting jobs to the cluster. This
 	  also controls how many splits will be loaded into memory at a given
 	  time, pending the submit time in the trace. Splits are pre-generated
 	  to hit submission deadlines, so particularly dense traces may want
 	  more submitting threads. However, storing splits in memory is
 	  reasonably expensive, so you should raise this cautiously. The
 	  default is 1 for the SERIAL job-submission policy (see
 	  <a href="#policies">Job Submission Policies</a>) and one more than
 	  the number of processors on the client machine for the other
 	  policies.</td>
         </tr>
 	<tr>
 	  <td>
 	    <code>gridmix.submit.multiplier</code>
 	  </td>
 	  <td>The multiplier to accelerate or decelerate the submission of
 	  jobs. The time separating two jobs is multiplied by this factor.
 	  The default value is 1.0. This is a crude mechanism to size
 	  a job trace to a cluster.</td>
 	</tr>
         <tr>
           <td>
             <code>gridmix.client.pending.queue.depth</code>
           </td>
           <td>The depth of the queue of job descriptions awaiting split
 	  generation. The jobs read from the trace occupy a queue of this
 	  depth before being processed by the submission threads. It is
 	  unusual to configure this. The default is 5.</td>
         </tr>
 	<tr>
 	  <td>
 	    <code>gridmix.gen.blocksize</code>
 	  </td>
 	  <td>The block-size of generated data. The default value is 256
 	  MiB.</td>
 	</tr>
 	<tr>
 	  <td>
 	    <code>gridmix.gen.bytes.per.file</code>
 	  </td>
 	  <td>The maximum bytes written per file. The default value is 1
 	  GiB.</td>
 	</tr>
         <tr>
           <td>
             <code>gridmix.min.file.size</code>
           </td>
           <td>The minimum size of the input files. The default limit is 128
 	  MiB. Tweak this parameter if you see an error-message like
 	  &quot;Found no satisfactory file&quot; while testing GridMix with
 	  a relatively-small input data-set.</td>
         </tr>
         <tr>
           <td>
             <code>gridmix.max.total.scan</code>
           </td>
           <td>The maximum size of the input files. The default limit is 100
 	  TiB.</td>
         </tr>
       </table>
     </section>
     <section id="jobtypes">
       <title>Job Types</title>
       <p>GridMix takes as input a job trace, essentially a stream of
       JSON-encoded job descriptions. For each job description, the submission
       client obtains the original job submission time and for each task in
       that job, the byte and record counts read and written. Given this data,
       it constructs a synthetic job with the same byte and record patterns as
       recorded in the trace. It constructs jobs of two types:</p>
       <table>
         <tr>
           <th>Job Type</th>
           <th>Description</th>
         </tr>
         <tr>
           <td>
             <code>LOADJOB</code>
           </td>
           <td>A synthetic job that emulates the workload mentioned in Rumen
 	  trace. In the current version we are supporting I/O. It reproduces
 	  the I/O workload on the benchmark cluster. It does so by embedding
 	  the detailed I/O information for every map and reduce task, such as
 	  the number of bytes and records read and written, into each
 	  job's input splits. The map tasks further relay the I/O patterns of
 	  reduce tasks through the intermediate map output data.</td>
         </tr>
         <tr>
           <td>
             <code>SLEEPJOB</code>
           </td>
 	  <td>A synthetic job where each task does <em>nothing</em> but sleep
 	  for a certain duration as observed in the production trace. The
 	  scalability of the Job Tracker is often limited by how many
 	  heartbeats it can handle every second. (Heartbeats are periodic
 	  messages sent from Task Trackers to update their status and grab new
 	  tasks from the Job Tracker.) Since a benchmark cluster is typically
 	  a fraction in size of a production cluster, the heartbeat traffic
 	  generated by the slave nodes is well below the level of the
 	  production cluster. One possible solution is to run multiple Task
 	  Trackers on each slave node. This leads to the obvious problem that
 	  the I/O workload generated by the synthetic jobs would thrash the
 	  slave nodes. Hence the need for such a job.</td>
         </tr>
       </table>
       <p>The following configuration parameters affect the job type:</p>
       <table>
         <tr>
           <th>Parameter</th>
           <th>Description</th>
         </tr>
         <tr>
           <td>
             <code>gridmix.job.type</code>
           </td>
           <td>The value for this key can be one of LOADJOB or SLEEPJOB. The
 	  default value is LOADJOB.</td>
         </tr>
         <tr>
           <td>
             <code>gridmix.key.fraction</code>
           </td>
           <td>For a LOADJOB type of job, the fraction of a record used for
 	  the data for the key. The default value is 0.1.</td>
         </tr>
         <tr>
           <td>
             <code>gridmix.sleep.maptask-only</code>
           </td>
           <td>For a SLEEPJOB type of job, whether to ignore the reduce
 	  tasks for the job. The default is <code>false</code>.</td>
         </tr>
         <tr>
           <td>
             <code>gridmix.sleep.fake-locations</code>
           </td>
           <td>For a SLEEPJOB type of job, the number of fake locations
 	  for map tasks for the job. The default is 0.</td>
         </tr>
         <tr>
           <td>
             <code>gridmix.sleep.max-map-time</code>
           </td>
           <td>For a SLEEPJOB type of job, the maximum runtime for map
 	  tasks for the job in milliseconds. The default is unlimited.</td>
         </tr>
         <tr>
           <td>
             <code>gridmix.sleep.max-reduce-time</code>
           </td>
           <td>For a SLEEPJOB type of job, the maximum runtime for reduce
 	  tasks for the job in milliseconds. The default is unlimited.</td>
         </tr>
       </table>
     </section>
     <section id="policies">
       <title>Job Submission Policies</title>
       <p>GridMix controls the rate of job submission. This control can be
       based on the trace information or can be based on statistics it gathers
       from the Job Tracker. Based on the submission policies users define,
       GridMix uses the respective algorithm to control the job submission.
       There are currently three types of policies:</p>
       <table>
         <tr>
           <th>Job Submission Policy</th>
           <th>Description</th>
         </tr>
         <tr>
           <td>
             <code>STRESS</code>
           </td>
           <td>Keep submitting jobs so that the cluster remains under stress.
 	  In this mode we control the rate of job submission by monitoring
 	  the real-time load of the cluster so that we can maintain a stable
 	  stress level of workload on the cluster. Based on the statistics we
 	  gather we define if a cluster is <em>underloaded</em> or
 	  <em>overloaded</em>. We consider a cluster <em>underloaded</em> if
 	  and only if the following three conditions are true:
 	  <ol>
 	    <li>the number of pending and running jobs are under a threshold
 	    TJ</li>
 	    <li>the number of pending and running maps are under threshold
 	    TM</li>
 	    <li>the number of pending and running reduces are under threshold
 	    TR</li>
 	  </ol>
           The thresholds TJ, TM and TR are proportional to the size of the
 	  cluster and map, reduce slots capacities respectively. In case of a
 	  cluster being <em>overloaded</em>, we throttle the job submission.
 	  In the actual calculation we also weigh each running task with its
 	  remaining work - namely, a 90% complete task is only counted as 0.1
 	  in calculation. Finally, to avoid a very large job blocking other
 	  jobs, we limit the number of pending/waiting tasks each job can
 	  contribute.</td>
         </tr>
         <tr>
           <td>
             <code>REPLAY</code>
           </td>
           <td>In this mode we replay the job traces faithfully. This mode
 	  exactly follows the time-intervals given in the actual job
 	  trace.</td>
         </tr>
         <tr>
           <td>
             <code>SERIAL</code>
           </td>
           <td>In this mode we submit the next job only once the job submitted
 	  earlier is completed.</td>
         </tr>
       </table>
       <p>The following configuration parameters affect the job submission
       policy:</p>
       <table>
         <tr>
           <th>Parameter</th>
           <th>Description</th>
         </tr>
         <tr>
           <td>
             <code>gridmix.job-submission.policy</code>
           </td>
           <td>The value for this key would one of the three: STRESS, REPLAY or
 	  SERIAL. In most of the cases the value of key would be STRESS or
 	  REPLAY. The default value is STRESS.</td>
         </tr>
         <tr>
           <td>
             <code>gridmix.throttle.jobs-to-tracker-ratio</code>
           </td>
           <td>In STRESS mode, the minimum ratio of running jobs to Task
 	  Trackers in a cluster for the cluster to be considered
 	  <em>overloaded</em>. This is the threshold TJ referred to earlier.
 	  The default is 1.0.</td>
         </tr>
         <tr>
           <td>
             <code>gridmix.throttle.maps.task-to-slot-ratio</code>
           </td>
           <td>In STRESS mode, the minimum ratio of pending and running map
 	  tasks (i.e. incomplete map tasks) to the number of map slots for
 	  a cluster for the cluster to be considered <em>overloaded</em>.
 	  This is the threshold TM referred to earlier. Running map tasks are
 	  counted partially. For example, a 40% complete map task is counted
 	  as 0.6 map tasks. The default is 2.0.</td>
         </tr>
         <tr>
           <td>
             <code>gridmix.throttle.reduces.task-to-slot-ratio</code>
           </td>
           <td>In STRESS mode, the minimum ratio of pending and running reduce
 	  tasks (i.e. incomplete reduce tasks) to the number of reduce slots
 	  for a cluster for the cluster to be considered <em>overloaded</em>.
 	  This is the threshold TR referred to earlier. Running reduce tasks
 	  are counted partially. For example, a 30% complete reduce task is
 	  counted as 0.7 reduce tasks. The default is 2.5.</td>
         </tr>
         <tr>
           <td>
             <code>gridmix.throttle.maps.max-slot-share-per-job</code>
           </td>
           <td>In STRESS mode, the maximum share of a cluster's map-slots
 	  capacity that can be counted toward a job's incomplete map tasks in
 	  overload calculation. The default is 0.1.</td>
         </tr>
         <tr>
           <td>
             <code>gridmix.throttle.reducess.max-slot-share-per-job</code>
           </td>
           <td>In STRESS mode, the maximum share of a cluster's reduce-slots
 	  capacity that can be counted toward a job's incomplete reduce tasks
 	  in overload calculation. The default is 0.1.</td>
         </tr>
       </table>
     </section>
     <section id="usersqueues">
       <title>Emulating Users and Queues</title>
       <p>Typical production clusters are often shared with different users and
       the cluster capacity is divided among different departments through job
       queues. Ensuring fairness among jobs from all users, honoring queue
       capacity allocation policies and avoiding an ill-behaving job from
       taking over the cluster adds significant complexity in Hadoop software.
       To be able to sufficiently test and discover bugs in these areas,
       GridMix must emulate the contentions of jobs from different users and/or
       submitted to different queues.</p>
       <p>Emulating multiple queues is easy - we simply set up the benchmark
       cluster with the same queue configuration as the production cluster and
       we configure synthetic jobs so that they get submitted to the same queue
       as recorded in the trace. However, not all users shown in the trace have
       accounts on the benchmark cluster. Instead, we set up a number of testing
       user accounts and associate each unique user in the trace to testing
       users in a round-robin fashion.</p>
       <p>The following configuration parameters affect the emulation of users
       and queues:</p>
       <table>
         <tr>
           <th>Parameter</th>
           <th>Description</th>
         </tr>
         <tr>
           <td>
             <code>gridmix.job-submission.use-queue-in-trace</code>
           </td>
           <td>When set to <code>true</code> it uses exactly the same set of
 	  queues as those mentioned in the trace. The default value is
 	  <code>false</code>.</td>
         </tr>
         <tr>
           <td>
             <code>gridmix.job-submission.default-queue</code>
           </td>
           <td>Specifies the default queue to which all the jobs would be
 	  submitted. If this parameter is not specified, GridMix uses the
 	  default queue defined for the submitting user on the cluster.</td>
         </tr>
         <tr>
           <td>
             <code>gridmix.user.resolve.class</code>
           </td>
           <td>Specifies which <code>UserResolver</code> implementation to use.
 	  We currently have three implementations:
 	  <ol>
 	    <li><code>org.apache.hadoop.mapred.gridmix.EchoUserResolver</code>
 	    - submits a job as the user who submitted the original job. All
 	    the users of the production cluster identified in the job trace
 	    must also have accounts on the benchmark cluster in this case.</li>
 	    <li><code>org.apache.hadoop.mapred.gridmix.SubmitterUserResolver</code>
 	    - submits all the jobs as current GridMix user. In this case we
 	    simply map all the users in the trace to the current GridMix user
 	    and submit the job.</li>
 	    <li><code>org.apache.hadoop.mapred.gridmix.RoundRobinUserResolver</code>
 	    - maps trace users to test users in a round-robin fashion. In
 	    this case we set up a number of testing user accounts and
 	    associate each unique user in the trace to testing users in a
 	    round-robin fashion.</li>
 	  </ol>
 	  The default is
 	  <code>org.apache.hadoop.mapred.gridmix.SubmitterUserResolver</code>.</td>
         </tr>
       </table>
       <p>If the parameter <code>gridmix.user.resolve.class</code> is set to
       <code>org.apache.hadoop.mapred.gridmix.RoundRobinUserResolver</code>,
       we need to define a users-list file with a list of test users.
       This is specified using the <code>-users</code> option to GridMix.</p>
       <note>
       Specifying a users-list file using the <code>-users</code> option is
       mandatory when using the round-robin user-resolver. Other user-resolvers
       ignore this option.
       </note>
       <p>A users-list file has one user per line, each line of the format:</p>
       <source>
       &lt;username&gt;
       </source>
       <p>For example:</p>
       <source>
       user1
       user2
       user3
       </source>
       <p>In the above example we have defined three users <code>user1</code>,
       <code>user2</code> and <code>user3</code>.
       Now we would associate each unique user in the trace to the above users
       defined in round-robin fashion. For example, if trace's users are
       <code>tuser1</code>, <code>tuser2</code>, <code>tuser3</code>,
       <code>tuser4</code> and <code>tuser5</code>, then the mappings would
       be:</p>
       <source>
       tuser1 -&gt; user1
       tuser2 -&gt; user2
       tuser3 -&gt; user3
       tuser4 -&gt; user1
       tuser5 -&gt; user2
       </source>
       <p>For backward compatibility reasons, each line of users-list file can
       contain username followed by groupnames in the form username[,group]*.
       The groupnames will be ignored by Gridmix.
       </p>
     </section>
     <section id="assumptions">
       <title>Simplifying Assumptions</title>
       <p>GridMix will be developed in stages, incorporating feedback and
       patches from the community. Currently its intent is to evaluate
       MapReduce and HDFS performance and not the layers on top of them (i.e.
       the extensive lib and sub-project space). Given these two limitations,
       the following characteristics of job load are not currently captured in
       job traces and cannot be accurately reproduced in GridMix:</p>
       <ul>
 	<li><em>CPU Usage</em> - We have no data for per-task CPU usage, so we
 	cannot even attempt an approximation. GridMix tasks are never
 	CPU-bound independent of I/O, though this surely happens in
 	practice.</li>
 	<li><em>Filesystem Properties</em> - No attempt is made to match block
 	sizes, namespace hierarchies, or any property of input, intermediate
 	or output data other than the bytes/records consumed and emitted from
 	a given task. This implies that some of the most heavily-used parts of
 	the system - the compression libraries, text processing, streaming,
 	etc. - cannot be meaningfully tested with the current
 	implementation.</li>
 	<li><em>I/O Rates</em> - The rate at which records are
 	consumed/emitted is assumed to be limited only by the speed of the
 	reader/writer and constant throughout the task.</li>
 	<li><em>Memory Profile</em> - No data on tasks' memory usage over time
 	is available, though the max heap-size is retained.</li>
 	<li><em>Skew</em> - The records consumed and emitted to/from a given
 	task are assumed to follow observed averages, i.e. records will be
 	more regular than may be seen in the wild. Each map also generates
 	a proportional percentage of data for each reduce, so a job with
 	unbalanced input will be flattened.</li>
 	<li><em>Job Failure</em> - User code is assumed to be correct.</li>
 	<li><em>Job Independence</em> - The output or outcome of one job does
 	not affect when or whether a subsequent job will run.</li>
       </ul>
     </section>
     <section id="appendix">
       <title>Appendix</title>
       <p>Issues tracking the original implementations of <a
       href="https://issues.apache.org/jira/browse/HADOOP-2369">GridMix1</a>,
       <a href="https://issues.apache.org/jira/browse/HADOOP-3770">GridMix2</a>,
       and <a
       href="https://issues.apache.org/jira/browse/MAPREDUCE-776">GridMix3</a>
       can be found on the Apache Hadoop MapReduce JIRA. Other issues tracking
       the current development of GridMix can be found by searching <a
       href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313086">the
       Apache Hadoop MapReduce JIRA</a></p>
     </section>
   </body>
 </document>