<?xml version="1.0"?>
<!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
-->
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
<document>
  <header>
    <title>GridMix</title>
  </header>
  <body>
    <section id="overview">
      <title>Overview</title>
      <p>GridMix is a benchmark for Hadoop clusters. It submits a mix of
      synthetic jobs, modeling a profile mined from production loads.</p>
      <p>There exist three versions of the GridMix tool. This document
      discusses the third (checked into <code>src/contrib</code>), distinct
      from the two checked into the <code>src/benchmarks</code> sub-directory.
      While the first two versions of the tool included stripped-down versions
      of common jobs, both were principally saturation tools for stressing the
      framework at scale. In support of a broader range of deployments and
      finer-tuned job mixes, this version of the tool will attempt to model
      the resource profiles of production jobs to identify bottlenecks, guide
      development, and serve as a replacement for the existing GridMix
      benchmarks.</p>
      <p>To run GridMix, you need a MapReduce job trace describing the job mix
      for a given cluster. Such traces are typically generated by Rumen (see
      Rumen documentation). GridMix also requires input data from which the
      synthetic jobs will be reading bytes. The input data need not be in any
      particular format, as the synthetic jobs are currently binary readers.
      If you are running on a new cluster, an optional step generating input
      data may precede the run.</p>
      <p>In order to emulate the load of production jobs from a given cluster
      on the same or another cluster, follow these steps:</p>
      <ol>
	<li>Locate the job history files on the production cluster. This
	location is specified by the
	<code>mapreduce.jobtracker.jobhistory.completed.location</code>
	configuration property of the cluster.</li>
	<li>Run Rumen to build a job trace in JSON format for all or select
	jobs.</li>
	<li><em>(Optional)</em> Use Rumen to fold this job trace to scale
	the load.</li>
	<li>Use GridMix with the job trace on the benchmark cluster.</li>
      </ol>
      <p>Jobs submitted by GridMix have names of the form
      &quot;<code>GRIDMIXnnnnnn</code>&quot;, where
      &quot;<code>nnnnnn</code>&quot; is a sequence number padded with leading
      zeroes.</p>
    </section>
    <section id="usage">
      <title>Usage</title>
      <p>Basic command-line usage without configuration parameters:</p>
      <source>
org.apache.hadoop.mapred.gridmix.Gridmix [-generate &lt;size&gt;] [-users &lt;users-list&gt;] &lt;iopath&gt; &lt;trace&gt;
      </source>
      <p>Basic command-line usage with configuration parameters:</p>
      <source>
org.apache.hadoop.mapred.gridmix.Gridmix \
  -Dgridmix.client.submit.threads=10 -Dgridmix.output.directory=foo \
  [-generate &lt;size&gt;] [-users &lt;users-list&gt;] &lt;iopath&gt; &lt;trace&gt;
      </source>
      <note>
	Configuration parameters like
	<code>-Dgridmix.client.submit.threads=10</code> and
	<code>-Dgridmix.output.directory=foo</code> as given above should
	be used <em>before</em> other GridMix parameters.
      </note>
      <p>The <code>&lt;iopath&gt;</code> parameter is the working directory for
      GridMix. Note that this can either be on the local file-system
      or on HDFS, but it is highly recommended that it be the same as that for
      the original job mix so that GridMix puts the same load on the local
      file-system and HDFS respectively.</p>
      <p>The <code>-generate</code> option is used to generate input data and
      Distributed Cache files for the synthetic jobs. It accepts standard units
      of size suffixes, e.g. <code>100g</code> will generate
      100 * 2<sup>30</sup> bytes as input data.
      <code>&lt;iopath&gt;/input</code> is the destination directory for
      generated input data and/or the directory from which input data will be
      read. HDFS-based Distributed Cache files are generated under the
      distributed cache directory <code>&lt;iopath&gt;/distributedCache</code>.
      If some of the needed Distributed Cache files are already existing in the
      distributed cache directory, then only the remaining non-existing
      Distributed Cache files are generated when <code>-generate</code> option
      is specified.</p>
      <p>The <code>-users</code> option is used to point to a users-list
      file (see <a href="#usersqueues">Emulating Users and Queues</a>).</p>
      <p>The <code>&lt;trace&gt;</code> parameter is a path to a job trace
      generated by Rumen. This trace can be compressed (it must be readable
      using one of the compression codecs supported by the cluster) or
      uncompressed. Use &quot;-&quot; as the value of this parameter if you
      want to pass an <em>uncompressed</em> trace via the standard
      input-stream of GridMix.</p>
      <p>The class <code>org.apache.hadoop.mapred.gridmix.Gridmix</code> can
      be found in the JAR
      <code>contrib/gridmix/hadoop-$VERSION-gridmix.jar</code> inside your
      Hadoop installation, where <code>$VERSION</code> corresponds to the
      version of Hadoop installed. A simple way of ensuring that this class
      and all its dependencies are loaded correctly is to use the
      <code>hadoop</code> wrapper script in Hadoop:</p>
      <source>
hadoop jar &lt;gridmix-jar&gt; org.apache.hadoop.mapred.gridmix.Gridmix \
  [-generate &lt;size&gt;] [-users &lt;users-list&gt;] &lt;iopath&gt; &lt;trace&gt;
      </source>
      <p>The supported configuration parameters are explained in the
      following sections.</p>
    </section>
    <section id="cfgparams">
      <title>General Configuration Parameters</title>
      <p/>
      <table>
        <tr>
          <th>Parameter</th>
          <th>Description</th>
        </tr>
        <tr>
          <td>
            <code>gridmix.output.directory</code>
          </td>
          <td>The directory into which output will be written. If specified,
	  <code>iopath</code> will be relative to this parameter. The
	  submitting user must have read/write access to this directory. The
	  user should also be mindful of any quota issues that may arise
	  during a run. The default is &quot;<code>gridmix</code>&quot;.</td>
        </tr>
        <tr>
          <td>
            <code>gridmix.client.submit.threads</code>
          </td>
          <td>The number of threads submitting jobs to the cluster. This
	  also controls how many splits will be loaded into memory at a given
	  time, pending the submit time in the trace. Splits are pre-generated
	  to hit submission deadlines, so particularly dense traces may want
	  more submitting threads. However, storing splits in memory is
	  reasonably expensive, so you should raise this cautiously. The
	  default is 1 for the SERIAL job-submission policy (see
	  <a href="#policies">Job Submission Policies</a>) and one more than
	  the number of processors on the client machine for the other
	  policies.</td>
        </tr>
	<tr>
	  <td>
	    <code>gridmix.submit.multiplier</code>
	  </td>
	  <td>The multiplier to accelerate or decelerate the submission of
	  jobs. The time separating two jobs is multiplied by this factor.
	  The default value is 1.0. This is a crude mechanism to size
	  a job trace to a cluster.</td>
	</tr>
        <tr>
          <td>
            <code>gridmix.client.pending.queue.depth</code>
          </td>
          <td>The depth of the queue of job descriptions awaiting split
	  generation. The jobs read from the trace occupy a queue of this
	  depth before being processed by the submission threads. It is
	  unusual to configure this. The default is 5.</td>
        </tr>
	<tr>
	  <td>
	    <code>gridmix.gen.blocksize</code>
	  </td>
	  <td>The block-size of generated data. The default value is 256
	  MiB.</td>
	</tr>
	<tr>
	  <td>
	    <code>gridmix.gen.bytes.per.file</code>
	  </td>
	  <td>The maximum bytes written per file. The default value is 1
	  GiB.</td>
	</tr>
        <tr>
          <td>
            <code>gridmix.min.file.size</code>
          </td>
          <td>The minimum size of the input files. The default limit is 128
	  MiB. Tweak this parameter if you see an error-message like
	  &quot;Found no satisfactory file&quot; while testing GridMix with
	  a relatively-small input data-set.</td>
        </tr>
        <tr>
          <td>
            <code>gridmix.max.total.scan</code>
          </td>
          <td>The maximum size of the input files. The default limit is 100
	  TiB.</td>
        </tr>
      </table>
    </section>
    <section id="jobtypes">
      <title>Job Types</title>
      <p>GridMix takes as input a job trace, essentially a stream of
      JSON-encoded job descriptions. For each job description, the submission
      client obtains the original job submission time and for each task in
      that job, the byte and record counts read and written. Given this data,
      it constructs a synthetic job with the same byte and record patterns as
      recorded in the trace. It constructs jobs of two types:</p>
      <table>
        <tr>
          <th>Job Type</th>
          <th>Description</th>
        </tr>
        <tr>
          <td>
            <code>LOADJOB</code>
          </td>
          <td>A synthetic job that emulates the workload mentioned in Rumen
	  trace. In the current version we are supporting I/O. It reproduces
	  the I/O workload on the benchmark cluster. It does so by embedding
	  the detailed I/O information for every map and reduce task, such as
	  the number of bytes and records read and written, into each
	  job's input splits. The map tasks further relay the I/O patterns of
	  reduce tasks through the intermediate map output data.</td>
        </tr>
        <tr>
          <td>
            <code>SLEEPJOB</code>
          </td>
	  <td>A synthetic job where each task does <em>nothing</em> but sleep
	  for a certain duration as observed in the production trace. The
	  scalability of the Job Tracker is often limited by how many
	  heartbeats it can handle every second. (Heartbeats are periodic
	  messages sent from Task Trackers to update their status and grab new
	  tasks from the Job Tracker.) Since a benchmark cluster is typically
	  a fraction in size of a production cluster, the heartbeat traffic
	  generated by the slave nodes is well below the level of the
	  production cluster. One possible solution is to run multiple Task
	  Trackers on each slave node. This leads to the obvious problem that
	  the I/O workload generated by the synthetic jobs would thrash the
	  slave nodes. Hence the need for such a job.</td>
        </tr>
      </table>
      <p>The following configuration parameters affect the job type:</p>
      <table>
        <tr>
          <th>Parameter</th>
          <th>Description</th>
        </tr>
        <tr>
          <td>
            <code>gridmix.job.type</code>
          </td>
          <td>The value for this key can be one of LOADJOB or SLEEPJOB. The
	  default value is LOADJOB.</td>
        </tr>
        <tr>
          <td>
            <code>gridmix.key.fraction</code>
          </td>
          <td>For a LOADJOB type of job, the fraction of a record used for
	  the data for the key. The default value is 0.1.</td>
        </tr>
        <tr>
          <td>
            <code>gridmix.sleep.maptask-only</code>
          </td>
          <td>For a SLEEPJOB type of job, whether to ignore the reduce
	  tasks for the job. The default is <code>false</code>.</td>
        </tr>
        <tr>
          <td>
            <code>gridmix.sleep.fake-locations</code>
          </td>
          <td>For a SLEEPJOB type of job, the number of fake locations
	  for map tasks for the job. The default is 0.</td>
        </tr>
        <tr>
          <td>
            <code>gridmix.sleep.max-map-time</code>
          </td>
          <td>For a SLEEPJOB type of job, the maximum runtime for map
	  tasks for the job in milliseconds. The default is unlimited.</td>
        </tr>
        <tr>
          <td>
            <code>gridmix.sleep.max-reduce-time</code>
          </td>
          <td>For a SLEEPJOB type of job, the maximum runtime for reduce
	  tasks for the job in milliseconds. The default is unlimited.</td>
        </tr>
      </table>
    </section>
    <section id="policies">
      <title>Job Submission Policies</title>
      <p>GridMix controls the rate of job submission. This control can be
      based on the trace information or can be based on statistics it gathers
      from the Job Tracker. Based on the submission policies users define,
      GridMix uses the respective algorithm to control the job submission.
      There are currently three types of policies:</p>
      <table>
        <tr>
          <th>Job Submission Policy</th>
          <th>Description</th>
        </tr>
        <tr>
          <td>
            <code>STRESS</code>
          </td>
          <td>Keep submitting jobs so that the cluster remains under stress.
	  In this mode we control the rate of job submission by monitoring
	  the real-time load of the cluster so that we can maintain a stable
	  stress level of workload on the cluster. Based on the statistics we
	  gather we define if a cluster is <em>underloaded</em> or
	  <em>overloaded</em>. We consider a cluster <em>underloaded</em> if
	  and only if the following three conditions are true:
	  <ol>
	    <li>the number of pending and running jobs are under a threshold
	    TJ</li>
	    <li>the number of pending and running maps are under threshold
	    TM</li>
	    <li>the number of pending and running reduces are under threshold
	    TR</li>
	  </ol>
          The thresholds TJ, TM and TR are proportional to the size of the
	  cluster and map, reduce slots capacities respectively. In case of a
	  cluster being <em>overloaded</em>, we throttle the job submission.
	  In the actual calculation we also weigh each running task with its
	  remaining work - namely, a 90% complete task is only counted as 0.1
	  in calculation. Finally, to avoid a very large job blocking other
	  jobs, we limit the number of pending/waiting tasks each job can
	  contribute.</td>
        </tr>
        <tr>
          <td>
            <code>REPLAY</code>
          </td>
          <td>In this mode we replay the job traces faithfully. This mode
	  exactly follows the time-intervals given in the actual job
	  trace.</td>
        </tr>
        <tr>
          <td>
            <code>SERIAL</code>
          </td>
          <td>In this mode we submit the next job only once the job submitted
	  earlier is completed.</td>
        </tr>
      </table>
      <p>The following configuration parameters affect the job submission
      policy:</p>
      <table>
        <tr>
          <th>Parameter</th>
          <th>Description</th>
        </tr>
        <tr>
          <td>
            <code>gridmix.job-submission.policy</code>
          </td>
          <td>The value for this key would one of the three: STRESS, REPLAY or
	  SERIAL. In most of the cases the value of key would be STRESS or
	  REPLAY. The default value is STRESS.</td>
        </tr>
        <tr>
          <td>
            <code>gridmix.throttle.jobs-to-tracker-ratio</code>
          </td>
          <td>In STRESS mode, the minimum ratio of running jobs to Task
	  Trackers in a cluster for the cluster to be considered
	  <em>overloaded</em>. This is the threshold TJ referred to earlier.
	  The default is 1.0.</td>
        </tr>
        <tr>
          <td>
            <code>gridmix.throttle.maps.task-to-slot-ratio</code>
          </td>
          <td>In STRESS mode, the minimum ratio of pending and running map
	  tasks (i.e. incomplete map tasks) to the number of map slots for
	  a cluster for the cluster to be considered <em>overloaded</em>.
	  This is the threshold TM referred to earlier. Running map tasks are
	  counted partially. For example, a 40% complete map task is counted
	  as 0.6 map tasks. The default is 2.0.</td>
        </tr>
        <tr>
          <td>
            <code>gridmix.throttle.reduces.task-to-slot-ratio</code>
          </td>
          <td>In STRESS mode, the minimum ratio of pending and running reduce
	  tasks (i.e. incomplete reduce tasks) to the number of reduce slots
	  for a cluster for the cluster to be considered <em>overloaded</em>.
	  This is the threshold TR referred to earlier. Running reduce tasks
	  are counted partially. For example, a 30% complete reduce task is
	  counted as 0.7 reduce tasks. The default is 2.5.</td>
        </tr>
        <tr>
          <td>
            <code>gridmix.throttle.maps.max-slot-share-per-job</code>
          </td>
          <td>In STRESS mode, the maximum share of a cluster's map-slots
	  capacity that can be counted toward a job's incomplete map tasks in
	  overload calculation. The default is 0.1.</td>
        </tr>
        <tr>
          <td>
            <code>gridmix.throttle.reducess.max-slot-share-per-job</code>
          </td>
          <td>In STRESS mode, the maximum share of a cluster's reduce-slots
	  capacity that can be counted toward a job's incomplete reduce tasks
	  in overload calculation. The default is 0.1.</td>
        </tr>
      </table>
    </section>
    <section id="usersqueues">
      <title>Emulating Users and Queues</title>
      <p>Typical production clusters are often shared with different users and
      the cluster capacity is divided among different departments through job
      queues. Ensuring fairness among jobs from all users, honoring queue
      capacity allocation policies and avoiding an ill-behaving job from
      taking over the cluster adds significant complexity in Hadoop software.
      To be able to sufficiently test and discover bugs in these areas,
      GridMix must emulate the contentions of jobs from different users and/or
      submitted to different queues.</p>
      <p>Emulating multiple queues is easy - we simply set up the benchmark
      cluster with the same queue configuration as the production cluster and
      we configure synthetic jobs so that they get submitted to the same queue
      as recorded in the trace. However, not all users shown in the trace have
      accounts on the benchmark cluster. Instead, we set up a number of testing
      user accounts and associate each unique user in the trace to testing
      users in a round-robin fashion.</p>
      <p>The following configuration parameters affect the emulation of users
      and queues:</p>
      <table>
        <tr>
          <th>Parameter</th>
          <th>Description</th>
        </tr>
        <tr>
          <td>
            <code>gridmix.job-submission.use-queue-in-trace</code>
          </td>
          <td>When set to <code>true</code> it uses exactly the same set of
	  queues as those mentioned in the trace. The default value is
	  <code>false</code>.</td>
        </tr>
        <tr>
          <td>
            <code>gridmix.job-submission.default-queue</code>
          </td>
          <td>Specifies the default queue to which all the jobs would be
	  submitted. If this parameter is not specified, GridMix uses the
	  default queue defined for the submitting user on the cluster.</td>
        </tr>
        <tr>
          <td>
            <code>gridmix.user.resolve.class</code>
          </td>
          <td>Specifies which <code>UserResolver</code> implementation to use.
	  We currently have three implementations:
	  <ol>
	    <li><code>org.apache.hadoop.mapred.gridmix.EchoUserResolver</code>
	    - submits a job as the user who submitted the original job. All
	    the users of the production cluster identified in the job trace
	    must also have accounts on the benchmark cluster in this case.</li>
	    <li><code>org.apache.hadoop.mapred.gridmix.SubmitterUserResolver</code>
	    - submits all the jobs as current GridMix user. In this case we
	    simply map all the users in the trace to the current GridMix user
	    and submit the job.</li>
	    <li><code>org.apache.hadoop.mapred.gridmix.RoundRobinUserResolver</code>
	    - maps trace users to test users in a round-robin fashion. In
	    this case we set up a number of testing user accounts and
	    associate each unique user in the trace to testing users in a
	    round-robin fashion.</li>
	  </ol>
	  The default is
	  <code>org.apache.hadoop.mapred.gridmix.SubmitterUserResolver</code>.</td>
        </tr>
      </table>
      <p>If the parameter <code>gridmix.user.resolve.class</code> is set to
      <code>org.apache.hadoop.mapred.gridmix.RoundRobinUserResolver</code>,
      we need to define a users-list file with a list of test users.
      This is specified using the <code>-users</code> option to GridMix.</p>
      <note>
      Specifying a users-list file using the <code>-users</code> option is
      mandatory when using the round-robin user-resolver. Other user-resolvers
      ignore this option.
      </note>
      <p>A users-list file has one user per line, each line of the format:</p>
      <source>
      &lt;username&gt;
      </source>
      <p>For example:</p>
      <source>
      user1
      user2
      user3
      </source>
      <p>In the above example we have defined three users <code>user1</code>,
      <code>user2</code> and <code>user3</code>.
      Now we would associate each unique user in the trace to the above users
      defined in round-robin fashion. For example, if trace's users are
      <code>tuser1</code>, <code>tuser2</code>, <code>tuser3</code>,
      <code>tuser4</code> and <code>tuser5</code>, then the mappings would
      be:</p>
      <source>
      tuser1 -&gt; user1
      tuser2 -&gt; user2
      tuser3 -&gt; user3
      tuser4 -&gt; user1
      tuser5 -&gt; user2
      </source>
      <p>For backward compatibility reasons, each line of users-list file can
      contain username followed by groupnames in the form username[,group]*.
      The groupnames will be ignored by Gridmix.
      </p>
    </section>

  <section id="distributedcacheload">
  <title>Emulation of Distributed Cache Load</title>
    <p>Gridmix emulates Distributed Cache load by default for LOADJOB type of
    jobs. This is done by precreating the needed Distributed Cache files for all
    the simulated jobs as part of a separate MapReduce job.</p>
    <p>Emulation of Distributed Cache load in gridmix simulated jobs can be
    disabled by configuring the property
    <code>gridmix.distributed-cache-emulation.enable</code> to
    <code>false</code>.
    But generation of Distributed Cache data by gridmix is driven by
    <code>-generate</code> option and is independent of this configuration
    property.</p>
    <p>Both generation of Distributed Cache files and emulation of
    Distributed Cache load are disabled if:</p>
    <ul>
    <li>input trace comes from the standard input-stream instead of file, or</li>
    <li><code>&lt;iopath&gt;</code> specified is on local file-system, or</li>
    <li>any of the ascendant directories of the distributed cache directory
    i.e. <code>&lt;iopath&gt;/distributedCache</code> (including the distributed
    cache directory) doesn't have execute permission for others.</li>
    </ul>
  </section>

    <section id="simulatedjobconf">
      <title>Configuration of Simulated Jobs</title>
      <p> Gridmix3 sets some configuration properties in the simulated Jobs
      submitted by it so that they can be mapped back to the corresponding Job
      in the input Job trace. These configuration parameters include:
      </p>
      <table>
        <tr>
          <th>Parameter</th>
          <th>Description</th>
        </tr>
        <tr>
          <td>
            <code>gridmix.job.original-job-id</code>
          </td>
          <td> The job id of the original cluster's job corresponding to this
          simulated job.
          </td>
        </tr>
        <tr>
          <td>
            <code>gridmix.job.original-job-name</code>
          </td>
          <td> The job name of the original cluster's job corresponding to this
          simulated job.
          </td>
        </tr>
      </table>
    </section>

    <section id="assumptions">
      <title>Simplifying Assumptions</title>
      <p>GridMix will be developed in stages, incorporating feedback and
      patches from the community. Currently its intent is to evaluate
      MapReduce and HDFS performance and not the layers on top of them (i.e.
      the extensive lib and sub-project space). Given these two limitations,
      the following characteristics of job load are not currently captured in
      job traces and cannot be accurately reproduced in GridMix:</p>
      <ul>
	<li><em>CPU Usage</em> - We have no data for per-task CPU usage, so we
	cannot even attempt an approximation. GridMix tasks are never
	CPU-bound independent of I/O, though this surely happens in
	practice.</li>
	<li><em>Filesystem Properties</em> - No attempt is made to match block
	sizes, namespace hierarchies, or any property of input, intermediate
	or output data other than the bytes/records consumed and emitted from
	a given task. This implies that some of the most heavily-used parts of
	the system - the compression libraries, text processing, streaming,
	etc. - cannot be meaningfully tested with the current
	implementation.</li>
	<li><em>I/O Rates</em> - The rate at which records are
	consumed/emitted is assumed to be limited only by the speed of the
	reader/writer and constant throughout the task.</li>
	<li><em>Memory Profile</em> - No data on tasks' memory usage over time
	is available, though the max heap-size is retained.</li>
	<li><em>Skew</em> - The records consumed and emitted to/from a given
	task are assumed to follow observed averages, i.e. records will be
	more regular than may be seen in the wild. Each map also generates
	a proportional percentage of data for each reduce, so a job with
	unbalanced input will be flattened.</li>
	<li><em>Job Failure</em> - User code is assumed to be correct.</li>
	<li><em>Job Independence</em> - The output or outcome of one job does
	not affect when or whether a subsequent job will run.</li>
      </ul>
    </section>
    <section id="appendix">
      <title>Appendix</title>
      <p>Issues tracking the original implementations of <a
      href="https://issues.apache.org/jira/browse/HADOOP-2369">GridMix1</a>,
      <a href="https://issues.apache.org/jira/browse/HADOOP-3770">GridMix2</a>,
      and <a
      href="https://issues.apache.org/jira/browse/MAPREDUCE-776">GridMix3</a>
      can be found on the Apache Hadoop MapReduce JIRA. Other issues tracking
      the current development of GridMix can be found by searching <a
      href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313086">the
      Apache Hadoop MapReduce JIRA</a></p>
    </section>
  </body>
</document>
