6/pipeline.html - joshua-site - Git at Google

 <!DOCTYPE html>
 <html lang="en">
   <head>
     <meta charset="utf-8">
     <meta http-equiv="X-UA-Compatible" content="IE=edge">
     <meta name="viewport" content="width=device-width, initial-scale=1">
     <meta name="description" content="">
     <meta name="author" content="">
     <link rel="icon" href="../../favicon.ico">

     <title>Joshua Documentation | The Joshua Pipeline</title>

     <!-- Bootstrap core CSS -->
     <link href="/dist/css/bootstrap.min.css" rel="stylesheet">

     <!-- Custom styles for this template -->
     <link href="/joshua6.css" rel="stylesheet">
   </head>

   <body>

     <div class="blog-masthead">
       <div class="container">
         <nav class="blog-nav">
           <!-- <a class="blog-nav-item active" href="#">Joshua</a> -->
           <a class="blog-nav-item" href="/">Joshua</a>
           <!-- <a class="blog-nav-item" href="/6.0/whats-new.html">New features</a> -->
           <a class="blog-nav-item" href="/language-packs/">Language packs</a>
           <a class="blog-nav-item" href="/data/">Datasets</a>
           <a class="blog-nav-item" href="/support/">Support</a>
           <a class="blog-nav-item" href="/contributors.html">Contributors</a>
         </nav>
       </div>
     </div>

     <div class="container">

       <div class="row">

         <div class="col-sm-2">
           <div class="sidebar-module">
             <!-- <h4>About</h4> -->
             <center>
             <img src="/images/joshua-logo-small.png" />
             <p>Joshua machine translation toolkit</p>
             </center>
           </div>
           <hr>
           <center>
             <a href="/releases/current/" target="_blank"><button class="button">Download Joshua 6.0.5</button></a>
             <br />
             <a href="/releases/runtime/" target="_blank"><button class="button">Runtime only version</button></a>
             <p>Released November 5, 2015</p>
           </center>
           <hr>
           <!-- <div class="sidebar-module"> -->
           <!--   <span id="download"> -->
           <!--     <a href="http://joshua-decoder.org/downloads/joshua-6.0.tgz">Download</a> -->
           <!--   </span> -->
           <!-- </div> -->
           <div class="sidebar-module">
             <h4>Using Joshua</h4>
             <ol class="list-unstyled">
               <li><a href="/6.0/install.html">Installation</a></li>
               <li><a href="/6.0/quick-start.html">Quick Start</a></li>
             </ol>
           </div>
           <hr>
           <div class="sidebar-module">
             <h4>Building new models</h4>
             <ol class="list-unstyled">
               <li><a href="/6.0/pipeline.html">Pipeline</a></li>
               <li><a href="/6.0/tutorial.html">Tutorial</a></li>
               <li><a href="/6.0/faq.html">FAQ</a></li>
             </ol>
           </div>
 <!--
           <div class="sidebar-module">
             <h4>Phrase-based</h4>
             <ol class="list-unstyled">
               <li><a href="/6.0/phrase.html">Training</a></li>
             </ol>
           </div>
 -->
           <hr>
           <div class="sidebar-module">
             <h4>Advanced</h4>
             <ol class="list-unstyled">
               <li><a href="/6.0/bundle.html">Building language packs</a></li>
               <li><a href="/6.0/decoder.html">Decoder options</a></li>
               <li><a href="/6.0/file-formats.html">File formats</a></li>
               <li><a href="/6.0/packing.html">Packing TMs</a></li>
               <li><a href="/6.0/large-lms.html">Building large LMs</a></li>
             </ol>
           </div>

           <hr>
           <div class="sidebar-module">
             <h4>Developer</h4>
             <ol class="list-unstyled">
 		<li><a href="https://github.com/joshua-decoder/joshua">Github</a></li>
 		<li><a href="http://cs.jhu.edu/~post/joshua-docs">Javadoc</a></li>
 		<li><a href="https://groups.google.com/forum/?fromgroups#!forum/joshua_developers">Mailing list</a></li>
             </ol>
           </div>

         </div><!-- /.blog-sidebar -->


         <div class="col-sm-8 blog-main">


           <div class="blog-title">
             <h2>The Joshua Pipeline</h2>
           </div>

           <div class="blog-post">

             <p><em>Please note that the Joshua 6.0.3 included some big changes to directory organization of the
  pipeline’s files.</em></p>

 <p>This page describes the Joshua pipeline script, which manages the complexity of training and
 evaluating machine translation systems.  The pipeline eases the pain of two related tasks in
 statistical machine translation (SMT) research:</p>

 <ul>
   <li>
     <p>Training SMT systems involves a complicated process of interacting steps that are
 time-consuming and prone to failure.</p>
   </li>
   <li>
     <p>Developing and testing new techniques requires varying parameters at different points in the
 pipeline. Earlier results (which are often expensive) need not be recomputed.</p>
   </li>
 </ul>

 <p>To facilitate these tasks, the pipeline script:</p>

 <ul>
   <li>
     <p>Runs the complete SMT pipeline, from corpus normalization and tokenization, through alignment,
 model building, tuning, test-set decoding, and evaluation.</p>
   </li>
   <li>
     <p>Caches the results of intermediate steps (using robust SHA-1 checksums on dependencies), so the
 pipeline can be debugged or shared across similar runs while doing away with time spent
 recomputing expensive steps.</p>
   </li>
   <li>
     <p>Allows you to jump into and out of the pipeline at a set of predefined places (e.g., the alignment
 stage), so long as you provide the missing dependencies.</p>
   </li>
 </ul>

 <p>The Joshua pipeline script is designed in the spirit of Moses’ <code class="highlighter-rouge">train-model.pl</code>, and shares
 (and has borrowed) many of its features.  It is not as extensive as Moses’
 <a href="http://www.statmt.org/moses/?n=FactoredTraining.EMS">Experiment Management System</a>, which allows
 the user to define arbitrary execution dependency graphs. However, it is significantly simpler to
 use, allowing many systems to be built with a single command (that may run for days or weeks).</p>

 <h2 id="dependencies">Dependencies</h2>

 <p>The pipeline has no <em>required</em> external dependencies.  However, it has support for a number of
 external packages, some of which are included with Joshua.</p>

 <ul>
   <li>
     <p><a href="http://code.google.com/p/giza-pp/">GIZA++</a> (included)</p>

     <p>GIZA++ is the default aligner.  It is included with Joshua, and should compile successfully when
 you typed <code class="highlighter-rouge">ant</code> from the Joshua root directory.  It is not required because you can use the
 (included) Berkeley aligner (<code class="highlighter-rouge">--aligner berkeley</code>). We have recently also provided support
 for the <a href="http://code.google.com/p/jacana-xy/wiki/JacanaXY">Jacana-XY aligner</a> (<code class="highlighter-rouge">--aligner
 jacana</code>). </p>
   </li>
   <li>
     <p><a href="http://hadoop.apache.org/">Hadoop</a> (included)</p>

     <p>The pipeline uses the <a href="thrax.html">Thrax grammar extractor</a>, which is built on Hadoop.  If you
 have a Hadoop installation, simply ensure that the <code class="highlighter-rouge">$HADOOP</code> environment variable is defined, and
 the pipeline will use it automatically at the grammar extraction step.  If you are going to
 attempt to extract very large grammars, it is best to have a good-sized Hadoop installation.</p>

     <p>(If you do not have a Hadoop installation, you might consider setting one up.  Hadoop can be
 installed in a
 <a href="http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html#PseudoDistributed">“pseudo-distributed”</a>
 mode that allows it to use just a few machines or a number of processors on a single machine.
 The main issue is to ensure that there are a lot of independent physical disks, since in our
 experience Hadoop starts to exhibit lots of hard-to-trace problems if there is too much demand on
 the disks.)</p>

     <p>If you don’t have a Hadoop installation, there are still no worries.  The pipeline will unroll a
 standalone installation and use it to extract your grammar.  This behavior will be triggered if
 <code class="highlighter-rouge">$HADOOP</code> is undefined.</p>
   </li>
   <li>
     <p><a href="http://statmt.org/moses/">Moses</a> (not included). Moses is needed
 if you wish to use its ‘kbmira’ tuner (–tuner kbmira), or if you
 wish to build phrase-based models.</p>
   </li>
   <li>
     <p><a href="http://www.speech.sri.com/projects/srilm/">SRILM</a> (not included; not needed; not recommended)</p>

     <p>By default, the pipeline uses the included <a href="https://kheafield.com/code/kenlm/">KenLM</a> for
 building (and also querying) language models. Joshua also includes a Java program from the
 <a href="http://code.google.com/p/berkeleylm/">Berkeley LM</a> package that contains code for constructing a
 Kneser-Ney-smoothed language model in ARPA format from the target side of your training data.<br />
 There is no need to use SRILM, but if you do wish to use it, you need to do the following:</p>

     <ol>
       <li>Install SRILM and set the <code class="highlighter-rouge">$SRILM</code> environment variable to point to its installed location.</li>
       <li>Add the <code class="highlighter-rouge">--lm-gen srilm</code> flag to your pipeline invocation.</li>
     </ol>

     <p>More information on this is available in the <a href="#lm">LM building section of the pipeline</a>.  SRILM
 is not used for representing language models during decoding (and in fact is not supported,
 having been supplanted by <a href="http://kheafield.com/code/kenlm/">KenLM</a> (the default) and
 BerkeleyLM).</p>
   </li>
 </ul>

 <p>After installing any dependencies, follow the brief instructions on
 the <a href="install.html">installation page</a>, and then you are ready to build
 models. </p>

 <h2 id="a-basic-pipeline-run">A basic pipeline run</h2>

 <p>The pipeline takes a set of inputs (training, tuning, and test data), and creates a set of
 intermediate files in the <em>run directory</em>.  By default, the run directory is the current directory,
 but it can be changed with the <code class="highlighter-rouge">--rundir</code> parameter.</p>

 <p>For this quick start, we will be working with the example that can be found in
 <code class="highlighter-rouge">$JOSHUA/examples/training</code>.  This example contains 1,000 sentences of Urdu-English data (the full
 dataset is available as part of the
 <a href="/indian-parallel-corpora/">Indian languages parallel corpora</a> with
 100-sentence tuning and test sets with four references each.</p>

 <p>Running the pipeline requires two main steps: data preparation and invocation.</p>

 <ol>
   <li>
     <p>Prepare your data.  The pipeline script needs to be told where to find the raw training, tuning,
 and test data.  A good convention is to place these files in an input/ subdirectory of your run’s
 working directory (NOTE: do not use <code class="highlighter-rouge">data/</code>, since a directory of that name is created and used
 by the pipeline itself for storing processed files).  The expected format (for each of training,
 tuning, and test) is a pair of files that share a common path prefix and are distinguished by
 their extension, e.g.,</p>

     <div class="highlighter-rouge"><pre class="highlight"><code>input/
       train.SOURCE
       train.TARGET
       tune.SOURCE
       tune.TARGET
       test.SOURCE
       test.TARGET
 </code></pre>
     </div>

     <p>These files should be parallel at the sentence level (with one sentence per line), should be in
 UTF-8, and should be untokenized (tokenization occurs in the pipeline).  SOURCE and TARGET denote
 variables that should be replaced with the actual target and source language abbreviations (e.g.,
 “ur” and “en”).</p>
   </li>
   <li>
     <p>Run the pipeline.  The following is the minimal invocation to run the complete pipeline:</p>

     <div class="highlighter-rouge"><pre class="highlight"><code>$JOSHUA/bin/pipeline.pl  \
   --rundir .             \
   --type hiero           \
   --corpus input/train   \
   --tune input/tune      \
   --test input/devtest   \
   --source SOURCE        \
   --target TARGET
 </code></pre>
     </div>

     <p>The <code class="highlighter-rouge">--corpus</code>, <code class="highlighter-rouge">--tune</code>, and <code class="highlighter-rouge">--test</code> flags define file prefixes that are concatened with the
 language extensions given by <code class="highlighter-rouge">--target</code> and <code class="highlighter-rouge">--source</code> (with a “.” in between).  Note the
 correspondences with the files defined in the first step above.  The prefixes can be either
 absolute or relative pathnames.  This particular invocation assumes that a subdirectory <code class="highlighter-rouge">input/</code>
 exists in the current directory, that you are translating from a language identified “ur”
 extension to a language identified by the “en” extension, that the training data can be found at
 <code class="highlighter-rouge">input/train.en</code> and <code class="highlighter-rouge">input/train.ur</code>, and so on.</p>
   </li>
 </ol>

 <p><em>Don’t</em> run the pipeline directly from <code class="highlighter-rouge">$JOSHUA</code>, or, for that matter, in any directory with lots of other files.
 This can cause problems because the pipeline creates lots of files under <code class="highlighter-rouge">--rundir</code> that can clobber existing files.
 You should run experiments in a clean directory.
 For example, if you have Joshua installed in <code class="highlighter-rouge">$HOME/code/joshua</code>, manage your runs in a different location, such as <code class="highlighter-rouge">$HOME/expts/joshua</code>.</p>

 <p>Assuming no problems arise, this command will run the complete pipeline in about 20 minutes,
 producing BLEU scores at the end.  As it runs, you will see output that looks like the following:</p>

 <div class="highlighter-rouge"><pre class="highlight"><code>[train-copy-en] rebuilding...
   dep=/Users/post/code/joshua/test/pipeline/input/train.en
   dep=data/train/train.en.gz [NOT FOUND]
   cmd=cat /Users/post/code/joshua/test/pipeline/input/train.en | gzip -9n &gt; data/train/train.en.gz
   took 0 seconds (0s)
 [train-copy-ur] rebuilding...
   dep=/Users/post/code/joshua/test/pipeline/input/train.ur
   dep=data/train/train.ur.gz [NOT FOUND]
   cmd=cat /Users/post/code/joshua/test/pipeline/input/train.ur | gzip -9n &gt; data/train/train.ur.gz
   took 0 seconds (0s)
 ...
 </code></pre>
 </div>

 <p>And in the current directory, you will see the following files (among
 other files, including intermediate files
 generated by the individual sub-steps).</p>

 <div class="highlighter-rouge"><pre class="highlight"><code>data/
     train/
         corpus.ur
         corpus.en
         thrax-input-file
     tune/
         corpus.ur -&gt; tune.tok.lc.ur
         corpus.en -&gt; tune.tok.lc.en
         grammar.filtered.gz
         grammar.glue
     test/
         corpus.ur -&gt; test.tok.lc.ur
         corpus.en -&gt; test.tok.lc.en
         grammar.filtered.gz
         grammar.glue
 alignments/
     0/
         [giza/berkeley aligner output files]
     1/
     ...
     training.align
 thrax-hiero.conf
 thrax.log
 grammar.gz
 lm.gz
 tune/
      decoder_command
      model/
            [model files]
      params.txt
      joshua.log
      mert.log
      joshua.config.final
      final-bleu
 test/
      model/
            [model files]
      output
      final-bleu
 </code></pre>
 </div>

 <p>These files will be described in more detail in subsequent sections of this tutorial.</p>

 <p>Another useful flag is the <code class="highlighter-rouge">--rundir DIR</code> flag, which chdir()s to the specified directory before
 running the pipeline.  By default the rundir is the current directory.  Changing it can be useful
 for organizing related pipeline runs.  In fact, we highly recommend
 that you organize your runs using consecutive integers, also taking a
 minute to pass a short note with the <code class="highlighter-rouge">--readme</code> flag, which allows you
 to quickly generate reports on <a href="#managing">groups of related experiments</a>.
 Relative paths specified to other flags (e.g., to <code class="highlighter-rouge">--corpus</code>
 or <code class="highlighter-rouge">--lmfile</code>) are relative to the directory the pipeline was called <em>from</em>, not the rundir itself
 (unless they happen to be the same, of course).</p>

 <p>The complete pipeline comprises many tens of small steps, which can be grouped together into a set
 of traditional pipeline tasks:</p>

 <ol>
   <li><a href="#prep">Data preparation</a></li>
   <li><a href="#alignment">Alignment</a></li>
   <li><a href="#parsing">Parsing</a> (syntax-based grammars only)</li>
   <li><a href="#tm">Grammar extraction</a></li>
   <li><a href="#lm">Language model building</a></li>
   <li><a href="#tuning">Tuning</a></li>
   <li><a href="#testing">Testing</a></li>
   <li><a href="#analysis">Analysis</a></li>
 </ol>

 <p>These steps are discussed below, after a few intervening sections about high-level details of the
 pipeline.</p>

 <h2 id="a-idmanaging--managing-groups-of-experiments"><a id="managing"></a> Managing groups of experiments</h2>

 <p>The real utility of the pipeline comes when you use it to manage groups of experiments. Typically,
 there is a held-out test set, and we want to vary a number of training parameters to determine what
 effect this has on BLEU scores or some other metric. Joshua comes with a script
 <code class="highlighter-rouge">$JOSHUA/scripts/training/summarize.pl</code> that collects information from a group of runs and reports
 them to you. This script works so long as you organize your runs as follows:</p>

 <ol>
   <li>
     <p>Your runs should be grouped together in a root directory, which I’ll call <code class="highlighter-rouge">$EXPDIR</code>.</p>
   </li>
   <li>
     <p>For comparison purposes, the runs should all be evaluated on the same test set.</p>
   </li>
   <li>
     <p>Each run in the run group should be in its own numbered directory, shown with the files used by
 the summarize script:</p>

     <div class="highlighter-rouge"><pre class="highlight"><code>$RUNDIR/
     1/
         README.txt
         test/
             final-bleu
             final-times
         [other files]
     2/
         README.txt
         test/
             final-bleu
             final-times
         [other files]
         ...
 </code></pre>
     </div>
   </li>
 </ol>

 <p>You can get such directories using the <code class="highlighter-rouge">--rundir N</code> flag to the pipeline. </p>

 <p>Run directories can build off each other. For example, <code class="highlighter-rouge">1/</code> might contain a complete baseline
 run. If you wanted to just change the tuner, you don’t need to rerun the aligner and model builder,
 so you can reuse the results by supplying the second run with the information it needs that was
 computed in step 1:</p>

 <div class="highlighter-rouge"><pre class="highlight"><code>$JOSHUA/bin/pipeline.pl \
   --first-step tune \
   --grammar 1/grammar.gz \
   ...
 </code></pre>
 </div>

 <p>More details are below.</p>

 <h2 id="grammar-options">Grammar options</h2>

 <p>Hierarchical Joshua can extract three types of grammars: Hiero
 grammars, GHKM, and SAMT grammars.  As described on the
 <a href="file-formats.html">file formats page</a>, all of them are encoded into
 the same file format, but they differ in terms of the richness of
 their nonterminal sets.</p>

 <p>Hiero grammars make use of a single nonterminals, and are extracted by computing phrases from
 word-based alignments and then subtracting out phrase differences.  More detail can be found in
 <a href="http://www.mitpressjournals.org/doi/abs/10.1162/coli.2007.33.2.201">Chiang (2007) [PDF]</a>.
 <a href="http://www.isi.edu/%7Emarcu/papers/cr_ghkm_naacl04.pdf">GHKM</a> (new with 5.0) and
 <a href="http://www.cs.cmu.edu/~zollmann/samt/">SAMT</a> grammars make use of a source- or target-side parse
 tree on the training data, differing in the way they extract rules using these trees: GHKM extracts
 synchronous tree substitution grammar rules rooted in a subset of the tree constituents, whereas
 SAMT projects constituent labels down onto phrases.  SAMT grammars are usually many times larger and
 are much slower to decode with, but sometimes increase BLEU score.  Both grammar formats are
 extracted with the <a href="thrax.html">Thrax software</a>.</p>

 <p>By default, the Joshua pipeline extract a Hiero grammar, but this can be altered with the <code class="highlighter-rouge">--type
 (ghkm|samt)</code> flag. For GHKM grammars, the default is to use
 <a href="http://www-nlp.stanford.edu/~mgalley/software/stanford-ghkm-latest.tar.gz">Michel Galley’s extractor</a>,
 but you can also use Moses’ extractor with <code class="highlighter-rouge">--ghkm-extractor moses</code>. Galley’s extractor only outputs
 two features, so the scores tend to be significantly lower than that of Moses’.</p>

 <p>Joshua (new in version 6) also includes an unlexicalized phrase-based
 decoder. Building a phrase-based model requires you to have Moses
 installed, since its <code class="highlighter-rouge">train-model.perl</code> script is used to extract the
 phrase table. You can enable this by defining the <code class="highlighter-rouge">$MOSES</code> environment
 variable and then specifying <code class="highlighter-rouge">--type phrase</code>.</p>

 <h2 id="other-high-level-options">Other high-level options</h2>

 <p>The following command-line arguments control run-time behavior of multiple steps:</p>

 <ul>
   <li>
     <p><code class="highlighter-rouge">--threads N</code> (1)</p>

     <p>This enables multithreaded operation for a number of steps: alignment (with GIZA, max two
 threads), parsing, and decoding (any number of threads)</p>
   </li>
   <li>
     <p><code class="highlighter-rouge">--jobs N</code> (1)</p>

     <p>This enables parallel operation over a cluster using the qsub command.  This feature is not
 well-documented at this point, but you will likely want to edit the file
 <code class="highlighter-rouge">$JOSHUA/scripts/training/parallelize/LocalConfig.pm</code> to setup your qsub environment, and may also
 want to pass specific qsub commands via the <code class="highlighter-rouge">--qsub-args "ARGS"</code>
 command. We suggest you stick to the standard Joshua model that
 tries to use as many cores as are available with the <code class="highlighter-rouge">--threads N</code> option.</p>
   </li>
 </ul>

 <h2 id="restarting-failed-runs">Restarting failed runs</h2>

 <p>If the pipeline dies, you can restart it with the same command you used the first time.  If you
 rerun the pipeline with the exact same invocation as the previous run (or an overlapping
 configuration – one that causes the same set of behaviors), you will see slightly different
 output compared to what we saw above:</p>

 <div class="highlighter-rouge"><pre class="highlight"><code>[train-copy-en] cached, skipping...
 [train-copy-ur] cached, skipping...
 ...
 </code></pre>
 </div>

 <p>This indicates that the caching module has discovered that the step was already computed and thus
 did not need to be rerun.  This feature is quite useful for restarting pipeline runs that have
 crashed due to bugs, memory limitations, hardware failures, and the myriad other problems that
 plague MT researchers across the world.</p>

 <p>Often, a command will die because it was parameterized incorrectly.  For example, perhaps the
 decoder ran out of memory.  This allows you to adjust the parameter (e.g., <code class="highlighter-rouge">--joshua-mem</code>) and rerun
 the script.  Of course, if you change one of the parameters a step depends on, it will trigger a
 rerun, which in turn might trigger further downstream reruns.</p>

 <h2 id="a-idsteps--skipping-steps-quitting-early"><a id="steps"></a> Skipping steps, quitting early</h2>

 <p>You will also find it useful to start the pipeline somewhere other than data preparation (for
 example, if you have already-processed data and an alignment, and want to begin with building a
 grammar) or to end it prematurely (if, say, you don’t have a test set and just want to tune a
 model).  This can be accomplished with the <code class="highlighter-rouge">--first-step</code> and <code class="highlighter-rouge">--last-step</code> flags, which take as
 argument a case-insensitive version of the following steps:</p>

 <ul>
   <li>
     <p><em>FIRST</em>: Data preparation.  Everything begins with data preparation.  This is the default first
  step, so there is no need to be explicit about it.</p>
   </li>
   <li>
     <p><em>ALIGN</em>: Alignment.  You might want to start here if you want to skip data preprocessing.</p>
   </li>
   <li>
     <p><em>PARSE</em>: Parsing.  This is only relevant for building SAMT grammars (<code class="highlighter-rouge">--type samt</code>), in which case
  the target side (<code class="highlighter-rouge">--target</code>) of the training data (<code class="highlighter-rouge">--corpus</code>) is parsed before building a
  grammar.</p>
   </li>
   <li>
     <p><em>THRAX</em>: Grammar extraction <a href="thrax.html">with Thrax</a>.  If you jump to this step, you’ll need to
  provide an aligned corpus (<code class="highlighter-rouge">--alignment</code>) along with your parallel data.  </p>
   </li>
   <li>
     <p><em>TUNE</em>: Tuning.  The exact tuning method is determined with <code class="highlighter-rouge">--tuner {mert,mira,pro}</code>.  With this
  option, you need to specify a grammar (<code class="highlighter-rouge">--grammar</code>) or separate tune (<code class="highlighter-rouge">--tune-grammar</code>) and test
  (<code class="highlighter-rouge">--test-grammar</code>) grammars.  A full grammar (<code class="highlighter-rouge">--grammar</code>) will be filtered against the relevant
  tuning or test set unless you specify <code class="highlighter-rouge">--no-filter-tm</code>.  If you want a language model built from
  the target side of your training data, you’ll also need to pass in the training corpus
  (<code class="highlighter-rouge">--corpus</code>).  You can also specify an arbitrary number of additional language models with one or
  more <code class="highlighter-rouge">--lmfile</code> flags.</p>
   </li>
   <li>
     <p><em>TEST</em>: Testing.  If you have a tuned model file, you can test new corpora by passing in a test
  corpus with references (<code class="highlighter-rouge">--test</code>).  You’ll need to provide a run name (<code class="highlighter-rouge">--name</code>) to store the
  results of this run, which will be placed under <code class="highlighter-rouge">test/NAME</code>.  You’ll also need to provide a
  Joshua configuration file (<code class="highlighter-rouge">--joshua-config</code>), one or more language models (<code class="highlighter-rouge">--lmfile</code>), and a
  grammar (<code class="highlighter-rouge">--grammar</code>); this will be filtered to the test data unless you specify
  <code class="highlighter-rouge">--no-filter-tm</code>) or unless you directly provide a filtered test grammar (<code class="highlighter-rouge">--test-grammar</code>).</p>
   </li>
   <li>
     <p><em>LAST</em>: The last step.  This is the default target of <code class="highlighter-rouge">--last-step</code>.</p>
   </li>
 </ul>

 <p>We now discuss these steps in more detail.</p>

 <h3 id="a-idprep--1-data-preparation"><a id="prep"></a> 1. DATA PREPARATION</h3>

 <p>Data prepare involves doing the following to each of the training data (<code class="highlighter-rouge">--corpus</code>), tuning data
 (<code class="highlighter-rouge">--tune</code>), and testing data (<code class="highlighter-rouge">--test</code>).  Each of these values is an absolute or relative path
 prefix.  To each of these prefixes, a “.” is appended, followed by each of SOURCE (<code class="highlighter-rouge">--source</code>) and
 TARGET (<code class="highlighter-rouge">--target</code>), which are file extensions identifying the languages.  The SOURCE and TARGET
 files must have the same number of lines.  </p>

 <p>For tuning and test data, multiple references are handled automatically.  A single reference will
 have the format TUNE.TARGET, while multiple references will have the format TUNE.TARGET.NUM, where
 NUM starts at 0 and increments for as many references as there are.</p>

 <p>The following processing steps are applied to each file.</p>

 <ol>
   <li>
     <p><strong>Copying</strong> the files into <code class="highlighter-rouge">$RUNDIR/data/TYPE</code>, where TYPE is one of “train”, “tune”, or “test”.
 Multiple <code class="highlighter-rouge">--corpora</code> files are concatenated in the order they are specified.  Multiple <code class="highlighter-rouge">--tune</code>
 and <code class="highlighter-rouge">--test</code> flags are not currently allowed.</p>
   </li>
   <li>
     <p><strong>Normalizing</strong> punctuation and text (e.g., removing extra spaces, converting special
 quotations).  There are a few language-specific options that depend on the file extension
 matching the <a href="http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes">two-letter ISO 639-1</a>
 designation.</p>
   </li>
   <li>
     <p><strong>Tokenizing</strong> the data (e.g., separating out punctuation, converting brackets).  Again, there
 are language-specific tokenizations for a few languages (English, German, and Greek).</p>
   </li>
   <li>
     <p>(Training only) <strong>Removing</strong> all parallel sentences with more than <code class="highlighter-rouge">--maxlen</code> tokens on either
 side.  By default, MAXLEN is 50.  To turn this off, specify <code class="highlighter-rouge">--maxlen 0</code>.</p>
   </li>
   <li>
     <p><strong>Lowercasing</strong>.</p>
   </li>
 </ol>

 <p>This creates a series of intermediate files which are saved for posterity but compressed.  For
 example, you might see</p>

 <div class="highlighter-rouge"><pre class="highlight"><code>data/
     train/
         train.en.gz
         train.tok.en.gz
         train.tok.50.en.gz
         train.tok.50.lc.en
         corpus.en -&gt; train.tok.50.lc.en
 </code></pre>
 </div>

 <p>The file “corpus.LANG” is a symbolic link to the last file in the chain.  </p>

 <h2 id="alignment-a-idalignment-">2. ALIGNMENT <a id="alignment"></a></h2>

 <p>Alignments are between the parallel corpora at <code class="highlighter-rouge">$RUNDIR/data/train/corpus.{SOURCE,TARGET}</code>.  To
 prevent the alignment tables from getting too big, the parallel corpora are grouped into files of no
 more than ALIGNER_CHUNK_SIZE blocks (controlled with a parameter below).  The last block is folded
 into the penultimate block if it is too small.  These chunked files are all created in a
 subdirectory of <code class="highlighter-rouge">$RUNDIR/data/train/splits</code>, named <code class="highlighter-rouge">corpus.LANG.0</code>, <code class="highlighter-rouge">corpus.LANG.1</code>, and so on.</p>

 <p>The pipeline parameters affecting alignment are:</p>

 <ul>
   <li>
     <p><code class="highlighter-rouge">--aligner ALIGNER</code> {giza (default), berkeley, jacana}</p>

     <p>Which aligner to use.  The default is <a href="http://code.google.com/p/giza-pp/">GIZA++</a>, but
 <a href="http://code.google.com/p/berkeleyaligner/">the Berkeley aligner</a> can be used instead.  When
 using the Berkeley aligner, you’ll want to pay attention to how much memory you allocate to it
 with <code class="highlighter-rouge">--aligner-mem</code> (the default is 10g).</p>
   </li>
   <li>
     <p><code class="highlighter-rouge">--aligner-chunk-size SIZE</code> (1,000,000)</p>

     <p>The number of sentence pairs to compute alignments over. The training data is split into blocks
 of this size, aligned separately, and then concatenated.</p>
   </li>
   <li>
     <p><code class="highlighter-rouge">--alignment FILE</code></p>

     <p>If you have an already-computed alignment, you can pass that to the script using this flag.
 Note that, in this case, you will want to skip data preparation and alignment using
 <code class="highlighter-rouge">--first-step thrax</code> (the first step after alignment) and also to specify <code class="highlighter-rouge">--no-prepare</code> so
 as not to retokenize the data and mess with your alignments.</p>

     <p>The alignment file format is the standard format where 0-indexed many-many alignment pairs for a
 sentence are provided on a line, source language first, e.g.,</p>

     <p>0-0 0-1 1-2 1-7 …</p>

     <p>This value is required if you start at the grammar extraction step.</p>
   </li>
 </ul>

 <p>When alignment is complete, the alignment file can be found at <code class="highlighter-rouge">$RUNDIR/alignments/training.align</code>.
 It is parallel to the training corpora.  There are many files in the <code class="highlighter-rouge">alignments/</code> subdirectory that
 contain the output of intermediate steps.</p>

 <h3 id="a-idparsing--3-parsing"><a id="parsing"></a> 3. PARSING</h3>

 <p>To build SAMT and GHKM grammars (<code class="highlighter-rouge">--type samt</code> and <code class="highlighter-rouge">--type ghkm</code>), the target side of the
 training data must be parsed. The pipeline assumes your target side will be English, and will parse
 it for you using <a href="http://code.google.com/p/berkeleyparser/">the Berkeley parser</a>, which is included.
 If it is not the case that English is your target-side language, the target side of your training
 data (found at CORPUS.TARGET) must already be parsed in PTB format.  The pipeline will notice that
 it is parsed and will not reparse it.</p>

 <p>Parsing is affected by both the <code class="highlighter-rouge">--threads N</code> and <code class="highlighter-rouge">--jobs N</code> options.  The former runs the parser in
 multithreaded mode, while the latter distributes the runs across as cluster (and requires some
 configuration, not yet documented).  The options are mutually exclusive.</p>

 <p>Once the parsing is complete, there will be two parsed files:</p>

 <ul>
   <li><code class="highlighter-rouge">$RUNDIR/data/train/corpus.en.parsed</code>: this is the mixed-case file that was parsed.</li>
   <li><code class="highlighter-rouge">$RUNDIR/data/train/corpus.parsed.en</code>: this is a leaf-lowercased version of the above file used for
 grammar extraction.</li>
 </ul>

 <h2 id="thrax-grammar-extraction-a-idtm-">4. THRAX (grammar extraction) <a id="tm"></a></h2>

 <p>The grammar extraction step takes three pieces of data: (1) the source-language training corpus, (2)
 the target-language training corpus (parsed, if an SAMT grammar is being extracted), and (3) the
 alignment file.  From these, it computes a synchronous context-free grammar.  If you already have a
 grammar and wish to skip this step, you can do so passing the grammar with the <code class="highlighter-rouge">--grammar
 /path/to/grammar</code> flag.</p>

 <p>The main variable in grammar extraction is Hadoop.  If you have a Hadoop installation, simply ensure
 that the environment variable <code class="highlighter-rouge">$HADOOP</code> is defined, and Thrax will seamlessly use it.  If you <em>do
 not</em> have a Hadoop installation, the pipeline will roll out out for you, running Hadoop in
 standalone mode (this mode is triggered when <code class="highlighter-rouge">$HADOOP</code> is undefined).  Theoretically, any grammar
 extractable on a full Hadoop cluster should be extractable in standalone mode, if you are patient
 enough; in practice, you probably are not patient enough, and will be limited to smaller
 datasets. You may also run into problems with disk space; Hadoop uses a lot (use <code class="highlighter-rouge">--tmp
 /path/to/tmp</code> to specify an alternate place for temporary data; we suggest you use a local disk
 partition with tens or hundreds of gigabytes free, and not an NFS partition).  Setting up your own
 Hadoop cluster is not too difficult a chore; in particular, you may find it helpful to install a
 <a href="http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html">pseudo-distributed version of Hadoop</a>.
 In our experience, this works fine, but you should note the following caveats:</p>

 <ul>
   <li>It is of crucial importance that you have enough physical disks.  We have found that having too
 few, or too slow of disks, results in a whole host of seemingly unrelated issues that are hard to
 resolve, such as timeouts.  </li>
   <li>NFS filesystems can cause lots of problems.  You should really try to install physical disks that
 are dedicated to Hadoop scratch space.</li>
 </ul>

 <p>Here are some flags relevant to Hadoop and grammar extraction with Thrax:</p>

 <ul>
   <li>
     <p><code class="highlighter-rouge">--hadoop /path/to/hadoop</code></p>

     <p>This sets the location of Hadoop (overriding the environment variable <code class="highlighter-rouge">$HADOOP</code>)</p>
   </li>
   <li>
     <p><code class="highlighter-rouge">--hadoop-mem MEM</code> (2g)</p>

     <p>This alters the amount of memory available to Hadoop mappers (passed via the
 <code class="highlighter-rouge">mapred.child.java.opts</code> options).</p>
   </li>
   <li>
     <p><code class="highlighter-rouge">--thrax-conf FILE</code></p>

     <p>Use the provided Thrax configuration file instead of the (grammar-specific) default.  The Thrax
  templates are located at <code class="highlighter-rouge">$JOSHUA/scripts/training/templates/thrax-TYPE.conf</code>, where TYPE is one
  of “hiero” or “samt”.</p>
   </li>
 </ul>

 <p>When the grammar is extracted, it is compressed and placed at <code class="highlighter-rouge">$RUNDIR/grammar.gz</code>.</p>

 <h2 id="a-idlm--5-language-model"><a id="lm"></a> 5. Language model</h2>

 <p>Before tuning can take place, a language model is needed.  A language model is always built from the
 target side of the training corpus unless <code class="highlighter-rouge">--no-corpus-lm</code> is specified.  In addition, you can
 provide other language models (any number of them) with the <code class="highlighter-rouge">--lmfile FILE</code> argument.  Other
 arguments are as follows.</p>

 <ul>
   <li>
     <p><code class="highlighter-rouge">--lm</code> {kenlm (default), berkeleylm}</p>

     <p>This determines the language model code that will be used when decoding.  These implementations
 are described in their respective papers (PDFs:
 <a href="http://kheafield.com/professional/avenue/kenlm.pdf">KenLM</a>,
 <a href="http://nlp.cs.berkeley.edu/pubs/Pauls-Klein_2011_LM_paper.pdf">BerkeleyLM</a>). KenLM is written in
 C++ and requires a pass through the JNI, but is recommended because it supports left-state minimization.</p>
   </li>
   <li>
     <p><code class="highlighter-rouge">--lmfile FILE</code></p>

     <p>Specifies a pre-built language model to use when decoding.  This language model can be in ARPA
 format, or in KenLM format when using KenLM or BerkeleyLM format when using that format.</p>
   </li>
   <li>
     <p><code class="highlighter-rouge">--lm-gen</code> {kenlm (default), srilm, berkeleylm}, <code class="highlighter-rouge">--buildlm-mem MEM</code>, <code class="highlighter-rouge">--witten-bell</code></p>

     <p>At the tuning step, an LM is built from the target side of the training data (unless
 <code class="highlighter-rouge">--no-corpus-lm</code> is specified).  This controls which code is used to build it.  The default is a
 KenLM’s <a href="http://kheafield.com/code/kenlm/estimation/">lmplz</a>, and is strongly recommended.</p>

     <p>If SRILM is used, it is called with the following arguments:</p>

     <div class="highlighter-rouge"><pre class="highlight"><code>  $SRILM/bin/i686-m64/ngram-count -interpolate SMOOTHING -order 5 -text TRAINING-DATA -unk -lm lm.gz
 </code></pre>
     </div>

     <p>Where SMOOTHING is <code class="highlighter-rouge">-kndiscount</code>, or <code class="highlighter-rouge">-wbdiscount</code> if <code class="highlighter-rouge">--witten-bell</code> is passed to the pipeline.</p>

     <p><a href="http://code.google.com/p/berkeleylm/source/browse/trunk/src/edu/berkeley/nlp/lm/io/MakeKneserNeyArpaFromText.java">BerkeleyLM java class</a>
 is also available. It computes a Kneser-Ney LM with a constant discounting (0.75) and no count
 thresholding.  The flag <code class="highlighter-rouge">--buildlm-mem</code> can be used to control how much memory is allocated to the
 Java process.  The default is “2g”, but you will want to increase it for larger language models.</p>

     <p>A language model built from the target side of the training data is placed at <code class="highlighter-rouge">$RUNDIR/lm.gz</code>.  </p>
   </li>
 </ul>

 <h2 id="interlude-decoder-arguments">Interlude: decoder arguments</h2>

 <p>Running the decoder is done in both the tuning stage and the testing stage.  A critical point is
 that you have to give the decoder enough memory to run.  Joshua can be very memory-intensive, in
 particular when decoding with large grammars and large language models.  The default amount of
 memory is 3100m, which is likely not enough (especially if you are decoding with SAMT grammar).  You
 can alter the amount of memory for Joshua using the <code class="highlighter-rouge">--joshua-mem MEM</code> argument, where MEM is a Java
 memory specification (passed to its <code class="highlighter-rouge">-Xmx</code> flag).</p>

 <h2 id="a-idtuning--6-tuning"><a id="tuning"></a> 6. TUNING</h2>

 <p>Two optimizers are provided with Joshua: MERT and PRO (<code class="highlighter-rouge">--tuner {mert,pro}</code>).  If Moses is
 installed, you can also use Cherry &amp; Foster’s k-best batch MIRA (<code class="highlighter-rouge">--tuner mira</code>, recommended).
 Tuning is run till convergence in the <code class="highlighter-rouge">$RUNDIR/tune</code> directory.</p>

 <p>When tuning is finished, each final configuration file can be found at either</p>

 <div class="highlighter-rouge"><pre class="highlight"><code>$RUNDIR/tune/joshua.config.final
 </code></pre>
 </div>

 <h2 id="a-idtesting--7-testing"><a id="testing"></a> 7. Testing</h2>

 <p>For each of the tuner runs, Joshua takes the tuner output file and decodes the test set.  If you
 like, you can also apply minimum Bayes-risk decoding to the decoder output with <code class="highlighter-rouge">--mbr</code>.  This
 usually yields about 0.3 - 0.5 BLEU points, but is time-consuming.</p>

 <p>After decoding the test set with each set of tuned weights, Joshua computes the mean BLEU score,
 writes it to <code class="highlighter-rouge">$RUNDIR/test/final-bleu</code>, and cats it. It also writes a file
 <code class="highlighter-rouge">$RUNDIR/test/final-times</code> containing a summary of runtime information. That’s the end of the pipeline!</p>

 <p>Joshua also supports decoding further test sets.  This is enabled by rerunning the pipeline with a
 number of arguments:</p>

 <ul>
   <li>
     <p><code class="highlighter-rouge">--first-step TEST</code></p>

     <p>This tells the decoder to start at the test step.</p>
   </li>
   <li>
     <p><code class="highlighter-rouge">--joshua-config CONFIG</code></p>

     <p>A tuned parameter file is required.  This file will be the output of some prior tuning run.
 Necessary pathnames and so on will be adjusted.</p>
   </li>
 </ul>

 <h2 id="a-idanalysis-8-analysis"><a id="analysis"> 8. ANALYSIS</a></h2>

 <p>If you have used the suggested layout, with a number of related runs all contained in a common
 directory with sequential numbers, you can use the script <code class="highlighter-rouge">$JOSHUA/scripts/training/summarize.pl</code> to
 display a summary of the mean BLEU scores from all runs, along with the text you placed in the run
 README file (using the pipeline’s <code class="highlighter-rouge">--readme TEXT</code> flag).</p>

 <h2 id="common-use-cases-and-pitfalls">COMMON USE CASES AND PITFALLS</h2>

 <ul>
   <li>
     <p>If the pipeline dies at the “thrax-run” stage with an error like the following:</p>

     <div class="highlighter-rouge"><pre class="highlight"><code>JOB FAILED (return code 1)
 hadoop/bin/hadoop: line 47:
 /some/path/to/a/directory/hadoop/bin/hadoop-config.sh: No such file or directory
 Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FsShell
 Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FsShell
 </code></pre>
     </div>

     <p>This occurs if the <code class="highlighter-rouge">$HADOOP</code> environment variable is set but does not point to a working
 Hadoop installation.  To fix it, make sure to unset the variable:</p>

     <div class="highlighter-rouge"><pre class="highlight"><code># in bash
 unset HADOOP
 </code></pre>
     </div>

     <p>and then rerun the pipeline with the same invocation.</p>
   </li>
   <li>
     <p>Memory usage is a major consideration in decoding with Joshua and hierarchical grammars.  In
 particular, SAMT grammars often require a large amount of memory.  Many steps have been taken to
 reduce memory usage, including beam settings and test-set- and sentence-level filtering of
 grammars.  However, memory usage can still be in the tens of gigabytes.</p>

     <p>To accommodate this kind of variation, the pipeline script allows you to specify both (a) the
 amount of memory used by the Joshua decoder instance and (b) the amount of memory required of
 nodes obtained by the qsub command.  These are accomplished with the <code class="highlighter-rouge">--joshua-mem</code> MEM and
 <code class="highlighter-rouge">--qsub-args</code> ARGS commands.  For example,</p>

     <div class="highlighter-rouge"><pre class="highlight"><code>pipeline.pl --joshua-mem 32g --qsub-args "-l pvmem=32g -q himem.q" ...
 </code></pre>
     </div>

     <p>Also, should Thrax fail, it might be due to a memory restriction. By default, Thrax requests 2 GB
 from the Hadoop server. If more memory is needed, set the memory requirement with the
 <code class="highlighter-rouge">--hadoop-mem</code> in the same way as the <code class="highlighter-rouge">--joshua-mem</code> option is used.</p>
   </li>
   <li>
     <p>Other pitfalls and advice will be added as it is discovered.</p>
   </li>
 </ul>

 <h2 id="feedback">FEEDBACK</h2>

 <p>Please email joshua_support@googlegroups.com with problems or suggestions.</p>


           <!--   <h4 class="blog-post-title">Welcome to Joshua!</h4> -->

           <!--   <p>This blog post shows a few different types of content that's supported and styled with Bootstrap. Basic typography, images, and code are all supported.</p> -->
           <!--   <hr> -->
           <!--   <p>Cum sociis natoque penatibus et magnis <a href="#">dis parturient montes</a>, nascetur ridiculus mus. Aenean eu leo quam. Pellentesque ornare sem lacinia quam venenatis vestibulum. Sed posuere consectetur est at lobortis. Cras mattis consectetur purus sit amet fermentum.</p> -->
           <!--   <blockquote> -->
           <!--     <p>Curabitur blandit tempus porttitor. <strong>Nullam quis risus eget urna mollis</strong> ornare vel eu leo. Nullam id dolor id nibh ultricies vehicula ut id elit.</p> -->
           <!--   </blockquote> -->
           <!--   <p>Etiam porta <em>sem malesuada magna</em> mollis euismod. Cras mattis consectetur purus sit amet fermentum. Aenean lacinia bibendum nulla sed consectetur.</p> -->
           <!--   <h2>Heading</h2> -->
           <!--   <p>Vivamus sagittis lacus vel augue laoreet rutrum faucibus dolor auctor. Duis mollis, est non commodo luctus, nisi erat porttitor ligula, eget lacinia odio sem nec elit. Morbi leo risus, porta ac consectetur ac, vestibulum at eros.</p> -->
           <!--   <h3>Sub-heading</h3> -->
           <!--   <p>Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.</p> -->
           <!--   <pre><code>Example code block</code></pre> -->
           <!--   <p>Aenean lacinia bibendum nulla sed consectetur. Etiam porta sem malesuada magna mollis euismod. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa.</p> -->
           <!--   <h3>Sub-heading</h3> -->
           <!--   <p>Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Aenean lacinia bibendum nulla sed consectetur. Etiam porta sem malesuada magna mollis euismod. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus.</p> -->
           <!--   <ul> -->
           <!--     <li>Praesent commodo cursus magna, vel scelerisque nisl consectetur et.</li> -->
           <!--     <li>Donec id elit non mi porta gravida at eget metus.</li> -->
           <!--     <li>Nulla vitae elit libero, a pharetra augue.</li> -->
           <!--   </ul> -->
           <!--   <p>Donec ullamcorper nulla non metus auctor fringilla. Nulla vitae elit libero, a pharetra augue.</p> -->
           <!--   <ol> -->
           <!--     <li>Vestibulum id ligula porta felis euismod semper.</li> -->
           <!--     <li>Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.</li> -->
           <!--     <li>Maecenas sed diam eget risus varius blandit sit amet non magna.</li> -->
           <!--   </ol> -->
           <!--   <p>Cras mattis consectetur purus sit amet fermentum. Sed posuere consectetur est at lobortis.</p> -->
           <!-- </div><\!-- /.blog-post -\-> -->

         </div>

       </div><!-- /.row -->


     </div><!-- /.container -->

     <!-- Bootstrap core JavaScript
     ================================================== -->
     <!-- Placed at the end of the document so the pages load faster -->
     <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
     <script src="../../dist/js/bootstrap.min.js"></script>
     <!-- <script src="../../assets/js/docs.min.js"></script> -->
     <!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
     <!-- <script src="../../assets/js/ie10-viewport-bug-workaround.js"></script>
     -->

     <!-- Start of StatCounter Code for Default Guide -->
     <script type="text/javascript">
       var sc_project=8264132;
       var sc_invisible=1;
       var sc_security="4b97fe2d";
     </script>
     <script type="text/javascript" src="http://www.statcounter.com/counter/counter.js"></script>
     <noscript>
       <div class="statcounter">
         <a title="hit counter joomla"
            href="http://statcounter.com/joomla/"
            target="_blank">
           <img class="statcounter"
                src="http://c.statcounter.com/8264132/0/4b97fe2d/1/"
                alt="hit counter joomla" />
         </a>
       </div>
     </noscript>
     <!-- End of StatCounter Code for Default Guide -->
   </body>
 </html>