blob: 63024619cb04f2e64ea86a22712b44c9cc74a886 [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="">
<meta name="author" content="">
<link rel="icon" href="../../favicon.ico">
<title>Joshua Documentation | Pipeline tutorial</title>
<!-- Bootstrap core CSS -->
<link href="/dist/css/bootstrap.min.css" rel="stylesheet">
<!-- Custom styles for this template -->
<link href="/joshua6.css" rel="stylesheet">
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="blog-nav">
<!-- <a class="blog-nav-item active" href="#">Joshua</a> -->
<a class="blog-nav-item" href="/">Joshua</a>
<!-- <a class="blog-nav-item" href="/6.0/whats-new.html">New features</a> -->
<a class="blog-nav-item" href="/language-packs/">Language packs</a>
<a class="blog-nav-item" href="/data/">Datasets</a>
<a class="blog-nav-item" href="/support/">Support</a>
<a class="blog-nav-item" href="/contributors.html">Contributors</a>
</nav>
</div>
</div>
<div class="container">
<div class="row">
<div class="col-sm-2">
<div class="sidebar-module">
<!-- <h4>About</h4> -->
<center>
<img src="/images/joshua-logo-small.png" />
<p>Joshua machine translation toolkit</p>
</center>
</div>
<hr>
<center>
<a href="/releases/current/" target="_blank"><button class="button">Download Joshua 6.0.5</button></a>
<br />
<a href="/releases/runtime/" target="_blank"><button class="button">Runtime only version</button></a>
<p>Released November 5, 2015</p>
</center>
<hr>
<!-- <div class="sidebar-module"> -->
<!-- <span id="download"> -->
<!-- <a href="http://joshua-decoder.org/downloads/joshua-6.0.tgz">Download</a> -->
<!-- </span> -->
<!-- </div> -->
<div class="sidebar-module">
<h4>Using Joshua</h4>
<ol class="list-unstyled">
<li><a href="/6.0/install.html">Installation</a></li>
<li><a href="/6.0/quick-start.html">Quick Start</a></li>
</ol>
</div>
<hr>
<div class="sidebar-module">
<h4>Building new models</h4>
<ol class="list-unstyled">
<li><a href="/6.0/pipeline.html">Pipeline</a></li>
<li><a href="/6.0/tutorial.html">Tutorial</a></li>
<li><a href="/6.0/faq.html">FAQ</a></li>
</ol>
</div>
<!--
<div class="sidebar-module">
<h4>Phrase-based</h4>
<ol class="list-unstyled">
<li><a href="/6.0/phrase.html">Training</a></li>
</ol>
</div>
-->
<hr>
<div class="sidebar-module">
<h4>Advanced</h4>
<ol class="list-unstyled">
<li><a href="/6.0/bundle.html">Building language packs</a></li>
<li><a href="/6.0/decoder.html">Decoder options</a></li>
<li><a href="/6.0/file-formats.html">File formats</a></li>
<li><a href="/6.0/packing.html">Packing TMs</a></li>
<li><a href="/6.0/large-lms.html">Building large LMs</a></li>
</ol>
</div>
<hr>
<div class="sidebar-module">
<h4>Developer</h4>
<ol class="list-unstyled">
<li><a href="https://github.com/joshua-decoder/joshua">Github</a></li>
<li><a href="http://cs.jhu.edu/~post/joshua-docs">Javadoc</a></li>
<li><a href="https://groups.google.com/forum/?fromgroups#!forum/joshua_developers">Mailing list</a></li>
</ol>
</div>
</div><!-- /.blog-sidebar -->
<div class="col-sm-8 blog-main">
<div class="blog-title">
<h2>Pipeline tutorial</h2>
</div>
<div class="blog-post">
<p>This document will walk you through using the pipeline in a variety of scenarios. Once you’ve gained a
sense for how the pipeline works, you can consult the <a href="pipeline.html">pipeline page</a> for a number of
other options available in the pipeline.</p>
<h2 id="download-and-setup">Download and Setup</h2>
<p>Download and install Joshua as described on the <a href="index.html">quick start page</a>, installing it under
<code class="highlighter-rouge">~/code/</code>. Once you’ve done that, you should make sure you have the following environment variable set:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>export JOSHUA=$HOME/code/joshua-v6.0.5
export JAVA_HOME=/usr/java/default
</code></pre>
</div>
<p>If you have a Hadoop installation, make sure you’ve set <code class="highlighter-rouge">$HADOOP</code> to point to it. For example, if the <code class="highlighter-rouge">hadoop</code> command is in <code class="highlighter-rouge">/usr/bin</code>,
you should type</p>
<div class="highlighter-rouge"><pre class="highlight"><code>export HADOOP=/usr
</code></pre>
</div>
<p>Joshua will find the binary and use it to submit to your hadoop cluster. If you don’t have one, just
make sure that HADOOP is unset, and Joshua will roll one out for you and run it in
<a href="https://hadoop.apache.org/docs/r1.2.1/single_node_setup.html">standalone mode</a>. </p>
<h2 id="a-basic-pipeline-run">A basic pipeline run</h2>
<p>For today’s experiments, we’ll be building a Spanish–English system using data included in the
<a href="/data/fisher-callhome-corpus/">Fisher and CALLHOME translation corpus</a>. This
data was collected by translating transcribed speech from previous LDC releases.</p>
<p>Download the data and install it somewhere:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>cd ~/data
wget --no-check -O fisher-callhome-corpus.zip https://github.com/joshua-decoder/fisher-callhome-corpus/archive/master.zip
unzip fisher-callhome-corpus.zip
</code></pre>
</div>
<p>Then define the environment variable <code class="highlighter-rouge">$FISHER</code> to point to it:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>cd ~/data/fisher-callhome-corpus-master
export FISHER=$(pwd)
</code></pre>
</div>
<h3 id="preparing-the-data">Preparing the data</h3>
<p>Inside the tarball is the Fisher and CALLHOME Spanish–English data, which includes Kaldi-provided
ASR output and English translations on the Fisher and CALLHOME dataset transcriptions. Because of
licensing restrictions, we cannot distribute the Spanish transcripts, but if you have an LDC site
license, a script is provided to build them. You can type:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>./bin/build_fisher.sh /export/common/data/corpora/LDC/LDC2010T04
</code></pre>
</div>
<p>Where the first argument is the path to your LDC data release. This will create the files in <code class="highlighter-rouge">corpus/ldc</code>.</p>
<p>In <code class="highlighter-rouge">$FISHER/corpus</code>, there are a set of parallel directories for LDC transcripts (<code class="highlighter-rouge">ldc</code>), ASR output
(<code class="highlighter-rouge">asr</code>), oracle ASR output (<code class="highlighter-rouge">oracle</code>), and ASR lattice output (<code class="highlighter-rouge">plf</code>). The files look like this:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>$ ls corpus/ldc
callhome_devtest.en fisher_dev2.en.2 fisher_dev.en.2 fisher_test.en.2
callhome_evltest.en fisher_dev2.en.3 fisher_dev.en.3 fisher_test.en.3
callhome_train.en fisher_dev2.es fisher_dev.es fisher_test.es
fisher_dev2.en.0 fisher_dev.en.0 fisher_test.en.0 fisher_train.en
fisher_dev2.en.1 fisher_dev.en.1 fisher_test.en.1 fisher_train.es
</code></pre>
</div>
<p>If you don’t have the LDC transcripts, you can use the data in <code class="highlighter-rouge">corpus/asr</code> instead. We will now use
this data to build our own Spanish–English model using Joshua’s pipeline.</p>
<h3 id="run-the-pipeline">Run the pipeline</h3>
<p>Create an experiments directory for containing your first experiment. <em>Note: it’s important that
this <strong>not</strong> be inside your <code class="highlighter-rouge">$JOSHUA</code> directory</em>.</p>
<div class="highlighter-rouge"><pre class="highlight"><code>mkdir ~/expts/joshua
cd ~/expts/joshua
</code></pre>
</div>
<p>We will now create the baseline run, using a particular directory structure for experiments that
will allow us to take advantage of scripts provided with Joshua for displaying the results of many
related experiments. Because this can take quite some time to run, we are going to reduce the model
by quite a bit by
restriction: Joshua will only use sentences in the training sets with ten or fewer words on either
side (Spanish or English):</p>
<div class="highlighter-rouge"><pre class="highlight"><code>cd ~/expts/joshua
$JOSHUA/bin/pipeline.pl \
--rundir 1 \
--readme "Baseline Hiero run" \
--source es \
--target en \
--type hiero \
--corpus $FISHER/corpus/ldc/fisher_train \
--tune $FISHER/corpus/ldc/fisher_dev \
--test $FISHER/corpus/ldc/fisher_dev2 \
--maxlen 10 \
--lm-order 3
</code></pre>
</div>
<p>This will start the pipeline building a Spanish–English translation system constructed from the
training data and a dictionary, tuned against dev, and tested against devtest. It will use the
default values for most of the pipeline: <a href="https://code.google.com/p/giza-pp/">GIZA++</a> for alignment,
KenLM’s <code class="highlighter-rouge">lmplz</code> for building the language model, Z-MERT for tuning, KenLM with left-state
minimization for representing LM state in the decoder, and so on. We change the order of the n-gram
model to 3 (from its default of 5) because there is not enough data to build a 5-gram LM.</p>
<p>A few notes:</p>
<ul>
<li>
<p>This will likely take many hours to run, especially if you don’t have a Hadoop cluster.</p>
</li>
<li>
<p>If you are running on Mac OS X, KenLM’s <code class="highlighter-rouge">lmplz</code> will not build due to the absence of static
libraries. In that case, you should add the flag <code class="highlighter-rouge">--lm-gen srilm</code> (recommended, if SRILM is
installed) or <code class="highlighter-rouge">--lm-gen berkeleylm</code>.</p>
</li>
</ul>
<h3 id="variations">Variations</h3>
<p>Once that is finished, you will have a baseline model. From there, you might wish to try variations
of the baseline model. Here are some examples of what you could vary:</p>
<ul>
<li>
<p>Build an SAMT model (<code class="highlighter-rouge">--type samt</code>), GKHM model (<code class="highlighter-rouge">--type ghkm</code>), or phrasal ITG model (<code class="highlighter-rouge">--type phrasal</code>) </p>
</li>
<li>
<p>Use the Berkeley aligner instead of GIZA++ (<code class="highlighter-rouge">--aligner berkeley</code>)</p>
</li>
<li>
<p>Build the language model with BerkeleyLM (<code class="highlighter-rouge">--lm-gen srilm</code>) instead of KenLM (the default)</p>
</li>
<li>
<p>Change the order of the LM from the default of 5 (<code class="highlighter-rouge">--lm-order 4</code>)</p>
</li>
<li>
<p>Tune with MIRA instead of MERT (<code class="highlighter-rouge">--tuner mira</code>). This requires that Moses is installed.</p>
</li>
<li>
<p>Decode with a wider beam (<code class="highlighter-rouge">--joshua-args '-pop-limit 200'</code>) (the default is 100)</p>
</li>
<li>
<p>Add the provided BN-EN dictionary to the training data (add another <code class="highlighter-rouge">--corpus</code> line, e.g., <code class="highlighter-rouge">--corpus $FISHER/bn-en/dict.bn-en</code>)</p>
</li>
</ul>
<p>To do this, we will create new runs that partially reuse the results of previous runs. This is
possible by doing two things: (1) incrementing the run directory and providing an updated README
note; (2) telling the pipeline which of the many steps of the pipeline to begin at; and (3)
providing the needed dependencies.</p>
<h1 id="a-second-run">A second run</h1>
<p>Let’s begin by changing the tuner, to see what effect that has. To do so, we change the run
directory, tell the pipeline to start at the tuning step, and provide the needed dependencies:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>$JOSHUA/bin/pipeline.pl \
--rundir 2 \
--readme "Tuning with MIRA" \
--source bn \
--target en \
--corpus $FISHER/bn-en/tok/training.bn-en \
--tune $FISHER/bn-en/tok/dev.bn-en \
--test $FISHER/bn-en/tok/devtest.bn-en \
--first-step tune \
--tuner mira \
--grammar 1/grammar.gz \
--no-corpus-lm \
--lmfile 1/lm.gz
</code></pre>
</div>
<p>Here, we have essentially the same invocation, but we have told the pipeline to use a different
MIRA, to start with tuning, and have provided it with the language model file and grammar it needs
to execute the tuning step. </p>
<p>Note that we have also told it not to build a language model. This is necessary because the
pipeline always builds an LM on the target side of the training data, if provided, but we are
supplying the language model that was already built. We could equivalently have removed the
<code class="highlighter-rouge">--corpus</code> line.</p>
<h2 id="changing-the-model-type">Changing the model type</h2>
<p>Let’s compare the Hiero model we’ve already built to an SAMT model. We have to reextract the
grammar, but can reuse the alignments and the language model:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>$JOSHUA/bin/pipeline.pl \
--rundir 3 \
--readme "Baseline SAMT model" \
--source bn \
--target en \
--corpus $FISHER/bn-en/tok/training.bn-en \
--tune $FISHER/bn-en/tok/dev.bn-en \
--test $FISHER/bn-en/tok/devtest.bn-en \
--alignment 1/alignments/training.align \
--first-step parse \
--no-corpus-lm \
--lmfile 1/lm.gz
</code></pre>
</div>
<p>See <a href="pipeline.html#steps">the pipeline script page</a> for a list of all the steps.</p>
<h2 id="analyzing-the-results">Analyzing the results</h2>
<p>We now have three runs, in subdirectories 1, 2, and 3. We can display summary results from them
using the <code class="highlighter-rouge">$JOSHUA/scripts/training/summarize.pl</code> script.</p>
<!-- <h4 class="blog-post-title">Welcome to Joshua!</h4> -->
<!-- <p>This blog post shows a few different types of content that's supported and styled with Bootstrap. Basic typography, images, and code are all supported.</p> -->
<!-- <hr> -->
<!-- <p>Cum sociis natoque penatibus et magnis <a href="#">dis parturient montes</a>, nascetur ridiculus mus. Aenean eu leo quam. Pellentesque ornare sem lacinia quam venenatis vestibulum. Sed posuere consectetur est at lobortis. Cras mattis consectetur purus sit amet fermentum.</p> -->
<!-- <blockquote> -->
<!-- <p>Curabitur blandit tempus porttitor. <strong>Nullam quis risus eget urna mollis</strong> ornare vel eu leo. Nullam id dolor id nibh ultricies vehicula ut id elit.</p> -->
<!-- </blockquote> -->
<!-- <p>Etiam porta <em>sem malesuada magna</em> mollis euismod. Cras mattis consectetur purus sit amet fermentum. Aenean lacinia bibendum nulla sed consectetur.</p> -->
<!-- <h2>Heading</h2> -->
<!-- <p>Vivamus sagittis lacus vel augue laoreet rutrum faucibus dolor auctor. Duis mollis, est non commodo luctus, nisi erat porttitor ligula, eget lacinia odio sem nec elit. Morbi leo risus, porta ac consectetur ac, vestibulum at eros.</p> -->
<!-- <h3>Sub-heading</h3> -->
<!-- <p>Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.</p> -->
<!-- <pre><code>Example code block</code></pre> -->
<!-- <p>Aenean lacinia bibendum nulla sed consectetur. Etiam porta sem malesuada magna mollis euismod. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa.</p> -->
<!-- <h3>Sub-heading</h3> -->
<!-- <p>Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Aenean lacinia bibendum nulla sed consectetur. Etiam porta sem malesuada magna mollis euismod. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus.</p> -->
<!-- <ul> -->
<!-- <li>Praesent commodo cursus magna, vel scelerisque nisl consectetur et.</li> -->
<!-- <li>Donec id elit non mi porta gravida at eget metus.</li> -->
<!-- <li>Nulla vitae elit libero, a pharetra augue.</li> -->
<!-- </ul> -->
<!-- <p>Donec ullamcorper nulla non metus auctor fringilla. Nulla vitae elit libero, a pharetra augue.</p> -->
<!-- <ol> -->
<!-- <li>Vestibulum id ligula porta felis euismod semper.</li> -->
<!-- <li>Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.</li> -->
<!-- <li>Maecenas sed diam eget risus varius blandit sit amet non magna.</li> -->
<!-- </ol> -->
<!-- <p>Cras mattis consectetur purus sit amet fermentum. Sed posuere consectetur est at lobortis.</p> -->
<!-- </div><\!-- /.blog-post -\-> -->
</div>
</div><!-- /.row -->
</div><!-- /.container -->
<!-- Bootstrap core JavaScript
================================================== -->
<!-- Placed at the end of the document so the pages load faster -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
<script src="../../dist/js/bootstrap.min.js"></script>
<!-- <script src="../../assets/js/docs.min.js"></script> -->
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<!-- <script src="../../assets/js/ie10-viewport-bug-workaround.js"></script>
-->
<!-- Start of StatCounter Code for Default Guide -->
<script type="text/javascript">
var sc_project=8264132;
var sc_invisible=1;
var sc_security="4b97fe2d";
</script>
<script type="text/javascript" src="http://www.statcounter.com/counter/counter.js"></script>
<noscript>
<div class="statcounter">
<a title="hit counter joomla"
href="http://statcounter.com/joomla/"
target="_blank">
<img class="statcounter"
src="http://c.statcounter.com/8264132/0/4b97fe2d/1/"
alt="hit counter joomla" />
</a>
</div>
</noscript>
<!-- End of StatCounter Code for Default Guide -->
</body>
</html>