blob: 1f0ee11b125e2b7e9d9dd1484259a762515c2067 [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="">
<meta name="author" content="">
<link rel="icon" href="../../favicon.ico">
<title>Joshua Documentation | Building a language pack</title>
<!-- Bootstrap core CSS -->
<link href="/dist/css/bootstrap.min.css" rel="stylesheet">
<!-- Custom styles for this template -->
<link href="/joshua6.css" rel="stylesheet">
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="blog-nav">
<!-- <a class="blog-nav-item active" href="#">Joshua</a> -->
<a class="blog-nav-item" href="/">Joshua</a>
<!-- <a class="blog-nav-item" href="/6.0/whats-new.html">New features</a> -->
<a class="blog-nav-item" href="/language-packs/">Language packs</a>
<a class="blog-nav-item" href="/data/">Datasets</a>
<a class="blog-nav-item" href="/support/">Support</a>
<a class="blog-nav-item" href="/contributors.html">Contributors</a>
</nav>
</div>
</div>
<div class="container">
<div class="row">
<div class="col-sm-2">
<div class="sidebar-module">
<!-- <h4>About</h4> -->
<center>
<img src="/images/joshua-logo-small.png" />
<p>Joshua machine translation toolkit</p>
</center>
</div>
<hr>
<center>
<a href="/releases/current/" target="_blank"><button class="button">Download Joshua 6.0.5</button></a>
<br />
<a href="/releases/runtime/" target="_blank"><button class="button">Runtime only version</button></a>
<p>Released November 5, 2015</p>
</center>
<hr>
<!-- <div class="sidebar-module"> -->
<!-- <span id="download"> -->
<!-- <a href="http://joshua-decoder.org/downloads/joshua-6.0.tgz">Download</a> -->
<!-- </span> -->
<!-- </div> -->
<div class="sidebar-module">
<h4>Using Joshua</h4>
<ol class="list-unstyled">
<li><a href="/6.0/install.html">Installation</a></li>
<li><a href="/6.0/quick-start.html">Quick Start</a></li>
</ol>
</div>
<hr>
<div class="sidebar-module">
<h4>Building new models</h4>
<ol class="list-unstyled">
<li><a href="/6.0/pipeline.html">Pipeline</a></li>
<li><a href="/6.0/tutorial.html">Tutorial</a></li>
<li><a href="/6.0/faq.html">FAQ</a></li>
</ol>
</div>
<!--
<div class="sidebar-module">
<h4>Phrase-based</h4>
<ol class="list-unstyled">
<li><a href="/6.0/phrase.html">Training</a></li>
</ol>
</div>
-->
<hr>
<div class="sidebar-module">
<h4>Advanced</h4>
<ol class="list-unstyled">
<li><a href="/6.0/bundle.html">Building language packs</a></li>
<li><a href="/6.0/decoder.html">Decoder options</a></li>
<li><a href="/6.0/file-formats.html">File formats</a></li>
<li><a href="/6.0/packing.html">Packing TMs</a></li>
<li><a href="/6.0/large-lms.html">Building large LMs</a></li>
</ol>
</div>
<hr>
<div class="sidebar-module">
<h4>Developer</h4>
<ol class="list-unstyled">
<li><a href="https://github.com/joshua-decoder/joshua">Github</a></li>
<li><a href="http://cs.jhu.edu/~post/joshua-docs">Javadoc</a></li>
<li><a href="https://groups.google.com/forum/?fromgroups#!forum/joshua_developers">Mailing list</a></li>
</ol>
</div>
</div><!-- /.blog-sidebar -->
<div class="col-sm-8 blog-main">
<div class="blog-title">
<h2>Building a language pack</h2>
</div>
<div class="blog-post">
<p><em>The information in this page applies to Joshua 6.0.3 and greater</em>.</p>
<p>Joshua distributes <a href="/language-packs">language packs</a>, which are models
that have been trained and tuned for particular language pairs. You
can easily create your own language pack after you have trained and
tuned a model using the provided
<code class="highlighter-rouge">$JOSHUA/scripts/support/run-bundler.py</code> script, which gathers files
from a pipeline training directory and bundles them together for easy
distribution and release.</p>
<p>The script takes just two mandatory arguments in the following order:</p>
<ol>
<li>The path to the Joshua configuration file to base the bundle
on. This file should contain the tuned weights from the tuning run, so
you can use either the final tuned file from the tuning run
(<code class="highlighter-rouge">tune/joshua.config.final</code>) or from the test run
(<code class="highlighter-rouge">test/model/joshua.config</code>).</li>
<li>The directory to place the language pack in. If this directory
already exists, the script will die, unless you also pass <code class="highlighter-rouge">--force</code>.</li>
</ol>
<p>In addition, there are a number of other arguments that may be important.</p>
<ul>
<li>
<p><code class="highlighter-rouge">--root /path/to/root</code>. If file paths in the Joshua config file are
not absolute, you need to provide relative root. If you specify a
tuned pipeline file (such as <code class="highlighter-rouge">tune/joshua.config.final</code> above), the
paths should all be absolute. If you instead provide a config file
from a previous run bundle (e.g., <code class="highlighter-rouge">test/model/joshua.config</code>), the
bundle directory above is the relative root.</p>
</li>
<li>
<p>The config file options that are used in the pipeline are likely not
the ones you want if you release a model. For example, the tuning
configuration file contains options that tell Joshua to output 300
translation candidates for each sentence (<code class="highlighter-rouge">-top-n 300</code>) and to
include lots of detail about each translation (<code class="highlighter-rouge">-output-format '%i
||| %s ||| %f ||| %c'</code>). Because of this, you will want to tell the
run bundler to change many of the config file options to be more
geared towards human-readable output. The default copy-config
options are options are <code class="highlighter-rouge">-top-n 0 -output-format %S -mark-oovs
false</code>, which accomplishes exactly this (human readability).</p>
</li>
<li>
<p>A very important issue has to do with the translation model (the
“TM”, also sometimes called the grammar or phrase table). The
translation model can be very large, so that it takes a long time to
load and to <a href="packing.html">pack</a>. To reduce this time during model
training, the translation model is filtered against the tuning and
testing data in the pipeline, and these filtered models will be what
is listed in the source config files. However, when exporting a
model for use as a language pack, you need to export the full model
instead of the filtered one so as to maximize your coverage on new
test data. The <code class="highlighter-rouge">--tm</code> parameter is used to accomplish this; it takes
an argument specifying the path to the full model. If you would
additionally like the large model to be <a href="packing.html">packed</a> (this
is recommended; it reformats the TM so that it can be quickly loaded
at run time), you can use <code class="highlighter-rouge">--pack-tm</code> instead. You can only pack one
TM (but typically there is only TM anyway). Multiple <code class="highlighter-rouge">--tm</code>
parameters can be passed; they will replace TMs found in the config
file in the order they are found.</p>
</li>
</ul>
<p>Here is an example invocation for packing a hierarchical model using
the final tuned Joshua config file:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>./run-bundler.py \
--force --verbose \
/path/to/rundir/tune/joshua.config.final \
language-pack-YYYY-MM-DD \
--root /path/to/rundir \
--pack-tm /path/to/rundir/grammar.gz \
--copy-config-options \
'-top-n 1 -output-format %S -mark-oovs false' \
--server-port 5674
</code></pre>
</div>
<p>The copy config options tell the decoder to present just the
single-best (<code class="highlighter-rouge">-top-n 0</code>) translated output string that has been
heuristically capitalized (<code class="highlighter-rouge">-output-format %S</code>), to not append <code class="highlighter-rouge">_OOV</code>
to OOVs (<code class="highlighter-rouge">-mark-oovs false</code>), and to use the translation model
<code class="highlighter-rouge">/path/to/rundir/grammar.gz</code> as the main translation model, packing it
before placing it in the bundle. Note that these arguments to
<code class="highlighter-rouge">--copy-config</code> are the default, so you could leave this off entirely.
See <a href="decoder.html">this page</a> for a longer list of decoder options.</p>
<p>This command is a slight variation used for phrase-based models, which
instead takes the test-set Joshua config (the result is the same):</p>
<div class="highlighter-rouge"><pre class="highlight"><code>./run-bundler.py \
--force --verbose \
/path/to/rundir/test/model/joshua.config \
--root /path/to/rundir/test/model \
language-pack-YYYY-MM-DD \
--pack-tm /path/to/rundir/model/phrase-table.gz \
--server-port 5674
</code></pre>
</div>
<p>In both cases, a new directory <code class="highlighter-rouge">language-pack-YYYY-MM-DD</code> will be
created along with a README and a number of support files.</p>
<!-- <h4 class="blog-post-title">Welcome to Joshua!</h4> -->
<!-- <p>This blog post shows a few different types of content that's supported and styled with Bootstrap. Basic typography, images, and code are all supported.</p> -->
<!-- <hr> -->
<!-- <p>Cum sociis natoque penatibus et magnis <a href="#">dis parturient montes</a>, nascetur ridiculus mus. Aenean eu leo quam. Pellentesque ornare sem lacinia quam venenatis vestibulum. Sed posuere consectetur est at lobortis. Cras mattis consectetur purus sit amet fermentum.</p> -->
<!-- <blockquote> -->
<!-- <p>Curabitur blandit tempus porttitor. <strong>Nullam quis risus eget urna mollis</strong> ornare vel eu leo. Nullam id dolor id nibh ultricies vehicula ut id elit.</p> -->
<!-- </blockquote> -->
<!-- <p>Etiam porta <em>sem malesuada magna</em> mollis euismod. Cras mattis consectetur purus sit amet fermentum. Aenean lacinia bibendum nulla sed consectetur.</p> -->
<!-- <h2>Heading</h2> -->
<!-- <p>Vivamus sagittis lacus vel augue laoreet rutrum faucibus dolor auctor. Duis mollis, est non commodo luctus, nisi erat porttitor ligula, eget lacinia odio sem nec elit. Morbi leo risus, porta ac consectetur ac, vestibulum at eros.</p> -->
<!-- <h3>Sub-heading</h3> -->
<!-- <p>Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.</p> -->
<!-- <pre><code>Example code block</code></pre> -->
<!-- <p>Aenean lacinia bibendum nulla sed consectetur. Etiam porta sem malesuada magna mollis euismod. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa.</p> -->
<!-- <h3>Sub-heading</h3> -->
<!-- <p>Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Aenean lacinia bibendum nulla sed consectetur. Etiam porta sem malesuada magna mollis euismod. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus.</p> -->
<!-- <ul> -->
<!-- <li>Praesent commodo cursus magna, vel scelerisque nisl consectetur et.</li> -->
<!-- <li>Donec id elit non mi porta gravida at eget metus.</li> -->
<!-- <li>Nulla vitae elit libero, a pharetra augue.</li> -->
<!-- </ul> -->
<!-- <p>Donec ullamcorper nulla non metus auctor fringilla. Nulla vitae elit libero, a pharetra augue.</p> -->
<!-- <ol> -->
<!-- <li>Vestibulum id ligula porta felis euismod semper.</li> -->
<!-- <li>Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.</li> -->
<!-- <li>Maecenas sed diam eget risus varius blandit sit amet non magna.</li> -->
<!-- </ol> -->
<!-- <p>Cras mattis consectetur purus sit amet fermentum. Sed posuere consectetur est at lobortis.</p> -->
<!-- </div><\!-- /.blog-post -\-> -->
</div>
</div><!-- /.row -->
</div><!-- /.container -->
<!-- Bootstrap core JavaScript
================================================== -->
<!-- Placed at the end of the document so the pages load faster -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
<script src="../../dist/js/bootstrap.min.js"></script>
<!-- <script src="../../assets/js/docs.min.js"></script> -->
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<!-- <script src="../../assets/js/ie10-viewport-bug-workaround.js"></script>
-->
<!-- Start of StatCounter Code for Default Guide -->
<script type="text/javascript">
var sc_project=8264132;
var sc_invisible=1;
var sc_security="4b97fe2d";
</script>
<script type="text/javascript" src="http://www.statcounter.com/counter/counter.js"></script>
<noscript>
<div class="statcounter">
<a title="hit counter joomla"
href="http://statcounter.com/joomla/"
target="_blank">
<img class="statcounter"
src="http://c.statcounter.com/8264132/0/4b97fe2d/1/"
alt="hit counter joomla" />
</a>
</div>
</noscript>
<!-- End of StatCounter Code for Default Guide -->
</body>
</html>