blob: ee004c221f6e08b968e9ea26ad22635bd8d61430 [file] [log] [blame]
---
layout: default
category: links
title: Installing and running the Joshua Decoder
---
<!-- begin header -->
<h2><a href="http://cs.jhu.edu/~ccb/">by Chris Callison-Burch</a> <br/>(Released: January 17, 2012)</h2>
<div class="warning">
<p>
Note: this walkthrough describes an older version of Joshua that is different in some ways from
the current version. Some of these differences are: SRILM support has been removed, the
Berkeley aligner is now included in <code>$JOSHUA/lib</code> (and therefore doesn't need to be
installed separately), and there is no need for you to download the developer version of Joshua
if you only want to use the software and not contribute to it. Please refer to
the <a href="pipeline.html">pipeline documentation for the 4.0 release</a>. The pipeline
automates these steps.
</p>
</div>
<p>This web page gives instructions on how to install and use the Joshua decoder. Joshua is an open-source decoder for parsing-based machine translation. Joshua uses the synchronous context free grammar (SCFG) formalism in its approach to statistical machine translation, and the software implements the algorithms that underly the approach.</p>
<a name="steps" />
<p>These instructions will tell you how to:
<ol>
<li> <a href="#step1">Install the software</a></li>
<li> <a href="#step2">Prepare your data</a></li>
<li> <a href="#step3">Create word alignments</a> </li>
<li> <a href="#step4">Train a language model</a> </li>
<li> <a href="#step5">Extract a translation grammar</a> </li>
<li> <a href="#step6">Run minimum error rate training</a> </li>
<li> <a href="#step7">Decode a test set</a></li>
<li> <a href="#step8">Recase the translations</a></li>
<li> <a href="#step9">Score the translations</a></li>
</ol>
</p>
<p>If you use Joshua in your work, please cite this paper:</p>
<p>Jonathan Weese, Juri Ganitkevitch, Chris Callison-Burch, Matt Post and Adam Lopez, 2011. <a href="publications/joshua-3.0.pdf">Joshua 3.0: Syntax-based Machine Translation with the Thrax Grammar Extractor</a>. In Proceedings of the Workshop on Statistical Machine Translation (WMT11). <a href="publications/joshua-3.0.pdf">[pdf]</a> <a href ="joshua.bib">[bib]</a>
</p>
<p>
These instructions apply to <a href = "https://github.com/joshua-decoder/joshua/tags">Release 3.1 of Joshua</a>, which is described in our WMT11 paper. You can also get the latest version of the Joshua software from the repository with the command:
</p>
<pre>
git clone https://github.com/joshua-decoder/joshua.git
</pre>
<a name="step1">
<h1>Step 1: Install the software</h1>
<h3>Prerequisites</h3>
<p>The Joshua decoder is written in Java. You'll need to install a few software development tools before you install it:</p>
<ul>
<li> <a href="http://git-scm.com/download">git</a> - git is the version control system that we use for managing the Joshua codebase. </li>
<li> <a href="http://ant.apache.org/bindownload.cgi">Apache Ant</a> - ant is a tool for compiling Java code which has similar functionality to make. </li>
</ul>
<p>
Before installing these, you can check whether they're already on your system by typing <code>which git</code> and <code>which ant</code>.
</p>
<p>In addition to these software development tools, you will also need to download: </p>
<ul>
<li> <a href = "http://www.speech.sri.com/projects/srilm/download.html">The SRI language modeling toolkit</a> - srilm is a widely used toolkit for building n-gram language models, which are an important component in the translation process.</li>
<li> <a href = "http://code.google.com/p/berkeleyaligner/">The Berkeley Aligner</a> - this software is used to align words across sentence pairs in a bilingual parallel corpus. Word alignment takes place before extracting an SCFG.
</ul>
<p>
After you have downloaded the srilm tar file, type the following commands to install it:
</p>
<pre>
mkdir srilm
mv srilm.tgz srilm/
cd srilm/
tar xfz srilm.tgz
make
</pre>
<p>If the build fails, please follow the instructions in SRILM's INSTALL file. For instance, if SRILM's Makefile does not identify that your're running a 64 bit Linux you might have to run "make MACHINE_TYPE=i686-m64 World".</p>
<p>After you successfully compile SRILM, Joshua will need to know what directory it is in. You can type <code>pwd</code>
to get the absolute path to the <code>sirlm/</code> directory that you created. Once you've figured out the path, set an <code>SRILM</code> environment variable by typing:</p>
<pre>
export SRILM="<b>/path/to/srilm</b>"
</pre>
<p>Where "<b>/path/to/srilm</b>" is replaced with your path. You'll also need to set a <code>JAVA_HOME</code> environment variable. For Mac OS X this usually is done by typing:</p>
<pre>
export JAVA_HOME="<b>/Library/Java/Home</b>"
</pre>
<p>These variables will need to be set every time you use Joshua, so it's useful to add them to your .bashrc, .bash_profile or .profile file.</p>
<h3>Download and Install Joshua</h3>
<p>First, download the <a href = "https://github.com/joshua-decoder/joshua/tarball/v3.1.1">Joshua release 3.1.1 tar file</a>. Next, type the following commands to untar the file and compile the Java classes: </p>
<pre>
tar xfz joshua-decoder-joshua-v3.1.1-0-g1a0e6b6.tar.gz
cd joshua-decoder-joshua-735224e
ant
</pre>
<p>Running <code>ant</code> will compile the Java classes and link in srilm. If everything works properly, you should see the message <b>BUILD SUCCESSFUL</b>. If you get a BUILD FAILED message, it may be because you have not properly set the paths to SRILM and JAVA_HOME, or because srilm was not compiled properly, as described above.</p>
<p>For the examples in this document, you will need to set a <code>JOSHUA</code> environment variable:
<pre>
export JOSHUA="<b>/path/to/joshua</b>"
</pre>
<h3>Run the example model</h3>
<p>
To test to make sure that the decoder is installed properly, we'll translate 5 sentences using a small translation model that loads quickly. The sentences that we will translate are contained in <code>example/example.test.in</code>
</p>
<pre>
科学家 为 攸关 初期 失智症 的 染色体 完成 定序<br>
( 法新社 巴黎 二日 电 ) 国际 间 的 一 群 科学家 表示 , 他们 已 为 人类 第十四 对 染色体 完成 定序 , 这 对 染色体 与 许多 疾病 有关 , 包括 三十几 岁 者 可能 罹患 的 初期 阿耳滋海默氏症 。<br>
这 是 到 目前 为止 完成 定序 的 第四 对 染色体 , 它 由 八千七百多万 对 去氧 核糖核酸 ( dna ) 组成 。<br>
英国 自然 科学 周刊 发表 的 这 项 研究 显示 , 第十四 对 染色体 排序 由 一千零五十 个 基因 和 基因 片段 构成 。<br>
基因 科学家 的 目标 是 , 提供 诊断 工具 以 发现 致病 的 缺陷 基因 , 终而 提供 可 阻止 这些 基因 产生 障碍 的 疗法 。
</pre>
<p>
The small translation grammar contains 15,939 rules -- you can get the count of the number of rules by running <code>gunzip -c example/example.hiero.tm.gz | wc -l</code> or you can see the first few translation rules with <code>gunzip -c example/example.hiero.tm.gz | head</code>
</p>
<pre>
[X] ||| [X,1] 科学家 [X,2] ||| [X,1] scientists to [X,2] ||| 2.17609119 0.333095818 1.53173875
[X] ||| [X,1] 科学家 [X,2] ||| [X,2] of the [X,1] scientists ||| 2.47712135 0.333095818 2.17681264
[X] ||| [X,1] 科学家 [X,2] ||| [X,2] of [X,1] scientists ||| 2.47712135 0.333095818 1.13837981
[X] ||| [X,1] 科学家 [X,2] ||| [X,2] [X,1] scientists ||| 2.47712135 0.333095818 0.218843221
[X] ||| [X,1] 科学家 [X,2] ||| [X,1] scientists [X,2] ||| 1.01472330 0.333095818 0.218843221
[X] ||| [X,1] 科学家 [X,2] ||| [X,2] of scientists of [X,1] ||| 2.47712135 0.333095818 2.05791640
[X] ||| [X,1] 科学家 [X,2] ||| scientists [X,1] for [X,2] ||| 2.47712135 0.333095818 2.05956721
[X] ||| [X,1] 科学家 [X,2] ||| [X,1] scientist [X,2] ||| 1.63202321 0.303409695 0.977472364
[X] ||| [X,1] 科学家 [X,2] ||| [X,1] scientists , [X,2] ||| 2.47712135 0.333095818 1.68990576
[X] ||| [X,1] 科学家 [X,2] ||| scientists [X,2] [X,1] ||| 2.47712135 0.333095818 0.218843221
</pre>
<p>The different parts of the rules are separated by the <code>|||</code> delimiter. The first part of the rule is the left-hand side non-terminal. The second and third parts are the right-hand side. The three numbers listed after each translation rules are negative log probabilities that signify, in order:
<ul>
<li> prob(e|f) - the probability of the English phrase given the foreign phrase </li>
<li> lexprob(e|f) - the lexical translation probabilities of the English words given the foreign words </li>
<li> lexprob(f|e) - the lexical translation probabilities of the foreign words given the English words</li>
</ul>
<p>You can use the grammar to translate the test set by running </p>
<pre>java -Xmx1g -cp $JOSHUA/bin \
-Djava.library.path=$JOSHUA/lib \
-Dfile.encoding=utf8 joshua.decoder.JoshuaDecoder \
example/example.config.srilm \
example/example.test.in \
example/example.nbest.srilm.out
</pre>
<p>
For those of you who aren't very familiar with Java, the arguments are the following:
<ul>
<li><code>-Xmx1g</code> -- this tells Java to use 1 GB of memory. </li>
<li><code>-cp $JOSHUA/bin</code> -- this specifies the directory that contains the Java class files.</li>
<li><code>-Djava.library.path=$JOSHUA/lib</code> -- this specifies the directory that contains the libraries that link in C++ code </li>
<li><code>-Dfile.ecoding=utf8</code> -- this tells java to use unicode as the default file encoding.</li>
<li><code>joshua.decoder.JoshuaDecoder </code> -- This is the class that is run. If you want to look at the the source code for this class, you can find it in <code>src/joshua/decoder/JoshuaDecoder.java</code></li>
<li><code>example/example.config.srilm </code> -- This is the configuration file used by Joshua.</li>
<li><code>example/example.test.in</code> -- This is the input file containing the sentences to translate.</li>
<li><code>example/example.nbest.srilm.out</code> -- This is the output file that the n-best translations will be written to.</li>
</ul>
<p>You can inspect the output file by typing <code>head example/example.nbest.srilm.out</code></p>
<pre>
0 ||| scientists to vital early 失智症 the chromosome completed has ||| -127.759 -6.353 -11.577 -5.325 -3.909 ||| -135.267
0 ||| scientists for vital early 失智症 the chromosome completed has ||| -128.239 -6.419 -11.179 -5.390 -3.909 ||| -135.556
0 ||| scientists to related early 失智症 the chromosome completed has ||| -126.942 -6.450 -12.716 -5.764 -3.909 ||| -135.670
0 ||| scientists to vital early 失智症 the chromosomes completed has ||| -128.354 -6.353 -11.396 -5.305 -3.909 ||| -135.714
0 ||| scientists to death early 失智症 the chromosome completed has ||| -127.879 -6.575 -11.845 -5.287 -3.909 ||| -135.803
0 ||| scientists as vital early 失智症 the chromosome completed has ||| -128.537 -6.000 -11.384 -5.828 -3.909 ||| -135.820
0 ||| scientists for related early 失智症 the chromosome completed has ||| -127.422 -6.516 -12.319 -5.829 -3.909 ||| -135.959
0 ||| scientists for vital early 失智症 the chromosomes completed has ||| -128.834 -6.419 -10.998 -5.370 -3.909 ||| -136.003
0 ||| scientists to vital early 失智症 completed the chromosome has ||| -127.423 -7.364 -11.577 -5.325 -3.909 ||| -136.009
0 ||| scientists to vital early 失智症 of chromosomes completed has ||| -127.427 -7.136 -11.612 -5.816 -3.909 ||| -136.086
</pre>
<p>This file contains the n-best translations, under the model. The first 10 lines that you see above are 10 best translations of the first sentence. Each line contains 4 fields. The first field is the index of the sentence (index 0 for the first sentence), the second field is the translation, the third field contains the each of the individual feature function scores for the translation (language model, rule translation probability, lexical translation probability, reverse lexical translation probability, and word penalty), and the final field is the overall score.
</p>
<p>
To get the 1-best translations for each sentence in the test set without all of the extra information, you can run the following command:
</p>
<pre>
java -Xmx1g -cp $JOSHUA/bin \
-Dfile.encoding=utf8 joshua.util.ExtractTopCand \
example/example.nbest.srilm.out \
example/example.nbest.srilm.out.1best
</pre>
<p>You cat then look at the 1-best output file by typing <code>cat example/example.nbest.srilm.out.1best</code>:</p>
<pre>
scientists to vital early 失智症 the chromosome completed has
( , paris 2 ) international a group of scientists said that they completed to human to chromosome 14 has , the chromosome with many diseases , including more years , may with the early 阿耳滋海默氏症 .
this is to now completed has in the fourth chromosome , which 八千七百多万 to carry when ( dna ) .
the weekly british science the study showed that the chromosome 14 are by 一千零五十 genes and gene fragments .
the goal of gene scientists is to provide diagnostic tools to found of the flawed genes , are still provide a to stop these genes treatments .
</pre>
<p>If your translations are identical to the ones above then Joshua is installed correctly. With this small model, there are many untranslated words, and the quality of the translations is very low. In the next steps, we'll show you how to train a model for a new language pair, using a larger training corpus that will result in higher quality translations.</p>
<a name="step2" />
<h1>Step 2: Prepare your data</h1>
<p>To create a new statistical translation model with Joshua, you will need several data sets:
<ul>
<li> A large sentence-aligned bilingual parallel corpus. We refer to this set as the <b>training data</b>, since it will be used to train the translation model. The question of how much data is necessary always arises. The short answer is more is better. Our parallel corpora typically contain tens of millions of words, and we use as much as 250 million words.</li>
<li> A larger monolingual corpus. We need data in the target language to train the language model. You could simply use the target side of the parallel corpus, but it is better to assemble to large amounts of monolingual text, since it will help improve the fluency of your translations.</li>
<li> A small sentence-aligned bilingual corpus to use as a <b>development set</b> (somewhere around 1000 sentence pairs ought to be sufficient). This data should disjoint from your training data. It will be used to optimize the parameters of your model in minimum error rate training (MERT). It may be useful to have multiple reference translations for your dev set, although this is not strictly necessary. </li>
<li> A small sentence-aligned bilingual corpus to use as a <b>test set</b> to evaluate the translation quality of your system and any modifications that you make to it. The test set should be disjoint from the dev and training sets. Again, it may be useful to have multiple reference translations if you are evaluating using the Bleu metric.</li>
</ul>
</p>
<p>
There are several sources for training data. A good source of free parallel corpora for European languages is the Europarl corpus that is distributed as part of the <a href="http://statmt.org/wmt12/translation-task.html">Workshop on Statistical Machine Translation</a>. If you sign up to participate in the annual <a href="http://www.itl.nist.gov/iad/mig//tests/mt/">NIST Open Machine Translation Evaluation</a> you can get access large Arabic-English and Chinese-English parallel corpora, and a small Urdu-English parallel corpus.
</p>
<p>Once you've gathered your data, you will need to do several preprocess steps: sentence alignment, tokenization, normalization, and subsampling. </p>
<h3>Sentence alignment</h3>
<p>In this exercise, we'll start with an existing sentence-aligned parallel corpus. Download this tarball, which contains a Spanish-Engish parallel corpus, along with a dev and a test set: <a href="http://cs.jhu.edu/~ccb/joshua/data.tar.gz">data.tar.gz</a> </p>
<p> The data tarball contains two training directories <code>training/</code>, which includes a subset of the corpus, and <code>full-training</code>, which includes the full corpus. I strongly recommend staring with the smaller set, and building an end-to-end system with it, since many steps take a very long time on the full data set. You should debug on the smaller set to avoid wasting time.</p>
<p>
If you start with your own data set, you will need to sentence align it. We recommend Bob Moore's <a href="http://research.microsoft.com/en-us/downloads/aafd5dcf-4dcc-49b2-8a22-f7055113e656/">bilingual sentence aligner</a>.
</p>
<h3>Tokenization</h3>
<p>Joshua uses whitespace to delineate words. For many languages, tokenization can be as simple as separating punctation off as its own token. For languages like Chinese, which don't put spaces around words, tokenization can be more tricky. </p>
<p>For this example we'll use the simple tokenizer that is released as part of the WMT. It's located in the tarball under the scripts directory. To use it type the following commands:</p>
<pre>
tar xfz data.tar.gz
cd data/
gunzip -c es-en/full-training/europarl-v4.es-en.es.gz \
| perl scripts/tokenizer.perl -l es \
> es-en/full-training/training.es.tok
gunzip -c es-en/full-training/europarl-v4.es-en.en.gz \
| perl scripts/tokenizer.perl -l en \
> es-en/full-training/training.en.tok
</pre>
<h3>Normalization</h3>
<p>After tokenization, we recommend that you normalize your data by lowercasing it. The system treats words with variant capitalization as distinct, which can lead to worse probability estimates for their translation, since the counts are fragmented. For other languages you might want to normalize the text in other ways.</p>
<p>You can lowercase your tokenized data with the following script:</p>
<pre>
cat es-en/full-training/training.en.tok \
| perl scripts/lowercase.perl \
> es-en/full-training/training.en.tok.lc
cat es-en/full-training/training.es.tok \
| perl scripts/lowercase.perl \
> es-en/full-training/training.es.tok.lc
</pre>
<p>The untokenized file looks like this (<code>gunzip -c es-en/full-training/europarl-v4.es-en.en.gz | head -3</code>):</p>
<p>
Resumption of the session<br>
I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.<br>
Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.
</p>
</td>
</tr>
<tr>
<td>
<p>After tokenization and lowercasing, the file looks like this (<code>head -3 es-en/full-training/training.en.tok.lc</code>):</p>
</td>
</tr>
<tr>
<td bgcolor="#cccccc">
<p>
resumption of the session<br>
i declare resumed the session of the european parliament adjourned on friday 17 december 1999 , and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period .<br>
although , as you will have seen , the dreaded ' millennium bug ' failed to materialise , still the people in a number of countries suffered a series of natural disasters that truly were dreadful .
</p>
</td>
</tr>
<tr>
<td>
<p>You must preprocess your dev and test sets in the same way you preprocess your training data. Run the following commands on the data that you downloaded:</p>
<p>
<pre>
cat es-en/dev/news-dev2009.es \
| perl scripts/tokenizer.perl -l es \
| perl scripts/lowercase.perl \
> es-en/dev/news-dev2009.es.tok.lc
cat es-en/dev/news-dev2009.en \
| perl scripts/tokenizer.perl -l en \
| perl scripts/lowercase.perl \
> es-en/dev/news-dev2009.en.tok.lc
cat es-en/test/newstest2009.es \
| perl scripts/tokenizer.perl -l es \
| perl scripts/lowercase.perl \
> es-en/test/newstest2009.es.tok.lc
cat es-en/test/newstest2009.en \
| perl scripts/tokenizer.perl -l en \
| perl scripts/lowercase.perl \
> es-en/test/newstest2009.en.tok.lc
</pre>
<h3>Subsampling (optional)</h3>
<p>
Sometimes the amount of training data is so large that it makes creating word alignments extremely time-consuming and memory-intesive. We therefore provide a facility for subsampling the training corpus to select sentences that are relevant for a test set.
</p>
<p>
<pre>
mkdir es-en/full-training/subsampled
echo "training" > es-en/full-training/subsampled/manifest
cat es-en/dev/news-dev2009.es.tok.lc es-en/test/newstest2009.es.tok.lc > es-en/full-training/subsampled/test-data
java -Xmx1000m -Dfile.encoding=utf8 -cp "$JOSHUA/bin:$JOSHUA/lib/commons-cli-2.0-SNAPSHOT.jar" \
joshua.subsample.Subsampler \
-e en.tok.lc \
-f es.tok.lc \
-epath es-en/full-training/ \
-fpath es-en/full-training/ \
-output es-en/full-training/subsampled/subsample \
-ratio 1.04 \
-test es-en/full-training/subsampled/test-data \
-training es-en/full-training/subsampled/manifest
</pre>
</p>
<p>You can see how much the subsampling step reduces the training data, by yping <code> wc -lw es-en/full-training/training.??.tok.lc es-en/full-training/subsampled/subsample.??.tok.lc</code>:
</p>
</td>
</tr>
<tr>
<td bgcolor="#cccccc">
<pre>
1411589 39411018 training/training.en.tok.lc
1411589 41042110 training/training.es.tok.lc
671429 16721564 training/subsampled/subsample.en.tok.lc
671429 17670846 training/subsampled/subsample.es.tok.lc
</pre>
</td>
</tr>
<tr>
<td>
<a name="step3" />
<h1>Step 3: Create word alignments</h1>
<p>
Before extracting a translation grammar, we first need to create word alignments for our parallel corpus. In this example, we show you how to use the Berkeley aligner. You may also use Giza++ to create the alignments, although that program is a little unwieldy to install.
</p>
<p>
To run the Berkeley aligner you first need to set up a configuration file, which defines the models that are used to align the data, how the program runs, and which files are to be aligned. Here is an example configuration file (you should create your own version of this file and save it as <code>training/word-align.conf</code>):
</p>
<pre>
## word-align.conf
## ----------------------
## This is an example training script for the Berkeley
## word aligner. In this configuration it uses two HMM
## alignment models trained jointly and then decoded
## using the competitive thresholding heuristic.
##########################################
# Training: Defines the training regimen
##########################################
forwardModels MODEL1 HMM
reverseModels MODEL1 HMM
mode JOINT JOINT
iters 5 5
###############################################
# Execution: Controls output and program flow
###############################################
execDir alignments
create
saveParams true
numThreads 1
msPerLine 10000
alignTraining
#################
# Language/Data
#################
<b>foreignSuffix es.tok.lc</b>
<b>englishSuffix en.tok.lc</b>
# Choose the training sources, which can either be directories or files that list files/directories
<b>trainSources subsampled/</b>
sentences MAX
#################
# 1-best output
#################
competitiveThresholding
</pre>
<p>To run the Berkeley aligner, first set an environment variable saying where the aligner's jar file is located (this environment variable is just used for convenience in this document, and is not necessary for running the aligner in general:
</p>
<pre>
export BERKELEYALIGNER="<b>/path/to/berkeleyaligner/dir</b>"
</pre>
<p>
You'll need to create an empty directory called <code>example/test</code>. This is because the Berkeley aligner generally expects to test against a set of manually word-aligned data:
</p>
<pre>
cd es-en/full-training/
mkdir -p example/test
</pre>
<p>
After you've created the <code>word-align.config</code> file, you can run the aligner with this command:
</p>
<pre>
nohup java -d64 -Xmx10g -jar $BERKELEYALIGNER/berkeleyaligner.jar ++word-align.conf &
</pre>
<p>
If the program finishes right away, then it probably terminated with an error. You can read the <code>nohup.out</code> file to see what went wrong. Common problems include a missing <code>example/test</code> directory, or a file not found exception. When you re-run the program, you will need to manually remove the <code>alignments/</code> directory.
</p>
<p>When you are aligning tens of millions of words worth of data, the word alignment process will take several hours to complete. While it is running, you can skip ahead and complete step 4, but not step 5.</p>
<!-- ccb - todo - show the output here, and the different subdirectories -->
<p>
After you get comfortable using the aligner and after you've run through the whole Joshua training sequence, you can try experimenting with the amount of training data, the number of training iterations, and different alignment models (the Berkeley aligner supports Model 1, a Hidden Markov Model, and a syntactic HMM).
</p>
<a name="step4" />
<h1>Step 4: Train a language model</h1>
<p>Most translation models also make use of an n-gram language model as a way of assigning higher probability to hypothesis translations that look like fluent examples of the target language. Joshua provides support for n-gram language models, either through a built in data structure, or through external calls to the SRI language modeling toolkit (srilm). To use large language models, we recommend srilm. </p>
<p>If you successfully installed srilm in <a href="#step1">Step 1</a>, then you should be able to train a language model with the following command:</p>
<pre>
mkdir -p model/lm
$SRILM/bin/macosx64/ngram-count \
-order 3 \
-unk \
-kndiscount1 -kndiscount2 -kndiscount3 \
-text training/training.en.tok.lc \
-lm model/lm/europarl.en.trigram.lm
</pre>
<p>(Note: the above assumes that you are on a 64-bit machine running Mac OS X. If that's not the case, your path to ngram-count will be slightly different.)</p>
<p>This will train a trigram language model on the English side of the parallel corpus. We use the <code>.tok.lc</code> file because it is important to have the input to the LM training be tokenized and normalized in the same way as the input data for word alignment and translation grammar extraction.</p>
<p>
The <code>-order 3</code> tells srilm to produce a trigram language model. You can set this to a higher value, and srilm will happily output 4-gram, 5-gram or even higher order language models. Joshua supports arbitrary order n-gram language models, but as the order increases the amount of memory that they require rapidly increases, and the amount of evidence used to estimate the probabilities decreases, so there is a diminishing returns for increasing n. It's common to use n-gram models up to order 5, but in practice, people generally don't use n-gram models much beyond that for practical reasons.
</p>
<p>The <code>-kndiscount</code> tells SRILM to use modified Kneser-Ney discounting as its smoothing scheme. Other smoothing schemes that are implemented in SRILM include Good-Turing and Witten-Bell. </p>
<p>Given that the English side of the parallel corpus is a relatively small amount of data in terms of language modeling, it only takes a few minutes a few minutes to output the LM. The uncompressed LM is 144 megabytes large (<code>du -h europarl.en.trigram.lm</code>).
</p>
<a name="step5" />
<h1>Step 5: Extract a translation grammar</h1>
<p>We'll use the word alignments to create a translation grammar similar to the Chinese one shown in <a href="#step1">Step 1</a>. The translation grammar is created by looking for where the foreign language phrases from the test set occur in the training set, and then using the word alignments to figure out which foreign phrases are aligned. </p>
<h3>Create a suffix array index</h3>
<p>
To find the foreign phrases in the test set, we first create an easily searchable index, called a suffix array, for the training data.
</p>
<pre>
java -Xmx500m -cp $JOSHUA/bin/ \
joshua.corpus.suffix_array.Compile \
training/subsampled/subsample.es.tok.lc \
training/subsampled/subsample.en.tok.lc \
training/subsampled/training.en.tok.lc-es.tok.lc.align \
model
</pre>
<p>This compiles the index that Joshua will use for its rule extraction, and puts it into a directory named <code>model</code>.
</p>
<!-- ccb - todo - add this back in when the GridViewer is fixed.
Joshua has some tools that let you manipulate the data in this directory. For example, you can visualize the word alignments with this command:
</p>
<pre>
java -cp $JOSHUA/bin joshua.ui.alignment.GridViewer model 1
</pre>
//!-->
<h3>Extract grammar rules for the dev set</h3>
<p>The following command will extract a translation grammar from the suffix array index of your word-aligned parallel corpus, where the grammar rules apply to the foreign phrases in the dev set <code>dev/news-dev2009.es.tok.lc</code>:
</p>
<pre>
mkdir mert
java -Dfile.encoding=UTF8 -Xmx1g -cp $JOSHUA/bin \
joshua.prefix_tree.ExtractRules \
./model \
mert/news-dev2009.es.tok.lc.grammar.raw \
dev/news-dev2009.es.tok.lc &
</pre>
<p>
Next, sort the grammar rules and remove the redundancies with the following Unix command:
</p>
<pre>
sort -u mert/news-dev2009.es.tok.lc.grammar.raw \
-o mert/news-dev2009.es.tok.lc.grammar
</pre>
<p>You will also need to create a small "glue grammar", in a file called <code>model/hiero.glue</code> that contains these rules that allow hiero-style grammars to reach the goal state:</p>
</td>
</tr>
<tr>
<td bgcolor="#cccccc">
<pre>
[S] ||| [X,1] ||| [X,1] ||| 0 0 0
[S] ||| [S,1] [X,2] ||| [S,1] [X,2] ||| 0.434294482 0 0
</pre>
</td>
</tr>
<tr>
<td>
<!-- ccb todo - show the Spanish grammar here -->
<a name="step6" />
<h1>Step 6: Run minimum error rate training</h1>
<p>
After we've extracted the grammar for the dev set we can run minimum error rate training (MERT). MERT is a method for setting the weights of the different feature functions the translation model to maximize the translation quality on the dev set. Translation quality is calculated according to an automatic metric, such as Bleu. Our implementation of MERT allows you to easily implement some other metric, and optimize your paramters to that. There's even a YouTube tutorial to show you how. </p>
<p>To run MERT you will first need to create a few files:
<ul>
<li> A MERT configuration file </li>
<li> A separate file with the list of the feature functions used in your model, along with their possible ranges</li>
<li> An executable file containing the command to use to run the decoder</li>
<li> A Joshua configuration file</li>
</ul>
</p>
<p>Create a MERT configuration file. In this example we name the file <code>mert/mert.config</code>. Its contents are: </p>
</td>
</tr>
<tr>
<td bgcolor="#cccccc">
<pre>
### MERT parameters
# target sentences file name (in this case, file name prefix)
-r dev/news-dev2009.en.tok.lc
-rps 1 # references per sentence
-p mert/params.txt # parameter file
-m BLEU 4 closest # evaluation metric and its options
-maxIt 10 # maximum MERT iterations
-ipi 20 # number of intermediate initial points per iteration
-cmd mert/decoder_command # file containing commands to run decoder
-decOut mert/news-dev2009.output.nbest # file prodcued by decoder
-dcfg mert/joshua.config # decoder config file
-N 300 # size of N-best list
-v 1 # verbosity level (0-2; higher value => more verbose)
-seed 12341234 # random number generator seed
</pre>
</td>
</tr>
<tr>
<td>
<p>You can see a list of the other parameters available in our MERT implementation by running this command:</p>
<pre>java -cp $JOSHUA/bin joshua.zmert.ZMERT -h </pre>
<p>Next, create a file called <code>mert/params.txt</code> that specifies what feature functions you are using in your mode. In our baseline model, this file should contain the following information:</p>
</td>
</tr>
<tr>
<td bgcolor="#cccccc">
<pre>
lm ||| 1.000000 Opt 0.1 +Inf +0.5 +1.5
phrasemodel pt 0 ||| 1.066893 Opt -Inf +Inf -1 +1
phrasemodel pt 1 ||| 0.752247 Opt -Inf +Inf -1 +1
phrasemodel pt 2 ||| 0.589793 Opt -Inf +Inf -1 +1
wordpenalty ||| -2.844814 Opt -Inf +Inf -5 0
normalization = absval 1 lm
</pre>
</td>
</tr>
<tr>
<td>
<p>Next, create a file called <code>mert/decoder_command</code> that contains the following command:</p>
</td>
</tr>
<tr>
<td bgcolor="#cccccc">
<pre>
java -Xmx1g -cp $JOSHUA/bin/ -Djava.library.path=$JOSHUA/lib -Dfile.encoding=utf8 \
joshua.decoder.JoshuaDecoder \
mert/joshua.config \
dev/news-dev2009.es.tok.lc \
mert/news-dev2009.output.nbest
</pre>
</td>
</tr>
<tr>
<td>
<p>Next, create a configuration file for joshua at <code>mert/joshua.config</code> that contains the following:</p>
</td>
</tr>
<tr>
<td bgcolor="#cccccc">
<pre>
<b>lm_file=model/lm/europarl.en.trigram.lm</b>
<b>tm_file=mert/news-dev2009.es.tok.lc.grammar</b>
tm_format=hiero
glue_file=model/hiero.glue
glue_format=hiero
#lm config
use_srilm=true
lm_ceiling_cost=100
use_left_equivalent_state=false
use_right_equivalent_state=false
order=3
#tm config
span_limit=10
phrase_owner=pt
mono_owner=mono
begin_mono_owner=begin_mono
default_non_terminal=X
goalSymbol=S
#pruning config
fuzz1=0.1
fuzz2=0.1
max_n_items=30
relative_threshold=10.0
max_n_rules=50
rule_relative_threshold=10.0
#nbest config
use_unique_nbest=true
use_tree_nbest=false
add_combined_cost=true
top_n=300
#remote lm server config, we should first prepare remote_symbol_tbl before starting any jobs
use_remote_lm_server=false
remote_symbol_tbl=./voc.remote.sym
num_remote_lm_servers=4
f_remote_server_list=./remote.lm.server.list
remote_lm_server_port=9000
#parallel deocoder: it cannot be used together with remote lm
num_parallel_decoders=1
parallel_files_prefix=/tmp/
###### model weights
#lm order weight
lm 1.0
#phrasemodel owner column(0-indexed) weight
phrasemodel pt 0 1.4037585111897322
phrasemodel pt 1 0.38379188013385945
phrasemodel pt 2 0.47752204361625605
#arityphrasepenalty owner start_arity end_arity weight
#arityphrasepenalty pt 0 0 1.0
#arityphrasepenalty pt 1 2 -1.0
#phrasemodel mono 0 0.5
#wordpenalty weight
wordpenalty -2.721711092619053
</pre>
</td>
</tr>
<tr>
<td>
<p>Finally, run the command to start MERT:</p>
<pre>
nohup java -cp $JOSHUA/bin \
joshua.zmert.ZMERT \
-maxMem 1500 mert/mert.config &
</pre>
<p>While MERT is running, you can skip ahead to the first part of the next step and extract the grammar for the test set.</p>
<a name="step7" />
<h1>Step 7: Decode a test set</h1>
<p>When MERT finishes, it will output a file <code>mert/joshua.config.ZMERT.final</code> that contains the news weights for the different feature functions. You can copy this config file and use it to decode the test set. </p>
<h3>Extract grammar rules for the test set</h3>
<p>Before decoding the test set, you'll need to extract a translation grammar for the foreign phrases in the test set <code>test/newstest2009.es.tok.lc</code>:
</p>
<pre>
java -Dfile.encoding=UTF8 -Xmx1g -cp $JOSHUA/bin \
joshua.prefix_tree.ExtractRules \
./model \
test/newstest2009.es.tok.lc.grammar.raw \
test/newstest2009.es.tok.lc &
</pre>
<p>
Next, sort the grammar rules and remove the redundancies with the following Unix command:
</p>
<pre>
sort -u test/newstest2009.es.tok.lc.grammar.raw \
-o test/newstest2009.es.tok.lc.grammar
</pre>
<p>Once the grammar extraction has completed, you can edit the <code>joshua.config</code> file for the test set.</p>
<pre>
cp mert/joshua.config.ZMERT.final test/joshua.config
</pre>
<p>You'll need to edit the config file to replace <b><code>tm_file=mert/mert/news-dev2009.es.tok.lc.grammar</code></b> with <b><code>tm_file=test/newstest2009.es.tok.lc.grammar</code></b>. After you have done that, you can decode the test set with the following command:</p>
<pre>
java -Xmx1g -cp $JOSHUA/bin/ -Djava.library.path=$JOSHUA/lib -Dfile.encoding=utf8 \
joshua.decoder.JoshuaDecoder \
test/joshua.config \
test/newstest2009.es.tok.lc \
test/newstest2009.output.nbest
</pre>
<p>After the decoder has finished, you can extract the 1-best translations from the n-best list using the following command:</p>
<pre>
java -cp $JOSHUA/bin -Dfile.encoding=utf8 \
joshua.util.ExtractTopCand \
test/newstest2009.output.nbest \
test/newstest2009.output.1best
</pre>
<!-- ccb - todo - show the output -->
<a name="step8" />
<h1>Step 8: Recase and detokenize</h1>
<p>You'll notice that your output is all lowercased and has the punctuation split off. In order to make the output more readable to human beings (remember us?), it'd be good to fix these problems and use proper punctuation and spacing. These are called recasing and detokenization, respectively. We can do recasing using SRILM, and can do detokenization with a perl script. </p>
<p>To build a recasing model first train a language model on true cased English text:</p>
<pre>
$SRILM/bin/macosx64/ngram-count \
-unk \
-order 5 \
-kndiscount1 -kndiscount2 -kndiscount3 -kndiscount4 -kndiscount5 \
-text training/training.en.tok \
-lm model/lm/training.TrueCase.5gram.lm
</pre>
<p>Next, you'll need to create a list of all of the alternative ways that each word can be capitalized. This will be stored in a map file that lists a lowercased word as the key and associates it with all of the variant capitalization of that word. Here's an example perl script to create the map:</p>
<pre>
#!/usr/bin/perl
#
# truecase-map.perl
# -----------------
# This script outputs alternate capitalizations
%map = ();
while($line = <>) {
@words = split(/\s+/, $line);
foreach $word (@words) {
$key = lc($word);
$map{$key}{$word} = 1;
}
}
foreach $key (sort keys %map) {
@words = keys %{$map{$key}};
if(scalar(@words) > 1 || !($words[0] eq $key)) {
print $key;
foreach $word (sort @words) {
print " $word";
}
print "\n";
}
}
</pre>
<pre>
cat training/training.en.tok | perl truecase-map.perl > model/lm/true-case.map
</pre>
<p>Finally, recase the lowercased 1-best translation by running the SRILM <code>disambig</code> program, which takes the map of alternative capitalizations, creates a confusion network, and uses truecased LM to find the best path through it:</p>
<pre>
$SRILM/bin/macosx/disambig \
-lm model/lm/training.TrueCase.5gram.lm \
-keep-unk \
-order 5 \
-map model/lm/true-case.map \
-text test/mt09.output.1best \
| perl strip-sent-tags.perl
> test/mt09.output.1best.recased
</pre>
<p>Where <code>strip-sent-tags.perl</code> is:</p>
<pre>
while($line = <>) {
$line =~ s/^\s*&lt;s&gt;\s*//g;
$line =~ s/\s*&lt;\/s&gt;\s*$//g;
print $line . "\n";
}
</pre>
<!-- ccb - todo - show the recased output -->
<a name="step9" />
<h1>Step 9: Score the translations</h1>
<p>
The quality of machine translation is commonly measured using the BLEU metric, which automatically compares a system's output against reference human translations. You can score your output using the JoshuaEval class, Joshua's built-in scorer:
</p>
<pre>
java -cp $JOSHUA/bin -Djava.library.path=lib -Xmx1000m -Xms1000m \
-Djava.util.logging.config.file=logging.properties \
joshua.util.JoshuaEval \
-cand dev/dev2006.en.output \
-ref dev/dev2006.en.small \
-m BLEU 4 closest
</pre>
<!-- ccb - todo - update these numbers with the actual numbers
The output will be something like:
BLEU_precision(1) = 797 / 2169 = 0.3675
BLEU_precision(2) = 346 / 2119 = 0.1633
BLEU_precision(3) = 174 / 2069 = 0.0841
BLEU_precision(4) = 86 / 2019 = 0.0426
BLEU_precision = 0.1211
Length of candidate corpus = 2169
Effective length of reference corpus = 1360
BLEU_BP = 1.0000
BLEU = 0.1211
Your Bleu score in this case would be 0.1211.
-->
<!-- <div id="news"> -->
<!-- <div id="contentcenter"><h2>Made Possible By</h2></div> -->
<!-- <p><a href="http://hltcoe.jhu.edu/"> -->
<!-- <img src="images/sponsors/hltcoe-logo1.jpg" width="100" border="0" /><br /> -->
<!-- The Human Language Technology Center of Excellence (HLTCOE) </a></p> -->
<!-- <p><a href="http://www.darpa.mil/Our_Work/I2O/Programs/Global_Autonomous_Language_Exploitation_(GALE).aspx"> -->
<!-- <img src="images/sponsors/darpa-logo.jpg" width="100" border="0" /><br /> -->
<!-- Global Autonomous Language Exploitation (GALE) </a></p> -->
<!-- <p><a href="http://nsf.gov/awardsearch/showAward.do?AwardNumber=0713448"> -->
<!-- <img src="images/sponsors/NSF-logo.jpg" width="100" border="0" /><br /> -->
<!-- Multi-level modeling of language and translation </a></p> -->
<!-- <br> -->
<!-- <p> -->
<!-- <div xmlns:cc="http://creativecommons.org/ns#" about="http://www.flickr.com/photos/marcusfrieze/422897640">Logo photo by <a rel="cc:attributionURL" href="http://www.flickr.com/photos/marcusfrieze/">Marcus Frieze</a> used under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/2.0/">Creative Commons License</a>.</div> -->
<!-- </p> -->
<!-- </div> -->