blob: b8f5a79e291d337f881a86f9ca347130ba578744 [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="">
<meta name="author" content="">
<link rel="icon" href="../../favicon.ico">
<title>Joshua Documentation | Alignment with Jacana</title>
<!-- Bootstrap core CSS -->
<link href="/dist/css/bootstrap.min.css" rel="stylesheet">
<!-- Custom styles for this template -->
<link href="/joshua6.css" rel="stylesheet">
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="blog-nav">
<!-- <a class="blog-nav-item active" href="#">Joshua</a> -->
<a class="blog-nav-item" href="/">Joshua</a>
<!-- <a class="blog-nav-item" href="/6.0/whats-new.html">New features</a> -->
<a class="blog-nav-item" href="/language-packs/">Language packs</a>
<a class="blog-nav-item" href="/data/">Datasets</a>
<a class="blog-nav-item" href="/support/">Support</a>
<a class="blog-nav-item" href="/contributors.html">Contributors</a>
</nav>
</div>
</div>
<div class="container">
<div class="row">
<div class="col-sm-2">
<div class="sidebar-module">
<!-- <h4>About</h4> -->
<center>
<img src="/images/joshua-logo-small.png" />
<p>Joshua machine translation toolkit</p>
</center>
</div>
<hr>
<center>
<a href="/releases/current/" target="_blank"><button class="button">Download Joshua 6.0.5</button></a>
<br />
<a href="/releases/runtime/" target="_blank"><button class="button">Runtime only version</button></a>
<p>Released November 5, 2015</p>
</center>
<hr>
<!-- <div class="sidebar-module"> -->
<!-- <span id="download"> -->
<!-- <a href="http://joshua-decoder.org/downloads/joshua-6.0.tgz">Download</a> -->
<!-- </span> -->
<!-- </div> -->
<div class="sidebar-module">
<h4>Using Joshua</h4>
<ol class="list-unstyled">
<li><a href="/6.0/install.html">Installation</a></li>
<li><a href="/6.0/quick-start.html">Quick Start</a></li>
</ol>
</div>
<hr>
<div class="sidebar-module">
<h4>Building new models</h4>
<ol class="list-unstyled">
<li><a href="/6.0/pipeline.html">Pipeline</a></li>
<li><a href="/6.0/tutorial.html">Tutorial</a></li>
<li><a href="/6.0/faq.html">FAQ</a></li>
</ol>
</div>
<!--
<div class="sidebar-module">
<h4>Phrase-based</h4>
<ol class="list-unstyled">
<li><a href="/6.0/phrase.html">Training</a></li>
</ol>
</div>
-->
<hr>
<div class="sidebar-module">
<h4>Advanced</h4>
<ol class="list-unstyled">
<li><a href="/6.0/bundle.html">Building language packs</a></li>
<li><a href="/6.0/decoder.html">Decoder options</a></li>
<li><a href="/6.0/file-formats.html">File formats</a></li>
<li><a href="/6.0/packing.html">Packing TMs</a></li>
<li><a href="/6.0/large-lms.html">Building large LMs</a></li>
</ol>
</div>
<hr>
<div class="sidebar-module">
<h4>Developer</h4>
<ol class="list-unstyled">
<li><a href="https://github.com/joshua-decoder/joshua">Github</a></li>
<li><a href="http://cs.jhu.edu/~post/joshua-docs">Javadoc</a></li>
<li><a href="https://groups.google.com/forum/?fromgroups#!forum/joshua_developers">Mailing list</a></li>
</ol>
</div>
</div><!-- /.blog-sidebar -->
<div class="col-sm-8 blog-main">
<div class="blog-title">
<h2>Alignment with Jacana</h2>
</div>
<div class="blog-post">
<h2 id="introduction">Introduction</h2>
<p>jacana-xy is a token-based word aligner for machine translation, adapted from the original
English-English word aligner jacana-align described in the following paper:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>A Lightweight and High Performance Monolingual Word Aligner. Xuchen Yao, Benjamin Van Durme,
Chris Callison-Burch and Peter Clark. Proceedings of ACL 2013, short papers.
</code></pre>
</div>
<p>It currently supports only aligning from French to English with a very limited feature set, from the
one week hack at the <a href="http://statmt.org/mtm13">Eighth MT Marathon 2013</a>. Please feel free to check
out the code, read to the bottom of this page, and
<a href="http://www.cs.jhu.edu/~xuchen/">send the author an email</a> if you want to add more language pairs to
it.</p>
<h2 id="build">Build</h2>
<p>jacana-xy is written in a mixture of Java and Scala. If you build from ant, you have to set up the
environmental variables <code class="highlighter-rouge">JAVA_HOME</code> and <code class="highlighter-rouge">SCALA_HOME</code>. In my system, I have:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.26
export SCALA_HOME=/home/xuchen/Downloads/scala-2.10.2
</code></pre>
</div>
<p>Then type:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>ant
</code></pre>
</div>
<p>build/lib/jacana-xy.jar will be built for you.</p>
<p>If you build from Eclipse, first install scala-ide, then import the whole jacana folder as a Scala project. Eclipse should find the .project file and set up the project automatically for you.</p>
<p>Demo
scripts-align/runDemoServer.sh shows up the web demo. Direct your browser to http://localhost:8080/ and you should be able to align some sentences.</p>
<p>Note: To make jacana-xy know where to look for resource files, pass the property JACANA_HOME with Java when you run it:</p>
<p>java -DJACANA_HOME=/path/to/jacana -cp jacana-xy.jar ……</p>
<p>Browser
You can also browse one or two alignment files (*.json) with firefox opening src/web/AlignmentBrowser.html:</p>
<p>Note 1: due to strict security setting for accessing local files, Chrome/IE won’t work.</p>
<p>Note 2: the input *.json files have to be in the same folder with AlignmentBrowser.html.</p>
<p>Align
scripts-align/alignFile.sh aligns tab-separated sentence files and outputs the output to a .json file that’s accepted by the browser:</p>
<p>java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -src fr -tgt en -m fr-en.model -a s.txt -o s.json</p>
<p>scripts-align/alignFile.sh takes GIZA++-style input files (one file containing the source sentences, and the other file the target sentences) and outputs to one .align file with dashed alignment indices (e.g. “1-2 0-4”):</p>
<p>java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -m fr-en.model -src fr -tgt en -a s1.txt -b s2.txt -o s.align</p>
<p>Training
java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -r train.json -d dev.json -t test.json -m /tmp/align.model</p>
<p>The aligner then would train on train.json, and report F1 values on dev.json for every 10 iterations, when the stopping criterion has reached, it will test on test.json.</p>
<p>For every 10 iterations, a model file is saved to (in this example) /tmp/align.model.iter_XX.F1_XX.X. Normally what I do is to select the one with the best F1 on dev.json, then run a final test on test.json:</p>
<p>java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -t test.json -m /tmp/align.model.iter_XX.F1_XX.X</p>
<p>In this case since the training data is missing, the aligner assumes it’s a test job, then reads model file still from the -m option, and test on test.json.</p>
<p>All the json files are in a format like the following (also accepted by the browser for display):</p>
<p>[
{
“id”: “0008”,
“name”: “Hansards.french-english.0008”,
“possibleAlign”: “0-0 0-1 0-2”,
“source”: “bravo !”,
“sureAlign”: “1-3”,
“target”: “hear , hear !”
},
{
“id”: “0009”,
“name”: “Hansards.french-english.0009”,
“possibleAlign”: “1-1 6-5 7-5 6-6 7-6 13-10 13-11”,
“source”: “monsieur le Orateur , ma question se adresse à le ministre chargé de les transports .”,
“sureAlign”: “0-0 2-1 3-2 4-3 5-4 8-7 9-8 10-9 12-10 14-11 15-12”,
“target”: “Mr. Speaker , my question is directed to the Minister of Transport .”
}
]
Where possibleAlign is not used.</p>
<p>The stopping criterion is to run up to 300 iterations or when the objective difference between two iterations is less than 0.001, whichever happens first. Currently they are hard-coded. If you need to be flexible on this, send me an email!</p>
<p>Support More Languages
To add support to more languages, you need:</p>
<p>labelled word alignment (in the download there’s already French-English under alignment-data/fr-en; I also have Chinese-English and Arabic-English; let me know if you have more). Usually 100 labelled sentence pairs would be enough
implement some feature functions for this language pair
To add more features, you need to implement the following interface:</p>
<p>edu.jhu.jacana.align.feature.AlignFeature</p>
<p>and override the following function:</p>
<p>addPhraseBasedFeature</p>
<p>For instance, a simple feature that checks whether the two words are translations in wiktionary for the French-English alignment task has the function implemented as:</p>
<p>def addPhraseBasedFeature(pair: AlignPair, ins:AlignFeatureVector, i:Int, srcSpan:Int, j:Int, tgtSpan:Int,
currState:Int, featureAlphabet: Alphabet){
if (j == -1) {
} else {
val srcTokens = pair.srcTokens.slice(i, i+srcSpan).mkString(“ “)
val tgtTokens = pair.tgtTokens.slice(j, j+tgtSpan).mkString(“ “)</p>
<div class="highlighter-rouge"><pre class="highlight"><code>if (WiktionaryMultilingual.exists(srcTokens, tgtTokens)) {
ins.addFeature("InWiktionary", NONE_STATE, currState, 1.0, srcSpan, featureAlphabet)
}
</code></pre>
</div>
<p>} <br />
}
This is a more general function that also deals with phrase alignment. But it is suggested to implement it just for token alignment as currently the phrase alignment part is very slow to train (60x slower than token alignment).</p>
<p>Some other language-independent and English-only features are implemented under the package edu.jhu.jacana.align.feature, for instance:</p>
<p>StringSimilarityAlignFeature: various string similarity measures</p>
<p>PositionalAlignFeature: features based on relative sentence positions</p>
<p>DistortionAlignFeature: Markovian (state transition) features</p>
<p>When you add features for more languages, just create a new package like the one for French-English:</p>
<p>edu.jhu.jacana.align.feature.fr_en</p>
<p>and start coding!</p>
<!-- <h4 class="blog-post-title">Welcome to Joshua!</h4> -->
<!-- <p>This blog post shows a few different types of content that's supported and styled with Bootstrap. Basic typography, images, and code are all supported.</p> -->
<!-- <hr> -->
<!-- <p>Cum sociis natoque penatibus et magnis <a href="#">dis parturient montes</a>, nascetur ridiculus mus. Aenean eu leo quam. Pellentesque ornare sem lacinia quam venenatis vestibulum. Sed posuere consectetur est at lobortis. Cras mattis consectetur purus sit amet fermentum.</p> -->
<!-- <blockquote> -->
<!-- <p>Curabitur blandit tempus porttitor. <strong>Nullam quis risus eget urna mollis</strong> ornare vel eu leo. Nullam id dolor id nibh ultricies vehicula ut id elit.</p> -->
<!-- </blockquote> -->
<!-- <p>Etiam porta <em>sem malesuada magna</em> mollis euismod. Cras mattis consectetur purus sit amet fermentum. Aenean lacinia bibendum nulla sed consectetur.</p> -->
<!-- <h2>Heading</h2> -->
<!-- <p>Vivamus sagittis lacus vel augue laoreet rutrum faucibus dolor auctor. Duis mollis, est non commodo luctus, nisi erat porttitor ligula, eget lacinia odio sem nec elit. Morbi leo risus, porta ac consectetur ac, vestibulum at eros.</p> -->
<!-- <h3>Sub-heading</h3> -->
<!-- <p>Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.</p> -->
<!-- <pre><code>Example code block</code></pre> -->
<!-- <p>Aenean lacinia bibendum nulla sed consectetur. Etiam porta sem malesuada magna mollis euismod. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa.</p> -->
<!-- <h3>Sub-heading</h3> -->
<!-- <p>Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Aenean lacinia bibendum nulla sed consectetur. Etiam porta sem malesuada magna mollis euismod. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus.</p> -->
<!-- <ul> -->
<!-- <li>Praesent commodo cursus magna, vel scelerisque nisl consectetur et.</li> -->
<!-- <li>Donec id elit non mi porta gravida at eget metus.</li> -->
<!-- <li>Nulla vitae elit libero, a pharetra augue.</li> -->
<!-- </ul> -->
<!-- <p>Donec ullamcorper nulla non metus auctor fringilla. Nulla vitae elit libero, a pharetra augue.</p> -->
<!-- <ol> -->
<!-- <li>Vestibulum id ligula porta felis euismod semper.</li> -->
<!-- <li>Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.</li> -->
<!-- <li>Maecenas sed diam eget risus varius blandit sit amet non magna.</li> -->
<!-- </ol> -->
<!-- <p>Cras mattis consectetur purus sit amet fermentum. Sed posuere consectetur est at lobortis.</p> -->
<!-- </div><\!-- /.blog-post -\-> -->
</div>
</div><!-- /.row -->
</div><!-- /.container -->
<!-- Bootstrap core JavaScript
================================================== -->
<!-- Placed at the end of the document so the pages load faster -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
<script src="../../dist/js/bootstrap.min.js"></script>
<!-- <script src="../../assets/js/docs.min.js"></script> -->
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<!-- <script src="../../assets/js/ie10-viewport-bug-workaround.js"></script>
-->
<!-- Start of StatCounter Code for Default Guide -->
<script type="text/javascript">
var sc_project=8264132;
var sc_invisible=1;
var sc_security="4b97fe2d";
</script>
<script type="text/javascript" src="http://www.statcounter.com/counter/counter.js"></script>
<noscript>
<div class="statcounter">
<a title="hit counter joomla"
href="http://statcounter.com/joomla/"
target="_blank">
<img class="statcounter"
src="http://c.statcounter.com/8264132/0/4b97fe2d/1/"
alt="hit counter joomla" />
</a>
</div>
</noscript>
<!-- End of StatCounter Code for Default Guide -->
</body>
</html>