commit | ed0b93d39b22a069a75365dcf387c826e8a79674 | [log] [tgz] |
---|---|---|
author | Tobias Domhan <domhant@amazon.com> | Thu Dec 01 10:21:52 2016 +0100 |
committer | Tobias Domhan <domhant@amazon.com> | Thu Dec 01 10:21:52 2016 +0100 |
tree | 1a1e24de3c8fd0684ed31550b62b2c811e5aa3d1 | |
parent | fe0666431b17e7c174c4a51a85e95e7ae390f4ba [diff] | |
parent | 15221f77a67921098325416b4673b167fc8ddd90 [diff] |
Merge branch 'master' of https://github.com/apache/incubator-joshua
Joshua is a statistical machine translation toolkit for both phrase-based (new in version 6.0) and syntax-based decoding. It can be run with pre-built language packs available for download, and can also be used to build models for new language pairs. Among the many features of Joshua are:
The latest release of Joshua is always linked to directly from the Home Page
Joshua 6.X includes the following new features:
Joshua must be run with a Java JDK 1.8 minimum.
To run the decoder in any form requires setting a few basic environment variables: $JAVA_HOME
, $JOSHUA
, and, for certain (optional) portions of the model-training pipeline, potentially $MOSES
.
export JAVA_HOME=/path/to/java # maybe /usr/java/home export JOSHUA=/path/to/joshua
You might also find it helpful to set these:
export LC_ALL=en_US.UTF-8 export LANG=en_US.UTF-8
Then, compile Joshua by typing:
cd $JOSHUA mvn clean package
You also need to download and compile KenLM and Thrax:
bash ./download-deps.sh
The basic method for invoking the decoder looks like this:
cat SOURCE | $JOSHUA/bin/joshua-decoder -m MEM -c CONFIG OPTIONS > OUTPUT
Some example usage scenarios and scripts can be found in the examples/ directory.
If you are hoping to work on the decoder, we suggest you use Eclipse. You can get started with this by typing
mvn eclipse:eclipse
Joshua includes a number of “language packs”, which are pre-built models that allow you to use the translation system as a black box, without worrying too much about how machine translation works. You can browse the models available for download on the Joshua website.
Joshua includes a pipeline script that allows you to build new models, provided you have training data. This pipeline can be run (more or less) by invoking a single command, which handles data preparation, alignment, phrase-table or grammar construction, and tuning of the model parameters. See the documentation for a walkthrough and more information about the many available options.
Joshua is licensed and released under the permissive Apache License v2.0, a copy of which ships with the Joshua source code.