blob: 285057ec63196d5add7612e42d35c6df3d2c348c [file] [log] [blame]
##################################
## Joshua Getting Started Guide ##
##################################
----------------------------
| Getting the Joshua Decoder |
----------------------------
The best way to obtain a copy of the Joshua source code is to check
out from the main SVN repository. For this, you need Subversion, which
is likely already present on your system. If not, if can be obtained
freely at http://subversion.tigris.org/. If you already have a copy of
Joshua, please skip to the Prerequisites or Installing & Compiling
sections of this document.
To set up your own copy of Joshua, you start by checking out the main
development branch:
$ svn co https://joshua.svn.sourceforge.net/svnroot/joshua/trunk joshua
This creates a subdirectory with your local copy of the source
code. This directory is under version control, which means that your
local copy is tied to the main repository. This facilitates keeping
your code up-to-date as well as contributing your changes back to the
repository.
To fetch the latest changes, fixes or improvements to the decoder, you
simply run:
$ cd joshua
$ svn up
A more detailed introduction on SVN can be found at
http://subversion.tigris.org/.
---------------
| Prerequisites |
---------------
[ Java ]
Joshua is written in Java, and thus requires a Java SDK to be
installed. Please make sure you use a recent version of Java. Make
sure you have $JAVA_HOME set to the SDK directory.
For Mac OS X this usually is done by adding
export JAVA_HOME="/Library/Java/Home"
to your .bashrc, .bash_profile or .profile file.
[ Build Tools ]
For building as well as for the actual decoding, Joshua also requires
a few additional software packages:
- Apache Ant -
is a Java building tool with functionality similar to the
make tool. It can be found at
http://ant.apache.org/
- Swig -
is an inter-language wrapper and can be obtained at
http://www.swig.org/
- SRILM -
is a widely used language modeling toolkit, available for
download at
http://www.speech.sri.com/projects/srilm/
Make sure you have the $SRILM variable set to the directory you
installed SRILM in, i.e.
$ export SRILM=/path/to/srilm
-------------------------------
| Installing & Compiling Joshua |
-------------------------------
First of all, make sure Ant and Swig are installed and properly set up
(i.e. accessible from the command line). SRILM should be build prior
to this step. For convenience, you may wish to set the JOSHUA_HOME
environment to the directory you installed Joshua in.
To build Joshua, it is sufficient to change into its install directory
and run make:
$ cd $JOSHUA_HOME
$ ant compile
This builds the Java source code, as well as the SRILM wrapper. Similarly,
if you have changes to the code, you can rebuild the decoder using the same
command.
For a full rebuild of the decoder, simply run
$ ant clean
before building. This command will remove any previously compiled code.
-------------------------------
| Testing Joshua |
-------------------------------
To run the Joshua unit tests:
$ ant test
To run the example:
$ ./example/decode_example_javalm.sh
or
$ ./example/decode_example_srilm.sh
-------------------------------
| Packaging Joshua (optional) |
-------------------------------
To pack the decoder into a JAR archive,
either compiled or the source code, run
$ ant jar
or
$ ant source-jar
--------------------------
| Extract a sample grammar |
--------------------------
(TODO: Lane, please review this section to make sure it's got enough details. --O.Z.)
To extract a grammar, you need to provide a parallel training corpus,
as well as alignment data (src-tgt) for the training sentences, to
joshua.prefix_tree.ExtractRules. ExtractRules has over 25 flags, but usually
you only need to account for a subset of those flags. Here is a command
you would run to extract grammar rules from the small 100-sentence
Spanish-English dataset in the data/ folder:
$ java -cp bin joshua.prefix_tree.ExtractRules \
--source=data/europarl.es.small.100 \
--target=data/europarl.en.small.100 \
--alignments=data/es_en_europarl_alignments.txt.small.100 \
--test=data/europarl.es.small.1 \
--output=es-en.grammar.small.unsorted \
--maxPhraseLength=5 \
--print-rules=false
(TODO: Lane, are the two --sentence-initial/final-X flags important? I've been using them in experiments but their default value is false.)
Once this is finished, you will notice a newly created file,
es-en.grammar.small.unsorted, where each line corresponds to a grammar
rule. Before you can use this grammar file, the lines need to be sorted
(and duplicates need to be eliminated), and so execute:
$ sort -u es-en.grammar.small.unsorted > es-en.grammar.small
Finally, gzip the grammar file, and you'll have a file that the decoder
can use to translate sentences:
$ gzip es-en.grammar.small
------------------------
| Extract large grammars |
------------------------
(TODO: Lane, please review this section to make sure it's got enough details and is actually correct! --O.Z.)
If you wish to extract grammar rules from a large training corpus,
ExtractRules could easily require several gigabytes of RAM. There is
a somewhat different usage of ExtractRules that would allow you to get
by with much less memory.
To do so, you would need to create binary files for the corpus, suffixes,
and vocabulary (for each of the two sides of the training corpus), as well
as for the alignment data itself. And you would do so by running the main
methods of SuffixArray and AlignmentGrids, before running ExtractRules...
1) Create binary files for source side of the training corpus:
$ java -cp bin joshua.corpus.suffix_array.SuffixArray data/europarl.es.small.100 vocab.es.bin corpus.es.bin suffixes.es.bin
(this creates 3 .bin files)
2) Create binary files for target side of the training corpus:
$ java -cp bin joshua.corpus.suffix_array.SuffixArray data/europarl.en.small.100 vocab.en.bin corpus.en.bin suffixes.en.bin
(this creates 3 .bin files)
3) Create binary file for alignments file:
$ java -cp bin joshua.corpus.alignment.AlignmentGrids data/es_en_europarl_alignments.txt.small.100 alignments.bin
(this creates alignments.bin)
And now, you can run ExtractRules as before, but using a different subset
of its flags:
$ java -cp bin joshua.prefix_tree.ExtractRules \
--binary-source=true \
--binary-target=true \
--source=corpus.es.bin \
--target=corpus.en.bin \
--source-vocab=vocab.es.bin \
--target-vocab=vocab.en.bin \
--source-suffixes=suffixes.es.bin \
--target-suffixes=suffixes.en.bin \
--alignmentsType=MemoryMappedAlignmentGrids \
--alignments=alignments.bin \
--test=data/europarl.es.small.1 \
--output=es-en.grammar.small.unsorted \
--maxPhraseLength=5 \
--print-rules=false
And follow that by "sort -u" and gzipping, as in the previous section.
Of course, once you start dealing with a large corpus, you will have to
use -Xmx (and -Xms) to provide ExtractRules with more memory than
the default amount allocated to java (64 MB). Using this alternative
memory-efficient method, it is possible you'd need up to 1-2 GB of RAM.
########################################################################
For any further question of help, please turn to the Joshua support
mailing list at
joshua-support@lists.sourceforge.net