| ################################## |
| ## Joshua Getting Started Guide ## |
| ################################## |
| |
| ---------------------------- |
| | Getting the Joshua Decoder | |
| ---------------------------- |
| |
| The best way to obtain a copy of the Joshua source code is to check |
| out from the main SVN repository. For this, you need Subversion, which |
| is likely already present on your system. If not, if can be obtained |
| freely at http://subversion.tigris.org/. If you already have a copy of |
| Joshua, please skip to the Prerequisites or Installing & Compiling |
| sections of this document. |
| |
| To set up your own copy of Joshua, you start by checking out the main |
| development branch: |
| |
| $ svn co https://joshua.svn.sourceforge.net/svnroot/joshua/trunk joshua |
| |
| This creates a subdirectory with your local copy of the source |
| code. This directory is under version control, which means that your |
| local copy is tied to the main repository. This facilitates keeping |
| your code up-to-date as well as contributing your changes back to the |
| repository. |
| |
| To fetch the latest changes, fixes or improvements to the decoder, you |
| simply run: |
| |
| $ cd joshua |
| $ svn up |
| |
| A more detailed introduction on SVN can be found at |
| http://subversion.tigris.org/. |
| |
| |
| |
| --------------- |
| | Prerequisites | |
| --------------- |
| |
| [ Java ] |
| |
| Joshua is written in Java, and thus requires a Java SDK to be |
| installed. Please make sure you use a recent version of Java. Make |
| sure you have $JAVA_HOME set to the SDK directory. |
| |
| For Mac OS X this usually is done by adding |
| |
| export JAVA_HOME="/Library/Java/Home" |
| |
| to your .bashrc, .bash_profile or .profile file. |
| |
| |
| [ Build Tools ] |
| |
| For building as well as for the actual decoding, Joshua also requires |
| a few additional software packages: |
| |
| - Apache Ant - |
| is a Java building tool with functionality similar to the |
| make tool. It can be found at |
| http://ant.apache.org/ |
| |
| - Swig - |
| is an inter-language wrapper and can be obtained at |
| http://www.swig.org/ |
| |
| - SRILM - |
| is a widely used language modeling toolkit, available for |
| download at |
| http://www.speech.sri.com/projects/srilm/ |
| |
| Make sure you have the $SRILM variable set to the directory you |
| installed SRILM in, i.e. |
| |
| $ export SRILM=/path/to/srilm |
| |
| |
| |
| ------------------------------- |
| | Installing & Compiling Joshua | |
| ------------------------------- |
| |
| First of all, make sure Ant and Swig are installed and properly set up |
| (i.e. accessible from the command line). SRILM should be build prior |
| to this step. For convenience, you may wish to set the JOSHUA_HOME |
| environment to the directory you installed Joshua in. |
| |
| To build Joshua, it is sufficient to change into its install directory |
| and run make: |
| |
| $ cd $JOSHUA_HOME |
| $ ant compile |
| |
| This builds the Java source code, as well as the SRILM wrapper. Similarly, |
| if you have changes to the code, you can rebuild the decoder using the same |
| command. |
| |
| For a full rebuild of the decoder, simply run |
| |
| $ ant clean |
| |
| before building. This command will remove any previously compiled code. |
| |
| |
| |
| ------------------------------- |
| | Testing Joshua | |
| ------------------------------- |
| |
| To run the Joshua unit tests: |
| |
| $ ant test |
| |
| |
| To run the example: |
| |
| $ ./example/decode_example_javalm.sh |
| |
| or |
| |
| $ ./example/decode_example_srilm.sh |
| |
| |
| |
| ------------------------------- |
| | Packaging Joshua (optional) | |
| ------------------------------- |
| |
| To pack the decoder into a JAR archive, |
| either compiled or the source code, run |
| |
| $ ant jar |
| |
| or |
| |
| $ ant source-jar |
| |
| |
| |
| -------------------------- |
| | Extract a sample grammar | |
| -------------------------- |
| |
| (TODO: Lane, please review this section to make sure it's got enough details. --O.Z.) |
| |
| To extract a grammar, you need to provide a parallel training corpus, |
| as well as alignment data (src-tgt) for the training sentences, to |
| joshua.prefix_tree.ExtractRules. ExtractRules has over 25 flags, but usually |
| you only need to account for a subset of those flags. Here is a command |
| you would run to extract grammar rules from the small 100-sentence |
| Spanish-English dataset in the data/ folder: |
| |
| $ java -cp bin joshua.prefix_tree.ExtractRules \ |
| --source=data/europarl.es.small.100 \ |
| --target=data/europarl.en.small.100 \ |
| --alignments=data/es_en_europarl_alignments.txt.small.100 \ |
| --test=data/europarl.es.small.1 \ |
| --output=es-en.grammar.small.unsorted \ |
| --maxPhraseLength=5 \ |
| --print-rules=false |
| |
| (TODO: Lane, are the two --sentence-initial/final-X flags important? I've been using them in experiments but their default value is false.) |
| |
| Once this is finished, you will notice a newly created file, |
| es-en.grammar.small.unsorted, where each line corresponds to a grammar |
| rule. Before you can use this grammar file, the lines need to be sorted |
| (and duplicates need to be eliminated), and so execute: |
| |
| $ sort -u es-en.grammar.small.unsorted > es-en.grammar.small |
| |
| Finally, gzip the grammar file, and you'll have a file that the decoder |
| can use to translate sentences: |
| |
| $ gzip es-en.grammar.small |
| |
| |
| ------------------------ |
| | Extract large grammars | |
| ------------------------ |
| |
| (TODO: Lane, please review this section to make sure it's got enough details and is actually correct! --O.Z.) |
| |
| If you wish to extract grammar rules from a large training corpus, |
| ExtractRules could easily require several gigabytes of RAM. There is |
| a somewhat different usage of ExtractRules that would allow you to get |
| by with much less memory. |
| |
| To do so, you would need to create binary files for the corpus, suffixes, |
| and vocabulary (for each of the two sides of the training corpus), as well |
| as for the alignment data itself. And you would do so by running the main |
| methods of SuffixArray and AlignmentGrids, before running ExtractRules... |
| |
| 1) Create binary files for source side of the training corpus: |
| $ java -cp bin joshua.corpus.suffix_array.SuffixArray data/europarl.es.small.100 vocab.es.bin corpus.es.bin suffixes.es.bin |
| (this creates 3 .bin files) |
| |
| 2) Create binary files for target side of the training corpus: |
| $ java -cp bin joshua.corpus.suffix_array.SuffixArray data/europarl.en.small.100 vocab.en.bin corpus.en.bin suffixes.en.bin |
| (this creates 3 .bin files) |
| |
| 3) Create binary file for alignments file: |
| $ java -cp bin joshua.corpus.alignment.AlignmentGrids data/es_en_europarl_alignments.txt.small.100 alignments.bin |
| (this creates alignments.bin) |
| |
| And now, you can run ExtractRules as before, but using a different subset |
| of its flags: |
| |
| $ java -cp bin joshua.prefix_tree.ExtractRules \ |
| --binary-source=true \ |
| --binary-target=true \ |
| --source=corpus.es.bin \ |
| --target=corpus.en.bin \ |
| --source-vocab=vocab.es.bin \ |
| --target-vocab=vocab.en.bin \ |
| --source-suffixes=suffixes.es.bin \ |
| --target-suffixes=suffixes.en.bin \ |
| --alignmentsType=MemoryMappedAlignmentGrids \ |
| --alignments=alignments.bin \ |
| --test=data/europarl.es.small.1 \ |
| --output=es-en.grammar.small.unsorted \ |
| --maxPhraseLength=5 \ |
| --print-rules=false |
| |
| And follow that by "sort -u" and gzipping, as in the previous section. |
| |
| Of course, once you start dealing with a large corpus, you will have to |
| use -Xmx (and -Xms) to provide ExtractRules with more memory than |
| the default amount allocated to java (64 MB). Using this alternative |
| memory-efficient method, it is possible you'd need up to 1-2 GB of RAM. |
| |
| |
| ######################################################################## |
| |
| For any further question of help, please turn to the Joshua support |
| mailing list at |
| joshua-support@lists.sourceforge.net |
| |
| |
| |