INSTALL.txt - joshua - Git at Google

 ##################################
 ## Joshua Getting Started Guide ##
 ##################################

  ----------------------------
 | Getting the Joshua Decoder |
  ----------------------------

 The best way to obtain a copy of the Joshua source code is to check
 out from the main SVN repository. For this, you need Subversion, which
 is likely already present on your system. If not, if can be obtained
 freely at http://subversion.tigris.org/. If you already have a copy of
 Joshua, please skip to the Prerequisites or Installing & Compiling
 sections of this document.

 To set up your own copy of Joshua, you start by checking out the main
 development branch:

 $ svn co https://joshua.svn.sourceforge.net/svnroot/joshua/trunk joshua

 This creates a subdirectory with your local copy of the source
 code. This directory is under version control, which means that your
 local copy is tied to the main repository. This facilitates keeping
 your code up-to-date as well as contributing your changes back to the
 repository.

 To fetch the latest changes, fixes or improvements to the decoder, you
 simply run:

 $ cd joshua
 $ svn up

 A more detailed introduction on SVN can be found at
 http://subversion.tigris.org/.


  ---------------
 | Prerequisites |
  ---------------

 [ Java ]

 Joshua is written in Java, and thus requires a Java SDK to be
 installed. Please make sure you use a recent version of Java. Make
 sure you have $JAVA_HOME set to the SDK directory.

 For Mac OS X this usually is done by adding

 export JAVA_HOME="/Library/Java/Home"

 to your .bashrc, .bash_profile or .profile file.


 [ Build Tools ]

 For building as well as for the actual decoding, Joshua also requires
 a few additional software packages:

 - Apache Ant -
 is a Java building tool with functionality similar to the
 make tool. It can be found at
       http://ant.apache.org/

 - Swig -
 is an inter-language wrapper and can be obtained at
       http://www.swig.org/

 - SRILM -
 is a widely used language modeling toolkit, available for
 download at
       http://www.speech.sri.com/projects/srilm/

 Make sure you have the $SRILM variable set to the directory you
 installed SRILM in, i.e.

 $ export SRILM=/path/to/srilm


  -------------------------------
 | Installing & Compiling Joshua |
  -------------------------------

 First of all, make sure Ant and Swig are installed and properly set up
 (i.e. accessible from the command line). SRILM should be build prior
 to this step. For convenience, you may wish to set the JOSHUA_HOME
 environment to the directory you installed Joshua in.

 To build Joshua, it is sufficient to change into its install directory
 and run make:

 $ cd $JOSHUA_HOME
 $ ant compile

 This builds the Java source code, as well as the SRILM wrapper. Similarly,
 if you have changes to the code, you can rebuild the decoder using the same
 command.

 For a full rebuild of the decoder, simply run

 $ ant clean

 before building. This command will remove any previously compiled code.


  -------------------------------
 | Testing Joshua               |
  -------------------------------

 To run the Joshua unit tests:

 $ ant test


 To run the example:

 $ ./example/decode_example_javalm.sh

 or

 $ ./example/decode_example_srilm.sh


  -------------------------------
 | Packaging Joshua (optional)  |
  -------------------------------

 To pack the decoder into a JAR archive,
 either compiled or the source code, run

 $ ant jar

 or

 $ ant source-jar


  --------------------------
 | Extract a sample grammar |
  --------------------------

 (TODO: Lane, please review this section to make sure it's got enough details.  --O.Z.)

 To extract a grammar, you need to provide a parallel training corpus,
 as well as alignment data (src-tgt) for the training sentences, to
 joshua.prefix_tree.ExtractRules. ExtractRules has over 25 flags, but usually
 you only need to account for a subset of those flags. Here is a command
 you would run to extract grammar rules from the small 100-sentence
 Spanish-English dataset in the data/ folder:

 $ java -cp bin joshua.prefix_tree.ExtractRules \
     --source=data/europarl.es.small.100 \
     --target=data/europarl.en.small.100 \
     --alignments=data/es_en_europarl_alignments.txt.small.100 \
     --test=data/europarl.es.small.1 \
     --output=es-en.grammar.small.unsorted \
     --maxPhraseLength=5 \
     --print-rules=false

 (TODO: Lane, are the two --sentence-initial/final-X flags important? I've been using them in experiments but their default value is false.)

 Once this is finished, you will notice a newly created file,
 es-en.grammar.small.unsorted, where each line corresponds to a grammar
 rule. Before you can use this grammar file, the lines need to be sorted
 (and duplicates need to be eliminated), and so execute:

 $ sort -u es-en.grammar.small.unsorted > es-en.grammar.small

 Finally, gzip the grammar file, and you'll have a file that the decoder
 can use to translate sentences:

 $ gzip es-en.grammar.small


  ------------------------
 | Extract large grammars |
  ------------------------

 (TODO: Lane, please review this section to make sure it's got enough details and is actually correct!  --O.Z.)

 If you wish to extract grammar rules from a large training corpus,
 ExtractRules could easily require several gigabytes of RAM. There is
 a somewhat different usage of ExtractRules that would allow you to get
 by with much less memory.

 To do so, you would need to create binary files for the corpus, suffixes,
 and vocabulary (for each of the two sides of the training corpus), as well
 as for the alignment data itself. And you would do so by running the main
 methods of SuffixArray and AlignmentGrids, before running ExtractRules...

 1) Create binary files for source side of the training corpus:
 $ java -cp bin joshua.corpus.suffix_array.SuffixArray data/europarl.es.small.100 vocab.es.bin corpus.es.bin suffixes.es.bin
   (this creates 3 .bin files)

 2) Create binary files for target side of the training corpus:
 $ java -cp bin joshua.corpus.suffix_array.SuffixArray data/europarl.en.small.100 vocab.en.bin corpus.en.bin suffixes.en.bin
   (this creates 3 .bin files)

 3) Create binary file for alignments file:
 $ java -cp bin joshua.corpus.alignment.AlignmentGrids data/es_en_europarl_alignments.txt.small.100 alignments.bin
   (this creates alignments.bin)

 And now, you can run ExtractRules as before, but using a different subset
 of its flags:

 $ java -cp bin joshua.prefix_tree.ExtractRules \
     --binary-source=true \
     --binary-target=true \
     --source=corpus.es.bin \
     --target=corpus.en.bin \
     --source-vocab=vocab.es.bin \
     --target-vocab=vocab.en.bin \
     --source-suffixes=suffixes.es.bin \
     --target-suffixes=suffixes.en.bin \
     --alignmentsType=MemoryMappedAlignmentGrids \
     --alignments=alignments.bin \
     --test=data/europarl.es.small.1 \
     --output=es-en.grammar.small.unsorted \
     --maxPhraseLength=5 \
     --print-rules=false

 And follow that by "sort -u" and gzipping, as in the previous section.

 Of course, once you start dealing with a large corpus, you will have to
 use -Xmx (and -Xms) to provide ExtractRules with more memory than
 the default amount allocated to java (64 MB). Using this alternative
 memory-efficient method, it is possible you'd need up to 1-2 GB of RAM.


 ########################################################################

 For any further question of help, please turn to the Joshua support
 mailing list at
                 joshua-support@lists.sourceforge.net
	##################################
	## Joshua Getting Started Guide ##
	##################################

	----------------------------
	\| Getting the Joshua Decoder \|
	----------------------------

	The best way to obtain a copy of the Joshua source code is to check
	out from the main SVN repository. For this, you need Subversion, which
	is likely already present on your system. If not, if can be obtained
	freely at http://subversion.tigris.org/. If you already have a copy of
	Joshua, please skip to the Prerequisites or Installing & Compiling
	sections of this document.

	To set up your own copy of Joshua, you start by checking out the main
	development branch:

	$ svn co https://joshua.svn.sourceforge.net/svnroot/joshua/trunk joshua

	This creates a subdirectory with your local copy of the source
	code. This directory is under version control, which means that your
	local copy is tied to the main repository. This facilitates keeping
	your code up-to-date as well as contributing your changes back to the
	repository.

	To fetch the latest changes, fixes or improvements to the decoder, you
	simply run:

	$ cd joshua
	$ svn up

	A more detailed introduction on SVN can be found at
	http://subversion.tigris.org/.



	---------------
	\| Prerequisites \|
	---------------

	[ Java ]

	Joshua is written in Java, and thus requires a Java SDK to be
	installed. Please make sure you use a recent version of Java. Make
	sure you have $JAVA_HOME set to the SDK directory.

	For Mac OS X this usually is done by adding

	export JAVA_HOME="/Library/Java/Home"

	to your .bashrc, .bash_profile or .profile file.


	[ Build Tools ]

	For building as well as for the actual decoding, Joshua also requires
	a few additional software packages:

	- Apache Ant -
	is a Java building tool with functionality similar to the
	make tool. It can be found at
	http://ant.apache.org/

	- Swig -
	is an inter-language wrapper and can be obtained at
	http://www.swig.org/

	- SRILM -
	is a widely used language modeling toolkit, available for
	download at
	http://www.speech.sri.com/projects/srilm/

	Make sure you have the $SRILM variable set to the directory you
	installed SRILM in, i.e.

	$ export SRILM=/path/to/srilm



	-------------------------------
	\| Installing & Compiling Joshua \|
	-------------------------------

	First of all, make sure Ant and Swig are installed and properly set up
	(i.e. accessible from the command line). SRILM should be build prior
	to this step. For convenience, you may wish to set the JOSHUA_HOME
	environment to the directory you installed Joshua in.

	To build Joshua, it is sufficient to change into its install directory
	and run make:

	$ cd $JOSHUA_HOME
	$ ant compile

	This builds the Java source code, as well as the SRILM wrapper. Similarly,
	if you have changes to the code, you can rebuild the decoder using the same
	command.

	For a full rebuild of the decoder, simply run

	$ ant clean

	before building. This command will remove any previously compiled code.



	-------------------------------
	\| Testing Joshua \|
	-------------------------------

	To run the Joshua unit tests:

	$ ant test


	To run the example:

	$ ./example/decode_example_javalm.sh

	or

	$ ./example/decode_example_srilm.sh



	-------------------------------
	\| Packaging Joshua (optional) \|
	-------------------------------

	To pack the decoder into a JAR archive,
	either compiled or the source code, run

	$ ant jar

	or

	$ ant source-jar



	--------------------------
	\| Extract a sample grammar \|
	--------------------------

	(TODO: Lane, please review this section to make sure it's got enough details. --O.Z.)

	To extract a grammar, you need to provide a parallel training corpus,
	as well as alignment data (src-tgt) for the training sentences, to
	joshua.prefix_tree.ExtractRules. ExtractRules has over 25 flags, but usually
	you only need to account for a subset of those flags. Here is a command
	you would run to extract grammar rules from the small 100-sentence
	Spanish-English dataset in the data/ folder:

	$ java -cp bin joshua.prefix_tree.ExtractRules \
	--source=data/europarl.es.small.100 \
	--target=data/europarl.en.small.100 \
	--alignments=data/es_en_europarl_alignments.txt.small.100 \
	--test=data/europarl.es.small.1 \
	--output=es-en.grammar.small.unsorted \
	--maxPhraseLength=5 \
	--print-rules=false

	(TODO: Lane, are the two --sentence-initial/final-X flags important? I've been using them in experiments but their default value is false.)

	Once this is finished, you will notice a newly created file,
	es-en.grammar.small.unsorted, where each line corresponds to a grammar
	rule. Before you can use this grammar file, the lines need to be sorted
	(and duplicates need to be eliminated), and so execute:

	$ sort -u es-en.grammar.small.unsorted > es-en.grammar.small

	Finally, gzip the grammar file, and you'll have a file that the decoder
	can use to translate sentences:

	$ gzip es-en.grammar.small


	------------------------
	\| Extract large grammars \|
	------------------------

	(TODO: Lane, please review this section to make sure it's got enough details and is actually correct! --O.Z.)

	If you wish to extract grammar rules from a large training corpus,
	ExtractRules could easily require several gigabytes of RAM. There is
	a somewhat different usage of ExtractRules that would allow you to get
	by with much less memory.

	To do so, you would need to create binary files for the corpus, suffixes,
	and vocabulary (for each of the two sides of the training corpus), as well
	as for the alignment data itself. And you would do so by running the main
	methods of SuffixArray and AlignmentGrids, before running ExtractRules...

	1) Create binary files for source side of the training corpus:
	$ java -cp bin joshua.corpus.suffix_array.SuffixArray data/europarl.es.small.100 vocab.es.bin corpus.es.bin suffixes.es.bin
	(this creates 3 .bin files)

	2) Create binary files for target side of the training corpus:
	$ java -cp bin joshua.corpus.suffix_array.SuffixArray data/europarl.en.small.100 vocab.en.bin corpus.en.bin suffixes.en.bin
	(this creates 3 .bin files)

	3) Create binary file for alignments file:
	$ java -cp bin joshua.corpus.alignment.AlignmentGrids data/es_en_europarl_alignments.txt.small.100 alignments.bin
	(this creates alignments.bin)

	And now, you can run ExtractRules as before, but using a different subset
	of its flags:

	$ java -cp bin joshua.prefix_tree.ExtractRules \
	--binary-source=true \
	--binary-target=true \
	--source=corpus.es.bin \
	--target=corpus.en.bin \
	--source-vocab=vocab.es.bin \
	--target-vocab=vocab.en.bin \
	--source-suffixes=suffixes.es.bin \
	--target-suffixes=suffixes.en.bin \
	--alignmentsType=MemoryMappedAlignmentGrids \
	--alignments=alignments.bin \
	--test=data/europarl.es.small.1 \
	--output=es-en.grammar.small.unsorted \
	--maxPhraseLength=5 \
	--print-rules=false

	And follow that by "sort -u" and gzipping, as in the previous section.

	Of course, once you start dealing with a large corpus, you will have to
	use -Xmx (and -Xms) to provide ExtractRules with more memory than
	the default amount allocated to java (64 MB). Using this alternative
	memory-efficient method, it is possible you'd need up to 1-2 GB of RAM.


	########################################################################

	For any further question of help, please turn to the Joshua support
	mailing list at
	joshua-support@lists.sourceforge.net