Solutions to common problems will be posted here as we become aware of them. If you need help with something, please check our support group for a solution, or post a new question.
This occurs when KenLM failed to build. This can occur for a number of reasons:
Boost isn't installed. Boost is available through most package management tools, so try that first. You can also build it from source.
Boost is installed, but not in your path. The easiest solution is to add the boost library directory to your $LD_LIBRARY_PATH
environment variable. You can also edit the file $JOSHUA/src/joshua/decoder/ff/lm/kenlm/Makefile
and define BOOST_ROOT
to point to your boost location. Then rebuild KenLM with the command
ant -f $JOSHUA/build.xml kenlm
You have run into boost's weird naming of multi-threaded libraries. For some reason, boost libraries sometimes have a -mt
extension applied when they are built with multi-threaded support. This will cause the linker to fail, since it is looking for, e.g., -lboost_system
instead of -lboost_system-mt
. Edit the same Makefile as above and uncomment the BOOST_MT = -mt
line, then try to compile again with
ant -f $JOSHUA.build.xml kenlm
You may find the following reference URLs to be useful.
https://groups.google.com/forum/#!topic/joshua_support/SiGO41tkpsw http://stackoverflow.com/questions/12583080/c-library-in-using-boost-library
One way is to add a larger language model. Build on Gigaword, news crawl data, etc. lmplz
makes it easy to build and efficient to represent (especially if you compress it with `build_binary). To include it in Joshua, there are two ways:
Pipeline. By default, Joshua‘s pipeline builds a language model on the target side of your parallel training data. But Joshua can decode with any number of additional language models as well. So you can build a language model separately, presumably on much more data (since you won’t be constrained only to one side of parallel data, which is much more scarce than monolingual data). Once you‘ve built extra language models and compiled them with KenLM’s build_binary
script, you can tell the pipeline to use them with any number of --lmfile /path/to/lm/file
flags.
Joshua (directly). This file documents the Joshua configuration file format.
You would need to do this if, for example, you added a language model, or changed some other parameter (e.g., an improvement to the decoder). To do this, follow the following steps:
--rundir N+1
(where N
is the last run, and N+1
is a new, non-existent directory).--first-step TUNE
--lmfile /path/to/lm
lines. You also have to tell it where the main language model is, which is usually --lmfile N/lm.kenlm
(paths are relative to the directory above the run directory.--grammar N/grammar.gz
. If the tuning and test data hasn't changed, you can also point it to the filtered and packed versions to save a little time using --tune-grammar N/data/tune/grammar.packed
and --test-grammar N/data/test/grammar.packed
, where N
here again is the previous run (or some other run; it can be anywhere).Here‘s an example. Let’s say you ran a full pipeline as run 1, and now added a new language model and want to see how it affects the decoder. Your first run might have been invoked like this:
$JOSHUA/scripts/training/pipeline.pl \ --rundir 1 \ --readme "Baseline French--English Europarl hiero system" \ --corpus /path/to/europarl \ --tune /path/to/europarl/tune \ --test /path/to/europarl/test \ --source fr \ --target en \ --threads 8 \ --joshua-mem 30g \ --tuner mira \ --type hiero \ --aligner berkeley
Your new run will look like this:
$JOSHUA/scripts/training/pipeline.pl \ --rundir 2 \ --readme "Adding in a huge language model" \ --tune /path/to/europarl/tune \ --test /path/to/europarl/test \ --source fr \ --target en \ --threads 8 \ --joshua-mem 30g \ --tuner mira \ --type hiero \ --aligner berkeley \ --first-step TUNE \ --lmfile 1/lm.kenlm \ --lmfile /path/to/huge/new/lm \ --tune-grammar 1/data/tune/grammar.packed \ --test-grammar 1/data/test/grammar.packed
Notice the changes: we removed the --corpus
(though it would have been fine to have left it, it would have just been skipped), specified the first step, changed the run directory and README comments, and pointed to the grammars and both language model files.
How can I enable specific feature functions?
Let's say you created a new feature function, OracleFeature
, and you want to enable it. You can do this in two ways. Through the pipeline, simply pass it the argument --joshua-args "list of joshua args"
. These will then be passed to the decoder when it is invoked. You can enable your feature functions, then using something like
$JOSHUA/bin/pipeline.pl --joshua-args '-feature-function OracleFeature'
If you call the decoder directly, you can just put that line in the configuration file, e.g.,
feature-function = OracleFeature
or you can pass it directly to Joshua on the command line using the standard notation, e.g.,
$JOSHUA/bin/joshua-decoder -feature-function OracleFeature
These could be stacked, e.g.,
$JOSHUA/bin/joshua-decoder -feature-function OracleFeature \ -feature-function MagicFeature \ -feature-function MTSolverFeature \ ...