blob: 99bbadb66685313d20eb1a8a8f13c934f141e21d [file] [log] [blame]
Apache Joshua Language Pack
===========================
Thanks for downloading the Apache Joshua <SOURCE>--<TARGET>
language pack. This language pack provides a machine translation
system for automatically translating sentences from <SOURCE> to
sentences in <TARGET>. Joshua language packs have no external
dependencies, and can be run straight from the provided JAR file.
Please see the "Quick Start" section below to learn how to run the
language pack. Additional runtime options can be found in the "Runtime
Options" section. For complete documentation, please visit the Joshua
website, https://joshua.apache.org/
For information on the data used to construct this language pack,
please see the CREDITS file, and to see its performance on a range of
different publicly available test sets, see BENCHMARKS.
A small number of example sentences are provided in 'example.<SRC>', along
with a human reference translation for each in 'example.<TRG>'.
If you have comments, questions, or concerns, please email Joshua user
support: dev@joshua.apache.org.
This language pack was released on <DATE>.
Quick Start
-----------
To run the language pack, invoke the command
joshua [OPTIONS ...]
The Joshua decoder will start running, accepting input from STDIN and writing to
STDOUT. Joshua expects its input in the form of a single sentence per line. Each
sentence should first be piped through `prepare.sh`, which normalizes and
tokenizes the input for the language pack's source language.
cat example.<SRC> | prepare.sh | joshua > output.<TRG>
It takes some time (sometimes as much as a minute) to load all of the models
into memory, which means there is high latency from startup until the first
translation. To reduce this time, Joshua can also be run in server mode,
implementing either a direct TCP-IP interface, or implementing a
Google-translate style RESTful API. To run Joshua as a TCP-IP server, add the
option
joshua -server-port 5674
You can then connect directly to the socket using nc or telnet:
cat example.<SRC> | prepare.sh | nc localhost 5674 > output.<TRG>
You can set the RESTful interface by also passing `-server-type http`:
joshua -server-port 5674 -server-type http
The RESTful interface is used when running the browser demo (see web/index.html)
or when using the Joshua Translation Engine.
Web demo
--------
See the web/ directory for a web-based AJAX interface to Java. One feature of this
approach is that it permits you to add custom words and phrases to Joshua, which then
become translation options for the decoder. To start the web demo,
1. Start Joshua in HTTP server mode, as described above
./joshua -server-port 5674 -server-type http
This starts Joshua running on port 5674.
2. Load the web interface, passing it to server and port you are using
firefox "web/index.html?server=localhost&port=5674"
Any browser will do.
You can then translate text, entering one sentence per line, in the main box. You can
also experiment with adding words and phrases. These are then saved to a custom grammar.
These rules can be managed on the "Rules" tab.
Runtime options
---------------
By default, the language pack runs with a single thread and with options set to
balance speed and accuracy. These and many other runtime options can be changed
with the following arguments and parameters to the Joshua invocation
demonstrated above.
- `-v 1`
Be more verbose in output.
- `-threads N`
N is the number of simultaneous decoding threads to launch. If this option is
omitted from the command line and the configuration file, the default number of
threads, which is 1, will be used.
Decoded outputs are assembled in order and Joshua has to hold on to the
complete target hypergraph until it is ready to be processed for output, so too
many simultaneous threads could result in lots of memory usage if a long
sentence results in many sentences being queued up. We have run Joshua with as
many as 48 threads without any problems of this kind, but it’s useful to keep
in the back of your mind.
- `-pop-limit N`
This controls how many hypotheses Joshua explores. You can make Joshua faster
(but less accurate) by reducing N, and you can make it more accurate (but
slower) by increasing N. We suggest values of N as low as 5 and as high as
1000. The default is 100.
- `-output-format "formatting string"
Specify the output-format variable, which is interpolated for the following
variables:
%i : the 0-indexed input sentence number
%s : the best translation, lower-cased and tokenized
%c : the model cost
%f : the values of the features of the best translation
%S : the best translation, denormalized and re-cased
The default value is "%S".