language-packs/index.md

layout: default6 title: Language packs

The simplest way to use Joshua is to use the provided “language packs”, which are pre-built models that enable translation in for particular language pairs. You can download and unpack each model and then run the included script to translate new sentences.

It is important to note the assumptions underlying the translation engine:

Joshua takes input on STDIN and outputs translations to STDOUT.
Joshua expects its input to be one plain-text, UTF-8 encoded sentence per UNIX-delimited line. If you are translating documents, you must perform sentence segmentation yourself.
Additionally, the input must be tokenized. To tokenize your data, you can use the script provided in each language pack.

Available language packs

Spanish--English phrase-based model [1.9 GB], built on Europarl and the Fisher and CALLHOME parallel dataset.
Arabic--English phrase-based model [2.4 GB], built from the LDC Arabic-Dialect/English parallel text, the ISI Arabic--English automatically extracted parallel text, and translations of the Arabic CALLHOME transcripts, and with an English Gigaword language model.

Have a request? Please email Matt Post.