Updated file format information

commit: 485f34c5c0d53463357dde202d0185bd57b8edf7 [log] [tgz]
author: Matt Post <post@cs.jhu.edu> Thu Aug 15 13:04:50 2013 -0400
committer: Matt Post <post@cs.jhu.edu> Thu Aug 15 13:04:50 2013 -0400
tree: 9c2939484756d7d628041be5f20f77a833f9cf04
parent: 0d8c77587ecf230534136208ecd24c2f966c61d4 [diff]
diff --git a/5.0/file-formats.md b/5.0/file-formats.md
index 68af2b0..980da2e 100644
--- a/5.0/file-formats.md
+++ b/5.0/file-formats.md

@@ -7,53 +7,52 @@
 
 ## Translation models (grammars)
 
-Joshua supports three grammar file formats.
+Joshua supports two grammar file formats: a text-based version (also used by Hiero, shared by
+[cdec](), and supported by [hierarchical Moses]()), and an efficient
+[packed representation](packing.html) developed by [Juri Ganitkevich](http://cs.jhu.edu/~juri).
 
-1. Thrax / Hiero
-1. SAMT [deprecated]
-1. packed
-
-The *Hiero* format is not restricted to Hiero grammars, but simply means *the format that David
-Chiang developed for Hiero*.  It can support a much broader class of SCFGs containing an arbitrary
-set of nonterminals.  Similarly, the *SAMT* format is not restricted to SAMT grammars but instead
-simply denotes *the grammar format that Zollmann and Venugopal developed for their decoder*.  To
-remove this source of confusion, "thrax" is the preferred format designation, and is in fact the
-default.
-
-The packed grammar format is the efficient grammar representation developed by
-[Juri Ganitkevich](http://cs.jhu.edu/~juri) [is described in detail elsewhere](packing.html).
-
-Grammar rules in the Thrax format follow this format:
+Grammar rules follow this format.
 
     [LHS] ||| SOURCE-SIDE ||| TARGET-SIDE ||| FEATURES
     
-Here are some two examples, one for a Hiero grammar, and the other for an SAMT grammar:
+The source and target sides contain a mixture of terminals and nonterminals. The nonterminals are
+linked across sides by indices. There is no limit to the number of paired nonterminals in the rule
+or on the nonterminal labels (Joshua supports decoding with SAMT and GHKM grammars).
 
-    [X] ||| el chico [X] ||| the boy [X] ||| -3.14 0 2 17
-    [S] ||| el chico [VP] ||| the boy [VP] ||| -3.14 0 2 17
+    [X] ||| el chico [X,1] ||| the boy [X,1] ||| -3.14 0 2 17
+    [S] ||| el chico [VP,1] ||| the boy [VP,1] ||| -3.14 0 2 17
+    [VP] ||| [NP,1] [IN,2] [VB,3] ||| [VB,3] [IN,2] [NP,1] ||| 0.0019026637 0.81322956
+
     
 The feature values can have optional labels, e.g.:
 
-    [X] ||| el chico [X] ||| the boy [X] ||| lexprob=-3.14 abstract=0 numwords=2 count=17
+    [X] ||| el chico [X,1] ||| the boy [X,1] ||| lexprob=-3.14 abstract=0 numwords=2 count=17
     
-These feature names are made up.  For an actual list of feature names, please
-[see the Thrax documentation](thrax.html).
+One file common to decoding is the glue grammar, which for hiero grammar is defined as follows:
 
-The SAMT grammar format is deprecated and undocumented.
+    [GOAL] ||| <s> ||| <s> ||| 0
+    [GOAL] ||| [GOAL,1] [X,2] ||| [GOAL,1] [X,2] ||| -1
+    [GOAL] ||| [GOAL,1] </s> ||| [GOAL,1] </s> ||| 0
+
+Joshua's [pipeline](pipeline.html) supports extraction of Hiero and SAMT grammars via
+[Thrax](thrax.html) or GHKM grammars using [Michel Galley](http://www-nlp.stanford.edu/~mgalley/)'s
+GHKM extractor (included) or Moses' GHKM extractor (if Moses is installed).
 
 ## Language Model
 
-Joshua has three language model implementations: [KenLM](), [BerkeleyLM](), and an (unrecommended)
-dummy Java implementation.  All language model implementations support the standard ARPA format
-output by [SRILM]().  In addition, KenLM and BerkeleyLM support compiled formats that can be loaded
-more quickly and efficiently.
+Joshua has two language model implementations: [KenLM](http://kheafield.com/code/kenlm/) and
+[BerkeleyLM](http://berkeleylm.googlecode.com).  All language model implementations support the
+standard ARPA format output by [SRILM](http://www.speech.sri.com/projects/srilm/).  In addition,
+KenLM and BerkeleyLM support compiled formats that can be loaded more quickly and efficiently. KenLM
+is written in C++ and is supported via a JNI bridge, while BerkeleyLM is written in Java. KenLM is
+the default because of its support for left-state minimization.
 
 ### Compiling for KenLM
 
 To compile an ARPA grammar for KenLM, use the (provided) `build-binary` command, located deep within
 the Joshua source code:
 
-    $JOSHUA/src/joshua/decoder/ff/lm/kenlm/build_binary lm.arpa lm.kenlm
+    $JOSHUA/bin/build_binary lm.arpa lm.kenlm
     
 This script takes the `lm.arpa` file and produces the compiled version in `lm.kenlm`.
 
@@ -65,14 +64,10 @@
 
 The `lm.berkeleylm` file can then be listed directly in the [Joshua configuration file](decoder.html).
 
-## Joshua configuration
+## Joshua configuration file
 
-See [the decoder page](decoder.html).
-
-## Pipeline configuration
-
-See [the pipeline page](pipeline.html).
+The [decoder page](decoder.html) documents decoder command-line and config file options.
 
 ## Thrax configuration
 
-See [the thrax page](thrax.html).
+See [the thrax page](thrax.html) for more information about the Thrax configuration file.
commit	485f34c5c0d53463357dde202d0185bd57b8edf7	[log] [tgz]
author	Matt Post <post@cs.jhu.edu>	Thu Aug 15 13:04:50 2013 -0400
committer	Matt Post <post@cs.jhu.edu>	Thu Aug 15 13:04:50 2013 -0400
tree	9c2939484756d7d628041be5f20f77a833f9cf04
parent	0d8c77587ecf230534136208ecd24c2f966c61d4 [diff]