Updated documentation for 5.0 release

commit: ce4f53e177b90912a69fd82d730bf71270a2936b [log] [tgz]
author: Matt Post <post@cs.jhu.edu> Fri Aug 16 15:41:38 2013 -0400
committer: Matt Post <post@cs.jhu.edu> Fri Aug 16 15:41:38 2013 -0400
tree: 153187a822a1b0d466f50e4db75a19bd80b8ae21
parent: f0e12ddb9cb5f25026c322dbd6f803873e1208d9 [diff]
diff --git a/5.0/advanced.md b/5.0/advanced.md
new file mode 100644
index 0000000..174041e
--- /dev/null
+++ b/5.0/advanced.md

@@ -0,0 +1,7 @@
+---
+layout: default
+category: links
+title: Advanced features
+---
+
+

diff --git a/5.0/pipeline.md b/5.0/pipeline.md
index 53d26c0..c6988d4 100644
--- a/5.0/pipeline.md
+++ b/5.0/pipeline.md

@@ -8,11 +8,11 @@
 evaluating machine translation systems.  The pipeline eases the pain of two related tasks in
 statistical machine translation (SMT) research:
 
-1. Training SMT systems involves a complicated process of interacting steps that are time-consuming
-and prone to failure.
+- Training SMT systems involves a complicated process of interacting steps that are
+  time-consuming and prone to failure.
 
-1. Developing and testing new techniques requires varying parameters at different points in the
-pipeline.  Earlier results (which are often expensive) need not be recomputed.
+- Developing and testing new techniques requires varying parameters at different points in the
+  pipeline. Earlier results (which are often expensive) need not be recomputed.
 
 To facilitate these tasks, the pipeline script:
 
@@ -28,7 +28,8 @@
 
 The Joshua pipeline script is designed in the spirit of Moses' `train-model.pl`, and shares many of
 its features.  It is not as extensive, however, as Moses'
-[Experiment Management System](http://www.statmt.org/moses/?n=FactoredTraining.EMS).
+[Experiment Management System](http://www.statmt.org/moses/?n=FactoredTraining.EMS), which allows
+the user to define arbitrary execution dependency graphs.
 
 ## Installation
 
@@ -88,7 +89,7 @@
 For this quick start, we will be working with the example that can be found in
 `$JOSHUA/examples/pipeline`.  This example contains 1,000 sentences of Urdu-English data (the full
 dataset is available as part of the
-[Indian languages parallel corpora](http://joshua-decoder.org/indian-parallel-corpora/) with
+[Indian languages parallel corpora](/indian-parallel-corpora/) with
 100-sentence tuning and test sets with four references each.
 
 Running the pipeline requires two main steps: data preparation and invocation.
@@ -96,8 +97,9 @@
 1. Prepare your data.  The pipeline script needs to be told where to find the raw training, tuning,
    and test data.  A good convention is to place these files in an input/ subdirectory of your run's
    working directory (NOTE: do not use `data/`, since a directory of that name is created and used
-   by the pipeline itself).  The expected format (for each of training, tuning, and test) is a pair
-   of files that share a common path prefix and are distinguished by their extension:
+   by the pipeline itself for storing processed files).  The expected format (for each of training,
+   tuning, and test) is a pair of files that share a common path prefix and are distinguished by
+   their extension, e.g.,
 
        input/
              train.SOURCE
@@ -122,17 +124,17 @@
          --target TARGET
 
    The `--corpus`, `--tune`, and `--test` flags define file prefixes that are concatened with the
-   language extensions given by `--target` and `--source` (with a "." in betwee).  Note the
+   language extensions given by `--target` and `--source` (with a "." in between).  Note the
    correspondences with the files defined in the first step above.  The prefixes can be either
    absolute or relative pathnames.  This particular invocation assumes that a subdirectory `input/`
    exists in the current directory, that you are translating from a language identified "ur"
    extension to a language identified by the "en" extension, that the training data can be found at
    `input/train.en` and `input/train.ur`, and so on.
 
-*Don't* run the pipeline directly from `$JOSHUA`. I recommend creating a `run/` directory to contain
- all of your experiments in some other location. The advantage to this (apart from not clobbering
- part of the Joshua install) is that Joshua provides support scripts for visualizing the results of
- a series of experiments that only work if you
+*Don't* run the pipeline directly from `$JOSHUA`. We recommend creating a run directory somewhere
+ else to contain all of your experiments in some other location. The advantage to this (apart from
+ not clobbering part of the Joshua install) is that Joshua provides support scripts for visualizing
+ the results of a series of experiments that only work if you
 
 Assuming no problems arise, this command will run the complete pipeline in about 20 minutes,
 producing BLEU scores at the end.  As it runs, you will see output that looks like the following:
@@ -203,6 +205,7 @@
 1. [Language model building](#lm)
 1. [Tuning](#tuning)
 1. [Testing](#testing)
+1. [Analysis](#analysis)
 
 These steps are discussed below, after a few intervening sections about high-level details of the
 pipeline.
@@ -210,7 +213,7 @@
 ## Managing groups of experiments
 
 The real utility of the pipeline comes when you use it to manage groups of experiments. Typically,
-we have a held-out test set, and want to vary a number of training parameters to determine what
+there is a held-out test set, and we want to vary a number of training parameters to determine what
 effect this has on BLEU scores or some other metric. Joshua comes with a script
 `$JOSHUA/scripts/training/summarize.pl` that collects information from a group of runs and reports
 them to you. This script works so long as you organize your runs as follows:
@@ -307,7 +310,7 @@
 the script.  Of course, if you change one of the parameters a step depends on, it will trigger a
 rerun, which in turn might trigger further downstream reruns.
    
-## Skipping steps, quitting early
+## <a id="steps" /> Skipping steps, quitting early
 
 You will also find it useful to start the pipeline somewhere other than data preparation (for
 example, if you have already-processed data and an alignment, and want to begin with building a
@@ -327,7 +330,7 @@
 - *THRAX*: Grammar extraction [with Thrax](thrax.html).  If you jump to this step, you'll need to
    provide an aligned corpus (`--alignment`) along with your parallel data.  
 
-- *TUNE*: Tuning.  The exact tuning method is determined with `--tuner {mert,pro}`.  With this
+- *TUNE*: Tuning.  The exact tuning method is determined with `--tuner {mert,mira,pro}`.  With this
    option, you need to specify a grammar (`--grammar`) or separate tune (`--tune-grammar`) and test
    (`--test-grammar`) grammars.  A full grammar (`--grammar`) will be filtered against the relevant
    tuning or test set unless you specify `--no-filter-tm`.  If you want a language model built from
@@ -346,8 +349,7 @@
 
 We now discuss these steps in more detail.
 
-<a name="prep" />
-## 1. DATA PREPARATION
+### <a id="prep" /> 1. DATA PREPARATION
 
 Data prepare involves doing the following to each of the training data (`--corpus`), tuning data
 (`--tune`), and testing data (`--test`).  Each of these values is an absolute or relative path
@@ -361,7 +363,7 @@
 
 The following processing steps are applied to each file.
 
-1.  **Copying** the files into `RUNDIR/data/TYPE`, where TYPE is one of "train", "tune", or "test".
+1.  **Copying** the files into `$RUNDIR/data/TYPE`, where TYPE is one of "train", "tune", or "test".
     Multiple `--corpora` files are concatenated in the order they are specified.  Multiple `--tune`
     and `--test` flags are not currently allowed.
     
@@ -391,27 +393,27 @@
 
 The file "corpus.LANG" is a symbolic link to the last file in the chain.  
 
-<a name="alignment" />
-## 2. ALIGNMENT
+## 2. ALIGNMENT <a id="alignment" />
 
-Alignments are between the parallel corpora at `RUNDIR/data/train/corpus.{SOURCE,TARGET}`.  To
+Alignments are between the parallel corpora at `$RUNDIR/data/train/corpus.{SOURCE,TARGET}`.  To
 prevent the alignment tables from getting too big, the parallel corpora are grouped into files of no
 more than ALIGNER\_CHUNK\_SIZE blocks (controlled with a parameter below).  The last block is folded
 into the penultimate block if it is too small.  These chunked files are all created in a
-subdirectory of `RUNDIR/data/train/splits`, named `corpus.LANG.0`, `corpus.LANG.1`, and so on.
+subdirectory of `$RUNDIR/data/train/splits`, named `corpus.LANG.0`, `corpus.LANG.1`, and so on.
 
 The pipeline parameters affecting alignment are:
 
--   `aligner ALIGNER` {giza (default), berkeley}
+-   `--aligner ALIGNER` {giza (default), berkeley}
 
     Which aligner to use.  The default is [GIZA++](http://code.google.com/p/giza-pp/), but
     [the Berkeley aligner](http://code.google.com/p/berkeleyaligner/) can be used instead.  When
     using the Berkeley aligner, you'll want to pay attention to how much memory you allocate to it
     with `--aligner-mem` (the default is 10g).
 
--   `aligner-chunk-size SIZE` (1,000,000)
+-   `--aligner-chunk-size SIZE` (1,000,000)
 
-    The number of sentence pairs to compute alignments over.
+    The number of sentence pairs to compute alignments over. The training data is split into blocks
+    of this size, aligned separately, and then concatenated.
     
 -   `--alignment FILE`
 
@@ -427,19 +429,18 @@
 
     This value is required if you start at the grammar extraction step.
 
-When alignment is complete, the alignment file can be found at `RUNDIR/alignments/training.align`.
+When alignment is complete, the alignment file can be found at `$RUNDIR/alignments/training.align`.
 It is parallel to the training corpora.  There are many files in the `alignments/` subdirectory that
 contain the output of intermediate steps.
 
-<a name="parsing" />
-## 3. PARSING
+### <a id="parsing" /> 3. PARSING
 
-When SAMT grammars are being built (`--type samt`), the target side of the training data must be
-parsed.  The pipeline assumes your target side will be English, and will parse it for you using
-[the Berkeley parser](http://code.google.com/p/berkeleyparser/), which is included.  If it is not
-the case that English is your target-side language, the target side of your training data (found at
-CORPUS.TARGET) must already be parsed in PTB format.  The pipeline will notice that it is parsed and
-will not reparse it.
+To build SAMT and GHKM grammars (`--type samt` and `--type ghkm`), the target side of the
+training data must be parsed. The pipeline assumes your target side will be English, and will parse
+it for you using [the Berkeley parser](http://code.google.com/p/berkeleyparser/), which is included.
+If it is not the case that English is your target-side language, the target side of your training
+data (found at CORPUS.TARGET) must already be parsed in PTB format.  The pipeline will notice that
+it is parsed and will not reparse it.
 
 Parsing is affected by both the `--threads N` and `--jobs N` options.  The former runs the parser in
 multithreaded mode, while the latter distributes the runs across as cluster (and requires some
@@ -447,34 +448,36 @@
 
 Once the parsing is complete, there will be two parsed files:
 
-- `RUNDIR/data/train/corpus.en.parsed`: this is the mixed-case file that was parsed.
-- `RUNDIR/data/train/corpus.parsed.en`: this is a leaf-lowercased version of the above file used for
+- `$RUNDIR/data/train/corpus.en.parsed`: this is the mixed-case file that was parsed.
+- `$RUNDIR/data/train/corpus.parsed.en`: this is a leaf-lowercased version of the above file used for
   grammar extraction.
 
-<a name="tm" />
-## 4. THRAX (grammar extraction)
+## 4. THRAX (grammar extraction) <a id="tm" />
 
 The grammar extraction step takes three pieces of data: (1) the source-language training corpus, (2)
 the target-language training corpus (parsed, if an SAMT grammar is being extracted), and (3) the
 alignment file.  From these, it computes a synchronous context-free grammar.  If you already have a
-grammar and wish to skip this step, you can do so passing the grammar with the `--grammar GRAMMAR`
-flag. 
+grammar and wish to skip this step, you can do so passing the grammar with the `--grammar
+/path/to/grammar` flag.
 
 The main variable in grammar extraction is Hadoop.  If you have a Hadoop installation, simply ensure
 that the environment variable `$HADOOP` is defined, and Thrax will seamlessly use it.  If you *do
 not* have a Hadoop installation, the pipeline will roll out out for you, running Hadoop in
-standalone mode.  (This mode is triggered when `$HADOOP` is undefined).  Theoretically, any grammar extractable on a full Hadoop cluster should be
-extractable in standalone mode, if you are patient enough; in practice, you probably are not patient
-enough, and will be limited to smaller datasets.  Setting up your own Hadoop cluster is not too
-difficult a chore; in particular, you may find it helpful to install a
+standalone mode (this mode is triggered when `$HADOOP` is undefined).  Theoretically, any grammar
+extractable on a full Hadoop cluster should be extractable in standalone mode, if you are patient
+enough; in practice, you probably are not patient enough, and will be limited to smaller
+datasets. You may also run into problems with disk space; Hadoop uses a lot (use `--tmp
+/path/to/tmp` to specify an alternate place for temporary data; we suggest you use a local disk
+partition with tens or hundreds of gigabytes free, and not an NFS partition).  Setting up your own
+Hadoop cluster is not too difficult a chore; in particular, you may find it helpful to install a
 [pseudo-distributed version of Hadoop](http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html).
 In our experience, this works fine, but you should note the following caveats:
 
 - It is of crucial importance that you have enough physical disks.  We have found that having too
   few, or too slow of disks, results in a whole host of seemingly unrelated issues that are hard to
   resolve, such as timeouts.  
-- NFS filesystems can exacerbate this.  You should really try to install physical disks that are
-  dedicated to Hadoop scratch space.
+- NFS filesystems can cause lots of problems.  You should really try to install physical disks that
+  are dedicated to Hadoop scratch space.
 
 Here are some flags relevant to Hadoop and grammar extraction with Thrax:
 
@@ -493,10 +496,9 @@
    templates are located at `$JOSHUA/scripts/training/templates/thrax-TYPE.conf`, where TYPE is one
    of "hiero" or "samt".
   
-When the grammar is extracted, it is compressed and placed at `RUNDIR/grammar.gz`.
+When the grammar is extracted, it is compressed and placed at `$RUNDIR/grammar.gz`.
 
-<a name="lm" />
-## 5. Language model
+## <a id="lm" /> 5. Language model
 
 Before tuning can take place, a language model is needed.  A language model is always built from the
 target side of the training corpus unless `--no-corpus-lm` is specified.  In addition, you can
@@ -508,21 +510,19 @@
    This determines the language model code that will be used when decoding.  These implementations
    are described in their respective papers (PDFs:
    [KenLM](http://kheafield.com/professional/avenue/kenlm.pdf),
-   [BerkeleyLM](http://nlp.cs.berkeley.edu/pubs/Pauls-Klein_2011_LM_paper.pdf)).
+   [BerkeleyLM](http://nlp.cs.berkeley.edu/pubs/Pauls-Klein_2011_LM_paper.pdf)). KenLM is written in
+   C++ and requires a pass through the JNI, but is recommended because it supports left-state minimization.
    
 - `--lmfile FILE`
 
   Specifies a pre-built language model to use when decoding.  This language model can be in ARPA
   format, or in KenLM format when using KenLM or BerkeleyLM format when using that format.
 
-- `--lm-gen` {berkeleylm (default), srilm}, `--buildlm-mem MEM`, `--witten-bell`
+- `--lm-gen` {kenlm (default), srilm, berkeleylm}, `--buildlm-mem MEM`, `--witten-bell`
 
   At the tuning step, an LM is built from the target side of the training data (unless
   `--no-corpus-lm` is specified).  This controls which code is used to build it.  The default is a
-  [BerkeleyLM java class](http://code.google.com/p/berkeleylm/source/browse/trunk/src/edu/berkeley/nlp/lm/io/MakeKneserNeyArpaFromText.java)
-  that computes a Kneser-Ney LM with a constant discounting and no count thresholding.  The flag
-  `--buildlm-mem` can be used to control how much memory is allocated to the Java process.  The
-  default is "2g", but you will want to increase it for larger language models.
+  KenLM's [lmplz](http://kheafield.com/code/kenlm/estimation/), and is strongly recommended.
   
   If SRILM is used, it is called with the following arguments:
   
@@ -530,8 +530,12 @@
         
   Where SMOOTHING is `-kndiscount`, or `-wbdiscount` if `--witten-bell` is passed to the pipeline.
   
-A language model built from the target side of the training data is placed at `RUNDIR/lm.gz`.  
-
+  [BerkeleyLM java class](http://code.google.com/p/berkeleylm/source/browse/trunk/src/edu/berkeley/nlp/lm/io/MakeKneserNeyArpaFromText.java)
+  is also available. It computes a Kneser-Ney LM with a constant discounting (0.75) and no count
+  thresholding.  The flag `--buildlm-mem` can be used to control how much memory is allocated to the
+  Java process.  The default is "2g", but you will want to increase it for larger language models.
+  
+  A language model built from the target side of the training data is placed at `$RUNDIR/lm.gz`.  
 
 ## Interlude: decoder arguments
 
@@ -542,34 +546,31 @@
 can alter the amount of memory for Joshua using the `--joshua-mem MEM` argument, where MEM is a Java
 memory specification (passed to its `-Xmx` flag).
 
-<a name="tuning" />
-## 6. TUNING
+## <a id="tuning" /> 6. TUNING
 
-Two optimizers are implemented for Joshua: MERT and PRO (`--tuner {mert,pro}`).  Tuning is run till
-convergence in the `RUNDIR/tune` directory.  By default, tuning is run just once, but the pipeline
-supports running the optimizer an arbitrary number of times due to
-[recent work](http://www.youtube.com/watch?v=BOa3XDkgf0Y) pointing out the variance of tuning
-procedures in machine translation, in particular MERT.  This can be activated with `--optimizer-runs
-N`.  Each run can be found in a directory `RUNDIR/tune/N`.
+Two optimizers are provided with Joshua: MERT and PRO (`--tuner {mert,pro}`).  If Moses is
+installed, you can also use Cherry & Foster's k-best batch MIRA (`--tuner mira`, recommended).
+Tuning is run till convergence in the `$RUNDIR/tune/N` directory, where N is the tuning instance.
+By default, tuning is run just once, but the pipeline supports running the optimizer an arbitrary
+number of times due to [recent work](http://www.youtube.com/watch?v=BOa3XDkgf0Y) pointing out the
+variance of tuning procedures in machine translation, in particular MERT.  This can be activated
+with `--optimizer-runs N`.  Each run can be found in a directory `$RUNDIR/tune/N`.
 
-When
-tuning is finished, each final configuration file can be found at either
+When tuning is finished, each final configuration file can be found at either
 
-    RUNDIR/tune/N/joshua.config.ZMERT.final
-    RUNDIR/tune/N/joshua.config.PRO.final
+    $RUNDIR/tune/N/joshua.config.final
 
 where N varies from 1..`--optimizer-runs`.
 
-<a name="testing" />
-## 7. Testing
+## <a id="testing" /> 7. Testing 
 
-For each of the tuner runs, Joshua takes the tuner output file and decodes the test set.
-Afterwards, by default, minimum Bayes-risk decoding is run on the 300-best output.  This step
-usually yields about 0.3 - 0.5 BLEU points but is time-consuming, and can be turned off with the
-`--no-mbr` flag. 
+For each of the tuner runs, Joshua takes the tuner output file and decodes the test set.  If you
+like, you can also apply minimum Bayes-risk decoding to the decoder output with `--mbr`.  This
+usually yields about 0.3 - 0.5 BLEU points, but is time-consuming.
 
 After decoding the test set with each set of tuned weights, Joshua computes the mean BLEU score,
-writes it to `RUNDIR/test/final-bleu`, and cats it.  That's the end of the pipeline!
+writes it to `$RUNDIR/test/final-bleu`, and cats it. It also writes a file
+`$RUNDIR/test/final-times` containing a summary of runtime information. That's the end of the pipeline!
 
 Joshua also supports decoding further test sets.  This is enabled by rerunning the pipeline with a
 number of arguments:
@@ -581,12 +582,19 @@
 -   `--name NAME`
 
     A name is needed to distinguish this test set from the previous ones.  Output for this test run
-    will be stored at `RUNDIR/test/NAME`.
+    will be stored at `$RUNDIR/test/NAME`.
     
 -   `--joshua-config CONFIG`
 
     A tuned parameter file is required.  This file will be the output of some prior tuning run.
     Necessary pathnames and so on will be adjusted.
+    
+## <a id="analysis"> 8. ANALYSIS
+
+If you have used the suggested layout, with a number of related runs all contained in a common
+directory with sequential numbers, you can use the script `$JOSHUA/scripts/training/summarize.pl` to
+display a summary of the mean BLEU scores from all runs, along with the text you placed in the run
+README file (using the pipeline's `--readme TEXT` flag).
 
 ## COMMON USE CASES AND PITFALLS 
 

diff --git a/5.0/scale2013-tutorial.md b/5.0/scale2013-tutorial.md
deleted file mode 100644
index eb3a598..0000000
--- a/5.0/scale2013-tutorial.md
+++ /dev/null

@@ -1,115 +0,0 @@
----
-layout: default
-category: links
-title: SCALE 2013 Joshua tutorial
----
-
-If you're running Joshua on the HLTCOE file servers, you're in luck, because if anything the Joshua
-setup is overfit to that environment. This page contains some notes tailored to a
-[SCALE 2013](http://hltcoe.jhu.edu/research/scale-workshops/) tutorial that took place on June 5,
-2013.
-
-This page is designed to supplement the [pipeline walkthrough](pipeline.html), so I recommend you
-keep both open in separate tabs.
-
-## Download and Setup
-
-You can copy Joshua 5.0rc2 from `~lorland/workspace/mt/joshua/release/joshua-v5.0rc2.tgz` instead of
-downloading it directly. I recommend you install it under `~/code/`. Assuming you do so, you will
-have
-
-    export JOSHUA=$HOME/code/joshua-v5.0rc2
-    export JAVA_HOME=/usr/java/default
-
-You should set the following environment variables. Add these to your ~/.bashrc:
-
-    export HADOOP=/opt/apache/hadoop
-    export HADOOP_CONF_DIR=/opt/apache/hadoop/conf/apache-mr/mapreduce
-    export MOSES=/home/hltcoe/mpost/code/mosesdecoder
-    export SRILM=/home/hltcoe/mpost/code/srilm
-    
-Then load them:
-
-    source ~/.bashrc
-    
-Now compile:
-
-    cd joshua-v5.0rc2
-    export JOSHUA=$(pwd)
-    ant
-
-## Installation
-
-You don't need to install any external software since it is already installed. The environment
-variable exports above will take care of this for you.
-
-## A basic pipeline run
-
-For today's experiments, we'll be translating with a Spanish-English translated dataset collected by
-[Chris Callison-Burch](http://cs.jhu.edu/~ccb/), [Adam Lopez](http://cs.jhu.edu/~alopez/), and
-[myself](http://cs.jhu.edu/~post/). This dataset contains translations of the LDC transcriptions of
-the Fisher Spanish and CALLHOME Spanish datasets. These translations were collected using Amazon's
-Mechanical Turk.
-
-We plan to publicly release this dataset later this year, and in the meantime *we ask that you do
- not distribute this data or remove it from HLTCOE servers*.
-
-The dataset is located at `/export/common/SCALE13/Text/fishcall`. Please set the following environment
-variable for convenience:
-
-    export FISHCALL=/export/common/SCALE13/Text/fishcall
-
-If you're at CLSP, the data can be found instead at
-
-    export FISHCALL=/home/mpost/data/fishcall
-    
-### Preparing the data
-
-The data has already been split into training, development, held-out test, and test sets, both for
-Fisher Spanish and CALLHOME. The prefixes are as follows:
-
-    $FISHCALL/fisher_train
-    $FISHCALL/fisher_dev
-    $FISHCALL/fisher_dev2
-    $FISHCALL/fisher_test
-    
-    $FISHCALL/callhome_train
-    $FISHCALL/callhome_devtest
-    $FISHCALL/callhome_evltest
-    
-### Run the pipeline
-
-I'll assume here a run directory of `$HOME/expts/scale13/joshua-tutorial/runs/`. To run the complete
-pipeline and output results for the Fisher held-out test set, type:
-
-    cd $HOME/expts/scale13/joshua-tutorial/runs/
-    $JOSHUA/bin/pipeline.pl           \
-      --readme "Baseline run"         \
-      --rundir 1                      \
-      --corpus $FISHCALL/fisher_train \
-      --tune $FISHCALL/fisher_dev     \
-      --test $FISHCALL/fisher_dev2    \
-      --source es                     \
-      --target en
-      
-This will start the pipeline building a translation system trained on (Spanish transcript, English
-translation) pairs, and evaluate on other Spanish transcripts. It will use the defaults for all
-pieces of the pipeline: [GIZA++](https://code.google.com/p/giza-pp/) for alignment, BerkeleyLM for
-building the language model, batch MIRA for tuning, KenLM for representing LM state in the decoder,
-and so on.
-
-### Variations
-
-You can try different variations:
-
-   - Build an SAMT model (`--type samt`), GKHM model (`--type ghkm`), or phrasal model (`--type phrasal`) 
-   
-   - Use the Berkeley aligner instead of GIZA++ (`--aligner berkeley`)
-   
-   - Build the language model with SRILM (recommended) instead of BerkeleyLM (`--lm-gen srilm`)
-
-   - Tune with MIRA instead of MERT (`--tuner mert`)
-   
-   - Decode with a wider beam (`--joshua-args '-pop-limit 200'`) (the default is 100)
-
-   - Add training data (add another `--corpus` line, e.g., `--corpus $FISHCALL/callhome_train`)

diff --git a/5.0/tutorial.md b/5.0/tutorial.md
index 7e800e5..6dfe448 100644
--- a/5.0/tutorial.md
+++ b/5.0/tutorial.md

@@ -50,11 +50,11 @@
 In `$INDIAN/bn-en/tok`, you should see the following files:
 
     $ ls $INDIAN/bn-en/tok
-    dev.bn-en.bn		devtest.bn-en.bn	dict.bn-en.bn		test.bn-en.en.2
-    dev.bn-en.en.0		devtest.bn-en.en.0	dict.bn-en.en		test.bn-en.en.3
-    dev.bn-en.en.1		devtest.bn-en.en.1	test.bn-en.bn		training.bn-en.bn
-    dev.bn-en.en.2		devtest.bn-en.en.2	test.bn-en.en.0		training.bn-en.en
-    dev.bn-en.en.3		devtest.bn-en.en.3	test.bn-en.en.1
+    dev.bn-en.bn     devtest.bn-en.bn     dict.bn-en.bn     test.bn-en.en.2
+    dev.bn-en.en.0   devtest.bn-en.en.0   dict.bn-en.en     test.bn-en.en.3
+    dev.bn-en.en.1   devtest.bn-en.en.1   test.bn-en.bn     training.bn-en.bn
+    dev.bn-en.en.2   devtest.bn-en.en.2   test.bn-en.en.0   training.bn-en.en
+    dev.bn-en.en.3   devtest.bn-en.en.3   test.bn-en.en.1
 
 We will now use this data to test the complete pipeline with a single command.
     
@@ -71,8 +71,8 @@
 
     cd ~/expts/joshua
     $JOSHUA/bin/pipeline.pl           \
-      --readme "Baseline Hiero run"   \
       --rundir 1                      \
+      --readme "Baseline Hiero run"   \
       --source bn                     \
       --target en                     \
       --corpus $INDIAN/bn-en/tok/training.bn-en \
@@ -145,7 +145,7 @@
  pipeline always builds an LM on the target side of the training data, if provided, but we are
  supplying the language model that was already built. We could equivalently have removed the
  `--corpus` line.
-
+ 
 ## Changing the model type
 
 Let's compare the Hiero model we've already built to an SAMT model. We have to reextract the
@@ -164,6 +164,8 @@
       --no-build-lm \
       --lmfile 1/lm.gz
 
+See [the pipeline script page](pipeline.html#steps) for a list of all the steps.
+
 ## Analyzing the results
 
 We now have three runs, in subdirectories 1, 2, and 3. We can display summary results from them

diff --git a/_layouts/default.html b/_layouts/default.html
index c1f3c37..691d3bb 100644
--- a/_layouts/default.html
+++ b/_layouts/default.html

@@ -86,12 +86,14 @@
           <a class="brand" href="/">Joshua</a>
           <div class="nav-collapse collapse">
             <ul class="nav">
-              <li class="active"><a href="/">Home</a></li>
+              <li><a href="index.html">Documentation</a></li>
               <li><a href="pipeline.html">Pipeline</a></li>
+              <li><a href="tutorial.html">Tutorial</a></li>
               <li><a href="decoder.html">Decoder</a></li>
               <li><a href="thrax.html">Thrax</a></li>
               <li><a href="file-formats.html">File formats</a></li>
-              <li><a href="advanced.html">Advanced</a></li>
+              <!-- <li><a href="advanced.html">Advanced</a></li> -->
+              <li><a href="faq.html">FAQ</a></li>
             </ul>
           </div><!--/.nav-collapse -->
         </div>

diff --git a/joshua.css b/joshua.css
index 728b6ee..77b01a2 100644
--- a/joshua.css
+++ b/joshua.css

@@ -39,6 +39,6 @@
 }
 
 img.sponsor {
-    height: 120px;
+    width: 120px;
     margin: 5px;
 }
commit	ce4f53e177b90912a69fd82d730bf71270a2936b	[log] [tgz]
author	Matt Post <post@cs.jhu.edu>	Fri Aug 16 15:41:38 2013 -0400
committer	Matt Post <post@cs.jhu.edu>	Fri Aug 16 15:41:38 2013 -0400
tree	153187a822a1b0d466f50e4db75a19bd80b8ae21
parent	f0e12ddb9cb5f25026c322dbd6f803873e1208d9 [diff]