src/giza-pp/GIZA++-v2/README - joshua - Git at Google

 ========================================================================
 GIZA++ is an extension of the program GIZA.
 It is a program for learning statistical translation models from
 bitext. It is an implementation of the models described in
 (Brown et al., 1993), (Vogel et al., 1996), (Och et al., 2000a),
 (Och et al., 2000b).
 ========================================================================


 CONTENTS of this README file:

 Part I: GIZA Package Contents
 Part II: How To Compile GIZA
 Part III: How to Run GIZA
 Part IV: Input File Formats
      A. VOCABULARY FILES
      B. Bitext Files
      C. Dictionary File (optional)
 Part V: Output File Formats:
      A. PROBABILITY TABLES
 	1. T TABLE (translation table)
 	2. N TABLE (Fertility table)
 	3. P0 TABLE
 	4. A TABLE
 	5. D3 TABLE
 	6. D4 TABLE
 	7. D5 TABLE
 	8. HMM TABLE
      B. ALIGNMENT FILE
      C. Cross Entropy and Perplexity Files
      D. Revised Vocabulary files
 Part VI: Literature
 Part VII: New features

 HISTORY of this README file:

 GIZA++:
 edited: 11 Jan. 2000, Franz Josef Och
 GIZA:
 edited: 16 Aug. 1999, Dan Melamed
 edited: 13 Aug. 1999, Yaser Al-Onaizan
 edited: 20 July 1999, Yaser Al-Onaizan
 edited: 15 July 1999, Yaser Al-Onaizan
 edited: 13 July 1999, Noah Smith
 ========================================================================

 Part 0: What is GIZA++

 GIZA++ is an extension of the program GIZA (part of the SMT toolkit
 EGYPT - http://www.clsp.jhu.edu/ws99/projects/mt/toolkit/ ) which was
 developed by the Statistical Machine Translation team during the
 summer workshop in 1999 at the Center for Language and Speech
 Processing at Johns-Hopkins University (CLSP/JHU). GIZA++ includes a
 lot of additional features. The extensions of GIZA++ were designed and
 written by Franz Josef Och.

 Features of GIZA++ not in GIZA:

 - Implements full IBM-4 alignment model with a dependency of word
 classes as described in (Brown et al. 1993)

 - Implements IBM-5: dependency on word classes, smoothing, ...

 - Implements HMM alignment model: Baum-Welch training, Forward-Backward
 algorithm, empty word, dependency on word classes, transfer to
 fertility models, ...

 - Implementation of a variant of the IBM-3 and IBM-4
 (-deficientDistortionModel 1) models which allow the training of -p0

 - Smoothing for fertility, distortion/alignment parameters

 - Significant more efficient training of the fertility models

 - Correct implementation of pegging as described in (Brown et
 al. 1993), implemented a series of heuristics in order to make pegging
 sufficiently efficient

 - Completely new parameter mechanism: allows to easily add additional
 parameters

 - Improved perplexity calculation for models IBM-1, IBM-2 and HMM (the
 parameter of the Poisson-distribution of the sentence lengths is
 computed automatically from the used training corpus)

 ========================================================================
 Part I: GIZA++ Package Programs

 GIZA++: GIZA++ itself

 plain2snt.out: simple tool to transform plain text into GIZA text
 format

 snt2plain.out: simple tool to transform GIZA text format into plain
 text

 trainGIZA++.sh: Shell script to perform standard training given a
 corpus in GIZA text format

 ========================================================================
 Part II: How To Compile GIZA++

 In order to compile GIZA++ you may need:
 - recent version of the GNU compiler (2.95 or higher)
 - recent version of assembler and linker which do not have restrictions
   with respect to the length of symbol names

 There is a make file in the src directory that will take care of the
 compilation. The most important targets are:

 GIZA++: generates an optimized version

 GIZA++.dbg: generates the debug version

 depend: generates the "dependencies" file (make this whenever you add
 source or header files to the package.

 ========================================================================
 Part III: How To run GIZA++

 It's simple:

 GIZA++ [config-file] [options]

 All options which expect a parameter could also be used in the
 parameter file. For example the command line options

 	GIZA++ -S S.vcb -T T.vcb -C ST.snt

 corresponds to the config file:

 	S: S.vcb
 	T: T.vcb
 	C: ST.snt

 If you call GIZA++ without a parameter you get a list of all the
 options. The option names form GIZA are normally still valid. The
 default values of the parameters typically are optimized with respect
 to the corpora I use and typically give good results. It is
 nevertheless important that these parameters are always optimized for
 every new task.

 ==========================================================================
 Part IV: Input File Formats

 A. VOCABULARY FILES

 Each entry is stored on one line as follows:

  uniq_id1 string1 no_occurrences1
  uniq_id2 string2 no_occurrences2
  uniq_id3 string3 no_occurrences3
  ....

 Here is a sample from an English vocabulary file:

 627 abandon 10
 628 abandoned 17
 629 abandoning 2
 630 abandonment 12
 631 abatement 8
 632 abbotsford 2

 uniq_ids are sequential positive integer numbers.  0 is reserved for
 the special token NULL.


 B. Bitext Files

 Each sentence pair is stored in three lines. The first line
 is the number of times this sentence pair occurred. The second line is
 the source sentence where each token is replaced by its unique integer
 id from the vocabulary file and the third is the target sentence in
 the same format.

 Here's a sample of 3 sentences from English/french corpus:

 1
 1 1 226 5008 621 6492 226 6377 6813 226 9505 5100 6824 226 5100 5222 0 614 10243 613
 2769 155 7989 585 1 578 6503 585 8242 578 8142 8541 578 12328 6595 8550 578 6595 6710 1
 1
 1 1 226 6260 11856 11806 1293
 11 1 1 11 155 14888 2649 11447 9457 8488 4168
 1
 1 1 226 7652 1 226 5337 226 6940 12089 5582 8076 12050
 1 1 155 4140 6812 153 1 154 155 14668 15616 10524 9954 1392

 C. Dictionary File

 This is optional. The dictionary file is of the format:

 target_word_id source_word_id

 The list should be sorted by the target_word_id.

 C. Dictionary Files

 If you provide a dictionary and list it in the configuration file,
 GIZA++ will change the cooccurrence counting in the first iteration
 of model 1 to honor the so-called "Dictionary Constraint":

 	In parallel sentences "e1 ... en" and "f1 ... fm",
 	ei and fi are counted as a coocurrence pair if one of two
 	conditions is met:  1.) ei and fi occur as an entry in the
 	dictionary, or 2.) ei does not occur in the dictionary with
 	any fj (1 <= j <= m) and fi does not occur in the dictionary
 	with any ej (1 <= j <= n).

 The dictionary must a list of pairs, one per line:

 	F E

 where F is an integer of a target token, and E is the integer of a
 source token.  F may be listed with other Es, and vice versa.

 Important:  The dictionary must be sorted by the F integers!

 ==========================================================================
 Part V: Output File Formats:

 For file names, we will use the prefix "prob_table".  This can be
 changed using the -o switch.  The default is a combination of user id
 and time stamp.


 A. PROBABILITY TABLES

 Normally, Model1 is trained first, and the result is used to start
 Model2 training.  Then Model2 is transfered to Model3.  Model3 viterbi
 training follows.  This sequence can be adjusted by the various
 options, either on the command line or in a config file.

 1. T TABLE ( *.t3.* )

 (translation table)

  prob_table.t1.n = t table after n iterations of Model1 training
  prob_table.t2.n = t table after n iterations of Model2 training
  prob_table.t2to3 = t table after transfering Model2 to Model3
  prob_table.t3.n = t table after n iterations of Model3 training
  prob_table.4.n = t table after n iterations of Model4 training

 Each line is of the following format:

 s_id t_id P(t_id/s_id)

 where:
  s_id: is the unique id for the source token
  t_id: is the unique id for the target token
  P(t_id/s_id) the probability of translating s_id as t_id

 sample part of a file:

 3599 5697 0.0628115
 2056 10686 0.000259988
 8227 3738 3.57132e-13
 5141 13720 5.52332e-12
 10798 4102 6.53047e-06
 8227 3750 6.97502e-14
 7712 14080 6.0365e-20
 7712 14082 2.68323e-17
 7713 1083 3.94464e-15
 7712 14084 2.98768e-15

 Similar files will be generated (with the prefix
 "prob_table.actual.xxx" that has the actual tokens instead of their
 unique ids). This is also true for fertility tables. Also the inverse
 probability table will be generated for the final table and it will
 have the infix "ti" .

 2. N TABLE ( *.n3.* )

 (Fertility table)

  prob_table.n2to3 = n table estimated during the transfer from M2 to M3
  ptob_table.n3.X = n table after X iterations of model3

 Each line in this file is of the following format:

 source_token_id p0 p1 p2 .... pn

 where p0 is the probability that the source token has zero fertility;
 p1, fertility one, ...., and n is the maximum possible fertility as
 defined in the program.

 sample:

 1 0.475861 0.282418 0.133455 0.0653083 0.0329326 0.00844979 0.0014008
 10 0.249747 0.000107778 0.307767 0.192208 0.0641439 0.15016 0.0358886
 11 0.397111 0.390421 0.19925 0.013382 2.21286e-05 0 0
 12 0.0163432 0.560621 0.374745 0.00231588 0 0 0
 13 1.78045e-07 0.545694 0.299573 0.132127 0.0230494 9.00322e-05 0
 14 1.41918e-18 0.332721 0.300773 0.0334969 0 0 0
 15 0 5.98626e-10 0.47729 0.0230955 0 0 0
 17 0 1.66346e-07 0.895883 0.103948 0 0 0


 3. P0 TABLE ( *.p0* )

 (1 - P0 is the probability of inserting a null after a
    source word.)

 This file contains only one line with one real number which is the
 value of P0, the probability of not inserting a NULL token.


 4. A TABLE ( *.a[23].* )

 The file names follow the naming conventions above. The format of each
 line is as follows:

 i j l m p(i | j, l, m)

 where i, j, l, m are all integers and
  j = position in target sentence
  i = position in source sentence
  l = length of source sentence
  m = length of target sentence
 and p(i/j,l,m) is the probability that a source word in position i is
 moved to position j in a pair of sentences of length l and m.

 sample:

 15 14 15 14 0.630798
 15 14 15 15 0.414137
 15 14 15 16 0.268919
 15 14 15 17 0.23171
 15 14 15 18 0.117311
 15 14 15 19 0.119202
 15 14 15 20 0.111369
 15 14 15 21 0.0358169


 5. D3 TABLE ( *.d3.* )

 distortion table

 The format is similar to the A table with a slight difference --- the
 position of i & j are switched:

 j i l m p(j/i,l,m)

 sample:

 15 14 14 15 0.286397
 15 14 14 16 0.138898
 15 14 14 17 0.109712
 15 14 14 18 0.0868322
 15 14 14 19 0.0535823

 6. D4 TABLE: (( *.d4.* )

 distortion table for IBM-4

 7. D5 TABLE: ( *.d5.* )

 distortion table for IBM-5

 8. HMM TABLE: ( *.hhmm.* )

 alignment probability table for HMM alignment model

 B. ALIGNMENT FILE ( *.A3.* )

 In each iteration of the training, and for each sentence pair in the
 training set, the best alignment (viterbi alignment) is written to the
 alignment file (if the dump parameters are set accordingly). The
 alignment file is named prob_table.An.i, where n is the model number
 ({1,2, 2to3, 3 or 4}), and i is the iteration number. The format of
 the alignments file is illustrated in the following sample:

 # Sentence pair (1)
 il s' agit de la même société qui a changé de propriétaires
 NULL ({ }) UNK ({ }) UNK ({ }) ( ({ }) this ({ 4 11 }) is ({ }) the ({ }) same ({ 6 }) agency ({ }) which ({ 8 }) has ({ }) undergone ({ 1 2 3 7 9 10 12 }) a ({ }) change ({ 5 }) of ({ }) UNK ({ })
 # Sentence pair (2)
 UNK UNK , le propriétaire , dit que cela s' est produit si rapidement qu' il n' en connaît pas la cause exacte
 NULL ({ 4 }) UNK ({ 1 2 }) UNK ({ }) , ({ 3 }) the ({ }) owner ({ 5 22 23 }) , ({ 6 }) says ({ 7 8 }) it ({ }) happened ({ 10 11 12 }) so ({ 13 }) fast ({ 14 19 }) he ({ 16 }) is ({ }) not ({ 20 }) sure ({ 15 17 }) what ({ }) went ({ 18 21 }) wrong ({ 9 })

 The alignment file is represented by three lines for each sentence
 pair. The first line is a label that can be used, e.g., as a caption
 for alignment visualization tools.  It contains information about the
 sentence sequential number in the training corpus, sentence lengths,
 and alignment probability. The second line is the target sentence, the
 third line is the source sentence. Each token in the source sentence
 is followed by a set of zero or more numbers. These numbers represent
 the positions of the target words to which this source word is
 connected, according to the alignment.


 C. Perplexity File ( *.perp )

 This file will be generated at the end of training. It summarizes
 perplexity values for each training iteration.  Here is a sample
 perplexity file that illustrates the format. The format is the same
 for cross entropy. If no test corpus was provided, the values for it
 will be set to "N/A".

 # train-size test-size iter. model train-perplexity test-perplexity final(y/n) train-viterbi-perp test-viterbi-perp
 	447136 9625 0 1 187067 186722 n 3.34328e+06 3.35352e+06
 	447136 9625 1 1 192.88 248.763 n 909.879 1203.13
 	447136 9625 2 1 99.45 139.214 n 316.363 459.745
 	447136 9625 3 1 83.4746 126.046 n 214.612 341.27
 	447136 9625 4 1 78.6939 124.914 n 179.218 303.169
 	447136 9625 5 2 76.6848 125.986 n 161.874 286.226
 	447136 9625 6 2 50.7452 86.2273 n 84.7227 151.701
 	447136 9625 7 2 42.9178 74.5574 n 63.6644 116.034
 	447136 9625 8 2 40.0651 70.7444 n 56.3186 104.274
 	447136 9625 9 2 38.8471 69.4105 n 53.1277 99.6044
 	447136 9625 10 2to3 38.2561 68.9576 n 51.4856 97.4414
 	447136 9625 11 3 129.993 248.885 n 86.6675 165.012
 	447136 9625 12 3 79.2212 169.902 n 86.4842 171.367
 	447136 9625 13 3 75.0746 164.488 n 84.9647 172.639
 	447136 9625 14 3 73.412 162.765 n 83.5762 172.797
 	447136 9625 15 3 72.6107 162.254 y 82.4575 172.688


 D. Revised Vocabulary files (*.src.vcb, *.trg.vcb)

 The revised vocabulary files are similar in format to the original
 vocabulary files. The only exceptions is that the frequency for each
 token is calculated from the given corpus (i.e. it is exact), which is
 not required in the input.

 E. final parameter file: ( *.gizacfg )

 This file includes all the parameter settings that were used in order
 to perform this training. This means that starting GIZA using this
 parameter file produces (should produce) the same training.


 Part VI: LITERATURE
 -------------------

 The following two articles include a comparison of the alignment
 models implemented in GIZA++:

 @INPROCEEDINGS{och00:isa,
 	AUTHOR = {F.~J.~Och and H.~Ney},
 	TITLE ={Improved Statistical Alignment Models},
 	BOOKTITLE = ACL00 ,
 	PAGES ={440--447},
 	ADDRESS={ Hongkong, China},
 	MONTH = {October},
 	YEAR = 2000}

 @INPROCEEDINGS{och00:aco,
 	AUTHOR =  {F.~J.~Och and H.~Ney},
 	TITLE = {A Comparison of Alignment Models for Statistical Machine Translation},
 	BOOKTITLE = COLING00,
 	ADDRESS	= {Saarbr\"ucken, Germany},
 	YEAR = {2000},
 	MONTH = {August},
         PAGES = {1086--1090}
 	}

 The following article describes the statistical machine translation
 toolkit EGYPT:

 @MISC{ alonaizan99:smt,
 AUTHOR = {Y. Al-Onaizan and J. Curin and M. Jahr and K. Knight and J. Lafferty and I. D. Melamed and F. J. Och and D. Purdy and N. A. Smith and D. Yarowsky},
 TITLE = {Statistical Machine Translation, Final Report, {JHU} Workshop},
 YEAR = {1999},
 ADDRESS = {Baltimore, Maryland, MD},
 NOTE={{\tt http://www.clsp.jhu.edu/ws99/projects/ mt/final\_report/mt-final-report.ps}}
 }


 The implemented alignment models IBM-1 to IBM-5 and HMM were originally described in:

 @ARTICLE{brown93:tmo,
 	AUTHOR	= {Brown, P. F. and Della Pietra, S. A. and Della Pietra, V. J. and Mercer, R. L.},
 	TITLE	= {The Mathematics of Statistical Machine Translation: Parameter Estimation},
 	JOURNAL	= {Computational Linguistics},
 	YEAR	= 1993,
 	VOLUME	= 19,
 	NUMBER	= 2,
 	PAGES	= {263--311}
 }

 @INPROCEEDINGS{ vogel96:hbw,
 	AUTHOR	= {Vogel, S. and Ney, H. and Tillmann, C.},
 	TITLE	= {{HMM}-Based Word Alignment in Statistical Translation},
 	YEAR	= 1996,
 	PAGES	= {836--841},
 	MONTH	= {August},
 	ADDRESS	= {Copenhagen},
 	BOOKTITLE	= COLING96
 }


 Part VII: New features
 ======================

 2003-06-09:

 - new parameter "-nbestalignments N": prints an N-best list of
   alignments into a file *.NBEST

 - If program is compiled with "-DBINARY_SEARCH_FOR_TTABLE", it uses
   more memory-efficient data structures for the t table (vector with
   binary search instead of hash table). Then, the program expects a
   parameter "-CoocurrenceFile FILE" which specifies a file which
   includes all lexical coccurrences in the training corpus. This file
   can be produced by the snt2cooc.out tool.
	========================================================================
	GIZA++ is an extension of the program GIZA.
	It is a program for learning statistical translation models from
	bitext. It is an implementation of the models described in
	(Brown et al., 1993), (Vogel et al., 1996), (Och et al., 2000a),
	(Och et al., 2000b).
	========================================================================



	CONTENTS of this README file:

	Part I: GIZA Package Contents
	Part II: How To Compile GIZA
	Part III: How to Run GIZA
	Part IV: Input File Formats
	A. VOCABULARY FILES
	B. Bitext Files
	C. Dictionary File (optional)
	Part V: Output File Formats:
	A. PROBABILITY TABLES
	1. T TABLE (translation table)
	2. N TABLE (Fertility table)
	3. P0 TABLE
	4. A TABLE
	5. D3 TABLE
	6. D4 TABLE
	7. D5 TABLE
	8. HMM TABLE
	B. ALIGNMENT FILE
	C. Cross Entropy and Perplexity Files
	D. Revised Vocabulary files
	Part VI: Literature
	Part VII: New features

	HISTORY of this README file:

	GIZA++:
	edited: 11 Jan. 2000, Franz Josef Och
	GIZA:
	edited: 16 Aug. 1999, Dan Melamed
	edited: 13 Aug. 1999, Yaser Al-Onaizan
	edited: 20 July 1999, Yaser Al-Onaizan
	edited: 15 July 1999, Yaser Al-Onaizan
	edited: 13 July 1999, Noah Smith
	========================================================================

	Part 0: What is GIZA++

	GIZA++ is an extension of the program GIZA (part of the SMT toolkit
	EGYPT - http://www.clsp.jhu.edu/ws99/projects/mt/toolkit/ ) which was
	developed by the Statistical Machine Translation team during the
	summer workshop in 1999 at the Center for Language and Speech
	Processing at Johns-Hopkins University (CLSP/JHU). GIZA++ includes a
	lot of additional features. The extensions of GIZA++ were designed and
	written by Franz Josef Och.

	Features of GIZA++ not in GIZA:

	- Implements full IBM-4 alignment model with a dependency of word
	classes as described in (Brown et al. 1993)

	- Implements IBM-5: dependency on word classes, smoothing, ...

	- Implements HMM alignment model: Baum-Welch training, Forward-Backward
	algorithm, empty word, dependency on word classes, transfer to
	fertility models, ...

	- Implementation of a variant of the IBM-3 and IBM-4
	(-deficientDistortionModel 1) models which allow the training of -p0

	- Smoothing for fertility, distortion/alignment parameters

	- Significant more efficient training of the fertility models

	- Correct implementation of pegging as described in (Brown et
	al. 1993), implemented a series of heuristics in order to make pegging
	sufficiently efficient

	- Completely new parameter mechanism: allows to easily add additional
	parameters

	- Improved perplexity calculation for models IBM-1, IBM-2 and HMM (the
	parameter of the Poisson-distribution of the sentence lengths is
	computed automatically from the used training corpus)

	========================================================================
	Part I: GIZA++ Package Programs

	GIZA++: GIZA++ itself

	plain2snt.out: simple tool to transform plain text into GIZA text
	format

	snt2plain.out: simple tool to transform GIZA text format into plain
	text

	trainGIZA++.sh: Shell script to perform standard training given a
	corpus in GIZA text format

	========================================================================
	Part II: How To Compile GIZA++

	In order to compile GIZA++ you may need:
	- recent version of the GNU compiler (2.95 or higher)
	- recent version of assembler and linker which do not have restrictions
	with respect to the length of symbol names

	There is a make file in the src directory that will take care of the
	compilation. The most important targets are:

	GIZA++: generates an optimized version

	GIZA++.dbg: generates the debug version

	depend: generates the "dependencies" file (make this whenever you add
	source or header files to the package.

	========================================================================
	Part III: How To run GIZA++

	It's simple:

	GIZA++ [config-file] [options]

	All options which expect a parameter could also be used in the
	parameter file. For example the command line options

	GIZA++ -S S.vcb -T T.vcb -C ST.snt

	corresponds to the config file:

	S: S.vcb
	T: T.vcb
	C: ST.snt

	If you call GIZA++ without a parameter you get a list of all the
	options. The option names form GIZA are normally still valid. The
	default values of the parameters typically are optimized with respect
	to the corpora I use and typically give good results. It is
	nevertheless important that these parameters are always optimized for
	every new task.

	==========================================================================
	Part IV: Input File Formats

	A. VOCABULARY FILES

	Each entry is stored on one line as follows:

	uniq_id1 string1 no_occurrences1
	uniq_id2 string2 no_occurrences2
	uniq_id3 string3 no_occurrences3
	....

	Here is a sample from an English vocabulary file:

	627 abandon 10
	628 abandoned 17
	629 abandoning 2
	630 abandonment 12
	631 abatement 8
	632 abbotsford 2

	uniq_ids are sequential positive integer numbers. 0 is reserved for
	the special token NULL.


	B. Bitext Files

	Each sentence pair is stored in three lines. The first line
	is the number of times this sentence pair occurred. The second line is
	the source sentence where each token is replaced by its unique integer
	id from the vocabulary file and the third is the target sentence in
	the same format.

	Here's a sample of 3 sentences from English/french corpus:

	1
	1 1 226 5008 621 6492 226 6377 6813 226 9505 5100 6824 226 5100 5222 0 614 10243 613
	2769 155 7989 585 1 578 6503 585 8242 578 8142 8541 578 12328 6595 8550 578 6595 6710 1
	1
	1 1 226 6260 11856 11806 1293
	11 1 1 11 155 14888 2649 11447 9457 8488 4168
	1
	1 1 226 7652 1 226 5337 226 6940 12089 5582 8076 12050
	1 1 155 4140 6812 153 1 154 155 14668 15616 10524 9954 1392

	C. Dictionary File

	This is optional. The dictionary file is of the format:

	target_word_id source_word_id

	The list should be sorted by the target_word_id.

	C. Dictionary Files

	If you provide a dictionary and list it in the configuration file,
	GIZA++ will change the cooccurrence counting in the first iteration
	of model 1 to honor the so-called "Dictionary Constraint":

	In parallel sentences "e1 ... en" and "f1 ... fm",
	ei and fi are counted as a coocurrence pair if one of two
	conditions is met: 1.) ei and fi occur as an entry in the
	dictionary, or 2.) ei does not occur in the dictionary with
	any fj (1 <= j <= m) and fi does not occur in the dictionary
	with any ej (1 <= j <= n).

	The dictionary must a list of pairs, one per line:

	F E

	where F is an integer of a target token, and E is the integer of a
	source token. F may be listed with other Es, and vice versa.

	Important: The dictionary must be sorted by the F integers!

	==========================================================================
	Part V: Output File Formats:

	For file names, we will use the prefix "prob_table". This can be
	changed using the -o switch. The default is a combination of user id
	and time stamp.


	A. PROBABILITY TABLES

	Normally, Model1 is trained first, and the result is used to start
	Model2 training. Then Model2 is transfered to Model3. Model3 viterbi
	training follows. This sequence can be adjusted by the various
	options, either on the command line or in a config file.

	1. T TABLE ( .t3. )

	(translation table)

	prob_table.t1.n = t table after n iterations of Model1 training
	prob_table.t2.n = t table after n iterations of Model2 training
	prob_table.t2to3 = t table after transfering Model2 to Model3
	prob_table.t3.n = t table after n iterations of Model3 training
	prob_table.4.n = t table after n iterations of Model4 training

	Each line is of the following format:

	s_id t_id P(t_id/s_id)

	where:
	s_id: is the unique id for the source token
	t_id: is the unique id for the target token
	P(t_id/s_id) the probability of translating s_id as t_id

	sample part of a file:

	3599 5697 0.0628115
	2056 10686 0.000259988
	8227 3738 3.57132e-13
	5141 13720 5.52332e-12
	10798 4102 6.53047e-06
	8227 3750 6.97502e-14
	7712 14080 6.0365e-20
	7712 14082 2.68323e-17
	7713 1083 3.94464e-15
	7712 14084 2.98768e-15

	Similar files will be generated (with the prefix
	"prob_table.actual.xxx" that has the actual tokens instead of their
	unique ids). This is also true for fertility tables. Also the inverse
	probability table will be generated for the final table and it will
	have the infix "ti" .

	2. N TABLE ( .n3. )

	(Fertility table)

	prob_table.n2to3 = n table estimated during the transfer from M2 to M3
	ptob_table.n3.X = n table after X iterations of model3

	Each line in this file is of the following format:

	source_token_id p0 p1 p2 .... pn

	where p0 is the probability that the source token has zero fertility;
	p1, fertility one, ...., and n is the maximum possible fertility as
	defined in the program.

	sample:

	1 0.475861 0.282418 0.133455 0.0653083 0.0329326 0.00844979 0.0014008
	10 0.249747 0.000107778 0.307767 0.192208 0.0641439 0.15016 0.0358886
	11 0.397111 0.390421 0.19925 0.013382 2.21286e-05 0 0
	12 0.0163432 0.560621 0.374745 0.00231588 0 0 0
	13 1.78045e-07 0.545694 0.299573 0.132127 0.0230494 9.00322e-05 0
	14 1.41918e-18 0.332721 0.300773 0.0334969 0 0 0
	15 0 5.98626e-10 0.47729 0.0230955 0 0 0
	17 0 1.66346e-07 0.895883 0.103948 0 0 0


	3. P0 TABLE ( .p0 )

	(1 - P0 is the probability of inserting a null after a
	source word.)

	This file contains only one line with one real number which is the
	value of P0, the probability of not inserting a NULL token.


	4. A TABLE ( .a[23]. )

	The file names follow the naming conventions above. The format of each
	line is as follows:

	i j l m p(i \| j, l, m)

	where i, j, l, m are all integers and
	j = position in target sentence
	i = position in source sentence
	l = length of source sentence
	m = length of target sentence
	and p(i/j,l,m) is the probability that a source word in position i is
	moved to position j in a pair of sentences of length l and m.

	sample:

	15 14 15 14 0.630798
	15 14 15 15 0.414137
	15 14 15 16 0.268919
	15 14 15 17 0.23171
	15 14 15 18 0.117311
	15 14 15 19 0.119202
	15 14 15 20 0.111369
	15 14 15 21 0.0358169


	5. D3 TABLE ( .d3. )

	distortion table

	The format is similar to the A table with a slight difference --- the
	position of i & j are switched:

	j i l m p(j/i,l,m)

	sample:

	15 14 14 15 0.286397
	15 14 14 16 0.138898
	15 14 14 17 0.109712
	15 14 14 18 0.0868322
	15 14 14 19 0.0535823

	6. D4 TABLE: (( .d4. )

	distortion table for IBM-4

	7. D5 TABLE: ( .d5. )

	distortion table for IBM-5

	8. HMM TABLE: ( .hhmm. )

	alignment probability table for HMM alignment model

	B. ALIGNMENT FILE ( .A3. )

	In each iteration of the training, and for each sentence pair in the
	training set, the best alignment (viterbi alignment) is written to the
	alignment file (if the dump parameters are set accordingly). The
	alignment file is named prob_table.An.i, where n is the model number
	({1,2, 2to3, 3 or 4}), and i is the iteration number. The format of
	the alignments file is illustrated in the following sample:

	# Sentence pair (1)
	il s' agit de la même société qui a changé de propriétaires
	NULL ({ }) UNK ({ }) UNK ({ }) ( ({ }) this ({ 4 11 }) is ({ }) the ({ }) same ({ 6 }) agency ({ }) which ({ 8 }) has ({ }) undergone ({ 1 2 3 7 9 10 12 }) a ({ }) change ({ 5 }) of ({ }) UNK ({ })
	# Sentence pair (2)
	UNK UNK , le propriétaire , dit que cela s' est produit si rapidement qu' il n' en connaît pas la cause exacte
	NULL ({ 4 }) UNK ({ 1 2 }) UNK ({ }) , ({ 3 }) the ({ }) owner ({ 5 22 23 }) , ({ 6 }) says ({ 7 8 }) it ({ }) happened ({ 10 11 12 }) so ({ 13 }) fast ({ 14 19 }) he ({ 16 }) is ({ }) not ({ 20 }) sure ({ 15 17 }) what ({ }) went ({ 18 21 }) wrong ({ 9 })

	The alignment file is represented by three lines for each sentence
	pair. The first line is a label that can be used, e.g., as a caption
	for alignment visualization tools. It contains information about the
	sentence sequential number in the training corpus, sentence lengths,
	and alignment probability. The second line is the target sentence, the
	third line is the source sentence. Each token in the source sentence
	is followed by a set of zero or more numbers. These numbers represent
	the positions of the target words to which this source word is
	connected, according to the alignment.


	C. Perplexity File ( *.perp )

	This file will be generated at the end of training. It summarizes
	perplexity values for each training iteration. Here is a sample
	perplexity file that illustrates the format. The format is the same
	for cross entropy. If no test corpus was provided, the values for it
	will be set to "N/A".

	# train-size test-size iter. model train-perplexity test-perplexity final(y/n) train-viterbi-perp test-viterbi-perp
	447136 9625 0 1 187067 186722 n 3.34328e+06 3.35352e+06
	447136 9625 1 1 192.88 248.763 n 909.879 1203.13
	447136 9625 2 1 99.45 139.214 n 316.363 459.745
	447136 9625 3 1 83.4746 126.046 n 214.612 341.27
	447136 9625 4 1 78.6939 124.914 n 179.218 303.169
	447136 9625 5 2 76.6848 125.986 n 161.874 286.226
	447136 9625 6 2 50.7452 86.2273 n 84.7227 151.701
	447136 9625 7 2 42.9178 74.5574 n 63.6644 116.034
	447136 9625 8 2 40.0651 70.7444 n 56.3186 104.274
	447136 9625 9 2 38.8471 69.4105 n 53.1277 99.6044
	447136 9625 10 2to3 38.2561 68.9576 n 51.4856 97.4414
	447136 9625 11 3 129.993 248.885 n 86.6675 165.012
	447136 9625 12 3 79.2212 169.902 n 86.4842 171.367
	447136 9625 13 3 75.0746 164.488 n 84.9647 172.639
	447136 9625 14 3 73.412 162.765 n 83.5762 172.797
	447136 9625 15 3 72.6107 162.254 y 82.4575 172.688


	D. Revised Vocabulary files (.src.vcb, .trg.vcb)

	The revised vocabulary files are similar in format to the original
	vocabulary files. The only exceptions is that the frequency for each
	token is calculated from the given corpus (i.e. it is exact), which is
	not required in the input.

	E. final parameter file: ( *.gizacfg )

	This file includes all the parameter settings that were used in order
	to perform this training. This means that starting GIZA using this
	parameter file produces (should produce) the same training.



	Part VI: LITERATURE
	-------------------

	The following two articles include a comparison of the alignment
	models implemented in GIZA++:

	@INPROCEEDINGS{och00:isa,
	AUTHOR = {F.~J.~Och and H.~Ney},
	TITLE ={Improved Statistical Alignment Models},
	BOOKTITLE = ACL00 ,
	PAGES ={440--447},
	ADDRESS={ Hongkong, China},
	MONTH = {October},
	YEAR = 2000}

	@INPROCEEDINGS{och00:aco,
	AUTHOR = {F.~J.~Och and H.~Ney},
	TITLE = {A Comparison of Alignment Models for Statistical Machine Translation},
	BOOKTITLE = COLING00,
	ADDRESS = {Saarbr\"ucken, Germany},
	YEAR = {2000},
	MONTH = {August},
	PAGES = {1086--1090}
	}

	The following article describes the statistical machine translation
	toolkit EGYPT:

	@MISC{ alonaizan99:smt,
	AUTHOR = {Y. Al-Onaizan and J. Curin and M. Jahr and K. Knight and J. Lafferty and I. D. Melamed and F. J. Och and D. Purdy and N. A. Smith and D. Yarowsky},
	TITLE = {Statistical Machine Translation, Final Report, {JHU} Workshop},
	YEAR = {1999},
	ADDRESS = {Baltimore, Maryland, MD},
	NOTE={{\tt http://www.clsp.jhu.edu/ws99/projects/ mt/final\_report/mt-final-report.ps}}
	}


	The implemented alignment models IBM-1 to IBM-5 and HMM were originally described in:

	@ARTICLE{brown93:tmo,
	AUTHOR = {Brown, P. F. and Della Pietra, S. A. and Della Pietra, V. J. and Mercer, R. L.},
	TITLE = {The Mathematics of Statistical Machine Translation: Parameter Estimation},
	JOURNAL = {Computational Linguistics},
	YEAR = 1993,
	VOLUME = 19,
	NUMBER = 2,
	PAGES = {263--311}
	}

	@INPROCEEDINGS{ vogel96:hbw,
	AUTHOR = {Vogel, S. and Ney, H. and Tillmann, C.},
	TITLE = {{HMM}-Based Word Alignment in Statistical Translation},
	YEAR = 1996,
	PAGES = {836--841},
	MONTH = {August},
	ADDRESS = {Copenhagen},
	BOOKTITLE = COLING96
	}


	Part VII: New features
	======================

	2003-06-09:

	- new parameter "-nbestalignments N": prints an N-best list of
	alignments into a file *.NBEST

	- If program is compiled with "-DBINARY_SEARCH_FOR_TTABLE", it uses
	more memory-efficient data structures for the t table (vector with
	binary search instead of hash table). Then, the program expects a
	parameter "-CoocurrenceFile FILE" which specifies a file which
	includes all lexical coccurrences in the training corpus. This file
	can be produced by the snt2cooc.out tool.