blob: e4ecdb67e67c93105a95fc51d1c232d5d439e209 [file] [log] [blame]
- [ ] language model is built incorrectly when starting at MERT with
a parsed corpus (maybe SAMT should expect a plain corpus and a .parsed one)
- [ ] add recasing with recursive call to pipeline.pl (provide a 1-1
alignment)
- [ ] pipeline shold output a script that can be easily -
used to decode another test set
- [ ] add tree output for test sets
- [ ] run MERT multiple times
- [X] hadoop cluster roll-out
- [X] rm -r hadoop directory after retrieving grammar successfully
- [ ] change qsub arg defaults when doing SAMT
- [ ] don't put number in train files if maxlen == 0
- [ ] should be easier to stop and start runs (locations of canonical files)
- [ ] add in kenlm binarization of the language model
- [ ] better tokenization (url aware, e.g.,)