CSR LM-1 language model training recipe --------------------------------------- http://www.keithv.com/software/csr/ This is a recipe to train word n-gram language models using the newswire text provided for the Continuous Speech Recognition (CSR) project. It also prepares dictionaries needed to use the LMs with the HTK and Sphinx speech recognizers. After conditioning, the total words from each source was (non-verbalized punctuation, excluding start/end words): WSJ 109M SJM 10M AP 103M Total 222M About 12M words were held out from the training data to serve as eval and dev test sets. Requirements: Linux computer with Perl installed LDC CSR-III text corpus SRILM language modeling toolkit A fair bit of disk space and memory Basic steps: 1) Set LM1_TRAIN to point to the directory the recipe zip was extracted to: LM1_TRAIN=/rd/lm_csr;export LM1_TRAIN 2) Set LM1 to point to CSR-III corpus top-level directory: LM1=/rd/corpus/lm1;export LM1 3) Install SRILM toolkit, make sure the binaries are on your path. 4) Download the CMU dictionary and put it in the LM1_TRAIN directory under the name "c0.6". 5) A full set of LMs using 5K, 20K, 64K vocabs and both verbalized (VP) and non-verbalized punctuation (NVP) can be built by running "go_all.sh" Defaults used by script to build LM: * Interpolated, modified Kneser-Ney smoothing * 2-gram cutoff 1, 3-gram cutoff 3 * Built with unknown word * Both verbalized punctuation (VP) and non-verbalized punctuation (NVP) LMs are built Results: Here are some OOV% and perplexity results measured on three different held-out evaluation test sets: * csr - held-out portion of CSR LM-1 training data (~6M words) * setasd - text from the CSR LM-1 setaside dir (~15M words) * giga - held-out portion of Gigaword training data from my Gigaword training recipe, uses different newswire sources and covers a different historical period than LM-1 (~25M words) +-------+------+--------+------+-------+-------+-------+-------+-------+ | Vocab | Punc | Size | csr | csr | setasd| setasd| giga | giga | | | | | OOV% | ppl | OOV% | ppl | OOV% | ppl | +-------+------+--------+------+-------+-------+-------+-------+-------+ | 5K | NVP | 2-gram | 10.8 | 108.9 | 10.6 | 109.9 | 13.5 | 119.5 | | 20K | NVP | 2-gram | 3.1 | 176.8 | 3.0 | 177.1 | 4.9 | 225.2 | | 64K | NVP | 2-gram | 0.9 | 206.9 | 0.9 | 206.9 | 2.0 | 291.7 | +-------+------+--------+------+-------+-------+-------+-------+-------+ | 5K | VP | 2-gram | 9.8 | 82.7 | 9.5 | 82.0 | 12.6 | 89.6 | | 20K | VP | 2-gram | 2.9 | 128.4 | 2.8 | 125.6 | 4.7 | 161.9 | | 64K | VP | 2-gram | 1.0 | 148.4 | 0.9 | 144.5 | 2.0 | 206.7 | +-------+------+--------+------+-------+-------+-------+-------+-------+ | 5K | NVP | 3-gram | 10.8 | 66.7 | 10.6 | 67.5 | 13.5 | 85.7 | | 20K | NVP | 3-gram | 3.1 | 101.8 | 3.0 | 102.3 | 4.9 | 156.1 | | 64K | NVP | 3-gram | 0.9 | 118.2 | 0.9 | 118.5 | 2.0 | 201.3 | +-------+------+--------+------+-------+-------+-------+-------+-------+ | 5K | VP | 3-gram | 9.8 | 49.0 | 9.5 | 48.5 | 12.6 | 62.8 | | 20K | VP | 3-gram | 2.9 | 71.0 | 2.8 | 69.5 | 4.7 | 109.1 | | 64K | VP | 3-gram | 1.0 | 81.3 | 0.9 | 79.2 | 2.0 | 138.6 | +-------+------+--------+------+-------+-------+-------+-------+-------+ For comparison, here is how the original Good-Turing smoothed LMs provided in the CSR corpus performed: +-------+------+--------+------+-------+-------+-------+-------+-------+ | Vocab | Punc | Size | csr | csr | setasd| setasd| giga | giga | | | | | OOV% | ppl | OOV% | ppl | OOV% | ppl | +-------+------+--------+------+-------+-------+-------+-------+-------+ | 20K | NVP | 2-gram | 3.1 | 170.6*| 3.0 | 181.2 | 4.9 | 231.8 | | 20K | NVP | 3-gram | 3.1 | 94.1*| 3.0 | 108.3 | 4.9 | 166.2 | +-------+------+--------+------+-------+-------+-------+-------+-------+ * Note that these LMs had an unfair advantage on this test set as they were trained on the full training data which included the eval test set I used! Notes: The scripts use 95% of the text data for training, reserving 2.5% for development test data and 2.5% for evaluation test data. The test sets are drawn from AP, SJM and WSJ sources (entire files from the CSR vp dir). I tried tuning the discounting parameters on the dev set using a simple grid search, but this yielded very little (~0.7%) reduction in perplexity on the dev set. This seems to agree with what Chen found in: An Empirical Study of Smoothing Techniques for Language Modeling. Technical Report TR-10-98, Computer Science Group, Harvard University, 1998 (figure 20). The SRI toolkit doesn't output n-grams in the right order for Sphinx's lm3g2dmp utility. You'll need to resort the LM somehow if you want to use them with the Sphinx decoder. Have fun! Keith Vertanen Revision history: ----------------- July 3rd, 2007 - Initial release of CSR LM-1 recipe. July 6th, 2007 - Added results on Gigaword test set.