This is a recipe to train word n-gram language models using the newswire text provided for the Continuous Speech Recognition (CSR) project (222M words from WSJ, SJM and AP). It prepares the dictionaries needed to use the LMs with the HTK and Sphinx speech recognizers. You can also download the trained ARPA format language models.
Requirements:
By default, the scripts use: interpolated, modified Kneser-Ney smoothing, bigram cutoff 1, trigram cutoff 3, with an unknown word. I used the top 5K, 20K, and 64K words occurring in the training text as vocabularies. Both verbalized punctuation (VP) and non-verbalized punctuation (NVP) LMs are built.
Here are some OOV% and perplexity results measured on three different held-out evaluation test sets:
- csr - held-out portion of CSR LM-1 training data (~6M words)
- setasd - text from the CSR LM-1 setaside dir (~15M words)
- giga - held-out portion of Gigaword training data from my Gigaword training recipe (~25M words)
Note that the Gigaword test sets are from different newswire sources and cover a different historical period than CSR.
Vocab |
Punc |
Size |
csr OOV% |
csr ppl |
setasd OOV% |
setasd ppl |
giga OOV% |
giga ppl |
Zip file |
5K | NVP | 2-gram | 10.8 | 108.9 | 10.6 | 109.9 | 13.5 | 119.5 |
Download |
20K | NVP | 2-gram | 3.1 | 176.8 | 3.0 | 177.1 | 4.9 | 225.2 |
Download |
64K | NVP | 2-gram | 0.9 | 206.9 | 0.9 | 206.9 | 2.0 | 291.7 |
Download |
5K | VP | 2-gram | 9.8 | 82.7 | 9.5 | 82.0 | 12.6 | 89.6 |
Download |
20K | VP | 2-gram | 2.9 | 128.4 | 2.8 | 125.6 | 4.7 | 161.9 |
Download |
64K | VP | 2-gram | 1.0 | 148.4 | 0.9 | 144.5 | 2.0 | 206.7 |
Download |
5K | NVP | 3-gram | 10.8 | 66.7 | 10.6 | 67.5 | 13.5 | 85.7 |
Download |
20K | NVP | 3-gram | 3.1 | 101.8 | 3.0 | 102.3 | 4.9 | 156.1 |
Download |
64K | NVP | 3-gram | 0.9 | 118.2 | 0.9 | 118.5 | 2.0 | 201.3 |
Download |
5K | VP | 3-gram | 9.8 | 49.0 | 9.5 | 48.5 | 12.6 | 62.8 |
Download |
20K | VP | 3-gram | 2.9 | 71.0 | 2.8 | 69.5 | 4.7 | 109.1 |
Download |
64K | VP | 3-gram | 1.0 | 81.3 | 0.9 | 79.2 | 2.0 | 138.6 |
Download |
|