This is a recipe to train word n-gram language models using the newswire text provided in the English Gigaword corpus (1200M words of NYT, APW, AFE, XIE). It also prepares dictionaries needed to use the LMs with the HTK and Sphinx speech recognizers.
Requirements:
By default, the scripts use: interpolated, modified Kneser-Ney smoothing, bigram cutoff 3, trigram cutoff 5, with an unknown word. I used the top 5K, 20K, and 64K words occurring in the training text as vocabularies. Both verbalized punctuation (VP) and non-verbalized punctuation (NVP) LMs are built.
Here are some OOV% and perplexity results measured on three different held-out evaluation test sets:
- giga - held-out portion of Gigaword training data (~25M words)
- setasd - text from the CSR LM-1 setaside dir (~15M words)
- csr - held-out portion of CSR LM-1 training data from my CSR LM-1 training recipe (~6M words)
Note that the CSR test sets are from different newswire sources and cover a different historical period than Gigaword.
Vocab |
Punc |
Size |
giga OOV% |
giga ppl |
setasd OOV% |
setasd ppl |
csr OOV% |
csr ppl |
Zip file |
5K | NVP | 2-gram | 12.5 | 123.7 | 12.0 | 132.5 | 12.1 | 130.9 |
Download |
20K | NVP | 2-gram | 4.1 | 199.7 | 3.7 | 215.9 | 3.8 | 213.9 |
Download |
64K | NVP | 2-gram | 1.7 | 238.6 | 1.1 | 264.9 | 1.2 | 262.5 |
Download |
5K | VP | 2-gram | 11.4 | 96.4 | 10.8 | 103.6 | 11.0 | 104.0 |
Download |
20K | VP | 2-gram | 3.8 | 141.8 | 3.4 | 151.2 | 3.5 | 153.0 |
Download |
64K | VP | 2-gram | 1.8 | 161.4 | 1.2 | 176.2 | 1.2 | 178.9 |
Download |
5K | NVP | 3-gram | 12.5 | 82.0 | 12.0 | 91.1 | 12.1 | 89.7 |
Download |
20K | NVP | 3-gram | 4.1 | 121.3 | 3.7 | 138.5 | 3.8 | 136.4 |
Download |
64K | NVP | 3-gram | 1.7 | 144.6 | 1.1 | 170.1 | 1.2 | 167.5 |
Download |
5K | VP | 3-gram | 11.4 | 62.0 | 10.8 | 67.8 | 11.0 | 67.9 |
Download |
20K | VP | 3-gram | 3.8 | 79.3 | 3.4 | 88.2 | 3.5 | 88.7 |
Download |
64K | VP | 3-gram | 1.8 | 90.3 | 1.2 | 103.1 | 1.2 | 104.1 |
Download |
|