CSR LM-1 language model training recipe

Home > Software > CSR LM training recipe

This is a recipe to train word n-gram language models using the newswire text provided for the Continuous Speech Recognition (CSR) project (222M words from WSJ, SJM and AP). It prepares the dictionaries needed to use the LMs with the HTK and Sphinx speech recognizers. You can also download the trained ARPA format language models.

Requirements:


By default, the scripts use: interpolated, modified Kneser-Ney smoothing, bigram cutoff 1, trigram cutoff 3, with an unknown word. I used the top 5K, 20K, and 64K words occurring in the training text as vocabularies. Both verbalized punctuation (VP) and non-verbalized punctuation (NVP) LMs are built.

Here are some OOV% and perplexity results measured on three different held-out evaluation test sets:

  • csr - held-out portion of CSR LM-1 training data (~6M words)
  • setasd - text from the CSR LM-1 setaside dir (~15M words)
  • giga - held-out portion of Gigaword training data from my Gigaword training recipe (~25M words)

Note that the Gigaword test sets are from different newswire sources and cover a different historical period than CSR.

Vocab Punc Size csr
OOV%
csr
ppl
setasd
OOV%
setasd
ppl
giga
OOV%
giga
ppl
Zip file
5K NVP 2-gram 10.8 108.9 10.6 109.9 13.5 119.5 Download
20K NVP 2-gram 3.1 176.8 3.0 177.1 4.9 225.2 Download
64K NVP 2-gram 0.9 206.9 0.9 206.9 2.0 291.7 Download
5K VP 2-gram 9.8 82.7 9.5 82.0 12.6 89.6 Download
20K VP 2-gram 2.9 128.4 2.8 125.6 4.7 161.9 Download
64K VP 2-gram 1.0 148.4 0.9 144.5 2.0 206.7 Download
5K NVP 3-gram 10.8 66.7 10.6 67.5 13.5 85.7 Download
20K NVP 3-gram 3.1 101.8 3.0 102.3 4.9 156.1 Download
64K NVP 3-gram 0.9 118.2 0.9 118.5 2.0 201.3 Download
5K VP 3-gram 9.8 49.0 9.5 48.5 12.6 62.8 Download
20K VP 3-gram 2.9 71.0 2.8 69.5 4.7 109.1 Download
64K VP 3-gram 1.0 81.3 0.9 79.2 2.0 138.6 Download

Files:
lm_csr_recipe.zip CSR LM-1 training recipe
readme.txt Readme file (contained in the above zip)