CSR LM-1 language model training recipe

Home > Software > CSR LM training recipe

This is a recipe to train word n-gram language models using the newswire text provided for the Continuous Speech Recognition (CSR) project (222M words from WSJ, SJM and AP). It prepares the dictionaries needed to use the LMs with the HTK and Sphinx speech recognizers. You can also download the trained ARPA format language models.

Requirements:

Linux computer with Perl installed
LDC's CSR-III text corpus
SRI Language Modeling Toolkit.

By default, the scripts use: interpolated, modified Kneser-Ney smoothing, bigram cutoff 1, trigram cutoff 3, with an unknown word. I used the top 5K, 20K, and 64K words occurring in the training text as vocabularies. Both verbalized punctuation (VP) and non-verbalized punctuation (NVP) LMs are built.

Here are some OOV% and perplexity results measured on three different held-out evaluation test sets:

csr - held-out portion of CSR LM-1 training data (~6M words)
setasd - text from the CSR LM-1 setaside dir (~15M words)
giga - held-out portion of Gigaword training data from my Gigaword training recipe (~25M words)

Note that the Gigaword test sets are from different newswire sources and cover a different historical period than CSR.

Vocab	Punc	Size	csr OOV%	csr ppl	setasd OOV%	setasd ppl	giga OOV%	giga ppl	Zip file
5K	NVP	2-gram	10.8	108.9	10.6	109.9	13.5	119.5	Download
20K	NVP	2-gram	3.1	176.8	3.0	177.1	4.9	225.2	Download
64K	NVP	2-gram	0.9	206.9	0.9	206.9	2.0	291.7	Download
5K	VP	2-gram	9.8	82.7	9.5	82.0	12.6	89.6	Download
20K	VP	2-gram	2.9	128.4	2.8	125.6	4.7	161.9	Download
64K	VP	2-gram	1.0	148.4	0.9	144.5	2.0	206.7	Download
5K	NVP	3-gram	10.8	66.7	10.6	67.5	13.5	85.7	Download
20K	NVP	3-gram	3.1	101.8	3.0	102.3	4.9	156.1	Download
64K	NVP	3-gram	0.9	118.2	0.9	118.5	2.0	201.3	Download
5K	VP	3-gram	9.8	49.0	9.5	48.5	12.6	62.8	Download
20K	VP	3-gram	2.9	71.0	2.8	69.5	4.7	109.1	Download
64K	VP	3-gram	1.0	81.3	0.9	79.2	2.0	138.6	Download

Files:

	lm_csr_recipe.zip	CSR LM-1 training recipe
	readme.txt	Readme file (contained in the above zip)