CSR LM-1 language model training recipe
---------------------------------------
http://www.keithv.com/software/csr/

This is a recipe to train word n-gram language models using 
the newswire text provided for the Continuous Speech 
Recognition (CSR) project.  It also prepares dictionaries  
needed to use the LMs with the  HTK and Sphinx speech 
recognizers.

After conditioning, the total words from each source was 
(non-verbalized punctuation, excluding start/end words):
  WSJ     109M
  SJM      10M
  AP      103M
  Total   222M
About 12M words were held out from the training data to 
serve as eval and dev test sets.

Requirements:
  Linux computer with Perl installed
  LDC CSR-III text corpus 
  SRILM language modeling toolkit
  A fair bit of disk space and memory

Basic steps:

1) Set LM1_TRAIN to point to the directory the recipe zip
   was extracted to:  
     LM1_TRAIN=/rd/lm_csr;export LM1_TRAIN
   
2) Set LM1 to point to CSR-III corpus top-level directory:
     LM1=/rd/corpus/lm1;export LM1

3) Install SRILM toolkit, make sure the binaries are on
   your path.

4) Download the CMU dictionary and put it in the LM1_TRAIN
   directory under the name "c0.6".
     
5) A full set of LMs using 5K, 20K, 64K vocabs and both 
   verbalized (VP) and non-verbalized punctuation (NVP)
   can be built by running "go_all.sh"

   Defaults used by script to build LM:
     * Interpolated, modified Kneser-Ney smoothing
     * 2-gram cutoff 1, 3-gram cutoff 3
     * Built with unknown word <unk>
     * Both verbalized punctuation (VP) and non-verbalized
       punctuation (NVP) LMs are built

Results:
Here are some OOV% and perplexity results measured on three different
held-out evaluation test sets:
  * csr      - held-out portion of CSR LM-1 training data (~6M words)
  * setasd   - text from the CSR LM-1 setaside dir (~15M words)
  * giga     - held-out portion of Gigaword training data from my 
               Gigaword training recipe, uses different newswire 
               sources and covers a different historical period than 
               LM-1 (~25M words)

+-------+------+--------+------+-------+-------+-------+-------+-------+
| Vocab | Punc | Size   | csr  | csr   | setasd| setasd| giga  | giga  |
|       |      |        | OOV% | ppl   | OOV%  | ppl   | OOV%  | ppl   |
+-------+------+--------+------+-------+-------+-------+-------+-------+
| 5K    | NVP  | 2-gram | 10.8 | 108.9 | 10.6  | 109.9 | 13.5  | 119.5 |
| 20K   | NVP  | 2-gram |  3.1 | 176.8 |  3.0  | 177.1 |  4.9  | 225.2 |
| 64K   | NVP  | 2-gram |  0.9 | 206.9 |  0.9  | 206.9 |  2.0  | 291.7 |
+-------+------+--------+------+-------+-------+-------+-------+-------+
| 5K    | VP   | 2-gram |  9.8 |  82.7 |  9.5  |  82.0 | 12.6  |  89.6 |
| 20K   | VP   | 2-gram |  2.9 | 128.4 |  2.8  | 125.6 |  4.7  | 161.9 |
| 64K   | VP   | 2-gram |  1.0 | 148.4 |  0.9  | 144.5 |  2.0  | 206.7 |
+-------+------+--------+------+-------+-------+-------+-------+-------+
| 5K    | NVP  | 3-gram | 10.8 |  66.7 | 10.6  |  67.5 | 13.5  |  85.7 |
| 20K   | NVP  | 3-gram |  3.1 | 101.8 |  3.0  | 102.3 |  4.9  | 156.1 |
| 64K   | NVP  | 3-gram |  0.9 | 118.2 |  0.9  | 118.5 |  2.0  | 201.3 |
+-------+------+--------+------+-------+-------+-------+-------+-------+
| 5K    | VP   | 3-gram |  9.8 |  49.0 |  9.5  |  48.5 | 12.6  |  62.8 |
| 20K   | VP   | 3-gram |  2.9 |  71.0 |  2.8  |  69.5 |  4.7  | 109.1 |
| 64K   | VP   | 3-gram |  1.0 |  81.3 |  0.9  |  79.2 |  2.0  | 138.6 |
+-------+------+--------+------+-------+-------+-------+-------+-------+

For comparison, here is how the original Good-Turing 
smoothed LMs provided in the CSR corpus performed:

+-------+------+--------+------+-------+-------+-------+-------+-------+
| Vocab | Punc | Size   | csr  | csr   | setasd| setasd| giga  | giga  |
|       |      |        | OOV% | ppl   | OOV%  | ppl   | OOV%  | ppl   |
+-------+------+--------+------+-------+-------+-------+-------+-------+
| 20K   | NVP  | 2-gram | 3.1  | 170.6*| 3.0   | 181.2 | 4.9   | 231.8 |
| 20K   | NVP  | 3-gram | 3.1  |  94.1*| 3.0   | 108.3 | 4.9   | 166.2 |
+-------+------+--------+------+-------+-------+-------+-------+-------+

* Note that these LMs had an unfair advantage on this
test set as they were trained on the full training data 
which included the eval test set I used!

Notes:
The scripts use 95% of the text data for training, 
reserving 2.5% for development test data and 2.5% for 
evaluation test data.  The test sets are drawn from AP, 
SJM and WSJ sources (entire files from the CSR vp dir).  

I tried tuning the discounting parameters on the dev set 
using a simple grid search, but this yielded very little 
(~0.7%) reduction in perplexity on the dev set.  This 
seems to agree with what Chen found in: An Empirical Study 
of Smoothing Techniques for Language Modeling. Technical 
Report TR-10-98, Computer Science Group, Harvard 
University, 1998 (figure 20).  

The SRI toolkit doesn't output n-grams in the right order 
for Sphinx's lm3g2dmp utility.  You'll need to resort the 
LM somehow if you want to use them with the Sphinx decoder.

Have fun!
Keith Vertanen

Revision history:
-----------------
July 3rd, 2007  - Initial release of CSR LM-1 recipe.

July 6th, 2007  - Added results on Gigaword test set.