Here is a recipe to to train the CMU Sphinx speech recognizer using the CMU pronouncing dictionary, Wall Street Journal WSJ0 corpus and optionally the WSJ1 corpus. The Resource Management corpus is used to perform the initial forced alignment of the WSJ training data. You'll need the sph2pipe utility to decompress the WSJ audio files and NIST's scoring toolkit to evaluate results.
This is mostly based on the Sphinx tutorial. A variety of acoustic models trained using this recipe are available for download. There is also a similar HTK recipe available. You can read about all the gory details in this technical report.
I also have some British English acoustic models.
I evaluated on the November 1992 ARPA WSJ set (Nov'92, 330 sentences) and the San Jose Mercury sentences from the WSJ1 Hub 2 test set (si_dt_s2, 207 sentences). Nov'92 was evaluated using the WSJ 5K non-verbalized 5k closed vocabulary set and the WSJ standard 5K non-verbalized closed bigram language model. si_dt_s2 was evaluated using a 60K vocabulary and bigram language model trained on the English Gigaword corpus (language model not included in this recipe). Models were evaluated using the Sphinx-3 decoder operating in close to real-time.
Training data | Nov'92 | si_dt_s2 |
WSJ SI-84 | 28.91% | 52.42% |
WSJ SI-284 | 7.34% | 24.27% |
WSJ all | 6.33% | 21.26% |
|