Even using models adapted on a large number of utterances, the test sets still had very high WER. What's worse I'm a fairly expert dictator and I recorded my test sets in a quiet office environment. I'd expect novice users in a noisy environment to fair even worse. I ran a series of experiments to investigate the source of the high error rate and see if there was a way to mitigate it.
I was fairly ruthless in choosing parameters to make the decoding go fast. I tested the existing configuration (near real-time on the N800), a slower version, and a much slower version. This test used a model fully adapted to my voice and was run on a desktop.
Configuration |
xRT (desktop) |
si_dt_s2 |
Enron |
SMS |
Normal | 0.1 | 34.3% | 31.5% | 45.5% |
Slow | 0.5 | 29.7% | 28.0% | 45.0% |
Slower | 9.0 | 29.5% | 27.9% | 47.8% |
As shown in the table, improved accuracy was possible with a modest (5 times) increase in CPU time. But throwing lots of CPU time at the problem (90 times), did not provide further improvements.
Unfortunately, I only could get narrowband (8kHz) audio from the N800. I also had to use a wireless bluetooth connection which may have introduced some interference. I tested how big a hit in accuracy this caused by recording another version of the si_dt_s2 test set on a desktop at 16kHz using a Sennheiser USB microphone.
I compared performance on the si_dt_s2 set recorded on the N800 using the narrowband acoustic model with performance on the second si_dt_s2 set using a wideband acoustic model (other parameters of the model were the same). I also downsampled the wideband audio and tested against the narrowband model. This test used the normal set of decoding parameters, unadapted acoustic models, and was run on a desktop.
Acoustic model |
Test set |
WER |
Narrowband | N800 si_dt_s2 | 39.6% |
Narrowband | Desktop si_dt_s2, downsampled | 43.0% |
Wideband | Desktop si_dt_s2, wideband | 33.9% |
Bear in mind that the N800 and Desktop test sets were from separate recording sessions, so I wouldn't draw any strong conclusions from comparing the first two rows above. But comparing the second and third rows, the downsampled version of the same audio caused a 9% absolute increase in WER.
I chose a semi-continuous acoustic model to keep decoding cheap and cheerful on the N800. Continuous acoustic models are more expensive to compute but are usually more accurate.
I tested the 8000 tied-state semi-continuous model with 256 codebook Gaussians against a 8000 tied-state continuous model with 16 Gaussians per state. Both models were trained on narrowband audio. The test used the normal set of decoding parameters, unadapted acoustic models, and was run on a desktop.
Acoustic model |
xRT (desktop) |
Memory (desktop) |
si_dt_s2 |
Enron |
SMS |
Semi-continuous | 0.10 | 94.7MB | 39.6% | 37.6% | 50.1% |
Continuous | 1.02 | 128.0MB | 34.2% | 32.6% | 41.0% |
The continuous model show significant WER improvements for all three test sets. But the speed of decoding was only around real-time on a desktop, making it too slow for use on the N800. The continuous acoustic model also required more memory. Note that I didn't specifically tune the decoding parameters to the continuous model, so it is possible that performance could be improved somewhat without sacrificing too much accuracy (but probably not 10x).
As we saw previously, the memory footprint was dominated by the language model. This is why I used a 2-gram language model instead of a longer span, but more accurate 3-gram model.
I tested 2-gram and 3-gram versions of the 20K blocked VP language model. The test used the normal set of decoding parameters, the unadapted acoustic model, and was run on a desktop.
Language model |
xRT (desktop) |
Memory (desktop) |
si_dt_s2 |
Enron |
SMS |
2-gram | 0.10 | 94.7MB | 39.6% | 37.6% | 50.1% |
3-gram | 0.11 | 134.9MB | 37.0% | 34.9% | 48.7% |
The 3-gram provided somewhat better recognition accuracy than the 2-gram. The 3-gram required slightly more processing time and a significant greater amount of memory.
There are some words in the test utterances which are not in the recognizer's vocabulary (si_dt_s2 is 4.4% OOV at 20K, Enron is 2.2% OOV). These out-of-vocabulary (OOV) words often causes collateral damage to words near them during recognition.
I wanted to see if an open vocabulary confusion network could improve accuracy. I created a LM consisting of 10K in-vocabulary words and 10K sub-word OOV chunks. I compared this with a the 20K language model trained on blocked, verbalized punctuation (VP) and a language model trained on single sentences with no verbalized punctuation (NVP). The test used the normal set of decoding parameters, the fully adapted acoustic model, and was run on a desktop. The recognition result was the consensus hypothesis of the confusion network constructed from the recognition lattice.
Language model |
xRT (desktop) |
si_dt_s2 |
Enron |
20K in-vocab, blocked VP | 0.10 | 34.3% | 31.5% |
20K in-vocab, NVP | 0.10 | 33.4% | 29.5% |
10K in-vocab + 10K graphones | 0.25 | 38.6% | 33.2% |
The 20K NVP language model was a better match to the single NVP sentences in the si_dt_s2 and Enron test sets. As a result, the 20K NVP LM provided somewhat better accuracy than the 20K blocked VP LM. Unfortunately, using the LM combining in-vocab and OOV chunks only made things worse.
|