While short message style dictation proved too hard, I still wanted to field a real dictation application on the N800. So I made the task easier. The new task was dictating sentences from the Wall Street Journal (WSJ) where sentences only had words from a closed 5K vocabulary (no OOVs).
I recorded another test set to test how easy this task was. This set consisted of 330 utterances from the WSJ0 si_et_05 directory (Nov'92 test set). I read the prompts wearing two microphones, one recording on the N800 using the Blueparrot B150 mic, and the other recording on a desktop using a wired Sennheiser USB mic.
Nov'92 test set
Nov'92 test set, with my audio
I trained a 5K vocabulary language model using just the words in the WSJ0 5K closed vocabulary list. I also trained a 3-gram as the smaller vocab size should allow me to use the longer span language model.
WSJ 5K NVP 2-gram
WSJ 5K NVP 3-gram
n this experiment, I compared performance on the WSJ 5K-closed vocab task using differing amounts of my adaptation data. I added data points for even smaller amounts of adaptation data (5%, 10%, 15% and 20%). I compared WER using the 2-gram LM and a 3-gram using the B150 Bluetooth mic. I also compared WER using the 3-gram with downsampled audio that was simultaneously recorded on a USB Sennheiser wired mic.
As shown in the figure above, in all cases the majority of adaptation benefit was gained in the first 5% (30 utterances) of adaptation data. The 3-gram substantially outperformed the 2-gram. Given the small vocab size, it is feasible to use this longer span LM on the N800.
Additionally, we see the downsampled Sennheiser audio was much better than the Bluetooth mic. This was despite the fact that the adaptation data was recorded on the B150 mic. This indicates that we're taking an accuracy hit either due to the wireless connection or the quality of the B150 mic.
Live Streaming Recognition |
My first tests doing live recognition using streaming audio from the Bluetooth mic produced terrible recognition accuracy. It seems that whenever the Bluetooth mic has been off for more than a few seconds, a beep is played when recording is started. While the captured audio doesn't contain an audible beep, it contains some sort of screwed up signal. This really messed up PocketSphinx, perhaps by screwing up the prior cepstral mean normalization. At any rate, I "solved" this by dropping the first 750ms of captured audio.
My adaptation data and test sets probably contain some instances of this audio artifact as well. But this would only occur if I paused for a period between recording utterances. So the majority are probably okay.
|