The Imagination of Crowds: Conversational AAC Language Modeling using Crowdsourcing and Large Data Sources
EMNLP '11: Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing, 2011.
Augmented and alternative communication (AAC) devices enable users with certain communication disabilities to participate in everyday conversations. Such devices often rely on statistical language models to improve text entry by offering word predictions. These predictions can be improved if the language model is trained on data that closely reflects the style of the users' intended communications. Unfortunately, there is no large dataset consisting of genuine AAC messages. In this paper we demonstrate how we can crowdsource the creation of a large set of fictional AAC messages. We show that these messages model conversational AAC better than the currently used datasets based on telephone conversations or newswire text. We leverage our crowdsourced messages to intelligently select sentences from much larger sets of Twitter, blog and Usenet data. Compared to a model trained only on telephone transcripts, our best performing model reduced perplexity on three test sets of AAC-like communications by 60-82% relative. This translated to a potential keystroke savings in a predictive keyboard interface of 5-11%.