Mobile Text Dataset and Language Models

Home > Software > Mobile Text Dataset and Language Models

This page contains supplementary materials for the paper Mining, Analyzing, and Modeling Text Written on Mobile Devices.

Mobile text dataset

Description Disk size Link
Mobile text dataset 624 MB Download

The above zip file contains the sentences we mined from public web forums and blogs. Additional details about the above dataset:

  • The data is split into training, development, and test sets based on the original domain name the text was mined from.
  • The sent_*.txt files are tab-delimited and contain one sentence parsed from a particular post. Each line contains the device name, forum software, device form factor (tablet or phone), and device input (touch or touch+key) associated with the post it was obtained from.
  • The sets subdirectory contains the groupings used in Section 2.
  • 64K word list (used in the paper), 5K and 20K word lists used on Forum-only models on this page.
  • Various word lists we used.
  • Posts and Email development and test sets.

For further details, please see our paper.

Recommended language models

We recommend the following set of letter and word language models (Table 15 from our paper):

Type Vocab size Order Pruning Perplexity Disk size Link
Mix 30 characters 12-gram tiny 4.29 5 MB Download
Mix 30 characters 12-gram small 3.64 43 MB Download
Mix 30 characters 12-gram large 3.37 399 MB Download
Mix 64K words 3-gram tiny 195.42 4 MB Download
Mix 64K words 3-gram small 144.88 40 MB Download
Mix 64K words 3-gram large 133.34 401 MB Download

The word models were trained with a sentence start word of <s>, a sentence end word of </s>, and an unknown word <unk>. The word vocabulary was the most frequent 64K words in our forum dataset that were also in a list of 330K known English words. All words are in lowercase. The character models are 12-gram models and were trained using interpolated Witten-Bell smoothing. The character model vocabulary consists of the lowercase letters a-z, apostrophe, <sp> for a space, <s> for sentence start, and </s> for sentence end.

The perplexities in the above table are the average per-word or per-letter perplexity averaged on four evaluation test sets. The test sets were:

The above mixture models were trained on a total of 504M words of data: 126M words of forum data, 126M words from Twitter's streaming API between December 2010 and June 2012, 126M words of forum data from ICWSM 2011 Spinn3r dataset, and 126M words of blog data from the ICWSM 2009 Spinn3r dataset. These language models are released under a Creative Commons attribution license (CC BY 4.0).

Forum only language models:

We also have word language models trained on only the forum data (141M words). For these models you have your choice of 5K, 20K, or 64K vocabulary sizes. These are available as 1-gram, 2-gram, 3-gram, or 4-gram models. Different entropy pruning thresholds were used to create a small and large version of each word language model. These language models are released under a Creative Commons attribution license (CC BY 4.0).

Type Vocab size Order Pruning Perplexity Disk size Link
Forum5k 1-gram- 316.8 0.04 MB Download
Forum5k 2-gramsmall120.0 7 MB Download
Forum5k 2-gramlarge118.7 24 MB Download
Forum5k 3-gramsmall 90.1 32 MB Download
Forum5k 3-gramlarge 87.7 210 MB Download
Forum5k 4-gramsmall 87.9 40 MB Download
Forum5k 4-gramlarge 83.4 304 MB Download
Forum20k 1-gram- 529.8 0.13 MB Download
Forum20k 2-gramsmall188.6 12 MB Download
Forum20k 2-gramlarge184.9 49 MB Download
Forum20k 3-gramsmall144.9 39 MB Download
Forum20k 3-gramlarge140.7 315 MB Download
Forum20k 4-gramsmall141.5 42 MB Download
Forum20k 4-gramlarge133.5 353 MB Download
Forum64k 1-gram- 620.2 0.36 MB Download
Forum64k 2-gramsmall222.8 14 MB Download
Forum64k 2-gramlarge218.2 61 MB Download
Forum64k 3-gramsmall172.3 42 MB Download
Forum64k 3-gramlarge167.4 348 MB Download
Forum64k 4-gramsmall168.0 44 MB Download
Forum64k 4-gramlarge158.3 360 MB Download

These resources are also available at