I created a bunch of large English word lists by taking words that appeared in the intersection of 12 different word lists. I used the following sources for the word lists:
- British national corpus (336K words)
- Enron email corpus (135K words)
- Moby word list (355K words)
- CMU pronuciation dictionary (119K words)
- W3C email corpus (89K words)
- Wiktionary (218K words)
- Wikipedia (top 400K words)
- Gigaword newswire corpus (top 400K words)
- LM-CSR newswire corpus (top 400K words)
- Google corpus (top 400K words)
- Westbury Lab Usenet corpus (top 400K words)
- ICWSM 2009 blog corpus (top 400K words)
By varying the number of lists a word must appear in (from 1 to 12), I got word lists of varying size and "quality".
Update: In March 2018 I updated the words lists. Previously I used 10 word lists, but several had problems that caused some common words like "and" and words with apostrophes not to appear in the intersection involving 9 or 10 of the lists. In the process of fixing this, I removed the American national and the 20 newsgroups word lists. I added new word lists from blog, usenet, w3c, and wikitionary data. If you need the old lists for some reason, they are still available here.
|