Big English Word Lists

Home > Software > Big English Word Lists

I created a bunch of large English word lists by taking words that appeared in the intersection of 12 different word lists. I used the following sources for the word lists:

British national corpus (336K words)
Enron email corpus (135K words)
Moby word list (355K words)
CMU pronuciation dictionary (119K words)
W3C email corpus (89K words)
Wiktionary (218K words)
Wikipedia (top 400K words)
Gigaword newswire corpus (top 400K words)
LM-CSR newswire corpus (top 400K words)
Google corpus (top 400K words)
Westbury Lab Usenet corpus (top 400K words)
ICWSM 2009 blog corpus (top 400K words)

By varying the number of lists a word must appear in (from 1 to 12), I got word lists of varying size and "quality".

Update: In March 2018 I updated the words lists. Previously I used 10 word lists, but several had problems that caused some common words like "and" and words with apostrophes not to appear in the intersection involving 9 or 10 of the lists. In the process of fixing this, I removed the American national and the 20 newsgroups word lists. I added new word lists from blog, usenet, w3c, and wikitionary data. If you need the old lists for some reason, they are still available here.

Files:

	wlist_all.zip	All the word lists
	wlist_match12.zip	Words in 12 lists (27K words)
	wlist_match11.zip	Words in 11 lists (43K words)
	wlist_match10.zip	Words in 10 lists (60K words)
	wlist_match9.zip	Words in 9 lists (84K words)
	wlist_match8.zip	Words in 8 lists (111K words)
	wlist_match7.zip	Words in 7 lists (143K words)
	wlist_match6.zip	Words in 6 lists (181K words)
	wlist_match5.zip	Words in 5 lists (228K words)
	wlist_match4.zip	Words in 4 lists (289K words)
	wlist_match3.zip	Words in 3 lists (384K words)
	wlist_match2.zip	Words in 2 lists (587K words)
	wlist_match1.zip	Words in 1 list (1517K words)