Monday, 16 July 2012

Very large-scale corpus (COCA)

COCA (Corpus of Contemporary American English)

Intro: The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English, and the only large and balanced corpus of American English. The corpus was created by Mark Davies of Brigham Young University, and it is used by tens of thousands of users every month (linguists, teachers, translators, and other researchers). COCA is also related to other large corpora that we have created.
The corpus contains more than 450 million words of text and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. It includes 20 million words each year from 1990-2012 and the corpus is also updated regularly (the most recent texts are from Summer 2012). Because of its design, it is perhaps the only corpus of English that is suitable for looking at current, ongoing changes in the language (see the 2011 article in Literary and Linguistic Computing).

Ngram corpus from COCA:
1) COCA Ngrams:
Link: see this post.

2) COHA Ngrams:
Intro: The Corpus of Historical American English (COHA) contain 400 million words of text from 1810-2009, and all of the n-grams from the corpus can be freely downloaded. They contain all n-grams that occur at least three times total in the corpus, and you can see the frequency of each of these n-grams in each decade from the 1810s-2000s. This data can be used offline to carry out powerful searches on a wide range of phenomena in the history of American English.

My thoughts:
- I have been developing a language-generic n-gram-based spell checking tool. So, this ngram corpus will be very beneficial.
- Other tasks in English NLP may need this corpus.