Saturday 28 May 2011

A USENET corpus (2005-2010)

http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html

This corpus was collected between Oct 2005 and Jan 2011, and covers 47860 English language, non-binary-file news groups:
+ Corpus size: over 30 billion words,
+ Data size: over 34Gb, compressed (delivered as weekly bundles of about 150 Mb each.)