Tuesday 13 May 2014

Wikipedia Extractor

Link: http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
Intro: WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump. The output is stored in a number of files of similar size in a given directory. Each file contains several documents in the document format.