Wednesday, 28 March 2012


Intro: The OpenCalais Web Service automatically creates rich semantic metadata for the content you submit – in well under a second. Using natural language processing (NLP), machine learning and other methods, Calais analyzes your document and finds the entities within it. But, Calais goes well beyond classic entity identification and returns the facts and events hidden within your text as well.

Downloading full CiteSeerX data

Just saw this link and found it very interesting.


I copy here for backup (to avoid if the original link dies).

Steps for downloading the full dataset from CiteSeerX:
  1. Download and extract the "Demo" from
  2. Go to the directory of the extracted files, type the following command to download the full dataset of CiteSeerX to the file "citeseerx_alldata.xml"
    java -classpath .;oaiharvester.jar;xerces.jar org.acme.oai.OAIReaderRawDump -o citeseerx_alldata.xml

Thanks the author for that.