Thursday, 28 April 2011

Boilerpipe - Boilerplate Removal and Fulltext Extraction from HTML pages

Link: http://code.google.com/p/boilerpipe/

The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

No comments:

Post a Comment