jusText is a tool for removing boilerplate content, such as navigation
links, headers, and footers from HTML pages. It is designed to preserve
mainly text containing full sentences and it is therefore well suited
for creating linguistic resources such as Web corpora.
No comments:
Post a Comment