Friday, 1 June 2012

WIT3 - Web Inventory of Transcribed and Translated Talks


Intro: WIT3 - acronym for Web Inventory of Transcribed and Translated Talks - is a ready-to-use version for research purposes of the multilingual transcriptions of TED talks. 
Since 2007, the TED Conference has been posting on its website all video recordings of its talks, English subtitles and their translations in more than 80 languages. In order to make this collection of talks more effectively usable by the research community, the original textual contents are redistributed here, together with MT benchmarks and processing tools.

Tuesday, 29 May 2012

Layout-Aware Text Extraction from Full-text PDF of Scientific Articles

Description: The Portable Document Format (PDF) is the almost universally used file format for online scientific publications. It is also notoriously difficult to read and handle computationally, presenting challenges for developers of biomedical text mining or biocuration informatics systems that use the published literature as an information source. To facilitate the effective use of scientific literature in such systems we introduce Layout-Aware PDF Text Extraction (LA-PDFText). The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1) Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2) Classifying text blocks into rhetorical categories using a rule-based method and (3) Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks.