Thursday, 19 May 2011

Mining scientific texts

This post is to collect all papers related to mining scientific texts (entity & relation extraction, summarization, ...).

(Extracting and Querying Relations in Scientiļ¬c Papers on Language Technology)


Monday, 16 May 2011

hunalign – sentence aligner


hunalign aligns bilingual text on the sentence level. Its input is tokenized and sentence-segmented text in two languages. In the simplest case, its output is a sequence of bilingual sentence pairs (bisentences).

In the presence of a dictionary, hunalign uses it, combining this information with Gale-Church sentence-length information. In the absence of a dictionary, it first falls back to sentence-length information, and then builds an automatic dictionary based on this alignment. Then it realigns the text in a second pass, using the automatic dictionary.

Like most sentence aligners, hunalign does not deal with changes of sentence order: it is unable to come up with crossing alignments, i.e., segments A and B in one language corresponding to segments B’ A’ in the other language.

There is nothing Hungarian-specific in hunalign, the name simply reflects the fact that it is part of the hun* NLP toolchain.

hunalign was written in portable C++. It can be built under basically any kind of operating system.

YouAlign - Online document alignment solution

"Welcome to YouAlign, your online document alignment solution. No software to purchase, no software to install. With YouAlign you can quickly and easily create bitexts from your archived documents. A YouAlign bitext contains a document and its translation aligned at the sentence level. YouAlign generates TMX files that can be loaded into your translation memory. YouAlign can also generate HTML files that you can publish on the Internet, or use with a full-text search engine to search for terminology and phraseology in context.

YouAlign is powered by the AlignFactory engine, which supports all kinds of formats, including Microsoft Word, Excel and PowerPoint, PDF, HTML, XML, Corel WordPerfect, RTF, Lotus WordPro and plain text."