Monday 16 May 2011

hunalign – sentence aligner

http://mokk.bme.hu/resources/hunalign/

Intro

hunalign aligns bilingual text on the sentence level. Its input is tokenized and sentence-segmented text in two languages. In the simplest case, its output is a sequence of bilingual sentence pairs (bisentences).

In the presence of a dictionary, hunalign uses it, combining this information with Gale-Church sentence-length information. In the absence of a dictionary, it first falls back to sentence-length information, and then builds an automatic dictionary based on this alignment. Then it realigns the text in a second pass, using the automatic dictionary.

Like most sentence aligners, hunalign does not deal with changes of sentence order: it is unable to come up with crossing alignments, i.e., segments A and B in one language corresponding to segments B’ A’ in the other language.

There is nothing Hungarian-specific in hunalign, the name simply reflects the fact that it is part of the hun* NLP toolchain.

hunalign was written in portable C++. It can be built under basically any kind of operating system.

No comments:

Post a Comment