Tuesday, 10 August 2010

Tokenization & Sentence Boundary Detection

GATE (LGPL)
variety of tokenizers and splitters (generic & language specific)
http://gate.ac.uk/

MorphAdorner
http://morphadorner.northwestern.edu/
English only

"tokenize.pl" script from the WCDG parser:
http://nats-www.informatik.uni-hamburg.de/view/CDG/DownloadPage
(even de-hyphenation when used together with the parser's lexicon)

Java-based program, Segment
https://sourceforge.net/projects/segment/ (MIT-type licence)
SRX rules for sentence splitting, includes a library for
sentence splitting, which is used by LanguageTool and the Maligna
sentence aligner
C++ library in development (GPL)

Mecab (successor of Chasen)
http://mecab.sourceforge.net/
Japanese

Juman
http://www-lab25.kuee.kyoto-u.ac.jp/nl-resource/juman-e.html
includes dependency parsing etc
Japanese

IceNLP is open source
http://icenlp.sourceforge.net
tokenizer/sentence segmentizer module, part of IceNLP - a toolkit for
Icelandic

Lingua::PT::PLNbase
Portuguese
heuristics with names and standard abbreviations
http://www.cis.uni-muenchen.de/~wastl/misc/tokenizer.tgz
fast, rule-based, tokenizer + sentence boundary detector
German, Russian, English
Sebastian Nagel

SentTrick (GPLv3)
http://sourceforge.net/projects/sentrick/
sentence boundary detector for German, trainable

fullstop
http://hackage.haskell.org/package/fullstop
English sentence segmenter in Haskell

Grammatical Framework tool
http://hackage.haskell.org/package/toktok

MADA + TOKAN
http://www1.ccls.columbia.edu/~cadim/MADA.html
Arabic

Moses/Europarl tokenizer
http://www.statmt.org/wmt10/scripts.tgz

Europarl sentence splitter as Perl modules
http://code.google.com/p/corpus-tools/downloads/list
http://search.cpan.org/~achimru/Lingua-Sentence-1.01/lib/Lingua/Sentence.pm

Other Perl modules
http://search.cpan.org/~shlomoy/Lingua-EN-Sentence-0.25/lib/Lingua/EN/Sentence.pm
http://search.cpan.org/~holsten/Lingua-DE-Sentence-0.07/Sentence.pm
http://search.cpan.org/~shlomoy/Lingua-HE-Sentence-0.13/lib/Lingua/HE/Sentence.pm

Punkt
implemented in NLTK (Apache license)
http://www.nltk.org/
trainable (unsupervised)
existing models for different languages (?)

OpenNLP (GPL)
http://opennlp.sourceforge.net/
trainable tokenizer & sentence boundary detector
models available for English, German, Spanish, Thai
further models to come, wiki at:
https://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Main_Page

huntoken (License?)
http://mokk.bme.hu/resources/huntoken
mainly for Hungarian (?)

Jena NLP tools
http://www.julielab.de/Resources/Software/NLP+Tools.html
trainable tokenizer & sentence splitter

FreeLing (GPL)
http://www.lsi.upc.edu/~nlp/freeling
regexp tokenizer
(mainly for Catalan & Spanish?)

Alpino for Dutch (tokenization + sentence splitting)
http://www.let.rug.nl/vannoord/alp/Alpino/

Ellogon (LGPL)
http://www.ellogon.org

ChaSen for Japanese (successor: mecab (see above))
http://chasen-legacy.sourceforge.jp/

MXPOST & MXTERMINATOR (research only!)
ftp://ftp.cis.upenn.edu/pub/adwait/jmx/
trainable sentence splitter

Perl Sentence Segmentation
http://www.koders.com/perl/fidFCD2926AB83BCD7179772D521830DE9A226A6195.aspx?s=open#L38

-----------------------
The list is compiled by Joerg Tiedemann.