A free/open-source machine translation platform: http://www.apertium.org/
Friday, 13 August 2010
Tuesday, 10 August 2010
Tokenization & Sentence Boundary Detection
GATE (LGPL)
variety of tokenizers and splitters (generic & language specific)
http://gate.ac.uk/
MorphAdorner
http://morphadorner.northwestern.edu/
English only
"tokenize.pl" script from the WCDG parser:
http://nats-www.informatik.uni-hamburg.de/view/CDG/DownloadPage
(even de-hyphenation when used together with the parser's lexicon)
Java-based program, Segment
https://sourceforge.net/projects/segment/ (MIT-type licence)
SRX rules for sentence splitting, includes a library for
sentence splitting, which is used by LanguageTool and the Maligna
sentence aligner
C++ library in development (GPL)
Mecab (successor of Chasen)
http://mecab.sourceforge.net/
Japanese
Juman
http://www-lab25.kuee.kyoto-u.ac.jp/nl-resource/juman-e.html
includes dependency parsing etc
Japanese
IceNLP is open source
http://icenlp.sourceforge.net
tokenizer/sentence segmentizer module, part of IceNLP - a toolkit for
Icelandic
Lingua::PT::PLNbase
Portuguese
heuristics with names and standard abbreviations
http://www.cis.uni-muenchen.de/~wastl/misc/tokenizer.tgz
fast, rule-based, tokenizer + sentence boundary detector
German, Russian, English
Sebastian Nagel
SentTrick (GPLv3)
http://sourceforge.net/projects/sentrick/
sentence boundary detector for German, trainable
fullstop
http://hackage.haskell.org/package/fullstop
English sentence segmenter in Haskell
Grammatical Framework tool
http://hackage.haskell.org/package/toktok
MADA + TOKAN
http://www1.ccls.columbia.edu/~cadim/MADA.html
Arabic
Moses/Europarl tokenizer
http://www.statmt.org/wmt10/scripts.tgz
Europarl sentence splitter as Perl modules
http://code.google.com/p/corpus-tools/downloads/list
http://search.cpan.org/~achimru/Lingua-Sentence-1.01/lib/Lingua/Sentence.pm
Other Perl modules
http://search.cpan.org/~shlomoy/Lingua-EN-Sentence-0.25/lib/Lingua/EN/Sentence.pm
http://search.cpan.org/~holsten/Lingua-DE-Sentence-0.07/Sentence.pm
http://search.cpan.org/~shlomoy/Lingua-HE-Sentence-0.13/lib/Lingua/HE/Sentence.pm
Punkt
implemented in NLTK (Apache license)
http://www.nltk.org/
trainable (unsupervised)
existing models for different languages (?)
OpenNLP (GPL)
http://opennlp.sourceforge.net/
trainable tokenizer & sentence boundary detector
models available for English, German, Spanish, Thai
further models to come, wiki at:
https://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Main_Page
huntoken (License?)
http://mokk.bme.hu/resources/huntoken
mainly for Hungarian (?)
Jena NLP tools
http://www.julielab.de/Resources/Software/NLP+Tools.html
trainable tokenizer & sentence splitter
FreeLing (GPL)
http://www.lsi.upc.edu/~nlp/freeling
regexp tokenizer
(mainly for Catalan & Spanish?)
Alpino for Dutch (tokenization + sentence splitting)
http://www.let.rug.nl/vannoord/alp/Alpino/
Ellogon (LGPL)
http://www.ellogon.org
ChaSen for Japanese (successor: mecab (see above))
http://chasen-legacy.sourceforge.jp/
MXPOST & MXTERMINATOR (research only!)
ftp://ftp.cis.upenn.edu/pub/adwait/jmx/
trainable sentence splitter
Perl Sentence Segmentation
http://www.koders.com/perl/fidFCD2926AB83BCD7179772D521830DE9A226A6195.aspx?s=open#L38
-----------------------
The list is compiled by Joerg Tiedemann.
http://gate.ac.uk/
MorphAdorner
http://morphadorner.northwestern.edu/
English only
"tokenize.pl" script from the WCDG parser:
http://nats-www.informatik.uni-hamburg.de/view/CDG/DownloadPage
(even de-hyphenation when used together with the parser's lexicon)
Java-based program, Segment
https://sourceforge.net/projects/segment/ (MIT-type licence)
SRX rules for sentence splitting, includes a library for
sentence splitting, which is used by LanguageTool and the Maligna
sentence aligner
C++ library in development (GPL)
Mecab (successor of Chasen)
http://mecab.sourceforge.net/
Japanese
Juman
http://www-lab25.kuee.kyoto-u.ac.jp/nl-resource/juman-e.html
includes dependency parsing etc
Japanese
IceNLP is open source
http://icenlp.sourceforge.net
tokenizer/sentence segmentizer module, part of IceNLP - a toolkit for
Icelandic
Lingua::PT::PLNbase
Portuguese
heuristics with names and standard abbreviations
http://www.cis.uni-muenchen.de/~wastl/misc/tokenizer.tgz
fast, rule-based, tokenizer + sentence boundary detector
German, Russian, English
Sebastian Nagel
SentTrick (GPLv3)
http://sourceforge.net/projects/sentrick/
sentence boundary detector for German, trainable
fullstop
http://hackage.haskell.org/package/fullstop
English sentence segmenter in Haskell
Grammatical Framework tool
http://hackage.haskell.org/package/toktok
MADA + TOKAN
http://www1.ccls.columbia.edu/~cadim/MADA.html
Arabic
Moses/Europarl tokenizer
http://www.statmt.org/wmt10/scripts.tgz
Europarl sentence splitter as Perl modules
http://code.google.com/p/corpus-tools/downloads/list
http://search.cpan.org/~achimru/Lingua-Sentence-1.01/lib/Lingua/Sentence.pm
Other Perl modules
http://search.cpan.org/~shlomoy/Lingua-EN-Sentence-0.25/lib/Lingua/EN/Sentence.pm
http://search.cpan.org/~holsten/Lingua-DE-Sentence-0.07/Sentence.pm
http://search.cpan.org/~shlomoy/Lingua-HE-Sentence-0.13/lib/Lingua/HE/Sentence.pm
Punkt
implemented in NLTK (Apache license)
http://www.nltk.org/
trainable (unsupervised)
existing models for different languages (?)
OpenNLP (GPL)
http://opennlp.sourceforge.net/
trainable tokenizer & sentence boundary detector
models available for English, German, Spanish, Thai
further models to come, wiki at:
https://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Main_Page
huntoken (License?)
http://mokk.bme.hu/resources/huntoken
mainly for Hungarian (?)
Jena NLP tools
http://www.julielab.de/Resources/Software/NLP+Tools.html
trainable tokenizer & sentence splitter
FreeLing (GPL)
http://www.lsi.upc.edu/~nlp/freeling
regexp tokenizer
(mainly for Catalan & Spanish?)
Alpino for Dutch (tokenization + sentence splitting)
http://www.let.rug.nl/vannoord/alp/Alpino/
Ellogon (LGPL)
http://www.ellogon.org
ChaSen for Japanese (successor: mecab (see above))
http://chasen-legacy.sourceforge.jp/
MXPOST & MXTERMINATOR (research only!)
ftp://ftp.cis.upenn.edu/pub/adwait/jmx/
trainable sentence splitter
Perl Sentence Segmentation
http://www.koders.com/perl/fidFCD2926AB83BCD7179772D521830DE9A226A6195.aspx?s=open#L38
-----------------------
The list is compiled by Joerg Tiedemann.
Subscribe to:
Posts (Atom)