Thursday, 30 December 2010
Friday, 10 December 2010
Sunday, 5 December 2010
NLP tools for Vietnamese
http://vspell.com/v4/spellcheck.aspx
2) VNSpeech
http://www.vnspeech.com/
3) JVnSegmenter: A Java-based Vietnamese Word Segmentation Tool
http://jvnsegmenter.sourceforge.net/
4) JVnTextPro: A Java-based Vietnamese Text Processing Tool
http://jvntextpro.sourceforge.net/
5) RDRPOSTagger: http://rdrpostagger.sourceforge.net/
6) ViWordNetwork (demo): http://112.137.129.215/ViWordNetwork
(to be continued)
Friday, 26 November 2010
MLcomp
http://mlcomp.org/
Thursday, 25 November 2010
Thursday, 18 November 2010
PaperMaker
Article: http://bioinformatics.oxfordjournals.org/content/26/7/982.full
Monday, 15 November 2010
Stanford CoreNLP
All-in-one included:
- sentence splitter
- tokenizer
- part of speech tagger
- lemmatizer
- named entity recognizers (probabilistic and rule-based)
- numeric entity canonicalizer
- parser
- coreference system
Sunday, 14 November 2010
Friday, 12 November 2010
Wednesday, 10 November 2010
Monday, 8 November 2010
Searchable Parallel Corpora
1. CABAL: Un concordancier en ligne pour la linguistique contrastive
http://cabal.rezo.net/ (University of Poitiers)
English, French
Environ 200 articles sont actuellement en ligne (soit environ 400 000 mots). La majorité sont issus du Monde diplomatique et datés de 1998 à décembre 2003.
2. The CLUVI corpus:
http://sli.uvigo.es/CLUVI/
English, French, Spanish, Galician,
Corpus: UNESCO Corpus of English-Galician-French-
3. German(-English) parallel corpora (Europarl and German News)
http://corpus.leeds.ac.uk/
English, German
4. WebTCE (Translation Corpus Explorer)
http://khnt.hit.uib.no/webtce.
English, German, French, Spanish, Norwegian, Danish
5. EVROKORPUS Parallel corpora
http://evrokorpus.gov.si/
223 million words. English, French, German, Italian, Slovene and Spanish. Searches must involve Slovene and one other language.
6. TERMACOR terminology and corpus
http://evrokorpus.gov.si/k2/
98 million words in 22 European Languages. EU Commission data.
7. COMPARA Portuguese-English parallel corpus
http://www.linguateca.pt/
Three million words.
Portuguese, English
8. Termsearch
http://www.termsearch.info/ or a faster interface at:
http://www.bible-study-in-
English, French, Russian
Major international treaties, conventions, agreements, etc. 792 documents.
9. English-Inuktitut Parallel Corpus
http://www.inuktitutcomputing.
3.5 million words (of English), 1.5 million words of Inuktitut
English, Inuktitut (an Inuit Language of North-Eastern Canada)
10. English-Russian Parallel Corpus
http://ruscorpora.ru/search-
English, Russian, (some German?)
Interface only in Russian.
About 9 million words
13. http://corpus.consumer.es/
14. OPUS:
http://www.let.rug.nl/
http://www.let.rug.nl/
15. LinearB
16. MyMemories
http://mymemory.translated.
17. Natura corpora
http://linguateca.di.uminho.
18. Compara
http://www.linguateca.pt/
19. CLUVI
20. English-Chinese
http://score.crpp.nie.edu.sg/
22. English-Vietnamese
http://hellochao.com/
http://linkdict.com/Bilingual/
Tuesday, 2 November 2010
Tuesday, 19 October 2010
Saturday, 9 October 2010
Selected Papers at EMNLP'10
D10-1050 [bib]: Kristian Woodsend; Yansong Feng; Mirella Lapata
Title Generation with Quasi-Synchronous Grammar
D10-1049 [bib]: Gabor Angeli; Percy Liang; Dan Klein
A Simple Domain-Independent Probabilistic Approach to Generation
D10-1047 [bib]: Ahmet Aker; Trevor Cohn; Robert Gaizauskas
Multi-Document Summarization Using A* Search and Discriminative Learning
D10-1007 [bib]: Michael Paul; ChengXiang Zhai; Roxana Girju
Summarizing Contrastive Viewpoints in Opinionated Text
Keyphrase Extraction via Topic Modeling
D10-1036 [bib]: Zhiyuan Liu; Wenyi Huang; Yabin Zheng; Maosong Sun
Automatic Keyphrase Extraction via Topic Decomposition
Question Answering
D10-1010 [bib]: Razvan Bunescu; Yunfeng Huang
Learning the Relative Usefulness of Questions in Community QA
Information Extraction
D10-1099 [bib]: Limin Yao; Sebastian Riedel; Andrew McCallum
Collective Cross-Document Relation Extraction Without Labelled Data
D10-1107 [bib]: Quang Do; Dan Roth
Constraints Based Taxonomic Relation Classification
D10-1123 [bib]: Thomas Lin; Mausam; Oren Etzioni
Identifying Functional Relations in Web Text
Verb Selection in Language Learning
D10-1104 [bib]: Xiaohua Liu; Bo Han; Kuan Li; Stephan Hyeonjun Stiller; Ming Zhou
SRL-Based Verb Selection for ESL
Friday, 10 September 2010
Selected Papers at COLING'10
C10-1074 [bib]: Fangtao Li; Chao Han; Minlie Huang; Xiaoyan Zhu; Ying-Ju Xia; Shu Zhang; Hao Yu
Structure-Aware Review Mining and Summarization
C10-1101 [bib]: Vahed Qazvinian; Dragomir R. Radev; Arzucan Ozgur
Citation Summarization Through Keyphrase Extraction
C10-1111 [bib]: Chao Shen; Tao Li
Multi-Document Summarization via the Minimum Dominating Set
Information Extraction
C10-1018 [bib]: Yee Seng Chan; Dan Roth
Exploiting Background Knowledge for Relation Extraction
C10-2058 [bib]: Heng Ji
Challenges from Information Extraction to Information Fusion
C10-2041 [bib]: Brian Harrington
A Semantic Network Approach to Measuring Relatedness
(http://www.aclweb.org/anthology/C/C10/)
Just fun about PhD study
http://matt.might.net/articles/phd-school-in-pictures/
10 easy ways to fail a Ph.D.
http://matt.might.net/articles/ways-to-fail-a-phd/
Thursday, 9 September 2010
Wednesday, 8 September 2010
Sunday, 5 September 2010
Friday, 3 September 2010
Just a note on email summarization
Situation 1: a new user raises a question which had already been partially or fully answered through one or more email threads of mailing list. Need a question answering system or summarizer in this context???
Situation 2: a new user wants to search a topic of interest. A retrieval system needed???
Situation 3: such a mailing list needs a classification of email threads???
Situation 4: TBA
Any suggestions?
--
Cheers,
Vu
Wednesday, 18 August 2010
Tuesday, 17 August 2010
Friday, 13 August 2010
Tuesday, 10 August 2010
Tokenization & Sentence Boundary Detection
http://gate.ac.uk/
MorphAdorner
http://morphadorner.northwestern.edu/
English only
"tokenize.pl" script from the WCDG parser:
http://nats-www.informatik.uni-hamburg.de/view/CDG/DownloadPage
(even de-hyphenation when used together with the parser's lexicon)
Java-based program, Segment
https://sourceforge.net/projects/segment/ (MIT-type licence)
SRX rules for sentence splitting, includes a library for
sentence splitting, which is used by LanguageTool and the Maligna
sentence aligner
C++ library in development (GPL)
Mecab (successor of Chasen)
http://mecab.sourceforge.net/
Japanese
Juman
http://www-lab25.kuee.kyoto-u.ac.jp/nl-resource/juman-e.html
includes dependency parsing etc
Japanese
IceNLP is open source
http://icenlp.sourceforge.net
tokenizer/sentence segmentizer module, part of IceNLP - a toolkit for
Icelandic
Lingua::PT::PLNbase
Portuguese
heuristics with names and standard abbreviations
http://www.cis.uni-muenchen.de/~wastl/misc/tokenizer.tgz
fast, rule-based, tokenizer + sentence boundary detector
German, Russian, English
Sebastian Nagel
SentTrick (GPLv3)
http://sourceforge.net/projects/sentrick/
sentence boundary detector for German, trainable
fullstop
http://hackage.haskell.org/package/fullstop
English sentence segmenter in Haskell
Grammatical Framework tool
http://hackage.haskell.org/package/toktok
MADA + TOKAN
http://www1.ccls.columbia.edu/~cadim/MADA.html
Arabic
Moses/Europarl tokenizer
http://www.statmt.org/wmt10/scripts.tgz
Europarl sentence splitter as Perl modules
http://code.google.com/p/corpus-tools/downloads/list
http://search.cpan.org/~achimru/Lingua-Sentence-1.01/lib/Lingua/Sentence.pm
Other Perl modules
http://search.cpan.org/~shlomoy/Lingua-EN-Sentence-0.25/lib/Lingua/EN/Sentence.pm
http://search.cpan.org/~holsten/Lingua-DE-Sentence-0.07/Sentence.pm
http://search.cpan.org/~shlomoy/Lingua-HE-Sentence-0.13/lib/Lingua/HE/Sentence.pm
Punkt
implemented in NLTK (Apache license)
http://www.nltk.org/
trainable (unsupervised)
existing models for different languages (?)
OpenNLP (GPL)
http://opennlp.sourceforge.net/
trainable tokenizer & sentence boundary detector
models available for English, German, Spanish, Thai
further models to come, wiki at:
https://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Main_Page
huntoken (License?)
http://mokk.bme.hu/resources/huntoken
mainly for Hungarian (?)
Jena NLP tools
http://www.julielab.de/Resources/Software/NLP+Tools.html
trainable tokenizer & sentence splitter
FreeLing (GPL)
http://www.lsi.upc.edu/~nlp/freeling
regexp tokenizer
(mainly for Catalan & Spanish?)
Alpino for Dutch (tokenization + sentence splitting)
http://www.let.rug.nl/vannoord/alp/Alpino/
Ellogon (LGPL)
http://www.ellogon.org
ChaSen for Japanese (successor: mecab (see above))
http://chasen-legacy.sourceforge.jp/
MXPOST & MXTERMINATOR (research only!)
ftp://ftp.cis.upenn.edu/pub/adwait/jmx/
trainable sentence splitter
Perl Sentence Segmentation
http://www.koders.com/perl/fidFCD2926AB83BCD7179772D521830DE9A226A6195.aspx?s=open#L38
-----------------------
The list is compiled by Joerg Tiedemann.
Monday, 19 July 2010
Friday, 9 July 2010
Wednesday, 7 July 2010
Thursday, 10 June 2010
Selected papers at ACL 2010
Pseudo-word for Phrase-based Machine Translation
Xiangyu Duan, Min Zhang and Haizhou Li
Learning Lexicalized Reordering Models from Reordering Graphs
Jinsong Su, Yang Liu, Yajuan Lv, Haitao Mi and Qun Liu
Filtering Syntactic Constraints for Statistical Machine Translation
Hailong Cao and Eiichiro Sumita
Error Detection for Statistical Machine Translation Using Linguistic Features
Deyi Xiong, Min Zhang and Haizhou Li
Boosting-based System Combination for Machine Translation
Tong Xiao, Jingbo Zhu, Muhua Zhu and Huizhen Wang
Bilingual Sense Similarity for Statistical Machine Translation
Boxing Chen, George Foster and Roland Kuhn
----
Summarization & Generation
A Risk Minimization Framework for Extractive Speech Summarization
Shih-Hsiang Lin and Berlin Chen
Entity-based local coherence modelling using topological fields
Jackie Chi Kit Cheung and Gerald Penn
Automatic Collocation Suggestion in Academic Writing
Jian-Cheng Wu, Yu-Chia Chang, Teruko Mitamura and Jason S. Chang
Identifying Non-Explicit Citing Sentences for Citation-Based Summarization.
Vahed Qazvinian and Dragomir R. Radev
Automatic Generation of Story Highlights
Kristian Woodsend and Mirella Lapata
Plot Induction and Evolutionary Search for Story Generation
Neil McIntyre and Mirella Lapata
Metadata-Aware Measures for Answer Summarization in Community Question Answering
Mattia Tomasoni and Minlie Huang
A Hybrid Hierarchical Model for Multi-Document Summarization
Asli Celikyilmaz and Dilek Hakkani-Tur
A new Approach to Improving Multilingual Summarization using a Genetic Algorithm
Marina Litvak, Mark Last and Menahem Friedman
Cross-Language Document Summarization Based on Machine Translation Quality Prediction
Xiaojun Wan, Huiying Li and Jianguo Xiao
Generating image descriptions using dependency relational patterns
Ahmet Aker and Robert Gaizauskas
----
Information Extraction
Open Information Extraction Using Wikipedia
Fei Wu and Daniel S. Weld
----
Corpus
The Human Language Project: Building a Universal Corpus of the World’s Languages
Steven Abney and Steven Bird
(see full list of papers at http://nlp.csie.ncnu.edu.tw/~shin/acl2010/proceedings/CDROM/ACL/index.html)
Wednesday, 9 June 2010
Topic summarization
--> I think this research problem is still open and actually very challenging. It requires advanced processing which combines many fields in AI such as: NLP, IR, IE, ...
Some initial works (including mine) as follows:
1) Scientific Paper Summarization Using Citation Summary Networks by Qazvinian V. et al. (COLING 2008).
--> this work only targets single article summarization using a clustering approach based on citation summary networks.
2) Generating surveys of scientific paradigms by Saif Mohammad et al. (NAACL 2009).
--> this work explores the usefulness of citation summary in compared to summary from abstracts or full text of articles.
3) Towards Automated Related Work Summarization by Cong Duy Vu HOANG et al. (COLING 2010)
--> this work does not use citation summary but tries to take advantage of full text of article in generating related work summary.
It makes a strong assumption that each related work summary follows a topic hierarchy tree which is provided as the input of summarization system. The system then proposes two different strategies (general & specific content summarization) based on manual rhetorical analysis on how humans use topic hierarchy tree to generate related work summary.
4) Identifying Non-Explicit Citing Sentences for Citation-Based Summarization by Vahed Qazvinian and Dragomir R. Radev (ACL 2010)
--> TBA
5) Context Identification of Sentences in Related Work Sections using a Conditional Random Field: Towards Intelligent Digital Libraries by Angrosh M. A. et al. (JCDL 2010)
6) Imitating Human Literature Review Writing: An Approach to Multi-document Summarization by Jaidka K. et al. (ICADL 2010)
7) Analysis of the Macro-Level Discourse Structure of Literature Reviews by Jaidka K. et al. (Online Information Review)
8) Ultimate Research Assistant: http://ultimate-research-assistant.com/
9) iResearch Reporter: http://iresearch-reporter.com//
10) TBA
Future works (what I come up in my mind now) includes:
- Given a research topic --> automatically generate a topic hierarchy tree of that topic.
- A systematic comparison of summaries built from citations, abstracts, full text of articles. Which ones are more useful to users?
- An initial add-in component integrated into online ACL anthology system.
- Some other issues improve the summarization performance (i.e. use rhetorical discourse analysis, ...)
- ...
--
Cheers,
Vu
Tuesday, 8 June 2010
Friday, 4 June 2010
Machine Translation Books
1) "Statistical Machine Translation" by Philipp Koehn
--> will try to buy this book if I have money hehe :D.
2) TBA
Just an immediate thought about constraints in SMT decoding
http://www.cs.cmu.edu/~nbach/papers/naacl2009-cohesion-8pages.pdf
Just think of whether other constraints (e.g. reordering, chunking, translation boundaries, ...) can be integrated into a beam search phrasal SMT decoding. (try to validate this later!).
Reordering:
http://www.spencegreen.com/pubs/naacl10-distortion.pdf
http://acl.ldc.upenn.edu/J/J03/J03-1005.pdf
(Linguistically Annotated Reordering: http://www1.i2r.a-star.edu.sg/~dyxiong/paper/cl-preprint.pdf)
--> Idea: given a set of reordering rules manually annotated by humans, the questions is how to use it during SMT decoding???
Translation Boundaries: http://aclweb.org/anthology-new/N/N10/N10-1016.pdf
Chunking: http://www.mt-archive.info/MTS-2009-Yahyaei.pdf
--
Cheers,
Vu
Thursday, 3 June 2010
Thursday, 27 May 2010
Selected papers at NAACL 2010
N10-1140 [bib]: Michel Galley; Christopher D. Manning
Accurate Non-Hierarchical Phrase-Based Translation
N10-1141 [bib]: John DeNero; Shankar Kumar; Ciprian Chelba; Franz Och
Model Combination for Machine Translation
N10-1127 [bib]: Niyu Ge
A Direct Syntax-Driven Reordering Model for Phrase-Based Machine Translation
N10-1016 [bib]: Deyi Xiong; Min Zhang; Haizhou Li
Learning Translation Boundaries for Phrase-Based Decoding
Sentence Fusion
N10-1044 [bib]: Kathleen McKeown; Sara Rosenthal; Kapil Thadani; Coleman Moore
Time-Efficient Creation of an Accurate Sentence Fusion Corpus
Summarization
N10-1100 [bib]: Beaux Sharifi; Mark-Anthony Hutton; Jugal Kalita
Summarizing Microblogs Automatically
Question Answering
N10-1007 [bib]: Taniya Mishra; Srinivas Bangalore
Qme! : A Speech-based Question-Answering system on Mobile Devices
Monday, 12 April 2010
Sunday, 28 March 2010
Monday, 22 March 2010
Monday, 8 March 2010
Social Question Answering
The feature I like most in that system is real-time updating. It means your questions posted on that online system might be answered in terms of real time manner.
More information about Aardvark is in this blog entry (http://mendicantbug.com/2009/03/13/aardvark-and-social-qa/)
That's actually an interesting research direction.
--
Cheers,
Vu
Wednesday, 3 March 2010
Wikipedia issues
Dumps of Wikipedia: http://download.wikimedia.org
Extraction of plain text corpus from Wikipedia: http://blog.afterthedeadline.com/2009/12/04/generating-a-plain-text-corpus-from-wikipedia/
--
Cheers,
Vu
Sunday, 21 February 2010
Friday, 29 January 2010
Monday, 25 January 2010
Saturday, 23 January 2010
Machine Reading
Video at videolectures.net: http://videolectures.net/wsdm08_etzioni_mrws/
Demo of KnowItAll: http://www.cs.washington.edu/research/knowitall/
http://www.cs.washington.edu/homes/pjallen/aaaiss07/schedule.htm (Machine Reading symposium). Link to download papers here.
The research in Machine Reading may be relevant to the research in the project "Read the Web" (see here).
--
Cheers,
Vu
Read the Web Project at CMU
Should be tracking this project regularly.
--
Cheers,
Vu
Sunday, 17 January 2010
Online resources for English syntax
http://faculty.washington.edu/dillon/GramResources/GramResources.html
--
Cheers,
Vu
Wednesday, 6 January 2010
Anaphora Resolution Tool
BART: http://www.assembla.com/wiki/show/bart-coref (with guide)
Besides the above, should consider this talk.
--
Cheers,
Vu