Sunday, 5 December 2010

NLP tools for Vietnamese

1) Vietnamese Spell Checker

2) VNSpeech

3) JVnSegmenter: A Java-based Vietnamese Word Segmentation Tool

4) JVnTextPro: A Java-based Vietnamese Text Processing Tool

5) RDRPOSTagger:

6) ViWordNetwork (demo):

(to be continued)

Graph visualization

Friday, 26 November 2010


MLcomp is a free website for objectively comparing machine learning programs across various datasets for multiple problem domains.

Resources for Text, Speech and Language Processing

Thursday, 25 November 2010

Thursday, 18 November 2010

MultiTree - A Digital Library of Language Relationships

(look quite beautiful! :D)




Abstract: PaperMaker is a novel IT solution that receives a scientific manuscript via a Web interface, automatically analyses the publication, evaluates consistency parameters and interactively delivers feedback to the author. It analyses the proper use of acronyms and their definitions, and the use of specialized terminology. It provides Gene Ontology (GO) and Medline Subject Headings (MeSH) categorization of text passages, the retrieval of relevant publications from public scientific literature repositories, and the identification of missing or unused references.

Monday, 15 November 2010

Stanford CoreNLP

All-in-one included:
- sentence splitter
- tokenizer
- part of speech tagger
- lemmatizer
- named entity recognizers (probabilistic and rule-based)
- numeric entity canonicalizer
- parser
- coreference system

Gensim – Python Framework for Vector Space Modelling

Monday, 8 November 2010

Searchable Parallel Corpora

1. CABAL: Un concordancier en ligne pour la linguistique contrastive (University of Poitiers)

English, French

Environ 200 articles sont actuellement en ligne (soit environ 400 000 mots). La majorité sont issus du Monde diplomatique et datés de 1998 à décembre 2003.

2. The CLUVI corpus:

English, French, Spanish, Galician,

Corpus: UNESCO Corpus of English-Galician-French-Spanish scientific-technical divulgation

3. German(-English) parallel corpora (Europarl and German News)

English, German

4. WebTCE (Translation Corpus Explorer)

English, German, French, Spanish, Norwegian, Danish

5. EVROKORPUS Parallel corpora

223 million words. English, French, German, Italian, Slovene and Spanish. Searches must involve Slovene and one other language.

6. TERMACOR terminology and corpus

98 million words in 22 European Languages. EU Commission data.

7. COMPARA Portuguese-English parallel corpus

Three million words.

Portuguese, English

8. Termsearch or a faster interface at:

English, French, Russian

Major international treaties, conventions, agreements, etc. 792 documents.

9. English-Inuktitut Parallel Corpus

3.5 million words (of English), 1.5 million words of Inuktitut

English, Inuktitut (an Inuit Language of North-Eastern Canada)

10. English-Russian Parallel Corpus

English, Russian, (some German?)

Interface only in Russian.

About 9 million words




14. OPUS:

15. LinearB

16. MyMemories

17. Natura corpora

18. Compara


20. English-Chinese

21. TransSearch

22. English-Vietnamese

23. WebLitera - free international online book library

(to be updated)

Saturday, 9 October 2010

Selected Papers at EMNLP'10

Summarization & Generation

D10-1050 [bib]: Kristian Woodsend; Yansong Feng; Mirella Lapata
Title Generation with Quasi-Synchronous Grammar

D10-1049 [bib]: Gabor Angeli; Percy Liang; Dan Klein
A Simple Domain-Independent Probabilistic Approach to Generation

D10-1047 [bib]: Ahmet Aker; Trevor Cohn; Robert Gaizauskas
Multi-Document Summarization Using A* Search and Discriminative Learning

D10-1007 [bib]: Michael Paul; ChengXiang Zhai; Roxana Girju
Summarizing Contrastive Viewpoints in Opinionated Text

Keyphrase Extraction via Topic Modeling

D10-1036 [bib]: Zhiyuan Liu; Wenyi Huang; Yabin Zheng; Maosong Sun
Automatic Keyphrase Extraction via Topic Decomposition

Question Answering

D10-1010 [bib]: Razvan Bunescu; Yunfeng Huang
Learning the Relative Usefulness of Questions in Community QA

Information Extraction

D10-1099 [bib]: Limin Yao; Sebastian Riedel; Andrew McCallum
Collective Cross-Document Relation Extraction Without Labelled Data

D10-1107 [bib]: Quang Do; Dan Roth
Constraints Based Taxonomic Relation Classification

D10-1123 [bib]: Thomas Lin; Mausam; Oren Etzioni
Identifying Functional Relations in Web Text

Verb Selection in Language Learning

D10-1104 [bib]: Xiaohua Liu; Bo Han; Kuan Li; Stephan Hyeonjun Stiller; Ming Zhou
SRL-Based Verb Selection for ESL


Friday, 10 September 2010

Selected Papers at COLING'10

Summarization & Generation

C10-1074 [bib]: Fangtao Li; Chao Han; Minlie Huang; Xiaoyan Zhu; Ying-Ju Xia; Shu Zhang; Hao Yu
Structure-Aware Review Mining and Summarization

C10-1101 [bib]: Vahed Qazvinian; Dragomir R. Radev; Arzucan Ozgur
Citation Summarization Through Keyphrase Extraction

C10-1111 [bib]: Chao Shen; Tao Li
Multi-Document Summarization via the Minimum Dominating Set

Information Extraction

C10-1018 [bib]: Yee Seng Chan; Dan Roth
Exploiting Background Knowledge for Relation Extraction

C10-2058 [bib]: Heng Ji
Challenges from Information Extraction to Information Fusion

C10-2041 [bib]: Brian Harrington
A Semantic Network Approach to Measuring Relatedness


Just fun about PhD study

The illustrated guide to a Ph.D.

10 easy ways to fail a Ph.D.

Friday, 3 September 2010

Just a note on email summarization

Imagine that we have a set of mailing lists (e.g. Dbworld, Corpora-List, Linguist, BioNLP, moses-support, so on). Each of such a mailing list actually contains a bulk of questions and answers given by domain experts or semi-experts. A couple of situations can be raised:
Situation 1: a new user raises a question which had already been partially or fully answered through one or more email threads of mailing list. Need a question answering system or summarizer in this context???
Situation 2: a new user wants to search a topic of interest. A retrieval system needed???
Situation 3: such a mailing list needs a classification of email threads???
Situation 4: TBA

Any suggestions?


LingPipe - a toolkit for processing text using computational linguistics

Tuesday, 10 August 2010

Tokenization & Sentence Boundary Detection

variety of tokenizers and splitters (generic & language specific)

English only

"" script from the WCDG parser:
(even de-hyphenation when used together with the parser's lexicon)

Java-based program, Segment (MIT-type licence)
SRX rules for sentence splitting, includes a library for
sentence splitting, which is used by LanguageTool and the Maligna
sentence aligner
C++ library in development (GPL)

Mecab (successor of Chasen)

includes dependency parsing etc

IceNLP is open source
tokenizer/sentence segmentizer module, part of IceNLP - a toolkit for

heuristics with names and standard abbreviations
fast, rule-based, tokenizer + sentence boundary detector
German, Russian, English
Sebastian Nagel

SentTrick (GPLv3)
sentence boundary detector for German, trainable

English sentence segmenter in Haskell

Grammatical Framework tool


Moses/Europarl tokenizer

Europarl sentence splitter as Perl modules

Other Perl modules

implemented in NLTK (Apache license)
trainable (unsupervised)
existing models for different languages (?)

trainable tokenizer & sentence boundary detector
models available for English, German, Spanish, Thai
further models to come, wiki at:

huntoken (License?)
mainly for Hungarian (?)

Jena NLP tools
trainable tokenizer & sentence splitter

FreeLing (GPL)
regexp tokenizer
(mainly for Catalan & Spanish?)

Alpino for Dutch (tokenization + sentence splitting)

Ellogon (LGPL)

ChaSen for Japanese (successor: mecab (see above))

MXPOST & MXTERMINATOR (research only!)
trainable sentence splitter

Perl Sentence Segmentation

The list is compiled by Joerg Tiedemann.

Thursday, 10 June 2010

Selected papers at ACL 2010

Machine Translation

Pseudo-word for Phrase-based Machine Translation
Xiangyu Duan, Min Zhang and Haizhou Li

Learning Lexicalized Reordering Models from Reordering Graphs
Jinsong Su, Yang Liu, Yajuan Lv, Haitao Mi and Qun Liu

Filtering Syntactic Constraints for Statistical Machine Translation
Hailong Cao and Eiichiro Sumita

Error Detection for Statistical Machine Translation Using Linguistic Features
Deyi Xiong, Min Zhang and Haizhou Li

Boosting-based System Combination for Machine Translation
Tong Xiao, Jingbo Zhu, Muhua Zhu and Huizhen Wang

Bilingual Sense Similarity for Statistical Machine Translation
Boxing Chen, George Foster and Roland Kuhn

Summarization & Generation

A Risk Minimization Framework for Extractive Speech Summarization
Shih-Hsiang Lin and Berlin Chen

Entity-based local coherence modelling using topological fields
Jackie Chi Kit Cheung and Gerald Penn

Automatic Collocation Suggestion in Academic Writing
Jian-Cheng Wu, Yu-Chia Chang, Teruko Mitamura and Jason S. Chang

Identifying Non-Explicit Citing Sentences for Citation-Based Summarization.
Vahed Qazvinian and Dragomir R. Radev

Automatic Generation of Story Highlights
Kristian Woodsend and Mirella Lapata

Plot Induction and Evolutionary Search for Story Generation
Neil McIntyre and Mirella Lapata

Metadata-Aware Measures for Answer Summarization in Community Question Answering
Mattia Tomasoni and Minlie Huang

A Hybrid Hierarchical Model for Multi-Document Summarization
Asli Celikyilmaz and Dilek Hakkani-Tur

A new Approach to Improving Multilingual Summarization using a Genetic Algorithm
Marina Litvak, Mark Last and Menahem Friedman

Cross-Language Document Summarization Based on Machine Translation Quality Prediction
Xiaojun Wan, Huiying Li and Jianguo Xiao

Generating image descriptions using dependency relational patterns
Ahmet Aker and Robert Gaizauskas

Information Extraction

Open Information Extraction Using Wikipedia
Fei Wu and Daniel S. Weld


The Human Language Project: Building a Universal Corpus of the World’s Languages
Steven Abney and Steven Bird

(see full list of papers at

Wednesday, 9 June 2010

Topic summarization

Given a scenario in which the system takes the input with a research topic and needs to generate a summary of related works relevant to that topic automatically.
--> I think this research problem is still open and actually very challenging. It requires advanced processing which combines many fields in AI such as: NLP, IR, IE, ...

Some initial works (including mine) as follows:
1) Scientific Paper Summarization Using Citation Summary Networks by Qazvinian V. et al. (COLING 2008).
--> this work only targets single article summarization using a clustering approach based on citation summary networks.
2) Generating surveys of scientific paradigms by Saif Mohammad et al. (NAACL 2009).
--> this work explores the usefulness of citation summary in compared to summary from abstracts or full text of articles.
3) Towards Automated Related Work Summarization by Cong Duy Vu HOANG et al. (COLING 2010)
--> this work does not use citation summary but tries to take advantage of full text of article in generating related work summary.
It makes a strong assumption that each related work summary follows a topic hierarchy tree which is provided as the input of summarization system. The system then proposes two different strategies (general & specific content summarization) based on manual rhetorical analysis on how humans use topic hierarchy tree to generate related work summary.
4) Identifying Non-Explicit Citing Sentences for Citation-Based Summarization by Vahed Qazvinian and Dragomir R. Radev (ACL 2010)
--> TBA
5) Context Identification of Sentences in Related Work Sections using a Conditional Random Field: Towards Intelligent Digital Libraries by Angrosh M. A. et al. (JCDL 2010)
6) Imitating Human Literature Review Writing: An Approach to Multi-document Summarization by Jaidka K. et al. (ICADL 2010)
7) Analysis of the Macro-Level Discourse Structure of Literature Reviews by Jaidka K. et al. (Online Information Review)
8) Ultimate Research Assistant:
9) iResearch Reporter:
10) TBA

Future works (what I come up in my mind now) includes:
- Given a research topic --> automatically generate a topic hierarchy tree of that topic.
- A systematic comparison of summaries built from citations, abstracts, full text of articles. Which ones are more useful to users?
- An initial add-in component integrated into online ACL anthology system.
- Some other issues improve the summarization performance (i.e. use rhetorical discourse analysis, ...)
- ...

Friday, 4 June 2010

Machine Translation Books

Just want to collect Machine Translation related books in the literature:

1) "Statistical Machine Translation" by Philipp Koehn
--> will try to buy this book if I have money hehe :D.

2) TBA

Just an immediate thought about constraints in SMT decoding

Cohesive constraints in Beam Search Phrasal Decoding

Just think of whether other constraints (e.g. reordering, chunking, translation boundaries, ...) can be integrated into a beam search phrasal SMT decoding. (try to validate this later!).

(Linguistically Annotated Reordering:
--> Idea: given a set of reordering rules manually annotated by humans, the questions is how to use it during SMT decoding???

Translation Boundaries:



Thursday, 27 May 2010

Selected papers at NAACL 2010

Machine Translation

N10-1140 [bib]: Michel Galley; Christopher D. Manning
Accurate Non-Hierarchical Phrase-Based Translation

N10-1141 [bib]: John DeNero; Shankar Kumar; Ciprian Chelba; Franz Och
Model Combination for Machine Translation

N10-1127 [bib]: Niyu Ge
A Direct Syntax-Driven Reordering Model for Phrase-Based Machine Translation

N10-1016 [bib]: Deyi Xiong; Min Zhang; Haizhou Li
Learning Translation Boundaries for Phrase-Based Decoding

Sentence Fusion

N10-1044 [bib]: Kathleen McKeown; Sara Rosenthal; Kapil Thadani; Coleman Moore
Time-Efficient Creation of an Accurate Sentence Fusion Corpus


N10-1100 [bib]: Beaux Sharifi; Mark-Anthony Hutton; Jugal Kalita
Summarizing Microblogs Automatically

Question Answering

N10-1007 [bib]: Taniya Mishra; Srinivas Bangalore
Qme! : A Speech-based Question-Answering system on Mobile Devices

Monday, 8 March 2010

Social Question Answering - a new social question answering system!

The feature I like most in that system is real-time updating. It means your questions posted on that online system might be answered in terms of real time manner.

More information about Aardvark is in this blog entry (

That's actually an interesting research direction.


Wednesday, 3 March 2010

Wikipedia issues

All Wikipedia related issues will be posted here:

Dumps of Wikipedia:
Extraction of plain text corpus from Wikipedia:


Saturday, 23 January 2010

Machine Reading (Machine Reading paper)
Video at
Demo of KnowItAll: (Machine Reading symposium). Link to download papers here.

The research in Machine Reading may be relevant to the research in the project "Read the Web" (see here).


Read the Web Project at CMU

The project namely "Read the Web" which has been undertaking by researchers at CMU (e.g. Prof. Tom Mitchell):

Should be tracking this project regularly.