Thursday, 30 December 2010
Friday, 10 December 2010
Sunday, 5 December 2010
3) JVnSegmenter: A Java-based Vietnamese Word Segmentation Tool
4) JVnTextPro: A Java-based Vietnamese Text Processing Tool
5) RDRPOSTagger: http://rdrpostagger.sourceforge.net/
6) ViWordNetwork (demo): http://126.96.36.199/ViWordNetwork
(to be continued)
Friday, 26 November 2010
Thursday, 25 November 2010
Thursday, 18 November 2010
Monday, 15 November 2010
- sentence splitter
- part of speech tagger
- named entity recognizers (probabilistic and rule-based)
- numeric entity canonicalizer
- coreference system
Sunday, 14 November 2010
Friday, 12 November 2010
Wednesday, 10 November 2010
Monday, 8 November 2010
http://cabal.rezo.net/ (University of Poitiers)
Environ 200 articles sont actuellement en ligne (soit environ 400 000 mots). La majorité sont issus du Monde diplomatique et datés de 1998 à décembre 2003.
2. The CLUVI corpus:
English, French, Spanish, Galician,
Corpus: UNESCO Corpus of English-Galician-French-
3. German(-English) parallel corpora (Europarl and German News)
4. WebTCE (Translation Corpus Explorer)
English, German, French, Spanish, Norwegian, Danish
5. EVROKORPUS Parallel corpora
223 million words. English, French, German, Italian, Slovene and Spanish. Searches must involve Slovene and one other language.
6. TERMACOR terminology and corpus
98 million words in 22 European Languages. EU Commission data.
7. COMPARA Portuguese-English parallel corpus
Three million words.
http://www.termsearch.info/ or a faster interface at:
English, French, Russian
Major international treaties, conventions, agreements, etc. 792 documents.
9. English-Inuktitut Parallel Corpus
3.5 million words (of English), 1.5 million words of Inuktitut
English, Inuktitut (an Inuit Language of North-Eastern Canada)
10. English-Russian Parallel Corpus
English, Russian, (some German?)
Interface only in Russian.
About 9 million words
17. Natura corpora
Tuesday, 2 November 2010
Tuesday, 19 October 2010
Saturday, 9 October 2010
D10-1047 [bib]: Ahmet Aker; Trevor Cohn; Robert Gaizauskas
Multi-Document Summarization Using A* Search and Discriminative Learning
Keyphrase Extraction via Topic Modeling
Verb Selection in Language Learning
Friday, 10 September 2010
C10-1074 [bib]: Fangtao Li; Chao Han; Minlie Huang; Xiaoyan Zhu; Ying-Ju Xia; Shu Zhang; Hao Yu
Structure-Aware Review Mining and Summarization
C10-1101 [bib]: Vahed Qazvinian; Dragomir R. Radev; Arzucan Ozgur
Citation Summarization Through Keyphrase Extraction
C10-1111 [bib]: Chao Shen; Tao Li
Multi-Document Summarization via the Minimum Dominating Set
C10-1018 [bib]: Yee Seng Chan; Dan Roth
Exploiting Background Knowledge for Relation Extraction
10 easy ways to fail a Ph.D.
Thursday, 9 September 2010
Wednesday, 8 September 2010
Sunday, 5 September 2010
Friday, 3 September 2010
Situation 1: a new user raises a question which had already been partially or fully answered through one or more email threads of mailing list. Need a question answering system or summarizer in this context???
Situation 2: a new user wants to search a topic of interest. A retrieval system needed???
Situation 3: such a mailing list needs a classification of email threads???
Situation 4: TBA
Wednesday, 18 August 2010
Tuesday, 17 August 2010
Friday, 13 August 2010
Tuesday, 10 August 2010
"tokenize.pl" script from the WCDG parser:
(even de-hyphenation when used together with the parser's lexicon)
Java-based program, Segment
https://sourceforge.net/projects/segment/ (MIT-type licence)
SRX rules for sentence splitting, includes a library for
sentence splitting, which is used by LanguageTool and the Maligna
C++ library in development (GPL)
Mecab (successor of Chasen)
includes dependency parsing etc
IceNLP is open source
tokenizer/sentence segmentizer module, part of IceNLP - a toolkit for
heuristics with names and standard abbreviations
fast, rule-based, tokenizer + sentence boundary detector
German, Russian, English
sentence boundary detector for German, trainable
English sentence segmenter in Haskell
Grammatical Framework tool
MADA + TOKAN
Europarl sentence splitter as Perl modules
Other Perl modules
implemented in NLTK (Apache license)
existing models for different languages (?)
trainable tokenizer & sentence boundary detector
models available for English, German, Spanish, Thai
further models to come, wiki at:
mainly for Hungarian (?)
Jena NLP tools
trainable tokenizer & sentence splitter
(mainly for Catalan & Spanish?)
Alpino for Dutch (tokenization + sentence splitting)
ChaSen for Japanese (successor: mecab (see above))
MXPOST & MXTERMINATOR (research only!)
trainable sentence splitter
Perl Sentence Segmentation
The list is compiled by Joerg Tiedemann.
Monday, 19 July 2010
Friday, 9 July 2010
Wednesday, 7 July 2010
Thursday, 10 June 2010
Pseudo-word for Phrase-based Machine Translation
Xiangyu Duan, Min Zhang and Haizhou Li
Learning Lexicalized Reordering Models from Reordering Graphs
Jinsong Su, Yang Liu, Yajuan Lv, Haitao Mi and Qun Liu
Filtering Syntactic Constraints for Statistical Machine Translation
Hailong Cao and Eiichiro Sumita
Error Detection for Statistical Machine Translation Using Linguistic Features
Deyi Xiong, Min Zhang and Haizhou Li
Boosting-based System Combination for Machine Translation
Tong Xiao, Jingbo Zhu, Muhua Zhu and Huizhen Wang
Bilingual Sense Similarity for Statistical Machine Translation
Boxing Chen, George Foster and Roland Kuhn
Summarization & Generation
A Risk Minimization Framework for Extractive Speech Summarization
Shih-Hsiang Lin and Berlin Chen
Entity-based local coherence modelling using topological fields
Jackie Chi Kit Cheung and Gerald Penn
Automatic Collocation Suggestion in Academic Writing
Jian-Cheng Wu, Yu-Chia Chang, Teruko Mitamura and Jason S. Chang
Identifying Non-Explicit Citing Sentences for Citation-Based Summarization.
Vahed Qazvinian and Dragomir R. Radev
Automatic Generation of Story Highlights
Kristian Woodsend and Mirella Lapata
Plot Induction and Evolutionary Search for Story Generation
Neil McIntyre and Mirella Lapata
Metadata-Aware Measures for Answer Summarization in Community Question Answering
Mattia Tomasoni and Minlie Huang
A Hybrid Hierarchical Model for Multi-Document Summarization
Asli Celikyilmaz and Dilek Hakkani-Tur
A new Approach to Improving Multilingual Summarization using a Genetic Algorithm
Marina Litvak, Mark Last and Menahem Friedman
Cross-Language Document Summarization Based on Machine Translation Quality Prediction
Xiaojun Wan, Huiying Li and Jianguo Xiao
Generating image descriptions using dependency relational patterns
Ahmet Aker and Robert Gaizauskas
Open Information Extraction Using Wikipedia
Fei Wu and Daniel S. Weld
The Human Language Project: Building a Universal Corpus of the World’s Languages
Steven Abney and Steven Bird
(see full list of papers at http://nlp.csie.ncnu.edu.tw/~shin/acl2010/proceedings/CDROM/ACL/index.html)
Wednesday, 9 June 2010
--> I think this research problem is still open and actually very challenging. It requires advanced processing which combines many fields in AI such as: NLP, IR, IE, ...
Some initial works (including mine) as follows:
1) Scientific Paper Summarization Using Citation Summary Networks by Qazvinian V. et al. (COLING 2008).
--> this work only targets single article summarization using a clustering approach based on citation summary networks.
2) Generating surveys of scientific paradigms by Saif Mohammad et al. (NAACL 2009).
--> this work explores the usefulness of citation summary in compared to summary from abstracts or full text of articles.
3) Towards Automated Related Work Summarization by Cong Duy Vu HOANG et al. (COLING 2010)
--> this work does not use citation summary but tries to take advantage of full text of article in generating related work summary.
It makes a strong assumption that each related work summary follows a topic hierarchy tree which is provided as the input of summarization system. The system then proposes two different strategies (general & specific content summarization) based on manual rhetorical analysis on how humans use topic hierarchy tree to generate related work summary.
4) Identifying Non-Explicit Citing Sentences for Citation-Based Summarization by Vahed Qazvinian and Dragomir R. Radev (ACL 2010)
5) Context Identification of Sentences in Related Work Sections using a Conditional Random Field: Towards Intelligent Digital Libraries by Angrosh M. A. et al. (JCDL 2010)
6) Imitating Human Literature Review Writing: An Approach to Multi-document Summarization by Jaidka K. et al. (ICADL 2010)
7) Analysis of the Macro-Level Discourse Structure of Literature Reviews by Jaidka K. et al. (Online Information Review)
8) Ultimate Research Assistant: http://ultimate-research-assistant.com/
9) iResearch Reporter: http://iresearch-reporter.com//
Future works (what I come up in my mind now) includes:
- Given a research topic --> automatically generate a topic hierarchy tree of that topic.
- A systematic comparison of summaries built from citations, abstracts, full text of articles. Which ones are more useful to users?
- An initial add-in component integrated into online ACL anthology system.
- Some other issues improve the summarization performance (i.e. use rhetorical discourse analysis, ...)
Tuesday, 8 June 2010
Friday, 4 June 2010
Just think of whether other constraints (e.g. reordering, chunking, translation boundaries, ...) can be integrated into a beam search phrasal SMT decoding. (try to validate this later!).
(Linguistically Annotated Reordering: http://www1.i2r.a-star.edu.sg/~dyxiong/paper/cl-preprint.pdf)
--> Idea: given a set of reordering rules manually annotated by humans, the questions is how to use it during SMT decoding???
Translation Boundaries: http://aclweb.org/anthology-new/N/N10/N10-1016.pdf
Thursday, 3 June 2010
Thursday, 27 May 2010
Monday, 12 April 2010
Sunday, 28 March 2010
Monday, 22 March 2010
Monday, 8 March 2010
The feature I like most in that system is real-time updating. It means your questions posted on that online system might be answered in terms of real time manner.
More information about Aardvark is in this blog entry (http://mendicantbug.com/2009/03/13/aardvark-and-social-qa/)
That's actually an interesting research direction.
Wednesday, 3 March 2010
Dumps of Wikipedia: http://download.wikimedia.org
Extraction of plain text corpus from Wikipedia: http://blog.afterthedeadline.com/2009/12/04/generating-a-plain-text-corpus-from-wikipedia/
Sunday, 21 February 2010
Friday, 29 January 2010
Monday, 25 January 2010
Saturday, 23 January 2010
Video at videolectures.net: http://videolectures.net/wsdm08_etzioni_mrws/
Demo of KnowItAll: http://www.cs.washington.edu/research/knowitall/
http://www.cs.washington.edu/homes/pjallen/aaaiss07/schedule.htm (Machine Reading symposium). Link to download papers here.
The research in Machine Reading may be relevant to the research in the project "Read the Web" (see here).
Should be tracking this project regularly.