HOANG Cong Duy Vu's research logs: 2009

Project Gutenberg (http://www.gutenberg.org/wiki/Main_Page) - a site containing more than 100,000 free online books in various languages ^_^. Especially, it allows to access the raw texts from books for further processing (e.g. book summarization - a very interesting research direction which has been underestimated so far).

--
Cheers,
Vu

Tuesday, 27 October 2009

Reference Management Tools (part 2)

Just want to collect more information about reference management tools.

Part 1: here

1) BibDesk (http://bibdesk.sf.net) for Mac

2) Mendeley: http://www.mendeley.com/

If you know any others, please suggest me! Thanks in advance!

--
Cheers,
Vu

Sunday, 25 October 2009

Remarkable papers for Vietnamese Word Segmentation

By Doan NGUYEN (Hewlett-Packard Company)

1) "Query Preprocessing: Improving Web Search Through a Vietnamese Word Tokenization Approach". SIGIR'08 (short paper)

2) "Using Search Engine to Construct a Scalable Corpus for Vietnamese Lexical Development for Word Segmentation". Proceedings of the 7th Workshop on Asian Language Resources, ACL-IJCNLP 2009.

--
Cheers,
Vu

Saturday, 24 October 2009

The Legacy of Randy Pausch

Prof. Randy has been very famous with two lectures: "Time Management" and "The Last Lecture" (MUST SEE THEM!).

http://www.cs.virginia.edu/~robins/Randy/ - a collective site about Prof. Randy's lectures.

--
Cheers,
Vu

Thursday, 22 October 2009

Surveys & Books for Automatic Summarization

This post is to collect some useful surveys or books about automatic summarization in the literature so far.

Surveys
1) A survey on Automatic Text Summarization (link)

2) A Survey on Multi-Document Summarization (link)

3) Automatic summarising: The state of the art (link)

Books
1) Automatic Summarization by Mani (link)

2) Text summarisation by Hovy (link)

--
Cheers,
Vu

Useful collective information about automatic summarization

http://www.summarizationonline.info

--
Cheers,
Vu

Semantic Similarity using WordNet

http://wn-similarity.sourceforge.net/

--
Cheers,
Vu

Tuesday, 6 October 2009

Modeling and Reasoning with Bayesian Networks

New book in 2009 about Bayesian Networks: "Modeling and Reasoning with Bayesian Networks" by Prof. Adnan Darwiche .

Link download: http://gigapedia.com/items/346116/modeling-and-reasoning-with-bayesian-networks (make sure that you already log in into that site)

Link on Amazon website: http://www.amazon.com/Modeling-Reasoning-Bayesian-Networks-Professor/dp/0521884381/ref=sr_1_1?ie=UTF8&s=books&qid=1239663545&sr=8-1

I am trying to read some of chapters in this book to study about Bayesian Networks and find that it is written with a very comprehensive and straightforward fashion, especially help us easily deal with a lot of mathematical materials in Bayesian Networks with accompanying examples. In my opinion, it may be better than the typical book in Bayesian Networks, namely "Learning Bayesian Networks" by Richard E. Neapolitan, which is very tough to read and understand.

--
Cheers,
Vu

Monday, 5 October 2009

Statistical Data Mining Tutorials

http://www.autonlab.org/tutorials/index.html by Prof. Andrew W. Moore.

--
Cheers,
Vu

Monday, 28 September 2009

SynView - A free syntax tree visualization tool

http://www.christian-behrenberg.de/work/SynView.html

Introduction trailer on YouTube: http://www.youtube.com/watch?v=zFi9ldFYlEs

Extremely awesome with 3D manipulation. It may be helpful for people who regularly deal with syntax tree in NLP.

Cheers,

Sunday, 27 September 2009

True Knowledge - The Internet Answer Engine

http://www.trueknowledge.com/

Just a beta version but very impressive about what it can do for us.

--
Cheers,
Vu

Saturday, 26 September 2009

IBM PhD Fellowship

https://www.ibm.com/developerworks/university/phdfellowship/

--
Cheers,
Vu

Friday, 25 September 2009

Open source search engine

The below is some collected information relevant to up-to-date open source search engine toolkits:

1) Lucene: http://lucene.apache.org/ (Java)
--> CLucene: http://sourceforge.net/projects/clucene/ (C++)

2) Minion: https://minion.dev.java.net
3) Galago: http://www.galagosearch.org/

4) Xapian: http://xapian.org/ (C++)

5) http://www.searchenginecaffe.com/2007/03/open-source-search-engines-in-java-and.html (very informative)

6) Minion vs. Lucene: http://blogs.sun.com/searchguy/entry/minion_and_lucene_query_languages

7) Search Engine Wrapper (Yee Fan Tan - NUS): http://wing.comp.nus.edu.sg/~tanyeefa/downloads/searchenginewrapper/

--
Cheers,
Vu

Wednesday, 16 September 2009

Bayesian Inference with Tears

Will plan to read this article to understand more about Bayesian inference applied to NLP.
Link: http://www.isi.edu/natural-language/people/bayes-with-tears.pdf
(by Kevin Knight)

--
Cheers,
Vu

Wednesday, 2 September 2009

Notes in machine learning

http://www.ics.uci.edu/~welling/classnotes/classnotes.html

Think such notes are very useful for me to learn more about topics in machine learning.

Useful datasets for Machine Learning: http://archive.ics.uci.edu/ml/

--
Cheers,
Vu

Tuesday, 1 September 2009

Markov Logic Networks

Markov Logic Networks, a combination of First Order Logic and Markov Networks, is a new graphical model which will be very important for AI modeling in the future. Prof. Pedro Domingos at Univ. of Washington is a pioneer in this field. There are some major references given by him:

1) New book "Markov Logic - An Interface Layer for AI".
Another editorial book: "Integrating Logic and Statistics: Novel Algorithms in Markov Logic Networks" by Marenglen Biba

2) The course about Markov Logic Networks given by Prof. Pedro Domingos at Univ. of Washington.

3) The article "What's missing in AI - The Interface Layer".

4) Alchemy - Open source AI: http://alchemy.cs.washington.edu/

I wonder whether some NLP problems can benefit from such a new model.

--
Cheers,
Vu

Graphical Models in a Nutshell

The paper by Prof. Daphne Koller :
http://robotics.stanford.edu/~koller/Papers/Koller+al:SRL07.pdf

MUST read this paper to understand the underlying principles behind graphical models before proceeding to investigate more!

--
Cheers,
Vu

Monday, 31 August 2009

Probability and Logic

Combining Probability and Logic - Journal of Applied Logic

This article is generally about how to use probability in combination with logic, also called language logic or metalanguage. It has been proven that probabilistic approaches have been getting more and more important in the field of Natural Language Processing and text processing.

Should read this article!

--
Cheers,
Vu

Sunday, 30 August 2009

Foundations of Probabilistic Modeling

The course by Prof. David Blei:
http://www.cs.princeton.edu/courses/archive/spr09/cos513/

--
Cheers,
Vu

Sunday, 16 August 2009

NLG systems

1) SimpleNLG (2009)
http://code.google.com/p/simplenlg/

2) FUF/SURGE
FUF: Functional Unification Formalism Interpreter
SURGE: A Syntactic Realization Grammar for Text Generation

http://www.cs.bgu.ac.il/surge/index.html (1999)
http://homepages.inf.ed.ac.uk/ccallawa/resources.html (newest version, 2005)

3) More in http://www.aclweb.org/aclwiki/index.php?title=Downloadable_NLG_systems

--
Cheers,
Vu

Friday, 14 August 2009

ILP for NLP

http://ilpnlp.wikidot.com/start

Mojo Web Framework

http://mojolicious.org/-A next generation web framework for the Perl programming language.

Wednesday, 5 August 2009

Markov Logic

Markov Logic - a new graphical model for Natural Language Processing

New book by Prof. Pedro Domingos‌: http://www.morganclaypool.com/doi/abs/10.2200/S00206ED1V01Y200907AIM007

Should study about this model as soon as possible!

--
Cheers,
Vu

Sunday, 2 August 2009

ACL-IJCNLP'09 participation

Day 1 - 02/08/2009

Tutorial 1: Topics in Statistical Machine Translation by Kevin Knight (ISI) and Philippe Koehn (Edinburgh Univ.)

Some sub-topics within this tutorial I need to be take into account are as follows:

- Minimum Bayesian Risk Decoding
- Re-evaluation of phrase-based SMT outputs
- MT system combination
- Efficient decoding (e.g. using cube pruning)
- Discriminative training with various features

I am looking for their slides (soft) for my further reference. If you have it, please share with me, thanks a lot!

Day 2 - 03/08/2009
Session 2B: Generation and Summarization 1

Talk 1: DEPEVAL (summ): Dependency-based Evaluation for Automatic Summaries by Karolina Owczarzakek
- the main idea is to use dependency relations to summary evaluation
- better in comparison with ROUGE (2004) and BE (2005)
+ Question: difference between DEPEVAL and BE?
Note:
- Lexical-Functional Grammar (e.g. two syntactic structures to one functional structure)
- LFG parser
+ Charniak-Johnson syntactic parser (2005)
+ LFG annotation (2008)

Talk 2: Summarizing Definition from Wikipedia by Shiren Ye
- raise new problem with their challenges in summarization of Wikipedia articles
+ recursive links
+ hidden information
- single-document summarization
- use existing approach named Document Concept Lattice (IPM 2007)

Talk 3: Automatically Generating Wikipedia Articles: A Structure-aware Approach by C. Sauper
- new problem in generating overview articles in Wikipedia using various resources crawled from the Internet
- template creation using clustering existing section topics in database
- proposed joint learning model that integrates Integer Linear Programming (ILP) into learning to optimize weights (for each section topic)
Note:
- evaluation of quality of generated articles is subjective (Prof. Hovy asked about this)!

Talk 4: Learning to tell tales: A Data-driven Approach to Story Generation by Neil McIntyre
- an interesting problem that results end-to-end generation system
- content selection (content) -> content planning (grammar) -> generation (use LM)
Note:
- how to evaluate the quality of generated stories in terms of coherence and interestingness?

Day 3 - 04/08/2009
Relax to save my energy to enjoy the interesting remaining sessions, especially in EMNLP!

Day 4 - 05/08/2009

Talk 1: SMS based Interface for FAQ Retrieval
- actually cannot follow the Indian guy who is speaker of this talk.

Talk 2: A Syntax-free Approach to Japanese Sentence Compression
- It is worthy noting some materials relevant to my current interest as follows:
+Intra-sentence positional term weighting
+Patched language modeling
- Analysis of human-made reference compression
-> very helpful to figure out challenges in specific problems!
- combinatorial optimization problem
-> used to do parameter optimization (MCE-Minimum Classification Error in this paper)
- Statistical significance using Wilcoxin sign T-test

Talk 3: Application-driven Statistical Paraphrase Generation
- use SMT-like techniques but propose some new models within noisy channel model
+ paraphrase model (adapt)
+ LM (re-use)
+ usability (propose)
- seems to be not compelling about error analysis (only exhibit the very good outputs of proposed system), and figure out which components in proposed framework are most influential?

Talk 4: Word or Phrase? Learning Which Unit to Stress for Information Retrieval
It seems to not interest me a lot, IR stuffs!

Talk 5: A Generative Blog Post Retrieval Model that Uses Query Expansion based on External Collections
- Learn about query expansion techniques and how to integrate it into specific problem (blog post retrieval in this paper)
+ worthy noting query expansion based on external resources

Talk 6: An Optimal-Time Binarization Algorithm for Linear Context-Free Rewriting Systems with Fan-Out Two
- a lot of parsing-relevant stuffs (especially in algorithm complexity) in this talks that made me extremely confused!

Talk 7: A Polynominal-Time Parsing Algorithm for TT-MCTAG
- cannot understand any materials!

Talk 8: Unsupervised Relation Extraction by Mining Wikipedia Texts Using Information from the Web
- quite simple idea (just in my opinion) based on observation from Wikipedia
- use dependency as key components

Closing session:
- very interesting and sometimes funny!
- Prof. Frederick Jelineks has been received the Lifetime Achievement Award and then following by interesting talks about his biography sketch.
- announcements about future NLP conferences (COLING'10, ACL'10, ACL'11, NAACL'10, IJCNLP'10, LREC'10)
- announcements about best paper awards (see more details in ACL-IJCNLP'09 website)

Exhausted but helpful day. Prepare for next days with ACL workshops and EMNLP sessions!

Day 5 - 06/08/2009
Talk 1 (invited talk): Query-focused Summarization Using Text-to-Text Generation: when Information Comes from Multilingual Sources by Kathleen R. McKeown
- This is the first time I have seen the face of Prof. Kathleen McKeown who was supervisor of my current supervisor (A/P Min-Yen KAN) hehe.
Some main points:
- typical approach for query-based summarization:
+ choose key sentences (word freq, position, clue words)
+ matches of query term against sentence terms
=> leads to
+ irrelevant sentences
+ sentences placed out of context -> misconceptions
<= - generate new sentences from selected phrases + fluent sentences -> disfluent sentences
+ edit references to people (focus mainly on names)
- remove irrelevant sentences using sentence simplification
+ project DARPA GALE
+ interactive question user input
- NIGHTINGALE
+ use Wikipedia to expand query
+ consider name translation in multilingual resources
+ better if operating over phrases
- GLARF parser from NYU
- long sentences -> shorter sentences using sentence simplification
- redundancy detection => pairwise similarity across all sentences to identify concepts
+ alignment of dependency parses --> hypergraph
+ BOW
- future research direction: text generation for QA

Talk 2: A Classification Algorithm for Predicting the Structure of Summaries
- Interesting motivated question: how to "paste" selected sentences during abstracting?
- abstracting
+ some of materials not present
+ be modeled by cut-and-paste operations (Mani, 01)
- use specific verbs (predicates), for example: present, conclude, include, ...
- language tools
+ GATE (POS, morpho)
+ SUPPLE parser

Talk 3: Entity Extraction via Ensemble Semantics
- web -> entities -> top-PMI (point-wise mutual information) entities

Talk 4: Clustering to Find Exemplar Terms for Keyphrase Extraction
- relatedness
+ co-occurrence (statistics)
+ Wikipedia-based (e.g. PMI)

Day 6 - 07/08/2009
TBA

Just for taking notes!

--
Cheers,
Vu

Thursday, 30 July 2009

JAVA stuffs

Of courses, a lot of available sites concerning this, the below is one of them:

http://www.java2s.com/

(to be updated!)

--
Cheers,
Vu

Wednesday, 29 July 2009

Porter's Stemming Algorithm Online

http://maya.cs.depaul.edu/~classes/ds575/porter.html

Useful for quick reference to Porter Stemming Algorithm for English!

--
Cheers,
Vu

Tuesday, 28 July 2009

Lucene stuffs

These days, I am trying to use Lucene for my own research purpose. I figure out here some stuffs that may be relevant and useful:

Lucene in general: http://lucene.apache.org
-> you can find out more detail on this site!

Lucene Index Toolbox (Luke): http://www.getopt.org/luke/
-> This tool is very helpful for us to deal with functionality of Lucene search engine. It supports index, document browsing, search, ... with graphical UI (cross-platform).

Great articles:

1) Summarization with Lucene: http://sujitpal.blogspot.com/2009/02/summarization-with-lucene.html

-> In this article, the author tried to implement his own summarizer based mainly on two simple summarization algorithms, namely Classifier4J (C4J) and Open Text Summarizer (OTS) using Lucene, an open-source source engine API.

2) Lucene Analyzer, Tokenizer and TokenFilter: http://mext.at/?p=26
-> how to use analyzer, tokenizer, filter in Lucene

3) Lucene Indexing and Document Scoring: (googling with the keyword "lucene indexing and document scoring")
-> contains some basic concepts and definitions in Lucene under comprehensive explanation.

4) Understanding Lucene Scoring: http://www.opensourcereleasefeed.com/article/show/understanding-lucene-scori

5) Lucene Query Syntax: http://lucene.apache.org/java/2_3_2/queryparsersyntax.html (replace the version "2_3_3" if you are using newer ones)

(to be continued ...)

--
Cheers,
Vu

IBM Many Aspects Document Summarization Tool

http://www.alphaworks.ibm.com/tech/manyaspects

--
Cheers,
Vu

Monday, 27 July 2009

AI softwares

Summarization: http://summarizer.intellexer.com/index.html
Extractor: http://www.extractor.com/

Surprising! AI softwares indeed!

--
Cheers,
Vu

Keywords Co-Occurrence and Semantic Connectivity

http://www.miislita.com/semantics/c-index-1.html

I would like to adapt some techniques mentioned in this article to the problem of keyword co-occurrence in scientific domain (e.g. ACL Anthology).

--
Cheers,
Vu

Sunday, 26 July 2009

Brown Coherence Toolkit

Link for download: http://www.cs.brown.edu/~melsner/egrid-distr.tgz
Manual: http://www.cs.brown.edu/~melsner/manual.html
Link to the author: http://www.cs.brown.edu/~melsner/

--
Cheers,
Vu

iOPENER

http://tangra.si.umich.edu/clair/iopener/index.html

The idea automatically creating technical surveys using AI algorithms seems to be interesting but quite ambitious (according to my understanding). This is a inter-disciplinary research combining the various techniques in Natural Language Processing, Natural Language Understanding as well as Natural Language Generation. To some extent, it is really hard, still far away from present :D.

See more in the newest paper at NAACL'09 "Using Citations to Generate Surveys of Scientific Paradigms"! Initially, the authors use existing citation contexts of articles combining with state-of-the-arts techniques (e.g. Trimmer, LexRank, C-LexRank, C-RR) in extractive multi-document summarization (almost in news domain) to generate the surveys. They also concluded some important points as follows:
- approaches in other domains applied in the scientific extent can produce satisfactory results
- citation contexts and abstracts contain much more useful information for summaries than full texts in papers

My comments on this are as follows:
- the specific features of scientific survey articles are not used yet. For example: the structure of technical surveys, topic coherence, ...
- information fusion. Different citation contexts may contain overlapping information. How to pinpoint them?

--
Cheers,
Vu

Clair library

The Clair Library - A Perl package for Natural Language Processing, Information Retrieval and Network Analysis.

Just a note for further reference!

--
Cheers,
Vu

Scholarship links

Sites to seek for scholarships for different levels (undergraduate, master, PhD):

1)
http://scholarship-position.blogspot.com/

2)
http://scholarshipsboard.com/

--
Cheers,
Vu

Thursday, 23 July 2009

NLP research links by Vlado Keselj

http://users.cs.dal.ca/~vlado/nlp/

This link compiled by Prof. Vlado Keselj at Dalhousie University contains quite a lot of research links in NLP.

Just a note for future search!

--
Cheers,
Vu

Wednesday, 22 July 2009

Language Experiments - The Portal for Psychological Experiments on Language

http://www.language-experiments.org/

This site may be very useful for people who want to create experiments for their own research. I am trying to figure out whether it can help me do something helpful.

--
Cheers,
Vu

Internet FAQ Archives

http://www.faqs.org/faqs/

--
Cheers,
Vu

Tuesday, 21 July 2009

PDF to raw texts

There are some ways to convert PDF files to raw text files. Two typical ways, just according to my opinion, are using and non-using OCR technology. PDFBox is a freely available tool non-using OCR technology, so the converted raw texts suffer some errors. To utilize OCR technology in conversion, we have some tricks as follows:
- use some free or commercial tools like SimpleOCR, VeryPDF, OmniPage ...
- copy-and-paste directly from PDF files. This trick is only applied to some PDF files that are not secured.
- use online tools (I am actually not sure about the internal technologies they are using :()

+ Adobe: http://www.adobe.com/products/acrobat/access_onlinetools.html
+ http://pdftextonline.com

- leverage Google OCR (means Google will do this for us):

http://www.labnol.org/software/convert-scanned-pdf-images-to-text-with-google-ocr/5158/

Convert Scanned PDFs to Text

Now if you have bunch of scanned PDF files on your hard drive and no OCR software, here’s what you can do to convert them into recognizable text.

Create a folder in your website (say abc.com/pdf) and upload all the PDF images to that folder. Now create a public web page that links to all the PDF files. Wait for the Google bots to spider your stuff.

Once done, type the query "site:abc.com/pdf filetype:pdf" to see the PDF documents as HTML.

--
Cheers,
Vu

Machine Learning and Natural Language

http://l2r.cs.uiuc.edu/~danr/Teaching/CS546-09/lectures.html

This is a helpful course instructed by Prof. Dan Roth at UIUC. Perhaps I must spend more time to self-study some of materials from this course to improve my background in Machine Learning for Natural Language.

--
Cheers,
Vu

Sunday, 19 July 2009

Term "Oracle"

Sometimes I encountered the term "Oracle" in some papers, especially in Experiment and Evaluation sections but I quite did not understand what it means. Recently, I have figured out the meaning of it, it can be something referring to the upper bound of any measure that is used to assess the performance of a method.

--
Cheers,
Vu

Baseline methods

How to effectively design the baseline methods for specific problems?
This is also my raised question when approaching any specific research problem. The baseline methods can be 1) state-of-the-art methods that well-studied in previous studies (yeh, we can use them by re-implementing some of them but how many are enough? It's hard question) 2) the simplest methods that we can think of naturally or the methods we easily propose but should not be too naive (because beating the methods which are naive will degrade the value of your proposed methods).

Just my thoughts, any corrections are welcome!

--
Cheers,
Vu

Wilcoxon signed-rank test

Wilcoxon signed-rank test: http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test

This is the test that is used to determine the statistical significance of the results probably from different methods.

Should refer to the following tool "Significance Testing" maintained by Sebastian Pado:
http://www.nlpado.de/~sebastian/sigf.html

--
Cheers,
Vu

New Book: Search User Interfaces

A book written by Prof. Marti A. Hearst at University of California, Berkeley
Search User Interfaces: http://searchuserinterfaces.com/

Just a note, maybe it will be useful for me in the future.

--
Cheers,
Vu

Friday, 17 July 2009

Intuition and Observation in NLP research

I think that intuition and observation are crucial factors that strongly affect the proposed methods/approaches solving the problems in NLP research. I read quite a lot of NLP papers and recognized this. Due to the ambiguity in natural language, some of NLP problems will be heuristically solved based on some intuition and observation from what humans are able to do naturally.

Anyway, just my opinion. It is quite subjective. Please contradict me if any!

--
Cheers,
Vu

Wednesday, 15 July 2009

NLP conferences links

Useful links to update information about newest conferences/journals

1) http://www.cs.rochester.edu/~tetreaul/conferences.html
2) http://www-tsujii.is.s.u-tokyo.ac.jp/~yoshinag/research/conference_link.html

NLP conference acceptance rates

http://aclweb.org/aclwiki/index.php?title=Conference_acceptance_rates

Computer Science Conference Ranking

http://www.cs-conference-ranking.org/conferencerankings/alltopics.html

According to this list, some NLP related conferences are ranked as follows:

AAAI: American Association for AI National Conference (0.99)
IJCAI: Intl Joint Conf on AI (0.96)
SIGIR: ACM SIGIR Conf on Information Retrieval (0.96)
ACL: Annual Meeting of the ACL - Association of Computational Linguistics (0.90)
NAACL: North American Chapter of the ACL (0.88)
CoNLL: Conference on Natural Language Learning (0.82)
EMNLP: Empirical Methods in Natural Language Processing (0.79)
COLING: International Conference on Computational Linguistics (0.64)
EACL: Annual Meeting of European Association Computational Linguistics (0.62)
PACLIC: Pacific Asia Conference on Language, Information and Computation (0.56)
RANLP: Recent Advances in Natural Language Processing (0.54)
NLPRS: Natural Language Pacific Rim Symposium (0.54)

--
Cheers,
Vu

ACL-IJCNLP'09 proceeding online

This proceedings will be officially archiving at ACL Anthology. The below link is only temporary for who wants to quickly refer the newest articles of ACL conference.

http://nlp.csie.ncnu.edu.tw/%7Eshin/acl-ijcnlp2009/proceedings/CDROM/ACLIJCNLP /index.html

--
Cheers,

Tuesday, 14 July 2009

NLP/Computational Linguistics Anthology

There are very useful resources that support for research in the field of Computational Linguistics and Natural Language Processing. Some of them are currently available on the web.

* ACL Anthology: http://www.aclweb.org/anthology-new/

- archive papers of major conferences or journals such as: ACL, NAACL, EMNLP, COLING, Journal of computational Linguistics, ...

* The ACL Anthology Network: http://belobog.si.umich.edu/clair/anthology/index.cgi

- very helpful network built based on data archived from ACL Anthology. It plays a role as a social network that unveils relationships between papers and authors.

* ACL Anthology Reference Corpus (ACL ARC): http://acl-arc.comp.nus.edu.sg/

- a corpus recently built by some leading researchers around the world aims at boosting the research in scientific domain.

* ACL Anthology SearchBench: http://aclasb.dfki.de/

--
Cheers,
Vu

Monday, 13 July 2009

The Machine Learning Forum

http://seed.ucsd.edu/joomla15/

I think this is a great forum for anyone who wants to learn, employ and apply some machine learning techniques to solve research problems in specific domain.

--
Cheers,
Vu

Linux Ubuntu stuffs

Some required configuration steps (of course, just appropriate in my situation):

1) Sharing folders between Windows XP (host) and Ubuntu Linux (guest) installed using VMware
on Linux
- create arbitrary folder to be shared
- install samba, can be automatically installed using wizards by right clicking the shared folder and choosing tab "Share". The Linux OS will ask for this installing progress. Then, just follow it :D.
- use the command # ifconfig | grep "inet addr:" to see your IP address of Ubuntu Linux (guest).
on Windows
- open "My Computer", use the tab "Tools\Map Network Drive", see the following figure:

+ choose the drive on Windows which will be mapped to the one on Ubuntu Linux
+ choose the address by clicking "button Browser" and then selecting the appropriate drive address on Ubuntu Linux. That's it!

2) update root password
- sudo passwd root

3) update vim editor with full version
- sudo apt-get install vim-full

4) auto remove and update with apt-get
- sudo apt-get update
- sudo apt-get autoremove

5) install java JDK
- sudo apt-get install sun-java6-jdk sun-java6-jre sun-java6-plugin

6) install netbeans
- sudo apt-get install netbeans
- setup javadocs for netbeans::
+ download JDK javadocs from sun website (of course choose appropriate versions with current JDK)
+ in netbeans IDE, choose menu Tools\Netbeans Platforms\javadoc and then locate the downloaded javadoc file

7) size of hard drives
- use the command # df -h

8) install eclipse
- sudo apt-get install eclipse

9) gnome commander - looks like Total Commander on Windows
http://www.nongnu.org/gcmd/ or sudo apt-get install gnome-commander

10) correct CGI/Perl bad interpreter - very useful tip
- Link
- use the command # perl -i.bak -pe 'tr/\r//d' script_file (e.g. *.pl, *.sh)

11) install JDK/JRE on Ubuntu Linux and related configuration
- Link

Some useful links:
- IDEs for Developers: http://mashable.com/2007/11/17/ide-toolbox/
- Eclipse IDE: http://www.eclipse.org/
- EPIC (Eclipse Perl Integration): http://www.epic-ide.org/
- Anjuta IDE: http://projects.gnome.org/anjuta/index.shtml

--
Cheers,
Vu

Friday, 10 July 2009

Soft skills for scientific research

Very useful writings (just in Vietnamese):

1) http://tuanvannguyen.blogspot.com/2009/01/k-nng-mm-cho-nh-khoa-hc.html
2) http://groups.google.com/group/cvpr-hcmuns-vn/msg/13fe6e2c525e550a?

I think that I have been experiencing similar circumstances during my research life though this is just a beginning ^_^. As shown in two above writings, we should be diligent, patient, and regular to maintain our passion for scientific research. Hopefully, I will overcome all of the most difficulties to realize my dream (become a well-skilled research scientist ^_^, still far away from present).

--
Cheers,
Vu

Thursday, 9 July 2009

Human computation

Manual data annotation methods need much efforts in terms of time-consuming, labor-intensive and error-prone process. Recently, human computation has emerged as a viable synergy for data annotation of which idea is to harness what humans are good at but machines are poor at. Currently, many tasks are trivial for humans but continue to challenge even the most sophisticated computer programs. Thus, the intelligent combination between computers and humans in terms of human computation to solve complex tasks is becoming a promising approach. Two typical frameworks representing human computation for data annotation are Games With A Purpose (GWAP) and Amazon Mechanical Turk (AMT).

During my earlier research at SoC@NUS, I had a chance to undertake an analysis survey on human computation. I will post it on here as soon as possible for your quick reference.

I think that human computation can become a primary research tool for quickly creating evaluation data in the future.

--
Cheers,
Vu

Tuesday, 7 July 2009

How to read a scientific article?

Do you think you already read scientific articles in terms of effectiveness and efficiency? If not, you can read some of the following articles:

www.owlnet.rice.edu/~cainproj/courses/sci_article.pdf
http://www.lib.purdue.edu/phys/assets/SciPaperTutorial.swf (very nice presentation :D)

For a newbie in research like me, it is extremely important to learn about.

--
Cheers,
Vu

Hypotheses

I have learned one lesson from my adviser.

Sometimes you are thinking about the problem and want to figure out the solution for it. The best way to do this is that you should firstly think of some hypotheses for your problem theoretically. You will then validate your hypotheses empirically based on some experiments. Explain and analyze why the results look like. It is worthy noticing that sometimes your data may not support for your hypotheses.

Please do not do experiments only, it is not helpful.

Hopefully, it helps me a lot then. Keep up with my best effort in my research.

--
Cheers,
Vu

Tools supporting our brainstorming

FreeMind

Link:
http://freemind.sourceforge.net/wiki/index.php/Main_Page
Tutorial
http://freemind.sourceforge.net/wiki/index.php/Tutorial_effort
FreeMind in YouTube
http://www.youtube.com/watch?v=grut_2cardM
Vietnamese book
http://www.vinabook.com/lap-ban-do-tu-duy-cong-cu-tu-duy-toi-uu-se-lam-thay-doi-cuoc-song-cua-ban-m11i21657.html

Thanks Prof. Duy-Dinh LE for sharing this information.

--
Cheers,
Vu

Topic modeling toolkit

MALLET: http://mallet.cs.umass.edu/
A famous tool supporting various algorithms in exploring latent topics in raw texts.

--
Cheers,
Vu

Summarization toolkit

MEAD: http://www.summarization.com/mead developed by Dragomir Radev at Univ. of Michigan
- Using Centroid-based summarization algorithm, read more details in some related papers
- Some troubles will be encountered when installing this tool. Please read the README file carefully before using it.
- Useful FAQs relating to MEAD: http://www.summarization.com/~radev/mead/email/

ROUGE evaluation: http://www.isi.edu/licensed-sw/see/rouge/index.html

--
Cheers,
Vu

Machine Learning tools

WEKA: http://www.cs.waikato.ac.nz/~ml/index.html
Various machine learning algorithms like SVM, Bayes, decision tools, ... are integrated in this tool. It also supports visualization of machine learning data, very efficient and effective for observation and analysis.

--
Cheers,
Vu

Reference Manager Tools (part 1)

There are currently many available tools that support the process of managing references in scientific research. I would like to introduce some of them which are quite good according to my opinion, as follows:

JabRef: http://sourceforge.net/projects/jabref/. The JabRef output can be flexibly customized according to various file formats (e.g. HTML, PDF, ...), see the following figure for a typical example (thanks Zhao Shanheng for sharing this template file):

Zotero (Firefox addin): http://www.zotero.org/

ForeCiteNote: http://forecitenote.comp.nus.edu.sg . This is one of exciting projects undertaken by WING research group at SoC@NUS. Try it!

--
Cheers,
Vu

upcoming NLP conferences in 2010

- ACL 2010 (http://acl2010.org/) [rank 1]
- NAACL 2010 (http://naaclhlt2010.isi.edu/) [rank 1]
- EMNLP 2010 [rank 2]
- COLING 2010 (http://www.coling-2010.org/) [rank 2]
- CICLing 2010 [rank 3]
- AI related (NLP tracks):
+ AAAI'10 (http://www.aaai.org/Conferences/AAAI/aaai10.php) [rank 1]
+ ECAI'10 (http://ecai2010.appia.pt/) [rank 2]
+ PRICAI'10 (http://www.pricai2010.org/) [rank 3]
+ ICTAI'10 [rank 2]

My preference is AAAI>EMNLP/COLING/ECAI>PRICAI/CICLING.

Note that I will update their deadlines of paper submission as soon as possible :D.

--
Cheers,
Vu

LaTeX and its related issues

Templates for LaTeX for beginners:
ACM templates: http://www.acm.org/sigs/publications/proceedings-templates

LaTeX editors:
TeXnicCenter: http://www.texniccenter.org/
Texmaker: http://www.xm1math.net/texmaker/ (cross-platform)
TeXstudio: http://texstudio.sourceforge.net/ (cross-platform)

PSTricks:
PSTricks: http://en.wikipedia.org/wiki/PSTricks
LaTeXDraw: http://latexdraw.sourceforge.net/
(support automatic generation of LaTeX codes, very effective)

Tools supporting vector graphics:
Inkscape: http://www.inkscape.org
GIMP: http://www.gimp.org/

LaTeX tutorials:
http://www.stat.cmu.edu/~hseltman/LatexTips.html
http://www.artofproblemsolving.com/LaTeX/AoPS_L_About.php
http://en.wikibooks.org/wiki/LaTeX
http://dcwww.fys.dtu.dk/~schiotz/comp/LatexTips/LatexTips.html
http://www.maths.tcd.ie/~dwilkins/LaTeXPrimer/Index.html
http://texblog.wordpress.com/: a great LaTeX site

LaTeX resources:
http://www.eng.cam.ac.uk/help/tpl/textprocessing/

LaTeX community:http://www.latex-community.org/

LaTeX tips (ubiquitous):

1) Quotation Marks and Dashes

Single quotation marks are produced in LaTeX using ` and '. Double quotation marks are produced by typing `` and ''. (The `undirected double quote character " produces double right quotation marks: it should never be used where left quotation marks are required.)

LaTeX allows you to produce dashes of various length, known as `hyphens', `en-dashes' and `em-dashes'. Hyphens are obtained in LaTeX by typing -, en-dashes by typing -- and em-dashes by typing ---.

One normally uses en-dashes when specifying a range of numbers. Thus for example, to specify a range of page numbers, one would type

on pages 155--219.

Dashes used for punctuating are often typeset as em-dashes, especially in older books. These are obtained by typing ---.

(Source: http://www.maths.tcd.ie/~dwilkins/LaTeXPrimer/QuotDash.html)
(to be continued)

2) LaTeX mathematical equation editor: http://www.codecogs.com/components/eqneditor/editor.php -> an interactive tool, very cool!

3) Some strategies to include graphics in LaTeX documents
http://www.tug.org/TUGboat/Articles/tb26-1/hoeppner.pdf

4) Using Visio to create EPS files (very helpful)
http://www.win.tue.nl/latex/visioeps.html
http://www.adobe.com/support/downloads/thankyou.jsp?ftpID=1500&fileID=1438 (driver of EPS files for printer in Visio)

5) Rename "Contents" by "Tables of Contents" using
\renewcommand\contentsname{Table of Contents}

6) Special characters in LaTeX:
http://www.noao.edu/noaoprop/help/symbols/

7) Spell Checking for LaTeX documents
http://www.microspell.com/cgi-bin/spellform.pl

8) Footnote with caption
http://www.latex-community.org/forum/viewtopic.php?f=5&t=1078

9) LaTeX tables with bar charts
http://www.keithv.com/software/barchart/

10) LaTeX mathematics equations tips: http://moser.cm.nctu.edu.tw/docs/typeset_equations.pdf

11) ... (to be added)