HOANG Cong Duy Vu's research logs: 2014

Wednesday, 31 December 2014

Python

Python is very powerful programming language, especially for text processing (or NLP) in terms of speed, simplicity, and vastly supported libraries. I've just started using it for two years and already love it.

Here I collect my stuffs relating to Python:

Runtime Libraries

Link: https://www.python.org/downloads/

Link: https://cloud.google.com/appengine/docs/python/

IDE

1) Microsoft Visual Studio

OS supported: Windows, Android, iOS

Debugging: fully supported

Auto completion: fully supported

Additional tools for Visual Studio: PTVS

2) Vim editor

OS supported: Linux

Debugging: none

Auto completion: none

3) More? See this.

Tutorials

1) Everything can be searched in the Internet :D.

2) http://stackoverflow.com/

Community-based question answering for everything (including Python).

Thursday, 25 December 2014

Representation Learning

This research topic is very important in machine learning and a first step for all kinds of machine learning algorithms. Having a robust representation definitely plays a vital role for good performance/accuracy.

Recent progress of representation learning is using deep architecture (normally referring to deep learning). Here I collect some of very good papers worth reading:

1) Representation Learning: A Review and New Perspectives (Yoshua Bengio et al., 2014)

2) Deep Learning of Representations for Unsupervised and Transfer Learning (Yoshua Bengio et al., 2012)

3) ...

Sunday, 21 December 2014

LateX Template

Link: http://www.latextemplates.com
Intro: A BIG categorized collection of LaTeX templates.

Thursday, 11 December 2014

Neural Machine Translation

Scientists around the world (especially Google guys) are moving the approaches of Statistical Machine Translation (SMT) (e.g. word-based, statistical with phrase-based or hierarchical, syntax-based) to the next level, namely Neural Machine Translation.

In general, Neural Machine Translation aims to simplify the SMT approaches by taking the source as an input sequence and producing the target as an output sequence via a single, large neural networks.

Here I am trying to catch up the recent progress of Neural Machine Translation.

*** People & Group

1) LISA Lab, University of Montreal 2014 led by Prof. Yoshua Bengio

Latest Demo: http://104.131.78.120/

2) Quoc Viet Le and co. at Google (e.g. Ilya Sutskever, Nal Kalchbrenner)

3) Phil Blunsom's group at Oxford Uni.

4) Dzmitry Bahdanau at Jacobs University Bremen

5) Richard Socher at Stanford Uni.

6) Kyunghyun Cho at NYU?

7) ...

*** Notable Papers

1a) Sequence to Sequence Learning with Neural Networks (Ilya Sutskever et al., NIPS 2014)

Note:

- The core idea behind neural machine translation.

1b) Generating Sequences With Recurrent Neural Networks (Alex Graves et al., ? 2014)
- TBA

2) Neural Machine Translation by Jointly Learning to Align and Translate (Dzmitry Bahdanau et al., EMNLP 2014)

3) On the Properties of Neural Machine Translation: Encoder–Decoder Approaches (Kyunghyun Cho et al., SSST-8 2014)

4) Addressing the Rare Word Problem in Neural Machine Translation (Thang Luong et al., drafted version 2014)

5) On Using Monolingual Corpora in Neural Machine Translation (Caglar Gulcehre et al., arXiv 2015)

6) Ask Me Anything: Dynamic Memory Networks for Natural Language Processing (Richard Socher and co. at MetaMind, arXiv June 2015)
- MT result still not yet released!

7) Effective Approaches to Attention-based Neural Machine Translation (Thang Luong et al., EMNLP'15)

8) (to be updated)

In addition, some other approaches utilized neural processing to enhance the current state-of-the-art SMT framework, for example:

*** For Language Model:

1) Decoding with large-scale neural language models improves translation (Ashish et al., EMNLP 2013)

Note:

- Resulting toolkit: NPLM ver 0.3 (http://nlg.isi.edu/software/nplm/)

Comments

- It is quite hard to choose the optimized parameters (e.g. hidden layer nodes, input and output embedding dimensions) across data-sets and domains.

- In Moses, NPLM feature will slow down the decoder speed.

- It actually improves the translation performance when being used with n-gram LM features. But I am not sure whether it can completely replace n-gram LM features.

2) OxLM: A Neural Language Modelling Framework for Machine Translation (Paul Baltescu et al., The Prague Bulletin of Mathematical Linguistics 2014)

Note:

- Resulting toolkit: OxLM (https://github.com/pauldb89/oxlm)

- Moses already has this feature.

3) rwthlm - A toolkit for training neural network language models (feed-forward, recurrent, and long short-term memory neural networks). The software was written by Martin Sundermeyer.

4) (to be updated)

*** For Translation Model:

1) Fast and Robust Neural Network Joint Models for Statistical Machine Translation (Devlin et al, ACL 2014)

Note:

- ACL 2014 best paper award.

- Accoding to the paper, they obtained a very impressive performance for Arabic-English Translation; good performance for Chinese-English Translation (datasets: OpenMT 2012, BOLT; domains: news, web forums).

- Moses already has this feature. Basic implementation of this model is already included in Moses under the name "BilingualLM".

- NPLM can be used to train the models for this.

Comments

- Personally, I tried this model with Moses and evaluated with conversational domains (e.g. SMS, Chat, conversational telephone speech) using OpenMT'15 datasets. I obtained good (but not very impressive, 0.7-1.0 BLEU score) performance compared to basic baseline. Using this model together with other strong features did not give significantly better performance as said in the paper :(.

- Optimizing parameters for this model is an exhausted task.

2) Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation (Kyunghyun Cho et al., EMNLP 2014)

3) (to be updated)

*** For Reordering Model:

1) Advancements in Reordering Models for Statistical Machine Translation (Minwei Feng et al., ACL 2013)

2) A Neural Reordering Model for Phrase-based Translation (Peng Li et al., COLING 2014)

3) (to be updated)

Wednesday, 10 December 2014

NIPS paper repository

Link: http://papers.nips.cc/

CASMACAT

Link: http://www.casmacat.eu/index.php?n=Installation.HomePage
Intro: a CAT tool for MT

Monday, 8 December 2014

Competitive Programming Book

Link: https://sites.google.com/site/stevenhalim/
Intro: for programming contests

VisuAlgo

Link: http://visualgo.net/

Intro: a tool to help his students better understand data structures and algorithms, by allowing them to learn the basics on their own and at their own pace.

Monday, 1 December 2014

IBM model 1

*** Way to get IBM model 1 score

CORPUS=METEO

##########################
#estimating IBM Model 1 with GIZA++

# First step of Moses training is symmetric
perl train-factored-phrase-model.perl -bin-dir . -scripts-root-dir . -root-dir . -corpus $CORPUS -f f -e e -first-step 1 -last-step 1 -alignment grow-diag-final-and -lm 0:3:lmfile >& log.train

mkdir -p ./giza.f-e

./snt2cooc.out ./corpus/e.vcb ./corpus/f.vcb ./corpus/f-e-int-train.snt > ./giza.f-e/f-e.cooc

# GIZA++ alignment is not symmetric
./GIZA++ -CoocurrenceFile ./giza.f-e/f-e.cooc -c ./corpus/f-e-int-train.snt -m1 19 -m2 0 -m3 0 -m4 0 -mh 0 -m5 0 -model1dumpfrequency 1 -nodumps 0 -o ./giza.f-e/f-e -onlyaldumps 0 -s ./corpus/e.vcb -t ./corpus/f.vcb -emprobforempty 0.0 -probsmooth 0.0 >& LOG.f-e
# Output file: giza.f-e/f-e.t1.X
# Format:
# e_code f_code P(f_word | e_word)

# With this script you transform codes into words (looking up into the vocabulary built in the first step
cat giza.f-e/f-e.t1.19 | perl code2word.pl ./corpus/e.vcb ./corpus/f.vcb > f-e.ibm1.giza

##########################
#estimating IBM Model 1 with a standalone software
perl ibm1.pl 20 $CORPUS.f $CORPUS.e > f-e.ibm1.standalone

Machine Learning materials

*** Lecture notes or courses
1) http://dk-techlogic.blogspot.in/2012/05/best-machine-learning-resources.html?m=1
2) https://gtnlp.wordpress.com/readinglist/
3) http://cs229.stanford.edu/materials.html
4) http://ciml.info/
5) Machine Learning for NLP: http://www.cs.columbia.edu/~mcollins/courses/6998-2012/lectures.html

*** ML Community
1) http://www.metacademy.org/roadmaps/
2) http://fastml.com/
3) ...

*** Toolkits
1) Liblinear và Liblinear with SBM (C++, Java,...)
Link: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/…

2) StreamSVM (C++)
Link: http://www.ibis.t.u-tokyo.ac.jp/masin/streamsvm.html

3) Vowpal Wabbit (C++, Python wrapper)
Link: http://hunch.net/~vw/

4) SGD
Link: http://leon.bottou.org/projects/sgd

5) Super-big list of ML softwares
Link: http://mloss.org/software/

6) ...

(to be updated)

The ClueWeb09 Dataset

Link: http://www.lemurproject.org/clueweb09/

Intro: The ClueWeb09 dataset was created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were collected in January and February 2009. The dataset is used by several tracks of the TREC conference.

Note: Huge corpus for LM

Deep Learning for NLP

1) CSLM: Continuous Space Language Model toolkit
Link: http://www-lium.univ-lemans.fr/cslm/

Intro: CSLM toolkit is open-source software which implements the so-called continuous space language model.

The basic idea of this approach is to project the word indices onto a continuous space and to use a probability estimator operating on this space. Since the resulting probability functions are smooth functions of the word representation, better generalization to unknown events can be expected. A neural network can be used to simultaneously learn the projection of the words onto the continuous space and to estimate the n-gram probabilities. This is still a n-gram approach, but the LM probabilities are interpolated for any possible context of length n-1 instead of backing-off to shorter contexts. This approach was successfully used in large vocabulary continuous speech recognition and in phrase-based SMT systems.

2) Recurrent Neural Network LM (RNNLM)

Link: http://rnnlm.org/

Intro: Neural network based language models are nowdays among the most successful techniques for statistical language modeling. They can be easily applied in wide range of tasks, including automatic speech recognition and machine translation, and provide significant improvements over classic backoff n-gram models. The 'rnnlm' toolkit can be used to train, evaluate and use such models.

3) word2vec

Link: https://github.com/dav/word2vec or http://deeplearning4j.org/word2vec.html

Intro: This tool provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research.

4) Long Short Term Memory (LSTM)
Link: http://www.bioinf.jku.at/software/lstm/
Intro: Software for the state of the art recurrent neural network. Long Short-Term Memory Software

5) DL4J Deep Learning for Java
Link: http://deeplearning4j.org/
Intro: Deeplearning4j is the first commercial-grade, open-source, distributed deep-learning library written for Java and Scala. Integrated with Hadoop and Spark, DL4J is designed to be used in business environments, rather than as a research tool. It aims to be cutting-edge plug and play, more convention than configuration, which allows for fast prototyping for non-researchers.

6) CURRENNT
Link: http://sourceforge.net/projects/currennt/
Intro: CUDA-enabled machine learning library for recurrent neural networks which can run both on Windows or Linux machines with CUDA-supported capability. CURRENNT is a machine learning library for Recurrent Neural Networks (RNNs) which uses NVIDIA graphics cards to accelerate the computations. The library implements uni- and bidirectional Long Short-Term Memory (LSTM) architectures and supports deep networks as well as very large data sets that do not fit into main memory.

7) ...

*** Deep learning materials

- For NLP

a) http://www.socher.org/index.php/DeepLearningTutorial/DeepLearningTutorial

b) http://cl.naist.jp/~kevinduh/a/deep2014/

c) http://ronan.collobert.com/pub/matos/2009_tutorial_nips.pdf

- For general background

a) http://cl.naist.jp/~kevinduh/notes/duh13deepadvances.pdf

b) http://www.trivedigaurav.com/blog/quoc-les-lectures-on-deep-learning/

c) http://deeplearning.net/reading-list/
d) (new book, 2014, draft version) http://www.iro.umontreal.ca/~bengioy/dlbook/
e) http://people.idsia.ch/~juergen/deep-learning-overview.html

- ...

(to be updated)

25 Websites That Will Make You Smarter

Link: http://www.businessinsider.sg/25-websites-that-will-make-you-smarter-2014-11

Sentence level of parallel texts

Link: https://github.com/rsennrich/Bleualign

Intro: Bleualign is a tool to align parallel texts (i.e. a text and its translation) on a sentence level. Additionally to the source and target text, Bleualign requires an automatic translation of at least one of the texts. The alignment is then performed on the basis of the similarity (modified BLEU score) between the source text sentences (translated into the target language) and the target text sentences

Clustering of Parallel Text

Link: https://github.com/rsennrich/bitext_clusterer

Intro: This program performs sentence-level k-means clustering for parallel texts based on language model similarity.

Monday, 27 October 2014

Armadillo - C++ linear algebra library

Link: http://arma.sourceforge.net/

Intro: Armadillo is a high quality C++ linear algebra library, aiming towards a good balance between speed and ease of use; the syntax (API) is deliberately similar to Matlab.

Monday, 13 October 2014

Q&A community for machine learning

Link: http://metaoptimize.com/qa/

Tuesday, 23 September 2014

WaCky - The Web-As-Corpus Kool Yinitiative

Link: http://wacky.sslmit.unibo.it/doku.php?id=tools

Sunday, 6 July 2014

Online tools for researchers

Link: http://connectedresearchers.com/online-tools-for-researchers/

Assistant Tools for Scientific Paper Writing

1) PaperRater (online)
Link: http://www.paperrater.com/free_paper_grader

Intro: PaperRater.com is a free resource that utilizes Artificial Intelligence to help students write better. Our technology combines Natural Language Processing, Machine Learning, Information Retrieval, Computational Linguistics, and Data Mining to produce the most powerful automated proofreading tool available on the Internet today. PaperRater.com is used by schools and universities in over 46 countries to help students improve their writing and check for plagiarism.

2) SWAN (offline)
Link: https://cs.joensuu.fi/swan/index.html

Intro: This Swan - Scientific Writing AssistaNt - aims at helping writers with the content, not the grammar or spelling. It guides you towards known good scientific writing practices and helps your readers find your contribution. The tool was designed to help you with your writing, not to merely point out errors. Using the tool should be simple; just enter your text sections into the tool, optionally make some manual elaboration and click the "Evaluate" button. Once you have used the tool with one of your own scientific papers, do let us know how it has helped to you.

3) TBA

Thursday, 3 July 2014

Freebase

Link: https://www.freebase.com/
API:
1) https://developers.google.com/freebase/data
2) http://wiki.freebase.com/wiki/Open_source

Tuesday, 13 May 2014

Wikipedia Extractor

Link: http://medialab.di.unipi.it/wiki/Wikipedia_Extractor

Intro: WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump. The output is stored in a number of files of similar size in a given directory. Each file contains several documents in the document format.

Wednesday, 23 April 2014

Charset Detector Tool in Java

Link: http://sourceforge.net/projects/jchardet/
Intro: chardet is a java port of the source from mozilla's automatic charset detection algorithm.

Monday, 14 April 2014

TempoWordNet

Link: https://tempowordnet.greyc.fr/

Intro: TempoWordNet is a free lexical knowledge base for temporal analysis where each synset of WordNet is assigned to its intrinsic temporal values. Each synset of WordNet is automatically time-tagged with four dimensions : atemporal, past, present and future.

Thursday, 20 March 2014

Cross-platform Programming

Some cross-platform programming IDE & tools:

1) Code::Blocks (an alternative to Microsoft VC++ IDE in Linux or even Windows): http://www.codeblocks.org/
2) wxWidgets (cross-platform GUI programming library): http://www.wxwidgets.org/downloads/
3) ...

Thursday, 13 March 2014

RegEx online

Link: http://regex101.com/
Intro: a great online tool for testing your regular expression.
Complete Tutorial: http://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf

Tuesday, 18 February 2014

Some useful C++ programming libaries

1) http://partow.net/programming/
2)