Wednesday, 31 December 2014


Python is very powerful programming language, especially for text processing (or NLP) in terms of speed, simplicity, and vastly supported libraries. I've just started using it for two years and already love it.

Here I collect my stuffs relating to Python:

Runtime Libraries


OS supported: Windows, Android, iOS
Debugging: fully supported
Auto completion: fully supported
Additional tools for Visual Studio: PTVS

2) Vim editor
OS supported: Linux
Debugging: none
Auto completion: none

3) More? See this.

1) Everything can be searched in the Internet :D.

Community-based question answering for everything (including Python).

Thursday, 25 December 2014

Representation Learning

This research topic is very important in machine learning and a first step for all kinds of machine learning algorithms. Having a robust representation definitely plays a vital role for good performance/accuracy.

Recent progress of representation learning is using deep architecture (normally referring to deep learning). Here I collect some of very good papers worth reading:

3) ...

Sunday, 21 December 2014

LateX Template

Intro: A BIG categorized collection of LaTeX templates.

Thursday, 11 December 2014

Neural Machine Translation

Scientists around the world (especially Google guys) are moving the approaches of Statistical Machine Translation (SMT) (e.g. word-based, statistical with phrase-based or hierarchical, syntax-based) to the next level, namely Neural Machine Translation.

In general, Neural Machine Translation aims to simplify the SMT approaches by taking the source as an input sequence and producing the target as an output sequence via a single, large neural networks.

Here I am trying to catch up the recent progress of Neural Machine Translation.

*** People & Group

1) LISA Lab, University of Montreal 2014 led by Prof. Yoshua Bengio
Latest Demo:

2) Quoc Viet Le and co. at Google (e.g. Ilya Sutskever, Nal Kalchbrenner)

3) Phil Blunsom's group at Oxford Uni.

4) Dzmitry Bahdanau at Jacobs University Bremen

5) Richard Socher at Stanford Uni.

6) Kyunghyun Cho at NYU?

7) ...

*** Notable Papers
1a) Sequence to Sequence Learning with Neural Networks (Ilya Sutskever et al., NIPS 2014)
- The core idea behind neural machine translation.

1b) Generating Sequences With Recurrent Neural Networks (Alex Graves et al., ? 2014)

4) Addressing the Rare Word Problem in Neural Machine Translation (Thang Luong et al., drafted version 2014)

5) On Using Monolingual Corpora in Neural Machine Translation (Caglar Gulcehre et al., arXiv 2015)

6) Ask Me Anything: Dynamic Memory Networks for Natural Language Processing (Richard Socher and co. at MetaMind, arXiv June 2015)
- MT result still not yet released!

7) Effective Approaches to Attention-based Neural Machine Translation (Thang Luong et al., EMNLP'15)

8) (to be updated)

In addition, some other approaches utilized neural processing to enhance the current state-of-the-art SMT framework, for example:

*** For Language Model:

- Resulting toolkit: NPLM ver 0.3 (

- It is quite hard to choose the optimized parameters (e.g. hidden layer nodes, input and output embedding dimensions) across data-sets and domains.
- In Moses, NPLM feature will slow down the decoder speed.
- It actually improves the translation performance when being used with n-gram LM features. But I am not sure whether it can completely replace n-gram LM features.

2) OxLM: A Neural Language Modelling Framework for Machine Translation (Paul Baltescu et al., The Prague Bulletin of Mathematical Linguistics 2014)

- Resulting toolkit: OxLM (
- Moses already has this feature.

3) rwthlm - A toolkit for training neural network language models (feed-forward, recurrent, and long short-term memory neural networks). The software was written by Martin Sundermeyer.

4) (to be updated)

*** For Translation Model:

- ACL 2014 best paper award. 
- Accoding to the paper, they obtained a very impressive performance for Arabic-English Translation; good performance for Chinese-English Translation (datasets: OpenMT 2012, BOLT; domains: news, web forums).
- Moses already has this feature. Basic implementation of this model is already included in Moses under the name "BilingualLM".
- NPLM can be used to train the models for this.

- Personally, I tried this model with Moses and evaluated with conversational domains (e.g. SMS, Chat, conversational telephone speech) using OpenMT'15 datasets. I obtained good (but not very impressive, 0.7-1.0 BLEU score) performance compared to basic baseline. Using this model together with other strong features did not give significantly better performance as said in the paper :(.
- Optimizing parameters for this model is an exhausted task.

3) (to be updated)

*** For Reordering Model:
1) Advancements in Reordering Models for Statistical Machine Translation (Minwei Feng et al., ACL 2013)

2) A Neural Reordering Model for Phrase-based Translation (Peng Li et al., COLING 2014)

3) (to be updated)

Monday, 8 December 2014

Competitive Programming Book

Intro: for programming contests


Intro:  a tool to help his students better understand data structures and algorithms, by allowing them to learn the basics on their own and at their own pace.

Monday, 1 December 2014

IBM model 1

*** Way to get IBM model 1 score


#estimating IBM Model 1 with GIZA++

# First step of Moses training is symmetric
perl train-factored-phrase-model.perl -bin-dir . -scripts-root-dir . -root-dir . -corpus $CORPUS -f f -e e -first-step 1 -last-step 1 -alignment grow-diag-final-and -lm 0:3:lmfile >& log.train

mkdir -p ./giza.f-e

./snt2cooc.out ./corpus/e.vcb ./corpus/f.vcb ./corpus/f-e-int-train.snt > ./giza.f-e/f-e.cooc

# GIZA++ alignment is not symmetric
./GIZA++ -CoocurrenceFile ./giza.f-e/f-e.cooc -c ./corpus/f-e-int-train.snt -m1 19 -m2 0 -m3 0 -m4 0 -mh 0 -m5 0 -model1dumpfrequency 1 -nodumps 0 -o ./giza.f-e/f-e -onlyaldumps 0 -s ./corpus/e.vcb -t ./corpus/f.vcb -emprobforempty 0.0 -probsmooth 0.0 >& LOG.f-e
# Output file: giza.f-e/f-e.t1.X
# Format:
# e_code f_code P(f_word | e_word)

# With this script you transform codes into words (looking up into the vocabulary built in the first step
cat giza.f-e/f-e.t1.19 | perl ./corpus/e.vcb ./corpus/f.vcb > f-e.ibm1.giza

#estimating IBM Model 1 with a standalone software
perl 20 $CORPUS.f $CORPUS.e > f-e.ibm1.standalone

Machine Learning materials

*** Lecture notes or courses
5) Machine Learning for NLP:

*** ML Community
3) ...

*** Toolkits
1) Liblinear vĂ  Liblinear with SBM (C++, Java,...)

2) StreamSVM (C++)

3) Vowpal Wabbit (C++, Python wrapper)

4) SGD

5) Super-big list of ML softwares

6) ...

(to be updated)

The ClueWeb09 Dataset

Intro: The ClueWeb09 dataset was created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were collected in January and February 2009. The dataset is used by several tracks of the TREC conference.
Note: Huge corpus for LM

Deep Learning for NLP

1) CSLM: Continuous Space Language Model toolkit
Intro: CSLM toolkit is open-source software which implements the so-called continuous space language model.
The basic idea of this approach is to project the word indices onto a continuous space and to use a probability estimator operating on this space. Since the resulting probability functions are smooth functions of the word representation, better generalization to unknown events can be expected. A neural network can be used to simultaneously learn the projection of the words onto the continuous space and to estimate the n-gram probabilities. This is still a n-gram approach, but the LM probabilities are interpolated for any possible context of length n-1 instead of backing-off to shorter contexts. This approach was successfully used in large vocabulary continuous speech recognition and in phrase-based SMT systems.

2) Recurrent Neural Network LM (RNNLM)
Intro: Neural network based language models are nowdays among the most successful techniques for statistical language modeling. They can be easily applied in wide range of tasks, including automatic speech recognition and machine translation, and provide significant improvements over classic backoff n-gram models. The 'rnnlm' toolkit can be used to train, evaluate and use such models.

3) word2vec
Intro: This tool provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research.

4) Long Short Term Memory (LSTM)
Intro: Software for the state of the art recurrent neural network. Long Short-Term Memory Software

5) DL4J Deep Learning for Java
Intro: Deeplearning4j is the first commercial-grade, open-source, distributed deep-learning library written for Java and Scala. Integrated with Hadoop and Spark, DL4J is designed to be used in business environments, rather than as a research tool. It aims to be cutting-edge plug and play, more convention than configuration, which allows for fast prototyping for non-researchers.

Intro: CUDA-enabled machine learning library for recurrent neural networks which can run both on Windows or Linux machines with CUDA-supported capability. CURRENNT is a machine learning library for Recurrent Neural Networks (RNNs) which uses NVIDIA graphics cards to accelerate the computations. The library implements uni- and bidirectional Long Short-Term Memory (LSTM) architectures and supports deep networks as well as very large data sets that do not fit into main memory.

7) ...

*** Deep learning materials
- For NLP

- For general background

- ...

(to be updated)

25 Websites That Will Make You Smarter


Sentence level of parallel texts

Intro: Bleualign is a tool to align parallel texts (i.e. a text and its translation) on a sentence level. Additionally to the source and target text, Bleualign requires an automatic translation of at least one of the texts. The alignment is then performed on the basis of the similarity (modified BLEU score) between the source text sentences (translated into the target language) and the target text sentences

Clustering of Parallel Text

Intro: This program performs sentence-level k-means clustering for parallel texts based on language model similarity.

Monday, 27 October 2014

Armadillo - C++ linear algebra library

IntroArmadillo is a high quality C++ linear algebra library, aiming towards a good balance between speed and ease of use; the syntax (API) is deliberately similar to Matlab.

Sunday, 6 July 2014

Online tools for researchers


Assistant Tools for Scientific Paper Writing

1) PaperRater (online)
Intro: is a free resource that utilizes Artificial Intelligence to help students write better. Our technology combines Natural Language Processing, Machine Learning, Information Retrieval, Computational Linguistics, and Data Mining to produce the most powerful automated proofreading tool available on the Internet today. is used by schools and universities in over 46 countries to help students improve their writing and check for plagiarism.

2) SWAN (offline)
Intro: This Swan - Scientific Writing AssistaNt - aims at helping writers with the content, not the grammar or spelling. It guides you towards known good scientific writing practices and helps your readers find your contribution. The tool was designed to help you with your writing, not to merely point out errors. Using the tool should be simple; just enter your text sections into the tool, optionally make some manual elaboration and click the "Evaluate" button. Once you have used the tool with one of your own scientific papers, do let us know how it has helped to you.

3) TBA

Tuesday, 13 May 2014

Wikipedia Extractor

Intro: is a Python script that extracts and cleans text from a Wikipedia database dump. The output is stored in a number of files of similar size in a given directory. Each file contains several documents in the document format.

Wednesday, 23 April 2014

Charset Detector Tool in Java

Intro: chardet is a java port of the source from mozilla's automatic charset detection algorithm.

Monday, 14 April 2014


Intro: TempoWordNet is a free lexical knowledge base for temporal analysis where each synset of WordNet is assigned to its intrinsic temporal values. Each synset of WordNet is automatically time-tagged with four dimensions : atemporal, past, present and future.

Thursday, 20 March 2014

Cross-platform Programming

Some cross-platform programming IDE & tools:

1) Code::Blocks (an alternative to Microsoft VC++ IDE in Linux or even Windows):
2) wxWidgets (cross-platform GUI programming library):
3) ...

Thursday, 13 March 2014

RegEx online

Intro: a great online tool for testing your regular expression.
Complete Tutorial