Monday, 1 December 2014

IBM model 1

*** Way to get IBM model 1 score


#estimating IBM Model 1 with GIZA++

# First step of Moses training is symmetric
perl train-factored-phrase-model.perl -bin-dir . -scripts-root-dir . -root-dir . -corpus $CORPUS -f f -e e -first-step 1 -last-step 1 -alignment grow-diag-final-and -lm 0:3:lmfile >& log.train

mkdir -p ./giza.f-e

./snt2cooc.out ./corpus/e.vcb ./corpus/f.vcb ./corpus/f-e-int-train.snt > ./giza.f-e/f-e.cooc

# GIZA++ alignment is not symmetric
./GIZA++ -CoocurrenceFile ./giza.f-e/f-e.cooc -c ./corpus/f-e-int-train.snt -m1 19 -m2 0 -m3 0 -m4 0 -mh 0 -m5 0 -model1dumpfrequency 1 -nodumps 0 -o ./giza.f-e/f-e -onlyaldumps 0 -s ./corpus/e.vcb -t ./corpus/f.vcb -emprobforempty 0.0 -probsmooth 0.0 >& LOG.f-e
# Output file: giza.f-e/f-e.t1.X
# Format:
# e_code f_code P(f_word | e_word)

# With this script you transform codes into words (looking up into the vocabulary built in the first step
cat giza.f-e/f-e.t1.19 | perl ./corpus/e.vcb ./corpus/f.vcb > f-e.ibm1.giza

#estimating IBM Model 1 with a standalone software
perl 20 $CORPUS.f $CORPUS.e > f-e.ibm1.standalone

Machine Learning materials

*** Lecture notes or courses
5) Machine Learning for NLP:

*** ML Community
3) ...

*** Toolkits
1) Liblinear vĂ  Liblinear with SBM (C++, Java,...)

2) StreamSVM (C++)

3) Vowpal Wabbit (C++, Python wrapper)

4) SGD

5) Super-big list of ML softwares

6) ...

(to be updated)

The ClueWeb09 Dataset

Intro: The ClueWeb09 dataset was created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were collected in January and February 2009. The dataset is used by several tracks of the TREC conference.
Note: Huge corpus for LM

Deep Learning for NLP

1) CSLM: Continuous Space Language Model toolkit
Intro: CSLM toolkit is open-source software which implements the so-called continuous space language model.
The basic idea of this approach is to project the word indices onto a continuous space and to use a probability estimator operating on this space. Since the resulting probability functions are smooth functions of the word representation, better generalization to unknown events can be expected. A neural network can be used to simultaneously learn the projection of the words onto the continuous space and to estimate the n-gram probabilities. This is still a n-gram approach, but the LM probabilities are interpolated for any possible context of length n-1 instead of backing-off to shorter contexts. This approach was successfully used in large vocabulary continuous speech recognition and in phrase-based SMT systems.

2) Recurrent Neural Network LM (RNNLM)
Intro: Neural network based language models are nowdays among the most successful techniques for statistical language modeling. They can be easily applied in wide range of tasks, including automatic speech recognition and machine translation, and provide significant improvements over classic backoff n-gram models. The 'rnnlm' toolkit can be used to train, evaluate and use such models.

3) word2vec
Intro: This tool provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research.

4) Long Short Term Memory (LSTM)
Intro: Software for the state of the art recurrent neural network. Long Short-Term Memory Software

5) DL4J Deep Learning for Java
Intro: Deeplearning4j is the first commercial-grade, open-source, distributed deep-learning library written for Java and Scala. Integrated with Hadoop and Spark, DL4J is designed to be used in business environments, rather than as a research tool. It aims to be cutting-edge plug and play, more convention than configuration, which allows for fast prototyping for non-researchers.

Intro: CUDA-enabled machine learning library for recurrent neural networks which can run both on Windows or Linux machines with CUDA-supported capability. CURRENNT is a machine learning library for Recurrent Neural Networks (RNNs) which uses NVIDIA graphics cards to accelerate the computations. The library implements uni- and bidirectional Long Short-Term Memory (LSTM) architectures and supports deep networks as well as very large data sets that do not fit into main memory.

7) ...

*** Deep learning materials
- For NLP

- For general background

- ...

(to be updated)

25 Websites That Will Make You Smarter


Sentence level of parallel texts

Intro: Bleualign is a tool to align parallel texts (i.e. a text and its translation) on a sentence level. Additionally to the source and target text, Bleualign requires an automatic translation of at least one of the texts. The alignment is then performed on the basis of the similarity (modified BLEU score) between the source text sentences (translated into the target language) and the target text sentences

Clustering of Parallel Text

Intro: This program performs sentence-level k-means clustering for parallel texts based on language model similarity.