HOANG Cong Duy Vu's research logs: 2015

Wednesday, 11 November 2015

Deep Learning Frameworks

There are a lot of deep learning frameworks out there, depending on your usage purpose or the familiarity of your programming languages or working tasks, here I only summarize ones that I am familiar with:

1) TensorFlow by Google (released on 10 Nov 2015): http://tensorflow.org/

Comment:

2) VELES by Samsung (released on 11 Nov 2015): https://velesnet.ml/

Comment:

3) cnn (lightweight and very fast neural network library in C++, also in Python, works both on Windows and Linux machines): https://github.com/kaishengyao/cnn

Comment: cnn has been proven to be much faster than Theano both with and without GPU. Also, it offers the advantage for software production of neural network models since it has been developing in C++ and more importantly, it supports both Windows and Linux platforms.

4) to be updated

Sunday, 30 August 2015

Wiki Parallel Data Extractor

Link: https://github.com/clab/wikipedia-parallel-titles
Intro: Tools for extracting parallel corpora from article titles across languages in Wikipedia

Saturday, 25 July 2015

Tay Nung dictionary

Link: https://sites.google.com/site/tndict/home

Intro: Thanks to some guys on facebook of VNese NLP group, I just know about this. One of interesting problems is to preserve and develop the local regional languages (e.g. Tay Nung in the above link) in parallel with official Vietnamese language.

P.S.: I will be back one day about this issue.

Thursday, 23 July 2015

Japanese-English Parallel Datasets

Link: http://phontron.com/japanese-translation-data.php

Tuesday, 21 July 2015

Hansard corpus

Link: http://www.hansard-corpus.org/

Intro: This Hansard corpus (or collection of texts) contains nearly every speech given in the British Parliament from 1803-2005, and it allows you to search these speeches (including semantically-based searches) in ways that are not possible with any other resource.

Sunday, 19 July 2015

easyloggingcpp - light-weight logging library for C++

Link: https://github.com/easylogging/easyloggingpp

Intro: Single header only C++ logging library. It is extremely light-weight, robust, fast performing, thread and type safe and consists of many built-in features. It provides ability to write logs in your own customized format. It also provide support for logging your classes, third-party libraries, STL and third-party containers etc.

Friday, 17 July 2015

OLAC - Open Language Archives Community

Link: http://www.language-archives.org/

Intro: OLAC, the Open Language Archives Community, is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources.

Visualgdb

Link: http://visualgdb.com/tutorials/linux/import/

Intro: This tool will help how to import a Linux project from a Linux machine to Visual Studio to build and debug it remotely.

This tool is probably comfortable for a Windows-based and Visual Studio fan who wants to compile a project remotely on Linux.

In terms of point of view of a Linux coder, it's not a good way. You may learn how to use gdb with CLI programming instead.

Wednesday, 8 July 2015

IR book by Bruce Croft

Title: Search Engines Information Retrieval in Practice by Prof. Bruce Croft
Link: http://ciir.cs.umass.edu/irbook/

Intro: This book provides an overview of the important issues in information retrieval, and how those issues affect the design and implementation of search engines. Not every topic is covered at the same level of detail. The focus is on some of the most important alternatives to implementing search engine components and the information retrieval models underlying them. The target audience for the book is advanced undergraduates in computer science, although it is also a useful introduction for graduate students. (from the link)

NAACL 2015 papers

Link: http://naacl.org/naacl-hlt-2015/papers.html

Here is my subjective list of remarkable papers relating to MT research:

*** Neural Machine Translation

Paul Baltescu and Phil Blunsom. "Pragmatic Neural Language Modelling in Machine Translation"

Adrià de Gispert, Gonzalo Iglesias, Bill Byrne. "Fast and Accurate Preordering for SMT using Neural Networks"

*** Continuous Models for Statistical Machine Translation

Frédéric Blain, Fethi Bougares, Amir Hazem, Loïc Barrault, Holger Schwenk. "Continuous Adaptation to User Feedback for Statistical Machine Translation"

Kai Zhao, Hany Hassan, Michael Auli. "Learning Translation Models from Monolingual Continuous Representations"

*** Multi-language Translation

Raj Dabre, Fabien Cromieres, Sadao Kurohashi, Pushpak Bhattacharyya. "Leveraging Small Multilingual Corpora for SMT Using Many Pivot Languages"

*** Video to Text Translation

Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko. "Translating Videos to Natural Language Using Deep Recurrent Neural Networks"

*** Others

Jonathan H. Clark, Chris Dyer, Alon Lavie. "Locally Non-Linear Learning for Statistical Machine Translation via Discretization and Structured Regularization"

Graham Neubig, Philip Arthur, Kevin Duh. "Multi-Target Machine Translation with Multi-Synchronous Context-free Grammars"

Aurelien Waite and Bill Byrne. "The Geometry of Statistical Machine Translation"

Other papers are also worth reading:

*** News Processing

Areej Alhothali and Jesse Hoey. "Good News or Bad News: Using Affect Control Theory to Analyze Readers' Reaction Towards News Articles"

Tuesday, 7 July 2015

Python wrapper for online translators

If you want to use Google Translate and Microsoft Bing Translate for free, you may consider the following Python-based wrappers:

+ Code: Google Translate; Bing Translate

+ Samples:

# Google Translate
import googletrans
gs = googletrans.Googletrans()

languages = gs.get_languages()
print(languages['en'])

print(gs.translate('hello', 'de'))
print(gs.translate('hello', 'zh'))
print(gs.translate('hello', 'vi'))

print(gs.detect('some English words'))

#Bing Translate

from mstranslator import Translator

translator =
Translator('cdvhoang', 'HlUUMftdkETWa8E9/jzD4l1CzC8sOhRSJxH+kk0MDBg=')

print(translator.translate('hello', lang_from='en', lang_to='vi'))

*** Please note that I don't encourage to use the wrapper for Google Translate because you should respect and pay for using its service (simply it's now commercialized ^_^).

Saturday, 4 July 2015

ACL 2015 papers

Link: http://acl2015.org/accepted_papers.html

Here is my subjective list of remarkable papers relating to MT research:

*** Conventional Statistical Machine Translation

A CONTEXT-AWARE TOPIC MODEL FOR STATISTICAL MACHINE TRANSLATION
Jinsong Su, Deyi Xiong, Yang Liu, Xianpei Han, Hongyu Lin and Junfeng Yao

NON-LINEAR LEARNING FOR STATISTICAL MACHINE TRANSLATION
Shujian Huang, Huadong Chen, Xinyu Dai and Jiajun Chen

MULTI-TASK LEARNING FOR MULTIPLE LANGUAGE TRANSLATION
Daxiang Dong, Hua Wu, Wei He, Dianhai Yu and Haifeng Wang

WHAT’S IN A DOMAIN? ANALYZING GENRE AND TOPIC DIFFERENCES IN STATISTICAL MACHINE TRANSLATION
Marlies van der Wees, Arianna Bisazza, Wouter Weerkamp and Christof Monz

*** Neural Machine Translation

ADDRESSING THE RARE WORD PROBLEM IN NEURAL MACHINE TRANSLATION
Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals and Wojciech Zaremba

ENCODING SOURCE LANGUAGE WITH CONVOLUTIONAL NEURAL NETWORK FOR MACHINE TRANSLATION
Fandong Meng, Zhengdong Lu, Mingxuan Wang, Hang Li, Wenbin Jiang and Qun Liu

IMPROVED NEURAL NETWORK FEATURES, ARCHITECTURE AND LEARNING FOR STATISTICAL MACHINE TRANSLATION
Hendra Setiawan, Zhongqiang Huang, Jacob Devlin, Thomas Lamar and Rabih Zbib

NON-PROJECTIVE DEPENDENCY-BASED PRE-REORDERING WITH RECURRENT NEURAL NETWORK FOR MACHINE TRANSLATION
Antonio Valerio Miceli Barone

ON USING VERY LARGE TARGET VOCABULARY FOR NEURAL MACHINE TRANSLATION
Sebastien Jean, Kyunghyun Cho, Roland Memisevic and Yoshua Bengio

CONTEXT-DEPENDENT TRANSLATION SELECTION USING CONVOLUTIONAL NEURAL NETWORK
Baotian Hu, Zhaopeng Tu, Zhengdong Lu and Hang Li

*** Machine Translation Evaluation and Quality Estimation

ONLINE MULTITASK LEARNING FOR MACHINE TRANSLATION QUALITY ESTIMATION
José G. C. de Souza, Matteo Negri, Marco Turchi and Elisa Ricci

PAIRWISE NEURAL MACHINE TRANSLATION EVALUATION
Francisco Guzmán, Shafiq Joty, Lluís Màrquez and Preslav Nakov

EVALUATING MACHINE TRANSLATION SYSTEMS WITH SECOND LANGUAGE PROFICIENCY TESTS
Takuya Matsuzaki, Akira Fujita, Naoya Todo and Noriko H. Arai

Some notes:

*** According to my observation, there are some research trends depending on data characteristics:
- very big data
- heterogeneous data
- multi-lingual data

*** And of course, deep learning research is still very hot.

Thursday, 25 June 2015

Torch vs. Theano vs. Caffe

Link: http://fastml.com/torch-vs-theano/

Here is my summary:

- Torch and Theano are better to be used for research purpose on deep learning (DL) whereas Caffe is more scaled for DL application development.

- Torch and Theano are competitive in terms of speech and performance via different benchmarks. Hence, choosing one of them depends the ease of use from users.

(to be updated)

Monday, 25 May 2015

Jekyll

Links:
http://karpathy.github.io/2014/07/01/switching-to-jekyll/
http://jekyllrb.com/docs/home/
Intro: to transform your plain text into static websites and blogs.

Sunday, 24 May 2015

Andrej Karpathy's blog

Link 1: http://karpathy.github.io/ (Neural Network's basics)
Link 2: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ (Recurrent NN's view)
Intro: A very useful blog from a very good PhD student of Stanford Uni.

Brat

Link: http://brat.nlplab.org/introduction.html

Intro: brat is a web-based tool for text annotation; that is, for adding notes to existing text documents. brat is designed in particular for structured annotation, where the notes are not free-form text but have a fixed form that can be automatically processed and interpreted by a computer.

Monday, 11 May 2015

Wikipedia and LM

Link: http://trulymadlywordly.blogspot.sg/2011/03/creating-text-corpus-from-wikipedia.html
Intro: How to leverage Wikipedia repository to create huge LM data.

Sunday, 19 April 2015

C++11

http://www.stroustrup.com/what-is-2009.pdf

http://stackoverflow.com/questions/8851670/relevant-boost-features-vs-c11

http://blog.smartbear.com/c-plus-plus/the-biggest-changes-in-c11-and-why-you-should-care/

https://marcoarena.wordpress.com/2012/02/05/modern-cpp-cookbooks/

It seems that I really don't need Boost library ^^.

Thursday, 16 April 2015

Rapid conversion table

Link: http://www.rapidtables.com/

Thursday, 12 March 2015

git basics guide

Link: http://rogerdudler.github.io/git-guide/

Thursday, 5 March 2015

Vowpal Wabbit

Link: https://github.com/moses-smt/vowpal_wabbit

Intro: The Vowpal Wabbit (VW) project is a fast out-of-core learning system sponsored by Microsoft Research and (previously) Yahoo! Research. Support is available through the mailing list.

There are two ways to have a fast learning algorithm: (a) start with a slow algorithm and speed it up, or (b) build an intrinsically fast learning algorithm. This project is about approach (b), and it's reached a state where it may be useful to others as a platform for research and experimentation.

Sunday, 1 March 2015

Training Google 1T web corpus with IRSTLM

Thanks to http://www44.atwiki.jp/keisks/pages/50.html , here is the way to train enormous LM with Google 1T web corpus using IRSTLM:

build-sublm.pl --size 3 --ngrams "gunzip -c 3gms/*.gz" --sublm LM.000 --witten-bell
merge-sublm.pl --size 3 --sublm LM -lm g_3grams_LM.gz
compile-lm g_3grams_LM.gz g_3grams_LM.blm
(if you get the error: "lt-compile-lm: lmtable.h:247: virtual double lmtable::setlogOOVpenalty(int): Assertion `dub > dict->size()' failed.")
compile-lm -dub=100000000 g_3grams_LM g_3grams_LM.blm (make the -dub option bigger)

I will validate it soon.

Friday, 27 February 2015

Simple batch programming on Windows

Some simple tricks for batch programming on Windows environment:

1) For loop (refer: http://ss64.com/nt/for_l.html)
Syntax
FOR /L %%parameter IN (start,step,end) DO command

Key
start : The first number
step : The amount by which to increment the sequence
end : The last number

command : The command to carry out, including any
command-line parameters.

%%parameter : A replaceable parameter:
in a batch file use %%G (on the command line %G)

Example
FOR /L %%G IN (0,1,195) DO (
echo %%G & ping 1.1.1.1 -n 1 -w 1000 >nul

)

2) ...

Thursday, 26 February 2015

SEAlang - Southeast Asian Languages Library

Link: http://sealang.net/library/

Intro: The SEAlang Library was established in 2005. It provides language reference materials for Southeast Asia; initially focused on the non-roman script languages used throughout the mainland, and now concentrating on the languages of insular SEA.

* Take note of the Vietnamese corpus (including dictionary, bitexts, ...)

Sunday, 22 February 2015

How to be a good graduate student

1) From Kevin Murphy: http://www.cs.ubc.ca/~murphyk/Teaching/guideForStudents.html

2) ...

(to be updated)

Thursday, 5 February 2015

Available Data from the CommonCrawl

Link: http://statmt.org/ngrams/
Intro: Multi-language data used for training large-scale LMs crawled from CommonCrawl.

Thursday, 29 January 2015

Puck - GPU-based natural language parser

Link: https://github.com/dlwh/puck

Intro: Puck is a high-speed, high-accuracy parser for natural languages. It's (currently) designed for use with grammars trained with the Berkeley Parser and on NVIDIA cards. On recent-ish NVIDIA cards (e.g. a GTX 680), around 400 sentences a second with a full Berkeley grammar for length <= 40 sentences.

Puck is only useful if you plan on parsing a lot of sentences. On the order of a few thousand. Also, it's designed for throughput, not latency.

Thursday, 22 January 2015

TCLAP - Templatized C++ Command Line Parser Library

Link: http://tclap.sourceforge.net/

Intro: TCLAP is a small, flexible library that provides a simple interface for defining and accessing command line arguments. It was intially inspired by the user friendly CLAP libary. The difference is that this library is templatized, so the argument class is type independent. Type independence avoids identical-except-for-type objects, such as IntArg, FloatArg, and StringArg. While the library is not strictly compliant with the GNU or POSIX standards, it is close.

Thursday, 15 January 2015

OpenMPI

Link: http://www.open-mpi.org/

Intro: The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners. Open MPI is therefore able to combine the expertise, technologies, and resources from all across the High Performance Computing community in order to build the best MPI library available. Open MPI offers advantages for system and software vendors, application developers and computer science researchers.

Sunday, 11 January 2015

Machine Learning podcast

Link: http://www.thetalkingmachines.com/

Intro: Talking Machines is your window into the world of machine learning.

Friday, 9 January 2015

Word Aligners for Machine Translation

Here is a not-complete list of word aligners used for Machine Translation:

1) Unsupervised Aligners
- GIZA++
- fast_align (with cdec)
- pialign
- BerkeleyAligner

2) Supervised Aligners
- BerkeleyAligner
- NILE

(to be updated ...)

Tuesday, 6 January 2015

Multi-Task Learning toolkit

1) MALSAR
Link: http://www.public.asu.edu/~jye02/Software/MALSAR/

Intro: the MALSAR (Multi-tAsk Learning via StructurAl Regularization) package includes the following multi-task learning algorithms:

Mean-Regularized Multi-Task Learning
Multi-Task Learning with Joint Feature Selection
Robust Multi-Task Feature Learning
Trace-Norm Regularized Multi-Task Learning
Alternating Structural Optimization
Incoherent Low-Rank and Sparse Learning
Robust Low-Rank Multi-Task Learning
Clustered Multi-Task Learning
Multi-Task Learning with Graph Structures
Disease Progression Models
Incomplete Multi-Source Fusion (iMSF)
Multi-Stage Multi-Source Fusion
Multi-Task Clustering

2)
Link: http://klcl.pku.edu.cn/member/sunxu/software/MultiTask.zip
Intro: This is a general purpose software for online multi-task learning. The online multi-task learning is mainly based on Conditional Random Fields (CRF) model and Stochastic Gradient Descent (SGD) training.

I am going to deepen this technique for machine translation and domain adaptation.

Monday, 5 January 2015

MTTK - Machine Translation Toolkit

Link: http://mi.eng.cam.ac.uk/~wjb31/distrib/mttkv1/

Intro: MTTK is a collection of software tools for the alignment of parallel text for use in Statistical Machine Translation. With MTTK you can ...

Align document translation pairs at the sentence or sub-sentence level, sometimes known as chunking. This is a useful pre-processing step to prepare collections of translations for use in estimating the parameters of complex alignment models. Sub-sentence alignment in particular makes it possible to segment long sentences into shorter aligned segments that otherwise would have to be discarded.
Train statistical models for parallel text alignment. The following models are supported :
IBM Model-1 and Model-2
Word-to-Word HMMs
Word-to-Phrase HMMs , with bigram translation probabilities
Parallelize your model training procedures. If you have multiple CPUs available, you can partition your translation training texts into subsets, thus speeding up iterative parameter re-estimation procedures and reducing the amount of memory needed in training. This is done under exact EM-based parameter estimation procedures.
Generate word-to-word and word-to-phrase alignments of parallel text. MTTK can generate Viterbi alignments of parallel text (both training text and other texts) under the supported alignment models.
Extract word-to-word translation tables from aligned bitext and from the estimated models.
Extract phrase-to-phrase translation tables (phrase-pair inventories) from aligned parallel text.
Use the HMM alignment models to induce phrase translations under its statistical models. Phrase-pair induction can generate richer inventories of phrase translations than can be extracted from Viterbi alignments.
Edit the C++ source code to implement your own estimation and alignment procedures.