HOANG Cong Duy Vu's research logs: 2015-01-04

Friday, 9 January 2015

Word Aligners for Machine Translation

Here is a not-complete list of word aligners used for Machine Translation:

1) Unsupervised Aligners
- GIZA++
- fast_align (with cdec)
- pialign
- BerkeleyAligner

2) Supervised Aligners
- BerkeleyAligner
- NILE

(to be updated ...)

Tuesday, 6 January 2015

Multi-Task Learning toolkit

1) MALSAR
Link: http://www.public.asu.edu/~jye02/Software/MALSAR/

Intro: the MALSAR (Multi-tAsk Learning via StructurAl Regularization) package includes the following multi-task learning algorithms:

Mean-Regularized Multi-Task Learning
Multi-Task Learning with Joint Feature Selection
Robust Multi-Task Feature Learning
Trace-Norm Regularized Multi-Task Learning
Alternating Structural Optimization
Incoherent Low-Rank and Sparse Learning
Robust Low-Rank Multi-Task Learning
Clustered Multi-Task Learning
Multi-Task Learning with Graph Structures
Disease Progression Models
Incomplete Multi-Source Fusion (iMSF)
Multi-Stage Multi-Source Fusion
Multi-Task Clustering

2)
Link: http://klcl.pku.edu.cn/member/sunxu/software/MultiTask.zip
Intro: This is a general purpose software for online multi-task learning. The online multi-task learning is mainly based on Conditional Random Fields (CRF) model and Stochastic Gradient Descent (SGD) training.

I am going to deepen this technique for machine translation and domain adaptation.

Monday, 5 January 2015

MTTK - Machine Translation Toolkit

Link: http://mi.eng.cam.ac.uk/~wjb31/distrib/mttkv1/

Intro: MTTK is a collection of software tools for the alignment of parallel text for use in Statistical Machine Translation. With MTTK you can ...

Align document translation pairs at the sentence or sub-sentence level, sometimes known as chunking. This is a useful pre-processing step to prepare collections of translations for use in estimating the parameters of complex alignment models. Sub-sentence alignment in particular makes it possible to segment long sentences into shorter aligned segments that otherwise would have to be discarded.
Train statistical models for parallel text alignment. The following models are supported :
IBM Model-1 and Model-2
Word-to-Word HMMs
Word-to-Phrase HMMs , with bigram translation probabilities
Parallelize your model training procedures. If you have multiple CPUs available, you can partition your translation training texts into subsets, thus speeding up iterative parameter re-estimation procedures and reducing the amount of memory needed in training. This is done under exact EM-based parameter estimation procedures.
Generate word-to-word and word-to-phrase alignments of parallel text. MTTK can generate Viterbi alignments of parallel text (both training text and other texts) under the supported alignment models.
Extract word-to-word translation tables from aligned bitext and from the estimated models.
Extract phrase-to-phrase translation tables (phrase-pair inventories) from aligned parallel text.
Use the HMM alignment models to induce phrase translations under its statistical models. Phrase-pair induction can generate richer inventories of phrase translations than can be extracted from Viterbi alignments.
Edit the C++ source code to implement your own estimation and alignment procedures.

HOANG Cong Duy Vu's research logs