Friday, 9 January 2015

Word Aligners for Machine Translation

Here is a not-complete list of word aligners used for Machine Translation:

1) Unsupervised Aligners
- GIZA++
- fast_align (with cdec)
- pialign

2) Supervised Aligners
- BerkeleyAligner

(to be updated ...)

Tuesday, 6 January 2015

Multi-Task Learning toolkit

Intro: the MALSAR (Multi-tAsk Learning via StructurAl Regularization) package includes the following multi-task learning algorithms:

  • Mean-Regularized Multi-Task Learning
  • Multi-Task Learning with Joint Feature Selection
  • Robust Multi-Task Feature Learning
  • Trace-Norm Regularized Multi-Task Learning
  • Alternating Structural Optimization
  • Incoherent Low-Rank and Sparse Learning
  • Robust Low-Rank Multi-Task Learning
  • Clustered Multi-Task Learning
  • Multi-Task Learning with Graph Structures
  • Disease Progression Models
  • Incomplete Multi-Source Fusion (iMSF)
  • Multi-Stage Multi-Source Fusion
  • Multi-Task Clustering
Intro: This is a general purpose software for online multi-task learning. The online multi-task learning is mainly based on Conditional Random Fields (CRF) model and Stochastic Gradient Descent (SGD) training.

I am going to deepen this technique for machine translation and domain adaptation.

Monday, 5 January 2015

MTTK - Machine Translation Toolkit

Intro: MTTK is a collection of software tools for the alignment of parallel text for use in Statistical Machine Translation. With MTTK you can ...
  • Align document translation pairs at the sentence or sub-sentence level, sometimes known as chunking. This is a useful pre-processing step to prepare collections of translations for use in estimating the parameters of complex alignment models. Sub-sentence alignment in particular makes it possible to segment long sentences into shorter aligned segments that otherwise would have to be discarded.
  • Train statistical models for parallel text alignment.  The following models are supported : 
  • IBM Model-1 and Model-2
  • Word-to-Word HMMs  
  • Word-to-Phrase HMMs ,  with bigram translation probabilities 
  • Parallelize your model training procedures. If you have multiple CPUs available,  you can partition your translation training texts into subsets,  thus speeding up iterative parameter re-estimation procedures and reducing the amount of memory needed in training. This is done under exact EM-based parameter estimation procedures.
  • Generate word-to-word and word-to-phrase alignments of parallel text. MTTK can generate Viterbi alignments of parallel text (both training text and other texts) under the supported alignment models.
  • Extract word-to-word translation tables from aligned bitext and from the estimated models.
  • Extract phrase-to-phrase translation tables (phrase-pair inventories) from aligned parallel text.
  • Use the HMM alignment models to induce phrase translations under its statistical models.   Phrase-pair induction can generate richer inventories of phrase translations than can be extracted from Viterbi alignments.
  • Edit the C++ source code to implement your own estimation and alignment procedures.