Monday, 5 November 2012


Intro: MultEval takes machine translation hypotheses from several runs of an optimizer and provides 3 popular metric scores, as well as, standard deviations (via bootstrap resampling) and p-values (via approximate randomization). This allows researchers to mitigate some of the risk of using unstable optimizers such as MERT, MIRA, and MCMC. It is intended to help in evaluating the impact of in-house experimental variations on translation quality; it is currently not setup to do bake-off style comparisons (bake-offs can't require multiple optimizer runs nor a standard tokenization).
Related (Code for Statistical Significance Testing for MT Evaluation Metrics)

Saturday, 3 November 2012


IntroThe MacPorts Project is an open-source community initiative to design an easy-to-use system for compiling, installing, and upgrading either command-line, X11 or Aqua based open-source software on the Mac OS X operating system. To that end we provide the command-line driven MacPorts software package under a BSD License, and through it easy access to thousands of ports that greatly simplify the task of compiling and installing open-source software on your Mac.

Tuesday, 16 October 2012


Intro: The goal of TalkBank is to foster fundamental research in the study of human and animal communication. It will construct sample databases within each of the subfields studying communication. It will use these databases to advance the development of standards and tools for creating, sharing, searching, and commenting upon primary materials via networked computers.

Tuesday, 2 October 2012



OpenMobster - Open Source Mobile Enterprise Backend


  • OpenMobster, is an open source Enterprise Backend for Mobile Apps, or 
  • OpenMobster, is an open source Mobile Backend As a Service that can be deployed privately (on-premise) within your Enterprise or
  • OpenMobster, is an open source MEAP (Mobile Enterprise Application Platform).

Open-source implementation of Boostexter

Intro: Boosting is a meta-learning approach that aims at combining an ensemble of weak classifiers to form a strong classifier. Adaptive Boosting (Adaboost) is a greedy search for a linear combination of classifiers by overweighting the examples that are misclassified by each classifier. icsiboost implements Adaboost over stumps (one-level decision trees) on discrete and continuous attributes (words and real values)

Thursday, 27 September 2012

TurboParser - Dependency Parser with Linear Programming

Intro: TurboParser is a free C++ implementation of a multilingual non-projective dependency parser based on linear programming relaxations.

Sunday, 23 September 2012

ICU - International Components for Unicode

Intro: ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.

Here are a few highlights of the services provided by ICU:

Code Page Conversion: Convert text data to or from Unicode and nearly any other character set or encoding. ICU's conversion tables are based on charset data collected by IBM over the course of many decades, and is the most complete available anywhere.

Collation: Compare strings according to the conventions and standards of a particular language, region or country. ICU's collation is based on the Unicode Collation Algorithm plus locale-specific comparison rules from the Common Locale Data Repository, a comprehensive source for this type of data.

Formatting: Format numbers, dates, times and currency amounts according the conventions of a chosen locale. This includes translating month and day names into the selected language, choosing appropriate abbreviations, ordering fields correctly, etc. This data also comes from the Common Locale Data Repository.

Time Calculations: Multiple types of calendars are provided beyond the traditional Gregorian calendar. A thorough set of timezone calculation APIs are provided.

Unicode Support: ICU closely tracks the Unicode standard, providing easy access to all of the many Unicode character properties, Unicode Normalization, Case Folding and other fundamental operations as specified by the Unicode Standard.

Regular Expression: ICU's regular expressions fully support Unicode while providing very competitive performance.

Bidi: support for handling text containing a mixture of left to right (English) and right to left (Arabic or Hebrew) data.

Text Boundaries: Locate the positions of words, sentences, paragraphs within a range of text, or identify locations that would be suitable for line wrapping when displaying the text.

Deep learning

I just find that this research topic is quite new.
I intend to get deeper into it, especially its impact in NLP research.

Some of review paper or tutorials:


2) ACL 2012 tutorial:

3) Ronan Collobert (

4) ...

Wednesday, 12 September 2012

How to solve "incorrect side-by-side configuration" error when running VC++ app

When I compiled an app by using Visual Studio 2005 Pro on my first PC and ran that app on my second PC. I got the following error:

"The application has failed to start because its side-by-side configuration is incorrect ... "

I found the reason. That is because the side-by-side configuration of my first PC was configured differently from the one on my second PC.

How to fix that:
- First of all, check the side-by-side configuration on registry (type regedit on cmd), proceed to the following keys:



Alternatively, you can check this by using the existing tool on Windows namely Event Viewer (under Control Panel\Administration Tools). On Event Viewer, check errors on Windows Logs\Application and find the appropriate error on your app.

Next, check the Default values of the above keys. They should be the same and have the highest values (the last numbers). Importantly, these values MUST be the same on two PCs.

- Secondly, try to install the correct side-by-side configuration by downloading and installing the latest update from Microsoft website. (see: and
For example, the side-by-side configuration of the first key on my first PC is 8.0.50727.6195.
I need the update version ( which has the same value as 8.0.50727.6195.

The problem is solved. When you get the same problem, you can follow my direction.

Hope it helps!


Web translation service


Tuesday, 11 September 2012

NiuTrans: A Statistical Machine Translation System

Intro: NiuTrans is an open-source statistical machine translation system developed by the Natural Language Processing Group at Northeastern University, China. The NiuTrans system is fully developed in C++ language. So it runs fast and uses less memory. Currently it has already supported phrase-based, hierarchical phrase-based and syntax-based (string-to-tree, tree-to-string and tree-to-tree) models for research-oriented studies.

Wednesday, 5 September 2012

Docent - Document-level SMT

Intro: Docent is a decoder for phrase-based Statistical Machine Translation (SMT). Unlike most existing SMT decoders, it treats complete documents, rather than single sentences, as translation units and permits the inclusion of features with cross-sentence dependencies to facilitate the development of discourse-level models for SMT. Docent implements the local search decoding approach described by Hardmeier et al. (EMNLP 2012).

Wednesday, 29 August 2012

Fangorn: a system for querying very large treebanks

Intro: Fangorn is an open source tool for querying very large treebanks, built on top of Apache Lucene.  Fangorn implements the LPath linguistic path language, which has an XPath-like syntax along with linguistically motivated extensions.  Result trees are annotated with the query in order to show how the query matched the tree, and these annotations can themselves be modified and submitted as further queries.

Tuesday, 28 August 2012

Intel® Threading Building Blocks (Intel® TBB)

Intro: Intel® Threading Building Blocks (Intel® TBB) offers a rich and complete approach to expressing parallelism in a C++ program. It is a library that helps you take advantage of multi-core processor performance without having to be a threading expert. Intel TBB is not just a threads-replacement library. It represents a higher-level, task-based parallelism that abstracts platform details and threading mechanisms for scalability and performance. 

Monday, 27 August 2012


Intro: The ACCURAT project ( is pleased to announce the release of ACCURAT Toolkit - a collection of tools for comparable corpora collection and multi-level alignment and information extraction from comparable corpora. By using the ACCURAT Toolkit, users may obtain:
- Comparable corpora from the Web (current news corpora, filtered Wikipedia corpora, and narrow domain focussed corpora);
- Comparable document alignments;
- Semi-parallel sentence/phrase mapping from comparable corpora (for SMT training purposes or other tasks);
- Translated terminology extracted and mapped from bilingual comparable corpora;
- Translated named entities extracted and mapped from bilingual comparable corpora.

Thursday, 23 August 2012

OpenFst Library

IntroOpenFst is a library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). Weighted finite-state transducers are automata where each transition has an input label, an output label, and a weight. The more familiar finite-state acceptor is represented as a transducer with each transition's input and output label equal. Finite-state acceptors are used to represent sets of strings (specifically, regular or rational sets); finite-state transducers are used to represent binary relations between pairs of strings (specifically, rational transductions). The weights can be used to represent the cost of taking a particular transition.

Wednesday, 15 August 2012

Champollion Tool Kit - Text Sentence Aligner

Intro: Built around LDC's champollion sentence aligner kernel, Champollion Tool Kit (CTK) aims to providing ready-to-use parallel text sentence alignment tools for as many language pairs as possible.
Champollion depends heavily on lexical information, but uses sentence length information as well. A translation lexicon is required. Past experiments indicate that champollion's performance improves as the translation lexicon become larger.

Monday, 30 July 2012

PML Tree Query

Intro: PML-TQ is an powerful open-source search tool for all kinds of linguistaically annotated treebanks with several client interfaces and two search backends (one based on a SQL database and one based on Perl and the TrEd toolkit). The tool works natively with treebanks encoded in the PML data format (conversion scripts are available for many established treebank formats).

Friday, 27 July 2012

PET - Post-Editing Translation Tool

Intro: PET is a stand-alone, open-source (under LGPL) tool written in Java that should help you post-edit and assess machine or human translations while gathering detailed statistics about post-editing time amongst other effort indicators.

Tuesday, 24 July 2012

Subtitle Translation

Subtitle corpus (more)

*** Google Translation API is no longer freely available. Can we use state-of-the-art SMT techniques to build a subtitle SMT system by ourself? What are challenges???

Monday, 23 July 2012


Intro: XML-RPC is a remote procedure call (RPC) protocol which uses XML to encode its calls and HTTP as a transport mechanism.[1] "XML-RPC" also refers generically to the use of XML for remote procedure call, independently of the specific protocol. This article is about the protocol named "XML-RPC".

C++ Tool: (tested)


Intro: OpenStreetMap is a free worldwide map, created by people like you.

Monday, 16 July 2012

Very large-scale corpus (COCA)

COCA (Corpus of Contemporary American English)

Intro: The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English, and the only large and balanced corpus of American English. The corpus was created by Mark Davies of Brigham Young University, and it is used by tens of thousands of users every month (linguists, teachers, translators, and other researchers). COCA is also related to other large corpora that we have created.
The corpus contains more than 450 million words of text and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. It includes 20 million words each year from 1990-2012 and the corpus is also updated regularly (the most recent texts are from Summer 2012). Because of its design, it is perhaps the only corpus of English that is suitable for looking at current, ongoing changes in the language (see the 2011 article in Literary and Linguistic Computing).

Ngram corpus from COCA:
1) COCA Ngrams:
Link: see this post.

2) COHA Ngrams:
Intro: The Corpus of Historical American English (COHA) contain 400 million words of text from 1810-2009, and all of the n-grams from the corpus can be freely downloaded. They contain all n-grams that occur at least three times total in the corpus, and you can see the frequency of each of these n-grams in each decade from the 1810s-2000s. This data can be used offline to carry out powerful searches on a wide range of phenomena in the history of American English.

My thoughts:
- I have been developing a language-generic n-gram-based spell checking tool. So, this ngram corpus will be very beneficial.
- Other tasks in English NLP may need this corpus.

Thursday, 12 July 2012

C&C semantic tools

CCG (Combinatory Categorial Grammar) Parser:

Intro: Boxer is developed by Johan Bos and generates semantic representations. It takes as input CCG (Combinatory Categorial Grammar) derivations and produces DRSs (Discourse Representation Structures, from Hans Kamp's Discourse Representation Theory) as output. It is distributed with the C&C tools.


Intro: A tool for automatically producing RDF/OWL ontologies and linked data from natural language sentences, currently limited to English.

Wednesday, 11 July 2012

Interested Papers at EMNLP 2012

1) Tenses in SMT

D12-1026: Zhengxian Gong; Min Zhang; Chew Lim Tan; Guodong Zhou
N-gram-based Tense Models for Statistical Machine Translation


D12-1041: Nan Duan; Mu Li; Ming Zhou
Forced Derivation Tree based Model Training to Statistical Machine Translation

3) ...

ESAXX - suffix array tool

Introesaxx is a C++ template library supporting to build an enhanced suffix array which is useful for various string algorithms. For an input text of length N, esaxx builds a suffix tree in linear time using almost 20N bytes working space (alphabet size independent).

Interested Papers at ACL 2012

1) New NLP topic: automatic document dating
P12-1011: Nathanael Chambers
Labeling Documents with Timestamps: Learning from their Time Expressions

P12-1050: Arianna Bisazza; Marcello Federico
Modified Distortion Matrices for Phrase-Based Statistical Machine Translation

3) ...

Sunday, 8 July 2012

Saffron - Extracting the Valuable Threads of Expertise


Intro: Saffron provides insights in a research community or organization by analysing its main topics of investigation and the experts associated with these topics.
Saffron analysis is fully automatic and is based on text mining and linked data principles.
This instance of Saffron analyzes the research community in Natural Language Processing based on the proceedings of the conferences organized by the Association for Computational Linguistics (ACL).

Saturday, 23 June 2012


jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora.

Friday, 1 June 2012

WIT3 - Web Inventory of Transcribed and Translated Talks


Intro: WIT3 - acronym for Web Inventory of Transcribed and Translated Talks - is a ready-to-use version for research purposes of the multilingual transcriptions of TED talks. 
Since 2007, the TED Conference has been posting on its website all video recordings of its talks, English subtitles and their translations in more than 80 languages. In order to make this collection of talks more effectively usable by the research community, the original textual contents are redistributed here, together with MT benchmarks and processing tools.

Tuesday, 29 May 2012

Layout-Aware Text Extraction from Full-text PDF of Scientific Articles

Description: The Portable Document Format (PDF) is the almost universally used file format for online scientific publications. It is also notoriously difficult to read and handle computationally, presenting challenges for developers of biomedical text mining or biocuration informatics systems that use the published literature as an information source. To facilitate the effective use of scientific literature in such systems we introduce Layout-Aware PDF Text Extraction (LA-PDFText). The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1) Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2) Classifying text blocks into rhetorical categories using a rule-based method and (3) Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks.

Monday, 9 April 2012

The advisor

Intro: Researchers typically rely on manual methods to discover research of interest such as keyword based searchon search engine, browsing publication list of known experts or reading references of interesting paper. These techniques are time-consuming and only allow reach a limited set of documents.

Monday, 2 April 2012

NRL: The Natural Rule Language

Intro: The Natural Rule Language is a model-driven language aimed at improving quality and time to market in integration projects. It enables users to constrain, modify and map data in diverse formats. NRL works at a high level, and is designed for automatic translation to execution languages.
NRL's main remit is to provide a user-friendly alternative to languages like OCL, XSLT, XPath, Schematron, and many others, particularly in scenarios where they would be considered too technical.

Wednesday, 28 March 2012


Intro: The OpenCalais Web Service automatically creates rich semantic metadata for the content you submit – in well under a second. Using natural language processing (NLP), machine learning and other methods, Calais analyzes your document and finds the entities within it. But, Calais goes well beyond classic entity identification and returns the facts and events hidden within your text as well.

Downloading full CiteSeerX data

Just saw this link and found it very interesting.


I copy here for backup (to avoid if the original link dies).

Steps for downloading the full dataset from CiteSeerX:
  1. Download and extract the "Demo" from
  2. Go to the directory of the extracted files, type the following command to download the full dataset of CiteSeerX to the file "citeseerx_alldata.xml"
    java -classpath .;oaiharvester.jar;xerces.jar org.acme.oai.OAIReaderRawDump -o citeseerx_alldata.xml

Thanks the author for that.


Tuesday, 20 March 2012

Language Technology related Companies

Here, I will collect information about industry companies relating to developing and using Language Technology (LT). I would like to see how potential LT has in industry.

2) Bimaple

1) Vietgle (Lac Viet Company)
2) ViLangTek

Monday, 19 March 2012

Parallel Text Mining for SMT

Problem: given a relatively large collection of parallel texts and a state-of-the-art SMT system, how to incrementally & automatically mine the parallel texts available on the Web. The newly added texts should ensure to improve the current SMT system.

Papers related:
1) Large Scale Parallel Document Mining for Machine Translation. COLING 2010. Link.
2) TBA

Crowdsourcing for NLP annotations

My new journal article

Crowd-sourcing has emerged as a new method for obtaining annotations for training models for machine learning. While many variants of this process exist, they largely differ in their methods of motivating subjects to contribute and the scale of their applications. To date, there has yet to be a study that helps the practitioner to decide what form an annotation application should take to best reach its objectives within the constraints of a project. To fill this gap, we provide a faceted analysis of crowdsourcing from a practitioner’s perspective, and show how our facets apply to existing published crowdsourced annotation applications. We then summarize how the major crowdsourcing genres fill different parts of this multi-dimensional space, which leads to our recommendations on the potential opportunities crowdsourcing offers to future annotation efforts. 


WebAnnotator is a new tool for annotating Web pages implemented at LIMSI. Giving it a try will take you no more than 10 minutes.

WebAnnotator is implemented as a Firefox extension, allowing annotation of both offline and inline pages. The HTML rendering is fully preserved and all annotations consist in new HTML spans with specific styles. 
WebAnnotator provides an easy and general-purpose framework and is made available under CeCILL free license (close to GNU GPL), so that use and further contributions are made simple.

WebAnnotator can be downloaded on the official Mozilla web page:

All parts of an HTML document can be annotated: text, images, videos, tables, menus, etc. The annotations are created by simply selecting a part of the document and clicking on the relevant type and subtypes. The annotated elements are then highlighted in a specific color. Annotation schemas can be defined by the user by creating a simple DTD representing the types and subtypes that must be highlighted. Finally, annotations can be saved (HTML with highlighted parts of documents) or exported (in a machine-readable format).

WebAnnotator will be presented at LREC conference in May 2012.

EVBCorpus - English-Vietnamese Bilingual Corpus

Monday, 20 February 2012

Topic Hierarchy Generation

This post continues the problem of topic summarization posted earlier. Here I try to collect research articles related to the problem of topic hierarchy generation which is an important step for topic summarization.

1) Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes (link). Journal of Machine Learning Research 2011.

2) A Practical Web-based Approach to Generating Topic Hierarchy for Text Segments (link). CIKM 2004.

3) Finding Topic Words for Hierarchical Summarization (link). SIGIR 2001.

4) The Nested Chinese Restaurant Process and Bayesian Non-parametric Inference of Topic Hierarchies (link). Journal of the ACM 2010.

5) Mining bilingual topic hierarchies from unaligned text (link). IJCNLP 2011.

6) Domain-Assisted Product Aspect Hierarchy Generation: Towards Hierarchical Organization of Unstructured Consumer Reviews (link). ACL 2011.

7) (TBA)

(to be updated).

Sunday, 19 February 2012

Albatross Toolkit

Albatross is a small and flexible Python toolkit for developing highly stateful web applications. The toolkit has been designed to take a lot of the pain out of constructing intranet applications although you can also use Albatross for deploying publicly accessed web applications.

Brat Rapid Annotation Tool

Looks great!

Wednesday, 4 January 2012

Corpus Management Tool

AntConc & AntWordProfiler:

NoSketch Engine, an open-source project combining Manatee and Bonito into a powerful and free corpus management system. NoSketch Engine is a limited version of the software empowering the famous Sketch Engine service, a commercial variant offering word sketches, thesaurus, keyword computation, user-friendly corpus creation and many other excellent features.

Bookmarks for Corpus-based Linguists

Tuesday, 3 January 2012

Scientific Summarization

This post aims to collect newest research papers in the literature about scientific summarization:

1) Abstract Summarization:

2) TBA

3) Review