Saturday, 16 November 2013


Intro: A perfect hash function maps a static set of n keys into a set of m integer numbers without collisions, where m is greater than or equal to n. If m is equal to n, the function is called minimal.

Sunday, 10 November 2013


Intro: a tool to assist manual alignment in MT, summarization, ...

Thursday, 7 November 2013

Pure for simple CSS

Intro: A set of small, responsive CSS modules that you can use in every web project.

Wednesday, 30 October 2013


Intro: A platform for paid problem solving around the world.

Monday, 7 October 2013

pialign - Phrasal ITG Aligner

Intro: pialign is a package that allows you to create a phrase table and word alignments from an unaligned parallel corpus. It is unlike other unsupervised word alignment tools in that it is able to create a phrase table using a fully statistical model, no heuristics. As a result, it is able to build phrase tables for phrase-based machine translation that achieve competitive results but are only a fraction of the size of those created with heuristic methods.

*** Note: pialign can extract very compact phrase table directly from unaligned parallel data. This is may be very helpful for SMT system in mobile environment.

Tuesday, 24 September 2013

Tree visualization with Treebolic

Intro: Treebolic is a Java component (widget) whose purpose is to provide a hyperbolic rendering of hierarchical data.
A tree is rendered with nodes and edges but display space is subject to a particular curvature (hence the name) : more space is allocated to the focus node while the parent and children, still in the immediate visual context, appear slightly smaller. The grandparents and grandchildren are still visible but come out even smaller. As we move away from the focus node, less display space is allotted to the nodes, which gradually disappear towards the disk's border, as though the whole hierarchy were seen through a fisheye lens.
Wrapped as a Java applet, the Treebolic widget can be embedded in a web page. Nodes may then contain hypertext links and the browser to other web pages.
The tree is dynamic (animation brings the focus node to the center) and responds to user interaction.

It looks good. See an illustrating example in Vietnamese Wordnet at here.

Vietnamese Wordnet

Intro: Vietnamese WordNet is used WNMS tools developed by AsianWordNet to create and share WordNet among Asia languages based on WordNet® version3.0, Co-operation by TCL and -- Vietnam -- establish on October 2007.

It's sad that it has not been developed by Vietnamese people :(.

Sunday, 22 September 2013


Intro: Statistical Machine Translation relies on parallel corpora for training translation models. However these corpora are limited and take time to create. Yalign is designed to automate this process by finding sentences that are close translation matches from comparable corpora. This opens up avenues for harvesting parallel corpora from sources like translated documents and the web.

Tuesday, 27 August 2013

Monday, 26 August 2013

Tuesday, 20 August 2013

Java-based Wiktionary Library (JWKTL)

Intro: JWKTL (Java-based Wiktionary Library) is an application programming interface for the free multilingual online dictionary Wiktionary ( JWKTL enables efficient and structured access to the information encoded in the English, the German, and the Russian Wiktionary language editions, including sense definitions, part of speech tags, etymology, example sentences, translations, semantic relations, and many other lexical information types. The Russian JWKTL parser is based on Wikokit (

Wednesday, 3 July 2013

Temporal Tagger HeidelTime

Intro: HeidelTime is a multilingual, cross-domain temporal tagger developed at the Database Systems Reseach Group at Heidelberg University. It extracts temporal expressions from documents and normalizes them according to the TIMEX3 annotation standard. HeidelTime is available as UIMA annotator and as standalone version. HeidelTime currently understands documents in English, German, Dutch, Vietnamese, Arabic, Spanish and Italian.

Thursday, 27 June 2013

How to embed Java code in C++ code

#include /* where everything is defined */
int main() {
JavaVM *jvm; /* denotes a Java VM */
JNIEnv *env; /* pointer to native method interface */
JDK1_1InitArgs vm_args; /* JDK 1.1 VM initialization arguments */
vm_args.version = 0x00010001; /* New in 1.1.2: VM version */
/* Get the default initialization arguments and set the class
* path */
vm_args.classpath = ...;
/* load and initialize a Java VM, return a JNI interface
* pointer in env */
JNI_CreateJavaVM(&jvm, &env, &vm_args);
/* invoke the Main.test method using the JNI */
jclass cls = env->FindClass("Main");
jmethodID mid = env->GetStaticMethodID(cls, "test", "(I)V");
env->CallStaticVoidMethod(cls, mid, 100);
/* We could have created an Object and called methods on it instead */
/* We are done. */

Scientific Writing Assistant

Intro: This Swan - Scientific Writing AssistaNt - aims at helping writers with the content, not the grammar or spelling. It guides you towards known good scientific writing practices and helps your readers find your contribution. The tool was designed to help you with your writing, not to merely point out errors. Using the tool should be simple; just enter your text sections into the tool, optionally make some manual elaboration and click the "Evaluate" button. Once you have used the tool with one of your own scientific papers, do let us know how it has helped to you.

Tuesday, 16 April 2013

DLLs on different Visual Studio versions

A quick note:
- Cannot allocate memory on DLL (newer version, VS 2010) which is loaded \& accessed in the program of older version, VS 2005

Thursday, 11 April 2013

Nile - Syntax-based Word Alignment Tool

Intro: Nile is a supervised, discriminative word alignment package that can make use of arbitrary and overlapping features

Word Alignment Visualization Tool


Sunday, 7 April 2013

SENNA toolkit

Intro: SENNA is a software distributed under a non-commercial license, which outputs a host of Natural Language Processing (NLP) predictions: part-of-speech (POS) tags, chunking (CHK), name entity recognition (NER), semantic role labeling (SRL) and syntactic parsing (PSG).
SENNA is fast because it uses a simple architecture, self-contained because it does not rely on the output of existing NLP system, and accurate because it offers state-of-the-art or near state-of-the-art performance.

Sunday, 3 March 2013

Detection of near-duplicate documents

Intro: A sample implementation of Charikar's hash for identification of similar documents.

Intro: DKPro Similarity is an open source software package for developing text similarity algorithms. The framework is designed to complement DKPro Core, a collection of software components for natural language processing (NLP) based on the Apache UIMA framework. By leveraging the power of the tools available in DKPro Core, it allows for a rich set of similarity computation operations, including the design of full-fledged language processing pipelines and fully customizable processing steps.

Tuesday, 29 January 2013

The DBpedia Data Set

Intro: The DBpedia data set uses a large multi-domain ontology which has been derived from Wikipedia. The English version of the DBpedia data set currently describes 3.77 million “things” with 400 million “facts”.

Monday, 21 January 2013

Explicit Semantic Analysis - ESA

Intro: ESA is a vector representation of texts based on Wikippedia as external knowledge base.

Wednesday, 16 January 2013

Statistical Methods in Language and Linguistic Research