HOANG Cong Duy Vu's research logs: 2013

Saturday, 16 November 2013

CMPH

Intro: A perfect hash function maps a static set of n keys into a set of m integer numbers without collisions, where m is greater than or equal to n. If m is equal to n, the function is called minimal.

Sunday, 10 November 2013

HandAlign

Link: http://www.umiacs.umd.edu/~hal/HandAlign/index.html
Intro: a tool to assist manual alignment in MT, summarization, ...

Thursday, 7 November 2013

Pure for simple CSS

Link: http://purecss.io/
Intro: A set of small, responsive CSS modules that you can use in every web project.

Wednesday, 30 October 2013

Kaggle

Link: http://www.kaggle.com/competitions
Intro: A platform for paid problem solving around the world.

Monday, 7 October 2013

pialign - Phrasal ITG Aligner

Link: http://phontron.com/pialign/

Intro: pialign is a package that allows you to create a phrase table and word alignments from an unaligned parallel corpus. It is unlike other unsupervised word alignment tools in that it is able to create a phrase table using a fully statistical model, no heuristics. As a result, it is able to build phrase tables for phrase-based machine translation that achieve competitive results but are only a fraction of the size of those created with heuristic methods.

*** Note: pialign can extract very compact phrase table directly from unaligned parallel data. This is may be very helpful for SMT system in mobile environment.

Wednesday, 2 October 2013

Source Code Search

Links:
1) http://www.programmableweb.com/apis/directory
2) http://code.ohloh.net/
3) http://runnable.com/

Tuesday, 24 September 2013

Tree visualization with Treebolic

Link: http://treebolic.sourceforge.net/en/index.html

Intro: Treebolic is a Java component (widget) whose purpose is to provide a hyperbolic rendering of hierarchical data.

A tree is rendered with nodes and edges but display space is subject to a particular curvature (hence the name) : more space is allocated to the focus node while the parent and children, still in the immediate visual context, appear slightly smaller. The grandparents and grandchildren are still visible but come out even smaller. As we move away from the focus node, less display space is allotted to the nodes, which gradually disappear towards the disk's border, as though the whole hierarchy were seen through a fisheye lens.

Wrapped as a Java applet, the Treebolic widget can be embedded in a web page. Nodes may then contain hypertext links and the browser to other web pages.

The tree is dynamic (animation brings the focus node to the center) and responds to user interaction.

It looks good. See an illustrating example in Vietnamese Wordnet at here.

Vietnamese Wordnet

Link: http://vi.asianwordnet.org/

Intro: Vietnamese WordNet is used WNMS tools developed by AsianWordNet to create and share WordNet among Asia languages based on WordNet® version3.0, Co-operation by TCL and -- Vietnam -- establish on October 2007.

It's sad that it has not been developed by Vietnamese people :(.

Sunday, 22 September 2013

YALIGN

Link: http://yalign.machinalis.com/

Intro: Statistical Machine Translation relies on parallel corpora for training translation models. However these corpora are limited and take time to create. Yalign is designed to automate this process by finding sentences that are close translation matches from comparable corpora. This opens up avenues for harvesting parallel corpora from sources like translated documents and the web.

Sunday, 15 September 2013

WordWanderer

Link: http://wordwanderer.org/

It's fun, right?

TinyTM

Link: http://tinytm.sourceforge.net/en/index.html
Intro: TinyTM - Open-Source Translation Memory

Tuesday, 27 August 2013

Alternative to Amazon Mechanical Turk

Link: http://crowdflower.com/

Note: This online service is not restricted to US.

Monday, 26 August 2013

Data visualization tool

Link: http://www.drasticdata.nl/DDHome.php
Intro: some kinds of beautiful data visualization

(linking to http://newsmap.jp)

Tuesday, 20 August 2013

Java-based Wiktionary Library (JWKTL)

Link: https://code.google.com/p/jwktl/

Intro: JWKTL (Java-based Wiktionary Library) is an application programming interface for the free multilingual online dictionary Wiktionary (http://www.wiktionary.org). JWKTL enables efficient and structured access to the information encoded in the English, the German, and the Russian Wiktionary language editions, including sense definitions, part of speech tags, etymology, example sentences, translations, semantic relations, and many other lexical information types. The Russian JWKTL parser is based on Wikokit (http://code.google.com/p/wikokit/).

Monday, 12 August 2013

Topic modeling packages

1) MALLET: http://mallet.cs.umass.edu/
2) TMT: http://nlp.stanford.edu/software/tmt/tmt-0.4/
3) GENSIM: http://radimrehurek.com/gensim/index.html

Wednesday, 3 July 2013

Temporal Tagger HeidelTime

Link: https://code.google.com/p/heideltime/

Demo: http://heideltime.ifi.uni-heidelberg.de/heideltime/

Intro: HeidelTime is a multilingual, cross-domain temporal tagger developed at the Database Systems Reseach Group at Heidelberg University. It extracts temporal expressions from documents and normalizes them according to the TIMEX3 annotation standard. HeidelTime is available as UIMA annotator and as standalone version. HeidelTime currently understands documents in English, German, Dutch, Vietnamese, Arabic, Spanish and Italian.

Thursday, 27 June 2013

How to embed Java code in C++ code

Example Link : http://stackoverflow.com/questions/7506329/embed-java-into-a-c-application

Sun's Guide: http://docs.oracle.com/javase/1.5.0/docs/guide/jni/


#include        /* where everything is defined */


int main() {

  JavaVM *jvm;       /* denotes a Java VM */

  JNIEnv *env;       /* pointer to native method interface */

  JDK1_1InitArgs vm_args; /* JDK 1.1 VM initialization arguments */

  vm_args.version = 0x00010001; /* New in 1.1.2: VM version */

  /* Get the default initialization arguments and set the class 

   * path */

  JNI_GetDefaultJavaVMInitArgs(&vm_args);

  vm_args.classpath = ...;

  /* load and initialize a Java VM, return a JNI interface 

   * pointer in env */

  JNI_CreateJavaVM(&jvm, &env, &vm_args);

  /* invoke the Main.test method using the JNI */

  jclass cls = env->FindClass("Main");

  jmethodID mid = env->GetStaticMethodID(cls, "test", "(I)V");

  env->CallStaticVoidMethod(cls, mid, 100);

  /* We could have created an Object and called methods on it instead */

  /* We are done. */

  jvm->DestroyJavaVM();

}

Scientific Writing Assistant

SWAN: http://cs.joensuu.fi/swan/

Intro: This Swan - Scientific Writing AssistaNt - aims at helping writers with the content, not the grammar or spelling. It guides you towards known good scientific writing practices and helps your readers find your contribution. The tool was designed to help you with your writing, not to merely point out errors. Using the tool should be simple; just enter your text sections into the tool, optionally make some manual elaboration and click the "Evaluate" button. Once you have used the tool with one of your own scientific papers, do let us know how it has helped to you.

Tuesday, 16 April 2013

DLLs on different Visual Studio versions

A quick note:

- Cannot allocate memory on DLL (newer version, VS 2010) which is loaded \& accessed in the program of older version, VS 2005

- TBA

Monday, 15 April 2013

Google Relation Extraction Corpus

Link: https://code.google.com/p/relation-extraction-corpus/
News: http://googleresearch.blogspot.sg/2013/04/50000-lessons-on-how-to-read-relation.html

Thursday, 11 April 2013

Nile - Syntax-based Word Alignment Tool

Link: https://code.google.com/p/nile/
Intro: Nile is a supervised, discriminative word alignment package that can make use of arbitrary and overlapping features

Word Alignment Visualization Tool

Link: http://nlg.isi.edu/demos/picaro/

Sunday, 7 April 2013

SENNA toolkit

Intro: SENNA is a software distributed under a non-commercial license, which outputs a host of Natural Language Processing (NLP) predictions: part-of-speech (POS) tags, chunking (CHK), name entity recognition (NER), semantic role labeling (SRL) and syntactic parsing (PSG).

SENNA is fast because it uses a simple architecture, self-contained because it does not rely on the output of existing NLP system, and accurate because it offers state-of-the-art or near state-of-the-art performance.

Link: http://ronan.collobert.com/senna/

Sunday, 3 March 2013

Detection of near-duplicate documents

Link: https://github.com/vilda/shash/

Intro: A sample implementation of Charikar's hash for identification of similar documents.

Link: https://code.google.com/p/dkpro-similarity-asl/

Intro: DKPro Similarity is an open source software package for developing text similarity algorithms. The framework is designed to complement DKPro Core, a collection of software components for natural language processing (NLP) based on the Apache UIMA framework. By leveraging the power of the tools available in DKPro Core, it allows for a rich set of similarity computation operations, including the design of full-fledged language processing pipelines and fully customizable processing steps.

Tuesday, 29 January 2013

The DBpedia Data Set

Link: http://wiki.dbpedia.org/Datasets

Intro: The DBpedia data set uses a large multi-domain ontology which has been derived from Wikipedia. The English version of the DBpedia data set currently describes 3.77 million “things” with 400 million “facts”.

Monday, 21 January 2013

Explicit Semantic Analysis - ESA

Intro: ESA is a vector representation of texts based on Wikippedia as external knowledge base.
Link: http://www.cs.technion.ac.il/~gabr/resources/code/esa/esa.html

Wednesday, 16 January 2013

Statistical Methods in Language and Linguistic Research

Link: https://www.equinoxpub.com/equinox/books/showbook.asp?bkid=348&keyword=