Intro: A perfect hash function maps a static set of n keys into a set of m integer numbers without collisions, where m is greater than or equal to n. If m is equal to n, the function is called minimal.
Saturday, 16 November 2013
Sunday, 10 November 2013
HandAlign
Link: http://www.umiacs.umd.edu/~hal/HandAlign/index.html
Intro: a tool to assist manual alignment in MT, summarization, ...
Intro: a tool to assist manual alignment in MT, summarization, ...
Labels:
MT,
NLP,
phrase alignment,
SMT,
text alignment,
toolkit,
visualization
Thursday, 7 November 2013
Pure for simple CSS
Link: http://purecss.io/
Intro: A set of small, responsive CSS modules that you can use in every web project.
Intro: A set of small, responsive CSS modules that you can use in every web project.
Wednesday, 30 October 2013
Kaggle
Link: http://www.kaggle.com/competitions
Intro: A platform for paid problem solving around the world.
Intro: A platform for paid problem solving around the world.
Monday, 7 October 2013
pialign - Phrasal ITG Aligner
Intro: pialign is a package that allows you to create a phrase table and word alignments from an unaligned parallel corpus. It is unlike other unsupervised word alignment tools in that it is able to create a phrase table using a fully statistical model, no heuristics. As a result, it is able to build phrase tables for phrase-based machine translation that achieve competitive results but are only a fraction of the size of those created with heuristic methods.
*** Note: pialign can extract very compact phrase table directly from unaligned parallel data. This is may be very helpful for SMT system in mobile environment.
Wednesday, 2 October 2013
Tuesday, 24 September 2013
Tree visualization with Treebolic
Intro: Treebolic is a Java component (widget) whose purpose is to provide a hyperbolic rendering of hierarchical data.
A tree is rendered with nodes and edges but display space is subject to a particular curvature (hence the name) : more space is allocated to the focus node while the parent and children, still in the immediate visual context, appear slightly smaller. The grandparents and grandchildren are still visible but come out even smaller. As we move away from the focus node, less display space is allotted to the nodes, which gradually disappear towards the disk's border, as though the whole hierarchy were seen through a fisheye lens.
Wrapped as a Java applet, the Treebolic widget can be embedded in a web page. Nodes may then contain hypertext links and the browser to other web pages.
The tree is dynamic (animation brings the focus node to the center) and responds to user interaction.
It looks good. See an illustrating example in Vietnamese Wordnet at here.
Vietnamese Wordnet
Link: http://vi.asianwordnet.org/
Intro: Vietnamese WordNet is used WNMS tools developed by AsianWordNet to create and share WordNet among Asia languages based on WordNet® version3.0, Co-operation by TCL and -- Vietnam -- establish on October 2007.
It's sad that it has not been developed by Vietnamese people :(.
Sunday, 22 September 2013
YALIGN
Intro: Statistical Machine Translation relies on parallel corpora for training translation models. However these corpora are limited and take time to create. Yalign is designed to automate this process by finding sentences that are close translation matches from comparable corpora. This opens up avenues for harvesting parallel corpora from sources like translated documents and the web.
Sunday, 15 September 2013
Tuesday, 27 August 2013
Alternative to Amazon Mechanical Turk
Link: http://crowdflower.com/
Note: This online service is not restricted to US.
Labels:
amazon mechanical turk,
AMT,
crowdsourcing,
data annotation,
online
Monday, 26 August 2013
Data visualization tool
Link: http://www.drasticdata.nl/DDHome.php
Intro: some kinds of beautiful data visualization
(linking to http://newsmap.jp)
Intro: some kinds of beautiful data visualization
(linking to http://newsmap.jp)
Labels:
news,
newsmap,
open source,
tree mapping,
visualization
Tuesday, 20 August 2013
Java-based Wiktionary Library (JWKTL)
Intro: JWKTL (Java-based Wiktionary Library) is an application programming interface for the free multilingual online dictionary Wiktionary (http://www.wiktionary.org). JWKTL enables efficient and structured access to the information encoded in the English, the German, and the Russian Wiktionary language editions, including sense definitions, part of speech tags, etymology, example sentences, translations, semantic relations, and many other lexical information types. The Russian JWKTL parser is based on Wikokit (http://code.google.com/p/wikokit/).
Monday, 12 August 2013
Wednesday, 3 July 2013
Temporal Tagger HeidelTime
Intro: HeidelTime is a multilingual, cross-domain temporal tagger developed at the Database Systems Reseach Group at Heidelberg University. It extracts temporal expressions from documents and normalizes them according to the TIMEX3 annotation standard. HeidelTime is available as UIMA annotator and as standalone version. HeidelTime currently understands documents in English, German, Dutch, Vietnamese, Arabic, Spanish and Italian.
Thursday, 27 June 2013
How to embed Java code in C++ code
Sun's Guide: http://docs.oracle.com/javase/1.5.0/docs/guide/jni/
#include/* where everything is defined */
int main() {JavaVM *jvm; /* denotes a Java VM */JNIEnv *env; /* pointer to native method interface */JDK1_1InitArgs vm_args; /* JDK 1.1 VM initialization arguments */vm_args.version = 0x00010001; /* New in 1.1.2: VM version *//* Get the default initialization arguments and set the class* path */JNI_GetDefaultJavaVMInitArgs(&vm_args);vm_args.classpath = ...;/* load and initialize a Java VM, return a JNI interface* pointer in env */JNI_CreateJavaVM(&jvm, &env, &vm_args);/* invoke the Main.test method using the JNI */jclass cls = env->FindClass("Main");jmethodID mid = env->GetStaticMethodID(cls, "test", "(I)V");env->CallStaticVoidMethod(cls, mid, 100);/* We could have created an Object and called methods on it instead *//* We are done. */jvm->DestroyJavaVM();}
Scientific Writing Assistant
Intro: This Swan - Scientific Writing AssistaNt - aims at helping writers with the content, not the grammar or spelling. It guides you towards known good scientific writing practices and helps your readers find your contribution. The tool was designed to help you with your writing, not to merely point out errors. Using the tool should be simple; just enter your text sections into the tool, optionally make some manual elaboration and click the "Evaluate" button. Once you have used the tool with one of your own scientific papers, do let us know how it has helped to you.
Labels:
assistant,
papers,
research,
scientific writing,
tool
Tuesday, 16 April 2013
DLLs on different Visual Studio versions
A quick note:
- Cannot allocate memory on DLL (newer version, VS 2010) which is loaded \& accessed in the program of older version, VS 2005
- TBA
Monday, 15 April 2013
Thursday, 11 April 2013
Nile - Syntax-based Word Alignment Tool
Link: https://code.google.com/p/nile/
Intro: Nile is a supervised, discriminative word alignment package that can make use of arbitrary and overlapping features
Sunday, 7 April 2013
SENNA toolkit
Intro: SENNA is a software distributed under a non-commercial license, which outputs a host of Natural Language Processing (NLP) predictions: part-of-speech (POS) tags, chunking (CHK), name entity recognition (NER), semantic role labeling (SRL) and syntactic parsing (PSG).
SENNA is fast because it uses a simple architecture, self-contained because it does not rely on the output of existing NLP system, and accurate because it offers state-of-the-art or near state-of-the-art performance.
Sunday, 3 March 2013
Detection of near-duplicate documents
Intro: A sample implementation of Charikar's hash for identification of similar documents.
Intro: DKPro Similarity is an open source software package for developing text similarity algorithms. The framework is designed to complement DKPro Core, a collection of software components for natural language processing (NLP) based on the Apache UIMA framework. By leveraging the power of the tools available in DKPro Core, it allows for a rich set of similarity computation operations, including the design of full-fledged language processing pipelines and fully customizable processing steps.
Labels:
document similarity,
hashing,
link,
NLP,
text processing,
tool
Tuesday, 29 January 2013
The DBpedia Data Set
Link: http://wiki.dbpedia.org/Datasets
Intro: The DBpedia data set uses a large multi-domain ontology which has been derived from Wikipedia. The English version of the DBpedia data set currently describes 3.77 million “things” with 400 million “facts”.
Labels:
corpus,
DBpedia,
information extraction,
large-scale,
link,
multi-lingual,
NLP,
Wikipedia
Monday, 21 January 2013
Explicit Semantic Analysis - ESA
Intro: ESA is a vector representation of texts based on Wikippedia as external knowledge base.
Link: http://www.cs.technion.ac.il/~gabr/resources/code/esa/esa.html
Link: http://www.cs.technion.ac.il/~gabr/resources/code/esa/esa.html
Labels:
esa,
Explicit Semantic Analysis,
NLP,
research,
semantic relatedness,
Wikipedia
Wednesday, 16 January 2013
Statistical Methods in Language and Linguistic Research
Link: https://www.equinoxpub.com/equinox/books/showbook.asp?bkid=348&keyword=
Subscribe to:
Posts (Atom)