Intro: A perfect hash function maps a static set of n keys into a set of m integer numbers without collisions, where m is greater than or equal to n. If m is equal to n, the function is called minimal.
Sunday, 10 November 2013
Thursday, 7 November 2013
Wednesday, 30 October 2013
Monday, 7 October 2013
Intro: pialign is a package that allows you to create a phrase table and word alignments from an unaligned parallel corpus. It is unlike other unsupervised word alignment tools in that it is able to create a phrase table using a fully statistical model, no heuristics. As a result, it is able to build phrase tables for phrase-based machine translation that achieve competitive results but are only a fraction of the size of those created with heuristic methods.
*** Note: pialign can extract very compact phrase table directly from unaligned parallel data. This is may be very helpful for SMT system in mobile environment.
Wednesday, 2 October 2013
Tuesday, 24 September 2013
Intro: Treebolic is a Java component (widget) whose purpose is to provide a hyperbolic rendering of hierarchical data.
A tree is rendered with nodes and edges but display space is subject to a particular curvature (hence the name) : more space is allocated to the focus node while the parent and children, still in the immediate visual context, appear slightly smaller. The grandparents and grandchildren are still visible but come out even smaller. As we move away from the focus node, less display space is allotted to the nodes, which gradually disappear towards the disk's border, as though the whole hierarchy were seen through a fisheye lens.
Wrapped as a Java applet, the Treebolic widget can be embedded in a web page. Nodes may then contain hypertext links and the browser to other web pages.
The tree is dynamic (animation brings the focus node to the center) and responds to user interaction.
It looks good. See an illustrating example in Vietnamese Wordnet at here.
Intro: Vietnamese WordNet is used WNMS tools developed by AsianWordNet to create and share WordNet among Asia languages based on WordNet® version3.0, Co-operation by TCL and -- Vietnam -- establish on October 2007.
It's sad that it has not been developed by Vietnamese people :(.
Sunday, 22 September 2013
Intro: Statistical Machine Translation relies on parallel corpora for training translation models. However these corpora are limited and take time to create. Yalign is designed to automate this process by finding sentences that are close translation matches from comparable corpora. This opens up avenues for harvesting parallel corpora from sources like translated documents and the web.
Sunday, 15 September 2013
Tuesday, 27 August 2013
Monday, 26 August 2013
Tuesday, 20 August 2013
Intro: JWKTL (Java-based Wiktionary Library) is an application programming interface for the free multilingual online dictionary Wiktionary (http://www.wiktionary.org). JWKTL enables efficient and structured access to the information encoded in the English, the German, and the Russian Wiktionary language editions, including sense definitions, part of speech tags, etymology, example sentences, translations, semantic relations, and many other lexical information types. The Russian JWKTL parser is based on Wikokit (http://code.google.com/p/wikokit/).
Monday, 12 August 2013
Wednesday, 3 July 2013
Intro: HeidelTime is a multilingual, cross-domain temporal tagger developed at the Database Systems Reseach Group at Heidelberg University. It extracts temporal expressions from documents and normalizes them according to the TIMEX3 annotation standard. HeidelTime is available as UIMA annotator and as standalone version. HeidelTime currently understands documents in English, German, Dutch, Vietnamese, Arabic, Spanish and Italian.
Thursday, 27 June 2013
Intro: This Swan - Scientific Writing AssistaNt - aims at helping writers with the content, not the grammar or spelling. It guides you towards known good scientific writing practices and helps your readers find your contribution. The tool was designed to help you with your writing, not to merely point out errors. Using the tool should be simple; just enter your text sections into the tool, optionally make some manual elaboration and click the "Evaluate" button. Once you have used the tool with one of your own scientific papers, do let us know how it has helped to you.
Tuesday, 16 April 2013
Monday, 15 April 2013
Thursday, 11 April 2013
Sunday, 7 April 2013
Intro: SENNA is a software distributed under a non-commercial license, which outputs a host of Natural Language Processing (NLP) predictions: part-of-speech (POS) tags, chunking (CHK), name entity recognition (NER), semantic role labeling (SRL) and syntactic parsing (PSG).
SENNA is fast because it uses a simple architecture, self-contained because it does not rely on the output of existing NLP system, and accurate because it offers state-of-the-art or near state-of-the-art performance.
Sunday, 3 March 2013
Intro: A sample implementation of Charikar's hash for identification of similar documents.
Intro: DKPro Similarity is an open source software package for developing text similarity algorithms. The framework is designed to complement DKPro Core, a collection of software components for natural language processing (NLP) based on the Apache UIMA framework. By leveraging the power of the tools available in DKPro Core, it allows for a rich set of similarity computation operations, including the design of full-fledged language processing pipelines and fully customizable processing steps.
Tuesday, 29 January 2013
Intro: The DBpedia data set uses a large multi-domain ontology which has been derived from Wikipedia. The English version of the DBpedia data set currently describes 3.77 million “things” with 400 million “facts”.
Monday, 21 January 2013
Intro: ESA is a vector representation of texts based on Wikippedia as external knowledge base.