HOANG Cong Duy Vu's research logs: 2011

Thursday, 29 December 2011

Common Crawl

http://www.commoncrawl.org/

Common Crawl Foundation is a California 501(c)3 non-profit founded by Gil Elbaz with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible.

Saturday, 24 December 2011

WebSPHINX

Link: http://www.cs.cmu.edu/~rcm/websphinx/

Intro: WebSPHINX ( Website-Specific Processors for HTML INformation eXtraction) is a Java class library and interactive development environment for web crawlers. A web crawler (also called a robot or spider) is a program that browses and processes Web pages automatically.

Monday, 19 December 2011

Adobe AIR 3

Link: http://www.adobe.com/products/air.html
Intro: The Adobe® AIR® runtime enables developers to deploy standalone applications built with HTML, JavaScript, ActionScript®, Flex, Adobe Flash® Professional, and Adobe Flash Builder® across platforms and devices — including Android™, BlackBerry®, iOS devices, personal computers, and televisions.

Sunday, 18 December 2011

Timeline for news readers

Facebook is going to launch new feature called "Timeline". It organizes all contents (including walls, photos, links, ...) on Facebook page according to time. It looks interesting.

Now I am thinking about timeline ... for news readers. Searching on the Internet, I've found some links:
- http://html5.labs.ap.org/
- http://feeds.allofme.com/RSS_Timeline.html?target=http://www.life.com/rss/news
- http://www.labnol.org/internet/google-news-time-as-rss-reader/9089/

This timeline feature has been being developed. There is still a room for us ^^.

Tuesday, 13 December 2011

Chatbots

http://www.chatbots.org/

Something sounds interesting. Look back soon.

Wednesday, 30 November 2011

Mono - cross platform, open source .NET development framework

http://www.mono-project.com/Main_Page

Mono is a software platform designed to allow developers to easily create cross platform applications. Sponsored by Xamarin, Mono is an open source implementation of Microsoft's .NET Framework based on the ECMA standards for C# and the Common Language Runtime. A growing family of solutions and an active and enthusiastic contributing community is helping position Mono to become the leading choice for development of Linux applications.

Tuesday, 29 November 2011

Technical Interview

1) http://technical-interview.com/default.aspx

2) http://www.wikijob.co.uk/wiki/technical-interview

3) http://www.careercup.com/

4) TBA

Tuesday, 22 November 2011

N-GRAMS from the COCA and COHA corpora of American English

Link: http://www.ngrams.info/
Intro: These n-grams are based on the largest publicly-available, genre-balanced corpus of English -- the 450 million wordCorpus of Contemporary American English (COCA). With this n-grams data (2, 3, 4, 5-word sequences, with their frequency), you can carry out powerful queries offline -- without needing to access the corpus via the web interface.

Monday, 7 November 2011

PiCloud

Link: http://www.picloud.com/

Intro:

PiCloud is a cloud-computing platform that integrates into the Python Programming Language. It enables you to leverage the computing power of Amazon Web Services without having to manage, maintain, or configure virtual servers.

PiCloud integrates seamlessly into your existing code base through a custom Python library,cloud. To offload the execution of a function to our servers, all you must do is pass your desired function into the cloud library. PiCloud will run the function on its high-performance cluster. As you run more functions, our cluster auto-scales to meet your computational needs. Getting on the cloud has never been this easy!

PiCloud improves the full cycle of software development and deployment. Every function run on PiCloud has its resource usage monitored, performance analyzed, and errors traced. This data is further aggregated across all your functions to give you a bird's eye view of your service. PiCloud enables you to develop faster, easier, and smarter.

Tuesday, 25 October 2011

Visualizer for Natural Language Processing

http://code.google.com/p/whatswrong/

An excellent tool for multi-purpose visualization in NLP.

Visual News Readers

Newsmap: http://newsmap.jp. An article about it at here.
Pulse: http://www.makeuseof.com/tag/pulse-free-visual-display-rss-news-reader-ipad/

Spectra: http://msnbcmedia.msn.com/i/msnbc/components/spectra/index.html

AlchemyAPI - Transforming Text into Knowledge

http://www.alchemyapi.com

AlchemyAPI is a series of products of Alchemy company applied for knowledge extraction from text. The name "Alchemy" may make a confusion with Alchemy Open Source AI developed by University of Washington.

It is quite intersting to see how language technologies are used for real applications.

--
Cheers,
Vu

Wednesday, 19 October 2011

Managing news

Link: http://www.managingnews.com/

Thursday, 13 October 2011

A must-read book on C programming

The C Programming Language by Brian W. Kernighan and Dennis M. Ritchie.

R.I.P Dennis M. Ritchie

Tuesday, 4 October 2011

Senna - NLP toolbox from NEC-Labs

http://ml.nec-labs.com/senna/

SENNA is a software distributed under a non-commercial license, which outputs a host of Natural Language Processing (NLP) predictions: part-of-speech (POS) tags, chunking (CHK), name entity recognition (NER) and semantic role labeling (SRL).

Thursday, 29 September 2011

Exhibit - Publishing Framework for Data-Rich Interactive Web Pages

http://www.simile-widgets.org/exhibit/

Exhibit lets you easily create web pages with advanced text search and filtering functionalities, with interactive maps, timelines, and other visualizations.

Monday, 26 September 2011

Microsoft WebMatrix

http://www.microsoft.com/web/webmatrix/

WebMatrix is a free web development tool from Microsoft that includes everything you need for website development. Start from open source web applications, built-in web templates or just start writing code yourself. It’s all-inclusive, simple and best of all free. Developing websites has never been easier.

Tbot - Translation Buddy for Windows Live Messenger

http://www.microsofttranslator.com/user/bot/

Tbot is an automated buddy that provides translations for Windows Live Messenger. It was first launched in 2008 as a prototype and has since become immensely popular. You can have one-on-one conversations with Tbot or invite friends who speak different languages with Tbot translating for you.

Wednesday, 21 September 2011

Statistical Significance Testing

http://www.ark.cs.cmu.edu/MT/

Tuesday, 13 September 2011

DependenSee - A Dependency Parse Visualization Tool

http://chaoticity.com/dependensee-a-dependency-parse-visualisation-tool/

Thursday, 8 September 2011

EMM NewsExplorer

EMM NewsExplorer: http://emm.newsexplorer.eu/NewsExplorer/home/en/latest.html

More: http://emm.newsbrief.eu/overview.html

That's great. I intended to develop such a similar thing for Vietnamese. Now I got one to follow.

--
Cheers,
Vu

Sunday, 4 September 2011

The interactive Grammar of English (iGE)

Excellent app for iPhone, iPad, iPod Touch. A must-buy app.

Apple link.

Thursday, 1 September 2011

Bilingual Sentence Aligner

Tool by Microsoft

Wednesday, 31 August 2011

A New Unsupervised Approach to Word Segmentation

See article link.

Automated Grammatical Error Detection for Language Learners

See this online link.
See the review from Computational Linguistics Journal.

Wednesday, 3 August 2011

TER-Plus (TERp)

TERp: http://www.umiacs.umd.edu/~snover/terp/

Intro
TERp is an automatic evaluation metric for Machine Translation, which takes as input a set of reference translations, and a set of machine translation output for that same data. It aligns the MT output to the reference translations, and measures the number of 'edits' needed to transform the MT output into the reference translation. TERp is an extension of TER (Translation Edit Rate) that utilizes phrasal substitutions (using automatically generated paraphrases), stemming, synonyms, relaxed shifting constraints and other improvements.

Open Source Machine Translation System Combination

MANY: http://www-lium.univ-lemans.fr/~barrault/MANY/

Intro

MANY is an MT system combination software which architecture is described is the following picture :

The combination can be decomposed into three steps

1-best hypotheses from all M systems are aligned in order to build M confusion networks (one for each system considered as backbone).
All CNs are connected into a single lattice. The first nodes of each CN are connected to a unique first node with probabilities equal to the priors probabilities assigned to the corresponding backbone. The final nodes are connected to a single final node with arc probability of one.
A token pass decoder is used along with a language model to decode the resulting lattice and the best hypothesis is generated.

--
Cheers,
Vu

System Combination for Machine Translation

This post is to collect papers regarding to system combination problem for Machine Translation systems. (collect everything first, filter later then)

1) Felipe Sánchez-Martínez. Choosing the best machine translation system to translate a sentence by using only source-language information. In Proceedings of the 15th Annual Conference of the European Associtation for Machine Translation, p. 97-104, May 30-31, 2011, Leuven, Belgium.

2) Víctor M. Sánchez-Cartagena, Felipe Sánchez-Martínez, Juan Antonio Pérez-Ortiz. Enriching a statistical machine translation system trained on small parallel corpora with rule-based bilingual phrases. In Proceedings of the International Conference on Recent Advances in Natural Language Processing, RANLP 2011, p ?-?, September 12-14, 2011, Hissar, Bulgaria (forthcoming)

--
Cheers,
Vu

Open Toolkit for Automatic MT (Meta-) Evaluation

Asiya: http://www.lsi.upc.edu/~nlp/Asiya/

Asiya has been designed to assist both system and metric developers by offering a rich repository of metrics and meta-metrics. Asiya has been developed at TALP Research Center NLP group , in Universitat Politècnica de Catalunya, as an evolution, extension, refactoring, and finally a replacement for its predecessor, IQMT.

Hybrid Example-based and Statistical MT System

Cunei: http://www.cunei.org/about/

Cunei is a hybrid platform for machine translation that draws upon the depth of research in Example-Based MT (EBMT) and Statistical MT (SMT). In particular, Cunei uses a data-driven approach that extends upon the basic thesis of EBMT--that some examples in the training data are of higher quality or are more relevant than others. Yet, it does so in a statistical manner, embracing much of the modeling pioneered by SMT, allowing for efficient optimization. Instead of using a static model for each phrase-pair, at run-time Cunei models each example of a phrase-pair in the corpus with respect to the input and combines them into dynamic collections of examples. Ultimately, this approach provides a more consistent model and a more flexible framework for integration of novel run-time features.

Stanford Biomedical Event Parser

http://nlp.stanford.edu/software/eventparser.shtml

David McClosky, Mihai Surdeanu, and Christopher D. Manning. 2011. Event Extraction as Dependency Parsing. In Proceedings of the Association for Computational Linguistics - Human Language Technologies 2011 Conference (ACL-HLT 2011). [PDF]

Tuesday, 2 August 2011

Document Translation Retrieval

Sánchez-Martínez, F., R.C. Carrasco. Document translation retrieval based on statistical machine translation techniques. In Applied Artificial Intelligence 25(5):329-340

http://code.google.com/p/doctrans/downloads/list

Tuesday, 26 July 2011

Bootstrapping for Named Entity Extraction & Recognition

This post aims to collect papers related to bootstrapping methods for Named Entity Entraction & Recognition.

1)

2)

(to be updated)

Hierarchical Statistical Machine Translation Toolkits

1) Joshua

2) Jane

3) Moses (also include Hierarchical SMT part)

(to be updated)

Error Analysis for Machine Translation Output

This post is to collect some papers in the literature referring to error analysis of (Statistical) Machine Translation Output. I aim to apply for English-Vietnamese translation outputs.

1) BLAST: http://www.ida.liu.se/~sarst/blast/

(Demo paper at ACL'2011: http://www.aclweb.org/anthology/P/P11/P11-4010.pdf)

2)

Maja Popovic et al. Towards Automatic Error Analysis of Machine Translation Output. (Computational Linguistics 2011)

3)

Mireia F. et al. Overcoming statistical machine translation limitations: error analysis and proposed solutions for the Catalan–Spanish language pair. LREC 2011.

4)

S. Condon. Machine Translation Errors: English and Iraqi Arabic. TALIP 2011.

5)

Maja Popović, Adrià de Gispert, Deepa Gupta, Patrik Lambert, Hermann Ney, José B. Mariño and Rafael Banchs. Morpho-syntactic Information for Automatic Error Analysis of Statistical Machine Translation Output. HLT/NAACL Workshop on Statistical Machine Translation, pages 1-6, New York, NY, June 2006.

Maja Popović and Hermann Ney. Error Analysis of Verb Inflections in Spanish Translation Output. TC-Star Workshop on Speech-to-Speech Translation, pages 99-103, Barcelona, Spain, June 2006.

David Vilar et al. Error Analysis of Statistical Machine Translation Output. LREC 2006.

Sunday, 24 July 2011

Interested Articles in Computational Linguistics Journal (Vol 35 issue 1)

Towards Automatic Error Analysis of Machine Translation Output

Maja Popović, Hermann Ney

Computational Linguistics Accepted for publication: 1–50.

Posted online on 14 Jul 2011.
Abstract | PDF (366 KB) | PDF Plus (367 KB)

Information Status Distinctions and Referring Expressions: An Empirical Study of References to People in News Summaries

Advaith Siddharthan, Ani Nenkova, Kathleen McKeown

Computational Linguistics Accepted for publication: 1–54.

Posted online on 14 Jul 2011.
Abstract | PDF (296 KB) | PDF Plus (297 KB)

Half-context language models

Hinrich Schütze, Michael Walsh

Computational Linguistics Accepted for publication: 1–40.

Posted online on 14 Jul 2011.
Abstract | PDF (280 KB) | PDF Plus (281 KB)

Wednesday, 20 July 2011

HeidelTime - Temporal Tagger

http://dbs.ifi.uni-heidelberg.de/index.php?id=129

Intro

HeidelTime is a multilingual temporal tagger that extracts temporal expressions from documents and normalizes them according to the TIMEX3 annotation standard, which is part of the markup language TimeML (with focus on the "value" attribute). HeidelTime uses different normalization strategies depending on the domain of the documents that are to be processed (news or narratives). It is a rule-based system and due to its architectural feature that the source code and the resources (patterns, normalization information, and rules) are strictly separated, one can simply develop resources for additional languages using HeidelTime's well-defined rule syntax.

Action Science Explorer - Tools for Rapid Understanding of Scientific Literature

http://www.cs.umd.edu/hcil/ase/

Xtractor

http://www.xtractor.in/ - Really impressive! I am thinking about how to do such a similar service for computational linguistics domain.

Tuesday, 19 July 2011

Good article to brainstorming

http://network.nature.com/groups/socialnotworking/forum/topics/2234

Off-the-self coreference packages

Arkref: http://www.ark.cs.cmu.edu/ARKref/

Stanford: http://nlp.stanford.edu/software/dcoref.shtml

BART: http://www.bart-coref.org/ --> http://www.bart-anaphora.org/

Illinios: http://cogcomp.cs.illinois.edu/page/software_view/18

GATE: http://gate.ac.uk/gate/doc/plugins.html (ANNIE)

Reconcile: http://www.cs.utah.edu/nlp/reconcile/

(to be updated)

Friday, 15 July 2011

Moses on Windows

http://www.statmt.org/moses/?n=Moses.FAQ#ntoc9

http://ssli.ee.washington.edu/people/amittai/Moses-on-Win7.pdf

Friday, 8 July 2011

Scientific Presentation

http://scientific-presentations.com/

Thursday, 7 July 2011

OPUS - The Open Parallel Corpus

http://opus.lingfil.uu.se/

Intro

OPUS is a growing collection of translated texts from the web. In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus. OPUS is based on open source products and the corpus is also delivered as an open content package. We used several tools to compile the current collection. All pre-processing is done automatically. No manual corrections have been carried out.

It's amazing. It's free!

--
Cheers,
Vu

Sunday, 26 June 2011

AlchemyAPI - transforming text into knowledge

http://www.alchemyapi.com/

Intro

AlchemyAPI provides content owners and web developers with a rich suite of content analysis and meta-data annotation tools.

Expose the semantic richness hidden in any content, using named entity extraction, keyword extraction, sentiment analysis, document categorization, concept tagging, language detection, and structured content scraping. Use AlchemyAPI to enhance your website, blog, content management system, or semantic web application.

AlchemyAPI uses statistical natural language processing technology and machine learning algorithms to analyze your content, extracting semantic meta-data: information about people, places, companies, topics, languages, and more.

Amazon Elastic Compute Cloud (Amazon EC2)

http://aws.amazon.com/ec2/

Intro

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers.

Amazon EC2’s simple web service interface allows you to obtain and configure capacity with minimal friction. It provides you with complete control of your computing resources and lets you run on Amazon’s proven computing environment. Amazon EC2 reduces the time required to obtain and boot new server instances to minutes, allowing you to quickly scale capacity, both up and down, as your computing requirements change. Amazon EC2 changes the economics of computing by allowing you to pay only for capacity that you actually use. Amazon EC2 provides developers the tools to build failure resilient applications and isolate themselves from common failure scenarios.

Statistical Language Processing & Cloud Computing

http://blog.alchemyapi.com/?p=44

A great article!

Saturday, 18 June 2011

Detection of Errors and Correction in Corpus Annotation

http://decca.osu.edu/

Intro: The success of data-driven approaches and stochastic modeling in computational linguistic research and applications is rooted in the availability of electronic natural language corpora. Despite the central role that annotated corpora play for computational linguistic research and applications, the question of how errors in the annotation of corpora can be detected and corrected has received only little attention. The DECCA project is designed to address this important gap by exploring an error detection and correction method with potential applicability to a wide range of corpus annotations.

Tuesday, 7 June 2011

POCO C++ Library

Yet another one but Boost:

http://www.appinf.com/en/products/pocolibs.html

--
Cheers,
Vu

Thursday, 2 June 2011

Speller Challenge

http://web-ngram.research.microsoft.com/spellerchallenge/

May consider such big event!

--
Cheers,
Vu

SciVerse

http://www.info.sciverse.com/sciverse-applications
http://www.applications.sciverse.com/action/gallery

Elsevier SIGIR 2011 Application Challenge: http://developer.sciverse.com/SIGIR2011

Important Dates

+ Startdate: June 6, 2011
+ Enddate: July 23, 2011
+ Judging starts: July 24, 2011
+ Judging ends: July 26, 2011
+ Announcement of the Winners: July 26, 2011

Prizes

+ First prize: 1,500 USD (VISA gift card)
+ Second prize: 1,000 USD (VISA gift card)
+ Third prize: 500 USD (VISA gift card)

Wednesday, 1 June 2011

Free Online File Converter

http://www.cometdocs.com/

Fantastic UI. Great Functionality with PDF to various file types (e.g. office files, HTML, ...).

Especially, it's free!

--
Cheers,
Vu

Topic Directory (~590K available categories so far)

http://www.dmoz.org/

The Open Directory Project is the largest, most comprehensive human-edited directory of the Web. It is constructed and maintained by a vast, global community of volunteer editors.

It is in multiple languages. Great!

--
Cheers,
Vu

Sunday, 29 May 2011

Interested Papers at EMNLP 2011

Accepted Papers: http://conferences.inf.ed.ac.uk/emnlp2011/papers.html

1) A Weakly-supervised Approach to Argumentative Zoning of Scientific Documents
Yufan Guo, Anna Korhonen and Thierry Poibeau

2) Linear Text Segmentation Using Affinity Propagation
Anna Kazantseva and Stan Szpakowicz

3) Identifying Relations for Open Information Extraction
Anthony Fader, Stephen Soderland and Oren Etzioni

4) Active Learning with Amazon Mechanical Turk
Florian Laws, Christian Scheible and Hinrich Schütze

5) Extreme Extraction — Machine Reading in a Week
Marjorie Freedman, Lance Ramshaw, Elizabeth Boschee, Ryan Gabbard, Nicolas Ward and Ralph Weischedel

6) Discovering Relations between Noun Categories
Thahir Mohamed, Estevam Hruschka and Tom Mitchell

7) Bootstrapped Named Entity Recognition for Product Attribute Extraction
Duangmanee Putthividhya and Junling Hu

8) Predicting a Scientific Community’s Response to an Article
Dani Yogatama, Michael Heilman, Brendan O'Connor, Chris Dyer, Bryan R. Routledge and Noah A. Smith

9) Language Models for Machine Translation: Original vs. Translated Texts
Gennadi Lembersky, Noam Ordan and Shuly Wintner

10) Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter
Eiji Aramaki, Sachiko Maskawa and Mizuki Morita

11) Rumor has it: Identifying Misinformation in Microblogs
Vahed Qazvinian, Emily Rosengren, Dragomir R. Radev and Qiaozhu Mei

SALM: Suffix Array and its Applications in Empirical Language Processing

Link: http://projectile.sv.cmu.edu/research/public/tools/salm/salm.htm
Another customized version: https://github.com/jhclark/salm

SALM is C++ package that provides functions to locate and estimates statistics of n-grams in a large corpus. SALM toolkit provides example applications such as estimating type/token frequency, locating n-gram occurrences, and a suffix array language model that can have arbitrarily long history for a very large training corpus.

Saturday, 28 May 2011

A USENET corpus (2005-2010)

http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html

This corpus was collected between Oct 2005 and Jan 2011, and covers 47860 English language, non-binary-file news groups:
+ Corpus size: over 30 billion words,
+ Data size: over 34Gb, compressed (delivered as weekly bundles of about 150 Mb each.)

Thursday, 19 May 2011

Mining scientific texts

This post is to collect all papers related to mining scientific texts (entity & relation extraction, summarization, ...).

1) http://www.lrec-conf.org/proceedings/lrec2008/pdf/773_paper.pdf
(Extracting and Querying Relations in Scientiﬁc Papers on Language Technology)

2)

Tuesday, 17 May 2011

Google Style Guide

http://code.google.com/p/google-styleguide/

Monday, 16 May 2011

hunalign – sentence aligner

http://mokk.bme.hu/resources/hunalign/

Intro

hunalign aligns bilingual text on the sentence level. Its input is tokenized and sentence-segmented text in two languages. In the simplest case, its output is a sequence of bilingual sentence pairs (bisentences).

In the presence of a dictionary, hunalign uses it, combining this information with Gale-Church sentence-length information. In the absence of a dictionary, it first falls back to sentence-length information, and then builds an automatic dictionary based on this alignment. Then it realigns the text in a second pass, using the automatic dictionary.

Like most sentence aligners, hunalign does not deal with changes of sentence order: it is unable to come up with crossing alignments, i.e., segments A and B in one language corresponding to segments B’ A’ in the other language.

There is nothing Hungarian-specific in hunalign, the name simply reflects the fact that it is part of the hun* NLP toolchain.

hunalign was written in portable C++. It can be built under basically any kind of operating system.

YouAlign - Online document alignment solution

http://www.youalign.com/

"Welcome to YouAlign, your online document alignment solution. No software to purchase, no software to install. With YouAlign you can quickly and easily create bitexts from your archived documents. A YouAlign bitext contains a document and its translation aligned at the sentence level. YouAlign generates TMX files that can be loaded into your translation memory. YouAlign can also generate HTML files that you can publish on the Internet, or use with a full-text search engine to search for terminology and phraseology in context.

YouAlign is powered by the AlignFactory engine, which supports all kinds of formats, including Microsoft Word, Excel and PowerPoint, PDF, HTML, XML, Corel WordPerfect, RTF, Lotus WordPro and plain text."

Thursday, 12 May 2011

Google Books Corpus

http://googlebooks.byu.edu/

"This corpus is based on the American English portion of the Google Books data (see http://ngrams.googlelabs.com and especially http://ngrams.googlelabs.com/datasets). It contains 155 *billion* words (155,000,000,000) in more than 1.3 million books from the 1810s-2000s (including 62 billion words from just 1980-2009).

The corpus has most of the functionality of the other corpora from http://corpus.byu.edu (e.g. COCA, COHA, and our interface to the BNC), including: searching by part of speech, wildcards, and lemma (and thus advanced syntactic searches), synonyms, collocate searches, frequency by decade (tables listing each individual string, or charts for total frequency), comparisons of two historical periods (e.g. collocates of "women" or "music" in the 1800s and the 1900s), and more." (From Corpora-List)

Tuesday, 3 May 2011

Interested Papers at SIGIR 2011

SIGIR 2011 accepted papers link: http://www.sigir2011.org/papers.htm

My Interested Papers:
1) Summarizing the Differences in Multilingual News
Xiaojun Wan, Houping Jia

2) Multifaceted Toponym Recognition for Streaming News
Michael Lieberman, Hanan Samet

3) Toward Social Context Summarization For Web Documents
Zi Yang, cai keke, Jie Tang, Li Zhang, Zhong Su, Juanzi Li

4) Evolutionary Timeline Summarization: a Balanced Optimization Framework via Iterative Substitution
Rui Yan, Xiaojun Wan, Jahna Otterbacher, Liang Kong, Yan Zhang, Xiaoming Li

5) The Economics in Interactive Information Retrieval
Leif Azzopardi

6) Composite Hashing with Multiple Information Sources
Dan Zhang, Fei Wang, Luo Si

7) Inverted Indexes for Phrases and Strings
Manish Patil, Sharma V. Thankachan, Rahul Shah, Wing-Kai Hon, Jeffrey Vitter, Sabrina Chandrasekaran

8) Multimedia Answering: Enriching Text QA with Media Information
Liqiang Nie, Meng Wang, Zha Zhengjun, Li Guangda, Tat Seng Chua

9) SCENE : A Scalable Two-Stage Personalized News Recommendation System
Lei Li, Dingding Wang, Tao Li

10) Ranking Related News Predictions
Nattiya Kanhabua, Roi Blanco, Michael Matthews

--
Cheers,
Vu

Friday, 29 April 2011

[C++] - how to deal with very large files

Two possible ways:

1) Only use basic I/O in such as FILE* + fread + fwrite ... and try to read/write byte sequences at one time.

2) Use memory-mapped file mechanism.
Possible links:
+ Boost C++ memory-mapped file support: http://www.boost.org/doc/libs/1_38_0/libs/iostreams/doc/index.html
+ http://codingplayground.blogspot.com/2009/03/memory-mapped-files-in-boost-and-c.html#comment-form

3) TBA (please let me know if u have others. Thanks!)

--
Cheers,
Vu

Thursday, 28 April 2011

Boilerpipe - Boilerplate Removal and Fulltext Extraction from HTML pages

Link: http://code.google.com/p/boilerpipe/

The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

Wednesday, 27 April 2011

Q&A for professional and enthusiast programmers

http://stackoverflow.com/ - A social Q&A site for both expert & non-expert programmers.

Boost C++ Library

Boost C++ Library: http://www.boost.org or http://boost.teeks99.com/

...one of the most highly regarded and expertly designed C++ library projects in the world. — Herb Sutter and Andrei Alexandrescu, C++ Coding Standards

*** Installation Tips

- with b2

b2 address-model=32 --build-type=complete --stagedir=stage
b2 address-model=64 --build-type=complete --stagedir=stage_x64

(regex with ICU lib)
b2 -sICU_PATH=C:\icu4c-54_1-src\icu address-model=32 --with-regex --stagedir=stage
b2 -sICU_PATH=C:\icu4c-54_1-src\icu address-model=64 --with-regex --stagedir=stage_x64

(iostream with zlib)
b2 -sZLIB_SOURCE=C:\zlib128-dll\include address-model=32 --with-iostreams --stagedir=stage

b2 -sZLIB_SOURCE=C:\zlib128-dll\include address-model=64 --with-iostreams --stagedir=stage_x64

- with bjam

(for different versions of Microsoft Visual C++)
bjam --toolset=msvc-12.0 address-model=64 --build-type=complete stage
bjam --toolset=msvc-11.0 address-model=64 --build-type=complete stage
bjam --toolset=msvc-10.0 address-model=64 --build-type=complete stage
bjam --toolset=msvc-9.0 address-model=64 --build-type=complete stage
bjam --toolset=msvc-8.0 address-model=64 --build-type=complete stage

bjam --toolset=msvc-12.0 --build-type=complete stage
bjam --toolset=msvc-11.0 --build-type=complete stage
bjam --toolset=msvc-10.0 --build-type=complete stage
bjam --toolset=msvc-9.0 --build-type=complete stage
bjam --toolset=msvc-8.0 --build-type=complete stage

--

Monday, 18 April 2011

Tools for Vietnamese Spell Checking

1) Hunspell (Vietnamese version): http://code.google.com/p/hunspell-spellcheck-vi/

Original version: http://hunspell.sourceforge.net/

Source Code with Java for Hunspell: http://dren.dk/hunspell.html

2) Aspell: http://aspell.net/

Available dictionaries: ftp://ftp.gnu.org/gnu/aspell/dict/0index.html

3) IBM csSpell (Context-sensitive Spelling Checker): http://www.alphaworks.ibm.com/tech/csspell

4) TBA

--
Cheers,
Vu

N-gram tools

1) http://homepages.inf.ed.ac.uk/lzhang10/ngram.html

2) Google Web N-gram
2a) Google Web N-gram Viewer: http://ngrams.googlelabs.com/
2b) Google Web N-gram Patterns: http://n-gram-patterns.sourceforge.net/

3) Microsoft Web N-gram: http://web-ngram.research.microsoft.com/info/

4) N-gram Statistics Package: http://ngram.sourceforge.net/

5) CMU Language Modeling Toolkit (version 2): http://www.speech.cs.cmu.edu/SLM/toolkit.html

6) N-gram Extraction Tools: http://homepages.inf.ed.ac.uk/lzhang10/ngram.html

Tools for corpus statistics

Thanks to Corpora-List member, I compiled the following list of tools for corpus statistics:

1) TMX software: https://sourceforge.net/projects/textometrie

2) R: www.r-project.org

With books accompanied:

http://www.amazon.com/dp/3110205645
http://www.amazon.com/dp/0415962706

3) Lexico3: http://www.tal.univ-paris3.fr/lexico/lexico3.htm (seemingly a commercial tool)

4) TBA

If you know others, please let me know!

--
Cheers,
Vu

Relax about coding

http://www.scribd.com/doc/38648591/About-Coders-new-version

Thursday, 7 April 2011

C++ STL with UTF-8

Portable Library
http://utfcpp.sourceforge.net/
https://sourceforge.net/projects/utfcpp

Conversion Tool
http://www.gnu.org/software/libiconv/

Articles
http://www.codeproject.com/KB/stl/upgradingstlappstounicode.aspx
http://www.codeproject.com/KB/stl/utf8facet.aspx
http://www.cplusplus.com/forum/beginner/7233/

--
Cheers,
Vu

Tuesday, 5 April 2011

GeoWordNet

http://geowordnet.semanticmatching.org/

Sunday, 13 March 2011

NLP Tools (links from Caroline Sporleder)

http://www.coli.uni-saarland.de/~csporled/page.php?id=tools

Friday, 11 March 2011

Summarizing contents across websites in the Internet

Summarizing contents across websites in the Internet:

http://www.iresearch-reporter.com/
http://ultimate-research-assistant.com/

The summary quality of the above systems is not actually good (i think). One possible improvement is to focus mainly on summarizing contents from research papers which contain very useful and detailed technical materials. It could be regarded to "Related Work Summarization" (see my paper at here).

--
Cheers,
Vu

Wednesday, 2 March 2011

Vietnamese Language Processing

http://vlsp.vietlp.org:8080/demo/?page=home

It's amazing! But I've not checked their quality yet. Will do ASAP.

RandLM - the randomised language modelling toolkit

Link: http://sourceforge.net/projects/randlm/

Reference:
1) David Talbot and Miles Osborne. Smoothed Bloom filter language models: Tera-Scale LMs on the Cheap. EMNLP, Prague, Czech Republic 2007.
2) David Talbot and Miles Osborne. Randomised Language Modelling for Statistical Machine Translation. ACL, Prague, Czech Republic 2007.

Wednesday, 23 February 2011

NLP with Python

Book: http://www.nltk.org/book
Toolkit: http://code.google.com/p/nltk/

Monday, 14 February 2011

Precision Translation Tools

http://precisiontranslationtools.com/ - the new source for Statistical Machine Translation (SMT) tools and resources.

Wednesday, 19 January 2011

NLP News

http://nlp.hivefire.com/

This site synthesizes news in Natural Language Processing (NLP) from various sources (webs, blogs) around the world. It's great!

--
Vu

Saturday, 15 January 2011

OpenCog - The Open Cognition Project

Link: http://wiki.opencog.org/w/The_Open_Cognition_Project

OpenCog - an open source Artificial General Intelligence framework, intended to one day express general intelligence at the human level and beyond.

A little bit ambitious?

RelEx Dependency Relationship Extractor

Link: https://launchpad.net/relex/

Documentation: http://wiki.opencog.org/w/RelEx

RelEx is an English-language dependency relationship extractor, built on the Carnegie-Mellon link parser. It can identify subject, object, indirect object and many other dependency relationships between words in a sentence. It also generates some advanced semantic relations, such as normalizing questions for question-answering. It also proposes "frames" or "semantic roles", similar in style to those of FrameNet. RelEx includes a basic implementation of the Hobbs anaphora (pronoun) resolution algorithm. As a "by-product", it also provides more basic functions, including entity detection, part-of-speech tagging, noun-number tagging, verb tense tagging, gender tagging, and so on. Relex now includes a Stanford parser compatibility mode, generating identical output, but more accurately and more quickly.

Thursday, 13 January 2011

Alignment Set Toolkit

http://gps-tsc.upc.es/veu/personal/lambert/software/AlignmentSet.html

Wednesday, 12 January 2011

Training SMT incrementally

For word alignments

Force Alignment: http://geek.kyloo.net/software/doku.php/mgiza:forcealignment

MGIZA++: http://geek.kyloo.net/software/doku.php/mgiza:overview and http://sourceforge.net/projects/mgizapp/

For translation models

Stream-based Translation Models for Statistical Machine Translation

Abby Levenberg, Chris Callison-Burch and Miles Osborne, NAACL 2010.

For language models

Stream-based Randomised Language Models for SMT

Abby Levenberg and Miles Osborne, EMNLP 2009.

Sunday, 9 January 2011

Discourse Parsing

Sentence-based level:
SPADE: http://www.isi.edu/licensed-sw/spade/

Text-base level:
HILDA: http://nlp.prendingerlab.net/hilda/
NUS demo: http://wing.comp.nus.edu.sg/~linzihen/parser/demo.html

(to be continued!)

Thursday, 6 January 2011

Open source toolkits for cloud computing

Eucalyptus: http://www.eucalyptus.com/

OpenNebula: http://www.opennebula.org/doku.php

Wednesday, 5 January 2011

[Book] - Foundations of Computer Science

http://infolab.stanford.edu/~ullman/focs.html

(VNese fresh CS students should consider reading this book!)

Scientext corpus

http://scientext.msh-alpes.fr

Scientext is a new, on-line French and English corpus of scientific texts. The corpus includes 4.8 million running tokens in French, 13 million words of research articles in English (medicine and biology), and an English-language sub-corpus of French undergraduate students’ texts (1,1 million words). The corpus is organized to facilitate the linguistic study of authorial position and reasoning in scientific articles through phraseology and lexico-grammatical markers linked to causality.

Saturday, 1 January 2011

ebook search

http://library.nu/