Thursday, 29 December 2011
Common Crawl
Common Crawl Foundation is a California 501(c)3 non-profit founded by Gil Elbaz with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible.
Saturday, 24 December 2011
WebSPHINX
Intro: WebSPHINX ( Website-Specific Processors for HTML INformation eXtraction) is a Java class library and interactive development environment for web crawlers. A web crawler (also called a robot or spider) is a program that browses and processes Web pages automatically.
Monday, 19 December 2011
Adobe AIR 3
Intro: The Adobe® AIR® runtime enables developers to deploy standalone applications built with HTML, JavaScript, ActionScript®, Flex, Adobe Flash® Professional, and Adobe Flash Builder® across platforms and devices — including Android™, BlackBerry®, iOS devices, personal computers, and televisions.
Sunday, 18 December 2011
Timeline for news readers
Now I am thinking about timeline ... for news readers. Searching on the Internet, I've found some links:
- http://html5.labs.ap.org/
- http://feeds.allofme.com/RSS_Timeline.html?target=http://www.life.com/rss/news
- http://www.labnol.org/internet/google-news-time-as-rss-reader/9089/
This timeline feature has been being developed. There is still a room for us ^^.
Tuesday, 13 December 2011
Wednesday, 30 November 2011
Mono - cross platform, open source .NET development framework
Tuesday, 29 November 2011
Tuesday, 22 November 2011
N-GRAMS from the COCA and COHA corpora of American English
Intro: These n-grams are based on the largest publicly-available, genre-balanced corpus of English -- the 450 million wordCorpus of Contemporary American English (COCA). With this n-grams data (2, 3, 4, 5-word sequences, with their frequency), you can carry out powerful queries offline -- without needing to access the corpus via the web interface.
Monday, 7 November 2011
PiCloud
Tuesday, 25 October 2011
Visual News Readers
Pulse: http://www.makeuseof.com/tag/pulse-free-visual-display-rss-news-reader-ipad/
AlchemyAPI - Transforming Text into Knowledge
AlchemyAPI is a series of products of Alchemy company applied for knowledge extraction from text. The name "Alchemy" may make a confusion with Alchemy Open Source AI developed by University of Washington.
It is quite intersting to see how language technologies are used for real applications.
--
Cheers,
Vu
Wednesday, 19 October 2011
Thursday, 13 October 2011
Tuesday, 4 October 2011
Senna - NLP toolbox from NEC-Labs
SENNA is a software distributed under a non-commercial license, which outputs a host of Natural Language Processing (NLP) predictions: part-of-speech (POS) tags, chunking (CHK), name entity recognition (NER) and semantic role labeling (SRL).
Thursday, 29 September 2011
Exhibit - Publishing Framework for Data-Rich Interactive Web Pages
Exhibit lets you easily create web pages with advanced text search and filtering functionalities, with interactive maps, timelines, and other visualizations.
Monday, 26 September 2011
Microsoft WebMatrix
Tbot - Translation Buddy for Windows Live Messenger
Wednesday, 21 September 2011
Tuesday, 13 September 2011
Thursday, 8 September 2011
EMM NewsExplorer
More: http://emm.newsbrief.eu/overview.html
That's great. I intended to develop such a similar thing for Vietnamese. Now I got one to follow.
--
Cheers,
Vu
Sunday, 4 September 2011
Thursday, 1 September 2011
Bilingual Sentence Aligner
Wednesday, 31 August 2011
Wednesday, 3 August 2011
TER-Plus (TERp)
Intro
TERp is an automatic evaluation metric for Machine Translation, which takes as input a set of reference translations, and a set of machine translation output for that same data. It aligns the MT output to the reference translations, and measures the number of 'edits' needed to transform the MT output into the reference translation. TERp is an extension of TER (Translation Edit Rate) that utilizes phrasal substitutions (using automatically generated paraphrases), stemming, synonyms, relaxed shifting constraints and other improvements.
Open Source Machine Translation System Combination
MANY is an MT system combination software which architecture is described is the following picture :
The combination can be decomposed into three steps
- 1-best hypotheses from all M systems are aligned in order to build M confusion networks (one for each system considered as backbone).
- All CNs are connected into a single lattice. The first nodes of each CN are connected to a unique first node with probabilities equal to the priors probabilities assigned to the corresponding backbone. The final nodes are connected to a single final node with arc probability of one.
- A token pass decoder is used along with a language model to decode the resulting lattice and the best hypothesis is generated.
--
Cheers,
Vu
System Combination for Machine Translation
1) Felipe Sánchez-Martínez. Choosing the best machine translation system to translate a sentence by using only source-language information. In Proceedings of the 15th Annual Conference of the European Associtation for Machine Translation, p. 97-104, May 30-31, 2011, Leuven, Belgium.
2) Víctor M. Sánchez-Cartagena, Felipe Sánchez-Martínez, Juan Antonio Pérez-Ortiz. Enriching a statistical machine translation system trained on small parallel corpora with rule-based bilingual phrases. In Proceedings of the International Conference on Recent Advances in Natural Language Processing, RANLP 2011, p ?-?, September 12-14, 2011, Hissar, Bulgaria (forthcoming)
--
Cheers,
Vu
Open Toolkit for Automatic MT (Meta-) Evaluation
Hybrid Example-based and Statistical MT System
Stanford Biomedical Event Parser
David McClosky, Mihai Surdeanu, and Christopher D. Manning. 2011. Event Extraction as Dependency Parsing. In Proceedings of the Association for Computational Linguistics - Human Language Technologies 2011 Conference (ACL-HLT 2011). [PDF]
Tuesday, 2 August 2011
Document Translation Retrieval
http://code.google.com/p/doctrans/downloads/list
Tuesday, 26 July 2011
Bootstrapping for Named Entity Extraction & Recognition
1)
2)
(to be updated)
Error Analysis for Machine Translation Output
1) BLAST: http://www.ida.liu.se/~sarst/blast/
(Demo paper at ACL'2011: http://www.aclweb.org/anthology/P/P11/P11-4010.pdf)
2)
Maja Popovic et al. Towards Automatic Error Analysis of Machine Translation Output. (Computational Linguistics 2011)
3)
Mireia F. et al. Overcoming statistical machine translation limitations: error analysis and proposed solutions for the Catalan–Spanish language pair. LREC 2011.
4)
S. Condon. Machine Translation Errors: English and Iraqi Arabic. TALIP 2011.
5)
Maja Popović, Adrià de Gispert, Deepa Gupta, Patrik Lambert, Hermann Ney, José B. Mariño and Rafael Banchs. Morpho-syntactic Information for Automatic Error Analysis of Statistical Machine Translation Output. HLT/NAACL Workshop on Statistical Machine Translation, pages 1-6, New York, NY, June 2006.
Maja Popović and Hermann Ney. Error Analysis of Verb Inflections in Spanish Translation Output. TC-Star Workshop on Speech-to-Speech Translation, pages 99-103, Barcelona, Spain, June 2006.
David Vilar et al. Error Analysis of Statistical Machine Translation Output. LREC 2006.
Sunday, 24 July 2011
Interested Articles in Computational Linguistics Journal (Vol 35 issue 1)
Abstract | PDF (366 KB) | PDF Plus (367 KB)
Abstract | PDF (296 KB) | PDF Plus (297 KB)
Abstract | PDF (280 KB) | PDF Plus (281 KB)
Wednesday, 20 July 2011
HeidelTime - Temporal Tagger
Intro
HeidelTime is a multilingual temporal tagger that extracts temporal expressions from documents and normalizes them according to the TIMEX3 annotation standard, which is part of the markup language TimeML (with focus on the "value" attribute). HeidelTime uses different normalization strategies depending on the domain of the documents that are to be processed (news or narratives). It is a rule-based system and due to its architectural feature that the source code and the resources (patterns, normalization information, and rules) are strictly separated, one can simply develop resources for additional languages using HeidelTime's well-defined rule syntax.
Xtractor
Tuesday, 19 July 2011
Friday, 15 July 2011
Friday, 8 July 2011
Thursday, 7 July 2011
OPUS - The Open Parallel Corpus
Intro
It's amazing. It's free!
--
Cheers,
Vu
Sunday, 26 June 2011
AlchemyAPI - transforming text into knowledge
AlchemyAPI provides content owners and web developers with a rich suite of content analysis and meta-data annotation tools.
Expose the semantic richness hidden in any content, using named entity extraction, keyword extraction, sentiment analysis, document categorization, concept tagging, language detection, and structured content scraping. Use AlchemyAPI to enhance your website, blog, content management system, or semantic web application.
Amazon Elastic Compute Cloud (Amazon EC2)
Intro
Amazon EC2’s simple web service interface allows you to obtain and configure capacity with minimal friction. It provides you with complete control of your computing resources and lets you run on Amazon’s proven computing environment. Amazon EC2 reduces the time required to obtain and boot new server instances to minutes, allowing you to quickly scale capacity, both up and down, as your computing requirements change. Amazon EC2 changes the economics of computing by allowing you to pay only for capacity that you actually use. Amazon EC2 provides developers the tools to build failure resilient applications and isolate themselves from common failure scenarios.
Saturday, 18 June 2011
Detection of Errors and Correction in Corpus Annotation
Tuesday, 7 June 2011
Thursday, 2 June 2011
SciVerse
http://www.applications.sciverse.com/action/gallery
Elsevier SIGIR 2011 Application Challenge: http://developer.sciverse.com/SIGIR2011
Important Dates
+ Startdate: June 6, 2011
+ Enddate: July 23, 2011
+ Judging starts: July 24, 2011
+ Judging ends: July 26, 2011
+ Announcement of the Winners: July 26, 2011
Prizes
+ First prize: 1,500 USD (VISA gift card)
+ Second prize: 1,000 USD (VISA gift card)
+ Third prize: 500 USD (VISA gift card)
Wednesday, 1 June 2011
Free Online File Converter
Fantastic UI. Great Functionality with PDF to various file types (e.g. office files, HTML, ...).
Especially, it's free!
--
Cheers,
Vu
Topic Directory (~590K available categories so far)
It is in multiple languages. Great!
--
Cheers,
Vu
Sunday, 29 May 2011
Interested Papers at EMNLP 2011
1) A Weakly-supervised Approach to Argumentative Zoning of Scientific Documents
Yufan Guo, Anna Korhonen and Thierry Poibeau
2) Linear Text Segmentation Using Affinity Propagation
Anna Kazantseva and Stan Szpakowicz
3) Identifying Relations for Open Information Extraction
Anthony Fader, Stephen Soderland and Oren Etzioni
4) Active Learning with Amazon Mechanical Turk
Florian Laws, Christian Scheible and Hinrich Schütze
5) Extreme Extraction — Machine Reading in a Week
Marjorie Freedman, Lance Ramshaw, Elizabeth Boschee, Ryan Gabbard, Nicolas Ward and Ralph Weischedel
6) Discovering Relations between Noun Categories
Thahir Mohamed, Estevam Hruschka and Tom Mitchell
7) Bootstrapped Named Entity Recognition for Product Attribute Extraction
Duangmanee Putthividhya and Junling Hu
8) Predicting a Scientific Community’s Response to an Article
Dani Yogatama, Michael Heilman, Brendan O'Connor, Chris Dyer, Bryan R. Routledge and Noah A. Smith
9) Language Models for Machine Translation: Original vs. Translated Texts
Gennadi Lembersky, Noam Ordan and Shuly Wintner
10) Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter
Eiji Aramaki, Sachiko Maskawa and Mizuki Morita
11) Rumor has it: Identifying Misinformation in Microblogs
Vahed Qazvinian, Emily Rosengren, Dragomir R. Radev and Qiaozhu Mei
SALM: Suffix Array and its Applications in Empirical Language Processing
Another customized version: https://github.com/jhclark/salm
Saturday, 28 May 2011
A USENET corpus (2005-2010)
This corpus was collected between Oct 2005 and Jan 2011, and covers 47860 English language, non-binary-file news groups:
+ Corpus size: over 30 billion words,
+ Data size: over 34Gb, compressed (delivered as weekly bundles of about 150 Mb each.)
Thursday, 19 May 2011
Mining scientific texts
1) http://www.lrec-conf.org/proceedings/lrec2008/pdf/773_paper.pdf
(Extracting and Querying Relations in Scientific Papers on Language Technology)
2)
Tuesday, 17 May 2011
Monday, 16 May 2011
hunalign – sentence aligner
Intro
hunalign aligns bilingual text on the sentence level. Its input is tokenized and sentence-segmented text in two languages. In the simplest case, its output is a sequence of bilingual sentence pairs (bisentences).
In the presence of a dictionary, hunalign uses it, combining this information with Gale-Church sentence-length information. In the absence of a dictionary, it first falls back to sentence-length information, and then builds an automatic dictionary based on this alignment. Then it realigns the text in a second pass, using the automatic dictionary.
Like most sentence aligners, hunalign does not deal with changes of sentence order: it is unable to come up with crossing alignments, i.e., segments A and B in one language corresponding to segments B’ A’ in the other language.
There is nothing Hungarian-specific in hunalign, the name simply reflects the fact that it is part of the hun* NLP toolchain.
hunalign was written in portable C++. It can be built under basically any kind of operating system.
YouAlign - Online document alignment solution
"Welcome to YouAlign, your online document alignment solution. No software to purchase, no software to install. With YouAlign you can quickly and easily create bitexts from your archived documents. A YouAlign bitext contains a document and its translation aligned at the sentence level. YouAlign generates TMX files that can be loaded into your translation memory. YouAlign can also generate HTML files that you can publish on the Internet, or use with a full-text search engine to search for terminology and phraseology in context.
YouAlign is powered by the AlignFactory engine, which supports all kinds of formats, including Microsoft Word, Excel and PowerPoint, PDF, HTML, XML, Corel WordPerfect, RTF, Lotus WordPro and plain text."
Thursday, 12 May 2011
Google Books Corpus
The corpus has most of the functionality of the other corpora from http://corpus.byu.edu (e.g. COCA, COHA, and our interface to the BNC), including: searching by part of speech, wildcards, and lemma (and thus advanced syntactic searches), synonyms, collocate searches, frequency by decade (tables listing each individual string, or charts for total frequency), comparisons of two historical periods (e.g. collocates of "women" or "music" in the 1800s and the 1900s), and more." (From Corpora-List)
Tuesday, 3 May 2011
Interested Papers at SIGIR 2011
My Interested Papers:
1) Summarizing the Differences in Multilingual News
Xiaojun Wan, Houping Jia
2) Multifaceted Toponym Recognition for Streaming News
Michael Lieberman, Hanan Samet
3) Toward Social Context Summarization For Web Documents
Zi Yang, cai keke, Jie Tang, Li Zhang, Zhong Su, Juanzi Li
4) Evolutionary Timeline Summarization: a Balanced Optimization Framework via Iterative Substitution
Rui Yan, Xiaojun Wan, Jahna Otterbacher, Liang Kong, Yan Zhang, Xiaoming Li
5) The Economics in Interactive Information Retrieval
Leif Azzopardi
6) Composite Hashing with Multiple Information Sources
Dan Zhang, Fei Wang, Luo Si
7) Inverted Indexes for Phrases and Strings
Manish Patil, Sharma V. Thankachan, Rahul Shah, Wing-Kai Hon, Jeffrey Vitter, Sabrina Chandrasekaran
8) Multimedia Answering: Enriching Text QA with Media Information
Liqiang Nie, Meng Wang, Zha Zhengjun, Li Guangda, Tat Seng Chua
9) SCENE : A Scalable Two-Stage Personalized News Recommendation System
Lei Li, Dingding Wang, Tao Li
10) Ranking Related News Predictions
Nattiya Kanhabua, Roi Blanco, Michael Matthews
--
Cheers,
Vu
Friday, 29 April 2011
[C++] - how to deal with very large files
1) Only use basic I/O in
2) Use memory-mapped file mechanism.
Possible links:
+ Boost C++ memory-mapped file support: http://www.boost.org/doc/libs/1_38_0/libs/iostreams/doc/index.html
+ http://codingplayground.blogspot.com/2009/03/memory-mapped-files-in-boost-and-c.html#comment-form
3) TBA (please let me know if u have others. Thanks!)
--
Cheers,
Vu
Thursday, 28 April 2011
Boilerpipe - Boilerplate Removal and Fulltext Extraction from HTML pages
The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
Wednesday, 27 April 2011
Q&A for professional and enthusiast programmers
Boost C++ Library
...one of the most highly regarded and expertly designed C++ library projects in the world.— Herb Sutter and Andrei Alexandrescu, C++ Coding Standards
*** Installation Tips
- with b2
b2 address-model=32 --build-type=complete --stagedir=stage
b2 address-model=64 --build-type=complete --stagedir=stage_x64
(regex with ICU lib)
b2 -sICU_PATH=C:\icu4c-54_1-src\icu address-model=32 --with-regex --stagedir=stage
b2 -sICU_PATH=C:\icu4c-54_1-src\icu address-model=64 --with-regex --stagedir=stage_x64
(iostream with zlib)
b2 -sZLIB_SOURCE=C:\zlib128-dll\include address-model=32 --with-iostreams --stagedir=stage
b2 -sZLIB_SOURCE=C:\zlib128-dll\include address-model=64 --with-iostreams --stagedir=stage_x64
- with bjam
(for different versions of Microsoft Visual C++)
bjam --toolset=msvc-12.0 address-model=64 --build-type=complete stage
bjam --toolset=msvc-11.0 address-model=64 --build-type=complete stage
bjam --toolset=msvc-10.0 address-model=64 --build-type=complete stage
bjam --toolset=msvc-9.0 address-model=64 --build-type=complete stage
bjam --toolset=msvc-8.0 address-model=64 --build-type=complete stage
bjam --toolset=msvc-12.0 --build-type=complete stage
bjam --toolset=msvc-11.0 --build-type=complete stage
bjam --toolset=msvc-10.0 --build-type=complete stage
bjam --toolset=msvc-9.0 --build-type=complete stage
bjam --toolset=msvc-8.0 --build-type=complete stage
--
Monday, 18 April 2011
Tools for Vietnamese Spell Checking
Original version: http://hunspell.sourceforge.net/
Source Code with Java for Hunspell: http://dren.dk/hunspell.html
2) Aspell: http://aspell.net/
Available dictionaries: ftp://ftp.gnu.org/gnu/aspell/dict/0index.html
3) IBM csSpell (Context-sensitive Spelling Checker): http://www.alphaworks.ibm.com/tech/csspell
4) TBA
--
Cheers,
Vu
N-gram tools
2) Google Web N-gram
2a) Google Web N-gram Viewer: http://ngrams.googlelabs.com/
2b) Google Web N-gram Patterns: http://n-gram-patterns.sourceforge.net/
3) Microsoft Web N-gram: http://web-ngram.research.microsoft.com/info/
4) N-gram Statistics Package: http://ngram.sourceforge.net/
5) CMU Language Modeling Toolkit (version 2): http://www.speech.cs.cmu.edu/SLM/toolkit.html
Tools for corpus statistics
1) TMX software: https://sourceforge.net/
2) R: www.r-project.org
With books accompanied:
http://www.amazon.com/dp/
http://www.amazon.com/dp/
3) Lexico3: http://www.tal.univ-paris3.fr/lexico/lexico3.htm (seemingly a commercial tool)
4) TBA
If you know others, please let me know!
--
Cheers,
Vu
Thursday, 7 April 2011
C++ STL with UTF-8
http://utfcpp.sourceforge.net/
https://sourceforge.net/projects/utfcpp
Conversion Tool
http://www.gnu.org/software/libiconv/
Articles
http://www.codeproject.com/KB/stl/upgradingstlappstounicode.aspx
http://www.codeproject.com/KB/stl/utf8facet.aspx
http://www.cplusplus.com/forum/beginner/7233/
--
Cheers,
Vu
Tuesday, 5 April 2011
Sunday, 13 March 2011
Friday, 11 March 2011
Summarizing contents across websites in the Internet
http://www.iresearch-reporter.com/
http://ultimate-research-assistant.com/
--
Cheers,
Vu
Wednesday, 2 March 2011
Vietnamese Language Processing
It's amazing! But I've not checked their quality yet. Will do ASAP.
RandLM - the randomised language modelling toolkit
Reference:
1) David Talbot and Miles Osborne. Smoothed Bloom filter language models: Tera-Scale LMs on the Cheap. EMNLP, Prague, Czech Republic 2007.
2) David Talbot and Miles Osborne. Randomised Language Modelling for Statistical Machine Translation. ACL, Prague, Czech Republic 2007.
Wednesday, 23 February 2011
Monday, 14 February 2011
Precision Translation Tools
Wednesday, 19 January 2011
NLP News
This site synthesizes news in Natural Language Processing (NLP) from various sources (webs, blogs) around the world. It's great!
--
Vu
Saturday, 15 January 2011
OpenCog - The Open Cognition Project
RelEx Dependency Relationship Extractor
Thursday, 13 January 2011
Wednesday, 12 January 2011
Training SMT incrementally
Sunday, 9 January 2011
Discourse Parsing
SPADE: http://www.isi.edu/licensed-sw/spade/
Text-base level:
HILDA: http://nlp.prendingerlab.net/hilda/
NUS demo: http://wing.comp.nus.edu.sg/~linzihen/parser/demo.html
(to be continued!)