Thursday, 29 December 2011
Common Crawl Foundation is a California 501(c)3 non-profit founded by Gil Elbaz with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible.
Saturday, 24 December 2011
Intro: WebSPHINX ( Website-Specific Processors for HTML INformation eXtraction) is a Java class library and interactive development environment for web crawlers. A web crawler (also called a robot or spider) is a program that browses and processes Web pages automatically.
Monday, 19 December 2011
Sunday, 18 December 2011
Now I am thinking about timeline ... for news readers. Searching on the Internet, I've found some links:
This timeline feature has been being developed. There is still a room for us ^^.
Tuesday, 13 December 2011
Wednesday, 30 November 2011
Tuesday, 29 November 2011
Tuesday, 22 November 2011
Intro: These n-grams are based on the largest publicly-available, genre-balanced corpus of English -- the 450 million wordCorpus of Contemporary American English (COCA). With this n-grams data (2, 3, 4, 5-word sequences, with their frequency), you can carry out powerful queries offline -- without needing to access the corpus via the web interface.
Monday, 7 November 2011
Tuesday, 25 October 2011
AlchemyAPI is a series of products of Alchemy company applied for knowledge extraction from text. The name "Alchemy" may make a confusion with Alchemy Open Source AI developed by University of Washington.
It is quite intersting to see how language technologies are used for real applications.
Wednesday, 19 October 2011
Thursday, 13 October 2011
Tuesday, 4 October 2011
SENNA is a software distributed under a non-commercial license, which outputs a host of Natural Language Processing (NLP) predictions: part-of-speech (POS) tags, chunking (CHK), name entity recognition (NER) and semantic role labeling (SRL).
Thursday, 29 September 2011
Exhibit lets you easily create web pages with advanced text search and filtering functionalities, with interactive maps, timelines, and other visualizations.
Monday, 26 September 2011
Wednesday, 21 September 2011
Tuesday, 13 September 2011
Thursday, 8 September 2011
That's great. I intended to develop such a similar thing for Vietnamese. Now I got one to follow.
Sunday, 4 September 2011
Thursday, 1 September 2011
Wednesday, 31 August 2011
Wednesday, 3 August 2011
TERp is an automatic evaluation metric for Machine Translation, which takes as input a set of reference translations, and a set of machine translation output for that same data. It aligns the MT output to the reference translations, and measures the number of 'edits' needed to transform the MT output into the reference translation. TERp is an extension of TER (Translation Edit Rate) that utilizes phrasal substitutions (using automatically generated paraphrases), stemming, synonyms, relaxed shifting constraints and other improvements.
MANY is an MT system combination software which architecture is described is the following picture :
The combination can be decomposed into three steps
- 1-best hypotheses from all M systems are aligned in order to build M confusion networks (one for each system considered as backbone).
- All CNs are connected into a single lattice. The first nodes of each CN are connected to a unique first node with probabilities equal to the priors probabilities assigned to the corresponding backbone. The final nodes are connected to a single final node with arc probability of one.
- A token pass decoder is used along with a language model to decode the resulting lattice and the best hypothesis is generated.
1) Felipe Sánchez-Martínez. Choosing the best machine translation system to translate a sentence by using only source-language information. In Proceedings of the 15th Annual Conference of the European Associtation for Machine Translation, p. 97-104, May 30-31, 2011, Leuven, Belgium.
2) Víctor M. Sánchez-Cartagena, Felipe Sánchez-Martínez, Juan Antonio Pérez-Ortiz. Enriching a statistical machine translation system trained on small parallel corpora with rule-based bilingual phrases. In Proceedings of the International Conference on Recent Advances in Natural Language Processing, RANLP 2011, p ?-?, September 12-14, 2011, Hissar, Bulgaria (forthcoming)
David McClosky, Mihai Surdeanu, and Christopher D. Manning. 2011. Event Extraction as Dependency Parsing. In Proceedings of the Association for Computational Linguistics - Human Language Technologies 2011 Conference (ACL-HLT 2011). [PDF]
Tuesday, 2 August 2011
Tuesday, 26 July 2011
(to be updated)
1) BLAST: http://www.ida.liu.se/~sarst/blast/
(Demo paper at ACL'2011: http://www.aclweb.org/anthology/P/P11/P11-4010.pdf)
Maja Popovic et al. Towards Automatic Error Analysis of Machine Translation Output. (Computational Linguistics 2011)
Mireia F. et al. Overcoming statistical machine translation limitations: error analysis and proposed solutions for the Catalan–Spanish language pair. LREC 2011.
S. Condon. Machine Translation Errors: English and Iraqi Arabic. TALIP 2011.
Maja Popović, Adrià de Gispert, Deepa Gupta, Patrik Lambert, Hermann Ney, José B. Mariño and Rafael Banchs. Morpho-syntactic Information for Automatic Error Analysis of Statistical Machine Translation Output. HLT/NAACL Workshop on Statistical Machine Translation, pages 1-6, New York, NY, June 2006.
Maja Popović and Hermann Ney. Error Analysis of Verb Inflections in Spanish Translation Output. TC-Star Workshop on Speech-to-Speech Translation, pages 99-103, Barcelona, Spain, June 2006.
David Vilar et al. Error Analysis of Statistical Machine Translation Output. LREC 2006.
Sunday, 24 July 2011
Abstract | PDF (366 KB) | PDF Plus (367 KB)
Abstract | PDF (296 KB) | PDF Plus (297 KB)
Abstract | PDF (280 KB) | PDF Plus (281 KB)
Wednesday, 20 July 2011
HeidelTime is a multilingual temporal tagger that extracts temporal expressions from documents and normalizes them according to the TIMEX3 annotation standard, which is part of the markup language TimeML (with focus on the "value" attribute). HeidelTime uses different normalization strategies depending on the domain of the documents that are to be processed (news or narratives). It is a rule-based system and due to its architectural feature that the source code and the resources (patterns, normalization information, and rules) are strictly separated, one can simply develop resources for additional languages using HeidelTime's well-defined rule syntax.
Tuesday, 19 July 2011
Friday, 15 July 2011
Friday, 8 July 2011
Thursday, 7 July 2011
It's amazing. It's free!
Sunday, 26 June 2011
AlchemyAPI provides content owners and web developers with a rich suite of content analysis and meta-data annotation tools.
Expose the semantic richness hidden in any content, using named entity extraction, keyword extraction, sentiment analysis, document categorization, concept tagging, language detection, and structured content scraping. Use AlchemyAPI to enhance your website, blog, content management system, or semantic web application.
Amazon EC2’s simple web service interface allows you to obtain and configure capacity with minimal friction. It provides you with complete control of your computing resources and lets you run on Amazon’s proven computing environment. Amazon EC2 reduces the time required to obtain and boot new server instances to minutes, allowing you to quickly scale capacity, both up and down, as your computing requirements change. Amazon EC2 changes the economics of computing by allowing you to pay only for capacity that you actually use. Amazon EC2 provides developers the tools to build failure resilient applications and isolate themselves from common failure scenarios.
Saturday, 18 June 2011
Tuesday, 7 June 2011
Thursday, 2 June 2011
Elsevier SIGIR 2011 Application Challenge: http://developer.sciverse.com/SIGIR2011
+ Startdate: June 6, 2011
+ Enddate: July 23, 2011
+ Judging starts: July 24, 2011
+ Judging ends: July 26, 2011
+ Announcement of the Winners: July 26, 2011
+ First prize: 1,500 USD (VISA gift card)
+ Second prize: 1,000 USD (VISA gift card)
+ Third prize: 500 USD (VISA gift card)
Wednesday, 1 June 2011
It is in multiple languages. Great!
Sunday, 29 May 2011
1) A Weakly-supervised Approach to Argumentative Zoning of Scientific Documents
Yufan Guo, Anna Korhonen and Thierry Poibeau
2) Linear Text Segmentation Using Affinity Propagation
Anna Kazantseva and Stan Szpakowicz
3) Identifying Relations for Open Information Extraction
Anthony Fader, Stephen Soderland and Oren Etzioni
4) Active Learning with Amazon Mechanical Turk
Florian Laws, Christian Scheible and Hinrich Schütze
5) Extreme Extraction — Machine Reading in a Week
Marjorie Freedman, Lance Ramshaw, Elizabeth Boschee, Ryan Gabbard, Nicolas Ward and Ralph Weischedel
6) Discovering Relations between Noun Categories
Thahir Mohamed, Estevam Hruschka and Tom Mitchell
7) Bootstrapped Named Entity Recognition for Product Attribute Extraction
Duangmanee Putthividhya and Junling Hu
8) Predicting a Scientific Community’s Response to an Article
Dani Yogatama, Michael Heilman, Brendan O'Connor, Chris Dyer, Bryan R. Routledge and Noah A. Smith
9) Language Models for Machine Translation: Original vs. Translated Texts
Gennadi Lembersky, Noam Ordan and Shuly Wintner
10) Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter
Eiji Aramaki, Sachiko Maskawa and Mizuki Morita
11) Rumor has it: Identifying Misinformation in Microblogs
Vahed Qazvinian, Emily Rosengren, Dragomir R. Radev and Qiaozhu Mei
Another customized version: https://github.com/jhclark/salm
Saturday, 28 May 2011
This corpus was collected between Oct 2005 and Jan 2011, and covers 47860 English language, non-binary-file news groups:
+ Corpus size: over 30 billion words,
+ Data size: over 34Gb, compressed (delivered as weekly bundles of about 150 Mb each.)
Thursday, 19 May 2011
(Extracting and Querying Relations in Scientiﬁc Papers on Language Technology)
Tuesday, 17 May 2011
Monday, 16 May 2011
hunalign aligns bilingual text on the sentence level. Its input is tokenized and sentence-segmented text in two languages. In the simplest case, its output is a sequence of bilingual sentence pairs (bisentences).
In the presence of a dictionary, hunalign uses it, combining this information with Gale-Church sentence-length information. In the absence of a dictionary, it first falls back to sentence-length information, and then builds an automatic dictionary based on this alignment. Then it realigns the text in a second pass, using the automatic dictionary.
Like most sentence aligners, hunalign does not deal with changes of sentence order: it is unable to come up with crossing alignments, i.e., segments A and B in one language corresponding to segments B’ A’ in the other language.
There is nothing Hungarian-specific in hunalign, the name simply reflects the fact that it is part of the hun* NLP toolchain.
hunalign was written in portable C++. It can be built under basically any kind of operating system.
"Welcome to YouAlign, your online document alignment solution. No software to purchase, no software to install. With YouAlign you can quickly and easily create bitexts from your archived documents. A YouAlign bitext contains a document and its translation aligned at the sentence level. YouAlign generates TMX files that can be loaded into your translation memory. YouAlign can also generate HTML files that you can publish on the Internet, or use with a full-text search engine to search for terminology and phraseology in context.
YouAlign is powered by the AlignFactory engine, which supports all kinds of formats, including Microsoft Word, Excel and PowerPoint, PDF, HTML, XML, Corel WordPerfect, RTF, Lotus WordPro and plain text."
Thursday, 12 May 2011
The corpus has most of the functionality of the other corpora from http://corpus.byu.edu (e.g. COCA, COHA, and our interface to the BNC), including: searching by part of speech, wildcards, and lemma (and thus advanced syntactic searches), synonyms, collocate searches, frequency by decade (tables listing each individual string, or charts for total frequency), comparisons of two historical periods (e.g. collocates of "women" or "music" in the 1800s and the 1900s), and more." (From Corpora-List)
Tuesday, 3 May 2011
My Interested Papers:
1) Summarizing the Differences in Multilingual News
Xiaojun Wan, Houping Jia
2) Multifaceted Toponym Recognition for Streaming News
Michael Lieberman, Hanan Samet
3) Toward Social Context Summarization For Web Documents
Zi Yang, cai keke, Jie Tang, Li Zhang, Zhong Su, Juanzi Li
4) Evolutionary Timeline Summarization: a Balanced Optimization Framework via Iterative Substitution
Rui Yan, Xiaojun Wan, Jahna Otterbacher, Liang Kong, Yan Zhang, Xiaoming Li
5) The Economics in Interactive Information Retrieval
6) Composite Hashing with Multiple Information Sources
Dan Zhang, Fei Wang, Luo Si
7) Inverted Indexes for Phrases and Strings
Manish Patil, Sharma V. Thankachan, Rahul Shah, Wing-Kai Hon, Jeffrey Vitter, Sabrina Chandrasekaran
8) Multimedia Answering: Enriching Text QA with Media Information
Liqiang Nie, Meng Wang, Zha Zhengjun, Li Guangda, Tat Seng Chua
9) SCENE : A Scalable Two-Stage Personalized News Recommendation System
Lei Li, Dingding Wang, Tao Li
10) Ranking Related News Predictions
Nattiya Kanhabua, Roi Blanco, Michael Matthews
Friday, 29 April 2011
1) Only use basic I/O in
2) Use memory-mapped file mechanism.
+ Boost C++ memory-mapped file support: http://www.boost.org/doc/libs/1_38_0/libs/iostreams/doc/index.html
3) TBA (please let me know if u have others. Thanks!)
Thursday, 28 April 2011
The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
Wednesday, 27 April 2011
...one of the most highly regarded and expertly designed C++ library projects in the world.— Herb Sutter and Andrei Alexandrescu, C++ Coding Standards
*** Installation Tips
- with b2
b2 address-model=32 --build-type=complete --stagedir=stage
b2 address-model=64 --build-type=complete --stagedir=stage_x64
(regex with ICU lib)
b2 -sICU_PATH=C:\icu4c-54_1-src\icu address-model=32 --with-regex --stagedir=stage
b2 -sICU_PATH=C:\icu4c-54_1-src\icu address-model=64 --with-regex --stagedir=stage_x64
(iostream with zlib)
b2 -sZLIB_SOURCE=C:\zlib128-dll\include address-model=32 --with-iostreams --stagedir=stage
b2 -sZLIB_SOURCE=C:\zlib128-dll\include address-model=64 --with-iostreams --stagedir=stage_x64
- with bjam
(for different versions of Microsoft Visual C++)
bjam --toolset=msvc-12.0 address-model=64 --build-type=complete stage
bjam --toolset=msvc-11.0 address-model=64 --build-type=complete stage
bjam --toolset=msvc-10.0 address-model=64 --build-type=complete stage
bjam --toolset=msvc-9.0 address-model=64 --build-type=complete stage
bjam --toolset=msvc-8.0 address-model=64 --build-type=complete stage
bjam --toolset=msvc-12.0 --build-type=complete stage
bjam --toolset=msvc-11.0 --build-type=complete stage
bjam --toolset=msvc-10.0 --build-type=complete stage
bjam --toolset=msvc-9.0 --build-type=complete stage
bjam --toolset=msvc-8.0 --build-type=complete stage
Monday, 18 April 2011
Original version: http://hunspell.sourceforge.net/
Source Code with Java for Hunspell: http://dren.dk/hunspell.html
2) Aspell: http://aspell.net/
Available dictionaries: ftp://ftp.gnu.org/gnu/aspell/dict/0index.html
3) IBM csSpell (Context-sensitive Spelling Checker): http://www.alphaworks.ibm.com/tech/csspell
2) Google Web N-gram
2a) Google Web N-gram Viewer: http://ngrams.googlelabs.com/
2b) Google Web N-gram Patterns: http://n-gram-patterns.sourceforge.net/
3) Microsoft Web N-gram: http://web-ngram.research.microsoft.com/info/
4) N-gram Statistics Package: http://ngram.sourceforge.net/
5) CMU Language Modeling Toolkit (version 2): http://www.speech.cs.cmu.edu/SLM/toolkit.html
1) TMX software: https://sourceforge.net/
2) R: www.r-project.org
With books accompanied:
3) Lexico3: http://www.tal.univ-paris3.fr/lexico/lexico3.htm (seemingly a commercial tool)
If you know others, please let me know!
Thursday, 7 April 2011
Tuesday, 5 April 2011
Sunday, 13 March 2011
Friday, 11 March 2011
Wednesday, 2 March 2011
1) David Talbot and Miles Osborne. Smoothed Bloom filter language models: Tera-Scale LMs on the Cheap. EMNLP, Prague, Czech Republic 2007.
2) David Talbot and Miles Osborne. Randomised Language Modelling for Statistical Machine Translation. ACL, Prague, Czech Republic 2007.
Wednesday, 23 February 2011
Monday, 14 February 2011
Wednesday, 19 January 2011
Saturday, 15 January 2011
Thursday, 13 January 2011
Wednesday, 12 January 2011
Sunday, 9 January 2011
NUS demo: http://wing.comp.nus.edu.sg/~linzihen/parser/demo.html
(to be continued!)