Link: http://www.codeproject.com/Articles/23198/C-String-Toolkit-StrTk-Tokenizer
It can be used as lite version of Boost.
Monday, 17 December 2012
Tuesday, 13 November 2012
Monday, 5 November 2012
Multeval
Link: https://github.com/jhclark/multeval
Intro: MultEval takes machine translation hypotheses from several runs of an optimizer and provides 3 popular metric scores, as well as, standard deviations (via bootstrap resampling) and p-values (via approximate randomization). This allows researchers to mitigate some of the risk of using unstable optimizers such as MERT, MIRA, and MCMC. It is intended to help in evaluating the impact of in-house experimental variations on translation quality; it is currently not setup to do bake-off style comparisons (bake-offs can't require multiple optimizer runs nor a standard tokenization).
Related: http://www.ark.cs.cmu.edu/MT/ (Code for Statistical Significance Testing for MT Evaluation Metrics)Saturday, 3 November 2012
MacPorts
Link: http://www.macports.org/index.php
Intro: The MacPorts Project is an open-source community initiative to design an easy-to-use system for compiling, installing, and upgrading either command-line, X11 or Aqua based open-source software on the Mac OS X operating system. To that end we provide the command-line driven MacPorts software package under a BSD License, and through it easy access to thousands of ports that greatly simplify the task of compiling and installing open-source software on your Mac.
Tuesday, 16 October 2012
TalkBank
Link: http://talkbank.org/
Intro: The goal of TalkBank is to foster fundamental research in the study of human and animal communication. It will construct sample databases within each of the subfields studying communication. It will use these databases to advance the development of standards and tools for creating, sharing, searching, and commenting upon primary materials via networked computers.
Labels:
child language,
data,
language learning,
NLP,
research,
speech
Tuesday, 2 October 2012
OpenMobster - Open Source Mobile Enterprise Backend
Link: https://code.google.com/p/openmobster/
Intro:
Intro:
- OpenMobster, is an open source Enterprise Backend for Mobile Apps, or
- OpenMobster, is an open source Mobile Backend As a Service that can be deployed privately (on-premise) within your Enterprise or
- OpenMobster, is an open source MEAP (Mobile Enterprise Application Platform).
Labels:
android,
backend,
iOS,
links,
mobile programming,
open source,
platform
Open-source implementation of Boostexter
Link: http://code.google.com/p/icsiboost/
Intro: Boosting is a meta-learning approach that aims at combining an ensemble of weak classifiers to form a strong classifier. Adaptive Boosting (Adaboost) is a greedy search for a linear combination of classifiers by overweighting the examples that are misclassified by each classifier. icsiboost implements Adaboost over stumps (one-level decision trees) on discrete and continuous attributes (words and real values)
Thursday, 27 September 2012
TurboParser - Dependency Parser with Linear Programming
Link: http://www.ark.cs.cmu.edu/TurboParser/
Intro: TurboParser is a free C++ implementation of a multilingual non-projective dependency parser based on linear programming relaxations.
Wednesday, 26 September 2012
Text extraction from HTML pages
1) http://cogcomp.cs.illinois.edu/page/software_view/MSS
2) Link: http://researchlog-duyvuleo.blogspot.sg/2010/11/easy-way-to-extract-useful-text-from.html
3) Link: http://researchlog-duyvuleo.blogspot.sg/2012/06/justext.html
4) Link (PhD thesis): http://is.muni.cz/th/45523/fi_d/phdthesis.pdf
2) Link: http://researchlog-duyvuleo.blogspot.sg/2010/11/easy-way-to-extract-useful-text-from.html
3) Link: http://researchlog-duyvuleo.blogspot.sg/2012/06/justext.html
4) Link (PhD thesis): http://is.muni.cz/th/45523/fi_d/phdthesis.pdf
Labels:
HTML,
links,
news processing,
text extraction,
tools
Sunday, 23 September 2012
ICU - International Components for Unicode
Intro: ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.
Code Page Conversion: Convert text data to or from Unicode and nearly any other character set or encoding. ICU's conversion tables are based on charset data collected by IBM over the course of many decades, and is the most complete available anywhere.
Collation: Compare strings according to the conventions and standards of a particular language, region or country. ICU's collation is based on the Unicode Collation Algorithm plus locale-specific comparison rules from the Common Locale Data Repository, a comprehensive source for this type of data.
Formatting: Format numbers, dates, times and currency amounts according the conventions of a chosen locale. This includes translating month and day names into the selected language, choosing appropriate abbreviations, ordering fields correctly, etc. This data also comes from the Common Locale Data Repository.
Time Calculations: Multiple types of calendars are provided beyond the traditional Gregorian calendar. A thorough set of timezone calculation APIs are provided.
Unicode Support: ICU closely tracks the Unicode standard, providing easy access to all of the many Unicode character properties, Unicode Normalization, Case Folding and other fundamental operations as specified by the Unicode Standard.
Regular Expression: ICU's regular expressions fully support Unicode while providing very competitive performance.
Bidi: support for handling text containing a mixture of left to right (English) and right to left (Arabic or Hebrew) data.
Text Boundaries: Locate the positions of words, sentences, paragraphs within a range of text, or identify locations that would be suitable for line wrapping when displaying the text.
Deep learning
I just find that this research topic is quite new.
I intend to get deeper into it, especially its impact in NLP research.
Some of review paper or tutorials:
1) http://deeplearning.net/
(tutorial: http://deeplearning.net/tutorial/)
2) ACL 2012 tutorial:
http://www.socher.org/index.php/DeepLearningTutorial/DeepLearningTutorial
3) Ronan Collobert (http://ronan.collobert.com/)
http://ronan.collobert.com/pub/matos/2008_nlp_icml.pdf
http://ronan.collobert.com/pub/matos/2009_tutorial_nips.pdf
http://ronan.collobert.com/pub/matos/2011_nlp_jmlr.pdf
4) ...
I intend to get deeper into it, especially its impact in NLP research.
Some of review paper or tutorials:
1) http://deeplearning.net/
(tutorial: http://deeplearning.net/tutorial/)
2) ACL 2012 tutorial:
http://www.socher.org/index.php/DeepLearningTutorial/DeepLearningTutorial
3) Ronan Collobert (http://ronan.collobert.com/)
http://ronan.collobert.com/pub/matos/2008_nlp_icml.pdf
http://ronan.collobert.com/pub/matos/2009_tutorial_nips.pdf
http://ronan.collobert.com/pub/matos/2011_nlp_jmlr.pdf
4) ...
Monday, 17 September 2012
Wednesday, 12 September 2012
How to solve "incorrect side-by-side configuration" error when running VC++ app
When I compiled an app by using Visual Studio 2005 Pro on my first PC and ran that app on my second PC. I got the following error:
"The application has failed to start because its side-by-side configuration is incorrect ... "
I found the reason. That is because the side-by-side configuration of my first PC was configured differently from the one on my second PC.
How to fix that:
- First of all, check the side-by-side configuration on registry (type regedit on cmd), proceed to the following keys:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\SideBySide\Winners\x86_policy.8.0.microsoft.vc80.crt_1fc8b3b9a1e18e3b_none_e8a8ec119a3821e7
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\SideBySide\Winners\x86_policy.8.0.microsoft.vc80.atl_1fc8b3b9a1e18e3b_none_e8ff9ccd99f7096b
Alternatively, you can check this by using the existing tool on Windows namely Event Viewer (under Control Panel\Administration Tools). On Event Viewer, check errors on Windows Logs\Application and find the appropriate error on your app.
Next, check the Default values of the above keys. They should be the same and have the highest values (the last numbers). Importantly, these values MUST be the same on two PCs.
- Secondly, try to install the correct side-by-side configuration by downloading and installing the latest update from Microsoft website. (see: http://support.microsoft.com/kb/2538218 and http://technet.microsoft.com/en-us/security/bulletin/ms11-025).
For example, the side-by-side configuration of the first key on my first PC is 8.0.50727.6195.
I need the update version (http://support.microsoft.com/kb/2538218) which has the same value as 8.0.50727.6195.
The problem is solved. When you get the same problem, you can follow my direction.
Hope it helps!
--
Vu
"The application has failed to start because its side-by-side configuration is incorrect ... "
I found the reason. That is because the side-by-side configuration of my first PC was configured differently from the one on my second PC.
How to fix that:
- First of all, check the side-by-side configuration on registry (type regedit on cmd), proceed to the following keys:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\SideBySide\Winners\x86_policy.8.0.microsoft.vc80.crt_1fc8b3b9a1e18e3b_none_e8a8ec119a3821e7
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\SideBySide\Winners\x86_policy.8.0.microsoft.vc80.atl_1fc8b3b9a1e18e3b_none_e8ff9ccd99f7096b
Alternatively, you can check this by using the existing tool on Windows namely Event Viewer (under Control Panel\Administration Tools). On Event Viewer, check errors on Windows Logs\Application and find the appropriate error on your app.
Next, check the Default values of the above keys. They should be the same and have the highest values (the last numbers). Importantly, these values MUST be the same on two PCs.
- Secondly, try to install the correct side-by-side configuration by downloading and installing the latest update from Microsoft website. (see: http://support.microsoft.com/kb/2538218 and http://technet.microsoft.com/en-us/security/bulletin/ms11-025).
For example, the side-by-side configuration of the first key on my first PC is 8.0.50727.6195.
I need the update version (http://support.microsoft.com/kb/2538218) which has the same value as 8.0.50727.6195.
The problem is solved. When you get the same problem, you can follow my direction.
Hope it helps!
--
Vu
Labels:
C++,
error,
Microsoft,
programming,
visual studio 2005
Tuesday, 11 September 2012
NiuTrans: A Statistical Machine Translation System
Intro: NiuTrans is an open-source statistical machine translation system developed by the Natural Language Processing Group at Northeastern University, China. The NiuTrans system is fully developed in C++ language. So it runs fast and uses less memory. Currently it has already supported phrase-based, hierarchical phrase-based and syntax-based (string-to-tree, tree-to-string and tree-to-tree) models for research-oriented studies.
Paper (ACL 2012 Demo): http://www.nlplab.com/NiuPlan/paper/ACL2012-NiuTrans.pdf
Labels:
machine translation,
NLP,
research,
SMT,
statistical machine translation,
systems,
toolkits
Wednesday, 5 September 2012
Docent - Document-level SMT
Intro: Docent is a decoder for phrase-based Statistical Machine Translation (SMT). Unlike most existing SMT decoders, it treats complete documents, rather than single sentences, as translation units and permits the inclusion of features with cross-sentence dependencies to facilitate the development of discourse-level models for SMT. Docent implements the local search decoding approach described by Hardmeier et al. (EMNLP 2012).
Paper: https://aclweb.org/anthology-new/D/D12/D12-1108.pdf
Paper: https://aclweb.org/anthology-new/D/D12/D12-1108.pdf
Wednesday, 29 August 2012
Fangorn: a system for querying very large treebanks
Link: http://nltk.ldc.upenn.edu:9090/index
Intro: Fangorn is an open source tool for querying very large treebanks, built on top of Apache Lucene. Fangorn implements the LPath linguistic path language, which has an XPath-like syntax along with linguistically motivated extensions. Result trees are annotated with the query in order to show how the query matched the tree, and these annotations can themselves be modified and submitted as further queries.
Tuesday, 28 August 2012
Intel® Threading Building Blocks (Intel® TBB)
Intro: Intel® Threading Building Blocks (Intel® TBB) offers a rich and complete approach to expressing parallelism in a C++ program. It is a library that helps you take advantage of multi-core processor performance without having to be a threading expert. Intel TBB is not just a threads-replacement library. It represents a higher-level, task-based parallelism that abstracts platform details and threading mechanisms for scalability and performance.
Monday, 27 August 2012
ACCURAT Toolkit
Intro: The ACCURAT project (http://www.accurat-project.eu/) is pleased to announce the release of ACCURAT Toolkit - a collection of tools for comparable corpora collection and multi-level alignment and information extraction from comparable corpora. By using the ACCURAT Toolkit, users may obtain:
- Comparable corpora from the Web (current news corpora, filtered Wikipedia corpora, and narrow domain focussed corpora);
- Comparable document alignments;
- Semi-parallel sentence/phrase mapping from comparable corpora (for SMT training purposes or other tasks);
- Translated terminology extracted and mapped from bilingual comparable corpora;
- Translated named entities extracted and mapped from bilingual comparable corpora.
Labels:
comparable corpora,
machine translation,
SMT,
text alignment,
toolkits
Thursday, 23 August 2012
OpenFst Library
Link: http://www.openfst.org/twiki/bin/view/FST/WebHome
Intro: OpenFst is a library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). Weighted finite-state transducers are automata where each transition has an input label, an output label, and a weight. The more familiar finite-state acceptor is represented as a transducer with each transition's input and output label equal. Finite-state acceptors are used to represent sets of strings (specifically, regular or rational sets); finite-state transducers are used to represent binary relations between pairs of strings (specifically, rational transductions). The weights can be used to represent the cost of taking a particular transition.
Labels:
finite state transducer,
FST,
machine learning,
NLP,
toolkits
Tuesday, 21 August 2012
Wednesday, 15 August 2012
Champollion Tool Kit - Text Sentence Aligner
Link: http://champollion.sourceforge.net/
Intro: Built around LDC's champollion sentence aligner kernel, Champollion Tool Kit (CTK) aims to providing ready-to-use parallel text sentence alignment tools for as many language pairs as possible.
Champollion depends heavily on lexical information, but uses sentence length information as well. A translation lexicon is required. Past experiments indicate that champollion's performance improves as the translation lexicon become larger.
Monday, 30 July 2012
PML Tree Query
Intro: PML-TQ is an powerful open-source search tool for all kinds of linguistaically annotated treebanks with several client interfaces and two search backends (one based on a SQL database and one based on Perl and the TrEd toolkit). The tool works natively with treebanks encoded in the PML data format (conversion scripts are available for many established treebank formats).
Labels:
dependency parsing,
link,
parser,
parsing,
query,
tools,
treebank search
Friday, 27 July 2012
PET - Post-Editing Translation Tool
Intro: PET is a stand-alone, open-source (under LGPL) tool written in Java that should help you post-edit and assess machine or human translations while gathering detailed statistics about post-editing time amongst other effort indicators.
Labels:
machine translation,
open source,
post-editing,
SMT,
tools
Tuesday, 24 July 2012
Subtitle Translation
Subtitle corpus: http://opus.lingfil.uu.se/ (more)
*** Google Translation API is no longer freely available. Can we use state-of-the-art SMT techniques to build a subtitle SMT system by ourself? What are challenges???
Labels:
open source,
research,
SMT,
statistical machine translation,
subtitle,
tools
Monday, 23 July 2012
XML-RPC
Link: http://en.wikipedia.org/wiki/XML-RPC
C++ Tool: (tested) http://xmlrpcpp.sourceforge.net/
Intro: XML-RPC is a remote procedure call (RPC) protocol which uses XML to encode its calls and HTTP as a transport mechanism.[1] "XML-RPC" also refers generically to the use of XML for remote procedure call, independently of the specific protocol. This article is about the protocol named "XML-RPC".
C++ Tool: (tested) http://xmlrpcpp.sourceforge.net/
Labels:
C++,
links,
server/client programming,
tools,
XML RPC
OpenStreetMap
Link: http://www.openstreetmap.org/
Intro: OpenStreetMap is a free worldwide map, created by people like you.
Data: http://planet.openstreetmap.org/
Intro: OpenStreetMap is a free worldwide map, created by people like you.
Data: http://planet.openstreetmap.org/
Labels:
crowdsourcing,
link,
NLP,
research,
social networks,
web 2.0,
world map
Monday, 16 July 2012
Very large-scale corpus (COCA)
COCA (Corpus of Contemporary American English)
Link: http://corpus.byu.edu/coca/
Intro: The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English, and the only large and balanced corpus of American English. The corpus was created by Mark Davies of Brigham Young University, and it is used by tens of thousands of users every month (linguists, teachers, translators, and other researchers). COCA is also related to other large corpora that we have created.
The corpus contains more than 450 million words of text and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. It includes 20 million words each year from 1990-2012 and the corpus is also updated regularly (the most recent texts are from Summer 2012). Because of its design, it is perhaps the only corpus of English that is suitable for looking at current, ongoing changes in the language (see the 2011 article in Literary and Linguistic Computing).
***
Ngram corpus from COCA:
1) COCA Ngrams:
Link: see this post.
2) COHA Ngrams:
Link: http://www.ngrams.info/download_coha.asp
Intro: The Corpus of Historical American English (COHA) contain 400 million words of text from 1810-2009, and all of the n-grams from the corpus can be freely downloaded. They contain all n-grams that occur at least three times total in the corpus, and you can see the frequency of each of these n-grams in each decade from the 1810s-2000s. This data can be used offline to carry out powerful searches on a wide range of phenomena in the history of American English.
---------------------------------
My thoughts:
- I have been developing a language-generic n-gram-based spell checking tool. So, this ngram corpus will be very beneficial.
- Other tasks in English NLP may need this corpus.Thursday, 12 July 2012
C&C semantic tools
CCG (Combinatory Categorial Grammar) Parser: http://svn.ask.it.usyd.edu.au/trac/candc/wiki
Boxer: http://svn.ask.it.usyd.edu.au/trac/candc/wiki/boxer
Boxer: http://svn.ask.it.usyd.edu.au/trac/candc/wiki/boxer
Intro: Boxer is developed by Johan Bos and generates semantic representations. It takes as input CCG (Combinatory Categorial Grammar) derivations and produces DRSs (Discourse Representation Structures, from Hans Kamp's Discourse Representation Theory) as output. It is distributed with the C&C tools.
FRED
Intro: A tool for automatically producing RDF/OWL ontologies and linked data from natural language sentences, currently limited to English.
Wednesday, 11 July 2012
Interested Papers at EMNLP 2012
1) Tenses in SMT
D12-1026: Zhengxian Gong; Min Zhang; Chew Lim Tan; Guodong Zhou
N-gram-based Tense Models for Statistical Machine Translation
2)
D12-1041: Nan Duan; Mu Li; Ming Zhou
Forced Derivation Tree based Model Training to Statistical Machine Translation
3) ...
D12-1026: Zhengxian Gong; Min Zhang; Chew Lim Tan; Guodong Zhou
N-gram-based Tense Models for Statistical Machine Translation
2)
D12-1041: Nan Duan; Mu Li; Ming Zhou
Forced Derivation Tree based Model Training to Statistical Machine Translation
3) ...
ESAXX - suffix array tool
Link: http://code.google.com/p/esaxx/
Intro: esaxx is a C++ template library supporting to build an enhanced suffix array which is useful for various string algorithms. For an input text of length N, esaxx builds a suffix tree in linear time using almost 20N bytes working space (alphabet size independent).
Interested Papers at ACL 2012
1) New NLP topic: automatic document dating
P12-1011: Nathanael Chambers
Labeling Documents with Timestamps: Learning from their Time Expressions
2)
P12-1050: Arianna Bisazza; Marcello Federico
Modified Distortion Matrices for Phrase-Based Statistical Machine Translation
3) ...
Sunday, 8 July 2012
Saffron - Extracting the Valuable Threads of Expertise
Link: http://saffron.deri.ie/acl
Intro: Saffron provides insights in a research community or organization by analysing its main topics of investigation and the experts associated with these topics.
Saffron analysis is fully automatic and is based on text mining and linked data principles.
This instance of Saffron analyzes the research community in Natural Language Processing based on the proceedings of the conferences organized by the Association for Computational Linguistics (ACL).
Intro: Saffron provides insights in a research community or organization by analysing its main topics of investigation and the experts associated with these topics.
Saffron analysis is fully automatic and is based on text mining and linked data principles.
This instance of Saffron analyzes the research community in Natural Language Processing based on the proceedings of the conferences organized by the Association for Computational Linguistics (ACL).
Tuesday, 3 July 2012
C++ web development frameworks
1) Witty: http://www.webtoolkit.eu/wt
2) CppCMS: http://cppcms.com/wikipp/en/page/main
3) POCO: http://pocoproject.org/
4) ...
2) CppCMS: http://cppcms.com/wikipp/en/page/main
3) POCO: http://pocoproject.org/
4) ...
Labels:
C programming,
C++,
framework,
links,
web programming
Saturday, 23 June 2012
jusText
http://code.google.com/p/justext/
jusText is a tool for removing boilerplate content, such as navigation
links, headers, and footers from HTML pages. It is designed to preserve
mainly text containing full sentences and it is therefore well suited
for creating linguistic resources such as Web corpora.
Friday, 1 June 2012
WIT3 - Web Inventory of Transcribed and Translated Talks
Link: https://wit3.fbk.eu/
Intro: WIT3 - acronym for Web Inventory of Transcribed and Translated Talks - is a ready-to-use version for research purposes of the multilingual transcriptions of TED talks.
Since 2007, the TED Conference has been posting on its website all video recordings of its talks, English subtitles and their translations in more than 80 languages. In order to make this collection of talks more effectively usable by the research community, the original textual contents are redistributed here, together with MT benchmarks and processing tools.
Labels:
links,
machine translation,
parallel corpora,
TED talks
Tuesday, 29 May 2012
Layout-Aware Text Extraction from Full-text PDF of Scientific Articles
Description: The Portable Document Format (PDF) is the almost universally used file format for online scientific publications. It is also notoriously difficult to read and handle computationally, presenting challenges for developers of biomedical text mining or biocuration informatics systems that use the published literature as an information source. To facilitate the effective use of scientific literature in such systems we introduce Layout-Aware PDF Text Extraction (LA-PDFText). The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1) Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2) Classifying text blocks into rhetorical categories using a rule-based method and (3) Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks.
Monday, 16 April 2012
Friday, 13 April 2012
Monday, 9 April 2012
The advisor
Intro: Researchers typically rely on manual methods to discover research of interest such as keyword based searchon search engine, browsing publication list of known experts or reading references of interesting paper. These techniques are time-consuming and only allow reach a limited set of documents.
Monday, 2 April 2012
NRL: The Natural Rule Language
Link: http://nrl.sourceforge.net/
Intro: The Natural Rule Language is a model-driven language aimed at improving quality and time to market in integration projects. It enables users to constrain, modify and map data in diverse formats. NRL works at a high level, and is designed for automatic translation to execution languages.
NRL's main remit is to provide a user-friendly alternative to languages like OCL, XSLT, XPath, Schematron, and many others, particularly in scenarios where they would be considered too technical.
Wednesday, 28 March 2012
Calais
Link: http://www.opencalais.com/
Intro: The OpenCalais Web Service automatically creates rich semantic metadata for the content you submit – in well under a second. Using natural language processing (NLP), machine learning and other methods, Calais analyzes your document and finds the entities within it. But, Calais goes well beyond classic entity identification and returns the facts and events hidden within your text as well.
Labels:
information extraction,
machine learning,
NLP,
toolkits
Downloading full CiteSeerX data
Just saw this link and found it very interesting.
Link: http://b010.blogspot.com/2008/11/downloading-full-citeseerx-data.html
I copy here for backup (to avoid if the original link dies).
Steps for downloading the full dataset from CiteSeerX:
Thanks the author for that.
--
Cheers,
Vu
Link: http://b010.blogspot.com/2008/11/downloading-full-citeseerx-data.html
I copy here for backup (to avoid if the original link dies).
Steps for downloading the full dataset from CiteSeerX:
- Download and extract the "Demo" from http://www.oclc.org/research/software/oai/harvester.htm
- Go to the directory of the extracted files, type the following command to download the full dataset of CiteSeerX to the file "citeseerx_alldata.xml"java -classpath .;oaiharvester.jar;xerces.jar org.acme.oai.OAIReaderRawDump http://citeseerx.ist.psu.edu/oai2 -o citeseerx_alldata.xml
Thanks the author for that.
--
Cheers,
Vu
Tuesday, 27 March 2012
Preference Learning
Introduction: http://www.ke.tu-darmstadt.de/publications/papers/PLBook-Introduction.pdf
Applications to NLP???
Applications to NLP???
Labels:
learning to rank,
links,
machine learning,
NLP,
preference learning,
technology
Tuesday, 20 March 2012
Language Technology related Companies
Here, I will collect information about industry companies relating to developing and using Language Technology (LT). I would like to see how potential LT has in industry.
Worldwide
1) LingvoSoft
2) Bimaple
3)
3)
Vietnam
Labels:
company,
language technology,
links,
research,
start-up,
Vietnamese NLP
Monday, 19 March 2012
Parallel Text Mining for SMT
Problem: given a relatively large collection of parallel texts and a state-of-the-art SMT system, how to incrementally & automatically mine the parallel texts available on the Web. The newly added texts should ensure to improve the current SMT system.
Papers related:
1) Large Scale Parallel Document Mining for Machine Translation. COLING 2010. Link.2) TBA
Labels:
comparable corpora,
machine translation,
NLP,
parallel corpora,
SMT,
text mining
Crowdsourcing for NLP annotations
My new journal article: http://www.springerlink.com/content/n5q5n08853t06131/
Abstract
Abstract
Crowd-sourcing has emerged as a new method for obtaining annotations for training models for machine learning. While many variants of this process exist, they largely differ in their methods of motivating subjects to contribute and the scale of their applications. To date, there has yet to be a study that helps the practitioner to decide what form an annotation application should take to best reach its objectives within the constraints of a project. To fill this gap, we provide a faceted analysis of crowdsourcing from a practitioner’s perspective, and show how our facets apply to existing published crowdsourced annotation applications. We then summarize how the major crowdsourcing genres fill different parts of this multi-dimensional space, which leads to our recommendations on the potential opportunities crowdsourcing offers to future annotation efforts.
WebAnnotator
WebAnnotator is a new tool for annotating Web pages implemented at LIMSI. Giving it a try will take you no more than 10 minutes.
WebAnnotator is implemented as a Firefox extension, allowing annotation of both offline and inline pages. The HTML rendering is fully preserved and all annotations consist in new HTML spans with specific styles.
WebAnnotator provides an easy and general-purpose framework and is made available under CeCILL free license (close to GNU GPL), so that use and further contributions are made simple.
WebAnnotator can be downloaded on the official Mozilla web page:
https://addons.mozilla.org/en-US/firefox/addon/webannotator/.
https://addons.mozilla.org/en-US/firefox/addon/webannotator/.
A quick manual can be found here:
http://perso.limsi.fr/Individu/xtannier/en/WebAnnotator/
http://perso.limsi.fr/Individu/xtannier/en/WebAnnotator/
All parts of an HTML document can be annotated: text, images, videos, tables, menus, etc. The annotations are created by simply selecting a part of the document and clicking on the relevant type and subtypes. The annotated elements are then highlighted in a specific color. Annotation schemas can be defined by the user by creating a simple DTD representing the types and subtypes that must be highlighted. Finally, annotations can be saved (HTML with highlighted parts of documents) or exported (in a machine-readable format).
WebAnnotator will be presented at LREC conference in May 2012.
Sunday, 11 March 2012
HTML Parsers
1) Jericho HTML Parser: http://jericho.htmlparser.net/docs/index.html
2) boilerpipe: http://researchlog-duyvuleo.blogspot.com/search?q=boilerpipe
3) HTML Parser: http://htmlparser.sourceforge.net/
4) TBA
2) boilerpipe: http://researchlog-duyvuleo.blogspot.com/search?q=boilerpipe
3) HTML Parser: http://htmlparser.sourceforge.net/
4) TBA
Labels:
HTML,
link,
parser,
text extraction,
tools,
web crawler
Tuesday, 6 March 2012
Sunday, 4 March 2012
Thursday, 23 February 2012
SMT without Parallel Corpora
Toward Statistical Machine Translation without Parallel Corpora (link). EMNLP 2011.
Labels:
links,
papers,
parallel corpora,
research,
SMT,
statistical machine translation
Wednesday, 22 February 2012
Monday, 20 February 2012
Topic Hierarchy Generation
This post continues the problem of topic summarization posted earlier. Here I try to collect research articles related to the problem of topic hierarchy generation which is an important step for topic summarization.
1) Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes (link). Journal of Machine Learning Research 2011.
2) A Practical Web-based Approach to Generating Topic Hierarchy for Text Segments (link). CIKM 2004.
3) Finding Topic Words for Hierarchical Summarization (link). SIGIR 2001.
4) The Nested Chinese Restaurant Process and Bayesian Non-parametric Inference of Topic Hierarchies (link). Journal of the ACM 2010.
5) Mining bilingual topic hierarchies from unaligned text (link). IJCNLP 2011.
6) Domain-Assisted Product Aspect Hierarchy Generation: Towards Hierarchical Organization of Unstructured Consumer Reviews (link). ACL 2011.
7) (TBA)
4) The Nested Chinese Restaurant Process and Bayesian Non-parametric Inference of Topic Hierarchies (link). Journal of the ACM 2010.
5) Mining bilingual topic hierarchies from unaligned text (link). IJCNLP 2011.
6) Domain-Assisted Product Aspect Hierarchy Generation: Towards Hierarchical Organization of Unstructured Consumer Reviews (link). ACL 2011.
7) (TBA)
(to be updated).
Labels:
links,
NLP,
research,
summarization,
topic summarization
Sunday, 19 February 2012
Albatross Toolkit
http://www.object-craft.com.au/projects/albatross/
http://www.nactem.ac.uk/brat-annotation/
Albatross is a small and flexible Python toolkit for developing highly stateful web applications. The toolkit has been designed to take a lot of the pain out of constructing intranet applications although you can also use Albatross for deploying publicly accessed web applications.
Saturday, 4 February 2012
Vietnamese Slang Dictionary
http://giaoducsangtao.com/san-pham-sang-tao/v2v-tu-dien-tieng-long.html
(note this one for future investigation)
Tuesday, 10 January 2012
Wednesday, 4 January 2012
Corpus Management Tool
AntConc & AntWordProfiler: http://www.antlab.sci.waseda.ac.jp/software.html
NoSketch Engine: http://nlp.fi.muni.cz/trac/noske
NoSketch Engine, an open-source project combining Manatee and Bonito into a powerful and free corpus management system. NoSketch Engine is a limited version of the software empowering the famous Sketch Engine service, a commercial variant offering word sketches, thesaurus, keyword computation, user-friendly corpus creation and many other excellent features.
Tuesday, 3 January 2012
Scientific Summarization
This post aims to collect newest research papers in the literature about scientific summarization:
1) Abstract Summarization:
http://www.springerlink.com/content/q505w20k054k10w4/
2) TBA
3) Review
http://www.springerlink.com/content/4455125331140684/
1) Abstract Summarization:
http://www.springerlink.com/content/q505w20k054k10w4/
2) TBA
3) Review
http://www.springerlink.com/content/4455125331140684/
Monday, 2 January 2012
Subscribe to:
Posts (Atom)