HOANG Cong Duy Vu's research logs: 2012-09-23

Thursday, 27 September 2012

TurboParser - Dependency Parser with Linear Programming

Link: http://www.ark.cs.cmu.edu/TurboParser/

Intro: TurboParser is a free C++ implementation of a multilingual non-projective dependency parser based on linear programming relaxations.

Wednesday, 26 September 2012

Text extraction from HTML pages

1) http://cogcomp.cs.illinois.edu/page/software_view/MSS
2) Link: http://researchlog-duyvuleo.blogspot.sg/2010/11/easy-way-to-extract-useful-text-from.html
3) Link: http://researchlog-duyvuleo.blogspot.sg/2012/06/justext.html
4) Link (PhD thesis): http://is.muni.cz/th/45523/fi_d/phdthesis.pdf

Sunday, 23 September 2012

ICU - International Components for Unicode

Link: http://site.icu-project.org/

Intro: ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.

Here are a few highlights of the services provided by ICU:

Code Page Conversion: Convert text data to or from Unicode and nearly any other character set or encoding. ICU's conversion tables are based on charset data collected by IBM over the course of many decades, and is the most complete available anywhere.

Collation: Compare strings according to the conventions and standards of a particular language, region or country. ICU's collation is based on the Unicode Collation Algorithm plus locale-specific comparison rules from the Common Locale Data Repository, a comprehensive source for this type of data.

Formatting: Format numbers, dates, times and currency amounts according the conventions of a chosen locale. This includes translating month and day names into the selected language, choosing appropriate abbreviations, ordering fields correctly, etc. This data also comes from the Common Locale Data Repository.

Time Calculations: Multiple types of calendars are provided beyond the traditional Gregorian calendar. A thorough set of timezone calculation APIs are provided.

Unicode Support: ICU closely tracks the Unicode standard, providing easy access to all of the many Unicode character properties, Unicode Normalization, Case Folding and other fundamental operations as specified by the Unicode Standard.

Regular Expression: ICU's regular expressions fully support Unicode while providing very competitive performance.

Bidi: support for handling text containing a mixture of left to right (English) and right to left (Arabic or Hebrew) data.

Text Boundaries: Locate the positions of words, sentences, paragraphs within a range of text, or identify locations that would be suitable for line wrapping when displaying the text.

Deep learning

I just find that this research topic is quite new.
I intend to get deeper into it, especially its impact in NLP research.

Some of review paper or tutorials:

1) http://deeplearning.net/
(tutorial: http://deeplearning.net/tutorial/)

2) ACL 2012 tutorial:
http://www.socher.org/index.php/DeepLearningTutorial/DeepLearningTutorial

3) Ronan Collobert (http://ronan.collobert.com/)
http://ronan.collobert.com/pub/matos/2008_nlp_icml.pdf
http://ronan.collobert.com/pub/matos/2009_tutorial_nips.pdf
http://ronan.collobert.com/pub/matos/2011_nlp_jmlr.pdf

4) ...

HOANG Cong Duy Vu's research logs