Thursday 23 July 2009

NLP research links by Vlado Keselj

http://users.cs.dal.ca/~vlado/nlp/

This link compiled by Prof. Vlado Keselj at Dalhousie University contains quite a lot of research links in NLP.

Just a note for future search!

--
Cheers,
Vu

Wednesday 22 July 2009

Language Experiments - The Portal for Psychological Experiments on Language

http://www.language-experiments.org/

This site may be very useful for people who want to create experiments for their own research. I am trying to figure out whether it can help me do something helpful.

--
Cheers,
Vu

Internet FAQ Archives

http://www.faqs.org/faqs/

--
Cheers,
Vu

Tuesday 21 July 2009

PDF to raw texts

There are some ways to convert PDF files to raw text files. Two typical ways, just according to my opinion, are using and non-using OCR technology. PDFBox is a freely available tool non-using OCR technology, so the converted raw texts suffer some errors. To utilize OCR technology in conversion, we have some tricks as follows:
- use some free or commercial tools like SimpleOCR, VeryPDF, OmniPage ...
- copy-and-paste directly from PDF files. This trick is only applied to some PDF files that are not secured.
- use online tools (I am actually not sure about the internal technologies they are using :()
- leverage Google OCR (means Google will do this for us):
Convert Scanned PDFs to Text

Now if you have bunch of scanned PDF files on your hard drive and no OCR software, here’s what you can do to convert them into recognizable text.

Create a folder in your website (say abc.com/pdf) and upload all the PDF images to that folder. Now create a public web page that links to all the PDF files. Wait for the Google bots to spider your stuff.

Once done, type the query "site:abc.com/pdf filetype:pdf" to see the PDF documents as HTML.
--
Cheers,
Vu

Machine Learning and Natural Language

http://l2r.cs.uiuc.edu/~danr/Teaching/CS546-09/lectures.html

This is a helpful course instructed by Prof. Dan Roth at UIUC. Perhaps I must spend more time to self-study some of materials from this course to improve my background in Machine Learning for Natural Language.

--
Cheers,
Vu

Sunday 19 July 2009

Term "Oracle"

Sometimes I encountered the term "Oracle" in some papers, especially in Experiment and Evaluation sections but I quite did not understand what it means. Recently, I have figured out the meaning of it, it can be something referring to the upper bound of any measure that is used to assess the performance of a method.

--
Cheers,
Vu

Baseline methods

How to effectively design the baseline methods for specific problems?
This is also my raised question when approaching any specific research problem. The baseline methods can be 1) state-of-the-art methods that well-studied in previous studies (yeh, we can use them by re-implementing some of them but how many are enough? It's hard question) 2) the simplest methods that we can think of naturally or the methods we easily propose but should not be too naive (because beating the methods which are naive will degrade the value of your proposed methods).

Just my thoughts, any corrections are welcome!

--
Cheers,
Vu

Wilcoxon signed-rank test


This is the test that is used to determine the statistical significance of the results probably from different methods.

Should refer to the following tool "Significance Testing" maintained by Sebastian Pado:
http://www.nlpado.de/~sebastian/sigf.html

--
Cheers,
Vu

New Book: Search User Interfaces

A book written by Prof. Marti A. Hearst at University of California, Berkeley
Search User Interfaces: http://searchuserinterfaces.com/

Just a note, maybe it will be useful for me in the future.

--
Cheers,
Vu