Tuesday, 21 July 2009

PDF to raw texts

There are some ways to convert PDF files to raw text files. Two typical ways, just according to my opinion, are using and non-using OCR technology. PDFBox is a freely available tool non-using OCR technology, so the converted raw texts suffer some errors. To utilize OCR technology in conversion, we have some tricks as follows:
- use some free or commercial tools like SimpleOCR, VeryPDF, OmniPage ...
- copy-and-paste directly from PDF files. This trick is only applied to some PDF files that are not secured.
- use online tools (I am actually not sure about the internal technologies they are using :()
- leverage Google OCR (means Google will do this for us):
Convert Scanned PDFs to Text

Now if you have bunch of scanned PDF files on your hard drive and no OCR software, here’s what you can do to convert them into recognizable text.

Create a folder in your website (say abc.com/pdf) and upload all the PDF images to that folder. Now create a public web page that links to all the PDF files. Wait for the Google bots to spider your stuff.

Once done, type the query "site:abc.com/pdf filetype:pdf" to see the PDF documents as HTML.
--
Cheers,
Vu

No comments:

Post a Comment