There are some ways to convert PDF files to raw text files. Two typical ways, just according to my opinion, are using and non-using OCR technology. PDFBox is a freely available tool non-using OCR technology, so the converted raw texts suffer some errors. To utilize OCR technology in conversion, we have some tricks as follows:
- use some free or commercial tools like SimpleOCR, VeryPDF, OmniPage ...
- copy-and-paste directly from PDF files. This trick is only applied to some PDF files that are not secured.
- use online tools (I am actually not sure about the internal technologies they are using :()
- leverage Google OCR (means Google will do this for us):
Cheers,
Vu
- use some free or commercial tools like SimpleOCR, VeryPDF, OmniPage ...
- copy-and-paste directly from PDF files. This trick is only applied to some PDF files that are not secured.
- use online tools (I am actually not sure about the internal technologies they are using :()
- leverage Google OCR (means Google will do this for us):
Convert Scanned PDFs to Text--
Now if you have bunch of scanned PDF files on your hard drive and no OCR software, here’s what you can do to convert them into recognizable text.
Create a folder in your website (say abc.com/pdf) and upload all the PDF images to that folder. Now create a public web page that links to all the PDF files. Wait for the Google bots to spider your stuff.
Once done, type the query "site:abc.com/pdf filetype:pdf" to see the PDF documents as HTML.
Cheers,
Vu
No comments:
Post a Comment