HOW TO : Perform OCR on PDF files for free

I had to convert a scanned PDF file into an editable document recently. You can do this using OCR and there is a ton of software out there, that does this. There are even web based services that do this. But each of them had limitations (either had to buy the software or limit in the number of pages that can be scanned). I didn’t want to buy the license, since this is not something I would be doing regularly and the document I had to convert was 61 pages, so none of the online services allowed me to do it. I remembered reading that Google Docs, added this (OCR) capability a while ago and since I have a Google Apps account, I decided to give it a try.

Google also has a limit of 2 pages per OCR conversion. So after some brainstorming, I came up with this quick hack to use Google Docs for converting large PDF files into editable content.

  1. Split the PDF file into two page documents using PDFsam (Open Source PDF Split and Merge Tool).
  2. Log into your Google Docs interface at http://docs.google.com . All you need is a Google Account to use this feature
  3. Create a folder (collection) to organize your files. This is not required, but it will make searching for the files a lot easier
  4. Check the settings to convert PDF files to editable
  5. Upload the PDF files you created in step 1.
  6. As you upload the files, Google creates an editable document with the text from the PDF files. You can then create a new document and copy/paste the content from all the smaller files.

I think someone with more programming chops than me can improve this by using the Google API to do the copy/paste from the smaller docs into the final document :).