HOW TO : Perform OCR on PDF files for free

I had to convert a scanned PDF file into an editable document recently. You can do this using OCR and there is a ton of software out there, that does this. There are even web based services that do this. But each of them had limitations (either had to buy the software or limit in the number of pages that can be scanned). I didn’t want to buy the license, since this is not something I would be doing regularly and the document I had to convert was 61 pages, so none of the online services allowed me to do it. I remembered reading that Google Docs, added this (OCR) capability a while ago and since I have a Google Apps account, I decided to give it a try.

Google also has a limit of 2 pages per OCR conversion. So after some brainstorming, I came up with this quick hack to use Google Docs for converting large PDF files into editable content.

  1. Split the PDF file into two page documents using PDFsam (Open Source PDF Split and Merge Tool).
  2. Log into your Google Docs interface at http://docs.google.com . All you need is a Google Account to use this feature
  3. Create a folder (collection) to organize your files. This is not required, but it will make searching for the files a lot easier
  4. Check the settings to convert PDF files to editable
  5. Upload the PDF files you created in step 1.
  6. As you upload the files, Google creates an editable document with the text from the PDF files. You can then create a new document and copy/paste the content from all the smaller files.

I think someone with more programming chops than me can improve this by using the Google API to do the copy/paste from the smaller docs into the final document :).

HOW TO : List files that don't contain a string using find and grep

If you run into a situation, where you need to search through a bunch of files and print the names of the files that don’t contain a particular string, here is how you do it in Linux

[code]find -name PATTERN_FOR_FILE_NAMES | xargs grep -L STRING_YOU_ARE_SEARCHING_FOR [/code]

The -L option for grep does this (according to the manual)

Suppress normal output; instead print the name of each input file from which no output would normally have been printed.Ā  The scanning will stop on the first match.

Project Uptime : Progress Report – 4

Continuing to lock down the server as part of project uptime a bit more.. I highly recommend enabling and using iptables on every Linux server. I want to restrict inbound traffic to the server to only SSH (tcp port 22) and HTTP(S) (tcp port 80/443). Here’s the process

Check the current rules on the server

[code]sudo iptables -L [/code]

Add rules to allow SSH, HTTP and HTTPS traffic and all traffic from the loopback interface

[code]sudo iptables -I INPUT -i lo -j ACCEPT
sudo iptables -A INPUT -m conntrack –ctstate RELATED,ESTABLISHED -j ACCEPT
sudo iptables -A INPUT -p tcp –dport ssh -j ACCEPT
sudo iptables -A INPUT -p tcp –dport http -j ACCEPT
sudo iptables -A INPUT -p tcp –dport https -j ACCEPT
[/code]

Drop any traffic that doesn’t match the above mentioned criteria

[code]sudo iptables -A INPUT -j DROP [/code]

save the config and create script for the rules to survive reboots by running

[code]sudo su –
iptables-save > /etc/firewall.rules[/code]

now create a simple script that will load these rules during startup. Ubuntu provides a pretty neat way to do this. You can write a simple script and place it in /etc/network/if-pre-up.d and the system will execute this before bringing up the interfaces. You can get pretty fancy with this, but here is a simple scrip that I use

[code]
[email protected]:/etc/network/if-pre-up.d$ cat startfirewall
#!/bin/bash

# Import iptables rules if the rules file exists

if [ -f /etc/firewall.rules ]; then
iptables-restore </etc/firewall.rules
fi

exit 0
[/code]

Now you can reboot the server and check if your firewall rules are still in effect by running

[code]sudo iptables -L [/code]

Project Uptime : Progress Report – 3

Things have been a bit hectic at work.. so didn’t get a lot of time to work on this project. Now that that the new server has been setup and the kernel updated, we get down to the mundane tasks of installing the software.

One of the first things I do, when configuring any new server is to restrict root user from logging into the server remotely. SSH is the default remote shell access method nowadays. Pls don’t tell me you are still using telnet :).

And before restricting the root user for remote access, add a new user that you want to use for regular activities, add the user to sudo group and ensure you can login and sudo to root as this user. Here are the steps I follow to do this on a Ubuntu server

Add a new user

[code]useradd xxxx [/code]

Add user to sudo group

[code]usermod -G sudo -a xxxx[/code]

Check user can sudo to gain root access

[code]sudo su – xxxx
su – [/code]

Now moving into the software installation part

Install Mysql

[code]sudo apt-get install mysql-server [/code]

you will be prompted to set the root user during this install. This is quite convenient, unlike the older installs, where you had to set the root password later on.

Install PHP

[code]sudo apt-get install php5-mysql [/code]

In addition to installing the PHP5-mysql, this will also install apache. I know, I mentioned, I would like to try out the new version of Apache. But it looks like Ubuntu, doesn’t have a package for it yet. And I am too lazy to compile from source :).

With this you have all the basic software for wordpress. Next, we will tweak this software to use less system resources.

HOW TO : Identify good talent

In the fortune magazine, there is a section (and for the life of me.. I cannot seem to find it online and I just cleared my dining table of all the back issues of the magazine šŸ™ ) on how business leaders see trends in real life and than make judgements on where the economy is headed. For example, if someone sees a starbucks they visit regularly become more busier than normal, that person judges that the economy is doing well.

I judge good talent (think sysadmin, dba, programmer, application engineer etc) in a somewhat similar fashion. I should note that this system is not perfect, but it seems to have worked out more than not for me so far. I judge their talent based on what browser and how they use it.

THE BEST

Ever see a person who has multiple browsers (not tabs) open and is doing a specialized task in each one of them? Their home screen is usually set to a specialized search engine (blekko, duckduckgo, wolframalpha) and they have add-ons that block ads and show them a variety of information of the site they visit. These are usually the best folks to have on your team. These are the folks that you want your systems designed by.

THE GOOD

This group tends to either use firefox or chrome. Has Google as theirĀ  home-page and know how to use multiple tabs. Yeah.. sorry, I a browser discriminator :). Since there can only be a few rock stars in the world, you should consider yourselves lucky if most of the members in your team belong to this group.

THE REST

Ever see someone, whose homepage looks like Google but is not.. And has a bunch of “toolbars” that take up 1/3rd of the screen. And has popups showing up randomly? Yep.. these are the folks you don’t want touching your code. Even with a 10 foot pole.. no sir.. Having these folks move to firefox and/or chrome doesn’t help the situation either :).

HOW TO : Route traffic to loopback interface in Linux

Back in 2009 (last decade!! šŸ™‚ ), I wrote a blog post on how you can trick windows to route traffic destined to a particular IP address to a black-hole. In it, I mentioned the command to route traffic to /dev/null in Linux was [code]<code>route ADD IP_ADDRESS_OF_MAIL_SERVER MASK 255.255.255.255 127.0.0.1</code> [/code]

I ran into a need to try it today and looks like the trick doesn’t work :). So here is the right command if you want to route traffic to the loopback (or blackhole) destined to a particular IP address [code]sudo route add -host IP_ADDRESS_OF_HOST/NETWORK_MASK lo [/code]

For example if I want to black-hole traffic destined to 74.205.216.2, I would do the following [code] sudo route add -host 74.205.216.2/32 lo [/code]

Dreaming

James Cameron successfully completed a deep dive into the deepest part of the ocean today. And with it, he repeated a feat that was tried 50 years ago. I cannot wait for the documentary he is going to make out of this dive. I put this on the same scale as someone flying to the moon and back. Yes, I know it was done before. But it has never been repeated again.

By doing this, Mr. Cameron has inspired to start dreaming again. That we can still fly to the moon.. that we can one day travel beyond theĀ boundariesĀ of our solar system

Thank you Mr. Cameron, for letting us dream again.

Driving in Texas

If you are ever driving in the country of Texas (yes, for theĀ uninitiated, Texas is a country by itself šŸ™‚ ), drive in the left most lane if you don’t want to get off the highway and drive in the rightmost lane if you want to get off the highway. Even the folks that only drive at 55 and are always in the right lane need to heed this advice..

Or you will be getting off and on the never ending “frontage” roads šŸ™‚

I shared by two cents!!

DID YOU KNOW : Windows mobile and wildcard certs don't work together

Wildcard SSL certificates allow you to use one certificate for all sub domains (up to one level) of a host. Say I got a wildcard SSL certificate for *.kudithipudi.org, I would be able to use it to provide SSL on blah.kudithipudi.org, ssltest.kudithipudi.org, youcannotbeserious.kudithipudi.org and the clients won’t complaint about it.

For some reason though, Windows Mobile phones don’t like wildcard certs. So if you are ever scratching your head, why every other client works, but windows mobile devices don’t..stop scratching and get a regular SSL certificate for your website/application.

Apparently, this is the case with

  • Windows CE
  • Windows Mobile 5.0
  • Windows Mobile 6.0
  • Windows Mobile 7.0

Don’t you get the feeling that someone keeps using the same library and never bothered to check/fix it? And searching on MSDN or any other Microsoft resource won’t provide you this information. This is my own deduction after beating my head against the wall for more than 3 days :).