admin

HOW TO : Search and Replace text in a file with Perl

There are tons of sites (and tons of different ways to do this) about this information.. But wanted to note this down for my personal records. If you ever wanted to search for and replace certain text in a file, you can do it with perl with this quick one liner

[code]perl  -p -i -e ‘s/ORIGINAL_STRING/NEW_STRING/g’ FILE_NAME [/code]

Demonstrating the power of perl

I haven’t scripted in perl for quite some time (disadvantages of moving into management 🙂 ). Today, we had to analyze some log files at work and thought I would dust off my scripting skills..

The source data is Apache web logs and we had to find out the number of hits from a unique IP address for a particular scenario.

Pretty simple right, grep will do the job very well. As demonstrated in this blog post. But we had to analyze the data for a ton of servers and I really didn’t want to repeat the same command again and again. Did you know that laziness is the mother of invention :). So I wrote a simple perl script to do the job for me. The biggest advantage of writing this perl script was not that it helped reduce the copy/paste job, but the speed that the script took to run. Details of the comparison below

HOW 99% OF ENGINEERS WOULD DO IT

The analysis consisted of getting web logs for the last week (and some of these log files were already rotated/compressed). Concatenating them to create one large file and then getting the number of hits by IP for a certain condition. This can be done very simply by using a couple of commands that come standard with any *nix system

  • cp
  • cat
  • grep for each day we needed the data

The final grep command would look like this

[code] grep -i "\[20/Feb/2012" final_log | grep -i "splash.do" | grep -i productcode | cut -d’ ‘ -f 1 -| sort |uniq -c | sort -rn > ~/2_20_2012_ip_report.log [/code]

Timing this command showed that it took ~1 min and 22 seconds to run it.

HOW THE 1% DO IT:)

I wrote this perl script (disclaimer : I am not a programmer :), so pls excuse the hack code).

[code]

#!/usr/bin/perl
# Modules to load
# use strict;
use warnings;

# Variables
my $version = 0.1;

# Clear the screen
system $^O eq ‘MSWin32’ ? ‘cls’ : ‘clear’;

# Create one large file to parse
`cp /opt/apache/logs/access_log ~/access_log`;
`cp /opt/apache/logs/access_log.1.gz ~/access_log.1.gz`;
`cp /opt/apache/logs/access_log.2.gz ~/access_log.2.gz`;

`gunzip access_log.1.gz`;
`gunzip access_log.2.gz`;

`cat access_log.2 access_log.1 access_log > final_access_log`;

# Hostname
$hostName=`hostname`;
chomp($hostName);

print "The Hostname of the server is : $hostName \n";

# Process the log file file, one line at a time
open(INPUTFILE,"< final_access_log") || die "Couldn’t open log file, exiting $!\n";

while (defined ($line = <INPUTFILE>)) {
chomp $line;
if ($line =~ m/\[20\/Feb\/2012/)
{
open(OUTPUTFILE, ">> 2_20_2012_log_file") || die "Couldn’t open log file, exiting $!\n";
print OUTPUTFILE "$line\n";
close(OUTPUTFILE);
next;
}
if ($line =~ m/\[21\/Feb\/2012/)
{
open(OUTPUTFILE, ">> 2_21_2012_log_file") || die "Couldn’t open log file, exiting $!\n";
print OUTPUTFILE "$line\n";
close(OUTPUTFILE);
next;
}
if ($line =~ m/\[22\/Feb\/2012/)
{
open(OUTPUTFILE, ">> 2_22_2012_log_file") || die "Couldn’t open log file, exiting $!\n";
print OUTPUTFILE "$line\n";
close(OUTPUTFILE);
next;
}
if ($line =~ m/\[23\/Feb\/2012/)
{
open(OUTPUTFILE, ">> 2_23_2012_log_file") || die "Couldn’t open log file, exiting $!\n";
print OUTPUTFILE "$line\n";
close(OUTPUTFILE);
next;
}
if ($line =~ m/\[24\/Feb\/2012/)
{
open(OUTPUTFILE, ">> 2_24_2012_log_file") || die "Couldn’t open log file, exiting $!\n";
print OUTPUTFILE "$line\n";
close(OUTPUTFILE);
next;
}
if ($line =~ m/\[25\/Feb\/2012/)
{
open(OUTPUTFILE, ">> 2_25_2012_log_file") || die "Couldn’t open log file, exiting $!\n";
print OUTPUTFILE "$line\n";
close(OUTPUTFILE);
next;
}

if ($line =~ m/\[26\/Feb\/2012/)
{
open(OUTPUTFILE, ">> 2_26_2012_log_file") || die "Couldn’t open log file, exiting $!\n";
print OUTPUTFILE "$line\n";
close(OUTPUTFILE);
next;
}
if ($line =~ m/\[27\/Feb\/2012/)
{
open(OUTPUTFILE, ">> 2_27_2012_log_file") || die "Couldn’t open log file, exiting $!\n";
print OUTPUTFILE "$line\n";
close(OUTPUTFILE);
next;
}
if ($line =~ m/\[28\/Feb\/2012/)
{
open(OUTPUTFILE, ">> 2_28_2012_log_file") || die "Couldn’t open log file, exiting $!\n";
print OUTPUTFILE "$line\n";
close(OUTPUTFILE);
next;
}
}

`rm final_access_log`;
`rm access_log`;
`rm access_log.1`;
`rm access_log.2`;

for ($day=0; $day < 9; $day++)
{
$outputLog = $hostName."_2_2".$day."_2012.txt";
$inputLog = "2_2".$day."_2012_log_file";

$dateString = "\\[2".$day."/Feb/2012";

print "Running the aggregator with following data\n";
print "Input File : $inputLog\n";
print "Output Log : $outputLog\n";
print "Date String: $dateString\n";

`grep -i "splash.do" | grep -i productcode | cut -d’ ‘ -f 1 -| sort |uniq -c | sort -rn > ~/$outputLog`;

# Cleanup after yourself
`rm $inputLog`;
}

[/code]

I wrote a smaller script to do the same job as the command line hack that I tried earlier and compared the time. First, here is the smaller script

[code]

#!/usr/bin/perl
# Modules to load
# use strict;
use warnings;

# Variables
my $version = 0.1;
# Clear the screen
system $^O eq ‘MSWin32’ ? ‘cls’ : ‘clear’;
open (TEMPFILE,"< final_log");

# Match date and write to another log file
while (defined ($line = <TEMPFILE>)) {
chomp $line;
if ($line =~ m/\[20\/Feb\/2012/)
{
open(OUTPUTFILE, ">> perl_speed_test_output.log");
print OUTPUTFILE "$line\n";
close(OUTPUTFILE);
next;
}
}

`grep -i "splash.do" perl_speed_test_output.log | grep -i productcode | cut -d’ ‘ -f 1 -| sort |uniq -c | sort -rn > ~/perl_speed_test_output_ip.log`;

[/code]

Timing this script, showed that it took 21 seconds to run it.  > 300% improvement in speed and more importantly, less load (RAM utilization) on the system

One has to love technology :).

HOW TO : Sort Apache Web Logs for hits by Unique IP Addresses

 

Say you want to find out how many hits you are getting t0 a specific page from a particular source IP, you can use this quick collection of Linux tools to get this data

[code]grep -i "URL_TO_CHECK" PATH_TO_APACHE_ACCESS_LOG | cut -d’ ‘ -f 1 -| sort |uniq -c | sort -rn > ~/ip_report.txt[/code]

You are using

  • grep to filter the string of the page you want the report on
  • cut to get the IP address from the log file
  • sort and uniq to sort the unique IP addresses
  • and finally sort -rn to sort the data in descending order

Example :

[code]grep -i "GET /" /opt/apache/logs/access_log | cut -d’ ‘ -f 1 -| sort |uniq -c | sort -rn > ~/ip_report.txt[/code]

gets you the report of hits to the index page.

HOW TO : Find list of files used by a process in Linux

Quick howto on finding the list of files being accessed by a process in Linux. I needed to find this for troubleshooting an issue where a particular process was using an abnormally high percentage of CPU. I wanted to find out what this particular process was doing and accessing.

  1. Find the process ID (pid) of the process you want to analyze by running[code] ps -ef | grep NAME_OF_PROCESS [/code]
  2. Find the files the process is accessing at a given time by running[code]sudo ls -l /proc/PROCESS_ID/fd [/code]

For example, if I wanted to find the list of files being accessed by mysql, the process would look as such

[code] ps -ef | grep mysqld [/code]

which would show the output as

[code]samurai@samurai:~$ ps -ef | grep mysqld
mysql     3304     1  0 Feb04 ?        00:00:23 /usr/sbin/mysqld
samurai  23389 23374  0 14:57 pts/0    00:00:00 grep –color=auto mysqld
[/code]

I can then find the list of files being used by mysql by running

[code] sudo ls -l /proc/3304/fd [/code]

which would give me

[code]

lrwx—— 1 root root 64 Feb  7 15:00 0 -> /dev/null
lrwx—— 1 root root 64 Feb  7 15:00 1 -> /var/log/mysql/error.log
lrwx—— 1 root root 64 Feb  7 15:00 10 -> socket:[4958]
lrwx—— 1 root root 64 Feb  7 15:00 11 -> /tmp/ibdu9WRh (deleted)
lrwx—— 1 root root 64 Feb  7 15:00 12 -> socket:[4959]
lrwx—— 1 root root 64 Feb  7 15:00 14 -> /var/lib/mysql/blog/wp_term_relatio                        nships.MYI
lrwx—— 1 root root 64 Feb  7 15:00 15 -> /var/lib/mysql/blog/wp_postmeta.MYI
lrwx—— 1 root root 64 Feb  7 15:00 17 -> /var/lib/mysql/blog/wp_term_relatio                        nships.MYD
lrwx—— 1 root root 64 Feb  7 15:00 18 -> /var/lib/mysql/blog/wp_term_taxonom                        y.MYI
lrwx—— 1 root root 64 Feb  7 15:00 2 -> /var/log/mysql/error.log
lrwx—— 1 root root 64 Feb  7 15:00 20 -> /var/lib/mysql/blog/wp_postmeta.MYD
lrwx—— 1 root root 64 Feb  7 15:00 21 -> /var/lib/mysql/blog/wp_term_taxonom                        y.MYD
lrwx—— 1 root root 64 Feb  7 15:00 22 -> /var/lib/mysql/blog/wp_terms.MYI
lrwx—— 1 root root 64 Feb  7 15:00 23 -> /var/lib/mysql/blog/wp_terms.MYD
lrwx—— 1 root root 64 Feb  7 15:00 3 -> /var/lib/mysql/ibdata1
lrwx—— 1 root root 64 Feb  7 15:00 4 -> /tmp/ibvANyz7 (deleted)
lrwx—— 1 root root 64 Feb  7 15:00 5 -> /tmp/ibonS0mU (deleted)
lrwx—— 1 root root 64 Feb  7 15:00 6 -> /tmp/ibcKctaH (deleted)
lrwx—— 1 root root 64 Feb  7 15:00 7 -> /tmp/ibB5DS5t (deleted)
lrwx—— 1 root root 64 Feb  7 15:00 8 -> /var/lib/mysql/ib_logfile0
lrwx—— 1 root root 64 Feb  7 15:00 9 -> /var/lib/mysql/ib_logfile1
[/code]

Overheard : Comment about trust and security

Very thought provoking comment on trust and security by Mark Boyle, the Moneyless Man, on a recent episode of PRI‘s To the best of our knowledge program (I personally transcribed this.. so pls overlook any minor typos 🙂 )

What money has become is.. a substitute for trust. It has now become our primary source of security in the world and what I am trying to do personally is to find my primary source of security in the friendships I have and in my local community and my relationship with earth. Because most countries, such as Argentina and Indonesia and currently Zimbabwe have experienced this hyperinflation and you can have a million dollars in the bank. One day with devaluation, it can almost be worthless. No matter how badly I behave, my friend’s don’t devalue me that quickly. And I think real security comes in our relationships, whether to it be with our planet or whether with our local community. I think what we all can do is build a bit more diversity in how we meet our needs and to not be so reliant on cash.

You can get the full interview at http://feedproxy.google.com/~r/TTBOOK/~3/X009WjbiqB0/tbk120205a.mp3. Segment with Mark starts at ~42 min.

Overheard : Comment on Work

I was standing in line to get into a plane yesterday and heard this comment made by a gentleman to his friend

You know.. funny thing about work, it has to get done!!

The guys were discussing about how their wives don’t understand the pressures of work :).

Protesting SOPA and PIPA

Unless you are living under a rock or outside the US :).. you probably heard about the crazy legislation that the US congress and senates are proposing to help protect content creators (AKA Hollywood) from privacy. While I personally don’t have any issues with giving protection to content creators, it should not be at the cost of freedom for the rest of the world. Go to http://americancensorship.org/ to get more information about why this proposed legislation are bad.

Today (1/18/2012) has been designated as “Protest SOPA/PIPA day” by the technology world. I believe in the old adage, put your money where your mouth is :).. so I checked on the top 25 US sites (according to Alexa) to see how many of them are supporting this protest in a visible manner. Only 4 out of the 25 sites, put visible content on their websites regarding the protest. I think Google’s message was the most effective, where they did not reduce the functionality of the website, but provided a lot of visibility to the protest. I know which companies I am going to support/use moving forward :). I was very happy to see that three of the sites that I use on a regular basis (google, amazon and wikipedia) are supporting this protest. Here are screenshots of the protest from the the  4 sites that are in the top 25 visited sites in the US

Google.com

Amazon.com

Wikipedia.Org

WordPress.com

Screenshots of some other sites that I visit on a regular basis and are supporting the protest

Boingboing.net

Wired.com

Arstechnica.com

Reddit.com

DuckDuckGo.com

G+ or Blog

I started using Google Plus from last November and I should say that, even though I am a big proponent of keeping control over your digital avatar, it has been very easy to make (give) quick updates on Google plus than on this blog. Plus my friends and family don’t have to specially come to this site to get updates. They get the G+ updates as part of their regular email and/or when they log into their G+ stream. It is less work on everyones part.

That is one of the reasons, I believe G+ will be one of the first real contenders to Facebook. Even though Facebook boasts of more than 800 million users, it is still a “seperate” site that folks have to log into unlike Google plus, which is fast becoming part of the regular Google experience. Esp with the tweaks that Google made last week with incorporating G+ data into the search results, the line between  a Google search and using Google Plus gets blurrier.

So the question (for me) is not if it is Facebook or G+.. but if it is the blog or G+..