I haven’t scripted in perl for quite some time (disadvantages of moving into management 🙂 ). Today, we had to analyze some log files at work and thought I would dust off my scripting skills..
The source data is Apache web logs and we had to find out the number of hits from a unique IP address for a particular scenario.
Pretty simple right, grep will do the job very well. As demonstrated in this blog post. But we had to analyze the data for a ton of servers and I really didn’t want to repeat the same command again and again. Did you know that laziness is the mother of invention :). So I wrote a simple perl script to do the job for me. The biggest advantage of writing this perl script was not that it helped reduce the copy/paste job, but the speed that the script took to run. Details of the comparison below
HOW 99% OF ENGINEERS WOULD DO IT
The analysis consisted of getting web logs for the last week (and some of these log files were already rotated/compressed). Concatenating them to create one large file and then getting the number of hits by IP for a certain condition. This can be done very simply by using a couple of commands that come standard with any *nix system
- cp
- cat
- grep for each day we needed the data
The final grep command would look like this
[code] grep -i "\[20/Feb/2012" final_log | grep -i "splash.do" | grep -i productcode | cut -d’ ‘ -f 1 -| sort |uniq -c | sort -rn > ~/2_20_2012_ip_report.log [/code]
Timing this command showed that it took ~1 min and 22 seconds to run it.
HOW THE 1% DO IT:)
I wrote this perl script (disclaimer : I am not a programmer :), so pls excuse the hack code).
[code]
#!/usr/bin/perl
# Modules to load
# use strict;
use warnings;
# Variables
my $version = 0.1;
# Clear the screen
system $^O eq ‘MSWin32’ ? ‘cls’ : ‘clear’;
# Create one large file to parse
`cp /opt/apache/logs/access_log ~/access_log`;
`cp /opt/apache/logs/access_log.1.gz ~/access_log.1.gz`;
`cp /opt/apache/logs/access_log.2.gz ~/access_log.2.gz`;
`gunzip access_log.1.gz`;
`gunzip access_log.2.gz`;
`cat access_log.2 access_log.1 access_log > final_access_log`;
# Hostname
$hostName=`hostname`;
chomp($hostName);
print "The Hostname of the server is : $hostName \n";
# Process the log file file, one line at a time
open(INPUTFILE,"< final_access_log") || die "Couldn’t open log file, exiting $!\n";
while (defined ($line = <INPUTFILE>)) {
chomp $line;
if ($line =~ m/\[20\/Feb\/2012/)
{
open(OUTPUTFILE, ">> 2_20_2012_log_file") || die "Couldn’t open log file, exiting $!\n";
print OUTPUTFILE "$line\n";
close(OUTPUTFILE);
next;
}
if ($line =~ m/\[21\/Feb\/2012/)
{
open(OUTPUTFILE, ">> 2_21_2012_log_file") || die "Couldn’t open log file, exiting $!\n";
print OUTPUTFILE "$line\n";
close(OUTPUTFILE);
next;
}
if ($line =~ m/\[22\/Feb\/2012/)
{
open(OUTPUTFILE, ">> 2_22_2012_log_file") || die "Couldn’t open log file, exiting $!\n";
print OUTPUTFILE "$line\n";
close(OUTPUTFILE);
next;
}
if ($line =~ m/\[23\/Feb\/2012/)
{
open(OUTPUTFILE, ">> 2_23_2012_log_file") || die "Couldn’t open log file, exiting $!\n";
print OUTPUTFILE "$line\n";
close(OUTPUTFILE);
next;
}
if ($line =~ m/\[24\/Feb\/2012/)
{
open(OUTPUTFILE, ">> 2_24_2012_log_file") || die "Couldn’t open log file, exiting $!\n";
print OUTPUTFILE "$line\n";
close(OUTPUTFILE);
next;
}
if ($line =~ m/\[25\/Feb\/2012/)
{
open(OUTPUTFILE, ">> 2_25_2012_log_file") || die "Couldn’t open log file, exiting $!\n";
print OUTPUTFILE "$line\n";
close(OUTPUTFILE);
next;
}
if ($line =~ m/\[26\/Feb\/2012/)
{
open(OUTPUTFILE, ">> 2_26_2012_log_file") || die "Couldn’t open log file, exiting $!\n";
print OUTPUTFILE "$line\n";
close(OUTPUTFILE);
next;
}
if ($line =~ m/\[27\/Feb\/2012/)
{
open(OUTPUTFILE, ">> 2_27_2012_log_file") || die "Couldn’t open log file, exiting $!\n";
print OUTPUTFILE "$line\n";
close(OUTPUTFILE);
next;
}
if ($line =~ m/\[28\/Feb\/2012/)
{
open(OUTPUTFILE, ">> 2_28_2012_log_file") || die "Couldn’t open log file, exiting $!\n";
print OUTPUTFILE "$line\n";
close(OUTPUTFILE);
next;
}
}
`rm final_access_log`;
`rm access_log`;
`rm access_log.1`;
`rm access_log.2`;
for ($day=0; $day < 9; $day++)
{
$outputLog = $hostName."_2_2".$day."_2012.txt";
$inputLog = "2_2".$day."_2012_log_file";
$dateString = "\\[2".$day."/Feb/2012";
print "Running the aggregator with following data\n";
print "Input File : $inputLog\n";
print "Output Log : $outputLog\n";
print "Date String: $dateString\n";
`grep -i "splash.do" | grep -i productcode | cut -d’ ‘ -f 1 -| sort |uniq -c | sort -rn > ~/$outputLog`;
# Cleanup after yourself
`rm $inputLog`;
}
[/code]
I wrote a smaller script to do the same job as the command line hack that I tried earlier and compared the time. First, here is the smaller script
[code]
#!/usr/bin/perl
# Modules to load
# use strict;
use warnings;
# Variables
my $version = 0.1;
# Clear the screen
system $^O eq ‘MSWin32’ ? ‘cls’ : ‘clear’;
open (TEMPFILE,"< final_log");
# Match date and write to another log file
while (defined ($line = <TEMPFILE>)) {
chomp $line;
if ($line =~ m/\[20\/Feb\/2012/)
{
open(OUTPUTFILE, ">> perl_speed_test_output.log");
print OUTPUTFILE "$line\n";
close(OUTPUTFILE);
next;
}
}
`grep -i "splash.do" perl_speed_test_output.log | grep -i productcode | cut -d’ ‘ -f 1 -| sort |uniq -c | sort -rn > ~/perl_speed_test_output_ip.log`;
[/code]
Timing this script, showed that it took 21 seconds to run it. > 300% improvement in speed and more importantly, less load (RAM utilization) on the system
One has to love technology :).