Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Parsing Apache Log to Get Most Recent File Access

by enoch (Chaplain)
on Nov 25, 2002 at 22:13 UTC ( [id://215742]=perlquestion: print w/replies, xml ) Need Help??

enoch has asked for the wisdom of the Perl Monks concerning the following question:

I am writing a quick script that will parse over my Apache access log and print out the most recent file accesses. The gotcha comes in with the fact that I want it to only print out the most recent file accessed from each unique IP, and I want the output sorted by date.

Here is my code:

#!/usr/bin/perl use warnings; use strict; use Date::Manip; use vars qw(%ipHash); while(<DATA>) { / ^((\d{1,3}\.){3}\d{1,3}) # grab the IP address into $1 \s\-\s\-\s\[ (\d\d\/\w{3}\/\d\d(\d\d\:){3}\d\d) # grab the date into $3 \s\-\d{4}\]\s"\w{1,4}\s ([\/|\w|\.|_]+) # grab the file path into $5 /x and $ipHash{&UnixDate($3,"%s")} = [$1, $3, $5]; } print join "\n", map {$ipHash{$_}[0] . " => " . $ipHash{$_}[1] . "\t" . $ipHash{$_} +[2]} sort keys %ipHash; __DATA__ 209.36.83.252 - - [17/Oct/2002:05:53:17 -0400] "GET /scripts/..%%35%63 +../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 400 303 209.36.83.252 - - [17/Oct/2002:05:53:17 -0400] "GET /scripts/..%%35c.. +/winnt/system32/cmd.exe?/c+dir HTTP/1.0" 400 303 209.36.83.252 - - [17/Oct/2002:05:53:17 -0400] "GET /scripts/..%25%35% +63../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 313 209.36.83.252 - - [17/Oct/2002:05:53:17 -0400] "GET /scripts/..%252f.. +/winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 313 68.9.44.75 - - [17/Oct/2002:06:50:34 -0400] "GET /phpMyAdmin HTTP/1.1" + 301 322 68.9.44.75 - - [17/Oct/2002:06:50:34 -0400] "GET /phpMyAdmin/ HTTP/1.1 +" 200 898 68.9.44.75 - - [17/Oct/2002:06:50:36 -0400] "GET /phpMyAdmin/left.php? +lang=en-iso-8859-1&convcharset=iso-8859-1&server=1 HTTP/1.1" 200 1024 129.22.39.158 - - [17/Oct/2002:18:05:10 -0400] "OPTIONS / HTTP/1.1" 20 +0 0 160.79.211.121 - - [17/Oct/2002:19:51:31 -0400] "GET /default.ida?NNNN +NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +NNNNNNNNNNNNN%u9090%u6858%ucbd3%u7801%u9090%u6858%ucbd3%u7801%u9090%u +6858%ucbd3%u7801%u9090%u9090%u8190%u00c3%u0003%u8b00%u531b%u53ff%u007 +8%u0000%u00=a HTTP/1.0" 400 303 129.22.82.8 - - [17/Oct/2002:20:37:10 -0400] "GET /index.php HTTP/1.1" + 200 25430 129.22.82.8 - - [17/Oct/2002:20:37:10 -0400] "GET /index.php?=PHPE9568 +F35-D428-11d2-A769-00AA001ACF42 HTTP/1.1" 200 4440 129.22.82.8 - - [17/Oct/2002:20:37:10 -0400] "GET /index.php?=PHPE9568 +F34-D428-11d2-A769-00AA001ACF42 HTTP/1.1" 200 2962 129.22.82.8 - - [17/Oct/2002:21:25:44 -0400] "GET / HTTP/1.1" 200 2673 129.22.82.8 - - [17/Oct/2002:21:25:44 -0400] "GET /manual/images/apach +e_pb.gif HTTP/1.1" 404 302
That code produces the following output:
209.36.83.252 => 17/Oct/2002:05:53:17 /scripts/.. 68.9.44.75 => 17/Oct/2002:06:50:34 /phpMyAdmin/ 68.9.44.75 => 17/Oct/2002:06:50:36 /phpMyAdmin/left.php 160.79.211.121 => 17/Oct/2002:19:51:31 /default.ida 129.22.82.8 => 17/Oct/2002:20:37:10 /index.php 129.22.82.8 => 17/Oct/2002:21:25:44 /manual/images/apache_pb.gif
Now, ideally, I would not want the IP's repeated. Rather, I just want to see the last file accessed by that IP. So, the output would look like:
209.36.83.252 => 17/Oct/2002:05:53:17 /scripts/.. 68.9.44.75 => 17/Oct/2002:06:50:34 /phpMyAdmin/ 160.79.211.121 => 17/Oct/2002:19:51:31 /default.ida 129.22.82.8 => 17/Oct/2002:21:25:44 /manual/images/apache_pb.gif
But, to maintain my sorting by date, I key the hash by the Unix timestamp and not the IP. Would I need to set up a dualing hash thing so I can sort by date but keep only one entry for each IP address? I just can't seem to wrap my head around this and wondered if any monks had some nifty ideas.

Thanks,
enoch

P.S. You gotta love those 'default.ida?NNNNNNN' entries.

Replies are listed 'Best First'.
Re: Parsing Apache Log to Get Most Recent File Access
by Ovid (Cardinal) on Nov 25, 2002 at 22:22 UTC
      I installed Apache::ParseLog and started playing around with it. Here is my quick mock-up code with it:
      #!/usr/bin/perl use warnings; use strict; use Apache::ParseLog; my $parseObj = new Apache::ParseLog(); $parseObj = $parseObj->config(transferlog => '/usr/local/apache/logs/access_lo +g'); my $transferLog = $parseObj->getTransferLog(); my %hosts = $transferLog->host(); foreach my $key (keys %hosts) { print "$key => $hosts{$key}\n"; }
      The problem is that none of the hashes you can access via it's methods contain all three items I am looking for -- IP, filename, and date. Here are all the methods that return information about the logs. None of them return a hash having all the info I want. The two closest methods are host() which returns a hash keyed in by IP's with the value being the total times that that IP has come to the site. And, then there's hitbydatetime() which returns a hash keyed in by datetime stamps with the value being the number of hits at that time. But, I can't cross-reference one with the other.

      enoch
Re: Parsing Apache Log to Get Most Recent File Access
by cLive ;-) (Prior) on Nov 26, 2002 at 01:29 UTC

    Untested, but looks right:

    #!/usr/bin/perl use warnings; use strict; use Date::Manip; use vars qw(%ipHash); my $order=0; while(<DATA>) { / ^((\d{1,3}\.){3}\d{1,3}) # grab the IP address into $1 \s\-\s\-\s\[ (\d\d\/\w{3}\/\d\d(\d\d\:){3}\d\d) # grab the date into $3 \s\-\d{4}\]\s"\w{1,4}\s ([\/|\w|\.|_]+) # grab the file path into $5 /x; $ipHash{$1}{'order'} = $order++; $ipHash{$1}{'path'} = $5; $ipHash{$1}{'date'} = &UnixDate($3,"%s"); } for (sort { $ipHash{$a}{'order'} <=> $ipHash{$b}{'order'}; } keys %ipH +ash) { print "$_ => $ipHash{$a}{'date'}\t$ipHash{$a}{'path'}\n"; }

    cLive ;-)

    PS - if you're not doing so already, you might want to use tail to grab the end of the logfile

Re: Parsing Apache Log to Get Most Recent File Access
by TStanley (Canon) on Nov 26, 2002 at 16:45 UTC
    You might want to also give Logfile::Apache a look as well. davorg gives some good coverage of this module in his book.

    TStanley
    --------
    It is God's job to forgive Osama Bin Laden. It is our job to arrange the meeting -- General Norman Schwartzkopf
Re: Parsing Apache Log to Get Most Recent File Access
by enoch (Chaplain) on Nov 26, 2002 at 19:36 UTC
    ++'s all around!

    cLive ;-)'s solution worked with a couple of minor tweaks. Here is the final working version.

    #!/usr/bin/perl use warnings; use strict; use Date::Manip; use vars qw(%ipHash); my $order=0; while(<DATA>) { / ^((\d{1,3}\.){3}\d{1,3}) # grab the IP address into $1 \s\-\s\-\s\[ (\d\d\/\w{3}\/\d\d(\d\d\:){3}\d\d) # grab the date into $3 \s\-\d{4}\]\s"\w{1,4}\s ([\/|\w|\.|_]+) # grab the file path into $5 /x; $ipHash{$1}{'order'} = $order++; $ipHash{$1}{'path'} = $5; $ipHash{$1}{'date'} = $3; $ipHash{$1}{'timestamp'} = &UnixDate($3,"%s"); } for (sort { $ipHash{$a}{'order'} <=> $ipHash{$b}{'order'}; } keys %ipH +ash) { print "$_ => $ipHash{$_}{'date'}\t$ipHash{$_}{'path'}\n"; }
    Notice, the only things that were changed from cLive ;-)'s version was the $a's in the final print statement had to be changed to $_, and that the 'date' hash element held the nicely formatted date instead of the Unix timestamp.

    Thanks, everyone!

    enoch

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://215742]
Approved by rob_au
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2024-04-19 07:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found