mkent has asked for the wisdom of the Perl Monks concerning the following question:
I'm a newbie with a deadline, so any help appreciated. I need to write a program with a web interface where periods of time can be specified (last 2 hours, last 24 hours, last 2 hours and 15 mins)and then the web log read to fetch all entries matching that time period, find the referrers and add them up to display the top 10 referrers, in order.
As a first step, I'm trying to pull out the time and http referrer from web log data, but it's not going well since the only way I can see to do it is to strip out the unwanted parts of the log line and then use the timelocal function to convert the log time to real time to match whatever math is done to the current time. Here's what I have so far as a test:
#!/usr/local/bin/perl
use CGI qw(:standard);
use CGI::Carp qw(fatalsToBrowser carpout);
use Time::Local;
print "Content-type: text/html\n\n";
#$time = timelocal($sec,$min,$hour,$mday,$mon,$year);
open LOGFILE, "datafile.html";
@log_data = <LOGFILE>;
foreach $log_line(@log_data) {
$log_line =~ s/.*(left square bracket)/ /;
$log_line =~ s/"GET.*"h/ /;
$log_line =~ s/".*/ /;
print $log_line, "<p>";
} <p>
The last $log_line does not work.
The datafile.html contains data in this form (square brackets are around the underlined date/times):
24.208.200.247 - - [10/Dec/2002:18:05:09 -0500] "GET /images/header_ao
+d2_08.gif HTTP/1.0" 200 663 "http://www.indystar.com/help/help/availa
+ble.html" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; H010818)"
24.208.200.247 - - [10/Dec/2002:18:05:09 -0500] "GET /images/header_ao
+d2_10.gif HTTP/1.0" 304 - "http://www.indystar.com/help/help/availabl
+e.html" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; H010818)"
24.208.200.247 - - [10/Dec/2002:18:05:09 -0500] "GET /images/storysear
+ch2.gif HTTP/1.0" 200 142 "http://www.indystar.com/help/help/availabl
+e.html" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; H010818)"
Re: pulling by regex
by BrowserUk (Patriarch) on Dec 11, 2002 at 03:04 UTC
|
This may help get you started. Incorporating this into a CGI.pm script is left as AEFTR. (Hint: There's not much point in useing CGI; if your going to produce the html yourself.)
Using Date:Manip makes the date calculation part easy, (though the verbose but entirely opaque documentation has me gritting my teeth and banging my head every time). The regex I've used may not be robust, but there are plenty of other offers above to choose from.
#! perl -slw
use strict;
use Date::Manip;
use Data::Dumper;
my $now = ParseDate( scalar localtime());
my $then = DateCalc( $now, ParseDateDelta( "7 hours 48 minutes ago" ))
+;
my $err;
my $re = qr/
^.*? # Skip the first part
\[([^\]]+)\]\s+ # capture everything between []
"[^"]+"\s+ # skip a quoted string and whitespace
.*? # and a couple of numbers or blanks
"( [^"]+ )" # capture the next quoted string
/x;
my %referrers;
while(<DATA>) {
my @chunks = /$re/;
my $ts = ParseDate $chunks[0];
print "The line '@chunks' was logged ",
Delta_Format( DateCalc( $ts, $now, \$err ), 2, ("%mt")),
" minutes ago.";
if ( Date_Cmp( $ts, $then ) > 0
and Date_Cmp( $ts, $now ) < 0 ) {
print "The previous line is within the window. Counting...";
$referrers{$chunks[1]}++;
}
else {
print "Discarding previous line";
}
}
print "\nThese are the referrers counted:\n", Dumper(\%referrers);
__DATA__
24.208.200.247 - - [10/Dec/2002:18:05:09 -0500] "GET /images/header_ao
+d2_08.gif HTTP/1.0" 200 663 "http://www.indystar.com/help/help/availa
+ble.html" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; H010818)"
24.208.200.247 - - [10/Dec/2002:18:08:13 -0500] "GET /images/header_ao
+d2_10.gif HTTP/1.0" 304 - "http://www.indystar.com/help/help/availabl
+e.html" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; H010818)"
24.208.200.247 - - [10/Dec/2002:18:11:19 -0500] "GET /images/storysear
+ch2.gif HTTP/1.0" 200 142 "http://www.indystar.com/help/help/availabl
+e.html" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; H010818)"
Produces
C:\test>218961
The line '10/Dec/2002:18:05:09 -0500 http://www.indystar.com/help/help
+/available.html' was logged 469.23 minutes ago.
Discarding previous line
The line '10/Dec/2002:18:08:13 -0500 http://www.indystar.com/help/help
+/available.html' was logged 466.17 minutes ago.
The previous line is within the window. Counting...
The line '10/Dec/2002:18:11:19 -0500 http://www.indystar.com/help/help
+/available.html' was logged 463.07 minutes ago.
The previous line is within the window. Counting...
These are the referrers counted:
$VAR1 = {
'http://www.indystar.com/help/help/available.html' => '2'
};
Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color.
Pick up your cloud down the end and "Yes" if you get allocated a grey one they are a bit damp under foot, but someone has to get them.
Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory.
Just be grateful that you arrived just as the tornado season finished. Them buggers are real work. | [reply] [d/l] [select] |
|
BrowserUK, I don't think I quite understand your code. I modified it to read my data and looks like I don't have it quite right:
#!/usr/local/bin/perl -slw
use strict;
use Date::Manip;
use Data::Dumper;
my $now = ParseDate( scalar localtime());
print "now is $now<p>";
my $then = DateCalc( $now, ParseDateDelta( "7 hours 48 minutes ago" ))
+;
my $err;
open LOGFILE, "datafile.html" || die "Can't open file";
my $re = qr/
^.*? # Skip the first part
\[([^\]]+)\]\s+ # capture everything between []
"[^"]+"\s+ # skip a quoted string and whitespace
.*? # and a couple of numbers or blanks
"( [^"]+ )" # capture the next quoted string
/x;
my %referrers;
while(<LOGFILE>) {
my @chunks = /$re/;
my $ts = ParseDate $chunks[0];
print "The line '@chunks' was logged ",
Delta_Format( DateCalc( $ts, $now, \$err ), 2, ("%mt")),
" minutes ago.";
if ( Date_Cmp( $ts, $then ) > 0
and Date_Cmp( $ts, $now ) < 0 ) {
print "The previous line is within the window. Counting...";
$referrers{$chunks[1]}++;
}
else {
print "Discarding previous line";
}
}
print "\nThese are the referrers counted:\n", Dumper(\%referrers);
datafile.html contains (in part):
68.22.179.211 - - [15/Dec/2002:14:52:12 -0500] "GET /scripts/s_code.js HTTP/1.1"
304 - "http://www.indystar.com/print/articles/6/008596-6466-040.html" "Mozilla/
4.0 (compatible; MSIE 5.5; Windows 98)"
152.163.188.37 - - [15/Dec/2002:14:52:12 -0500] "GET /icons/unknown.gif HTTP/1.1
" 200 245 "http://www.indystar.com/print/articles/?S=D" "Mozilla/4.0 (compatible
; MSIE 5.5; AOL 7.0; Windows 98)"
68.22.179.211 - - [15/Dec/2002:14:52:12 -0500] "GET /images/white_159x60.gif HTT
P/1.1" 304 - "http://www.indystar.com/print/articles/6/008596-6466-040.html" "Mo
zilla/4.0 (compatible; MSIE 5.5; Windows 98)"
141.154.123.193 - - [15/Dec/2002:14:52:13 -0500] "GET /print/articles/2/008227-9
652-031.html HTTP/1.0" 200 7275 "http://www.fark.com/" "Mozilla/4.79 [en] (Windo
ws NT 5.0; U)"
68.22.179.211 - - [15/Dec/2002:14:52:13 -0500] "GET /images/black_1x60.gif HTTP/
1.1" 304 - "http://www.indystar.com/print/articles/6/008596-6466-040.html" "Mozi
lla/4.0 (compatible; MSIE 5.5; Windows 98)"
68.22.179.211 - - [15/Dec/2002:14:52:13 -0500] "GET /images/69.gif HTTP/1.1" 200
1348 "http://www.indystar.com/print/articles/6/008596-6466-040.html" "Mozilla/4
.0 (compatible; MSIE 5.5; Windows 98)"
141.154.123.193 - - [15/Dec/2002:14:52:13 -0500] "GET /images/header_aod2_01.gif
HTTP/1.0" 200 2011 "http://www.indystar.com/print/articles/2/008227-9652-031.ht
ml" "Mozilla/4.79 [en] (Windows NT 5.0; U)"
141.154.123.193 - - [15/Dec/2002:14:52:13 -0500] "GET /images/header_aod2_15.gif
HTTP/1.0" 200 4162 "http://www.indystar.com/print/articles/2/008227-9652-031.ht
ml" "Mozilla/4.79 [en] (Windows NT 5.0; U)"
141.154.123.193 - - [15/Dec/2002:14:52:13 -0500] "GET /images/header_aod2_10.gif
HTTP/1.0" 200 3034 "http://www.indystar.com/print/articles/2/008227-9652-031.ht
ml" "Mozilla/4.79 [en] (Windows NT 5.0; U)"
141.154.123.193 - - [15/Dec/2002:14:52:13 -0500] "GET /images/go_blue.gif HTTP/1
.0" 200 133 "http://www.indystar.com/print/articles/2/008227-9652-031.html" "Moz
illa/4.79 [en] (Windows NT 5.0; U)"
141.154.123.193 - - [15/Dec/2002:14:52:13 -0500] "GET /images/aod_searchend2.gif
HTTP/1.0" 200 186 "http://www.indystar.com/print/articles/2/008227-9652-031.htm
l" "Mozilla/4.79 [en] (Windows NT 5.0; U)"
24.79.125.220 - - [15/Dec/2002:14:52:13 -0500] "GET /images/coheader2_aod_08.gif
HTTP/1.1" 304 - "http://www.indystar.com/forums/showthread.php?s=&postid=177044
" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Q312461)"
24.79.125.220 - - [15/Dec/2002:14:52:13 -0500] "GET /images/coheader2_aod_10.gif
HTTP/1.1" 304 - "http://www.indystar.com/forums/showthread.php?s=&postid=177044
" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Q312461)"
141.154.123.193 - - [15/Dec/2002:14:52:13 -0500] "GET /images/email.gif HTTP/1.0
" 200 138 "http://www.indystar.com/print/articles/2/008227-9652-031.html" "Mozil
la/4.79 [en] (Windows NT 5.0; U)"
66.149.178.96 - - [15/Dec/2002:14:52:14 -0500] "GET /forums/showthread.php?s=&po
stid=177042 HTTP/1.1" 200 7302 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1
.0.1) Gecko/20021003"
24.79.125.220 - - [15/Dec/2002:14:52:14 -0500] "GET /images/coheader2_aod_11.gif
HTTP/1.1" 200 954 "http://www.indystar.com/forums/showthread.php?s=&postid=1770
44" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Q312461)"
Edit: Added <code> tags. Escaped [s and ]s. larsen
| [reply] [d/l] [select] |
|
Hi. Please read Site How To before you submit code next time and save the editors and yourself a lot of work. Thanks.
I just appended the data lines from above to the end of the code I gave you at pulling by regex and it parsed it correctly.
Ouput
Then I looked at your version of the code and noticed this:
open LOGFILE, "datafile.html" || die "Can't open file";
The problem with this line is that because you are not using brackets around the parameters to open combined with the relatively high presedence of ||, this is being parsed as
open( LOGFILE, ("datafile.html" || die "Can't open file") );
which as the first part of the || statement is always true, the second part ('die die "Can't open file"') is simply being optimised away meaning that even if the open fails (because input file does not exist or is not in the current subdirectory etc), you will never see any error msg. Could this be your problem?
The fix is to use either
open(LOGFILE, "datafile.html") || die "Can't open file$!";
or
open LOGFILE, "datafile.html" or die "Can't open file$!"
Please also note the inclusion of $! in the error message. This will tell you why the open failed if it does, not just if. See Error Indicators for further details.
The second thing I noted was the name of the file: "datafile.html"?? If this is a logfile, why is it named .html? If the file conatains html tags, the regex supplied will not parse the data.
Your not by any chance viewing and saving the logs via a web interface are you? If so, you need to cut&paste from the screen to a file or use "Save as...type *.txt" if your browser has that option in order to remove the html tags from the file.
If that doesn't explain and allow you to fix the problem come back and post the error message or otherwise describe what you are seeing (eg. No output, wrong output, etc).
No need to re-post the code or data again unless it has changed substantially.
Good luck.
Examine what is said, not who speaks. | [reply] [d/l] [select] |
|
|
Re: pulling by regex
by Enlil (Parson) on Dec 11, 2002 at 01:04 UTC
|
I am not all that certain what you mean by the last $log_line does not work. It seems that you are getting the information you want, but the h is missing from your urls. Which you remove on the line: $log_line =~ s/"GET.*"h/ /;
One thing that I would advise is that instead of stripping everything around what you want, that you take some time to look over perlre so you get a little better grasp at the regular expressions and get what you want out of the lines instead. For instance, you are over using the dot star a lot, in most cases you would be better off putting a ? after the dot star so that it does not match all the way to end and then backtrack until it finds a match.
Anyhow, I might do something like the following inside the for loop:
foreach my $log_line(@log_data) {
my ($date_string,$referrer) = ($log_line =~ /\[([^\]]+)\] "[^"]+"[^"
+]+"([^"]+)"/);
print "$date_string,$referrer<P>\n";
}
Which as I mentioned gets what I want and nothing else. ( I am making some assumptions about the rest of your data, but based on what you have it should work). </rant>you should be using strict as well</rant> -enlil | [reply] [d/l] [select] |
Re: pulling by regex
by petral (Curate) on Dec 11, 2002 at 01:01 UTC
|
Not sure what's wrong with the last $log_line, it works for me.   Another way to approach it is to remove the parts you do want:
$log_line =~ /\[([^]]+)\] "[^"]+" [^"]+ "([^"]+)"/;
print "$1 $2<p>";
  p | [reply] [d/l] |
Re: pulling by regex
by Abigail-II (Bishop) on Dec 11, 2002 at 10:49 UTC
|
So, you are reading your weblogs over and over, once for each
request? That's not very efficient. Why not dump the log data
into a database, and query the database? You could have the
database do most of the work, including finding the 10 top.
Abigail | [reply] |
|
On reflection, that's a good idea, using MySQL. But wouldn't it waste time overwriting the same database each time the script is called, since there would be no point in keeping the old data?
As I would envision this, translate to a date string plus the referrer and send them both to MySQL in two fields. Then process the input from the web page and use that information to pull from the database. What do you think?
| [reply] |
|
I think dumping the information of the log into a database
each time the script is run is pretty stupid, and defeating
the benefits. What's in the database is in the database, and
doesn't have to be inserted again. Just dump the new
logs to a database on a regular basis.
Abigail
| [reply] |
|
Hey, guys, thanks!!! This is a wonderful resource, and I incorporated some suggestions into the revised script below. I still have some questions, though!
BrowserUk, I decided against using Date:Manip even though I really like that module. That's because the module instructions warn that it's slower than other time modules and this script will be used most often when the web server is overloaded with requests; thus, speed is essential.
Abigail-II, a database would be nice, but the server is producing regular logs, so that's what I have to use.
In the following script, here are my questions:
1) Using strict produces errors that I don't have a global module loaded; what module is that?
2) The simulated $month switch statement doesn't work as expected; instead of values 0 through 11, it gives everything a value of 1. Getting it changed to a number makes timelocal accurate.
3. At the end, I pack all the referrers into an array; what I need to do is count each referrer as an unique URL, so that www.you.com is counted x times and www.me.com is counted y times so I can then tell the top referrer in the time period stipulated by the web page (which just has hours and minutes to enter). That will let me create output like
www.you.com 22
www.me.com 19
etc
How can I count an unknown value and produce this output? And is an array the best way to do it?
Any and all ideas welcome, and thanks in advance. I really appreciate the help!
Here's the script, followed by some raw log data:
#!/usr/local/bin/perl
#use strict;
use CGI qw(:standard);
use CGI::Carp qw(fatalsToBrowser carpout);
use Time::Local;
# Grab information returned by web page
$hour = param ("hour");
$minute = param ("minute");
# Allow perl to write to browser window
print "Content-type: text/html\n\n";
# Current time in seconds
$now = time;
# Convert submitted time to seconds
$compare_time = ($hour * 3600) + ($minute * 60);
# Times extracted by logs must be >= to $target
$target = $now - $compare_time;
open LOGFILE, "datafile.html" || die "Can't open file";
@log_data =<LOGFILE>;
# Grab useful information from each line of the web log
foreach $log_line(@log_data) {
# Grab date/time and referer
($date_string, $referrer) = ($log_line =~ /\[([^\]]+)\] "[^"]+"[^"]
++"([^"]+)"/);
# Replace / and : with spaces
$date_string =~ s!/! !g;
$date_string =~ s!:! !g;
# Dump junk at end of line
$date_string =~ s! -[0-9]+!!;
# Split date/time into useful information
($day, $month, $year, $hhour, $min, $sec) = split(' ', $date_string
+);
# Convert month from text to number
if ($month == 'Jan') {$month = 0}
elsif ($month == 'Feb') {$month = 1}
elsif ($month == 'Mar') {$month = 2}
elsif ($month == 'Apr') {$month = 3}
elsif ($month == 'May') {$month = 4}
elsif ($month == 'Jun') {$month = 5}
elsif ($month == 'Jul') {$month = 6}
elsif ($month == 'Aug') {$month = 7}
elsif ($month == 'Sep') {$month = 8}
elsif ($month == 'Oct') {$month = 9}
elsif ($month == 'Nov') {$month = 10}
else {$month = 11}
# Calculate time on the log line in seconds
$log_time = timelocal($sec,$min,$hhour,$day,$month,$year);
if ($log_time >= $target) {
push @refers, $referrer;
}
}
Some data:
216.45.43.42 - - [12/Dec/2002:18:39:15 -0500] "GET /news/opinions/varv
+el.gif HTTP/1.1" 302 313 "http://www.freerepublic.com/forum/a3a95ca3c
+24a0.htm" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CL
+R 1.0.3705)"
12.222.75.65 - - [12/Dec/2002:18:39:15 -0500] "GET /images/header_aod2
+_15.gif HTTP/1.1" 200 4162 "http://www.indystar.com/print/articles/1/
+007735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98;
+ Win 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705)"
12.222.75.65 - - [12/Dec/2002:18:39:15 -0500] "GET /images/storysearch
+2.gif HTTP/1.1" 200 142 "http://www.indystar.com/print/articles/1/007
+735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Wi
+n 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705)"
12.222.75.65 - - [12/Dec/2002:18:39:15 -0500] "GET /users/ads/misc/rem
+ax_searchad3.gif HTTP/1.1" 200 2335 "http://www.indystar.com/print/ar
+ticles/1/007735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Wi
+ndows 98; Win 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.37
+05)"
12.222.75.65 - - [12/Dec/2002:18:39:16 -0500] "GET /images/sports_03_a
+od.gif HTTP/1.1" 200 3195 "http://www.indystar.com/print/articles/1/0
+07735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98;
+Win 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705)"
12.222.75.65 - - [12/Dec/2002:18:39:16 -0500] "GET /images/email.gif H
+TTP/1.1" 200 138 "http://www.indystar.com/print/articles/1/007735-767
+1-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.
+90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705)"
12.222.75.65 - - [12/Dec/2002:18:39:16 -0500] "GET /images/print.gif H
+TTP/1.1" 200 139 "http://www.indystar.com/print/articles/1/007735-767
+1-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.
+90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705)"
12.222.75.65 - - [12/Dec/2002:18:39:16 -0500] "GET /images/sidelinksen
+d2.gif HTTP/1.1" 200 1009 "http://www.indystar.com/print/articles/1/0
+07735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98;
+Win 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705)"
12.222.75.65 - - [12/Dec/2002:18:39:16 -0500] "GET /images/pics2/image
+-007735-7410.jpg HTTP/1.1" 200 18319 "http://www.indystar.com/print/a
+rticles/1/007735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; W
+indows 98; Win 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3
+705)"
12.222.75.65 - - [12/Dec/2002:18:39:16 -0500] "GET /images/advertiseme
+nt_250strip.gif HTTP/1.1" 200 238 "http://www.indystar.com/print/arti
+cles/1/007735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Wind
+ows 98; Win 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705
+)"
12.222.75.65 - - [12/Dec/2002:18:39:17 -0500] "GET /users/ads/story/ma
+cselect/macselect_250_Oct.gif HTTP/1.1" 200 10436 "http://www.indysta
+r.com/print/articles/1/007735-7671-036.html" "Mozilla/4.0 (compatible
+; MSIE 6.0; Windows 98; Win 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .
+NET CLR 1.0.3705)"
update (broquaint): changed <pre> tags to <code> tags | [reply] [d/l] [select] |
|
|