Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re: Pulling by regex II

by PhiRatE (Monk)
on Dec 13, 2002 at 11:11 UTC ( #219547=note: print w/ replies, xml ) Need Help??


in reply to Pulling by regex II

Right. Lets get down to it. You (and I say you in a friendly, you're-not-the-only-one-everyone-does-this way) have a problem of bad priorities, so, lets list the priorities you have stated so far:

  • Tight deadline
  • Performance
  • Working code

Now, a few of these don't match. Even more don't add up when you add priorities that you *should* have, notably "secure".

So, lets detail some of the issues we have with the code thus far:

  • Reading in entire log file (into ram no less) every time the script is run
  • No use of taint mode
  • No use of warnings
  • Bad choice of time comparison algorithm
  • Iffy regexen

Under these circumstances, without access to help, you should be getting your priorities in order. While it depends on your particular position, the following is a good start:

  1. Working code
  2. Working, secure code
  3. Working, secure code by deadline
  4. Working, secure, maintainable code by deadline
  5. Working, secure, maintainable code that performs well, by deadline

What does this mean? Well, the first thing it means is that if a well known, already-debugged module is around that can do any part of the problem you're trying to solve, use it. Performance is waaay down on the list, just use it. If it really turns out too slow in use, *then* you worry about it.

The second thing it means is a quick investigation of language features that can help you out with debugging and security. Taint mode, Strict and Warnings should all be on, and you shouldn't move an inch until your code is working with all of them.

Now, if you had done this, most of the code above, that isn't working, wouldn't even exist. You'd have something a fraction of the size, that worked fine. Maybe it'd be a bit slow, maybe not, but if it was, you'd be posting to perlmonks saying "Hey, here's my working code, how can I make it go a bit faster", and then people like Abigail-II would have been able to tell you how exactly a database would aid your cause, even when the logs are in text file format. But we're not there yet.

#!/usr/bin/perl -t use strict; use warnings; use Date::Manip; use CGI qw/:standard/; # Make sure neither we, nor any of our submodules compromise security # by calling unpathed programs. $ENV{PATH} = "/bin:/usr/bin"; $ENV{IFS}=""; # Use CGI to print our headers print header; my %referers = (); # Retrieve and security-check parameters my $hour = param('hour'); my $minute = param('minute'); if ($hour !~ /^\d\d?$/) { die('Invalid hour'); } if ($minute !~ /^\d\d?$/) { die('Invalid minute'); } # Get date object for our check point my $check_date = ParseDate("${hour}hours ${minute}minutes ago"); # File handling, one line at a time open(FH,"file") || die('Could not open log file'); while (my $line = <FH>) { next if ($line !~ /^\S+ \S \S (\S+) \S+ "[^"]+" \d+ \d+ "([^"]+)"/ +); my $line_date = ParseDate($1); # Check to see if the line date is in the range we're after next unless Date_Cmp($line_date, $check_date)>0; # If the referer is new, we set to 1 entry, otherwise increment (i +ncrementing undef doesn't work well) if (!$referers{$2}) { $referers{$2}=1; } else { $referers{$2}++; } } close(FH); my $row = 0; # Sort our referers by the number of hits for (sort {$referers{$b} <=> $referers{$a}} keys %referers) { # break out after the tenth one last if $row++==10; print "$_: ".$referers{$_}."\n"; }

This is by no means perfect, I didn't have enough log examples to work from to be sure the regex will work well, but it contains all the properties of a good first effort. All the necessary warnings and taints turned on, the relevant modules used (there is, I think, an apache log parsing module too, but I wasn't sure you were using apache), all user input is verified and no more code than strictly necessary is used.

The end result...no weird month name/number conversions, about 30 seconds of debugging, and a much better product. I highly recommend, prior to your next project, sitting down and listing your priorities as above, it will help focus you on what really matters first, and what can be tweaked second.


Comment on Re: Pulling by regex II
Download Code
Re: Re: Pulling by regex II
by PhiRatE (Monk) on Dec 13, 2002 at 11:21 UTC
    I note that my regex is not entirely correct, my cut&paste of the logs that I used for testing did not contain the square brackets, not sure how that happened, regex should be:

    /^\S+ \S \S \[(\S+) \S+\] "[^"]+" \d+ \d+ "([^"]+)"/
    I believe
Re: Re: Pulling by regex II
by mkent (Acolyte) on Dec 14, 2002 at 23:07 UTC
    Thanks. I've installed Date::Manip and tried to run your code as a test, but I keep getting "Premature end of script headers" errors. I've checked the path on my server but can't figure out why it's generating these errors. Can you tell me why? Thanks.
      As a hunch, you may need to try switching off taint (the -t switch). Date::Manip may not be handling taint correctly, other than that, no reason that I can think of.
        Thank you. Turns out on my system, taint is a capital T. I also added the extra lines after header as suggested by DapperDan. Here's the resulting code plus some recent data, but all I get as output is "-: 1". Not sure why.

        #!/usr/local/bin/perl -slwT
        
        use strict;
        use warnings;
        use Date::Manip;
        use CGI qw/:standard/;
        
        # Make sure neither we, nor any of our submodules compromise security
        # by calling unpathed programs.
        $ENV{PATH} = "/bin:/usr/bin";
        $ENV{IFS}="";
        
        # Use CGI to print our headers
        print header, "\n\n";
        
        my %referers = ();
        
        # Retrieve and security-check parameters
        my $hour = param('hour');
        my $minute = param('minute');
        
        if ($hour !~ /^\d\d?$/) { die('Invalid hour'); }
        if ($minute !~ /^\d\d?$/) { die('Invalid minute'); }
        
        # Get date object for our check point
        my $check_date = ParseDate("${hour}hours ${minute}minutes ago");
        
        # File handling, one line at a time
        open(FH,"datafile.html") || die('Could not open log file');
        while (my $line = <FH>) {
        
            next if ($line !~ /^\S+ \S \S \(\S+) \S+\ "^"+" \d+ \d+ "(^"+)"/);
            
            my $line_date = ParseDate($1);
        
            # Check to see if the line date is in the range we're after
            next unless Date_Cmp($line_date, $check_date)>0;
        
            # If the referer is new, we set to 1 entry, otherwise increment (incrementing undef doesn't work well)
            if (!$referers{$2}) {
                $referers{$2}=1;
            } else {
                $referers{$2}++;
            }
        }
        close(FH);
        
        my $row = 0;
        
        # Sort our referers by the number of hits
        for (sort {$referers{$b} <=> $referers{$a}} keys %referers) {
            # break out after the tenth one
            last if $row++==10;
            print "$_: ".$referers{$_}."\n";
        }
        

        Recent data:

        68.22.179.211 - - 15/Dec/2002:14:52:13 -0500 "GET /images/69.gif HTTP/1.1" 200 1348 "http://www.indystar.com/print/articles/6/008596-6466-040.html" "Mozilla/4 .0 (compatible; MSIE 5.5; Windows 98)"
        141.154.123.193 - - 15/Dec/2002:14:52:13 -0500 "GET /images/header_aod2_01.gif HTTP/1.0" 200 2011 "http://www.indystar.com/print/articles/2/008227-9652-031.ht ml" "Mozilla/4.79 en (Windows NT 5.0; U)"
        141.154.123.193 - - 15/Dec/2002:14:52:13 -0500 "GET /images/header_aod2_15.gif HTTP/1.0" 200 4162 "http://www.indystar.com/print/articles/2/008227-9652-031.ht ml" "Mozilla/4.79 en (Windows NT 5.0; U)"
        141.154.123.193 - - 15/Dec/2002:14:52:13 -0500 "GET /images/header_aod2_10.gif HTTP/1.0" 200 3034 "http://www.indystar.com/print/articles/2/008227-9652-031.ht ml" "Mozilla/4.79 en (Windows NT 5.0; U)"
        141.154.123.193 - - 15/Dec/2002:14:52:13 -0500 "GET /images/go_blue.gif HTTP/1 .0" 200 133 "http://www.indystar.com/print/articles/2/008227-9652-031.html" "Moz illa/4.79 en (Windows NT 5.0; U)"
        141.154.123.193 - - 15/Dec/2002:14:52:13 -0500 "GET /images/aod_searchend2.gif HTTP/1.0" 200 186 "http://www.indystar.com/print/articles/2/008227-9652-031.htm l" "Mozilla/4.79 en (Windows NT 5.0; U)"
        24.79.125.220 - - 15/Dec/2002:14:52:13 -0500 "GET /images/coheader2_aod_08.gif HTTP/1.1" 304 - "http://www.indystar.com/forums/showthread.php?s=&postid=177044 " "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Q312461)"
        24.79.125.220 - - 15/Dec/2002:14:52:13 -0500 "GET /images/coheader2_aod_10.gif HTTP/1.1" 304 - "http://www.indystar.com/forums/showthread.php?s=&postid=177044 " "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Q312461)"
        141.154.123.193 - - 15/Dec/2002:14:52:13 -0500 "GET /images/email.gif HTTP/1.0 " 200 138 "http://www.indystar.com/print/articles/2/008227-9652-031.html" "Mozil la/4.79 en (Windows NT 5.0; U)"
        66.149.178.96 - - 15/Dec/2002:14:52:14 -0500 "GET /forums/showthread.php?s=&po stid=177042 HTTP/1.1" 200 7302 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1 .0.1) Gecko/20021003"
        24.79.125.220 - - 15/Dec/2002:14:52:14 -0500 "GET /images/coheader2_aod_11.gif HTTP/1.1" 200 954 "http://www.indystar.com/forums/showthread.php?s=&postid=1770 44" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Q312461)"

      It's a problem with the output of the script; which is more than likely quite easy to fix. You print your headers using a call to CGI.pm (see the print header; line?), followed by a blank line, and then the page data (<html> etc.).

      You get the "premature end of script headers error" if you didn't output the blank line. That means your problem may be something to do with your print header; line.

      Try reading the CGI spec, the CGI.pm perldoc , and try again.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://219547]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (12)
As of 2014-07-11 15:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (230 votes), past polls