Splitting Apache Log Files

cmm7825 has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on a script that will split parts of a log file to different output files if the line matches certain keywords or regex. First the script reads in the user defined keywords/regex and name of the output files. Here is an example rule file:

## This is a comment!
OPTIONS --> OPTIONS.txt
GET --> GET.txt
REGEX: ^124\.40\.\d{1,3}\.\d{1,3} --> REGEX.txt
::default:: --> default.txt
[download]

## is for comments, lines starting with REGEX are regex and anything that doesnt match a keyword or rule gets placed in a default file. Anyways, I tested this on 500mb log files and it takes about 1min 20sec to execute. I'm looking to use this on much bigger logs and was wondering if this could be optimized any better. I'm aware its better to seek the file on disk, but I'd like to use STDIN since this script will most likely be used at the end of a long pipe:

#! /usr/bin/perl
use strict;
use warnings;

my $default;
my %rules;

open INFILE, shift || "split_rules.txt" or die $!;
while(<INFILE>)
{
        unless(m/^##/)
        {
                if(m/::default:: --> (\S+)/)
                {
                        $default = $1;
                        open DEFAULT, ">$default" || die $1;
                }
                elsif(m/REGEX: (\S+) --> (\S+)/)
                {
                        $rules{qr/$1/} = $2;
                 }
                elsif(m/(\S+) --> (\S+)/)
                {
                        my $string = quotemeta($1);
                        $rules{qr/$string/} = $2;
                }
                else
                {
                        die "$0: Syntax Error!\n";
                }
        }
}
close INFILE;

foreach my $rule (keys %rules)
{
        open(my $fh, ">", $rules{$rule}) || die $!;
        $rules{$rule} = $fh;
}

while(my $line = <STDIN>)
{       study $line;
        my $match = 0;
        foreach my $rule (keys %rules)
        {
                if($line =~ /$rule/)
                {
                        $match=1;
                        print {$rules{$rule}}  $line;
                }
        }
        if(defined($default) && $match!=1)
        {
                print DEFAULT $line;
        }
}

foreach my $rule (keys %rules)
{
        close $rules{$rule};
}

if(defined($default))
{
        close DEFAULT;
}
[download]

Thanks! UPDATE: crashtest pointed out that putting a compiled regex in to a hash converts it back to a string. My execution time went down to 22seconds. Thanks!

Comment on Splitting Apache Log Files Select or Download Code

Replies are listed 'Best First'.
Re: Splitting Apache Log Files by ig (Vicar) on Apr 26, 2010 at 17:04 UTC
It looks like you are recompiling each regular expression once for each line of input to be scanned. Compiling regular expressions can be expensive, so performance might be improved if you pre-compile all the regular expressions. http://www.stonehenge.com/merlyn/UnixReview/col28.html provides a nice introduction to some of the options for compiling regular expressions.	[reply]
Re^2: Splitting Apache Log Files by cmm7825 (Novice) on Apr 26, 2010 at 18:19 UTC
Thanks for the link. When I read the the words I use the qr// operator. It was my understanding that this compiles the regular expression.	[reply]
Re^3: Splitting Apache Log Files by crashtest (Curate) on Apr 26, 2010 at 19:01 UTC
`qr//` is a regular expression quote, and as such does, in a sense, compile regular expressions. Unfortunately, you're using the regular expression as a hash key, at which point it's turned back into a string. As you process the Apache log file, `$rule` is just a string. When you use it as a regular expression, it has to be compiled again - each time through the loop. If I were writing your code, I would store the regular expression rules/filehandles in an array. Here's a sketch of what it might look like: `my @rules; # not %rules. ... # Process input file of processing rules while(<INFILE>) { ... push @rules, { regex => qr/$string/, file_handle => $fh }; } ... # Read Apache log file and print to various other files while (my $line = <STDIN>) { for my $rule_ref (@rules){ my $regex = $rule_ref->{regex}; my $fh = $rule_ref->{file_handle}; if ($line =~ $regex) { print $fh $line; } } }` [download] Hope this helps.	[reply] [d/l] [select]
Re^4: Splitting Apache Log Files by cmm7825 (Novice) on Apr 26, 2010 at 20:16 UTC
Re^5: Splitting Apache Log Files by BrowserUk (Patriarch) on Apr 26, 2010 at 21:02 UTC
Some notes below your chosen depth have not been shown here
Re^5: Splitting Apache Log Files by Marshall (Canon) on Apr 27, 2010 at 17:51 UTC
Re: Splitting Apache Log Files by crashtest (Curate) on Apr 26, 2010 at 17:33 UTC
If you think you need to optimize your code, have a look at Devel::NYTProf, a profiler which can tell you how much time is spent on each statement in your code. Then you can measure the difference as you make changes and, hopefully, improvements.	[reply]


Don't ask to ask, just ask
	PerlMonks