Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Splitting Apache Log Files

by cmm7825 (Novice)
on Apr 26, 2010 at 16:30 UTC ( [id://836952]=perlquestion: print w/replies, xml ) Need Help??

cmm7825 has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on a script that will split parts of a log file to different output files if the line matches certain keywords or regex. First the script reads in the user defined keywords/regex and name of the output files. Here is an example rule file:

## This is a comment! OPTIONS --> OPTIONS.txt GET --> GET.txt REGEX: ^124\.40\.\d{1,3}\.\d{1,3} --> REGEX.txt ::default:: --> default.txt
## is for comments, lines starting with REGEX are regex and anything that doesnt match a keyword or rule gets placed in a default file. Anyways, I tested this on 500mb log files and it takes about 1min 20sec to execute. I'm looking to use this on much bigger logs and was wondering if this could be optimized any better. I'm aware its better to seek the file on disk, but I'd like to use STDIN since this script will most likely be used at the end of a long pipe:
#! /usr/bin/perl use strict; use warnings; my $default; my %rules; open INFILE, shift || "split_rules.txt" or die $!; while(<INFILE>) { unless(m/^##/) { if(m/::default:: --> (\S+)/) { $default = $1; open DEFAULT, ">$default" || die $1; } elsif(m/REGEX: (\S+) --> (\S+)/) { $rules{qr/$1/} = $2; } elsif(m/(\S+) --> (\S+)/) { my $string = quotemeta($1); $rules{qr/$string/} = $2; } else { die "$0: Syntax Error!\n"; } } } close INFILE; foreach my $rule (keys %rules) { open(my $fh, ">", $rules{$rule}) || die $!; $rules{$rule} = $fh; } while(my $line = <STDIN>) { study $line; my $match = 0; foreach my $rule (keys %rules) { if($line =~ /$rule/) { $match=1; print {$rules{$rule}} $line; } } if(defined($default) && $match!=1) { print DEFAULT $line; } } foreach my $rule (keys %rules) { close $rules{$rule}; } if(defined($default)) { close DEFAULT; }
Thanks! UPDATE: crashtest pointed out that putting a compiled regex in to a hash converts it back to a string. My execution time went down to 22seconds. Thanks!

Replies are listed 'Best First'.
Re: Splitting Apache Log Files
by ig (Vicar) on Apr 26, 2010 at 17:04 UTC

    It looks like you are recompiling each regular expression once for each line of input to be scanned. Compiling regular expressions can be expensive, so performance might be improved if you pre-compile all the regular expressions.

    http://www.stonehenge.com/merlyn/UnixReview/col28.html provides a nice introduction to some of the options for compiling regular expressions.

      Thanks for the link. When I read the the words I use the qr// operator. It was my understanding that this compiles the regular expression.

        qr// is a regular expression quote, and as such does, in a sense, compile regular expressions. Unfortunately, you're using the regular expression as a hash key, at which point it's turned back into a string. As you process the Apache log file, $rule is just a string. When you use it as a regular expression, it has to be compiled again - each time through the loop.

        If I were writing your code, I would store the regular expression rules/filehandles in an array. Here's a sketch of what it might look like:

        my @rules; # not %rules. ... # Process input file of processing rules while(<INFILE>) { ... push @rules, { regex => qr/$string/, file_handle => $fh }; } ... # Read Apache log file and print to various other files while (my $line = <STDIN>) { for my $rule_ref (@rules){ my $regex = $rule_ref->{regex}; my $fh = $rule_ref->{file_handle}; if ($line =~ $regex) { print $fh $line; } } }

        Hope this helps.

Re: Splitting Apache Log Files
by crashtest (Curate) on Apr 26, 2010 at 17:33 UTC

    If you think you need to optimize your code, have a look at Devel::NYTProf, a profiler which can tell you how much time is spent on each statement in your code. Then you can measure the difference as you make changes and, hopefully, improvements.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://836952]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (7)
As of 2024-04-24 20:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found