I'm working on a script that will split parts of a log file to different output files if the line matches certain keywords or regex. First the script reads in the user defined keywords/regex and name of the output files. Here is an example rule file:
## This is a comment!
OPTIONS --> OPTIONS.txt
GET --> GET.txt
REGEX: ^124\.40\.\d{1,3}\.\d{1,3} --> REGEX.txt
::default:: --> default.txt
## is for comments, lines starting with REGEX are regex and anything that doesnt match a keyword or rule gets placed in a default file. Anyways, I tested this on 500mb log files and it takes about 1min 20sec to execute. I'm looking to use this on much bigger logs and was wondering if this could be optimized any better. I'm aware its better to seek the file on disk, but I'd like to use STDIN since this script will most likely be used at the end of a long pipe:
#! /usr/bin/perl
use strict;
use warnings;
my $default;
my %rules;
open INFILE, shift || "split_rules.txt" or die $!;
while(<INFILE>)
{
unless(m/^##/)
{
if(m/::default:: --> (\S+)/)
{
$default = $1;
open DEFAULT, ">$default" || die $1;
}
elsif(m/REGEX: (\S+) --> (\S+)/)
{
$rules{qr/$1/} = $2;
}
elsif(m/(\S+) --> (\S+)/)
{
my $string = quotemeta($1);
$rules{qr/$string/} = $2;
}
else
{
die "$0: Syntax Error!\n";
}
}
}
close INFILE;
foreach my $rule (keys %rules)
{
open(my $fh, ">", $rules{$rule}) || die $!;
$rules{$rule} = $fh;
}
while(my $line = <STDIN>)
{ study $line;
my $match = 0;
foreach my $rule (keys %rules)
{
if($line =~ /$rule/)
{
$match=1;
print {$rules{$rule}} $line;
}
}
if(defined($default) && $match!=1)
{
print DEFAULT $line;
}
}
foreach my $rule (keys %rules)
{
close $rules{$rule};
}
if(defined($default))
{
close DEFAULT;
}
Thanks!
UPDATE: crashtest pointed out that putting a compiled regex in to a hash converts it back to a string. My execution time went down to 22seconds. Thanks!