Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Fast/efficient way of check if a string in an strArr is contained in a line

by cpomp (Initiate)
on Nov 12, 2008 at 00:37 UTC ( [id://723032]=perlquestion: print w/replies, xml ) Need Help??

cpomp has asked for the wisdom of the Perl Monks concerning the following question:

I have a huge log file (millions of lines/records) that is parsed line by line in a script. The user can input a file containing lines that if the record contains any of those lines I skip that line. The option to input this file creates a bottleneck in my script. It takes more that 10 times as much to execute when that option is given. This is the code that I currently have:
# $line is the current line # @bypList is the array containing all lines in the skip file sub check_bypass { my $line= shift; my @bypList= @_; my $ret= 1; foreach(@bypList) { if($line=~/$_/i) { $ret= 0; } } return $ret; }
As you can see the solution that have goes through each element of the array for every line. I though about using a hash but it didn't work because it isn't an exact match so I have to run a regExp to see if the line contains any of the lines of the skip arr. How can I improve on this. Thanks!
  • Comment on Fast/efficient way of check if a string in an strArr is contained in a line
  • Download Code

Replies are listed 'Best First'.
Re: Fast/efficient way of check if a string in an strArr is contained in a line
by Limbic~Region (Chancellor) on Nov 12, 2008 at 01:00 UTC
    cpomp,
    First, there is more than a nominal amount of time spent calling check_bypass() millions of times so it would be best to inline that code. It would be better if your skip file contained exact matches instead of regular expressions because then you could just say next if $skip{$_}; Instead, I would suggest the following (untested):
    #!/usr/bin/perl use strict; use warnings; use Regexp::Assemble; my $skip_file = $ARGV[0] or die "Usage: $0 <skip_file>; my $ra = Regexp::Assemble->new(); $ra->add_file($skip_file); my $skip = $ra->re; open(my $fh, '<', 'log.txt') or die "Unable to open 'log.txt' for rea +ding: $!"; while (<$fh>) { next if /$skip/; print; }

    Minor addition: Depending on the contents of your skip file, there are other modules that may be a better choice but only you would know that unless you share examples.

    Cheers - L~R

Re: Fast/efficient way of check if a string in an strArr is contained in a line
by dragonchild (Archbishop) on Nov 12, 2008 at 01:05 UTC
    Create a master regex. Let's say your options were
    foo.*bar \w+\d+\s+\d+:\d$
    You'd do something like:
    my @skips = get_skips(); # However you get them my $skip_regex = join '|', map { "(?:${_})" } @skips; while ( <BIG_FILE> ) { next if /$skip_regex/; # Now you've checked everything. }
    It's not going to be as fast as a hash lookup, but it'll be faster than your for-loop in most cases. Plus, it'll be more self-documenting.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: Fast/efficient way of check if a string in an strArr is contained in a line
by GrandFather (Saint) on Nov 12, 2008 at 01:07 UTC

    If you allow pattern matches try Regexp::Match::List or if you are working with literal matches try Regexp::List. You can then replace your check_bypass call with $line =~ $combinedRegex having previously built $combinedRegex using the module.


    Perl reduces RSI - it saves typing
Re: Fast/efficient way of check if a string in an strArr is contained in a line
by toolic (Bishop) on Nov 12, 2008 at 00:48 UTC
    You could get out of the "foreach" loop as soon as a match occurs, either using last or return:
    sub check_bypass { my $line = shift; foreach (@_) { if ($line =~ /$_/i) {return 0} } return 1; }
Re: Fast/efficient way of check if a string in an strArr is contained in a line
by betterworld (Curate) on Nov 12, 2008 at 01:16 UTC
    if($line=~/$_/i)

    On the assumption that you don't actually want to do regex searches I suggest the following:

    You can create an index (hash) of the words of those lines that you want to search for. That way you only have to do a hash lookup for each word in the log. I've created a small sample script that demonstrates this.

    use strict; use warnings; my @logfile = ( 'Aliens ate my baby-sitter', 'Pearls of Light', 'Really long line of logs', ); my @searchfile = ( 'test these words please', 'ate my', ); my %wordhash; for my $line (@searchfile) { $line = lc $line; # Index the searchlines by their words for my $word (split ' ', $line) { push @{$wordhash{$word}}, \$line; } } LOGLINE: for my $line (@logfile) { my $lower = lc $line; # For every word in the logline, check if we # have a searchline containing that word. for my $word (split ' ', $lower) { for my $searchline (@{$wordhash{$word} || []}) { # If the word was found, compare the lines if (index($lower, $$searchline) >= 0) { next LOGLINE; } } } # If there was no matching searchline, print the logline. print "Processing the log line $line\n"; }
Re: Fast/efficient way of check if a string in an strArr is contained in a line
by almut (Canon) on Nov 12, 2008 at 00:57 UTC

    Do they actually specify patterns in the skip file, or just simple substrings to be searched for in the log file records?  In case of the latter, you could try index, which might be marginally faster than a regex match... (haven't verified it, though).

Re: Fast/efficient way of check if a string in an strArr is contained in a line
by gone2015 (Deacon) on Nov 12, 2008 at 01:03 UTC

    One approach might be to pull in bunches of log file lines and scan them together, amortising the overhead of the loop and the regex start-up.

    How many entries do you get in the @bypList ? Could you put them all together into a single regex, which you could qr// ? Eliminating the foreach and the regex start-up.

    Combining both the above might make a difference.

    How general are the regexes ?

    Otherwise, can you change the problem so that the @bypList is at least a single regex constructed with some intelligence ? Or construct a two step process, with a first, simple, regex (or something else) that rules out a high proportion of lines ? Even if the full generality of a regex is required for final identification of lines to be skipped, can some quicker filter be constructed -- by hand or automatically ?

    Doesn't look entirely straightforward to me...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://723032]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (2)
As of 2024-04-24 23:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found