Fast/efficient way of check if a string in an strArr is contained in a line

cpomp has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Fast/efficient way of check if a string in an strArr is contained in a line by Limbic~Region (Chancellor) on Nov 12, 2008 at 01:00 UTC
cpomp, First, there is more than a nominal amount of time spent calling check_bypass() millions of times so it would be best to inline that code. It would be better if your skip file contained exact matches instead of regular expressions because then you could just say `next if $skip{$_};` Instead, I would suggest the following (untested): `#!/usr/bin/perl use strict; use warnings; use Regexp::Assemble; my $skip_file = $ARGV[0] or die "Usage: $0 <skip_file>; my $ra = Regexp::Assemble->new(); $ra->add_file($skip_file); my $skip = $ra->re; open(my $fh, '<', 'log.txt') or die "Unable to open 'log.txt' for rea +ding: $!"; while (<$fh>) { next if /$skip/; print; }` [download] Minor addition: Depending on the contents of your skip file, there are other modules that may be a better choice but only you would know that unless you share examples. Cheers - L~R	[reply] [d/l] [select]
Re: Fast/efficient way of check if a string in an strArr is contained in a line by dragonchild (Archbishop) on Nov 12, 2008 at 01:05 UTC
Create a master regex. Let's say your options were `foo.*bar \w+\d+\s+\d+:\d$` [download] You'd do something like: `my @skips = get_skips(); # However you get them my $skip_regex = join '\|', map { "(?:${_})" } @skips; while ( <BIG_FILE> ) { next if /$skip_regex/; # Now you've checked everything. }` [download] It's not going to be as fast as a hash lookup, but it'll be faster than your for-loop in most cases. Plus, it'll be more self-documenting. My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply] [d/l] [select]
Re: Fast/efficient way of check if a string in an strArr is contained in a line by GrandFather (Saint) on Nov 12, 2008 at 01:07 UTC
If you allow pattern matches try Regexp::Match::List or if you are working with literal matches try Regexp::List. You can then replace your check_bypass call with `$line =~ $combinedRegex` having previously built $combinedRegex using the module. Perl reduces RSI - it saves typing	[reply] [d/l]
Re: Fast/efficient way of check if a string in an strArr is contained in a line by toolic (Bishop) on Nov 12, 2008 at 00:48 UTC
You could get out of the "foreach" loop as soon as a match occurs, either using last or return: `sub check_bypass { my $line = shift; foreach (@_) { if ($line =~ /$_/i) {return 0} } return 1; }` [download]	[reply] [d/l]
Re: Fast/efficient way of check if a string in an strArr is contained in a line by betterworld (Curate) on Nov 12, 2008 at 01:16 UTC
`if($line=~/$_/i)` On the assumption that you don't actually want to do regex searches I suggest the following: You can create an index (hash) of the words of those lines that you want to search for. That way you only have to do a hash lookup for each word in the log. I've created a small sample script that demonstrates this. use strict; use warnings; my @logfile = ( 'Aliens ate my baby-sitter', 'Pearls of Light', 'Really long line of logs', ); my @searchfile = ( 'test these words please', 'ate my', ); my %wordhash; for my $line (@searchfile) { $line = lc $line; # Index the searchlines by their words for my $word (split ' ', $line) { push @{$wordhash{$word}}, \$line; } } LOGLINE: for my $line (@logfile) { my $lower = lc $line; # For every word in the logline, check if we # have a searchline containing that word. for my $word (split ' ', $lower) { for my $searchline (@{$wordhash{$word} \|\| []}) { # If the word was found, compare the lines if (index($lower, $$searchline) >= 0) { next LOGLINE; } } } # If there was no matching searchline, print the logline. print "Processing the log line $line\n"; } [download]	[reply] [d/l] [select]
Re: Fast/efficient way of check if a string in an strArr is contained in a line by almut (Canon) on Nov 12, 2008 at 00:57 UTC
Do they actually specify patterns in the skip file, or just simple substrings to be searched for in the log file records? In case of the latter, you could try index, which might be marginally faster than a regex match... (haven't verified it, though).	[reply]
Re: Fast/efficient way of check if a string in an strArr is contained in a line by gone2015 (Deacon) on Nov 12, 2008 at 01:03 UTC
One approach might be to pull in bunches of log file lines and scan them together, amortising the overhead of the loop and the regex start-up. How many entries do you get in the `@bypList` ? Could you put them all together into a single regex, which you could `qr//` ? Eliminating the `foreach` and the regex start-up. Combining both the above might make a difference. How general are the regexes ? Otherwise, can you change the problem so that the `@bypList` is at least a single regex constructed with some intelligence ? Or construct a two step process, with a first, simple, regex (or something else) that rules out a high proportion of lines ? Even if the full generality of a regex is required for final identification of lines to be skipped, can some quicker filter be constructed -- by hand or automatically ? Doesn't look entirely straightforward to me...	[reply] [d/l] [select]


No such thing as a small change
	PerlMonks