Re: line by line match on an array of strings
by j1n3l0 (Friar) on Jan 09, 2008 at 09:48 UTC
|
You could look at the any() function from the List::MoreUtils module. I think it would be something like so:
#!/usr/bin/perl -w
use strict;
use List::MoreUtils qw(any);
my @typedefs = qw(do re me fa so la ti do);
while ( my $line = <DATA> ) {
if ( any { $line =~ /$_/ } @typedefs ) {
# perform various actions here if line match
print $line;
}
}
exit 0;
__END__
do
me
a
favour
will
you
This prints:
do
me
favour
UPDATE
From the docs:
Using the functions from this module however should give slightly better performance as everything is implemented in C.
Hope that helps =)
Smoothie, smoothie, hundre prosent naturlig!
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: line by line match on an array of strings
by jbert (Priest) on Jan 09, 2008 at 10:42 UTC
|
You might get some minor joy by:
- Compiling your regexps and saving them. See qr in perldoc perlop.
- If each line can only match one of these typedefs, make sure you exit the typedef loop when you find a match.
- In combination with the previous, if you have some typedefs which are more likely to occur, order the typedefs array with the most likely at the front (for early exit).
In combination:
my @typedefs = qw(...sorted list of typedefs, common ones first...);
my @regexes = map { qr{$_} } @typedefs;
REGEX:
foreach my $regex (@regexes) {
if ($line =~ $regex) {
# found typedef so do work
last REGEX; # Stop looking
}
}
Ah...if you want to perform the same action, you can munge all your typedefs together into one regex.
my @typedefs = qw(...sorted list of typedefs, common ones first...);
# Make a big '(a|b|c|...)' regex str
# Should probably do some quoting here.
my $regexStr = "(" . join("|", @typedefs) . ")";
my $regex = qr{$regexStr};
if ($line =~ $regex) {
# found typedef so do work
}
That should help a bit, since you're now handling all the alternatives in the regex engine, i.e. looping in C not perl.
UPDATE: in case it's not obvious, you want to produce the @regex array or $regex once, before looping through all your lines. Don't do it per line :-) | [reply] [Watch: Dir/Any] [d/l] [select] |
Re: line by line match on an array of strings
by bunch (Sexton) on Jan 09, 2008 at 11:15 UTC
|
You can use Regexp::Assemble to build an optimized re that matches all your typedefs:
use Regexp::Assemble;
my $re = Regexp::Assemble->new->add(@typedefs)->re
...
if ($line =~ $re) {
# do something
}
| [reply] [Watch: Dir/Any] [d/l] |
Re: line by line match on an array of strings
by holli (Abbot) on Jan 09, 2008 at 11:29 UTC
|
You could try to feed your strings into Regexp::Assemble. That way you get a single regex you can use to test your lines against, which would probably speed things up. OTOH, you cannot tell wich of your original search strings actually matched the line, if that is important for you. If it is, then you would have to do your explicit tests after the regex matched. Depending on the blank/hit ratio it may still be faster.
Update: as per bibliophile's comment below.
| [reply] [Watch: Dir/Any] [d/l] |
|
Interesting. I'd never heard of Regexp::Assemble...
From the doc:
It is also possible to track the original patterns,
so that you can determine which, among the source patterns
that form the assembled pattern, was the one that caused
the match to occur.
| [reply] [Watch: Dir/Any] |
Re: line by line match on an array of strings
by Punitha (Priest) on Jan 09, 2008 at 10:13 UTC
|
use strict;
my @typedefs = qw(do re me fa so la ti do);
while (my $line = <DATA>) {
chomp($line);
if(grep/$line/, @typedefs){
print "$line\n";
## perform various actions here if line match
}
}
__DATA__
do
me
a
favour
will
you
Punitha | [reply] [Watch: Dir/Any] [d/l] |
|
May not be the OP's intent???
prints
do
me
a
Update (after what I judge to have been too many minutes of puzzling over this)
Still not clear why "a" matched (it'll come to me yet), but the obvious cure (IMO) is to anchor the match, thus
if(grep/^$line$/, @typedefs){
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: line by line match on an array of strings
by balaji_red83 (Acolyte) on Jan 09, 2008 at 09:50 UTC
|
Instead of reading line by line, you can slurp in the contents of the entire file into a scalar variable and then look for a match against an array of strings.
local $/ = undef;
open(FH,"<","test.txt") or die "Could not open the file due to $!";
my $fcontent = <FH>;
close FH;
foreach (@typedefs) {
if ($fcontent =~ /$_/){
# perform various actions here if line match
}
}
| [reply] [Watch: Dir/Any] [d/l] |
|
Perhaps you missed this in the original post?
with a few hundred thousand lines to search
-- [ e d @ h a l l e y . c c ]
| [reply] [Watch: Dir/Any] |
Re: line by line match on an array of strings
by pilcrow (Sexton) on Jan 09, 2008 at 18:05 UTC
|
Is there a more elegant solution?
Substituting speed for elegance, there are many quick wins for this sort of thing. Precompile your regexen with qr//, so each iteration doesn't compile them anew. Break out of the loop after the first successful match, if appropriate (the List::MoreUtils any approach does this, I think). Optimize the regex list and its application: can you profitably combine them, add modifiers like \b or ^, etc? Keep track of regex hit counts and sort your regex list now and then to apply the most common matches first, if appropriate.
The biggest win typically comes from rethinking the problem, of course. Without really knowing what you're attempting, it looks as though you might be trying to do some comparatively simple token matches. Something like
while (<INPUT>) {
if (/\b keyword \s+ (\w+)/ and exists $keywords{ $1 }) {
# .. do something with the token in $1
}
}
might do the trick. We'd need to see specific examples to give more specific advice.
-Mike | [reply] [Watch: Dir/Any] [d/l] |
|
Just trying to restate what Mike was saying.
> however with a few hundred thousand lines to seach, and an array of a few hundred it is far too slow.
It could be slow because it has to do a lot of work at each end step, which is where optimizing the regex helps.
I think it is slow because your looping is of order (a few hundred thousand) TIMES (a few hundred).
It would be much better if the looping is of order (a few hundred thousand) times a big constant. You might be able to get away with that by 'precompile your regexen' (wonderful phrase) -- or imho more likely if your line can be broken into a small number of tokens, just do a dispatch table on tokens broken out from the line.
--woody
| [reply] [Watch: Dir/Any] |
Re: line by line match on an array of strings
by vdrab (Initiate) on Jan 11, 2008 at 02:57 UTC
|
Hi,
What exactly do you mean by "see if there's a match against any of the strings in the array"? Do you just have to check for "a" match?
Just thinking loud (I don't know what the data is or where it came from), but would it be an option to sort the array, and look for matches using binary search? That would save you a fair lot of processing (assuming the sorting can be done offline).
v. | [reply] [Watch: Dir/Any] |