line by line match on an array of strings

barryscott has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: line by line match on an array of strings by j1n3l0 (Friar) on Jan 09, 2008 at 09:48 UTC
You could look at the any() function from the List::MoreUtils module. I think it would be something like so: `#!/usr/bin/perl -w use strict; use List::MoreUtils qw(any); my @typedefs = qw(do re me fa so la ti do); while ( my $line = <DATA> ) { if ( any { $line =~ /$_/ } @typedefs ) { # perform various actions here if line match print $line; } } exit 0; __END__ do me a favour will you` [download] This prints: `do me favour` [download] UPDATE From the docs: Using the functions from this module however should give slightly better performance as everything is implemented in C. Hope that helps =) Smoothie, smoothie, hundre prosent naturlig!	[reply] [d/l] [select]
Re: line by line match on an array of strings by jbert (Priest) on Jan 09, 2008 at 10:42 UTC
You might get some minor joy by: Compiling your regexps and saving them. See `qr` in perldoc perlop. If each line can only match one of these typedefs, make sure you exit the typedef loop when you find a match. In combination with the previous, if you have some typedefs which are more likely to occur, order the typedefs array with the most likely at the front (for early exit). In combination: `my @typedefs = qw(...sorted list of typedefs, common ones first...); my @regexes = map { qr{$_} } @typedefs; REGEX: foreach my $regex (@regexes) { if ($line =~ $regex) { # found typedef so do work last REGEX; # Stop looking } }` [download] Ah...if you want to perform the same action, you can munge all your typedefs together into one regex. `my @typedefs = qw(...sorted list of typedefs, common ones first...); # Make a big '(a\|b\|c\|...)' regex str # Should probably do some quoting here. my $regexStr = "(" . join("\|", @typedefs) . ")"; my $regex = qr{$regexStr}; if ($line =~ $regex) { # found typedef so do work }` [download] That should help a bit, since you're now handling all the alternatives in the regex engine, i.e. looping in C not perl. UPDATE: in case it's not obvious, you want to produce the `@regex` array or `$regex` once, before looping through all your lines. Don't do it per line :-)	[reply] [d/l] [select]
Re: line by line match on an array of strings by bunch (Sexton) on Jan 09, 2008 at 11:15 UTC
You can use Regexp::Assemble to build an optimized re that matches all your typedefs: `use Regexp::Assemble; my $re = Regexp::Assemble->new->add(@typedefs)->re ... if ($line =~ $re) { # do something }` [download]	[reply] [d/l]
Re: line by line match on an array of strings by holli (Abbot) on Jan 09, 2008 at 11:29 UTC
You could try to feed your strings into Regexp::Assemble. That way you get a single regex you can use to test your lines against, which would probably speed things up. OTOH, you cannot tell wich of your original search strings actually matched the line, if that is important for you. If it is, then you would have to do your explicit tests after the regex matched. Depending on the blank/hit ratio it may still be faster. Update: as per bibliophile's comment below. holli, /regexed monk/	[reply] [d/l]
Re^2: line by line match on an array of strings by bibliophile (Prior) on Jan 09, 2008 at 14:52 UTC
Interesting. I'd never heard of Regexp::Assemble... From the doc: It is also possible to track the original patterns, so that you can determine which, among the source patterns that form the assembled pattern, was the one that caused the match to occur.	[reply]
Re: line by line match on an array of strings by Punitha (Priest) on Jan 09, 2008 at 10:13 UTC
Hi barryscott This may also help you in addition to j1n3l0 `use strict; my @typedefs = qw(do re me fa so la ti do); while (my $line = <DATA>) { chomp($line); if(grep/$line/, @typedefs){ print "$line\n"; ## perform various actions here if line match } } __DATA__ do me a favour will you` [download] Punitha	[reply] [d/l]
Re^2: line by line match on an array of strings by ww (Archbishop) on Jan 09, 2008 at 14:13 UTC
May not be the OP's intent??? prints `do me a` [download] Update (after what I judge to have been too many minutes of puzzling over this) Still not clear why "a" matched (it'll come to me yet), but the obvious cure (IMO) is to anchor the match, thus `if(grep/^$line$/, @typedefs){`	[reply] [d/l] [select]
Re: line by line match on an array of strings by balaji_red83 (Acolyte) on Jan 09, 2008 at 09:50 UTC
Instead of reading line by line, you can slurp in the contents of the entire file into a scalar variable and then look for a match against an array of strings. `local $/ = undef; open(FH,"<","test.txt") or die "Could not open the file due to $!"; my $fcontent = <FH>; close FH; foreach (@typedefs) { if ($fcontent =~ /$_/){ # perform various actions here if line match } }` [download]	[reply] [d/l]
Re^2: line by line match on an array of strings by halley (Prior) on Jan 09, 2008 at 14:20 UTC
Perhaps you missed this in the original post? with a few hundred thousand lines to search -- `[ e d @ h a l l e y . c c ]`	[reply]
Re: line by line match on an array of strings by pilcrow (Sexton) on Jan 09, 2008 at 18:05 UTC
Is there a more elegant solution? Substituting speed for elegance, there are many quick wins for this sort of thing. Precompile your regexen with qr//, so each iteration doesn't compile them anew. Break out of the loop after the first successful match, if appropriate (the List::MoreUtils any approach does this, I think). Optimize the regex list and its application: can you profitably combine them, add modifiers like \b or ^, etc? Keep track of regex hit counts and sort your regex list now and then to apply the most common matches first, if appropriate. The biggest win typically comes from rethinking the problem, of course. Without really knowing what you're attempting, it looks as though you might be trying to do some comparatively simple token matches. Something like `while (<INPUT>) { if (/\b keyword \s+ (\w+)/ and exists $keywords{ $1 }) { # .. do something with the token in $1 } }` [download] might do the trick. We'd need to see specific examples to give more specific advice. -Mike	[reply] [d/l]
Re^2: line by line match on an array of strings by WoodyWeaver (Monk) on Jan 09, 2008 at 21:38 UTC
Just trying to restate what Mike was saying. > however with a few hundred thousand lines to seach, and an array of a few hundred it is far too slow. It could be slow because it has to do a lot of work at each end step, which is where optimizing the regex helps. I think it is slow because your looping is of order (a few hundred thousand) TIMES (a few hundred). It would be much better if the looping is of order (a few hundred thousand) times a big constant. You might be able to get away with that by 'precompile your regexen' (wonderful phrase) -- or imho more likely if your line can be broken into a small number of tokens, just do a dispatch table on tokens broken out from the line. --woody	[reply]
Re: line by line match on an array of strings by vdrab (Initiate) on Jan 11, 2008 at 02:57 UTC
Hi, What exactly do you mean by "see if there's a match against any of the strings in the array"? Do you just have to check for "a" match? Just thinking loud (I don't know what the data is or where it came from), but would it be an option to sort the array, and look for matches using binary search? That would save you a fair lot of processing (assuming the sorting can be done offline). v.	[reply]


Don't ask to ask, just ask
	PerlMonks