Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

line by line match on an array of strings

by barryscott (Acolyte)
on Jan 09, 2008 at 09:31 UTC ( [id://661292]=perlquestion: print w/replies, xml ) Need Help??

barryscott has asked for the wisdom of the Perl Monks concerning the following question:

Morning monks. Another day, another problem ;oD I have dynamically collated an array of strings in my perl code, and now need to search line by line (this mechanism exists in the program) to see if there's a match against any of the strings in the array. So far I have sometyhing like this
foreach (@typedefs){ if ($line=~ /$_/) { # perform various actions here if line match } }
however with a few hundred thousand lines to seach, and an array of a few hundred it is far too slow. Is there a more elegant solution? Many thanks.

Replies are listed 'Best First'.
Re: line by line match on an array of strings
by j1n3l0 (Friar) on Jan 09, 2008 at 09:48 UTC
    You could look at the any() function from the List::MoreUtils module. I think it would be something like so:

    #!/usr/bin/perl -w use strict; use List::MoreUtils qw(any); my @typedefs = qw(do re me fa so la ti do); while ( my $line = <DATA> ) { if ( any { $line =~ /$_/ } @typedefs ) { # perform various actions here if line match print $line; } } exit 0; __END__ do me a favour will you

    This prints:

    do me favour

    UPDATE

    From the docs:

    Using the functions from this module however should give slightly better performance as everything is implemented in C.

    Hope that helps =)


    Smoothie, smoothie, hundre prosent naturlig!
Re: line by line match on an array of strings
by jbert (Priest) on Jan 09, 2008 at 10:42 UTC
    You might get some minor joy by:
    • Compiling your regexps and saving them. See qr in perldoc perlop.
    • If each line can only match one of these typedefs, make sure you exit the typedef loop when you find a match.
    • In combination with the previous, if you have some typedefs which are more likely to occur, order the typedefs array with the most likely at the front (for early exit).
    In combination:
    my @typedefs = qw(...sorted list of typedefs, common ones first...); my @regexes = map { qr{$_} } @typedefs; REGEX: foreach my $regex (@regexes) { if ($line =~ $regex) { # found typedef so do work last REGEX; # Stop looking } }
    Ah...if you want to perform the same action, you can munge all your typedefs together into one regex.
    my @typedefs = qw(...sorted list of typedefs, common ones first...); # Make a big '(a|b|c|...)' regex str # Should probably do some quoting here. my $regexStr = "(" . join("|", @typedefs) . ")"; my $regex = qr{$regexStr}; if ($line =~ $regex) { # found typedef so do work }
    That should help a bit, since you're now handling all the alternatives in the regex engine, i.e. looping in C not perl.

    UPDATE: in case it's not obvious, you want to produce the @regex array or $regex once, before looping through all your lines. Don't do it per line :-)

Re: line by line match on an array of strings
by bunch (Sexton) on Jan 09, 2008 at 11:15 UTC
    You can use Regexp::Assemble to build an optimized re that matches all your typedefs:
    use Regexp::Assemble; my $re = Regexp::Assemble->new->add(@typedefs)->re ... if ($line =~ $re) { # do something }
Re: line by line match on an array of strings
by holli (Abbot) on Jan 09, 2008 at 11:29 UTC
    You could try to feed your strings into Regexp::Assemble. That way you get a single regex you can use to test your lines against, which would probably speed things up. OTOH, you cannot tell wich of your original search strings actually matched the line, if that is important for you. If it is, then you would have to do your explicit tests after the regex matched. Depending on the blank/hit ratio it may still be faster.

    Update: as per bibliophile's comment below.


    holli, /regexed monk/
      Interesting. I'd never heard of Regexp::Assemble...
      From the doc:
      It is also possible to track the original patterns,
      so that you can determine which, among the source patterns
      that form the assembled pattern, was the one that caused
      the match to occur.
      
Re: line by line match on an array of strings
by Punitha (Priest) on Jan 09, 2008 at 10:13 UTC

    Hi barryscott

    This may also help you in addition to j1n3l0

    use strict; my @typedefs = qw(do re me fa so la ti do); while (my $line = <DATA>) { chomp($line); if(grep/$line/, @typedefs){ print "$line\n"; ## perform various actions here if line match } } __DATA__ do me a favour will you

    Punitha

      May not be the OP's intent???

      prints
      do me a

      Update (after what I judge to have been too many minutes of puzzling over this)

      Still not clear why "a" matched (it'll come to me yet), but the obvious cure (IMO) is to anchor the match, thus

      if(grep/^$line$/, @typedefs){
Re: line by line match on an array of strings
by balaji_red83 (Acolyte) on Jan 09, 2008 at 09:50 UTC
    Instead of reading line by line, you can slurp in the contents of the entire file into a scalar variable and then look for a match against an array of strings.
    local $/ = undef; open(FH,"<","test.txt") or die "Could not open the file due to $!"; my $fcontent = <FH>; close FH; foreach (@typedefs) { if ($fcontent =~ /$_/){ # perform various actions here if line match } }
      Perhaps you missed this in the original post?
        with a few hundred thousand lines to search

      --
      [ e d @ h a l l e y . c c ]

Re: line by line match on an array of strings
by pilcrow (Sexton) on Jan 09, 2008 at 18:05 UTC
    Is there a more elegant solution?

    Substituting speed for elegance, there are many quick wins for this sort of thing. Precompile your regexen with qr//, so each iteration doesn't compile them anew. Break out of the loop after the first successful match, if appropriate (the List::MoreUtils any approach does this, I think). Optimize the regex list and its application: can you profitably combine them, add modifiers like \b or ^, etc? Keep track of regex hit counts and sort your regex list now and then to apply the most common matches first, if appropriate.

    The biggest win typically comes from rethinking the problem, of course. Without really knowing what you're attempting, it looks as though you might be trying to do some comparatively simple token matches. Something like

    while (<INPUT>) { if (/\b keyword \s+ (\w+)/ and exists $keywords{ $1 }) { # .. do something with the token in $1 } }
    might do the trick. We'd need to see specific examples to give more specific advice. -Mike
      Just trying to restate what Mike was saying.

      > however with a few hundred thousand lines to seach, and an array of a few hundred it is far too slow.

      It could be slow because it has to do a lot of work at each end step, which is where optimizing the regex helps.

      I think it is slow because your looping is of order (a few hundred thousand) TIMES (a few hundred).

      It would be much better if the looping is of order (a few hundred thousand) times a big constant. You might be able to get away with that by 'precompile your regexen' (wonderful phrase) -- or imho more likely if your line can be broken into a small number of tokens, just do a dispatch table on tokens broken out from the line.

      --woody

Re: line by line match on an array of strings
by vdrab (Initiate) on Jan 11, 2008 at 02:57 UTC
    Hi, What exactly do you mean by "see if there's a match against any of the strings in the array"? Do you just have to check for "a" match? Just thinking loud (I don't know what the data is or where it came from), but would it be an option to sort the array, and look for matches using binary search? That would save you a fair lot of processing (assuming the sorting can be done offline). v.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://661292]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (5)
As of 2024-03-19 08:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found