http://www.perlmonks.org?node_id=583928

ultibuzz has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

i have the following Situation: a String of Numbers is given, an Array for matching several Numbers is given, i need to cut the matching Number and then print both the mathing Number and the rest with a ; between.

it shoud match from the beginning and it shoud match the longes possibility

Example: this number is given 012345666666 , these matchings are given 012345 and 0123456 , then 0123456 shoud be returnt not 012345

the array itself is 5025 elements big and i'm searching for faster ways then to join the array and doing a regex on each element.

also i can't find a way that the regex deliver back the longest matching possibility, for this reason i sortet the array so that the longest numbers comes first

my $filterregex = join('|',@Numbers)
/^$filterregex/

i'am using the Regex vars matching and rest to get everything i need, but the regex is still not working correctly and it's not that fast.

Summary: regex needed that matches from the beginning and delivers just the longes matching value, a possible faster way woud be great

kd

ultibuzz

  • Comment on Regex only the longes Value , and possible speedup Problem

Replies are listed 'Best First'.
Re: Regex only the longes Value , and possible speedup Problem
by reneeb (Chaplain) on Nov 14, 2006 at 09:50 UTC
    You could use this:
    #!/usr/bin/perl use strict; use warnings; my @numbers = ('012345','0123456'); @numbers = sort{length $b <=> length $a}@numbers; my $regex = join('|',@numbers); my $string = '0123456778945'; $string =~ s!^($regex)(.*)!$1;$2!; print $string;
      hi reneeb,
      i just testet your code a bit changed with a live case and it looks like it's working perfectly, and it was fast

      kd ultibuzz

      Update: time for a 500000 file with additional loops for stuff is 2.632 seconds, thats fast ;)
Re: Regex only the longes Value , and possible speedup Problem
by GrandFather (Saint) on Nov 14, 2006 at 09:10 UTC

    It's pretty hard to give a faster answer when we don't know exactly what you are doing currently (note that the code should compile and run!). Perhaps you could provide just enough of your code to demonstrate the problem and just enought data (in a __DATA__ section) to illustrate the matching issues.

    A few other things that it may help to know are:

    1. does the (5025) element set change from run to run?
    2. do the match strings change from run to run?
    3. how many matching strings are there?
    4. what is the minimum likely match length?
    5. what is the maximum likely match length?
    6. are you looking for a single best match or multiple matches?

    DWIM is Perl's answer to Gödel
      hi grandfather,

      1. normaly not but it can happen so i load these elements from a file and read it into an array
      2. yes they do everytime
      3. can be 1 can be 500000
      4. the minnimum matching is 4 numbers , the maximum i don't know but i never see everything larger then 14 so 20 shoud be the max
      6. im loking for a single best match

      here is some sample data for the array
      __Array_DATA__ 4930 49201 49202 49203 492041 492043 492045 492051 492052 492053 492054 492056 492058 492064 492065 492066 49208 49209 492102 492103 492104

      here are some strings
      4920911223344 492065577883667 49206656672

      the result for these strings shoud look like
      49209;11223344 492065;577883667 492066;56672

      finaly my testing code wich is far from greatness ;)
      use strict; use warnings; use File::Spec; open(my $fh_in, '<', 'Script\\onkz.txt') or die("open failed: $!"); open(INFILE, '<', 'data.txt') or die("open failed: $!"); my @ONKZ = <$fh_in>; my $filterregex = join('|',@ONKZ); while (<INFILE>) { if($_ =~ /^($filterregex)/){ print "$&;$'\n"; } }
Re: Regex only the longes Value , and possible speedup Problem
by BrowserUk (Patriarch) on Nov 14, 2006 at 10:06 UTC

    If you

    1. sort your search patterns by length and then value (reversed);
    2. chop characters off the string one at a time until it is less than the longest/highest string;
    3. then descend down the array comparing against each string in turn;
    4. quit the loop when you find a match, or loop and repeat;

    The worst case time I've seen is always less than 2 milliseconds. How does that compare with the regex solution?

    #! perl -slw use strict; use Math::Random::MT qw[ rand ]; use Benchmark::Timer; my @array = sort{ length( $b ) <=> length( $a ) || $b cmp $a } qw[ 0123456 012345 01234], map int rand( 123456789 ), 1 .. 5022; my $str = '012345666666'; my $T = new Benchmark::Timer; $T->start( '11-char string in 5025 possibles' ); my $copy = $str; my $n = 0; while( $n < $#array and $copy ne $array[ $n ] ) { chop $copy while $copy gt $array[ $n ]; $n++ while $copy lt $array[ $n ]; last if $copy eq $array[ $n ]; } if( length $copy ) { print "Best match found $copy (at posn: $n )"; } else { print "No match found"; } $T->stop( '11-char string in 5025 possibles' ); $T->report; print for grep defined, @array[ $n - 10 .. $n + 10 ]; __END__ C:\test>583928 Best match found 0123456 (at posn: 4975 ) 1 trial of 11-char string in 5025 possibles (1.621ms total) 1166599 1157595 1146996 1065416 1063127 1063055 1052870 1034509 1021521 1004863 0123456 997625 973306 962972 943968 911619 881691 873333 838373 832949 788294

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Regex only the longes Value , and possible speedup Problem
by Anonymous Monk on Nov 14, 2006 at 09:46 UTC
    Why bother with a regex? I'd use substr(), and do something like:
    my $search_str = "..."; my @numbers = (...); foreach my $n (sort {length($b) <=> length($a)} grep {length($_) <= length($search_str)} @numbers) { if ($n eq substr($search_str,0,length $n)) { print "$n matched!\n"; exit; } } print "No match!\n";