Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Efficient run determination.

by BrowserUk (Pope)
on Nov 14, 2002 at 09:23 UTC ( #212796=perlquestion: print w/ replies, xml ) Need Help??
BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Given a string consisting of runs of chars, eg.

0 1 2 3 4 5 + 6 01234567890123456789012345678901234567890123456789012345 +67890 my $string =' aaaa bbbbccccccccbbbb aaaabbbbbcdddddddddddddddd +dddd';

I need to iterate over the string returning, (value, startpos and length) after each iteration. So the values for the above string would be

(' ',0,4), ('a',4,4),(' ',8,4),('b',12,4),('c',16,8),('b',24,4),(' ',28,3),('a',31,4),('b',40,1)...

There is no significance in the choice of example string, each run could be of any char 0..255. Nor the lengths, the number of 4's and multiples thereof seems to be something to do with the repeat speed of my keyboard and my reaction time.

I don't really mind if that means I get an array with the values interleaved, 3 arrays with the triples spanning the arrays.

I've tried various methods, but I feel that a clever regex solution using \G and /cg in a loop would be possible and probably more efficient that most I've tried, but it escapes me. NOTE: My goal here is absolutely efficiency not keystroke golf.

I'll benchmark any and all solutions offered, what I'm really after is the regex solution if its possible and also alternative methods.

Thanks, BrowserUK


Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color.
Pick up your cloud down the end and "Yes" if you get allocated a grey one they are a bit damp under foot, but someone has to get them.
Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory.
Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.

Comment on Efficient run determination.
Download Code
Re: Efficient run determination.
by robartes (Priest) on Nov 14, 2002 at 10:07 UTC
    OK, I'm not really answering your question (I'm not using a regexp), and this is a method you have probably tried before, but if it's speed you want, how about simply converting the string to an array and looping once over its elements?
    use strict; my $string=" aaaa bbbbccccccccbbbb aaaabbbbbcddddddddddddddddd +dddd"; my @listedstring= split//,$string; my $prev=shift @listedstring; my $currstart=my $index=0; for (@listedstring) { if ($_ eq $prev) { $index++; } else { print "('$prev',$currstart,".($index-$currstart+1).")\n"; $currstart=++$index; $prev=$_; } } print "('$prev',$currstart,".($index-$currstart+1).")\n";
    Interesting problem, BrowserUk!

    CU
    Robartes-

Re: Efficient run determination.
by tommyw (Hermit) on Nov 14, 2002 at 10:11 UTC

    Is $string=~/((.)\2*)/g what you're looking for?

    #! /usr/bin/perl -w use Data::Dumper; use strict; my $string =' aaaa bbbbccccccccbbbb aaaabbbbbcdddddddddddddddd +dddd'; my $pos=0; my @triples=(); my @reps=$string=~/((.)\2*)/g; while (@reps) { my $hits=shift @reps; my $char=shift @reps; push @triples, [$char, $pos, length $hits]; $pos+=length $hits; } $Data::Dumper::Indent=0; print Dumper @triples; __DATA__ $VAR1 = [' ',0,4];$VAR2 = ['a','4',4];$VAR3 = [' ','8',4];$VAR4 = ['b' +,'12',4];$VAR5 = ['c','16',8];$VAR6 = ['b','24',4];$VAR7 = [' ','28', +3];$VAR8 = ['a','31',4];$VAR9 = ['b','35',5];$VAR10 = ['c','40',1];$V +AR11 = ['d','41',20];

    Memo to self: must learn to use Data::Dumper properly ;-)

    --
    Tommy
    Too stupid to live.
    Too stubborn to die.

Re: Efficient run determination.
by dingus (Friar) on Nov 14, 2002 at 10:23 UTC
    Here's the no regex simple iterator version. Just for comparison as I suspect that there is a faster method.
    use Data::Dumper; my $string =' aaaa bbbbccccccccbbbb aaaabbbbbcdddddddddddddddd +dddd'; my (@res, $c, $p, $i); $p = 0; $c = substr($string,$p,1); for ($i=1; $i<length($string); $i++) { next if ($c eq substr($string,$i,1)); push (@res, [$c,$p,($i-$p)]); $c = substr($string,$i,1); $p = $i; } push (@res, [$c,$p,($i-$p)]); Dumper (\@res);

    Dingus


    Enter any 47-digit prime number to continue.
Re: Efficient run determination.
by dingus (Friar) on Nov 14, 2002 at 10:34 UTC
    And here's a simple while (//g;) loop version:
    use Data::Dumper; my $string =' aaaa bbbbccccccccbbbb aaaabbbbbcdddddddddddddddd +dddd'; my (@res, $i); $i = 0; while ($string =~ /(.)\1*/g) { push (@res, [$1, $i, pos($string)-$i]); $i = pos($string); } Dumper (\@res);

    Dingus


    Enter any 47-digit prime number to continue.
Re: Efficient run determination.
by Enlil (Parson) on Nov 14, 2002 at 10:38 UTC
    Neat problem. Here is my initial crack at it:
    use strict; use Data::Dumper; my $string =' aaaa bbbbccccccccbbbb aaaabbbbbcdddddddddddddddd +dddd'; my @bah; my $old_pos = 0; while ($string =~ /((.)\2*)/g) { push @bah, [$2,$old_pos,length($1)]; $old_pos = pos($string); } print Dumper(\@bah);
    I am really interested as to how the benchmarking turns out. Let us know.

    Update: It is late, but I just realized i would probably benchmark all the code that turns up on this post anyhow. Once again interesting problem.

    Updated Again: Changed the \2+ to \2* or it would have failed when there was a series one character long.

    -enlil

      Second Try. I am going off to bed after this, my benchmarking tells me this is faster than my previous try:
      use strict; use Data::Dumper; my $string =' aaaa bbbbccccccccbbbb aaaabbbbbcdddddddddddddddd +dddd'; my @bah; while ($string =~ /((.)\2*)/g) { push (@bah, [$2,$-[1],$+[1] - $-[1]]); } print Dumper(\@bah);

      Update Changed the \2+ to \2* or it would have failed when there was a series one character long.

      -enlil

Re: Efficient run determination.
by rasta (Hermit) on Nov 14, 2002 at 10:42 UTC
    Not enought laconically, although fast enought:
    $string =' aaaa bbbbccccccccbbbb aaaabbbbbcddddddddddddddddddd +d'; $l = length($string); $pp = 0; while ($pp < $l) { $c = substr $string, $pp, 1; if ($string =~ /\G\Q$c\E+/gc) { print "('$c',", $pp, ",",pos($string) - $pp,")\n"; $pp = pos($string); } }
Re: Efficient run determination.
by pg (Canon) on Nov 14, 2002 at 17:34 UTC
    I donít think regexp will help improve efficiency in this case. The requirement is very straight forward, and I believe everyone agrees that, any solution has to iterate thru the string (if you use regexp, regexp will do this), doesnít matter how quick or slow it is. Theoretically, in this case, for any reasonably efficient algorithm, including the best solution anyone would be able to find, t ~ o(l) stands true for all solutions (t is the time spent, and the length of the string is l, and t ~ o(l) means t is a linear function of l).

    Considering regexp is more general, and most likely would do more than you want in a specific case. It is reasonable that a specific function you wrote would be faster than use regexp.
Re: Efficient run determination.
by PhiRatE (Monk) on Nov 14, 2002 at 21:23 UTC
    Its all about using the right language for the job.

    @res = p_process(' aaaa bbbbccccccccbbb aaaaabbbbbcddddddddddddd +ddddddd'); print Dumper(@res); use Inline C => <<'END_OF_C_CODE'; void p_process(char *s) { char prev = 'Q'; long count = 0; long pos = 0; long i=0; AV *array; Inline_Stack_Vars; Inline_Stack_Reset; while(*s != 0) { if (count==0) { pos = i; prev = *s; count = 1; } else if (prev == *s) { count++; } else { array = newAV(); av_push(array,newSVpvn(&prev,1)); av_push(array,newSViv(pos)); av_push(array,newSViv(count)); Inline_Stack_Push(newRV_inc(array)); pos=i; prev = *s; count=1; } i++; s++; } Inline_Stack_Done; } END_OF_C_CODE

    (Note, if you have a very long string, you should look at shifting things into a top-level array instead of onto the stack directly, since the stack might run out of room)

    This one doesn't handle char 0, since that is the C string termination character. A fairly trivial modification to using the SV as it comes in would fix that however.

      Well, I ran some benchmarks myself, basically because I wanted validation :) I like to think I'm not biased but that's probably untrue :)

      For 1000 iterations over the given test string, the time to complete is given in seconds. Some of the solutions provided were modified slightly so that they were contained in a function, and returned an AoA containing the results.

      PhiRatE 1:  0.042767s
      Dingus 2:   0.083538s
      Dingus 1:   0.106047s
      TommyW 1:   0.117478s
      Enlil 2:    0.138005s
      Robartes 1: 0.217086s
      Rasta 1:    0.273936s
      

      The Inline C function I proposed is nearly twice as fast as the nearest all-perl competitor (Dingus' 2nd attempt). I must admit its not as much of an advantage as I had expected, but still clear. A run of 10000 each revealed almost exactly the same ratios.

      Please note that no attempt to validate the correctness of any of the entries was made.

      For further interest I did a test with a considerably longer string (1920 chars):

      PhiRatE:     2.238917s
      Dingus 1:    5.350828s
      Dingus 2:    5.474964s
      Enlil 2:     9.461141s
      TommyW 1:    8.870724s
      Robartes 1: 10.239816s
      Rasta 1:    19.996283s
      

      The inline method is even further ahead in this test. Interestingly Dingus 2, which had previously performed second-best, dropped to 3rd. Possibly due to the different nature of the data or some non O(1) operations within the calls made. Others held up variously against the increased length. The operation was fairly memory intensive but I ran all the competitors in various orders to ensure that it wasn't affecting anything.

        Nice one. Now send me a version that I don't have to compile:)

        Seriously, thanks for doing the benchmark. I am doing the same right now, but am having trouble ensuring that I'm comparing apples with apples. I have one (perl) solution that hasn't been posted yet which might do better, but it's very difficult to benchmark as it relies on @- to get the positions and lengths and this system var goes out of scope very quickly.

        Here is my simplistic test program run on a 1920 char string. One run only and a different string from yours so direct comparison is dodgey, but it finds 351 triplets in the 1920 chars string in a little under 0.4 of a second, which may mean it isn't doing so bad by comparison with your C version. Maybe you could include it into you benchmark (HARD!) or at least time one run of it using the same data as you did for your other tests for comparison.

        #! perl -slw use strict; use re 'eval'; #! Took me ages to find this. use Benchmark::Timer; my $t = new Benchmark::Timer; #! Set up big regex. 1-time hit. my $re ='(?:(.)(??{"$+*"}))?' x 500; $re = qr/$re/o; # Test data(+2) x 32 to make it more comparable $_ = ' aaaa bbbbccccccccbbbb aaaabbbbbcdddddddddddddddddddddd' x +32; #! All the data is gathered here in one shot. #! Chars in @c - start positions in @-, lengths derived: length N = $- +[N+2] - $-[N+1] #! Caveat. lengths must be used immediately (or stored elsewhere which + costs) else the go stale:) $t->start('bigre'); my @c = m/$re/; #! THIS LINE DOES ALL THE WORK. #! This truncates the list to exclude null matches returned from regex +. $#c = $#- -1; $t->stop('bigre'); #! data accessed here. printf "('%1s', %3d, %3d)\n", $c[$_], $-[$_+1], ( $-[$_+2] || $+[0] ) +- $-[$_+1] for 0..$#c; print $#c, ' triplets found in '; $t->report; __END__ C:\test>212796-2 (' ', 0, 3) ('a', 3, 4) (' ', 7, 3) .... 345 values omitted .... ('b', 1892, 5) ('c', 1897, 1) ('d', 1898, 22) 351 triplets found in 1 trial of bigre (371ms total) C:\test>

        The basic idea was to push the loop into the regex engine instead of external to it, and use the regex engines well tuned optimisations to do the data gathering for me. I originally though that using the /g modifier would do the trick, but it only returns the positions for the last match made. Then I thought of using a repeated regex ({1..n}) but again this is treated such that only the positions for the last capture group are available.

        The answer was to construct a single large regex with enough capture groups to gather the data I needed. This was made harder by the need to only capture the first char of the group, but also to know its end position. That's when I discovered the (??{}) re and combined this with $+ in order to achieve my aims.

        I'm not sure that this is good or maintainable Perl, but it seems to work and appears on preliminary tests to be very fast, which was and is the only criteria by which I am judging it.


        Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color.
        Pick up your cloud down the end and "Yes" if you get allocated a grey one they are a bit damp under foot, but someone has to get them.
        Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory.
        Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.

My final entry :)
by PhiRatE (Monk) on Nov 15, 2002 at 12:21 UTC
    use Inline C => <<'END_OF_C_CODE'; void p_process_2(SV *svv) { char prev = 0; long count = 0; long i=0; long len = SvCUR(svv); AV *array; AV *list; char *s = SvPV(svv, len); Inline_Stack_Vars; Inline_Stack_Reset; list = newAV(); av_extend(list, 500); prev = *s; for (i=0; i<len; i++) { if (prev == *s) { count++; } else { array = newAV(); av_push(array,newSVpvn(&prev,1)); av_push(array,newSViv(i-count)); av_push(array,newSViv(count)); av_push(list,newRV_inc(array)); prev = *s; count=1; } s++; } Inline_Stack_Push(newRV_inc(list)); Inline_Stack_Done; } END_OF_C_CODE

    Featuring no problems with null chars (binary clean), array pre-extension and the fastest performance so far :)

    Its kinda like golf with benchmarks.

Re: Efficient run determination.
by BrowserUk (Pope) on Nov 15, 2002 at 17:22 UTC

    Okay. After some interesting discussion with PhiRatE, and utilsing his benchmark, I conclude that barring an Inline::C approach for various reasons but mostly because I want the solution to run on machines that do not have a C compiler set up, I declare Tah Tah Tah Tah. PhiRatE the winner anyway with his pure Perl solution.

    Much to my surprise, his chop backwards and count approach seems faster than any another (in Perl). PhiRatE++ and thanks.

    With some adjustments to the regex to prevent backtracking and the need for runtime interpolation, my big regex approach doesn't fair so badly as before. I should point out that mine is not a full solution in that it doesn't return the positions/length from the subroutine. These would be derived from @-, in situ. This biases the test in my favour somewhat. However, utilising the values that are in @- rather than copying them elsewhere would be a valid strategy in use. However, even with this advantage, it's still well short of PhiRatE's perl Solution.

    My thanks to everyone that took a crack.

    The results are

    C:\test>213121.pl Iterations: 500 Length: 500 Enlil_2: 0.02351 sec/iteration found:208 Dingus_1: 0.01400 sec/iteration found:208 PhiRatE_2: 0.00987 sec/iteration found:208 Rasta_1: 0.05989 sec/iteration found:208 TommyW_1: 0.01516 sec/iteration found:208 Robartes_1: 0.01560 sec/iteration found:208 BrowserUk: 0.01186 sec/iteration found:208 Dingus_2: 0.01785 sec/iteration found:208 C:\test>

    It doesn't include the C version as I'm not set up for Inline::C.

    The benchmark code follows:

      I was looking over the results and I mistakenly put that my first solution was slower than my second one (did I mention that it was late).

      Anyhow, I went ahead and used the same benchmarking methods as used by BrowserUK above to see the results for myself except that I wanted to see how different solutions would do under different conditions. So I modified the code a little (both to do a more thorough benchmark (on the Perl only solutions, as the Inline C version is tons faster so no need to beat another dead horse),and also to add my first attempt (Enlil_1)):

      One of the first things that I noticed, is that PhiRate's Perl code is broken, as the while($next||$prev) will return false when two 0's show up in a row (i.e. "00"). Regardless, running this code on both a RH Linux 7.2 running Perl 5.8 (compiled from source) or AS Perl on WinXP Pro Enlil_1 and Dingus_2 were usually neck in neck (when PhiRate's solution returns correctly it is usually close as well though almost always last most of my results slower).

      Even when I play around with the $max_size, and the multiplier (the "x int(rand(9))" part), the rankings seem to stay the same.

      Here are some results (from both places though I have removed the string as it is sometimes rather long.):

      -enlil

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://212796]
Approved by davis
Front-paged by robartes
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (12)
As of 2014-09-30 12:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (366 votes), past polls