Re^5: how to count the number of repeats in a string (really!)

I've made just a few changes (mostly to my sub, so that it comes closer to my original design - your modifications added a few unnecessary steps) and run the benchmarks again.

The results seem to support my original suspicion that - at least for this particular problem - a regexp based solution would have to loose the fight against an approach that never has to look behind or ahead, but just touches every possible substring exactly 1 time:

Results for string:

"aabcdabcabcecdecd "

          Rate  blazar  kramba ikegami   lodin     oha
blazar   464/s      --    -77%    -89%    -90%    -90%
kramba  2062/s    344%      --    -52%    -54%    -58%
ikegami 4255/s    817%    106%      --     -4%    -13%
lodin   4444/s    858%    116%      4%      --     -9%
oha     4878/s    951%    137%     15%     10%      --

Results for string:

"aabcdabcabcecdecd aabcdabcabcecdecd "

          Rate  blazar     oha  kramba ikegami   lodin
blazar  92.4/s      --    -85%    -86%    -86%    -87%
oha      610/s    560%      --     -7%     -9%    -14%
kramba   658/s    612%      8%      --     -1%     -7%
ikegami  667/s    621%      9%      1%      --     -6%
lodin    709/s    667%     16%      8%      6%      --

Results for string:

"aabcdabcabcecdecd aabcdabcabcecdecd aabcdabcabcecdecd aabcdabcabcecde
+cd "

          Rate  blazar   lodin ikegami     oha  kramba
blazar  21.4/s      --    -85%    -86%    -86%    -90%
lodin    144/s    574%      --     -3%     -5%    -35%
ikegami  148/s    594%      3%      --     -2%    -33%
oha      151/s    607%      5%      2%      --    -31%
kramba   220/s    930%     53%     48%     46%      --
[download]

Here's the full code I've used for benchmarking.

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use constant MIN_LENGTH => 2;
use constant MIN_REPEATS => 2;
use Benchmark qw/:all :hireswallclock/;

my $str='aabcdabcabcecdecd';

sub blazar {
    local $_=shift;
    my $l=length;
    my %count;

    for my $off (0..$l-1) {
        for my $len (MIN_LENGTH .. $l-$off) {
            my $s = substr $_, $off, $len;
            $count{ $s } ||= ()= /$s/g;
        }
        $count{$_} < MIN_REPEATS and
          delete $count{$_} for keys %count;
    }
    \%count;
}

sub oha {
    my $s=shift;
    my %saw;

    while($s =~ /(..+)(?=.*?\1)/g) {
        for my $x (0..length $1) {
            @saw{ map {substr $1, $x, $_} $x+2..length $1 } = ();
        }
    }
    $saw{$_} =()= $s =~ /\Q$_/g for keys %saw;
    \%saw;
}

sub ikegami {
    my $str = shift;

    local our %counts;
    $str =~ /
      (.{2,})   # or (.+)
      (?(?{ !$counts{$1} })(?=.*\1))
      (?{ ++$counts{$1} })
      (?!)
      /x;
    \%counts;
}

sub lodin {
    my $str = shift;

    local our %count;
    $str =~ /
      (.{2,})
      (?(?{ $count{$1} })
          (?!)
      )
      .*
      \1
      (?{ ($count{$1} ||= 1)++ })
      (?!)
      /x;
    \%count;
}


{
    my %count;

    sub kramba {
        my( $string) = @_;
        my $length = length( $string );

        if ($length < MIN_LENGTH) {
            for (keys %count) {
                delete $count{$_}
                    if $count{$_} < MIN_REPEATS;
            }
            return \%count;
        }

        for my $l (MIN_LENGTH..$length) {
            my $s = substr( $string, 0, $l );
            $count{ $s } += 1;
        }

        kramba( substr( $string, 1 ) );
    };
}

for my $multiplier (1, 2, 4) {
   my $work_str = "$str " x $multiplier;

print "Results for string:\n\n\"$work_str\"\n\n";
    cmpthese 2000/$multiplier => {
        blazar  => sub { blazar $work_str },
        oha     => sub { oha $work_str },
        kramba  => sub { kramba $work_str },
        ikegami => sub { ikegami $work_str },
        lodin   => sub { lodin $work_str },
    }
}
[download]

Another mention I'd make is that if some changes would be needed to the subs - like for example considering at least MIN_REPEATS repetitions of a string to be counted - I'm afraid it might be rather challenging in modifying the RX-ish solutions.
Speaking for me, I wouldn't know how to make it in the code above, even if I think of me as not being a novice any more when dealing with regular expressions.

Update Ahmm... there are some things broken, and I'll have to find out which one. Checking results for simple cases looked ok, so I thought things are ok. But then trying to run the benchmark for the longer text that Oha proposed, I noticed some problems and so tried to just output the _count_ of strings retained by each sub for a string like

'aabcdabcabcecdecd aabcdabcabcecdecd aabcdabcabcecdecd aabcdabcabcecdecd '

Much to my surprize, that came out as

467,337,791,467,467

for respectively blazar, oha, kramba, ikegami, lodin. Ooops...

How was that: who has a clock, knows what the time is, who has 2 clocks, has a problem... :)

Update 2 With Oha's longer latin text, the counts are - in the same order as above - 419,244,371,371,371 and my little recursive beauty complains about 'Deep recursion on subroutine "main::kramba" at ./test.pl line 95'. Well, understandable...

Krambambuli
---
enjoying Mark Jason Dominus' Higher-Order Perl

Comment on Re^5: how to count the number of repeats in a string (really!) Select or Download Code

Replies are listed 'Best First'.
Re^6: how to count the number of repeats in a string (really!) by ikegami (Patriarch) on Nov 16, 2007 at 02:37 UTC
Another mention I'd make is that if some changes would be needed to the subs - like for example considering at least MIN_REPEATS repetitions of a string to be counted - I'm afraid it might be rather challenging in modifying the RX-ish solutions. One simple solution is to go back to my (unposted) original solution and check MIN_REPEATS outside. `use constant MIN_REPEATS => 2; # Must have at least this many repeats my $str = 'aabcdabcabcecdecd'; local our %counts; $str =~ /(.{2,})(?{ ++$counts{$1} })(?!)/s; delete @counts{ grep $counts{$_}<MIN_REPEATS, keys %counts }; use Data::Dumper; print Dumper \%counts;` [download] That way, there's nothing complicated left in the regex. The only part that's modifiable is `/.{2,}/`, which is easier to understand and modify than hand-rolled parsing code. a regexp based solution would have to loose the fight against an approach that never has to look behind or ahead, but just touches every possible substring exactly 1 time Not so! The above regex never has to look beind or ahead and touches every possible substring exactly 1 time.	[reply] [d/l] [select]
Re^6: how to count the number of repeats in a string (really!) by lodin (Hermit) on Nov 16, 2007 at 01:38 UTC
Another mention I'd make is that if some changes would be needed to the subs - like for example considering at least MIN_REPEATS repetitions of a string to be counted - I'm afraid it might be rather challenging in modifying the RX-ish solutions. I've updated my original reply to include minimum substring length and minimum matched patterns. It would be interesting to see our and the other subroutines benchmarked against each other with this addition. lodin	[reply]
Re^6: how to count the number of repeats in a string (really!) by blazar (Canon) on Nov 16, 2007 at 23:31 UTC
I've made just a few changes (mostly to my sub, so that it comes closer to my original design - your modifications added a few unnecessary steps) and run the benchmarks again. I personally believe that those steps were not all that unnecessary. To be definite, I chose to compare subs that accept a string and return a hashref of the counts. Yours doesn't, so some extra step is required. According to the last update to Re^4: how to count the number of repeats in a string (really!), I'm posting a new benchmark here, with your sub as a thin layer around the recursive sub. I hope that is fine... Read more... (631 Bytes) So, instead of the benchmark, for the moment I'm posting the script with the tests: Read more... (3 kB) 15 tests out 18 fail. Update: I'm an idiot! All tests run smoothly once I use the correct reference. The lesson in this is: don't post when you're too tired. I'm going to sleep and I will update the node with the actual benchmark tomorrow morning... Sorry for the noise! Update: Ok, I woke up and I'm not that tired anymore. Here's the complete script: Read more... (3 kB) And here's the output: Read more... (3 kB) Krambambuli seems to win in the long string case. Anyway, analysis? Refinements, comments, additions? Update: well done, lodin++.	[reply] [d/l] [select]
Re^7: how to count the number of repeats in a string (really!) by lodin (Hermit) on Nov 17, 2007 at 19:46 UTC
Anyway, analysis? Refinements, comments, additions? These routines are still not equivalent. They all give different results for e.g. 'aaaaa'. You're not using ikegami's version from Re^6: how to count the number of repeats in a string (really!). With that version, ikegami's and krambambuli's versions are equivalent. A simple modification to the regex in your routine makes that equivalent too. I also removed the tail recursion from `kramba`, which improves the performance with about 50 %. The return value from these routines should be interpreted as the answer to "how many unique substrings are there in the string, and what's their frequency". This is a different problem from "how many times is a substring sequentially repeated (possibly with other substrings inbetween) in the string", which is the problem most of us solved initially. Here's the code and the output: use strict; use warnings; use Test::More 'no_plan'; use Benchmark qw/:all :hireswallclock/; my $str = 'aabcdabcabcecdecdaaaa'; sub blazar { my ($str, $min_len, $min_rep)=@_; my $l=length($str); my %count; for my $off (0..$l-1) { for my $len ($min_len .. $l-$off) { my $s = quotemeta substr $str, $off, $len; $count{ $s } \|\|= () = $str =~ /(?=$s)./gs; } $count{$_} < $min_rep and delete $count{$_} for keys %count; } \%count; } sub kramba { my ($str, $min_len, $min_rep)=@_; my %count; for my $c (0 .. length($str) - $min_len) { my $string = substr($str, $c); for my $l ($min_len .. length $string) { $count{substr($string, 0, $l)}++; } } for (keys %count) { delete $count{$_} if $count{$_} < $min_rep; } return \%count; } sub ikegami { my ($str, $min_len, $min_rep)=@_; local our %counts; use re 'eval'; $str =~ / (.{$min_len,}) (?{ ++$counts{$1} }) (?!) /x; for (keys %counts) { delete $counts{$_} if $counts{$_} < $min_rep; } \%counts; } my %subs = ( ikegami => \&ikegami, blazar => \&blazar, kramba => \&kramba, ); for my $len (1..3) { for my $rep (2..3) { my $tag="len=$len, rep=$rep"; my $ref = ikegami($str, $len, $rep); for my $name (keys %subs) { my $code = $subs{$name}; is_deeply($code->($str, $len, $rep), $ref, "$name - $tag") +; } } } print "\n"; for my $s ( map {$str x $_} 1,3,42) { for my $len (1..2) { for my $rep (2..3) { printf "length=%d, min_len=%d, min_rep=%d\n", length $s, $len, $rep ; cmpthese(-60, { map { my $c = $subs{$_}; $_ => sub { $c->($s, $len, $rep) } } keys %subs }); print "\n"; } } } __END__ ok 1 - blazar - len=1, rep=2 ok 2 - ikegami - len=1, rep=2 ok 3 - kramba - len=1, rep=2 ok 4 - blazar - len=1, rep=3 ok 5 - ikegami - len=1, rep=3 ok 6 - kramba - len=1, rep=3 ok 7 - blazar - len=2, rep=2 ok 8 - ikegami - len=2, rep=2 ok 9 - kramba - len=2, rep=2 ok 10 - blazar - len=2, rep=3 ok 11 - ikegami - len=2, rep=3 ok 12 - kramba - len=2, rep=3 ok 13 - blazar - len=3, rep=2 ok 14 - ikegami - len=3, rep=2 ok 15 - kramba - len=3, rep=2 ok 16 - blazar - len=3, rep=3 ok 17 - ikegami - len=3, rep=3 ok 18 - kramba - len=3, rep=3 length=21, min_len=1, min_rep=2 Rate blazar ikegami kramba blazar 364/s -- -84% -86% ikegami 2329/s 539% -- -13% kramba 2674/s 634% 15% -- length=21, min_len=1, min_rep=3 Rate blazar ikegami kramba blazar 355/s -- -85% -87% ikegami 2316/s 553% -- -16% kramba 2744/s 674% 18% -- length=21, min_len=2, min_rep=2 Rate blazar ikegami kramba blazar 381/s -- -85% -87% ikegami 2462/s 546% -- -13% kramba 2826/s 642% 15% -- length=21, min_len=2, min_rep=3 Rate blazar ikegami kramba blazar 381/s -- -84% -86% ikegami 2424/s 537% -- -13% kramba 2802/s 636% 16% -- length=63, min_len=1, min_rep=2 Rate blazar ikegami kramba blazar 28.8/s -- -91% -94% ikegami 328/s 1039% -- -26% kramba 444/s 1441% 35% -- length=63, min_len=1, min_rep=3 Rate blazar ikegami kramba blazar 35.1/s -- -90% -92% ikegami 340/s 870% -- -21% kramba 432/s 1130% 27% -- length=63, min_len=2, min_rep=2 Rate blazar ikegami kramba blazar 32.0/s -- -91% -93% ikegami 355/s 1009% -- -21% kramba 452/s 1312% 27% -- length=63, min_len=2, min_rep=3 Rate blazar ikegami kramba blazar 35.0/s -- -90% -92% ikegami 335/s 855% -- -21% kramba 423/s 1106% 26% -- length=882, min_len=1, min_rep=2 Rate blazar ikegami kramba blazar 6.46e-002/s -- -95% -95% ikegami 1.20/s 1754% -- -16% kramba 1.42/s 2104% 19% -- length=882, min_len=1, min_rep=3 Rate blazar ikegami kramba blazar 6.16e-002/s -- -95% -96% ikegami 1.17/s 1804% -- -17% kramba 1.41/s 2189% 20% -- length=882, min_len=2, min_rep=2 Rate blazar ikegami kramba blazar 6.14e-002/s -- -95% -96% ikegami 1.18/s 1829% -- -17% kramba 1.42/s 2213% 20% -- length=882, min_len=2, min_rep=3 Rate blazar ikegami kramba blazar 6.27e-002/s -- -95% -96% ikegami 1.18/s 1779% -- -18% kramba 1.44/s 2197% 22% -- 1..18 [download] lodin	[reply] [d/l] [select]