Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re^6: how to count the no of repeats in a string (really!)

by blazar (Canon)
on Nov 16, 2007 at 23:31 UTC ( #651345=note: print w/replies, xml ) Need Help??


in reply to Re^5: how to count the no of repeats in a string (really!)
in thread how to count the no of repeats in a string (really!)

I've made just a few changes (mostly to my sub, so that it comes closer to my original design - your modifications added a few unnecessary steps) and run the benchmarks again.

I personally believe that those steps were not all that unnecessary. To be definite, I chose to compare subs that accept a string and return a hashref of the counts. Yours doesn't, so some extra step is required. According to the last update to Re^4: how to count the no of repeats in a string (really!), I'm posting a new benchmark here, with your sub as a thin layer around the recursive sub. I hope that is fine...

The subs are the last updates of their respective authors, suitably modified just to be consistent with each other and at the explicit request from some of the people who took part to this thread, for generality I changed them to accept as arguments in order: the minimum string length, the minimum repetition count, and the string to be processed.

Note: I wanted to post the benchmark. But

  • I had problems with oha's sub even in the simplest case;
  • to be sure, before proceeding I added some tests: sadly enough they seem to fail.

So, instead of the benchmark, for the moment I'm posting the script with the tests:

#!/usr/bin/perl use strict; use warnings; use Test::More 'no_plan'; use Benchmark qw/:all :hireswallclock/; my $str='aabcdabcabcecdecd'; sub blazar { my ($str, $min_len, $min_rep)=@_; my $l=length($str); my %count; for my $off (0..$l-1) { for my $len ($min_len .. $l-$off) { my $s = substr $str, $off, $len; $count{ $s } ||= ()= $str =~ /$s/g; } $count{$_} < $min_rep and delete $count{$_} for keys %count; } \%count; } sub kramba { my ($str, $min_len, $min_rep)=@_; my %count; local *count = sub { my( $string) = @_; my $length = length( $string ); if ($length < $min_len) { for (keys %count) { delete $count{$_} if $count{$_} < $min_rep; } return \%count; } for my $l ($min_len..$length) { my $s = substr( $string, 0, $l ); $count{ $s } += 1; } count( substr( $string, 1 ) ); }; count($str); \%count; } sub ikegami { my ($str, $min_len, $min_rep)=@_; local our %counts; use re 'eval'; $str =~ / (.{$min_len,}) # or (.+) (?(?{ !$counts{$1} })(?=.*\1)) (?{ ++$counts{$1} }) (?!) /x; for (keys %counts) { delete $counts{$_} if $counts{$_} < $min_rep; } \%counts; } sub lodin { my ($str, $min_len, $min_rep)=@_; local our %count; use re 'eval'; $str =~ / (.{$min_len,}) (?(?{ $count{$1} }) (?!) ) (?> .*? \1 ){@{[ $min_rep - 2 ]}} .* \1 (?{ ($count{$1} ||= $min_rep-1)++ }) (?!) /x; \%count; } { my %cache; sub _reference { $cache{$_[0]} ||= blazar @_ } } for my $len (1..3) { for my $rep (2..3) { my $tag="len=$len, rep=$rep"; is_deeply kramba($str,$len,$rep), _reference($str,$len,$rep), +"kramba $tag"; is_deeply ikegami($str,$len,$rep), _reference($str,$len,$rep), + "ikegami $tag"; is_deeply lodin($str,$len,$rep), _reference($str,$len,$rep), " +lodin $tag"; } } __END__

15 tests out 18 fail.

Update: I'm an idiot! All tests run smoothly once I use the correct reference. The lesson in this is: don't post when you're too tired. I'm going to sleep and I will update the node with the actual benchmark tomorrow morning... Sorry for the noise!

Update: Ok, I woke up and I'm not that tired anymore. Here's the complete script:

#!/usr/bin/perl use strict; use warnings; use Test::More 'no_plan'; use Benchmark qw/:all :hireswallclock/; my $str='aabcdabcabcecdecd'; sub blazar { my ($str, $min_len, $min_rep)=@_; my $l=length($str); my %count; for my $off (0..$l-1) { for my $len ($min_len .. $l-$off) { my $s = substr $str, $off, $len; $count{ $s } ||= ()= $str =~ /$s/g; } $count{$_} < $min_rep and delete $count{$_} for keys %count; } \%count; } sub kramba { my ($str, $min_len, $min_rep)=@_; my %count; no warnings 'recursion'; local *count = sub { my( $string) = @_; my $length = length( $string ); if ($length < $min_len) { for (keys %count) { delete $count{$_} if $count{$_} < $min_rep; } return \%count; } for my $l ($min_len..$length) { my $s = substr( $string, 0, $l ); $count{ $s } += 1; } count( substr( $string, 1 ) ); }; count($str); \%count; } sub ikegami { my ($str, $min_len, $min_rep)=@_; local our %counts; use re 'eval'; $str =~ / (.{$min_len,}) # or (.+) (?(?{ !$counts{$1} })(?=.*\1)) (?{ ++$counts{$1} }) (?!) /x; for (keys %counts) { delete $counts{$_} if $counts{$_} < $min_rep; } \%counts; } sub lodin { my ($str, $min_len, $min_rep)=@_; local our %count; use re 'eval'; $str =~ / (.{$min_len,}) (?(?{ $count{$1} }) (?!) ) (?> .*? \1 ){@{[ $min_rep - 2 ]}} .* \1 (?{ ($count{$1} ||= $min_rep-1)++ }) (?!) /x; \%count; } { my %cache; sub _reference { $cache{"@_"} ||= blazar @_ } } for my $s ( map {$str x $_} 1,3,42) { for my $len (1..2) { for my $rep (2..3) { my $strlen=length $s; print "\nstring length=$strlen, len=$len, rep=$rep\n\n"; cmpthese +($strlen < 100 ? 10_000 : -60) => { kramba => sub { kramba($s,$len,$rep) }, ikegami => sub { ikegami($s,$len,$rep) }, lodin => sub { lodin($s,$len,$rep) }, }; } } } print "\n"; for my $len (1..3) { for my $rep (2..3) { my $tag="len=$len, rep=$rep"; is_deeply kramba($str,$len,$rep), _reference($str,$len,$rep), +"kramba $tag"; is_deeply ikegami($str,$len,$rep), _reference($str,$len,$rep), + "ikegami $tag"; is_deeply lodin($str,$len,$rep), _reference($str,$len,$rep), " +lodin $tag"; } } __END__

And here's the output:

string length=17, len=1, rep=2 Rate kramba ikegami lodin kramba 1860/s -- -59% -60% ikegami 4570/s 146% -- -1% lodin 4636/s 149% 1% -- string length=17, len=1, rep=3 Rate kramba lodin ikegami kramba 1866/s -- -58% -60% lodin 4413/s 137% -- -5% ikegami 4638/s 149% 5% -- string length=17, len=2, rep=2 Rate kramba lodin ikegami kramba 2000/s -- -63% -64% lodin 5470/s 174% -- -3% ikegami 5618/s 181% 3% -- string length=17, len=2, rep=3 Rate kramba lodin ikegami kramba 1905/s -- -59% -66% lodin 4604/s 142% -- -18% ikegami 5612/s 195% 22% -- string length=51, len=1, rep=2 Rate kramba lodin ikegami kramba 212/s -- -0% -18% lodin 212/s 0% -- -18% ikegami 259/s 22% 22% -- string length=51, len=1, rep=3 Rate kramba ikegami lodin kramba 201/s -- -5% -6% ikegami 212/s 6% -- -1% lodin 215/s 7% 1% -- string length=51, len=2, rep=2 Rate kramba lodin ikegami kramba 202/s -- -5% -7% lodin 213/s 5% -- -2% ikegami 217/s 7% 2% -- string length=51, len=2, rep=3 Rate kramba ikegami lodin kramba 200/s -- -5% -10% ikegami 210/s 5% -- -5% lodin 221/s 11% 5% -- string length=714, len=1, rep=2 s/iter lodin ikegami kramba lodin 2.50 -- -5% -50% ikegami 2.37 6% -- -48% kramba 1.24 102% 91% -- string length=714, len=1, rep=3 s/iter lodin ikegami kramba lodin 3.48 -- -30% -64% ikegami 2.45 42% -- -49% kramba 1.25 180% 96% -- string length=714, len=2, rep=2 s/iter lodin ikegami kramba lodin 2.50 -- -1% -51% ikegami 2.49 1% -- -50% kramba 1.24 103% 101% -- string length=714, len=2, rep=3 s/iter lodin ikegami kramba lodin 3.40 -- -28% -63% ikegami 2.44 39% -- -49% kramba 1.25 173% 96% -- ok 1 - kramba len=1, rep=2 ok 2 - ikegami len=1, rep=2 ok 3 - lodin len=1, rep=2 ok 4 - kramba len=1, rep=3 ok 5 - ikegami len=1, rep=3 ok 6 - lodin len=1, rep=3 ok 7 - kramba len=2, rep=2 ok 8 - ikegami len=2, rep=2 ok 9 - lodin len=2, rep=2 ok 10 - kramba len=2, rep=3 ok 11 - ikegami len=2, rep=3 ok 12 - lodin len=2, rep=3 ok 13 - kramba len=3, rep=2 ok 14 - ikegami len=3, rep=2 ok 15 - lodin len=3, rep=2 ok 16 - kramba len=3, rep=3 ok 17 - ikegami len=3, rep=3 ok 18 - lodin len=3, rep=3 1..18

Krambambuli seems to win in the long string case. Anyway, analysis? Refinements, comments, additions?

Update: well done, lodin++.

Replies are listed 'Best First'.
Re^7: how to count the no of repeats in a string (really!)
by lodin (Hermit) on Nov 17, 2007 at 19:46 UTC

    Anyway, analysis? Refinements, comments, additions?

    These routines are still not equivalent. They all give different results for e.g. 'aaaaa'. You're not using ikegami's version from Re^6: how to count the no of repeats in a string (really!). With that version, ikegami's and krambambuli's versions are equivalent. A simple modification to the regex in your routine makes that equivalent too. I also removed the tail recursion from kramba, which improves the performance with about 50 %.

    The return value from these routines should be interpreted as the answer to "how many unique substrings are there in the string, and what's their frequency". This is a different problem from "how many times is a substring sequentially repeated (possibly with other substrings inbetween) in the string", which is the problem most of us solved initially.

    Here's the code and the output:

    use strict; use warnings; use Test::More 'no_plan'; use Benchmark qw/:all :hireswallclock/; my $str = 'aabcdabcabcecdecdaaaa'; sub blazar { my ($str, $min_len, $min_rep)=@_; my $l=length($str); my %count; for my $off (0..$l-1) { for my $len ($min_len .. $l-$off) { my $s = quotemeta substr $str, $off, $len; $count{ $s } ||= () = $str =~ /(?=$s)./gs; } $count{$_} < $min_rep and delete $count{$_} for keys %count; } \%count; } sub kramba { my ($str, $min_len, $min_rep)=@_; my %count; for my $c (0 .. length($str) - $min_len) { my $string = substr($str, $c); for my $l ($min_len .. length $string) { $count{substr($string, 0, $l)}++; } } for (keys %count) { delete $count{$_} if $count{$_} < $min_rep; } return \%count; } sub ikegami { my ($str, $min_len, $min_rep)=@_; local our %counts; use re 'eval'; $str =~ / (.{$min_len,}) (?{ ++$counts{$1} }) (?!) /x; for (keys %counts) { delete $counts{$_} if $counts{$_} < $min_rep; } \%counts; } my %subs = ( ikegami => \&ikegami, blazar => \&blazar, kramba => \&kramba, ); for my $len (1..3) { for my $rep (2..3) { my $tag="len=$len, rep=$rep"; my $ref = ikegami($str, $len, $rep); for my $name (keys %subs) { my $code = $subs{$name}; is_deeply($code->($str, $len, $rep), $ref, "$name - $tag") +; } } } print "\n"; for my $s ( map {$str x $_} 1,3,42) { for my $len (1..2) { for my $rep (2..3) { printf "length=%d, min_len=%d, min_rep=%d\n", length $s, $len, $rep ; cmpthese(-60, { map { my $c = $subs{$_}; $_ => sub { $c->($s, $len, $rep) } } keys %subs }); print "\n"; } } } __END__ ok 1 - blazar - len=1, rep=2 ok 2 - ikegami - len=1, rep=2 ok 3 - kramba - len=1, rep=2 ok 4 - blazar - len=1, rep=3 ok 5 - ikegami - len=1, rep=3 ok 6 - kramba - len=1, rep=3 ok 7 - blazar - len=2, rep=2 ok 8 - ikegami - len=2, rep=2 ok 9 - kramba - len=2, rep=2 ok 10 - blazar - len=2, rep=3 ok 11 - ikegami - len=2, rep=3 ok 12 - kramba - len=2, rep=3 ok 13 - blazar - len=3, rep=2 ok 14 - ikegami - len=3, rep=2 ok 15 - kramba - len=3, rep=2 ok 16 - blazar - len=3, rep=3 ok 17 - ikegami - len=3, rep=3 ok 18 - kramba - len=3, rep=3 length=21, min_len=1, min_rep=2 Rate blazar ikegami kramba blazar 364/s -- -84% -86% ikegami 2329/s 539% -- -13% kramba 2674/s 634% 15% -- length=21, min_len=1, min_rep=3 Rate blazar ikegami kramba blazar 355/s -- -85% -87% ikegami 2316/s 553% -- -16% kramba 2744/s 674% 18% -- length=21, min_len=2, min_rep=2 Rate blazar ikegami kramba blazar 381/s -- -85% -87% ikegami 2462/s 546% -- -13% kramba 2826/s 642% 15% -- length=21, min_len=2, min_rep=3 Rate blazar ikegami kramba blazar 381/s -- -84% -86% ikegami 2424/s 537% -- -13% kramba 2802/s 636% 16% -- length=63, min_len=1, min_rep=2 Rate blazar ikegami kramba blazar 28.8/s -- -91% -94% ikegami 328/s 1039% -- -26% kramba 444/s 1441% 35% -- length=63, min_len=1, min_rep=3 Rate blazar ikegami kramba blazar 35.1/s -- -90% -92% ikegami 340/s 870% -- -21% kramba 432/s 1130% 27% -- length=63, min_len=2, min_rep=2 Rate blazar ikegami kramba blazar 32.0/s -- -91% -93% ikegami 355/s 1009% -- -21% kramba 452/s 1312% 27% -- length=63, min_len=2, min_rep=3 Rate blazar ikegami kramba blazar 35.0/s -- -90% -92% ikegami 335/s 855% -- -21% kramba 423/s 1106% 26% -- length=882, min_len=1, min_rep=2 Rate blazar ikegami kramba blazar 6.46e-002/s -- -95% -95% ikegami 1.20/s 1754% -- -16% kramba 1.42/s 2104% 19% -- length=882, min_len=1, min_rep=3 Rate blazar ikegami kramba blazar 6.16e-002/s -- -95% -96% ikegami 1.17/s 1804% -- -17% kramba 1.41/s 2189% 20% -- length=882, min_len=2, min_rep=2 Rate blazar ikegami kramba blazar 6.14e-002/s -- -95% -96% ikegami 1.18/s 1829% -- -17% kramba 1.42/s 2213% 20% -- length=882, min_len=2, min_rep=3 Rate blazar ikegami kramba blazar 6.27e-002/s -- -95% -96% ikegami 1.18/s 1779% -- -18% kramba 1.44/s 2197% 22% -- 1..18

    lodin

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://651345]
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (7)
As of 2018-06-18 20:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?



    Results (110 votes). Check out past polls.

    Notices?