Another mention I'd make is that if some changes would be needed to the subs - like for example considering at least MIN_REPEATS repetitions of a string to be counted - I'm afraid it might be rather challenging in modifying the RX-ish solutions.
One simple solution is to go back to my (unposted) original solution and check MIN_REPEATS outside.
use constant MIN_REPEATS => 2; # Must have at least this many repeats
my $str = 'aabcdabcabcecdecd';
local our %counts;
$str =~ /(.{2,})(?{ ++$counts{$1} })(?!)/s;
delete @counts{ grep $counts{$_}<MIN_REPEATS, keys %counts };
use Data::Dumper;
print Dumper \%counts;
That way, there's nothing complicated left in the regex. The only part that's modifiable is /.{2,}/, which is easier to understand and modify than hand-rolled parsing code.
a regexp based solution would have to loose the fight against an approach that never has to look behind or ahead, but just touches every possible substring exactly 1 time
Not so! The above regex never has to look beind or ahead and touches every possible substring exactly 1 time.
| [reply] [d/l] [select] |
I've made just a few changes (mostly to my sub, so that it comes closer to my original design - your modifications added a few unnecessary steps) and run the benchmarks again.
I personally believe that those steps were not all that unnecessary. To be definite, I chose to compare subs that accept a string and return a hashref of the counts. Yours doesn't, so some extra step is required. According to the last update to Re^4: how to count the number of repeats in a string (really!), I'm posting a new benchmark here, with your sub as a thin layer around the recursive sub. I hope that is fine...
So, instead of the benchmark, for the moment I'm posting the script with the tests:
15 tests out 18 fail.
Update: I'm an idiot! All tests run smoothly once I use the correct reference. The lesson in this is: don't post when you're too tired. I'm going to sleep and I will update the node with the actual benchmark tomorrow morning... Sorry for the noise!
Update: Ok, I woke up and I'm not that tired anymore. Here's the complete script:
And here's the output:
Krambambuli seems to win in the long string case. Anyway, analysis? Refinements, comments, additions?
Update: well done, lodin++.
| [reply] [d/l] [select] |
Anyway, analysis? Refinements, comments, additions?
These routines are still not equivalent. They all give different results for e.g. 'aaaaa'. You're not using ikegami's version from Re^6: how to count the number of repeats in a string (really!). With that version, ikegami's and krambambuli's versions are equivalent. A simple modification to the regex in your routine makes that equivalent too. I also removed the tail recursion from kramba, which improves the performance with about 50 %.
The return value from these routines should be interpreted as the answer to "how many unique substrings are there in the string, and what's their frequency". This is a different problem from "how many times is a substring sequentially repeated (possibly with other substrings inbetween) in the string", which is the problem most of us solved initially.
Here's the code and the output:
use strict;
use warnings;
use Test::More 'no_plan';
use Benchmark qw/:all :hireswallclock/;
my $str = 'aabcdabcabcecdecdaaaa';
sub blazar {
my ($str, $min_len, $min_rep)=@_;
my $l=length($str);
my %count;
for my $off (0..$l-1) {
for my $len ($min_len .. $l-$off) {
my $s = quotemeta substr $str, $off, $len;
$count{ $s } ||= () = $str =~ /(?=$s)./gs;
}
$count{$_} < $min_rep and
delete $count{$_} for keys %count;
}
\%count;
}
sub kramba {
my ($str, $min_len, $min_rep)=@_;
my %count;
for my $c (0 .. length($str) - $min_len) {
my $string = substr($str, $c);
for my $l ($min_len .. length $string) {
$count{substr($string, 0, $l)}++;
}
}
for (keys %count) {
delete $count{$_}
if $count{$_} < $min_rep;
}
return \%count;
}
sub ikegami {
my ($str, $min_len, $min_rep)=@_;
local our %counts;
use re 'eval';
$str =~ /
(.{$min_len,})
(?{ ++$counts{$1} })
(?!)
/x;
for (keys %counts) {
delete $counts{$_} if $counts{$_} < $min_rep;
}
\%counts;
}
my %subs = (
ikegami => \&ikegami,
blazar => \&blazar,
kramba => \&kramba,
);
for my $len (1..3) {
for my $rep (2..3) {
my $tag="len=$len, rep=$rep";
my $ref = ikegami($str, $len, $rep);
for my $name (keys %subs) {
my $code = $subs{$name};
is_deeply($code->($str, $len, $rep), $ref, "$name - $tag")
+;
}
}
}
print "\n";
for my $s ( map {$str x $_} 1,3,42) {
for my $len (1..2) {
for my $rep (2..3) {
printf "length=%d, min_len=%d, min_rep=%d\n",
length $s,
$len,
$rep
;
cmpthese(-60, {
map {
my $c = $subs{$_};
$_ => sub { $c->($s, $len, $rep) }
}
keys %subs
});
print "\n";
}
}
}
__END__
ok 1 - blazar - len=1, rep=2
ok 2 - ikegami - len=1, rep=2
ok 3 - kramba - len=1, rep=2
ok 4 - blazar - len=1, rep=3
ok 5 - ikegami - len=1, rep=3
ok 6 - kramba - len=1, rep=3
ok 7 - blazar - len=2, rep=2
ok 8 - ikegami - len=2, rep=2
ok 9 - kramba - len=2, rep=2
ok 10 - blazar - len=2, rep=3
ok 11 - ikegami - len=2, rep=3
ok 12 - kramba - len=2, rep=3
ok 13 - blazar - len=3, rep=2
ok 14 - ikegami - len=3, rep=2
ok 15 - kramba - len=3, rep=2
ok 16 - blazar - len=3, rep=3
ok 17 - ikegami - len=3, rep=3
ok 18 - kramba - len=3, rep=3
length=21, min_len=1, min_rep=2
Rate blazar ikegami kramba
blazar 364/s -- -84% -86%
ikegami 2329/s 539% -- -13%
kramba 2674/s 634% 15% --
length=21, min_len=1, min_rep=3
Rate blazar ikegami kramba
blazar 355/s -- -85% -87%
ikegami 2316/s 553% -- -16%
kramba 2744/s 674% 18% --
length=21, min_len=2, min_rep=2
Rate blazar ikegami kramba
blazar 381/s -- -85% -87%
ikegami 2462/s 546% -- -13%
kramba 2826/s 642% 15% --
length=21, min_len=2, min_rep=3
Rate blazar ikegami kramba
blazar 381/s -- -84% -86%
ikegami 2424/s 537% -- -13%
kramba 2802/s 636% 16% --
length=63, min_len=1, min_rep=2
Rate blazar ikegami kramba
blazar 28.8/s -- -91% -94%
ikegami 328/s 1039% -- -26%
kramba 444/s 1441% 35% --
length=63, min_len=1, min_rep=3
Rate blazar ikegami kramba
blazar 35.1/s -- -90% -92%
ikegami 340/s 870% -- -21%
kramba 432/s 1130% 27% --
length=63, min_len=2, min_rep=2
Rate blazar ikegami kramba
blazar 32.0/s -- -91% -93%
ikegami 355/s 1009% -- -21%
kramba 452/s 1312% 27% --
length=63, min_len=2, min_rep=3
Rate blazar ikegami kramba
blazar 35.0/s -- -90% -92%
ikegami 335/s 855% -- -21%
kramba 423/s 1106% 26% --
length=882, min_len=1, min_rep=2
Rate blazar ikegami kramba
blazar 6.46e-002/s -- -95% -95%
ikegami 1.20/s 1754% -- -16%
kramba 1.42/s 2104% 19% --
length=882, min_len=1, min_rep=3
Rate blazar ikegami kramba
blazar 6.16e-002/s -- -95% -96%
ikegami 1.17/s 1804% -- -17%
kramba 1.41/s 2189% 20% --
length=882, min_len=2, min_rep=2
Rate blazar ikegami kramba
blazar 6.14e-002/s -- -95% -96%
ikegami 1.18/s 1829% -- -17%
kramba 1.42/s 2213% 20% --
length=882, min_len=2, min_rep=3
Rate blazar ikegami kramba
blazar 6.27e-002/s -- -95% -96%
ikegami 1.18/s 1779% -- -18%
kramba 1.44/s 2197% 22% --
1..18
lodin
| [reply] [d/l] [select] |