how to count the number of repeats in a string (really!)

blazar has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: how to count the number of repeats in a string (really!) by blokhead (Monsignor) on Nov 14, 2007 at 15:09 UTC
Your regex `/(.{2,}).\1/g` will always try to capture the largest* thing it can in $1. In your example string, every "b" character is followed by a "c". So every position where the string could match `/b.b/`, it could also match `/bc.bc/`. Since the "bc" version is longer, that's the one that will be tried first by the regex engine, and will return with success. It will never return success with `$1 eq "b"`, even though a "b" character repeats itself in the string. Update: it's also worth noting that m//g does not mean "try to match every possible way this match could succeed". Instead it means, "try to find one match starting at each position in the string" .. So in the above, when it matches on "bc", it will not continue backtracking to pick up the match with "b". Instead, it will be satisfied that it found a match starting at that position, increment pos, and move on. blokhead	[reply] [d/l] [select]
Re^2: how to count the number of repeats in a string (really!) by blazar (Canon) on Nov 14, 2007 at 15:33 UTC
Your regex `/(.{2,}).\1/g` will always try to capture the largest* thing it can in $1. In your example string, every "b" character is followed by a "c". So every position where the string could match `/b.b/`, it could also match `/bc.bc/`. Since the "bc" version is longer, that's the one that will be tried first by the regex engine, and will return with success. It will never return success with `$1 eq "b"`, even though a "b" character repeats itself in the string. I personally believe that this obvious... now that you point it out... Anyway I now wonder if at this point the best thing could be to generate all substrings e.g. with two nested maps and a uniq-like technique and possibly filter out those that have a count of 1 if one is not interested in them. My approach at a filtering in the generation phase by means of a regex may be fixable somehow but I can't see an easy way... Update: it's also worth noting that m//g does not mean "try to match every possible way this match could succeed". Instead it means, "try to find one match starting at each position in the string" .. So in the above, when it matches on "bc", it will not continue backtracking to pick up the match with "b". Instead, it will be satisfied that it found a match starting at that position, increment pos, and move on. But in fact this is the reason why I explicitly set pos. Perl 6 provides an adverb to do so in the first place instead -matching with superimpositions-, which is very good. Update: the following, for example, finally works really correctly. Read more... (685 Bytes)	[reply] [d/l] [select]
Re^3: how to count the number of repeats in a string (really!) by Krambambuli (Curate) on Nov 14, 2007 at 17:56 UTC
I've tried to smoothen the differences between your non-recursive but using regexp solution and mine, which is recursive but doesn't use any RX. After benchmarking, you're the clear winner: `Benchmark: timing 2000 iterations of Krambambuli, blazar... Krambambuli: 4 wallclock secs ( 3.28 usr + 0.01 sys = 3.29 CPU) @ 6 +07.90/s (n=2000) blazar: 1 wallclock secs ( 0.80 usr + 0.00 sys = 0.80 CPU) @ 25 +00.00/s (n=2000)` [download] Congrats! :) Here's the code I've used for the benchmarking. Read more... (1170 Bytes) Update: Actually, it's in fact the other way round, my code is faster - I've just named the benchmarked subs wrongly. Duh! Sorry. Krambambuli --- enjoying Mark Jason Dominus' Higher-Order Perl	[reply] [d/l] [select]
Re^4: how to count the number of repeats in a string (really!) by blazar (Canon) on Nov 14, 2007 at 22:08 UTC
Re^5: how to count the number of repeats in a string (really!) by lodin (Hermit) on Nov 14, 2007 at 23:41 UTC
Re^5: how to count the number of repeats in a string (really!) by lodin (Hermit) on Nov 15, 2007 at 00:00 UTC
Re^5: how to count the number of repeats in a string (really!) by oha (Friar) on Nov 15, 2007 at 11:22 UTC
Re^5: how to count the number of repeats in a string (really!) by Krambambuli (Curate) on Nov 15, 2007 at 18:35 UTC
Some notes below your chosen depth have not been shown here
Re: how to count the number of repeats in a string (really!) [regexp solution] by lodin (Hermit) on Nov 14, 2007 at 18:59 UTC
This is a perfect task for the regex engine. `local our %count; $str =~ / (.+) # or .{N,} where N is minimum length. (?(?{ $count{$1} }) (?!) ) .* \1 (?{ ($count{$1} \|\|= 1)++ }) (?!) /x;` [download] A more generalized version where you can specify the minimum substring length and minimum number of occurances is `my $min_len = 2; # Substring is at least two chars long. my $min_count = 3; # Substring occures at least three times. local our %count; use re 'eval'; $str =~ / (.{$min_len,}) (?(?{ $count{$1} }) (?!) ) (?> .? \1 ){@{[ $min_count - 2 ]}} . \1 (?{ ($count{$1} \|\|= $min_count-1)++ }) (?!) /x;` [download] lodin Update: While writing this, ikegami posted a very similar-looking reply. While they look very much alike they work quite differently. ikegami's work by requiring that each match is repeated further into the string, and then goes on to count all those successive matches kind of like a global match. Mine does the counting right away, and then forces the engine to not count those again. So they work rather opposite of each other. I did a shallow benchmark. It seems that mine is a slight favourite (5-10%) in many, but not all, situations. I also get the impression that ikegami's scales slightly better, see Re^5: how to count the number of repeats in a string (really!). Update 2: Added the comment in the regex. Update 3: Added the generalized version.	[reply] [d/l] [select]
Re^2: how to count the number of repeats in a string (really!) [regexp solution] by ikegami (Patriarch) on Nov 16, 2007 at 20:41 UTC
You made the same mistake I did. It fails to find two `'aba'` in `'ababa'`.	[reply] [d/l] [select]
Re^3: how to count the number of repeats in a string (really!) [regexp solution] by lodin (Hermit) on Nov 16, 2007 at 22:23 UTC
Mistake and mistake. That's not how I interpreted the question. The problem statement is a bit vague on this. For practical purposes I don't think 'aaa' should report 2 x 'aa' as well. If this is the task however, it's better to just exhaustively match everything of a minimum length. The problem as I (and you?) interpreted it is actually more interesting, I think. lodin	[reply]
Re: how to count the number of repeats in a string (really!) by lima1 (Curate) on Nov 14, 2007 at 16:05 UTC
Sorry, couldn't resist :) : Once again a standard application of the nice data structure suffix trees/arrays. It is in general easier to explain this with suffix trees, but it also works quite analogous with arrays (which have several advantages). Take a look at the `mississippi` example in http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/Suffix/. Now just enumerate all edge labels of nodes with more than one leaf: `issi, i, ssi, si, p`. These are all repeats because they are common prefixes of different suffixes of the string. There are algorithms that output only the maximal (overlapping) repeats `issi` and `p`. Construction and calculation are both possible in linear time, but obviously with a lot of overhead (so the string must be quite large that this pays off - and then you don't want a Perl implementation). And btw. it is not an "artificial problem". The human genome consists of many, many repeats (this is in fact the reason why the assembly is so hard) and we don't know much about most of them. See also: http://en.wikipedia.org/wiki/Suffix_tree	[reply]
Re^2: how to count the number of repeats in a string (really!) by blazar (Canon) on Nov 14, 2007 at 16:17 UTC
And btw. it is not an "artificial problem". I personally believe that I meant "not an actual problem of mine". Of course it may be an actual problem of someone else... Thank you for the supplied information. I'd still be interested in ways to do it with Perl. In fact it's true that the abstract algorithm could be implemented in any sufficiently powerful language. But some languages have syntactical features that make some taks particularly easy or difficult to implement... How can Perl specifically come to the rescue in this regard?	[reply]
Re: how to count the number of repeats in a string (really!) by ikegami (Patriarch) on Nov 14, 2007 at 18:45 UTC
The regex engine is quite adept at backtracking to find all inputs that satisfy a set of conditions. The shortest solution should be a regex one. `my $str = 'aabcdabcabcecdecd'; local our %counts; $str =~ / (.{2,}) # or (.+) (?(?{ !$counts{$1} })(?=.\1)) (?{ ++$counts{$1} }) (?!) /x; use Data::Dumper; print Dumper \%counts;` [download] Update: `(?(?{ !$counts{$1} })(?=.\1))` might be more efficient as `(?> (?(?{ !$counts{$1} })(?=.\1)) )` Update*: The above doesn't work. `$str='ababa'` fails. My simpler version doesn't suffer from this bug.	[reply] [d/l] [select]
Re^2: how to count the number of repeats in a string (really!) by ikegami (Patriarch) on Nov 14, 2007 at 19:28 UTC
I guess I didn't explain why mine works and why the OP's doesn't work with ".+". Compare `while ($str =~ /($re)/g) { print("$1\n"); }` [download] `use re 'eval'; $str =~ / ($re) (?{ print("$1\n"); }) (?!) /x;` [download] Looks similar? But for the same input, they produce different results. Input: `my $str = 'aabcdabcabce'; my $re = qr/a[^a]*/;` [download] Output from `/.../g`: `a abcd abc abce` [download] Output from `/...(?{ save results })(?!)/`: `a abcd abc ab a abc ab a abce abc ab a` [download] `/...(?{ save results })(?!)/` is key in finding all possible matches. It forces the regex to try everything to obtain a match.	[reply] [d/l] [select]
Re^3: how to count the number of repeats in a string (really!) by blazar (Canon) on Nov 14, 2007 at 21:49 UTC
I guess I didn't explain why mine works and why the OP's doesn't work with ".+". I personally believe that blokhead already did. It was just a trivial mistake/overview on my part.	[reply]
Re: how to count the number of repeats in a string (really!) by Krambambuli (Curate) on Nov 14, 2007 at 16:33 UTC
If there wouldn't be other constraints, I believe I'd go rather another way, towards something like use strict; use warnings; my $str = 'aabcdabcabcecdecd'; my $min_length = 1; my $min_count = 2; my %count; count( $str ); while (my ($string, $counts) = each %count) { if ($counts >= $min_count) { print "$string => $counts\n" if length($string) >= $min_length; } } exit; sub count { my( $string) = @_; my $length = length( $string ); return if $length < $min_length; for my $l ($min_length..$length) { my $substring = substr( $string, 0, $l ); $count{$substring} += 1; } count( substr( $string, 1 ) ); } [download] Which gives a result that slightly differs of your expected values, but I believe is a accurate: `ec => 2 ecd => 2 ab => 3 abc => 3 bc => 3 cd => 3` [download] and respectively `e => 2 a => 4 ecd => 2 ab => 3 c => 5 abc => 3 bc => 3 cd => 3 d => 3 ec => 2 b => 3` [download] Krambambuli --- enjoying Mark Jason Dominus' Higher-Order Perl	[reply] [d/l] [select]
Re^2: how to count the number of repeats in a string (really!) by blazar (Canon) on Nov 14, 2007 at 21:08 UTC
Which gives a result that slightly differs of your expected values, but I believe is a accurate: I personally believe that my initially expected values were wrong. In fact with the version of the program I posted in a previous reply I get exactly the same result as yours.	[reply]
Re: how to count the number of repeats in a string (really!) by oha (Friar) on Nov 14, 2007 at 16:34 UTC
First of all, i will find the longest matching sequences possibile, which are in the following string xx, abc and ecd. (I use the zero-width lookahead to avoid to reset pos) Then I'll break those substrings in parts, if abc is repeated i suspect also bc is repeated, isn't it? Then I'll count the repetitions of only those substrings: `$s = 'xxaabcdabcabcecdecdxx'; $min_len = 2; $min_rep = 2; while($s=~/(.{$min_len,})(?=.?\1)/g) { for my $x (0..length $1) { @saw{ map {substr $1, $x, $_} $x+2..length $1 } = (); } pos($s) = pos($s) +1 -length $1; # fix } for (keys %saw) { my $saw{$_} =()= $s=~/\Q$_/g; delete $saw{$_} if $saw{$_}<$min_rep; } ____ a 4 ab 3 abc 3 bc 3 c 5 cd 3 d 3 e 2 ec 2 ecd 2 x 4 xx 2` [download] The first loop find the repetitions, the second count them. if you want to get only 2 or more char substring, change $x+1 to $x+2. Oha Update: added regex quoting to the last re Update: shorter and print in order of findings: `while($s=~/(\w\w+)(?=.?\1)/g) { foreach $x (0..length $1) { map { $y = substr $1, $x, $_; $saw{$y}++ \|\| do { local pos $s = 0; my $c=()= $s=~/\Q$y/g; print "$y => $c\n"; } } ($x+1..length $1) } pos($s) = pos($s) +1 -length $1; # fix }` [download] Update: fix a bug in the above code, added a pos() relocation (see #fix)	[reply] [d/l] [select]
Re^2: how to count the number of repeats in a string (really!) by blazar (Canon) on Nov 14, 2007 at 21:25 UTC
Then I'll break those substrings in parts, if abc is repeated i suspect also bc is repeated, isn't it? I personally believe that if it were only a suspect it wouldn't be enough. The nice part is that it's obviously certain it is! Anyway, I like your approach very much. For completeness I'm recasting your code in a sub with a similar behaviour to the ones in previous code: `sub oha { my $s=shift; my %saw; while($s =~ /(..+)(?=.?\1)/g) { for my $x (0..length $1) { @saw{ map {substr $1, $x, $_} $x+2..length $1 } = (); } } $saw{$_} =()= $s =~ /\Q$_/g for keys %saw; \%saw; }` [download] Update:* I see that you chaged your nodes content and that the original code is not there anymore. I recommend you to only post updates instead, and if you feel that something is wrong and needs to be "deleted", then possibly use `<strike>` tags. To keep the visual size of your node limited for those that do not want to read all of its details, you can also adopt `<readmore>` tags.	[reply] [d/l] [select]
Re^3: how to count the number of repeats in a string (really!) by oha (Friar) on Nov 16, 2007 at 14:58 UTC
i added a line marked with # fix as i explained, there is a bug in the code you used to rearrange: you must add the fix `pos($s)=pos($s)+1-length $1;` at the end of the while:; `sub oha { my $s=shift; my %saw; while($s =~ /(..+)(?=.*?\1)/g) { for my $x (0..length $1) { @saw{ map {substr $1, $x, $_} $x+2..length $1 } = (); } pos($s)=pos($s)+1-length $1; # fix } $saw{$_} =()= $s =~ /\Q$_/g for keys %saw; \%saw; }` [download] i apologize for the bug Oha	[reply] [d/l] [select]


Perl Monk, Perl Meditation
	PerlMonks