Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

how to count the number of repeats in a string (really!)

by blazar (Canon)
on Nov 14, 2007 at 14:50 UTC ( [id://650761]=perlquestion: print w/replies, xml ) Need Help??

blazar has asked for the wisdom of the Perl Monks concerning the following question:

Inspired by today's how to count the no of repeats in a string, I came up with an admittedly artificial but possibly very interesting problem: that of counting for a string the repetition count of each substring that is actually repeated. For example, given the string

my $str = 'aabcdabcabcecdecd';

I see that -limiting ourselves to substrings of length 2 or more- we have the following counts:

(abc => 3, bc => 3, cd => 3, ecd => 2)

One possible way to get those counts is as follows -I'm casting the thing in the form of a sub-:

sub count1 { local $_=shift; my %count; for my $pos (0..(length)-1) { pos=$pos; @count{ /(.{2,}).*\1/g } = (); } for my $k (keys %count) { $count{$k} =()= /$k/g; } \%count; }

In fact, print Dumper count1 $str; gives me:

$VAR1 = { 'cd' => 3, 'ecd' => 2, 'abc' => 3, 'bc' => 3 };

To my unpleasant surprise, if I change .{2,} to .+ I do not get all substrings (of length 1):

$VAR1 = { 'cd' => 3, 'c' => 5, 'ecd' => 2, 'a' => 4, 'abc' => 3, 'd' => 3, 'bc' => 3 };

And I can't understand why...

I personally believe that map based solutions are often very cute and so I rolled my own too:

sub count2 { my $s=shift; my %saw; my %count = map { map { $saw{$_}++ ? () : $_ => scalar(()= $s =~ /$_/g); } do { pos=$_; $s =~ /(.{2,}).*\1/g }; } 0..length($s)-1; \%count; }

But the latter

  1. is clumsier than the non-map version and;
  2. what's worst, is not correct at all:
    $VAR1 = { 'ecd' => 2, 'abc' => 3, '3' => 2 };

And I can't understand why, either.

What's your take on the problem?

Replies are listed 'Best First'.
Re: how to count the number of repeats in a string (really!)
by blokhead (Monsignor) on Nov 14, 2007 at 15:09 UTC
    Your regex /(.{2,}).*\1/g will always try to capture the largest thing it can in $1. In your example string, every "b" character is followed by a "c". So every position where the string could match /b.*b/, it could also match /bc.*bc/. Since the "bc" version is longer, that's the one that will be tried first by the regex engine, and will return with success. It will never return success with $1 eq "b", even though a "b" character repeats itself in the string.

    Update: it's also worth noting that m//g does not mean "try to match every possible way this match could succeed". Instead it means, "try to find one match starting at each position in the string" .. So in the above, when it matches on "bc", it will not continue backtracking to pick up the match with "b". Instead, it will be satisfied that it found a match starting at that position, increment pos, and move on.

    blokhead

      Your regex /(.{2,}).*\1/g will always try to capture the largest thing it can in $1. In your example string, every "b" character is followed by a "c". So every position where the string could match /b.*b/, it could also match /bc.*bc/. Since the "bc" version is longer, that's the one that will be tried first by the regex engine, and will return with success. It will never return success with $1 eq "b", even though a "b" character repeats itself in the string.

      I personally believe that this obvious... now that you point it out... Anyway I now wonder if at this point the best thing could be to generate all substrings e.g. with two nested maps and a uniq-like technique and possibly filter out those that have a count of 1 if one is not interested in them. My approach at a filtering in the generation phase by means of a regex may be fixable somehow but I can't see an easy way...

      Update: it's also worth noting that m//g does not mean "try to match every possible way this match could succeed". Instead it means, "try to find one match starting at each position in the string" .. So in the above, when it matches on "bc", it will not continue backtracking to pick up the match with "b". Instead, it will be satisfied that it found a match starting at that position, increment pos, and move on.

      But in fact this is the reason why I explicitly set pos. Perl 6 provides an adverb to do so in the first place instead -matching with superimpositions-, which is very good.

      Update: the following, for example, finally works really correctly.

        I've tried to smoothen the differences between your non-recursive but using regexp solution and mine, which is recursive but doesn't use any RX.

        After benchmarking, you're the clear winner:
        Benchmark: timing 2000 iterations of Krambambuli, blazar... Krambambuli: 4 wallclock secs ( 3.28 usr + 0.01 sys = 3.29 CPU) @ 6 +07.90/s (n=2000) blazar: 1 wallclock secs ( 0.80 usr + 0.00 sys = 0.80 CPU) @ 25 +00.00/s (n=2000)
        Congrats! :)

        Here's the code I've used for the benchmarking.
        Update: Actually, it's in fact the other way round, my code is faster - I've just named the benchmarked subs wrongly. Duh! Sorry.

        Krambambuli
        ---
        enjoying Mark Jason Dominus' Higher-Order Perl
Re: how to count the number of repeats in a string (really!) [regexp solution]
by lodin (Hermit) on Nov 14, 2007 at 18:59 UTC

    This is a perfect task for the regex engine.

    local our %count; $str =~ / (.+) # or .{N,} where N is minimum length. (?(?{ $count{$1} }) (?!) ) .* \1 (?{ ($count{$1} ||= 1)++ }) (?!) /x;
    A more generalized version where you can specify the minimum substring length and minimum number of occurances is
    my $min_len = 2; # Substring is at least two chars long. my $min_count = 3; # Substring occures at least three times. local our %count; use re 'eval'; $str =~ / (.{$min_len,}) (?(?{ $count{$1} }) (?!) ) (?> .*? \1 ){@{[ $min_count - 2 ]}} .* \1 (?{ ($count{$1} ||= $min_count-1)++ }) (?!) /x;

    lodin

    Update:

    While writing this, ikegami posted a very similar-looking reply. While they look very much alike they work quite differently. ikegami's work by requiring that each match is repeated further into the string, and then goes on to count all those successive matches kind of like a global match. Mine does the counting right away, and then forces the engine to not count those again. So they work rather opposite of each other.

    I did a shallow benchmark. It seems that mine is a slight favourite (5-10%) in many, but not all, situations. I also get the impression that ikegami's scales slightly better, see Re^5: how to count the number of repeats in a string (really!).

    Update 2:

    Added the comment in the regex.

    Update 3:

    Added the generalized version.

      You made the same mistake I did. It fails to find two 'aba' in 'ababa'.

        Mistake and mistake. That's not how I interpreted the question. The problem statement is a bit vague on this. For practical purposes I don't think 'aaa' should report 2 x 'aa' as well.

        If this is the task however, it's better to just exhaustively match everything of a minimum length. The problem as I (and you?) interpreted it is actually more interesting, I think.

        lodin

Re: how to count the number of repeats in a string (really!)
by lima1 (Curate) on Nov 14, 2007 at 16:05 UTC
    Sorry, couldn't resist :) : Once again a standard application of the nice data structure suffix trees/arrays. It is in general easier to explain this with suffix trees, but it also works quite analogous with arrays (which have several advantages). Take a look at the mississippi example in http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/Suffix/. Now just enumerate all edge labels of nodes with more than one leaf: issi, i, ssi, si, p. These are all repeats because they are common prefixes of different suffixes of the string. There are algorithms that output only the maximal (overlapping) repeats issi and p.

    Construction and calculation are both possible in linear time, but obviously with a lot of overhead (so the string must be quite large that this pays off - and then you don't want a Perl implementation).

    And btw. it is not an "artificial problem". The human genome consists of many, many repeats (this is in fact the reason why the assembly is so hard) and we don't know much about most of them.

    See also: http://en.wikipedia.org/wiki/Suffix_tree

      And btw. it is not an "artificial problem".

      I personally believe that I meant "not an actual problem of mine". Of course it may be an actual problem of someone else...

      Thank you for the supplied information. I'd still be interested in ways to do it with Perl. In fact it's true that the abstract algorithm could be implemented in any sufficiently powerful language. But some languages have syntactical features that make some taks particularly easy or difficult to implement... How can Perl specifically come to the rescue in this regard?

Re: how to count the number of repeats in a string (really!)
by ikegami (Patriarch) on Nov 14, 2007 at 18:45 UTC

    The regex engine is quite adept at backtracking to find all inputs that satisfy a set of conditions. The shortest solution should be a regex one.

    my $str = 'aabcdabcabcecdecd'; local our %counts; $str =~ / (.{2,}) # or (.+) (?(?{ !$counts{$1} })(?=.*\1)) (?{ ++$counts{$1} }) (?!) /x; use Data::Dumper; print Dumper \%counts;

    Update:
    (?(?{ !$counts{$1} })(?=.*\1))
    might be more efficient as
    (?> (?(?{ !$counts{$1} })(?=.*\1)) )

    Update: The above doesn't work. $str='ababa' fails. My simpler version doesn't suffer from this bug.

      I guess I didn't explain why mine works and why the OP's doesn't work with ".+". Compare

      while ($str =~ /($re)/g) { print("$1\n"); }
      use re 'eval'; $str =~ / ($re) (?{ print("$1\n"); }) (?!) /x;

      Looks similar? But for the same input, they produce different results.

      Input:

      my $str = 'aabcdabcabce'; my $re = qr/a[^a]*/;

      Output from /.../g:

      a abcd abc abce

      Output from /...(?{ save results })(?!)/:

      a abcd abc ab a abc ab a abce abc ab a

      /...(?{ save results })(?!)/ is key in finding all possible matches. It forces the regex to try everything to obtain a match.

        I guess I didn't explain why mine works and why the OP's doesn't work with ".+".

        I personally believe that blokhead already did. It was just a trivial mistake/overview on my part.

Re: how to count the number of repeats in a string (really!)
by Krambambuli (Curate) on Nov 14, 2007 at 16:33 UTC
    If there wouldn't be other constraints, I believe I'd go rather another way, towards something like
    use strict; use warnings; my $str = 'aabcdabcabcecdecd'; my $min_length = 1; my $min_count = 2; my %count; count( $str ); while (my ($string, $counts) = each %count) { if ($counts >= $min_count) { print "$string => $counts\n" if length($string) >= $min_length; } } exit; sub count { my( $string) = @_; my $length = length( $string ); return if $length < $min_length; for my $l ($min_length..$length) { my $substring = substr( $string, 0, $l ); $count{$substring} += 1; } count( substr( $string, 1 ) ); }
    Which gives a result that slightly differs of your expected values, but I believe is a accurate:
    ec => 2 ecd => 2 ab => 3 abc => 3 bc => 3 cd => 3
    and respectively
    e => 2 a => 4 ecd => 2 ab => 3 c => 5 abc => 3 bc => 3 cd => 3 d => 3 ec => 2 b => 3

    Krambambuli
    ---
    enjoying Mark Jason Dominus' Higher-Order Perl
      Which gives a result that slightly differs of your expected values, but I believe is a accurate:

      I personally believe that my initially expected values were wrong. In fact with the version of the program I posted in a previous reply I get exactly the same result as yours.

Re: how to count the number of repeats in a string (really!)
by oha (Friar) on Nov 14, 2007 at 16:34 UTC
    First of all, i will find the longest matching sequences possibile, which are in the following string xx, abc and ecd.
    (I use the zero-width lookahead to avoid to reset pos)
    Then I'll break those substrings in parts, if abc is repeated i suspect also bc is repeated, isn't it?
    Then I'll count the repetitions of only those substrings:
    $s = 'xxaabcdabcabcecdecdxx'; $min_len = 2; $min_rep = 2; while($s=~/(.{$min_len,})(?=.*?\1)/g) { for my $x (0..length $1) { @saw{ map {substr $1, $x, $_} $x+2..length $1 } = (); } pos($s) = pos($s) +1 -length $1; # fix } for (keys %saw) { my $saw{$_} =()= $s=~/\Q$_/g; delete $saw{$_} if $saw{$_}<$min_rep; } ____ a 4 ab 3 abc 3 bc 3 c 5 cd 3 d 3 e 2 ec 2 ecd 2 x 4 xx 2
    The first loop find the repetitions, the second count them. if you want to get only 2 or more char substring, change $x+1 to $x+2.

    Oha

    Update: added regex quoting to the last re

    Update: shorter and print in order of findings:

    while($s=~/(\w\w+)(?=.*?\1)/g) { foreach $x (0..length $1) { map { $y = substr $1, $x, $_; $saw{$y}++ || do { local pos $s = 0; my $c=()= $s=~/\Q$y/g; print "$y => $c\n"; } } ($x+1..length $1) } pos($s) = pos($s) +1 -length $1; # fix }

    Update: fix a bug in the above code, added a pos() relocation (see #fix)

      Then I'll break those substrings in parts, if abc is repeated i suspect also bc is repeated, isn't it?

      I personally believe that if it were only a suspect it wouldn't be enough. The nice part is that it's obviously certain it is! Anyway, I like your approach very much. For completeness I'm recasting your code in a sub with a similar behaviour to the ones in previous code:

      sub oha { my $s=shift; my %saw; while($s =~ /(..+)(?=.*?\1)/g) { for my $x (0..length $1) { @saw{ map {substr $1, $x, $_} $x+2..length $1 } = (); } } $saw{$_} =()= $s =~ /\Q$_/g for keys %saw; \%saw; }

      Update: I see that you chaged your nodes content and that the original code is not there anymore. I recommend you to only post updates instead, and if you feel that something is wrong and needs to be "deleted", then possibly use <strike> tags. To keep the visual size of your node limited for those that do not want to read all of its details, you can also adopt <readmore> tags.

        i added a line marked with # fix as i explained, there is a bug in the code you used to rearrange: you must add the fix pos($s)=pos($s)+1-length $1; at the end of the while:;
        sub oha { my $s=shift; my %saw; while($s =~ /(..+)(?=.*?\1)/g) { for my $x (0..length $1) { @saw{ map {substr $1, $x, $_} $x+2..length $1 } = (); } pos($s)=pos($s)+1-length $1; # fix } $saw{$_} =()= $s =~ /\Q$_/g for keys %saw; \%saw; }

        i apologize for the bug

        Oha

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://650761]
Approved by marto
Front-paged by naikonta
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (7)
As of 2024-04-16 11:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found