http://www.perlmonks.org?node_id=432494

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

A simple pattern matching question: I want to count the number of occurrences of "AA" in the string "AAAA". I want the answer to be 3. The standard while("AAAA" =~ /AA/g) {$count++} gives $count=2 and I can't seem to come up with a magical combination of ?,+,*,. that gives 3. Is there a way to do it this way (i.e. without resorting to looping over substrings)? perlre doesn't seem to provide an (obvious) answer and SuperSearch has (alas) failed me.

Replies are listed 'Best First'.
Re: counting overlapping patterns
by Eimi Metamorphoumai (Deacon) on Feb 18, 2005 at 20:39 UTC
    The key is to match without consuming text. The following works.
    while("AAAA" =~ /(?=AA)/g){ $count++; }
    It'll go through it repeatedly, starting at each position, but never consuming anything.

      Alternatively,

      while ('AAAA' =~ /A(?=A)/g) { $count++; }

      which looks slightly less weird to me.

        But requires rewriting the pattern. Mine will work even if the pattern comes from a variable, or contains different possibilities for the first character ('/AA|BB/' becomes '/(?=AA|BB)/'). My approach was to try to do minimal tinkering with the original, but yeah, it's not as obvious what's actually going on.
Re: counting overlapping patterns
by merlyn (Sage) on Feb 18, 2005 at 23:33 UTC
    Here's a bizarre way of doing it, taking a hint from my Prolog studies (thanks Ovid!):
    sub match_count { my ($string, $pattern) = @_; my $n = 0; $string =~ /$pattern(?{ $n++ })(?!)/; return $n; }
    Essentially, match successfully, count it, but then fail the match (an empty string can always match, so negating that always fails).

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

Re: counting overlapping patterns
by betterworld (Curate) on Feb 19, 2005 at 02:43 UTC
    Why not let Perl do the incrementing itself?

    $count = () = "AAAA" =~ m/(?=AA)/g;

Re: counting overlapping patterns
by bobf (Monsignor) on Feb 19, 2005 at 03:24 UTC

    This might not be as slick as the lookahead approaches described above, but in the spirit of TMTOWTDI here's a version that uses pos and the @- array, which contains the offset of the start of the last match (see perlvar for more info):

    sub count_matches { my ( $pattern, $string ) = @_; my $num_matches = 0; while( $string =~ m/$pattern/gi ) { pos( $string ) = $-[0] + 1; $num_matches++ } return $num_matches; }

    Update: For monks (like me) that didn't understand why a pattern consisting entirely of a zero-width lookahead assertion (m/(?=AA)/g, see the above responses) doesn't get stuck in an infinite loop, see perlre, "Repeated patterns matching zero-length substring". From that doc:

    Perl allows such constructs, by forcefully breaking the infinite loop ... when Perl detects that a repeated expression matched a zero-length substring.

    To break the loop, the following match after a zero-length match is prohibited to have a length of zero.

    ... the second best match is chosen if the best match is of zero length ... the second-best match is the match at the position one notch further in the string.
    Thanks to ambrus for the pointer to the right section in the docs.

Re: counting overlapping patterns
by saintmike (Vicar) on Feb 18, 2005 at 20:19 UTC
    You need to tell the regex engine to pick up after the last match, using the \G anchor:
    my $count = 0; my $string = "AAAA"; $count++ while $string =~ /\GAA/gc;
    Argh, sorry, I was wrong. Check out this thread instead.
      $count++ while $string =~ /\GAA/gc;
      Er, no, the \G there serves no purpose as that's the default behaviour anyway. On this other hand this will work, by only consuming the first character of the match:
      $count++ while $string =~ /A(?=A)/g;

      Dave.

        Dave you truly are "the m". That works as promised. Thanks!
      Ah... the \G anchor, of course.... it doesn't seem to work for me. Still getting 2. Are my cut-n-paste skills poor?
      This returns 2. Reasonable because the position of the last match is right after the 2nd "A".
      IŽd like to be disproven, but i think this cannot be solved by a simple regex.
      Disproven ,)


      holli, /regexed monk/
Re: counting overlapping patterns
by ikegami (Patriarch) on Feb 18, 2005 at 21:49 UTC

    There's also the simple:

    while ("AAAA" =~ /A(A+)/g) { $count += length($1); }

    Unlike C's strlen, Perl's length doesn't loop, so this snippet is compliant with your request for no further looping.

Re: counting overlapping patterns
by Limbic~Region (Chancellor) on Feb 18, 2005 at 20:30 UTC
    Anonymous Monk,
    Here is some code that covers your example, but it admittedly only works on fixed strings since it doesn't use regular expressions at all.
    #!/usr/bin/perl use strict; use warnings; print str_count('AAAA', 'AA'), "\n"; sub str_count { my ($str, $pat) = @_; my $tot; for ( 0 .. length( $str ) - length( $pat ) ) { $tot++ if index($str, $pat, $_) - $_ == 0;; } return $tot; }

    Cheers - L~R

      While discussing other methods that avoided regular expressions in the CB with bobf and nothingmuch, I mentioned an unpack/hash solution. nothinmuch asked to see it, so here is a highly untested alternative.
      #!/usr/bin/perl use strict; use warnings; print str_count('ABAABAAAA', 'AA'), "\n"; sub str_count { my ($str, $pat) = @_; my %substr; my ($p_len, $s_len) = (length $pat, length $str); my $template = ("A$p_len" . 'X' . ($p_len - 1)) x ($s_len - $p_len + + 1); $substr{$_}++ for unpack $template, $str; return $substr{$pat}; }

      Cheers - L~R