counting overlapping patterns

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

A simple pattern matching question: I want to count the number of occurrences of "AA" in the string "AAAA". I want the answer to be 3. The standard while("AAAA" =~ /AA/g) {$count++} gives $count=2 and I can't seem to come up with a magical combination of ?,+,*,. that gives 3. Is there a way to do it this way (i.e. without resorting to looping over substrings)? perlre doesn't seem to provide an (obvious) answer and SuperSearch has (alas) failed me.

Comment on counting overlapping patterns Download Code

Replies are listed 'Best First'.
Re: counting overlapping patterns by Eimi Metamorphoumai (Deacon) on Feb 18, 2005 at 20:39 UTC
The key is to match without consuming text. The following works. `while("AAAA" =~ /(?=AA)/g){ $count++; }` [download] It'll go through it repeatedly, starting at each position, but never consuming anything.	[reply] [d/l]
Re^2: counting overlapping patterns by ikegami (Patriarch) on Feb 18, 2005 at 21:46 UTC
Alternatively, `while ('AAAA' =~ /A(?=A)/g) { $count++; }` [download] which looks slightly less weird to me.	[reply] [d/l]
Re^3: counting overlapping patterns by Eimi Metamorphoumai (Deacon) on Feb 18, 2005 at 22:15 UTC
But requires rewriting the pattern. Mine will work even if the pattern comes from a variable, or contains different possibilities for the first character ('`/AA\|BB/`' becomes '`/(?=AA\|BB)/`'). My approach was to try to do minimal tinkering with the original, but yeah, it's not as obvious what's actually going on.	[reply] [d/l] [select]
Re^4: counting overlapping patterns by ikegami (Patriarch) on Feb 18, 2005 at 23:39 UTC
Re: counting overlapping patterns by merlyn (Sage) on Feb 18, 2005 at 23:33 UTC
Here's a bizarre way of doing it, taking a hint from my Prolog studies (thanks Ovid!): `sub match_count { my ($string, $pattern) = @_; my $n = 0; $string =~ /$pattern(?{ $n++ })(?!)/; return $n; }` [download] Essentially, match successfully, count it, but then fail the match (an empty string can always match, so negating that always fails). -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply] [d/l]
Re: counting overlapping patterns by betterworld (Curate) on Feb 19, 2005 at 02:43 UTC
Why not let Perl do the incrementing itself? `$count = () = "AAAA" =~ m/(?=AA)/g;`	[reply] [d/l]
Re: counting overlapping patterns by bobf (Monsignor) on Feb 19, 2005 at 03:24 UTC
This might not be as slick as the lookahead approaches described above, but in the spirit of TMTOWTDI here's a version that uses pos and the `@-` array, which contains the offset of the start of the last match (see perlvar for more info): `sub count_matches { my ( $pattern, $string ) = @_; my $num_matches = 0; while( $string =~ m/$pattern/gi ) { pos( $string ) = $-[0] + 1; $num_matches++ } return $num_matches; }` [download] Update: For monks (like me) that didn't understand why a pattern consisting entirely of a zero-width lookahead assertion (`m/(?=AA)/g`, see the above responses) doesn't get stuck in an infinite loop, see perlre, "Repeated patterns matching zero-length substring". From that doc: Perl allows such constructs, by forcefully breaking the infinite loop ... when Perl detects that a repeated expression matched a zero-length substring. To break the loop, the following match after a zero-length match is prohibited to have a length of zero. ... the second best match is chosen if the best match is of zero length ... the second-best match is the match at the position one notch further in the string. Thanks to ambrus for the pointer to the right section in the docs.	[reply] [d/l] [select]
Re: counting overlapping patterns by saintmike (Vicar) on Feb 18, 2005 at 20:19 UTC
You need to tell the regex engine to pick up after the last match, using the `\G` anchor: `my $count = 0; my $string = "AAAA"; $count++ while $string =~ /\GAA/gc;` [download] Argh, sorry, I was wrong. Check out this thread instead.	[reply] [d/l] [select]
Re^2: counting overlapping patterns by dave_the_m (Monsignor) on Feb 18, 2005 at 20:28 UTC
`$count++ while $string =~ /\GAA/gc;` Er, no, the \G there serves no purpose as that's the default behaviour anyway. On this other hand this will work, by only consuming the first character of the match: `$count++ while $string =~ /A(?=A)/g;` [download] Dave.	[reply] [d/l] [select]
Re^3: counting overlapping patterns by Anonymous Monk on Feb 18, 2005 at 20:34 UTC
Dave you truly are "the m". That works as promised. Thanks!	[reply]
Re^2: counting overlapping patterns by Anonymous Monk on Feb 18, 2005 at 20:25 UTC
Ah... the \G anchor, of course.... it doesn't seem to work for me. Still getting 2. Are my cut-n-paste skills poor?	[reply]
Re^2: counting overlapping patterns by holli (Abbot) on Feb 18, 2005 at 20:25 UTC
This returns 2. Reasonable because the position of the last match is right after the 2nd "A". ~~I´d like to be disproven, but i think this cannot be solved by a simple regex.~~ Disproven ,) holli, /regexed monk/	[reply] [d/l]
Re: counting overlapping patterns by ikegami (Patriarch) on Feb 18, 2005 at 21:49 UTC
There's also the simple: `while ("AAAA" =~ /A(A+)/g) { $count += length($1); }` [download] Unlike C's `strlen`, Perl's `length` doesn't loop, so this snippet is compliant with your request for no further looping.	[reply] [d/l] [select]
Re: counting overlapping patterns by Limbic~Region (Chancellor) on Feb 18, 2005 at 20:30 UTC
Anonymous Monk, Here is some code that covers your example, but it admittedly only works on fixed strings since it doesn't use regular expressions at all. `#!/usr/bin/perl use strict; use warnings; print str_count('AAAA', 'AA'), "\n"; sub str_count { my ($str, $pat) = @_; my $tot; for ( 0 .. length( $str ) - length( $pat ) ) { $tot++ if index($str, $pat, $_) - $_ == 0;; } return $tot; }` [download] Cheers - L~R	[reply] [d/l]
Re^2: counting overlapping patterns by Limbic~Region (Chancellor) on Feb 19, 2005 at 16:14 UTC
While discussing other methods that avoided regular expressions in the CB with bobf and nothingmuch, I mentioned an unpack/hash solution. nothinmuch asked to see it, so here is a highly untested alternative. `#!/usr/bin/perl use strict; use warnings; print str_count('ABAABAAAA', 'AA'), "\n"; sub str_count { my ($str, $pat) = @_; my %substr; my ($p_len, $s_len) = (length $pat, length $str); my $template = ("A$p_len" . 'X' . ($p_len - 1)) x ($s_len - $p_len + + 1); $substr{$_}++ for unpack $template, $str; return $substr{$pat}; }` [download] Cheers - L~R	[reply] [d/l]

Back to Seekers of Perl Wisdom