Re: counting overlapping patterns
by Eimi Metamorphoumai (Deacon) on Feb 18, 2005 at 20:39 UTC
|
The key is to match without consuming text. The following works.
while("AAAA" =~ /(?=AA)/g){
$count++;
}
It'll go through it repeatedly, starting at each position, but never consuming anything. | [reply] [d/l] |
|
while ('AAAA' =~ /A(?=A)/g) {
$count++;
}
which looks slightly less weird to me.
| [reply] [d/l] |
|
But requires rewriting the pattern. Mine will work even if the pattern comes from a variable, or contains different possibilities for the first character ('/AA|BB/' becomes '/(?=AA|BB)/'). My approach was to try to do minimal tinkering with the original, but yeah, it's not as obvious what's actually going on.
| [reply] [d/l] [select] |
|
Re: counting overlapping patterns
by merlyn (Sage) on Feb 18, 2005 at 23:33 UTC
|
Here's a bizarre way of doing it, taking a hint from my Prolog studies (thanks Ovid!):
sub match_count {
my ($string, $pattern) = @_;
my $n = 0;
$string =~ /$pattern(?{ $n++ })(?!)/;
return $n;
}
Essentially, match successfully, count it, but then fail the match (an empty string can always match, so negating that always fails).
| [reply] [d/l] |
Re: counting overlapping patterns
by betterworld (Curate) on Feb 19, 2005 at 02:43 UTC
|
| [reply] [d/l] |
Re: counting overlapping patterns
by bobf (Monsignor) on Feb 19, 2005 at 03:24 UTC
|
This might not be as slick as the lookahead approaches described above, but in the spirit of TMTOWTDI here's a version that uses pos and the @- array, which contains the offset of the start of the last match (see perlvar for more info):
sub count_matches
{
my ( $pattern, $string ) = @_;
my $num_matches = 0;
while( $string =~ m/$pattern/gi )
{
pos( $string ) = $-[0] + 1;
$num_matches++
}
return $num_matches;
}
Update: For monks (like me) that didn't understand why a pattern consisting entirely of a zero-width lookahead assertion (m/(?=AA)/g, see the above responses) doesn't get stuck in an infinite loop, see perlre, "Repeated patterns matching zero-length substring". From that doc:
Perl allows such constructs, by forcefully breaking the infinite loop ... when Perl detects that a repeated expression matched a zero-length substring.
To break the loop, the following match after a zero-length match is prohibited to have a length of zero.
... the second best match is chosen if the best match is of zero length ... the second-best match is the match at the position one notch further in the string.
Thanks to ambrus for the pointer to the right section in the docs.
| [reply] [d/l] [select] |
Re: counting overlapping patterns
by saintmike (Vicar) on Feb 18, 2005 at 20:19 UTC
|
| [reply] [d/l] [select] |
|
$count++ while $string =~ /\GAA/gc;
Er, no, the \G there serves no purpose as that's the default behaviour anyway. On this other hand this will work, by only consuming the first character of the match:
$count++ while $string =~ /A(?=A)/g;
Dave. | [reply] [d/l] [select] |
|
Dave you truly are "the m". That works as promised. Thanks!
| [reply] |
|
Ah... the \G anchor, of course.... it doesn't seem to work for me. Still getting 2. Are my cut-n-paste skills poor?
| [reply] |
|
This returns 2. Reasonable because the position of the last match is right after the 2nd "A".
IŽd like to be disproven, but i think this cannot be solved by a simple regex.
Disproven ,)
| [reply] [d/l] |
Re: counting overlapping patterns
by ikegami (Patriarch) on Feb 18, 2005 at 21:49 UTC
|
while ("AAAA" =~ /A(A+)/g) {
$count += length($1);
}
Unlike C's strlen, Perl's length doesn't loop, so this snippet is compliant with your request for no further looping.
| [reply] [d/l] [select] |
Re: counting overlapping patterns
by Limbic~Region (Chancellor) on Feb 18, 2005 at 20:30 UTC
|
Anonymous Monk,
Here is some code that covers your example, but it admittedly only works on fixed strings since it doesn't use regular expressions at all.
#!/usr/bin/perl
use strict;
use warnings;
print str_count('AAAA', 'AA'), "\n";
sub str_count {
my ($str, $pat) = @_;
my $tot;
for ( 0 .. length( $str ) - length( $pat ) ) {
$tot++ if index($str, $pat, $_) - $_ == 0;;
}
return $tot;
}
| [reply] [d/l] |
|
While discussing other methods that avoided regular expressions in the CB with bobf and nothingmuch, I mentioned an unpack/hash solution. nothinmuch asked to see it, so here is a highly untested alternative.
#!/usr/bin/perl
use strict;
use warnings;
print str_count('ABAABAAAA', 'AA'), "\n";
sub str_count {
my ($str, $pat) = @_;
my %substr;
my ($p_len, $s_len) = (length $pat, length $str);
my $template = ("A$p_len" . 'X' . ($p_len - 1)) x ($s_len - $p_len
+ + 1);
$substr{$_}++ for unpack $template, $str;
return $substr{$pat};
}
| [reply] [d/l] |