Re: Regexes: finding ALL matches (including overlap)
by blokhead (Monsignor) on Jun 04, 2005 at 04:09 UTC
|
Note: I would want "abcdef" =~ m/..*..*./g to return 20 = 6 choose 3 matches.
You can add a simple counter to your regexes with (?{code}):
local $_ = "abcdef";
my $count;
/..*..*.(?{$count++})(?!)/;
print "$count matches\n"; ## "20 matches"
How does that fancy regex work? Every time it passes the "normal" part of the regex, it increments the counter, but the final (?!) part makes the overall expression fail and backtrack (back past the (?{code})) to try again. This process only stops when it has exhausted every possible way to match the "normal" part of the regex.
There are some issues though: It's a little messy to reuse this, because to do it programatically requires use re 'eval', and lexicals that get closured inside regexes don't always behave like you think they should. You may have to resort to a symbol-table variable for the counter.
| [reply] [Watch: Dir/Any] [d/l] |
|
It's a little messy to reuse this, because to do it programatically requires use re 'eval'
No, you can (and should) use qr// to avoid this.
local our $count;
my $inc_count = qr/(?{$count++})/;
/..*..*.$inc_count(?!)/;
Update: local our not my. | [reply] [Watch: Dir/Any] [d/l] [select] |
|
sub match_all_ways {
my ($string, $regex) = @_;
my $count;
my $incr = qr/(?{$count++})/;
$string =~ /(?:$regex)$incr(?!)/;
return $count;
}
print match_all_ways("abcdef", qr/..*..*./); # 20
print match_all_ways("abcdef", qr/..*..*./); # undef
It's because the qr// object is compiled just once and always refers to the first instance of $count. If you call this sub more than once, you will always get undef.
You have to do something ugly like this to get around it:
sub match_all_ways {
use vars '$count';
my ($string, $regex) = @_;
local $count = 0;
my $incr = qr/(?{$count++})/;
$string =~ /(?:$regex)$incr(?!)/;
return $count;
}
or this
{
my $count;
my $incr = qr/(?{$count++})/;
sub match_all_ways {
my ($string, $regex) = @_;
$count = 0;
$string =~ /(?:$regex)$incr(?!)/;
return $count;
}
}
So yes, it can be done programatically without use re 'eval', but it's non-trivial and a little messy ;)
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
|
|
Great! This is exactly the code idea I wanted. Are there any other ways without using such a construct (just for the sake of TIMTOWDI)?
I was always unsure of the level of support of enclosing code within regexen. Do you know what kinds of things can go wrong?
| [reply] [Watch: Dir/Any] |
|
my $count;
'ac' =~
/
a (?{ $count++ }) b
|
a (?{ $count++ }) c
/x;
# 1. Matches 'a' in first branch.
# 2. Increments $count to 1.
# 3. Fails to match 'b'.
# 4. Matches 'a' in second branch.
# 5. Increments $count to 2.
# 6. Matches 'c'.
print("$count\n"); # 2
The fix is to use local. When the regexp backtracks through a local, the old value is restored. The old value is also restored when the regexp succesfully matches, so you need to save the result.
my $count;
our $c = 0;
'ac' =~
/
(?:
a (?{ local $c = $c + 1 }) b
|
a (?{ local $c = $c + 1 }) c
)
(?{ $count = $c }) # Save result.
/x;
# 1. Matches 'a' in first branch.
# 2. Increments $c to 1.
# 3. Fails to match 'b'.
# 4. Undoes increment ($c = 0).
# 5. Matches 'a' in second branch.
# 6. Increments $c to 1.
# 7. Matches 'c'.
# 8. $count = $c.
print("$count\n"); # 1
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
So I just read through perlre and I couldn't find something: how does one include a (code-based) conditional expression in a regex, analogous to actions in P::RD? Is it even possible? If so, then one could not only find the last match (which may differ slightly from reversing the result of a reversed regex):
"abcdef" =~ /(..*..*.)(?{$last = $^N})(?!)/;
print "[$last]\n"; ## "[def]"
but also the (say) tenth match.
Another solution to my problem would be possible if P::RD had non-greedy matches. Is it likely that this will be implemented soon? I guess I could try hacking on it myself.
P.S.: Has anyone ever used customre? Super Search gave back only one result ... | [reply] [Watch: Dir/Any] [d/l] |
Re: Regexes: finding ALL matches (including overlap)
by bart (Canon) on Jun 04, 2005 at 10:52 UTC
|
$count = () = map /y/g, /<.*?>/g;
A little more complex, but possibly a little more memory friendly, is:
use List::Util 'sum';
$count = sum map { my $x = () = /y/g } /<.*?>/g;
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Very nice! In fact, the first time I looked at it, I thought it wouldn't work, because I misunderstood what it did; playing around with it convinced me it worked, however.
| [reply] [Watch: Dir/Any] |
|
I forgot to mention my mistakes:
- I confused /y/g with the y// transliteration operator.
- I forgot map provided list context.
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Regexes: finding ALL matches (including overlap)
by ikegami (Patriarch) on Jun 04, 2005 at 04:20 UTC
|
For the in-bracket example, you could extract the contents of the brackets, then search through the extracted contents for 'y'. That problem is also very well suited for parsers:
use strict;
use warnings;
my $count;
local $_ = "<Pooh,> said Rabbit kindly, <you haven't any brain> <I kn
+ow,> said Pooh humbly.";
our $c = 0;
/
^
(?:
# Outside of brackets
[^<]
|
# Inside of brackets
<
[^y>]*
(?:
y (?{ local $c = $c + 1 })
[^y>]*
)*
>? # Optional in case of unmatched bracket.
)*
$
(?{ $count = $c }) # Save count.
/x;
print("$count\n");
Since the above will match every string without ever backtracking, using $c is optional. You can replace (?{ local $c = $c + 1 }) with (?{ $count++ }) and drop (?{ $count = $c }).
Sorry, I don't have any general solutions.
Update: Fixed a bug in the regexp. | [reply] [Watch: Dir/Any] [d/l] [select] |
|
<
(?:
[^y>]*
y (?{ local $c = $c + 1 })
)*
.*?
> # Closing bracket not optional
? I see that the original allows for the possibility of unmatched left angle brackets, but I don't see why one would want this; i.e. I don't see why one would want to count the "y" in "<xyz", for example, but not the one in "xyz>".
| [reply] [Watch: Dir/Any] [d/l] |
|
but I don't see why one would want this
I had to make a decision since I had insufficient information. If someone needs a different behaviour, they can change the code or ask me to do so. I decided to adopt Windows quoting behaviour. For example, dir "c:\program files works.
| [reply] [Watch: Dir/Any] [d/l] |
|
Re: Regexes: finding ALL matches (including overlap)
by nobull (Friar) on Jun 04, 2005 at 09:06 UTC
|
I gave a talk about this amongst other things at YAPC::Europe::2004. This question started at slide 20.
I would want "abcdef" =~ m/..*..*./g to return 20 = 6 choose 3 matches.
Hmmm... that's not quite the same
thing I was talking about. How is Perl to know that .* is different from . ?
As far as my solution (actually largely due to abigail) is concerned /..*..*../ is simply /.{4,}/ and all matches thereof in "abcdef" would be 6..
- substr("abcdef",0,4)
- substr("abcdef",0,5)
- substr("abcdef",0,6)
- substr("abcdef",1,4)
- substr("abcdef",1,5)
- substr("abcdef",2,4)
Update: changed /.*/ to /.{4,}/ and made resulting changes. | [reply] [Watch: Dir/Any] [d/l] [select] |
|
/..*..*../ is simply /.*/
That strikes me as somewhat odd. The pattern on the right can match a string of less than 4 characters, the pattern of the left can not.
---
$world=~s/war/peace/g
| [reply] [Watch: Dir/Any] |
|
Yes thanks. Updating the previous node.
| [reply] [Watch: Dir/Any] |
|
I would want "abcdef" =~ m/..*..*./g to return 20 = 6 choose 3 matches. Hmmm... that's not quite the same thing I was talking about. How is Perl to know that .* is different from . ?
Easy: imagine I was matching m/\w.*\w.*\w/g instead. There really is no other possibility than have this return 20 matches (each \w has to match one of the 6 letters). Here are some more examples of what I would want (assuming I made no mathematical mistakes):
- "abcdef" =~ m/..*..*./g returns 20 = 6 choose 3
- "abcdef" =~ m/.*/g returns 28 = (6+2) choose 2 = number of substrings of length 6 string
- "abcdef" =~ m/....*/g returns 10 = number of length 3 or greater substrings of length 6 string
- "abcdef" =~ m/^.*$/g returns 1
- "abcdef" =~ m/^.*.*$/g returns 7 = number of ways of splitting a length 6 string into two parts
| [reply] [Watch: Dir/Any] [d/l] [select] |