Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Regexes: finding ALL matches (including overlap)

by kaif (Friar)
on Jun 04, 2005 at 03:48 UTC ( #463461=perlquestion: print w/replies, xml ) Need Help??
kaif has asked for the wisdom of the Perl Monks concerning the following question:

One particular issue about regular expressions that bites me every once in a while is that the m//g modifier does not really find all matches, in the following sense:

$_ = "<Pooh,> said Rabbit kindly, <you haven't any brain> <I know,> s +aid Pooh humbly."; $ys = () = m/<[^>]*y/g; # count number of y's within angled brackets ( +assume no nesting) print "$ys\n";
prints "1", rather than "2". I understand why matching has to have this property, so instead I am looking for idioms that people use in such situations. For example, if I wanted to do something else, replace all y's in angled brackets with x's, I would use 1 while s/(<[^>]*)y/$1x/g;. However, this does not yield itself nicely to an analog for counting matches. Any ideas?

One thing that occured to me is perhaps using embedded code to count matches (i.e., if a match "almost succeeded", run some code that increments a counter and then make the match fail and retry elsewhere --- I don't know enough about embedding code to do this properly). Perhaps someone could provide some working code that would make this work?

Note: I would want "abcdef" =~ m/..*..*./g to return 20 = 6 choose 3 matches.

Replies are listed 'Best First'.
Re: Regexes: finding ALL matches (including overlap)
by blokhead (Monsignor) on Jun 04, 2005 at 04:09 UTC
    Note: I would want "abcdef" =~ m/..*..*./g to return 20 = 6 choose 3 matches.
    You can add a simple counter to your regexes with (?{code}):
    local $_ = "abcdef"; my $count; /..*..*.(?{$count++})(?!)/; print "$count matches\n"; ## "20 matches"
    How does that fancy regex work? Every time it passes the "normal" part of the regex, it increments the counter, but the final (?!) part makes the overall expression fail and backtrack (back past the (?{code})) to try again. This process only stops when it has exhausted every possible way to match the "normal" part of the regex.

    There are some issues though: It's a little messy to reuse this, because to do it programatically requires use re 'eval', and lexicals that get closured inside regexes don't always behave like you think they should. You may have to resort to a symbol-table variable for the counter.


      It's a little messy to reuse this, because to do it programatically requires use re 'eval'
      No, you can (and should) use qr// to avoid this.
      local our $count; my $inc_count = qr/(?{$count++})/; /..*..*.$inc_count(?!)/;

      Update: local our not my.

        You're right, use re 'eval' is not absolutely required, and I shouldn't have said it like that. But beware! Your example code works fine on just an instance-by-instance basis. But if you want to do this programatically and extensibly, then my warning about closure-ing lexicals applies. It's tricky to make a generic-use sub that does this kind of matching.

        You may be tempted to do the following, but it won't work:

        sub match_all_ways { my ($string, $regex) = @_; my $count; my $incr = qr/(?{$count++})/; $string =~ /(?:$regex)$incr(?!)/; return $count; } print match_all_ways("abcdef", qr/..*..*./); # 20 print match_all_ways("abcdef", qr/..*..*./); # undef
        It's because the qr// object is compiled just once and always refers to the first instance of $count. If you call this sub more than once, you will always get undef.

        You have to do something ugly like this to get around it:

        sub match_all_ways { use vars '$count'; my ($string, $regex) = @_; local $count = 0; my $incr = qr/(?{$count++})/; $string =~ /(?:$regex)$incr(?!)/; return $count; }
        or this
        { my $count; my $incr = qr/(?{$count++})/; sub match_all_ways { my ($string, $regex) = @_; $count = 0; $string =~ /(?:$regex)$incr(?!)/; return $count; } }
        So yes, it can be done programatically without use re 'eval', but it's non-trivial and a little messy ;)


      Great! This is exactly the code idea I wanted. Are there any other ways without using such a construct (just for the sake of TIMTOWDI)?

      I was always unsure of the level of support of enclosing code within regexen. Do you know what kinds of things can go wrong?

        Do you know what kinds of things can go wrong?

        Backtracking can screw things up:

        my $count; 'ac' =~ / a (?{ $count++ }) b | a (?{ $count++ }) c /x; # 1. Matches 'a' in first branch. # 2. Increments $count to 1. # 3. Fails to match 'b'. # 4. Matches 'a' in second branch. # 5. Increments $count to 2. # 6. Matches 'c'. print("$count\n"); # 2

        The fix is to use local. When the regexp backtracks through a local, the old value is restored. The old value is also restored when the regexp succesfully matches, so you need to save the result.

        my $count; our $c = 0; 'ac' =~ / (?: a (?{ local $c = $c + 1 }) b | a (?{ local $c = $c + 1 }) c ) (?{ $count = $c }) # Save result. /x; # 1. Matches 'a' in first branch. # 2. Increments $c to 1. # 3. Fails to match 'b'. # 4. Undoes increment ($c = 0). # 5. Matches 'a' in second branch. # 6. Increments $c to 1. # 7. Matches 'c'. # 8. $count = $c. print("$count\n"); # 1

      So I just read through perlre and I couldn't find something: how does one include a (code-based) conditional expression in a regex, analogous to actions in P::RD? Is it even possible? If so, then one could not only find the last match (which may differ slightly from reversing the result of a reversed regex):

      "abcdef" =~ /(..*..*.)(?{$last = $^N})(?!)/; print "[$last]\n"; ## "[def]"
      but also the (say) tenth match.

      Another solution to my problem would be possible if P::RD had non-greedy matches. Is it likely that this will be implemented soon? I guess I could try hacking on it myself.

      P.S.: Has anyone ever used customre? Super Search gave back only one result ...

Re: Regexes: finding ALL matches (including overlap)
by bart (Canon) on Jun 04, 2005 at 10:52 UTC
    From the top of my head:
    $count = () = map /y/g, /<.*?>/g;
    A little more complex, but possibly a little more memory friendly, is:
    use List::Util 'sum'; $count = sum map { my $x = () = /y/g } /<.*?>/g;

      Very nice! In fact, the first time I looked at it, I thought it wouldn't work, because I misunderstood what it did; playing around with it convinced me it worked, however.

        I forgot to mention my mistakes:
        1. I confused /y/g with the y// transliteration operator.
        2. I forgot map provided list context.
Re: Regexes: finding ALL matches (including overlap)
by ikegami (Pope) on Jun 04, 2005 at 04:20 UTC
    For the in-bracket example, you could extract the contents of the brackets, then search through the extracted contents for 'y'. That problem is also very well suited for parsers:
    use strict; use warnings; my $count; local $_ = "<Pooh,> said Rabbit kindly, <you haven't any brain> <I kn +ow,> said Pooh humbly."; our $c = 0; / ^ (?: # Outside of brackets [^<] | # Inside of brackets < [^y>]* (?: y (?{ local $c = $c + 1 }) [^y>]* )* >? # Optional in case of unmatched bracket. )* $ (?{ $count = $c }) # Save count. /x; print("$count\n");

    Since the above will match every string without ever backtracking, using $c is optional. You can replace (?{ local $c = $c + 1 }) with (?{ $count++ }) and drop (?{ $count = $c }).

    Sorry, I don't have any general solutions.

    Update: Fixed a bug in the regexp.

      Just out of curiosity, why not make the "inside of brackets" subexpression something more like

      < (?: [^y>]* y (?{ local $c = $c + 1 }) )* .*? > # Closing bracket not optional
      ? I see that the original allows for the possibility of unmatched left angle brackets, but I don't see why one would want this; i.e. I don't see why one would want to count the "y" in "<xyz", for example, but not the one in "xyz>".

      the lowliest monk

        but I don't see why one would want this

        I had to make a decision since I had insufficient information. If someone needs a different behaviour, they can change the code or ask me to do so. I decided to adopt Windows quoting behaviour. For example, dir "c:\program files works.

Re: Regexes: finding ALL matches (including overlap)
by nobull (Friar) on Jun 04, 2005 at 09:06 UTC
    I gave a talk about this amongst other things at YAPC::Europe::2004. This question started at slide 20.
    I would want "abcdef" =~ m/..*..*./g to return 20 = 6 choose 3 matches.

    Hmmm... that's not quite the same thing I was talking about. How is Perl to know that .* is different from . ?

    As far as my solution (actually largely due to abigail) is concerned /..*..*../ is simply /.{4,}/ and all matches thereof in "abcdef" would be 6..

    • substr("abcdef",0,4)
    • substr("abcdef",0,5)
    • substr("abcdef",0,6)
    • substr("abcdef",1,4)
    • substr("abcdef",1,5)
    • substr("abcdef",2,4)
    Update: changed /.*/ to /.{4,}/ and made resulting changes.

      /..*..*../ is simply /.*/

      That strikes me as somewhat odd. The pattern on the right can match a string of less than 4 characters, the pattern of the left can not.


        Yes thanks. Updating the previous node.
      I would want "abcdef" =~ m/..*..*./g to return 20 = 6 choose 3 matches.
      Hmmm... that's not quite the same thing I was talking about. How is Perl to know that .* is different from . ?
      Easy: imagine I was matching m/\w.*\w.*\w/g instead. There really is no other possibility than have this return 20 matches (each \w has to match one of the 6 letters). Here are some more examples of what I would want (assuming I made no mathematical mistakes):
      • "abcdef" =~ m/..*..*./g   returns 20 = 6 choose 3
      • "abcdef" =~ m/.*/g   returns 28 = (6+2) choose 2 = number of substrings of length 6 string
      • "abcdef" =~ m/....*/g   returns 10 = number of length 3 or greater substrings of length 6 string
      • "abcdef" =~ m/^.*$/g   returns 1
      • "abcdef" =~ m/^.*.*$/g   returns 7 = number of ways of splitting a length 6 string into two parts

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://463461]
Approved by davido
Front-paged by tlm
[erix]: a few years ago there was suddenly the news that the Kremlin was using typewriters again. Heard nothing about it afterward
[erix]: (maybe that means it worked)
[oiskuu]: Yeah, it might work as long as there are no root exploits. ;-)
[Corion]: Also, it's much harder to leak paper sheets than it is to leak documents that are available electronically
[oiskuu]: tye, do you make use of remote logging?
[oiskuu]: erix, this gives a new definition to e-paper: something that records the pressure imprints and is bluetooh capable, looks like ordinary paper.
[oiskuu]: So to be really, really sure, you must microwave the paper before you type on it.

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (6)
As of 2017-06-23 20:26 GMT
Find Nodes?
    Voting Booth?
    How many monitors do you use while coding?

    Results (555 votes). Check out past polls.