Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

RFC: Regexp::AllMatches

by lodin (Hermit)
on Aug 06, 2007 at 16:54 UTC ( #630855=perlmeditation: print w/replies, xml ) Need Help??

The problem

Some years ago I wrote a class to find all possible matches of a pattern against a string, including overlapping matches. If there was an /a switch that accomplished this, it would look like this:

foreach ('abc' =~ /.+?/a) { print "$_\n"; } __END__ a ab abc b bc c

I figure I'm going to release it to CPAN. Before I do that I'd appreciate some feedback.

Description of the classes

I currently call it Regexp::AllMatches and use an OO interface. Here's how it's used:

use Regexp::AllMatches; my $matcher = Regexp::AllMatches->new(STRING => qr/PATTERN/); while (my ($match) = $matcher->next) { print "$match\n"; }

$matcher is a simple iterator, and the only methods are

* new * clone * next

$match is a match object that stringifies to the matched string ($&) and implements the following methods:

* prematch ($`) * match ($&) * postmatch ($`) * group ($<*digits*>) * groups

I also wrote Regexp::AllMatches::Extended that implements some extra convenience methods at the cost of memory and speed.

* curr * prev * reset * all

Regexp::AllMatches and Regexp::AllMatches::Extended will be two different modules. The match object is currently defined in Regexp::AllMatches, and is at the moment not for public instantiation.

So, what do you think of

  • the class names?
  • the overall design?
Any feedback and thoughts will be appreciated.

lodin

Update: Regexp::Exhaustive is the new name. I'm not too happy about Regexp::Exhaustive::Extended though. Any ideas? How about Regexp::Exhaustive::Extra(s) or Regexp::Exhaustive::Convenient?

Update: After a bit of cleaning Regexp::Exhaustive::Extended became nothing but a generic iterator decorator, so it's gone. The all method is now put directly in Regexp::Exhaustive instead.

Update: Uploaded to CPAN as Regexp::Exhaustive.

Replies are listed 'Best First'.
Re: RFC: Regexp::AllMatches
by moritz (Cardinal) on Aug 06, 2007 at 17:06 UTC
    In Perl 6 that's the :ex or :exhaustive modifier, maybe you could consider calling your module Regexp::Exhaustive or Regexp::Match::Exhaustive or something.

    The API looks nice, I just wonder why you use the parens in while (my ($match) = $matcher->next) - does ->next() return a list? if yes, why?

    Update: while rereading S05 I also found the :overlap modifier - please check if you are implementing :exhaustive or :overlap - I think it's :exhaustive, but I'm not sure.

      I'll definately consider ::Exhaustive. Thanks!

      does ->next() return a list? if yes, why?

      There are different opinions and tastes regarding this. I prefer to design iterators so that the next method returns undef in scalar context and the empty list in list context when it's exhausted. A successful match may be "" or "0", and that would force me to write

      while (defined(my $match = $matcher->next)) { ... }
      to not leave the loop prematurely.

      lodin

        There is another way to handle the return value in a loop. Instead of returning a string, return an object that overloads the boolean, numerical, and string operators.

        An example of this in practice is the IO::Prompt module on cpan. Here is an excerpt of the source:
        package IO::Prompt::ReturnVal; use overload q{bool} => sub { $_ = $_[0]{value} if $_[0]{set_val}; $_[0]{handled} = 1; $_[0]{success}; }, q{""} => sub { $_[0]{handled} = 1; "$_[0]{value}"; }, q{0+} => sub { $_[0]{handled} = 1; 0 + $_[0]{value}; }, fallback => 1, ; sub DESTROY { $_ = $_[0]{value} unless $_[0]{handled}; }
        As you can see, it also provides mechanisms for setting $_.

        Anyway, just wanted to throw out another option. Please note that the IO::Prompt module only runs on unix based systems currently AFAIK.

        - Miller
Re: RFC: Regexp::AllMatches
by blokhead (Monsignor) on Aug 07, 2007 at 02:07 UTC
    I'm interested about this being implemented as an iterator. I can imagine two ways this might be done:
    • You exhaustively generate all matches, and the iterator is just added on to have a nice interface. Often if you see an iterator interface you assume that it is only generating the return values "on demand" and not pre-computing them all.
    • The iterator is implemented by somehow jumping out of a regex during a match with lots of back-tracking. If this is the case, I wonder if re-entrancy would be a problem. Could the regex engine be used while this iterator object is "active?" Actually, implementing an iterator instead of a callback in this way seems highly non-trivial to me, so if this is the case, I'd be interested to see the implementation details.
    Either way, I think it would be nice if the documentation made clear what was going on with respect to iterators.

    I like how the interface provides a way to get the $1, $2, etc match variables.

    blokhead

      Often if you see an iterator interface you assume that it is only generating the return values "on demand" and not pre-computing them all.

      ... and that is also true here. Since backtracking patterns quickly generate a very large number of matches I don't dare to precompute them.

      Could the regex engine be used while this iterator object is "active?"

      The following code works, and that makes me believe there won't be any other re-entrancy issues. But I know very little about the internals of the engine, and what may blow under certain circumstances.

      use Test::More 'no_plan'; use Regexp::AllMatches; my $str = 'abc'; my $m1 = Regexp::AllMatches::->new($str => qr/.+/); is($m1->next, 'abc'); my $m2 = $m1->clone; is($m1->next, 'ab'); is_deeply([ $str =~ /./g ], [qw/ a b c /]); is($m1->next, 'a'); is($m2->next, 'ab'); __END__ ok 1 ok 2 ok 3 ok 4 ok 5 1..5

      lodin

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://630855]
Approved by grep
Front-paged by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (7)
As of 2021-06-24 21:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)












    Results (132 votes). Check out past polls.

    Notices?