Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Debugging Regexes

by perlgags78 (Acolyte)
on Jun 28, 2004 at 16:57 UTC ( #370249=perlquestion: print w/ replies, xml ) Need Help??
perlgags78 has asked for the wisdom of the Perl Monks concerning the following question:

Hi folks, Is there anyway to debug regular expressions?
I have a regex that resolves a lookahead expression at
the beginning of the expression. It looks like the
following
( "hello Gags" =~ /^(?=hello).*[^G][^a].*$/ )&& print ("Matched") || p +rint ("Unmatched")
Basically what it's saying is that if the engine looks
ahead from the start and locates the word 'hello'
then if the rest of the string DOES NOT contain the
char 'G' followed by the char 'a' then print matched or
else print unmatched.
i.e. for the above string it should return Unmatched... but
it doesn't.

Would anyone be able to tell me what the internal string
changes to after the engine reads (?=hello) and actually
locates the first hello?
Is there anyway to debug the resolution of regexes? I'm
using the debugger but it only does statements at a time.

Thanks,
Mark.

Comment on Debugging Regexes
Download Code
Re: Debugging Regexes
by gellyfish (Monsignor) on Jun 28, 2004 at 17:02 UTC

    the re pragma may be your friend here:

    use re qw(debug); ( "hello Gags" =~ /^(?=hello).*[^G][^a].*$/ )&& print ("Matched") || p +rint ("Unmatched");

    Should give you more information than you need.

    /J\

      That's a very handy tool indeed!!!
      Is there anywhere that I can find any docs on it?
      I've to run for my train, thanks again for your help.

      Thanks,
      Mark.
        perldoc re

        HTH

        Sweetblood

Re: Debugging Regexes
by fletcher_the_dog (Friar) on Jun 28, 2004 at 17:49 UTC
    Why your regex matches the string can be explained like this:
    "(?=hello)" matches "hello" ".*" matches "hello Ga" "[^G]" matches "g" "[^a]" matches "s" ".*" matches nothing
    Your regex properly skips matching [^G][^a] with "Ga" but will happily match stuff after that. You probably want to do something like this with a negative look ahead:
    ( "hello Gags" =~ /^(?=hello)(?!.*Ga/ )&& print ("Matched") || print ( +"Unmatched")
    Update
    To answer your question about how debugging, I find that throwing in a few capturing parens, can often help you to see what is being matched. Here is is a simple example:
    use strict; ( "hello Gags" =~ /^(?=hello)(.*)([^G][^a])(.*)$/ ) ? matches() : pri +nt ("Unmatched"); sub matches { no strict 'refs'; for (1..10) { my $tmp = ${$_} or next; print "\$$_ = '$tmp'\n"; } return 1; } __OUTPUT__ $1 = 'hello Ga' $2 = 'gs'
Re: Debugging Regexes
by BrowserUk (Pope) on Jun 29, 2004 at 05:07 UTC

    It's nearly always easier and quicker (in coding time if not execution time) to code that sort of test using separate regexes and boolean operators than it is a single regex.

    $_ = "hello Gags"; if( /hello/ and not /Ga/ ) { print 'Matched' } else { print 'Failed' }; Failed

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
Re: Debugging Regexes
by skyknight (Hermit) on Jun 29, 2004 at 12:46 UTC

    Personally, I think that the way you wrote that statement is evil and such constructs should be avoided. I really shouldn't have to think too hard about operator evaluation order to figure out how your code works. You could have at least put parentheses around the operands of the &&. As the code presently exists, I'd have to run it through a debugger to convince myself that it works the way you claim it works.

    Maybe this is an extreme opinion, but I think I'd prefer that compilers/interpreters forced you to parenthesize for stuff like that, and threw ambiguity errors when you didn't.

      yer.. the OP should have used something like:
       print /foo/ ? 'bar' : 'baz'
      IMO
Re: Debugging Regexes
by hv (Parson) on Jun 29, 2004 at 13:03 UTC

    ... locates the word 'hello' then if the rest of the string ...

    This is your first problem: a lookahead peeks into the string, but doesn't change the position that the rest of the string will attempt to match at. You don't want a lookahead for this, but a straight match - then the rest of the pattern will correctly be tried against the rest of the string:

    /^hello.../

    ... if the rest of the string DOES NOT contain the char 'G' followed by the char 'a' ...

    A simple negative lookahead:

    / (?! # fail to match if you find ( .* # zero or more characters Ga # followed by "Ga" ) # ) /x

    So your specification is satisfied simply by:

    /^hello(?!.*Ga)/

    Hope this helps,

    Hugo

Re: Debugging Regexes
by OhReally (Monk) on Jun 30, 2004 at 09:10 UTC
    There is a program called The Regex Coach which may well help you. It allows you to test regexes interactively and also shows the regex in english form and other details such as the parse tree.
    It is available for windows and linux and is free, I have found it quite useful.
Re: Debugging Regexes
by perlgags78 (Acolyte) on Jul 01, 2004 at 10:30 UTC
    Monks,
    Thanks a million for all the help. I'll be taking a
    look at that regexp coach definitely.
    The reason I didn't do it incrementally using a number of
    if's is that I wanted to learn how to write regexes and
    the best way to learn I guess is it try and do
    something complex. Aiming for the sky and hitting the
    ceiling effort. Once again though your generous help and attention to
    detail is thoroughly appreciated.

    Thanks,
    Mark.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://370249]
Approved by gri6507
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (8)
As of 2014-10-01 22:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    What is your favourite meta-syntactic variable name?














    Results (40 votes), past polls