Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Extended Regular Expressions

by jroberts (Acolyte)
on Aug 16, 2001 at 17:38 UTC ( [id://105361]=perlquestion: print w/replies, xml ) Need Help??

jroberts has asked for the wisdom of the Perl Monks concerning the following question:

Help Needed,
$_ = '<a href= ?a=500011&w=2&r=1 >joe@blow.com</a>' print $& if /^<a href.*>[^\z])/; print $& if /^<a href.*>(?=[^\z])/x;
the 1st print prints "<a href= ?a=500011&w=2&r=1 >j" the 2nd print prints the entire string "$_". why is that? This is under Activestate perl 5.6.1 for winNT. What i really want is the stuff to the right of the 1st '>' character. the stuff to the left of the 1st '>' is too unpredictable to allow a flexible approach to stop the .* from "eating up" everything until the very last '>'. The 1st statment stops the .* but $' does not include the 'j' from "joe" and thus $' is not correct for my purposes. I thought the extended zero-width lookahead would solve my problem but it doesn't act like i expect. With the exception of the values of $' and $&, shouldn't those expressions have the same results? Thanks in advance jroberts

Replies are listed 'Best First'.
Re: Extended Regular Expressions
by azatoth (Curate) on Aug 16, 2001 at 17:40 UTC
      Thanks, i got this to work as expected using "negated" character classes. My original question still remains as a puzzle. why aren't the results (aka. the values of $&) of the two pattern matches different by the single character 'j' (taken from "joe")? it seems they should be. the zero-width lookahead doesn't seem to be "zero-width" at all. Regards, jroberts

        Ok, the problem with [^\z] can be easily seen when you use warnings - which is a good idea in general. Perl complains then about an unidentified escape sequence in the character class. This means that \z is not the end of the string in a character class!! All metacharacters loose their special meaning in a character class.

        So [^\z] is equivalent to [^z] which is a single character that is not a 'z'. Looking at your original regex print $& if /^<a href.*>(?=[^\z])/x; Consequently the .* in your regex eats up everything till the last '>' and checks if the next character is not a z. Which is true, as it is a newline. Ergo, match found, mystery solved :)

        -- Hofmator

Re: Extended Regular Expressions
by japhy (Canon) on Aug 16, 2001 at 20:16 UTC
    Hmm... you've confused me a bit here. You have a typo in the code, and I'm not sure why you have the /x on your second regex. The /x merely means "allow extraneous whitespace and comments."
    #!/usr/bin/perl -l $_ = '<a href= ?a=500011&w=2&r=1 >joe@blow.com</a>'; print $& if /^<a href.*>[^\z]/; print $& if /^<a href.*>(?=[^\z])/;
    This code prints:
    <a href= ?a=500011&w=2&r=1 >j <a href= ?a=500011&w=2&r=1 >
    Perhaps you want to use:
    /^<a [^>]+>([^<]+)/
    That kinda matches a tag, followed by kinda the non-tag stuff after it.

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: Extended Regular Expressions
by jlongino (Parson) on Aug 16, 2001 at 17:49 UTC
    If what you want is to the right of the first > can't you just print $'  or the English $POSTMATCH after the first print statement?

    E.g.,

    print $& if /^<a href.*>[^\z])/; print $';
    Update: Sorry, overlooked that somehow.

    If the code and the comments disagree, then both are probably wrong. -- Norm Schryer

      print $& if /^<a href.*>[^\z]/; print $';
      This does not work - as jroberts already pointed out - because the first letter after the closing angle bracket is eaten as well. So it's not printed with $POSTMATCH.

      Furthermore the regex contained an additional closing parenthesis - this was already in the original post and just a copy and paste error. I fixed that above.

      -- Hofmator

Re: Extended Regular Expressions
by arturo (Vicar) on Aug 16, 2001 at 22:13 UTC

    And while we're at it, if you're planning on using this code for an application you need to be robust, I'd recommend using HTML::Parser or a similar module to do the work of extracting information from HTML files. Of course that has little to do with understanding why the regex doesn't work as you thought it would, but it's worth noting.

    perl -e 'print "How sweet does a rose smell? "; chomp ($n = <STDIN>); +$rose = "smells sweet to degree $n"; *other_name = *rose; print "$oth +er_name\n"'
Re: Extended Regular Expressions
by larryk (Friar) on Aug 16, 2001 at 18:01 UTC
    to get everything in between...
    />(.*?)<\/a>/; print $1;
    the first one doesn't get everything as you expect because the [^\z] matches one non-EOL char so...
    1. .* gobbles everything to EOL
    2. the regex backtracks to match the > (last char on line)
    3. regex attempts to match a non-EOL char - FAIL - no chars left in string
    4. regex backtracks to match the previous >
    5. regex matches a non-EOL char "j" - COMPLETE
    the second one just _looks_ for a non-EOL after the match but is zero-width so works as you expect.

    hope this helps

       larryk                                          
    perl -le "s,,reverse killer,e,y,rifle,lycra,,print"

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://105361]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (8)
As of 2024-04-18 16:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found