Extended Regular Expressions

jroberts has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Extended Regular Expressions by azatoth (Curate) on Aug 16, 2001 at 17:40 UTC
Read up on Character Classes & Negated Character Classes, or speak to japhy about them if he's around. Also, take a look at Death to Dot Star!. Feel free to ask in the ChatterBox if you are still having problems after you've done the required reading. Azatoth a.k.a Captain Whiplash Make Your Die Messages Full of Wisdom! Get YOUR PerlMonks Stagename here! Want to speak like a Londoner?	[reply]
Re: Re: Extended Regular Expressions by jroberts (Acolyte) on Aug 16, 2001 at 18:02 UTC
Thanks, i got this to work as expected using "negated" character classes. My original question still remains as a puzzle. why aren't the results (aka. the values of $&) of the two pattern matches different by the single character 'j' (taken from "joe")? it seems they should be. the zero-width lookahead doesn't seem to be "zero-width" at all. Regards, jroberts	[reply]
Re3: Extended Regular Expressions by Hofmator (Curate) on Aug 16, 2001 at 19:40 UTC
Ok, the problem with `[^\z]` can be easily seen when you `use warnings` - which is a good idea in general. Perl complains then about an unidentified escape sequence in the character class. This means that \z is not the end of the string in a character class!! All metacharacters loose their special meaning in a character class. So `[^\z]` is equivalent to `[^z]` which is a single character that is not a 'z'. Looking at your original regex `print $& if /^<a href.>(?=[^\z])/x;` Consequently the . in your regex eats up everything till the last '>' and checks if the next character is not a z. Which is true, as it is a newline. Ergo, match found, mystery solved :) -- Hofmator	[reply] [d/l] [select]
Re: Extended Regular Expressions by japhy (Canon) on Aug 16, 2001 at 20:16 UTC
Hmm... you've confused me a bit here. You have a typo in the code, and I'm not sure why you have the /x on your second regex. The /x merely means "allow extraneous whitespace and comments." `#!/usr/bin/perl -l $_ = '<a href= ?a=500011&w=2&r=1 >joe@blow.com</a>'; print $& if /^<a href.>[^\z]/; print $& if /^<a href.>(?=[^\z])/;` [download] This code prints: `<a href= ?a=500011&w=2&r=1 >j <a href= ?a=500011&w=2&r=1 >` [download] Perhaps you want to use: `/^<a [^>]+>([^<]+)/` [download] That kinda matches a tag, followed by kinda the non-tag stuff after it. _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker. `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply] [d/l] [select]
Re: Extended Regular Expressions by jlongino (Parson) on Aug 16, 2001 at 17:49 UTC
If what you want is to the right of the first `>` can't you just print `$' or the English $POSTMATCH` after the first print statement? E.g., `print $& if /^<a href.>[^\z])/; print $';` [download] Update:* Sorry, overlooked that somehow. If the code and the comments disagree, then both* are probably wrong.* -- Norm Schryer	[reply] [d/l] [select]
Re: Re: Extended Regular Expressions by Hofmator (Curate) on Aug 16, 2001 at 18:07 UTC
`print $& if /^<a href.*>[^\z]/; print $';` [download] This does not work - as jroberts already pointed out - because the first letter after the closing angle bracket is eaten as well. So it's not printed with $POSTMATCH. Furthermore the regex contained an additional closing parenthesis - this was already in the original post and just a copy and paste error. I fixed that above. -- Hofmator	[reply] [d/l]
Re: Extended Regular Expressions by arturo (Vicar) on Aug 16, 2001 at 22:13 UTC
And while we're at it, if you're planning on using this code for an application you need to be robust, I'd recommend using HTML::Parser or a similar module to do the work of extracting information from HTML files. Of course that has little to do with understanding why the regex doesn't work as you thought it would, but it's worth noting. `perl -e 'print "How sweet does a rose smell? "; chomp ($n = <STDIN>); +$rose = "smells sweet to degree $n"; other_name = rose; print "$oth +er_name\n"'` [download]	[reply] [d/l]
Re: Extended Regular Expressions by larryk (Friar) on Aug 16, 2001 at 18:01 UTC
to get everything in between... `/>(.?)<\/a>/; print $1;` [download] the first one doesn't get everything as you expect because the `[^\z]` matches one non-EOL char so... `.` gobbles everything to EOL the regex backtracks to match the > (last char on line) regex attempts to match a non-EOL char - FAIL - no chars left in string regex backtracks to match the previous > regex matches a non-EOL char "j" - COMPLETE the second one just _looks_ for a non-EOL after the match but is zero-width so works as you expect. hope this helps larryk perl -le "s,,reverse killer,e,y,rifle,lycra,,print"	[reply] [d/l] [select]


Perl-Sensitive Sunglasses
	PerlMonks