Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Regex look-behind problem.

by the_0ne (Pilgrim)
on Jul 12, 2007 at 21:21 UTC ( #626319=perlquestion: print w/ replies, xml ) Need Help??
the_0ne has asked for the wisdom of the Perl Monks concerning the following question:

Hey monks, have a regex problem that I'm hoping you can help with.

First off, disclaimer, the reason I am not using an html parser is the format I am going to is not very synonomous with html converters. I'm working with a very small subset, so I'm hoping to bang this out with regexes instead of a full-blown html parser.

Here's the code...
$foo = "<italic>Here's a <bold>larger<normal> paragraph, <italic>where + I'm<norma l> going to <bold>bold some <italic>"; print "\nfoo before:\n$foo\n\n"; #foo.gsub!(/(?<=<italic>)(?<!<normal>)(.*?)<bold>/, '\1<bold-italic>') $foo =~ s/(?<=<italic>)(?<!<normal>)(.*?)<bold>/\1<bold-italic>/g; print "foo after:\n$foo\n";
Here's the output I am getting...
# Output is... # <italic>Here's a <bold-italic>larger<normal> paragraph, <italic>wher +e I'm<normal> going to <bold-italic>bold some <italic>
Notice the second <bold> is being replaced with <bold-italic>. By the regex (at least I think I have the regex right) the second bold *should not* be replaced since I perform a look-behind for <normal>. If <normal> is between the <italic> and the <bold>, then the <bold> should be left alone. At least this is what I am trying to get at.

Here what I would like to see...
# However, should be... # <italic>Here's a <bold-italic>larger<normal> paragraph, <italic>wher +e I'm<normal> going to <bold>bold some <italic>
Notice the second <bold> is not replaced.

I'm confused as to what is wrong with my regex.

Thanks again Monks for all your help.

Comment on Regex look-behind problem.
Select or Download Code
Replies are listed 'Best First'.
Re: Regex look-behind problem.
by ikegami (Pope) on Jul 12, 2007 at 22:43 UTC

    (?!<normal>).*? will happily match " <normal>"<c>, so you need to check every <c>. to make sure it's not the start of <normal>. Or since you're looking backwards, you could check to make sure every . is not the end of <normal>.

    s/ (?<=<italic>) ( (?: .(?<! <normal>) )* ) <bold> /$1<bold-italic>/xg

    It's a lot more sane going forward instead of backwards.

    s/ ( <italic> (?: (?!<normal>). )* ) <bold> /$1<bold-italic>/xg

    By the way, you should use $1 in the second (non-regep) half of the substitution operator.

      Thanks ikegami, that worked perfectly.

      lol, thanks for the tip. However, there is a reason for going backwards. The conversions I'm doing actually matter in reverse more than forward. But believe me, if I can take your second example and convert some of my regexes, I will.

      Thanks again.
Re: Regex look-behind problem.
by runrig (Abbot) on Jul 12, 2007 at 21:30 UTC
    The ".*?" allows the regex to find lots of places before the "bold" where it is not "normal". So it matches "bold" and replaces.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://626319]
Approved by pKai
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (7)
As of 2015-07-30 06:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (270 votes), past polls