Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Regex look-behind problem.

by the_0ne (Pilgrim)
on Jul 12, 2007 at 21:21 UTC ( #626319=perlquestion: print w/replies, xml ) Need Help??
the_0ne has asked for the wisdom of the Perl Monks concerning the following question:

Hey monks, have a regex problem that I'm hoping you can help with.

First off, disclaimer, the reason I am not using an html parser is the format I am going to is not very synonomous with html converters. I'm working with a very small subset, so I'm hoping to bang this out with regexes instead of a full-blown html parser.

Here's the code...
$foo = "<italic>Here's a <bold>larger<normal> paragraph, <italic>where + I'm<norma l> going to <bold>bold some <italic>"; print "\nfoo before:\n$foo\n\n"; #foo.gsub!(/(?<=<italic>)(?<!<normal>)(.*?)<bold>/, '\1<bold-italic>') $foo =~ s/(?<=<italic>)(?<!<normal>)(.*?)<bold>/\1<bold-italic>/g; print "foo after:\n$foo\n";
Here's the output I am getting...
# Output is... # <italic>Here's a <bold-italic>larger<normal> paragraph, <italic>wher +e I'm<normal> going to <bold-italic>bold some <italic>
Notice the second <bold> is being replaced with <bold-italic>. By the regex (at least I think I have the regex right) the second bold *should not* be replaced since I perform a look-behind for <normal>. If <normal> is between the <italic> and the <bold>, then the <bold> should be left alone. At least this is what I am trying to get at.

Here what I would like to see...
# However, should be... # <italic>Here's a <bold-italic>larger<normal> paragraph, <italic>wher +e I'm<normal> going to <bold>bold some <italic>
Notice the second <bold> is not replaced.

I'm confused as to what is wrong with my regex.

Thanks again Monks for all your help.

Replies are listed 'Best First'.
Re: Regex look-behind problem.
by ikegami (Pope) on Jul 12, 2007 at 22:43 UTC

    (?!<normal>).*? will happily match " <normal>"<c>, so you need to check every <c>. to make sure it's not the start of <normal>. Or since you're looking backwards, you could check to make sure every . is not the end of <normal>.

    s/ (?<=<italic>) ( (?: .(?<! <normal>) )* ) <bold> /$1<bold-italic>/xg

    It's a lot more sane going forward instead of backwards.

    s/ ( <italic> (?: (?!<normal>). )* ) <bold> /$1<bold-italic>/xg

    By the way, you should use $1 in the second (non-regep) half of the substitution operator.

      Thanks ikegami, that worked perfectly.

      lol, thanks for the tip. However, there is a reason for going backwards. The conversions I'm doing actually matter in reverse more than forward. But believe me, if I can take your second example and convert some of my regexes, I will.

      Thanks again.
Re: Regex look-behind problem.
by runrig (Abbot) on Jul 12, 2007 at 21:30 UTC
    The ".*?" allows the regex to find lots of places before the "bold" where it is not "normal". So it matches "bold" and replaces.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://626319]
Approved by pKai
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (9)
As of 2017-07-20 13:51 GMT
Find Nodes?
    Voting Booth?
    I came, I saw, I ...

    Results (303 votes). Check out past polls.