http://www.perlmonks.org?node_id=146178

costas has asked for the wisdom of the Perl Monks concerning the following question:

Hi, i have a reg ex query i need help with.

I am basically substituting massive pages of html doc in english to swedish and need help in keeping tags such as html bold tags and italic tags whilst only changing the text viewable on a browser.

basically if i have
<td>english text</td>
and want to change it to
<td>swedish text</td>
it works fine, however some td tags contain b tags and i want to keep them intact. HOw can i write the reg ex so that if a b tag exists then keep it. ie.... from
<td><b>uk text</b></td>
to
<td><b>swedish text</b></td>
thanks

Replies are listed 'Best First'.
Re: shameful reg expression
by dragonchild (Archbishop) on Feb 18, 2002 at 17:32 UTC
    Heh. This is where CPAN is your friend. Go find HTML::Parser and use it. Heck, in a pinch, CGI and CGI::Simple may be useable. :-)

    In other words, use the wheel that has already been invented.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

Re: shameful reg expression
by grep (Monsignor) on Feb 18, 2002 at 17:34 UTC
    Check out HTML::TreeBuilder. You should never try to do this with a single regexp, you want a parser.

    Also before you post you should:
  • use search... this would have answered your question many times over (and faster)
  • read Before you post...


  • UPDATE: Just for laughs I did a search on HTML Parse - 2390 posts all pretty much telling you the same as we're saying here.

    grep
    grep> rm -f /bin/laden
Try HTML::Parser
by Kozz (Friar) on Feb 18, 2002 at 17:36 UTC
    You should give HTML::Parser module a try. Otherwise, I don't know what your regex looks like (where you're capturing the text), but you might try placing into your existing regex code
    ([^><]+)
    which captures a string of characters that do NOT match either > or <. Keep in mind, however, that this should be placed correctly in the regex, otherwise it could also give you matches like "td" and "b".

    But again, this may not necessarily work in all situations, even if you write a damned good regex. For best results, look into HTML::Parser.