Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

How do I remove HTML from a string?

by faq_monk (Initiate)
on Oct 08, 1999 at 00:32 UTC ( #758=perlfaq nodetype: print w/replies, xml ) Need Help??

Current Perl documentation can be found at perldoc.perl.org.

Here is our local, out-dated (pre-5.6) version:

The most correct way (albeit not the fastest) is to use HTML::Parse from CPAN (part of the libwww-perl distribution, which is a must-have module for all web hackers).

Many folks attempt a simple-minded regular expression approach, like s/<.*?>//g, but that fails in many cases because the tags may continue over line breaks, they may contain quoted angle-brackets, or HTML comment may be present. Plus folks forget to convert entities, like &lt; for example.

Here's one ``simple-minded'' approach, that works for most files:

    #!/usr/bin/perl -p0777
    s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

If you want a more complete solution, see the 3-stage striphtml program in http://www.perl.com/CPAN/authors/Tom_Christiansen/scripts/striphtml.gz .

Here are some tricky cases that you should think about when picking a solution:

    <IMG SRC = "foo.gif" ALT = "A > B">

    <IMG SRC = "foo.gif" 
         ALT = "A > B">

    <!-- <A comment> -->

    <script>if (a<b && a>c)</script>

    <# Just data #>

    <![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

If HTML comments include other tags, those solutions would also break on text like this:

    <!-- This section commented out.
        <B>You can't see me!</B>
    -->

Log In?
Username:
Password:

What's my password?
Create A New User
Chatterbox?
[karlgoethebier]: Cojones! We need cojones!
[karlgoethebier]: Ouch! Permissions! We need permissions!
[BarApp]: I can not use modules. I gain temporary access and still can not use modules.
[Cosmic37]: ta erix - this szabo geezer is pretty cool methinks and he writes about undef but I cannot see instructions for redefining the record separator after having undefined it
[Corion]: $/ = "wahtever";
[Corion]: (it's a magic variable)
[karlgoethebier]: BarApp: whoami
[Cosmic37]: ok fankyou - I was wondering about that but thought there might be a redefine command or something; peachy
[Lotus1]: Cosmic37 if you undef $/ in a local context to a block it won't affect the global version after the block finishes

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (9)
As of 2017-06-29 16:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How many monitors do you use while coding?















    Results (672 votes). Check out past polls.