Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

How do I remove HTML from a string?

by faq_monk (Initiate)
on Oct 08, 1999 at 00:32 UTC ( #758=perlfaq nodetype: print w/replies, xml ) Need Help??

Current Perl documentation can be found at perldoc.perl.org.

Here is our local, out-dated (pre-5.6) version:

The most correct way (albeit not the fastest) is to use HTML::Parse from CPAN (part of the libwww-perl distribution, which is a must-have module for all web hackers).

Many folks attempt a simple-minded regular expression approach, like s/<.*?>//g, but that fails in many cases because the tags may continue over line breaks, they may contain quoted angle-brackets, or HTML comment may be present. Plus folks forget to convert entities, like &lt; for example.

Here's one ``simple-minded'' approach, that works for most files:

    #!/usr/bin/perl -p0777
    s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

If you want a more complete solution, see the 3-stage striphtml program in http://www.perl.com/CPAN/authors/Tom_Christiansen/scripts/striphtml.gz .

Here are some tricky cases that you should think about when picking a solution:

    <IMG SRC = "foo.gif" ALT = "A > B">

    <IMG SRC = "foo.gif" 
         ALT = "A > B">

    <!-- <A comment> -->

    <script>if (a<b && a>c)</script>

    <# Just data #>

    <![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

If HTML comments include other tags, those solutions would also break on text like this:

    <!-- This section commented out.
        <B>You can't see me!</B>
    -->

Log In?
Username:
Password:

What's my password?
Create A New User
Chatterbox?
[stevieb]: ...unrelated which requires replacing a lot of code and a whole lib. I'm about to go nix only ffs
[shmem]: stevieb: what you're doing sounds afwully complex. Too much for me this evening to provide brighter insight ;-)
[stevieb]: I don't even own a Windows computer. Both my girl and I have a laptop each with Linux. I'm supporting Windows in some of my projects and I can't even guage whether it's worth it or not.
[stevieb]: shmem It's something I desired to have years ago, which is why I took over berrybrew. Cross-platform build/test automation locally, or over the network Test::BrewBuild
[shmem]: sounds good.
[shmem]: but I'm crumbling smaller stones. remember...
[stevieb]: I'm working on it to fatten it up and make it more reliant so I can finalize my Raspberry Pi automated build system for that software :) It's all well and fun, until I try to make it work with Windows lol
[shmem]: "debugging a program is more difficult than to write it in the first place. If you code your program as smart as you are, you are, by definition, too dumb to debug it."

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2017-03-28 22:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Should Pluto Get Its Planethood Back?



    Results (342 votes). Check out past polls.