Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Remove HTML tags from document

by matth (Monk)
on Aug 03, 2003 at 18:09 UTC ( #280476=perlquestion: print w/replies, xml ) Need Help??
matth has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

What is now regarded as the best way to remove all tags from a HTML document? I have briefly tried to work will HTML::Parser but I don't understand it all that well.

20030803 Edit by jeffa: Changed title from 'HTML tags '

Replies are listed 'Best First'.
Re: Remove HTML tags from document
by pzbagel (Chaplain) on Aug 03, 2003 at 18:25 UTC

    You could use HTML::TokeParser::Simple and only print text tags.

    #almost straight from the TokeParser::Simple POD use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( $somefile ); while ( my $token = $p->get_token ) { print $token->as_is if $token->is_text; }


      This works nicely. Is there an easy adapation that would allow me to maintain spacing that is in the HTML document?

        I'm not sure I understand. I recall that HTML::TokeParser::Simple does in fact maintain newlines in the text. I tested the code quickly just to make sure and it does maintain newlines in the html. Do you have tags that are multi-line? What exactly is happening?

Re: Remove HTML tags from document
by fglock (Vicar) on Aug 03, 2003 at 21:35 UTC

    HTML::Strip - Perl extension for stripping HTML markup from text.

    use HTML::Strip; my $hs = HTML::Strip->new(); my $clean_text = $hs->parse( $raw_html ); $hs->eof;
Re: Remove HTML tags from document
by Juerd (Abbot) on Aug 04, 2003 at 09:26 UTC


    perldoc -q 'remove html'

    How do I remove HTML from a string?

    The most correct way (albeit not the fastest) is to use HTML::Parser from CPAN. Another mostly correct way is to use HTML::FormatText which not only removes HTML but also attempts to do a little simple formatting of the resulting plain text.

    Many folks attempt a simple-minded regular expression approach, like s/<.*?>//g, but that fails in many cases because the tags may continue over line breaks, they may contain quoted angle-brackets, or HTML comment may be present. Plus, folks forget to convert entities--like &lt; for example.

    Here's one "simple-minded" approach, that works for most files:

    #!/usr/bin/perl -p0777 s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

    If you want a more complete solution, see the 3-stage striphtml program in

    Also, with Super Search or Google, you can find hundreds of answers.

    See also How (Not) To Ask A Question.

    Juerd # { site => '', plp_site => '', do_not_use => 'spamtrap' }

Re: Remove HTML tags from document
by ido50 (Scribe) on Aug 03, 2003 at 20:37 UTC
    If you want a good module with good documentation, I suggest you try HTML::TokeParser.'s got a free full chapter from "Perl&LWP" which deals with this module exclusively. You can find it on in a nice pdf document.

    Live fat, die young
Re: Remove HTML tags from document
by trs80 (Priest) on Aug 03, 2003 at 20:09 UTC
    You might want to try w3m, it preserves formating of tables in plain text fairly well as well. It't not Perl, but it works :)
      This is an old package. Is it really any good?
        I use this package to convert my HTML reports into text so they can emailed to users that don't support HTML in their email client. It works well with the content I deal with. I don't feel value of a package should be derived from its age if it solves the problem at hand.
Re: Remove HTML tags from document
by LazerRed (Pilgrim) on Aug 03, 2003 at 22:12 UTC
    Here's something I've been playing with lately. Maybe it'll help you.

    sub strip { my $html = shift; my $p = HTML::PullParser->new( doc => $html, text => 'text', ); my $result = ''; while ( my $t = $p->get_token ) { $result .= $t->[0]; } return $result; }

    I use this sub in a script that checks a status page on many different servers. It feeds the raw stats pages through the above sub, then parses the output text to generate a consolodated status report.

    Whip me, Beat me, Make me use Y-ModemG.
Re: Remove HTML tags from document
by daeve (Deacon) on Aug 04, 2003 at 03:52 UTC
    And in the spirit of TIMTOWTDI...

    If you just need to strip all the html tags from a page, and are on a platform with lynx, you can use:

    #! /usr/bin/perl use strict; use warnings; my $text=`lynx -dump htmlDocument.html`; print "$text";


      How can I get this to print out to a file instead of the STDOUT? I have very large HTML files.
        perldoc -f open perldoc -f print perldoc perlopentut


      That is definetly not in the spirit of TIMTOWTDI. It may be another way to do it, but it's not perl, so it really misses the point.
        But it is perl. Or the calling structure is perl. Now if I had just posted

        lynx -dump htmlDocument.html > htmlDocument.txt
        I would certainly agree with you about that not being perl. But even this does what the OP requested. I realize that lynx isn't perl, but neither are a lot of gnulinux/unix system calls that are easier and shorter than the alternative "pure" perl methods. I use whatever allows me to get the job done in the shortest amount of time with the least trouble. In the case of stripping html tags out of pages lynx works better and quicker than any regex I've seen so far. Then if there are formatting changes that need to be made, once the tags are stripped out, you can use perl to modify the document as needed.

        As I said in my original post - TIMTOWTDI ;-)


        Oh? Using a module and calling a function, spawning a utility, where's the difference? You're using a blackbox either way.

        Makeshifts last the longest.

Re: Remove HTML tags from document
by BUU (Prior) on Aug 03, 2003 at 18:24 UTC

      this won't work properly if there are any tags with a '>' in one of the attributes. eg,

      <img alt="some text with a > in it" ...>

      anders pearson

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://280476]
Approved by blue_cowdawg
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (5)
As of 2018-02-19 08:53 GMT
Find Nodes?
    Voting Booth?
    When it is dark outside I am happiest to see ...

    Results (260 votes). Check out past polls.