Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Remove HTML tags from document

by matth (Monk)
on Aug 03, 2003 at 18:09 UTC ( #280476=perlquestion: print w/ replies, xml ) Need Help??
matth has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

What is now regarded as the best way to remove all tags from a HTML document? I have briefly tried to work will HTML::Parser but I don't understand it all that well.

20030803 Edit by jeffa: Changed title from 'HTML tags '

Comment on Remove HTML tags from document
Re: Remove HTML tags from document
by BUU (Prior) on Aug 03, 2003 at 18:24 UTC

      this won't work properly if there are any tags with a '>' in one of the attributes. eg,

      <img alt="some text with a > in it" ...>

      anders pearson

Re: Remove HTML tags from document
by pzbagel (Chaplain) on Aug 03, 2003 at 18:25 UTC

    You could use HTML::TokeParser::Simple and only print text tags.

    #almost straight from the TokeParser::Simple POD use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( $somefile ); while ( my $token = $p->get_token ) { print $token->as_is if $token->is_text; }

    HTH

      This works nicely. Is there an easy adapation that would allow me to maintain spacing that is in the HTML document?

        I'm not sure I understand. I recall that HTML::TokeParser::Simple does in fact maintain newlines in the text. I tested the code quickly just to make sure and it does maintain newlines in the html. Do you have tags that are multi-line? What exactly is happening?

Re: Remove HTML tags from document
by trs80 (Priest) on Aug 03, 2003 at 20:09 UTC
    You might want to try w3m, it preserves formating of tables in plain text fairly well as well. It't not Perl, but it works :)
      This is an old package. Is it really any good?
        I use this package to convert my HTML reports into text so they can emailed to users that don't support HTML in their email client. It works well with the content I deal with. I don't feel value of a package should be derived from its age if it solves the problem at hand.
Re: Remove HTML tags from document
by ido50 (Scribe) on Aug 03, 2003 at 20:37 UTC
    If you want a good module with good documentation, I suggest you try HTML::TokeParser. oreilly.com's got a free full chapter from "Perl&LWP" which deals with this module exclusively. You can find it on http://www.oreilly.com/catalog/perllwp/ in a nice pdf document.

    ------------------------
    Live fat, die young
Re: Remove HTML tags from document
by fglock (Vicar) on Aug 03, 2003 at 21:35 UTC

    HTML::Strip - Perl extension for stripping HTML markup from text.

    use HTML::Strip; my $hs = HTML::Strip->new(); my $clean_text = $hs->parse( $raw_html ); $hs->eof;
Re: Remove HTML tags from document
by LazerRed (Pilgrim) on Aug 03, 2003 at 22:12 UTC
    Here's something I've been playing with lately. Maybe it'll help you.

    sub strip { my $html = shift; my $p = HTML::PullParser->new( doc => $html, text => 'text', ); my $result = ''; while ( my $t = $p->get_token ) { $result .= $t->[0]; } return $result; }

    I use this sub in a script that checks a status page on many different servers. It feeds the raw stats pages through the above sub, then parses the output text to generate a consolodated status report.

    Whip me, Beat me, Make me use Y-ModemG.
Re: Remove HTML tags from document
by daeve (Deacon) on Aug 04, 2003 at 03:52 UTC
    And in the spirit of TIMTOWTDI...

    If you just need to strip all the html tags from a page, and are on a platform with lynx, you can use:

    #! /usr/bin/perl use strict; use warnings; my $text=`lynx -dump htmlDocument.html`; print "$text";

    HTH
    Daeve

      That is definetly not in the spirit of TIMTOWTDI. It may be another way to do it, but it's not perl, so it really misses the point.
        Oh? Using a module and calling a function, spawning a utility, where's the difference? You're using a blackbox either way.

        Makeshifts last the longest.

        But it is perl. Or the calling structure is perl. Now if I had just posted

        lynx -dump htmlDocument.html > htmlDocument.txt
        I would certainly agree with you about that not being perl. But even this does what the OP requested. I realize that lynx isn't perl, but neither are a lot of gnulinux/unix system calls that are easier and shorter than the alternative "pure" perl methods. I use whatever allows me to get the job done in the shortest amount of time with the least trouble. In the case of stripping html tags out of pages lynx works better and quicker than any regex I've seen so far. Then if there are formatting changes that need to be made, once the tags are stripped out, you can use perl to modify the document as needed.

        As I said in my original post - TIMTOWTDI ;-)

        Daeve

      How can I get this to print out to a file instead of the STDOUT? I have very large HTML files.
        perldoc -f open perldoc -f print perldoc perlopentut

        Abigail

Re: Remove HTML tags from document
by Juerd (Abbot) on Aug 04, 2003 at 09:26 UTC

    RTFM.

    perldoc -q 'remove html'

    How do I remove HTML from a string?

    The most correct way (albeit not the fastest) is to use HTML::Parser from CPAN. Another mostly correct way is to use HTML::FormatText which not only removes HTML but also attempts to do a little simple formatting of the resulting plain text.

    Many folks attempt a simple-minded regular expression approach, like s/<.*?>//g, but that fails in many cases because the tags may continue over line breaks, they may contain quoted angle-brackets, or HTML comment may be present. Plus, folks forget to convert entities--like &lt; for example.

    Here's one "simple-minded" approach, that works for most files:

    #!/usr/bin/perl -p0777 s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

    If you want a more complete solution, see the 3-stage striphtml program in http://www.cpan.org/authors/Tom_Christiansen/scripts/striphtml.gz.

    Also, with Super Search or Google, you can find hundreds of answers.

    See also How (Not) To Ask A Question.

    Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://280476]
Approved by blue_cowdawg
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (14)
As of 2014-12-19 16:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (86 votes), past polls