Remove HTML tags from document

matth has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

What is now regarded as the best way to remove all tags from a HTML document? I have briefly tried to work will HTML::Parser but I don't understand it all that well.

20030803 Edit by jeffa: Changed title from 'HTML tags '

Comment on Remove HTML tags from document

Replies are listed 'Best First'.
Re: Remove HTML tags from document by pzbagel (Chaplain) on Aug 03, 2003 at 18:25 UTC
You could use HTML::TokeParser::Simple and only print text tags. `#almost straight from the TokeParser::Simple POD use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( $somefile ); while ( my $token = $p->get_token ) { print $token->as_is if $token->is_text; }` [download] HTH	[reply] [d/l]
Re: Re: Remove HTML tags from document by matth (Monk) on Aug 04, 2003 at 09:18 UTC
This works nicely. Is there an easy adapation that would allow me to maintain spacing that is in the HTML document?	[reply]
Re: Re: Re: Remove HTML tags from document by pzbagel (Chaplain) on Aug 04, 2003 at 09:47 UTC
I'm not sure I understand. I recall that HTML::TokeParser::Simple does in fact maintain newlines in the text. I tested the code quickly just to make sure and it does maintain newlines in the html. Do you have tags that are multi-line? What exactly is happening?	[reply]
Re: Re: Re: Re: Remove HTML tags from document by matth (Monk) on Aug 04, 2003 at 10:04 UTC
Re: Remove HTML tags from document by fglock (Vicar) on Aug 03, 2003 at 21:35 UTC
HTML::Strip - Perl extension for stripping HTML markup from text. `use HTML::Strip; my $hs = HTML::Strip->new(); my $clean_text = $hs->parse( $raw_html ); $hs->eof;` [download]	[reply] [d/l]
Re: Remove HTML tags from document by Juerd (Abbot) on Aug 04, 2003 at 09:26 UTC
RTFM. `perldoc -q 'remove html'` [download] How do I remove HTML from a string? The most correct way (albeit not the fastest) is to use HTML::Parser from CPAN. Another mostly correct way is to use HTML::FormatText which not only removes HTML but also attempts to do a little simple formatting of the resulting plain text. Many folks attempt a simple-minded regular expression approach, like `s/<.?>//g`, but that fails in many cases because the tags may continue over line breaks, they may contain quoted angle-brackets, or HTML comment may be present. Plus, folks forget to convert entities--like `<` for example. Here's one "simple-minded" approach, that works for most files: `#!/usr/bin/perl -p0777 s/<(?:[^>'"]\|(['"]).?\1)>//gs` [download] If you want a more complete solution, see the 3-stage striphtml program in http://www.cpan.org/authors/Tom_Christiansen/scripts/striphtml.gz. Also, with Super Search or Google, you can find hundreds of answers. See also How (Not) To Ask A Question. Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }	[reply] [d/l] [select]
Re: Remove HTML tags from document by ido50 (Scribe) on Aug 03, 2003 at 20:37 UTC
If you want a good module with good documentation, I suggest you try HTML::TokeParser. oreilly.com's got a free full chapter from "Perl&LWP" which deals with this module exclusively. You can find it on http://www.oreilly.com/catalog/perllwp/ in a nice pdf document. ------------------------ Live fat, die young	[reply]
Re: Remove HTML tags from document by trs80 (Priest) on Aug 03, 2003 at 20:09 UTC
You might want to try w3m, it preserves formating of tables in plain text fairly well as well. It't not Perl, but it works :)	[reply]
Re: Re: Remove HTML tags from document by matth (Monk) on Aug 04, 2003 at 08:39 UTC
This is an old package. Is it really any good?	[reply]
Re: Re: Re: Remove HTML tags from document by trs80 (Priest) on Aug 04, 2003 at 15:21 UTC
I use this package to convert my HTML reports into text so they can emailed to users that don't support HTML in their email client. It works well with the content I deal with. I don't feel value of a package should be derived from its age if it solves the problem at hand.	[reply]
Re4: Remove HTML tags from document by dragonchild (Archbishop) on Aug 04, 2003 at 15:30 UTC
Re: Remove HTML tags from document by LazerRed (Pilgrim) on Aug 03, 2003 at 22:12 UTC
Here's something I've been playing with lately. Maybe it'll help you. `sub strip { my $html = shift; my $p = HTML::PullParser->new( doc => $html, text => 'text', ); my $result = ''; while ( my $t = $p->get_token ) { $result .= $t->[0]; } return $result; }` [download] I use this sub in a script that checks a status page on many different servers. It feeds the raw stats pages through the above sub, then parses the output text to generate a consolodated status report. `Whip me, Beat me, Make me use Y-ModemG.`	[reply] [d/l] [select]
Re: Remove HTML tags from document by daeve (Deacon) on Aug 04, 2003 at 03:52 UTC
And in the spirit of TIMTOWTDI... If you just need to strip all the html tags from a page, and are on a platform with lynx, you can use: #! /usr/bin/perl use strict; use warnings; my $text=`lynx -dump htmlDocument.html`; print "$text"; [download] HTH Daeve	[reply] [d/l]
Re: Re: Remove HTML tags from document by matth (Monk) on Aug 04, 2003 at 08:36 UTC
How can I get this to print out to a file instead of the STDOUT? I have very large HTML files.	[reply]
Re: Remove HTML tags from document by Abigail-II (Bishop) on Aug 04, 2003 at 08:40 UTC
`perldoc -f open perldoc -f print perldoc perlopentut` [download] Abigail	[reply] [d/l]
A reply falls below the community's threshold of quality. You may see it by logging in.
A reply falls below the community's threshold of quality. You may see it by logging in.
A reply falls below the community's threshold of quality. You may see it by logging in.

Back to Seekers of Perl Wisdom