Re: Remove HTML tags from document
by pzbagel (Chaplain) on Aug 03, 2003 at 18:25 UTC
|
#almost straight from the TokeParser::Simple POD
use HTML::TokeParser::Simple;
my $p = HTML::TokeParser::Simple->new( $somefile );
while ( my $token = $p->get_token ) {
print $token->as_is if $token->is_text;
}
HTH | [reply] [d/l] |
|
This works nicely. Is there an easy adapation that would allow me to maintain spacing that is in the HTML document?
| [reply] |
|
| [reply] |
|
Re: Remove HTML tags from document
by fglock (Vicar) on Aug 03, 2003 at 21:35 UTC
|
use HTML::Strip;
my $hs = HTML::Strip->new();
my $clean_text = $hs->parse( $raw_html );
$hs->eof;
| [reply] [d/l] |
Re: Remove HTML tags from document
by Juerd (Abbot) on Aug 04, 2003 at 09:26 UTC
|
perldoc -q 'remove html'
How do I remove HTML from a string?
The most correct way (albeit not the fastest) is to use
HTML::Parser from CPAN. Another mostly correct way is to use
HTML::FormatText which not only removes HTML but also attempts
to do a little simple formatting of the resulting plain text.
Many folks attempt a simple-minded regular expression approach,
like s/<.*?>//g, but that fails in many cases because the
tags may continue over line breaks, they may contain quoted
angle-brackets, or HTML comment may be present. Plus, folks
forget to convert entities--like < for example.
Here's one "simple-minded" approach, that works for most files:
#!/usr/bin/perl -p0777
s/<(?:[^>'"]*|(['"]).*?\1)*>//gs
If you want a more complete solution, see the 3-stage striphtml
program in http://www.cpan.org/authors/Tom_Christiansen/scripts/striphtml.gz.
Also, with Super Search or Google, you can find hundreds of answers.
See also How (Not) To Ask A Question.
Juerd
# { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }
| [reply] [d/l] [select] |
Re: Remove HTML tags from document
by ido50 (Scribe) on Aug 03, 2003 at 20:37 UTC
|
If you want a good module with good documentation, I suggest you try HTML::TokeParser. oreilly.com's got a free full chapter from "Perl&LWP" which deals with this module exclusively. You can find it on http://www.oreilly.com/catalog/perllwp/ in a nice pdf document.
------------------------
Live fat, die young | [reply] |
Re: Remove HTML tags from document
by trs80 (Priest) on Aug 03, 2003 at 20:09 UTC
|
You might want to try w3m, it preserves formating of tables in plain text fairly well as well. It't not Perl, but it works :)
| [reply] |
|
This is an old package. Is it really any good?
| [reply] |
|
I use this package to convert my HTML reports into text so they can emailed to users that don't support HTML in their email client. It works well with the content I deal with. I don't feel value of a package should be derived from its age if it solves the problem at hand.
| [reply] |
|
Re: Remove HTML tags from document
by LazerRed (Pilgrim) on Aug 03, 2003 at 22:12 UTC
|
Here's something I've been playing with lately. Maybe it'll help you.
sub strip {
my $html = shift;
my $p = HTML::PullParser->new(
doc => $html,
text => 'text',
);
my $result = '';
while ( my $t = $p->get_token ) {
$result .= $t->[0];
}
return $result;
}
I use this sub in a script that checks a status page on many different servers. It feeds the raw stats pages through the above sub, then parses the output text to generate a consolodated status report.
Whip me, Beat me, Make me use Y-ModemG. | [reply] [d/l] [select] |
Re: Remove HTML tags from document
by daeve (Deacon) on Aug 04, 2003 at 03:52 UTC
|
And in the spirit of TIMTOWTDI...
If you just need to strip all the html tags from a page, and are on a platform with lynx, you can use:
#! /usr/bin/perl
use strict;
use warnings;
my $text=`lynx -dump htmlDocument.html`;
print "$text";
HTH
Daeve
| [reply] [d/l] |
|
How can I get this to print out to a file instead of the STDOUT? I have very large HTML files.
| [reply] |
|
perldoc -f open
perldoc -f print
perldoc perlopentut
Abigail | [reply] [d/l] |
A reply falls below the community's threshold of quality. You may see it by logging in.
|
A reply falls below the community's threshold of quality. You may see it by logging in. |
A reply falls below the community's threshold of quality. You may see it by logging in. |