Re: Strip HTML tags again
by Ovid (Cardinal) on Jun 30, 2002 at 20:20 UTC
|
#!/usr/bin/perl -w
use strict;
use HTML::TokeParser::Simple;
use HTML::Tagset;
my $html = <<'END_HTML';
<a href="mylink">text1</a>
<this is normal text>
END_HTML
my $p = HTML::TokeParser::Simple->new( \$html );
while ( my $token = $p->get_token ) {
next if ! $token->is_text
and
exists $HTML::Tagset::isKnown{ $token->return_tag };
print $token->return_text;
}
Result:
text1
<this is normal text>
Cheers,
Ovid
Join the Perlmonks Setiathome Group or just click on the the link and check out our stats. | [reply] [d/l] |
|
sub strip_html {
my $renew = "";
my $p = HTML::TokeParser::Simple->new(\$_[0]);
no warnings "uninitialized";
while ( my $token = $p->get_token ) {
next if ! $token->is_text
and
exists $HTML::Tagset::isKnown{ $token->get_tag };
$renew .= $token->as_is;
}
$_[0] = $renew;
}
| [reply] [d/l] |
•Re: Strip HTML tags again
by merlyn (Sage) on Jun 30, 2002 at 15:54 UTC
|
Here's an example from the eg directory in the HTML::Parser distribution:
#!/usr/bin/perl -w
# Extract all plain text from an HTML file
use strict;
use HTML::Parser 3.00 ();
my %inside;
sub tag
{
my($tag, $num) = @_;
$inside{$tag} += $num;
print " "; # not for all tags
}
sub text
{
return if $inside{script} || $inside{style};
print $_[0];
}
HTML::Parser->new(api_version => 3,
handlers => [start => [\&tag, "tagname, '+1'"],
end => [\&tag, "tagname, '-1'"],
text => [\&text, "dtext"],
],
marked_sections => 1,
)->parse_file(shift) || die "Can't open file: $!\n";;
-- Randal L. Schwartz, Perl hacker | [reply] [d/l] |
Re: Stripping HTML tags from a document
by cjf (Parson) on Jun 30, 2002 at 15:55 UTC
|
Have a look at HTML::Tagset it contains various lists of valid HTML tags for different sections of a document.
Update: ++ to Ovid for providing the working example below.
| [reply] |
|
Thanks!!! It is the stuff I was looking for. Now I'd like to know how to use it in a 'perl' manner. Currently I have the following code (right from perlfaq):
sub strip_html {
my $t = shift;
$t =~ s/<(?:[^>'"]*|(['"]).*?\1)*>//gs;
return $t;
}
Seems like I have to use %HTML::Tagset::isKnown hash, but how to apply it to my sub? I can't find any quick way...
--dda | [reply] [d/l] |
Re: Strip HTML tags again
by tachyon (Chancellor) on Jun 30, 2002 at 20:35 UTC
|
| [reply] |
|
heh. i ended up writing HTML::TagFilter because tachyon shouted at me so loud. Which only does part of what you want, sadly, so I wouldn't recommend it. But i'm usefully reminded to finish the next version, which does the rest. And lots of other exciting things, i feel sure.
| [reply] |
Re: Strip HTML tags again
by ides (Deacon) on Jun 30, 2002 at 15:47 UTC
|
This will probably do the trick, however this does not handle HTML tags that span multiple lines. To do that you'll most likely have to join all the lines together into one scalar. This will also not catch multiple HTML tags on the same line, you'll need to modify it to suit your needs.
What this is doing is finding text contained in <>'s that has a corresponding ending tag.
Here is the code ($l is the scalar holding the line of text):
if( $l =~ /<.*?>(.*?)<\/.*?>/ ) {
$l = $1;
}
-----------------------------------
Frank Wiles <frank@wiles.org>
http://frank.wiles.org
| [reply] [d/l] |
|
Thanks, but I need a solution which 'knows' about possible HTML tags. What I need is to filter HTML from a chat message, and if someone type '<Hehe>' - it will be wiped off.
--dda
| [reply] |
Re: Strip HTML tags again
by hacker (Priest) on Jul 01, 2002 at 10:33 UTC
|
| [reply] |
Re: Strip HTML tags again
by Mask (Pilgrim) on Jul 01, 2002 at 11:38 UTC
|
Hi monks, i am little bit disappointed in all this discussion.
If the input from chat is displayed in HTML page, then any "<" or ">" in the displayed text will be transformed to the < and > . So if you can see <this is a normal text> in your web browser, than in the sources of a HTML page it will be <this is a normal text> in this case you should not be bothered about knowing all tags, and if you want to see the text as it is in browser you need just to replace "<" by "<" and ">" by ">" in your perl code. | [reply] [d/l] [select] |
|
| [reply] |
|
| [reply] |
|
|
|
|
|
Re: Strip HTML tags again
by mousey (Scribe) on Jul 01, 2002 at 06:20 UTC
|
$foo =~ s/<(.|\n)+?>//g;
This is great! from up here we can throw lots and lots of stuf! but uh...how do we get down? --Goblin Balloon Brigade | [reply] |