There have been many questions posted to Perlmonks
recently asking
about cleaning up HTML
(either removing specific tags, or removing
all tags except for a given few). Most people respond
with one of two suggestions:
a regular expression - this has the problem
that it may not work because of the way > and < may appear
in the HTML
advice to check out <cpan://HTML::Parser> and
use it as the basis for solving the problem
I've decided to delve in write some code that would
serve as an example of how to properly filter out
unwanted HTML tags from a document. I actually
use <cpan://HTML::Filter> which is distributed with
<cpan://HTML::Parser>.
My code uses a hash of tags to keep; it could be
easily adapted to work with a hash of tags to drop
instead.
As always; any comments, criticism or advice on doing
this better is appreciated.
package HTML::Sanitizer;
require HTML::Filter;
@ISA=qw(HTML::Filter);
my $data='';
my %keep=(
a => 1,
p => 1,
img => 1
);
sub output{
my $self=shift;
my $d=$_[0];
if($d=~/\<\s*\/?\s*(\w+)/){
if(exists $keep{lc($1)}){
$data.=$d;
}
}else{
$data.=$d;
}
}
my $p=HTML::Sanitizer->new();
$p->parse_file("index.html");
print $data;
From a user interface standpoint, you might want to simplify the configurations section:
## BEGIN CONFIGURE
my @KEEP = qw(a p img br);
## END CONFIGURE
...
...
# just before the business loop:
my %keep; @keep{@KEEP} = ();
...
... "it's good" if exists $keep{$1}; # or whatever
Been too long since I wrote that code. In that code I am subclassing HTML::Filter into HTML::Sanitizer, replacing its output sub with the I wrote. I never explicitly call it.
There are slicker ways to solve this problem today, for instance the HTML::Sanitizer module that didn't even exist when I wrote this example.