Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

HTML Sanitizer (removes unwanted tags)

by lhoward (Vicar)
on Aug 08, 2000 at 07:08 UTC ( #26728=snippet: print w/ replies, xml ) Need Help??

Description: There have been many questions posted to Perlmonks recently asking about cleaning up HTML (either removing specific tags, or removing all tags except for a given few). Most people respond with one of two suggestions:
  1. a regular expression - this has the problem that it may not work because of the way > and < may appear in the HTML
  2. advice to check out <cpan://HTML::Parser> and use it as the basis for solving the problem
I've decided to delve in write some code that would serve as an example of how to properly filter out unwanted HTML tags from a document. I actually use <cpan://HTML::Filter> which is distributed with <cpan://HTML::Parser>. My code uses a hash of tags to keep; it could be easily adapted to work with a hash of tags to drop instead.

As always; any comments, criticism or advice on doing this better is appreciated.

package HTML::Sanitizer;
require HTML::Filter;
@ISA=qw(HTML::Filter);

my $data='';

my %keep=(
  a => 1,
  p => 1,
  img => 1
);

sub output{
  my $self=shift;
  my $d=$_[0];
  if($d=~/\<\s*\/?\s*(\w+)/){
    if(exists $keep{lc($1)}){
      $data.=$d;
    }
  }else{
    $data.=$d;
  }
}

my $p=HTML::Sanitizer->new();
$p->parse_file("index.html");

print $data;
Comment on HTML Sanitizer (removes unwanted tags)
Download Code
RE: HTML Sanitizer (removes unwanted tags)
by merlyn (Sage) on Aug 08, 2000 at 07:20 UTC
    From a user interface standpoint, you might want to simplify the configurations section:
    ## BEGIN CONFIGURE my @KEEP = qw(a p img br); ## END CONFIGURE ... ... # just before the business loop: my %keep; @keep{@KEEP} = (); ... ... "it's good" if exists $keep{$1}; # or whatever
    That data structure requiring them to specify the whole hash is troublesome.

    -- Randal L. Schwartz, Perl hacker

Re: HTML Sanitizer (removes unwanted tags)
by ehdonhon (Curate) on Apr 20, 2005 at 01:10 UTC
    Where does sub output() get invoked?
      Been too long since I wrote that code. In that code I am subclassing HTML::Filter into HTML::Sanitizer, replacing its output sub with the I wrote. I never explicitly call it.

      There are slicker ways to solve this problem today, for instance the HTML::Sanitizer module that didn't even exist when I wrote this example.

      L

Back to Snippets Section

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: snippet [id://26728]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (5)
As of 2014-09-20 18:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (160 votes), past polls