SilasTheMonk has asked for the wisdom of the Perl Monks concerning the following question:
I need to be able to take a fragment of HTML and clean it up. We can assume that the outermost element is a "div",
though if it was not necessary to assume that it would be a bonus. By "clean up" I mean
- The fragment should be valid XML. In particular I know that my source is not closing img tags properly.
- Any tags apart from: p, a, img, h3, div, em, strong should be stripped.
- For any supported tag all but certain attributes must be stripped off.
- Some attributes may require further processing such as removing non-local src or href attributes.
- We should reject tags that do not have certain mandatory attibutes. No "a" without an href for example.
- It must either already be in Debian or just require perl packaging.
- It must be configurable and if possible extendible.
- I would rather not be starting a new open source project. I know this is an old problem.
- Degenerate things like empty paragraphs should be removed.
- I should be able to turn it into a Data::FormValidator::Filters though I cannot really see how something could meet all the other criteria and not this one.
HTML::Scrubber
This module seemed at first to meet all the criteria -- until I spotted the issue with unclosed "img" tags. According to the bug reports there is also an issue with it not recognising self-closing tags there is an easy work around rt://25477 for that. That work around does seem to help at all. I guess if all fails I use this module and apply another filter to fix the "img" tags but this is ugly.HTML::Tidy
This is not itself in Debian but the underlying library is so I could easily get it added. The perl library itself has appalling reviews. It looks to me as if it may not do everything I want but combined with HTML::Scrubber may be it would.XML::SAX
This was my experiment of the day. From what I could read of the docs it looks as if it could do everything I want though requiring some code to be written. However the docs are extremely light -- mostly point at Java documentation. I was experimenting with this as shown below. However once I introduced an attempt to close off open img tags it started giving an errormismatched tag at line 2, column 53, byte 141 at /usr/lib/perl5/XML/Parser.pm line 187As you can see I am not experienced with SAX.
#!/usr/bin/perl use warnings; use strict; use Perl6::Slurp; my $output = ""; use XML::SAX::Machines qw(Pipeline); #use XML::SAX::ParserFactory; my $machine = Pipeline(MySAXHandler => \$output); $machine->parse_string( join "", slurp $ARGV[0] ); print "$output\n"; package MySAXHandler; use base qw(XML::SAX::Base); sub start_document { my $self = shift; $self->{_supported} = { img=>{ alt=>1, width=>1, height=>1, src=>1, title=>1, }, a=>{ href=>1, title=>1, }, p=>{}, h3=>{}, em=>{}, strong=>{}, div=>{}, }; return $self->SUPER::start_document(shift); } sub start_element { my ($self, $el) = @_; my $localName = $el->{LocalName}; if (exists $self->{_pending_img}) { my %el = %{$self->{_pending_img}}; delete $self->{_pending_img}; delete $el{Attributes}; $self->SUPER::end_element(\%el); } if (exists $self->{_supported}->{$localName}) { my $attributes = $self->{_supported}->{$localName}; foreach my $attr (keys %{$el->{Attributes}}) { my $key = $attr; $key =~ s[\A{}][]xms; if (not exists $attributes->{$key}) { delete $el->{Attributes}->{$attr}; } } if ($localName eq 'img') { $self->{_pending_img} = $el; } return $self->SUPER::start_element($el); } } sub end_element { my ($self, $el) = @_; my $localName = $el->{LocalName}; if (exists $self->{_pending_img} and $localName ne 'img') { my %el = %{$self->{_pending_img}}; delete $self->{_pending_img}; delete $el{Attributes}; $self->SUPER::end_element(\%el); } if (exists $self->{_supported}->{$localName}) { return $self->SUPER::end_element($el); } } sub character { my ($self, $el) = @_; if (exists $self->{_pending_img}) { my %el = %{$self->{_pending_img}}; delete $self->{_pending_img}; delete $el{Attributes}; $self->SUPER::end_element(\%el); } return $self->SUPER::character($el); } 1
HTML::TreeBuilder
This does quite a good job of closing off the "img" tag but it does no cleaning. It also puts in "html" and "body" tags which actually I don't want but can at least be easily cleaned off. I have tried combining it with HTML::Scrubber but that just demonstrates the issues with self closing tags.#!/usr/bin/perl use strict; use warnings; use Carp; use HTML::TreeBuilder; use HTML::Scrubber; use Perl6::Slurp; my $tidy = HTML::TreeBuilder->new(); my $scrubber = HTML::Scrubber->new( allow => [ qw[ p em stong a img ] ], rules => [ img => { src => 1, alt => 1, title => 1, width => 1, height => 1, }, a => { href=>1, title=>1, }, ], ); $scrubber->{_p}->empty_element_tags(1); my $html = slurp $ARGV[0]; $tidy->no_expand_entities(1); $tidy->p_strict(1); print $scrubber->scrub($tidy->parse_content($html)->as_XML);
Edit:
HTML::StripScripts::Parser
I saw this on the Chatterbox last night. It looks very good and I am trying it out.XML::LibXML
This has a forgiving mode and a method for reading fragments. I've looked at it but I think the work would move into DTDs.HTML::Parser
Most of the options I have looked at are based upon this. I had steared away from this partly because it is more low-level and partly because I thought SAX ought to be better in principle.Marpa::HTML
Okay now I have a wealth of options.
|
---|
Back to
Seekers of Perl Wisdom