I need to be able to take a fragment of HTML and clean it up. We can assume that the outermost element is a "div",
though if it was not necessary to assume that it would be a bonus. By "clean up" I mean
- The fragment should be valid XML. In particular I know that my source is not closing img tags properly.
- Any tags apart from: p, a, img, h3, div, em, strong should be stripped.
- For any supported tag all but certain attributes must be stripped off.
- Some attributes may require further processing such as removing non-local src or href attributes.
- We should reject tags that do not have certain mandatory attibutes. No "a" without an href for example.
- It must either already be in Debian or just require perl packaging.
- It must be configurable and if possible extendible.
- I would rather not be starting a new open source project. I know this is an old problem.
- Degenerate things like empty paragraphs should be removed.
- I should be able to turn it into a Data::FormValidator::Filters though I cannot really see how something could meet all the other criteria and not this one.
Below is a list of things I have tried. Things mentioned in
Simplify HTML programatically but not mentioned below I have consciously
for some reason. The
XML::SAX looks the most promising to me but the one I have made least progress with.
Any ideas?
This module seemed at first to meet all the criteria -- until I spotted the issue with unclosed "img" tags. According to the bug reports
there is also an issue with it not recognising self-closing tags there is an easy work around
rt://25477 for that. That
work around does seem to help at all. I guess if all fails I use this module and apply another filter to fix the "img"
tags but this is ugly.
This is not itself in Debian but the underlying library is so I could easily get it added. The perl library itself has appalling reviews.
It looks to me as if it may not do everything I want but combined with
HTML::Scrubber may be it would.
This was my experiment of the day. From what I could read of the docs it looks as if it could do everything I want
though requiring some code to be written. However the docs are extremely light -- mostly point at Java documentation.
I was experimenting with this as shown below. However once I introduced an attempt to close off open img tags it started giving an error
mismatched tag at line 2, column 53, byte 141 at /usr/lib/perl5/XML/Parser.pm line 187
As you can see I am not experienced with SAX.
#!/usr/bin/perl
use warnings;
use strict;
use Perl6::Slurp;
my $output = "";
use XML::SAX::Machines qw(Pipeline);
#use XML::SAX::ParserFactory;
my $machine = Pipeline(MySAXHandler => \$output);
$machine->parse_string( join "", slurp $ARGV[0] );
print "$output\n";
package MySAXHandler;
use base qw(XML::SAX::Base);
sub start_document {
my $self = shift;
$self->{_supported} = {
img=>{
alt=>1,
width=>1,
height=>1,
src=>1,
title=>1,
},
a=>{
href=>1,
title=>1,
},
p=>{},
h3=>{},
em=>{},
strong=>{},
div=>{},
};
return $self->SUPER::start_document(shift);
}
sub start_element {
my ($self, $el) = @_;
my $localName = $el->{LocalName};
if (exists $self->{_pending_img}) {
my %el = %{$self->{_pending_img}};
delete $self->{_pending_img};
delete $el{Attributes};
$self->SUPER::end_element(\%el);
}
if (exists $self->{_supported}->{$localName}) {
my $attributes = $self->{_supported}->{$localName};
foreach my $attr (keys %{$el->{Attributes}}) {
my $key = $attr;
$key =~ s[\A{}][]xms;
if (not exists $attributes->{$key}) {
delete $el->{Attributes}->{$attr};
}
}
if ($localName eq 'img') {
$self->{_pending_img} = $el;
}
return $self->SUPER::start_element($el);
}
}
sub end_element {
my ($self, $el) = @_;
my $localName = $el->{LocalName};
if (exists $self->{_pending_img} and $localName ne 'img') {
my %el = %{$self->{_pending_img}};
delete $self->{_pending_img};
delete $el{Attributes};
$self->SUPER::end_element(\%el);
}
if (exists $self->{_supported}->{$localName}) {
return $self->SUPER::end_element($el);
}
}
sub character {
my ($self, $el) = @_;
if (exists $self->{_pending_img}) {
my %el = %{$self->{_pending_img}};
delete $self->{_pending_img};
delete $el{Attributes};
$self->SUPER::end_element(\%el);
}
return $self->SUPER::character($el);
}
1
This does quite a good job of closing off the "img" tag but it does no cleaning. It also puts in "html" and "body" tags which
actually I don't want but can at least be easily cleaned off. I have tried combining it with
HTML::Scrubber but that
just demonstrates the issues with self closing tags.
#!/usr/bin/perl
use strict;
use warnings;
use Carp;
use HTML::TreeBuilder;
use HTML::Scrubber;
use Perl6::Slurp;
my $tidy = HTML::TreeBuilder->new();
my $scrubber = HTML::Scrubber->new(
allow => [ qw[ p em stong a img ] ],
rules => [
img => {
src => 1,
alt => 1,
title => 1,
width => 1,
height => 1,
},
a => {
href=>1,
title=>1,
},
],
);
$scrubber->{_p}->empty_element_tags(1);
my $html = slurp $ARGV[0];
$tidy->no_expand_entities(1);
$tidy->p_strict(1);
print $scrubber->scrub($tidy->parse_content($html)->as_XML);
Edit:
I saw this on the Chatterbox last night. It looks very good and I am trying it out.
This has a forgiving mode and a method for reading fragments. I've looked at
it but I think the work would move into DTDs.
Most of the options I have looked at are based upon this. I had steared away from this partly because it is more low-level and partly because I thought SAX ought to be better in principle.
Okay now I have a wealth of options.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.