Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
I need to be able to take a fragment of HTML and clean it up. We can assume that the outermost element is a "div", though if it was not necessary to assume that it would be a bonus. By "clean up" I mean
  1. The fragment should be valid XML. In particular I know that my source is not closing img tags properly.
  2. Any tags apart from: p, a, img, h3, div, em, strong should be stripped.
  3. For any supported tag all but certain attributes must be stripped off.
  4. Some attributes may require further processing such as removing non-local src or href attributes.
  5. We should reject tags that do not have certain mandatory attibutes. No "a" without an href for example.
  6. It must either already be in Debian or just require perl packaging.
  7. It must be configurable and if possible extendible.
  8. I would rather not be starting a new open source project. I know this is an old problem.
  9. Degenerate things like empty paragraphs should be removed.
  10. I should be able to turn it into a Data::FormValidator::Filters though I cannot really see how something could meet all the other criteria and not this one.
Below is a list of things I have tried. Things mentioned in Simplify HTML programatically but not mentioned below I have consciously for some reason. The XML::SAX looks the most promising to me but the one I have made least progress with. Any ideas?

HTML::Scrubber

This module seemed at first to meet all the criteria -- until I spotted the issue with unclosed "img" tags. According to the bug reports there is also an issue with it not recognising self-closing tags there is an easy work around rt://25477 for that. That work around does seem to help at all. I guess if all fails I use this module and apply another filter to fix the "img" tags but this is ugly.

HTML::Tidy

This is not itself in Debian but the underlying library is so I could easily get it added. The perl library itself has appalling reviews. It looks to me as if it may not do everything I want but combined with HTML::Scrubber may be it would.

XML::SAX

This was my experiment of the day. From what I could read of the docs it looks as if it could do everything I want though requiring some code to be written. However the docs are extremely light -- mostly point at Java documentation. I was experimenting with this as shown below. However once I introduced an attempt to close off open img tags it started giving an error
mismatched tag at line 2, column 53, byte 141 at /usr/lib/perl5/XML/Parser.pm line 187
As you can see I am not experienced with SAX.
#!/usr/bin/perl use warnings; use strict; use Perl6::Slurp; my $output = ""; use XML::SAX::Machines qw(Pipeline); #use XML::SAX::ParserFactory; my $machine = Pipeline(MySAXHandler => \$output); $machine->parse_string( join "", slurp $ARGV[0] ); print "$output\n"; package MySAXHandler; use base qw(XML::SAX::Base); sub start_document { my $self = shift; $self->{_supported} = { img=>{ alt=>1, width=>1, height=>1, src=>1, title=>1, }, a=>{ href=>1, title=>1, }, p=>{}, h3=>{}, em=>{}, strong=>{}, div=>{}, }; return $self->SUPER::start_document(shift); } sub start_element { my ($self, $el) = @_; my $localName = $el->{LocalName}; if (exists $self->{_pending_img}) { my %el = %{$self->{_pending_img}}; delete $self->{_pending_img}; delete $el{Attributes}; $self->SUPER::end_element(\%el); } if (exists $self->{_supported}->{$localName}) { my $attributes = $self->{_supported}->{$localName}; foreach my $attr (keys %{$el->{Attributes}}) { my $key = $attr; $key =~ s[\A{}][]xms; if (not exists $attributes->{$key}) { delete $el->{Attributes}->{$attr}; } } if ($localName eq 'img') { $self->{_pending_img} = $el; } return $self->SUPER::start_element($el); } } sub end_element { my ($self, $el) = @_; my $localName = $el->{LocalName}; if (exists $self->{_pending_img} and $localName ne 'img') { my %el = %{$self->{_pending_img}}; delete $self->{_pending_img}; delete $el{Attributes}; $self->SUPER::end_element(\%el); } if (exists $self->{_supported}->{$localName}) { return $self->SUPER::end_element($el); } } sub character { my ($self, $el) = @_; if (exists $self->{_pending_img}) { my %el = %{$self->{_pending_img}}; delete $self->{_pending_img}; delete $el{Attributes}; $self->SUPER::end_element(\%el); } return $self->SUPER::character($el); } 1

HTML::TreeBuilder

This does quite a good job of closing off the "img" tag but it does no cleaning. It also puts in "html" and "body" tags which actually I don't want but can at least be easily cleaned off. I have tried combining it with HTML::Scrubber but that just demonstrates the issues with self closing tags.
#!/usr/bin/perl use strict; use warnings; use Carp; use HTML::TreeBuilder; use HTML::Scrubber; use Perl6::Slurp; my $tidy = HTML::TreeBuilder->new(); my $scrubber = HTML::Scrubber->new( allow => [ qw[ p em stong a img ] ], rules => [ img => { src => 1, alt => 1, title => 1, width => 1, height => 1, }, a => { href=>1, title=>1, }, ], ); $scrubber->{_p}->empty_element_tags(1); my $html = slurp $ARGV[0]; $tidy->no_expand_entities(1); $tidy->p_strict(1); print $scrubber->scrub($tidy->parse_content($html)->as_XML);

Edit:

HTML::StripScripts::Parser

I saw this on the Chatterbox last night. It looks very good and I am trying it out.

XML::LibXML

This has a forgiving mode and a method for reading fragments. I've looked at it but I think the work would move into DTDs.

HTML::Parser

Most of the options I have looked at are based upon this. I had steared away from this partly because it is more low-level and partly because I thought SAX ought to be better in principle.

Marpa::HTML

Okay now I have a wealth of options.

In reply to Dynamically cleaning up HTML fragments by SilasTheMonk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (7)
As of 2024-04-10 21:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found