As explained in
this node, I was asked to add an interface to
HTML::StripScripts so that it would return either an
XML::LibXML::Document object or an
XML::LibXML::DocumentFragment object (for a snippet of HTML)
I've done this by subclassing HTML::StripScripts::Parser and overriding the output callbacks in HTML::StripScripts. So the flow is as follows:
HTML::StripScripts::Parser
--> uses HTML::Parser to tokenise the HTML
--> uses callbacks in HTML::StripScripts to filter the XSS an
+d tidy the HTML
--> uses callbacks in HTML::StripScripts::LibXML to build
+ a DOM tree
The code works, it passes the tests for
HTML::StripScripts::Parser (with a couple of modifications, eg
<b></b> in the
XML->toString output is represented as
<b/>).
The questions
I don't do much with XML, so I would appreciate feedback from XML-ers. My questions are:
I currently accept no options to pass to LibXML. What should I accept? What would be useful? Encoding?
See sub output_text in the code. This sub receives an HTML-entity encoded string. Currently I decode &, <, >, ", ' before calling $document->createTextNode($text). Everything else gets passed straight through. Should I be doing something with Encode here, for instance accepting a source character set and a destination character set, and applying a conversion both here and to attribute values?
For purposes of documentation, how do I refer to what is returned? A DOM tree? An XML document?
Should this module be called HTML::StripScripts::LibXML/XML/DOM?
As merlyn said in the CB, an empty <p/> tag isn't valid because what is a paragraph without content. Should I be deleting tags that are empty but aren't defined as empty (eg <br/> or <img />?
Tag callbacks (ie a sub that is called every time a particular tag is found) are different for this module.
In HTML::StripScripts:
$hss = HTML::StripScripts->new({
Rules => { a => \&a_callback }
});
sub a_callback {
my ($filter,$element) = @_;
# where $element = {
# tag => 'a',
# attr => { href => '/index.html' },
# content => 'Go to <b>Home</b> page',
# }
return 1;
}
In HTML::StripScripts::LibXML:
sub a_callback {
my ($filter,$element) = @_;
# where $element = {
# tag => 'a',
# attr => { href => '/index.html' },
# children => [
# XML::LibXML::Text --> 'Go to ',
# XML::LibXML::Element --> 'b'
# with child Text --> 'Home',
# XML::LibXML::Text --> ' page',
# ],
# }
return 1;
}
The reason I don't create the element and add its children before sending it to the callback is that the callback has the option of deleting the tag, which would just remove this tag, but preserve its contents/children.
Would it make more sense to create the element regardless, set its attributes, add the children and then send it to the callback. And if the callback decides to delete the tag, then the children would just get added to the parent element?
The code
I have posted the code below,
but it requires a newer version of HTML::StripScripts than the one currently on CPAN which works with the current version of HTML::StripScripts on CPAN. Also, just uploaded this module to CPAN as HTML::StripScripts::LibXML - available as soon as your mirror syncs.
package HTML::StripScripts::LibXML;
use strict;
use vars qw($VERSION);
$VERSION = '0.10';
=head1 NAME
HTML::StripScripts::LibXML - XSS filter - outputs a LibXML Document o
+r DocumentFragment
=head1 SYNOPSIS
use HTML::StripScripts::LibXML();
my $hss = HTML::StripScripts::LibXML->new(
{
Context => 'Document', ## HTML::StripScripts configur
+ation
Rules => { ... },
},
strict_comment => 1, ## HTML::Parser options
strict_names => 1,
);
$hss->parse_file("foo.html");
$xml_doc = $hss->filtered_document;
OR
$xml_doc = $hss->filter_html($html);
=head1 DESCRIPTION
This class provides an easy interface to C<HTML::StripScripts>, using
C<HTML::Parser> to parse the HTML, and returns an XML::LibXML::Documen
+t
or XML::LibXML::DocumentFragment.
See L<HTML::Parser> for details of how to customise how the raw HTML i
+s parsed
into tags, and L<HTML::StripScripts> for details of how to customise t
+he way
those tags are filtered. This module is a subclass of
L<HTML::StripScripts::Parser>.
=cut
=head1 DIFFERENCES FROM HTML::StripScripts
=over
=item CONTEXT
HTML::StripScripts::LibXML still allows you to specify the C<Context>
+of the
HTML (Document, Flow, Inline, NoTags). If C<Context> is C<Document>, t
+hen it
returns an C<XML::LibXML::Document> object, otherwise it returns an
C<XML::LibXML::DocumentFragment> object.
=item TAG CALLBACKS
HTML::StripScripts allows you to use tag callbacks, for instance:
$hss = HTML::StripScripts->new({
Rules => { a => \&a_callback }
});
sub a_callback {
my ($filter,$element) = @_;
# where $element = {
# tag => 'a',
# attr => { href => '/index.html' },
# content => 'Go to <b>Home</b> page',
# }
return 1;
}
HTML::StripScripts::LibXML still gives you tag callbacks, but they lo
+ok like
this:
sub a_callback {
my ($filter,$element) = @_;
# where $element = {
# tag => 'a',
# attr => { href => '/index.html' },
# children => [
# XML::LibXML::Text --> 'Go to ',
# XML::LibXML::Element --> 'b'
# with child Text --> 'Home',
# XML::LibXML::Text --> ' page',
# ],
# }
return 1;
}
=item SUBCLASSING
The subs C<output>, C<output_start> and C<output_end> are not called.
+ Instead,
this module uses C<output_stack_entry> which handles the tag callback,
+ (and
depending on the result of the tag callback) creates an element and ad
+ds
its child nodes. Then it adds the element to the list of children for
+the
parent tag.
=back
=head1 CONSTRUCTORS
=over
=item new ( {CONFIG}, [PARSER_OPTIONS] )
Creates a new C<HTML::StripScripts::LibXML> object.
See L<HTML::StripScripts::Parser> for details.
=back
=cut
use base 'HTML::StripScripts::Parser';
use XML::LibXML();
use HTML::Entities();
#===================================
sub output_start_document {
#===================================
my ($self) = @_;
$self->{_hsxXML} = XML::LibXML::Document->new();
return;
}
#===================================
sub output_end_document {
#===================================
my ($self) = @_;
my $top = $self->{_hssStack}[0];
my $document = delete $self->{_hsxXML};
if ( $top->{CTX} ne 'Document' ) {
$document = $document->createDocumentFragment();
}
foreach my $child ( @{ $top->{CHILDREN} } ) {
$document->addChild($child);
}
$top->{CONTENT} = $document;
return;
}
#===================================
sub output_start { }
*output_end = \&output_start;
*output_declaration = \&output_start;
*output_process = \&output_start;
*output = \&output_start;
#===================================
my $Entities = { 'amp' => '&',
'lt' => '<',
'gt' => '>',
'quot' => '"',
'#39' => "'",
};
#===================================
sub output_text {
#===================================
my ( $self, $text ) = @_;
HTML::Entities::_decode_entities( $text, $Entities );
push @{ $self->{_hssStack}[0]{CHILDREN} },
$self->{_hsxXML}->createTextNode($text);
return;
}
#===================================
sub output_comment {
#===================================
my ( $self, $comment ) = @_;
$comment =~ s/^\s*<!--//g;
$comment =~ s/-->\s*$//g;
push @{ $self->{_hssStack}[0]{CHILDREN} },
$self->{_hsxXML}->createComment($comment);
return;
}
#===================================
sub output_stack_entry {
#===================================
my ( $self, $tag ) = @_;
my %entry;
$tag->{CHILDREN} ||= [];
@entry{qw(tag attr children)} = @{$tag}{qw(NAME ATTR CHILDREN)};
if ( my $tag_callback = $tag->{CALLBACK} ) {
$tag_callback->( $self, \%entry )
or return;
}
if ( my $tagname = $entry{tag} ) {
my $element = $self->{_hsxXML}->createElement($tagname);
my $attrs = $entry{attr};
foreach my $name ( sort keys %$attrs ) {
$element->setAttribute( $name => $attrs->{$name} );
}
unless ( $tag->{CTX} eq 'EMPTY' ) {
foreach my $children ( @{ $entry{children} } ) {
$element->addChild($children);
}
}
push @{ $self->{_hssStack}[0]{CHILDREN} }, $element;
}
else {
push @{ $self->{_hssStack}[0]{CHILDREN} }, @{ $entry{children}
+ };
}
$tag->{CHILDREN} = [];
}
=head1 BUGS AND LIMITATIONS
=over
=item API - BETA
This is the first draft of this module, and currently there are no con
+figuration
options for the XML. I would welcome feedback from XML users as to how
+ I could
improve the interface.
For this reason, the API may change.
=back
=head1 SEE ALSO
L<HTML::Parser>, L<HTML::StripScripts::Parser>,
L<HTML::StripScripts::Regex>
=head1 AUTHOR
Clinton Gormley E<lt>clint@traveljury.comE<gt>
=head1 COPYRIGHT
Copyright (C) 2007 Clinton Gormley. All Rights Reserved.
=head1 LICENSE
This module is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.
=cut
1;
UPDATE: Uploaded this module to CPAN as
HTML::StripScripts::LibXML, and removed the requirement for a new version of
HTML::StripScripts.