Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Script Stripper

by OeufMayo (Curate)
on Dec 26, 2001 at 02:31 UTC ( #134317=sourcecode: print w/ replies, xml ) Need Help??

Category: Web Stuff
Author/Contact Info Briac Pilpré
Description:

The following act can act as a HTML script filter, stripping Javascript, VBScript, JScript, PerlScript, etc. from the HTML code.

This weeds out all the "scriptable" events from the HTML 4.01 specifications and all the <script> elements.

It takes a filename as argument, or if there's no argument, read from STDIN. All the output is done on STDOUT.

This piece of code should be pretty reliable, but I'd be interested to know if there's a flaw in this code.

#!/usr/bin/perl -w
use strict;
use HTML::Parser;

use vars qw(%attribs @elements);

@attribs{qw(
    onblur onchange onclick ondblclick onfocus onkeydown onkeypress
    onkeyup onload onmousedown onmousemove onmouseout onmouseover
    onmouseup onreset onselect onsubmit onunload
)} ++;

@elements = qw( script );

my $parser = HTML::Parser->new(
    default_h       => [ sub { print shift }, 'text' ],
    start_h         => [ \&JSstrip, 'tagname, attr, attrseq' ],
    ignore_elements => \@elements,
);

if ( $ARGV[0] ) { $parser->parse_file( $ARGV[0] ); }
else            { $parser->parse_file( \*STDIN  ); }

sub JSstrip {
    my ( $tagname, $attr, $attrseq ) = @_;
    print "<$tagname";
    foreach (@$attrseq) {

        # The attribute is a script event handler
        unless ( exists $attribs{$_} ) {

            # I'm not sure if this regex is 100% reliable
            # (esp. in case of escaped quotes?)
            my $q = $attr->{$_} =~ /"/ ? "'" : '"';
            print qq' $_=$q$attr->{$_}$q';
        }
    }
    print ">";
}

Comment on Script Stripper
Download Code
Re: Script Stripper
by Juerd (Abbot) on Dec 26, 2001 at 03:37 UTC
    @foo{ qw/a b c/ }++ on an empty %foo, will have $foo{a} be 1 and $foo{b} and $foo{c} be undef. That's no problem, because it's only being used for exists. You can have $foo{a} be undef too, by assigning an empty list to the hash splice: @foo{ qw/a b c/ } = ()

    By the way, think about these:
    <style> span.blah { background-color: expression(alert('Hello, World!' +)) } </style> <a href="javascript:alert('Hello, World!');">hi mom</a> <img src="fourohfour" onerror="alert('Hello, World!');">
    It may be better to define what IS allowed, instead of what's NOT. HTML changes continuously, and browsers don't always follow specs...

    For the quotes: just use HTML::Entities, and have it change double quotes to &quot;, zo you can safely use double quotes.

    2;0 juerd@ouranos:~$ perl -e'undef christmas' Segmentation fault 2;139 juerd@ouranos:~$

      Long live Perl 6.
      %hash{qw(a b c)} ^= 1; # or maybe... %hash{qw(a b c)}^++;

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

        Perl6 will indeed be great. But undef values are sufficient in this case ;)

        2;0 juerd@ouranos:~$ perl -e'undef christmas' Segmentation fault 2;139 juerd@ouranos:~$

Back to Code Catacombs

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://134317]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2014-09-17 04:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (57 votes), past polls