Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Am I javascript or not?

by boo_radley (Parson)
on Mar 30, 2002 at 18:09 UTC ( #155451=CUFP: print w/ replies, xml ) Need Help??

This script demonstrates a use of HTML::Parser to determine if a web page contains a script tag, or if an html tag contains an event attribute.

I realize that this script takes what I'd call an exclusive method of approval (does the page contain something on a banned list?) rather than an inclusive method (does the page contain something not on an approved list) and is therefore more vulnerable to trickery or changes/ additions to what browsers think is an acceptable javascript event.
I hope this script is useful, but you've been warned. As always, your comments, criticisms and baseless accusations are welcome.

Update Removed the event list and added dws' suggestion for a regex to search for /on\w+/
Added belg4mit's suggestions :

  1. elements passed to check_attrs will get URI decoded
  2. added mocha & data as forbidden protocols. If anyone has further suggestions, or a list of pseudo-protocols that should be included, I'd love to see it.
  3. added check for javascript entities
script doesn't die on first error found anymore. Cleaned up output a bit.
#! perl -w use HTML::Parser; use LWP::Simple; use URI::Escape; use strict; my $fn = shift; $fn || die "supply a url to parse\n"; my $f = get ($fn) || die "unable to get $fn\n"; HTML::Parser->new( default_h => [\&check_attrs, 'text, tagname, attr'], + )->parse($f ) || die $!; sub check_attrs { my @forbiddenprotos= qw(javascript mocha data); my $line = shift; my $tagname = shift; return unless $tagname; $tagname = uri_unescape($tagname); print "found script tag.\n\t$line\n" if $tagname eq "script"; my $attr = shift; my $attrs = uri_unescape(join " ", keys %$attr); my $attrvals = uri_unescape(join " ", values %$attr); print "events $1 found.\n\t$line\n" if $attrs=~/\b(on\w+)/; foreach (@forbiddenprotos) { print "$_ protocol found.\n\t$line\n" if $attrvals=~/\b$_:/; } print "javascript entity found.\n\t$line\n" if $attrvals=~/\&\{/; }

Comment on Am I javascript or not?
Download Code
Re: Am I javascript or not?
by dws (Chancellor) on Mar 30, 2002 at 18:16 UTC
    If you want to make this run a bit faster, and aren't worried about false positives on fringe cases (such as people adding non-standard attributes), you can simplify the test to
    my $attr = shift; my @found = { grep /^on[A-Z]/ } keys %$attr;
    This also "future proofs" the code against the possiblity of someone's browser adding yet another new event.

    Update: D'oh. Make that   my @found = { grep /^on/ } keys %$attr;
    since the attributes all come back lower-case.

      dws says :
      my $attr = shift; my @found = { grep /^on[A-Z]/ } keys %$attr;
      I thought about doing this, but I wanted to stick to known events. If that's not a concern, then your grep is acceptable as well.
      I originally started out with an inclusive checker, but the performance was pretty awful as I was checking that each tag had acceptable attributes. (An alt attribute is ok for an img tag, but not a p tag, fr instance).
      Another potential speed boost (on closer examination) would come from making all @events lowercase, and getting rid of the i switch in the current grep, as HTML::Parser passes parameters to its handlers in all lowercase.
(~OT) WARNING: Live Ammo WAS: Re: Am I javascript or not?
by belg4mit (Prior) on Mar 30, 2002 at 19:48 UTC
    If you haven't gotten it yet, there's wild javascript in there (it's all tame though), read the source Luke. From what I've been saving on my pad.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: CUFP [id://155451]
Approved by root
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (7)
As of 2015-01-28 10:47 GMT
Find Nodes?
    Voting Booth?

    My top resolution in 2015 is:

    Results (213 votes), past polls