Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Regex: Strip <script> tags?

by Spidy (Chaplain)
on Mar 21, 2007 at 15:24 UTC ( #605862=perlquestion: print w/ replies, xml ) Need Help??
Spidy has asked for the wisdom of the Perl Monks concerning the following question:

Hello, all.

I am working on a project that allows users to input their own profile text, using any HTML markup they choose. CSS styles and HTML are all welcome; <script> tags are not. I am looking for a regex that will strip out the <script> tags for me.

I happened to find a solution given by someone named ender, which goes as follows:

s/<script>.*?<\/script>//igs;

However, I am unsure of how I would modify that regex to have it catch '<script ' at the beginning of the script(because it is common for there to be a space following the word 'script' with attributes). Does anyone know of what I would need to change?


Thanks,
Spidy

Comment on Regex: Strip <script> tags?
Download Code
Re: Regex: Strip <script> tags?
by skx (Parson) on Mar 21, 2007 at 15:29 UTC

    There are a lot more things that you'll need to worry about than just raw <script> tags.

    For example:

    <a href="http://example.com" onClick="alert(1);">test</a>
    

    To deal with this complexity properly you should be looking at using one of the filtering modules available from CPAN.

    I've got good experience of using HTML::Scrubber - but there are a few more including HTML::EscapeEvil and HTML::Sanitizer

    Steve
    --
Re: Regex: Strip <script> tags?
by rodion (Chaplain) on Mar 21, 2007 at 16:06 UTC
    skx has beter advice, but as for the question as you posed it:
    s/<script[^>]*>.*?<\/script>//igs;
    should work. It accepts any characters that are not ">", up to the ">" that terminates the tag. It may not be the best solution to this particular problem, but it's a very handy regex idiom to have ready access to.

      It lets the following through:

      <<script></script>script>...</script>

      It's also a poor regexp in a more general sense since it it doesn't check if the > actually closes the tag of it's inside the quotes of an attribute value.

Re: Regex: Strip <script> tags?
by ww (Bishop) on Mar 21, 2007 at 16:19 UTC
    If your only worry were attributes following the start of the tag ... as, for example,
        <script src=....
    you could simply remove the ">" at the end of the first "<script>" in the (cargo-culted) regex, thusly
        <script .*?<\/script>... which will catch anything inside script tags (unless -- illogically, they're miss written by your users with nested <script ...> tags. (Update: In fact, this is a faq.)

    However, as skx has already pointed out, evil is not restricted to items labeled "<script...>

    Bottom line: You should probably consider/study security issues (suggestion: start with some examples of why to use -t and move on to more generic considerations) AND should improve your regex-fu before borrowing code.

    You've been here long enough to have seen discussions of the un-wisdom of writing your own .html parsers, and might wish to review some of those (Cliff notes-style summary: you might screw up by rolling your own) and also read these old-but-still-good nodes: Re: How to remove HTML tags from text (by skx, with a more expansive version of his comment above); How do I test for potential security problems?; and Re: Remove HTML tags from document, including Jured's links to asking questions.

Re: Regex: Strip <script> tags?
by duelafn (Priest) on Mar 21, 2007 at 17:33 UTC

    Yes, do use a prepackaged filter. <scr<script>Kiddies</script>ipt> are clever buggers</script>

    Update: In response to anonymous monk below (in case you think you can win in the battle of workarounds). Check out the XSS Cheat Sheet. It is quite old, so don't count on it including all XSS exploits, however, look at that list and ask yourself whether your time is better spent researching and fighting these or actually working on something related to your site's business. --- My advice: Find and use a module which scrubs user-submitted html. Find one which is maintained and thorough. It isn't typically worth doing it yourself. (in general) No, your case is probably not special enough to warrant doing it yourself - you've got better things to do.

    Good Day,
        Dean

      This will fix what you're were talking about. You can loop through it as many times to remove unwanted script tags and everything within it

      $bool = true; while ($bool) { $str = preg_replace('/<script\ .*?<\/.*?script>/i','', $str); if (!(preg_match('/<script\ .*?<\/.*?script>/i', $str))){ $bool = false; } }
Re: Regex: Strip <script> tags?
by stonecolddevin (Vicar) on Mar 22, 2007 at 00:31 UTC

    I personally enjoy HTML::Scrubber.

    It allows you to create a pretty detailed profile of what HTML you want allowed/disallowed.

    From the docs:

    (Turns out JavaScript is turned off by default. See the script method for more info.)

    #!/usr/bin/perl -w use HTML::Scrubber; use strict; + # my $html = q[ <style type="text/css"> BAD { background: #666; color: #666;} </st +yle> <script language="javascript"> alert("Hello, I am EVIL!"); </sc +ript> <HR> a => <a href=1>link </a> br => <br> b => <B> bold </B> u => <U> UNDERLINE </U> ]; + # my $scrubber = HTML::Scrubber->new( allow => [ qw[ p b i u hr br ] + ] ); # + # print $scrubber->scrub($html); + # + # $scrubber->deny( qw[ p b i u hr br ] ); + # + # print $scrubber->scrub($html); + #

    Hope this helps!

    meh.
Re: Regex: Strip <script> tags?
by sanPerl (Friar) on Mar 22, 2007 at 06:26 UTC
    It is better to use some ready-to-eat kind of CPAN module. However if you want a simple soltion then here it is.
    It will work for following kinds of tags
    1) <script> ( should be deleted )
    2) <script a="aaa"> (should be deleted )
    3) <script1> (This should be retained by regex)
    4) ANY OTHER TAG other than mentioned above (This should be retained by regex)
    ## This is required so that we can escape processing of tags like <scr +ipt1>, <scriptabc>,<scriptxyz>....etc from deletion s/<script>/<script >/igs; s/<script\ .*?>.*?<\/script>//igs;
Re: Regex: Strip <script> tags?
by hacker (Priest) on Sep 02, 2007 at 14:37 UTC

    I use the following in a piece of code here:

    # Strip <script [..]>..</script> and <style>..</style> $content =~ s!<(s(?:cript|tyle))[^>]*>.*?</\1>!!gis;

    backtracking++

      There are plenty of things that will be missed with your regex. For instance, all of the onclick/focus/load/etc events.

      Have a look at HTML::StripScripts::Parser, which allows you to customise the HTML / CSS that you would like to allow, while removing XSS attacks.

      Clint

Re: Regex: Strip <script> tags?
by Anonymous Monk on May 25, 2012 at 22:28 UTC

    This will fix what duelafn was talking about. You can loop through it as many times to remove unwanted script tags and everything within it

    $bool = true; while ($bool) { $str = preg_replace('/<script\ .*?<\/.*?script>/i','', $str); if (!(preg_match('/<script\ .*?<\/.*?script>/i', $str))){ $bool = false; } }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://605862]
Approved by skx
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (7)
As of 2014-12-26 08:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (168 votes), past polls