Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Regex help

by chuleto1 (Beadle)
on Oct 01, 2002 at 19:46 UTC ( [id://202087]=perlquestion: print w/replies, xml ) Need Help??

chuleto1 has asked for the wisdom of the Perl Monks concerning the following question:

Monks:
Can you help me with a suggestion to have anything in a <script> tags deleted using a regex? I have tried this. s/<script>(.*?)</script>//g;, but it didn't work. I thought the (.*?) is for anything. Shed some light if you please,
<script> function showCalendar( field ){ gfPop.fPopCalendar(top.document.all[ field ]); } var timer=null; function timerCheck(field, checkbox){ clearTimeout(timer); checkBox(top.document.all[field].value, checkbox); timer=setTimeout("timerCheck('"+field+"', '"+checkbox+"');", 1000); } function selectShifts(time){ var i=0; while (document.FilterForm.shifts.options[i]!=null){ if (document.FilterForm.shifts.options[i].value=="") document.FilterForm.shifts.options[i].selected=false; if (document.FilterForm.shifts.options[i].time==time) document.FilterForm.shifts.options[i].selected=true; i++; } checkBox("nonnull", 'useShifts'); } function checkBox(value, checkbox){ if (value=="" || value==null) document.FilterForm[checkbox].checked = false; else document.FilterForm[checkbox].checked = true; } function doNothing(){} </script>

Replies are listed 'Best First'.
Re: Regex help
by sauoq (Abbot) on Oct 01, 2002 at 21:32 UTC

    If you are using this to strip potentially malicious code, you should be more liberal in what you match.

    1. Use /i because tags can be upper, lower, or mixed cases.

    <script>CODE</script> <SCRIPT>CODE</SCRIPT> <ScRiPt>CODE</ScRiPt>

    2. Be careful of whitespace in tags.

    <script >CODE</script> <script>CODE</script >

    3. Be careful of what gets left behind after you strip it. (The following example is a good reason not to use a non-greedy match.)

    <<script></script>script>CODE</script>

    I'd use something like jeffa's and eliminate as much as possible. I don't see any immediate problems with this: s#<script.*script\s*>##gis; but I didn't test it very thoroughly and there may be some. You might consider substituting repeatedly until nothing matches in order be sure you've avoided the 3rd issue above but that may well be overkill.

    -sauoq
    "My two cents aren't worth a dime.";
    
(jeffa) Re: Regex help
by jeffa (Bishop) on Oct 01, 2002 at 19:54 UTC
    I have used this one with good success in the past:
    s/<\s*(?:no)?script.*<\/(?:no)?script\s*>//sig;
    it only works on a scalar, so you will need to slurp your entire JS code into a scalar first. I am sure others have better regexes, but i thought i should share. :)

    UPDATE:
    fixed (i think) ... thanks sauoq :)

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    

      That won't match: Yes. It's fixed now.

      <script> // Malicious code. </script >
      -sauoq
      "My two cents aren't worth a dime.";
      
Re: Regex help
by Helter (Chaplain) on Oct 01, 2002 at 19:55 UTC
    Updated, I missed the multi-line part...added a s to my regexp
    This actually ran? I would think the parser would have halted when it got to the </script> part...it should have ended the regex.

    You probably want to try something like this:
    #!/usr/bin/perl -wl use strict; use warnings; my $test = "before script<script> stuff and more stuff \n stuff\nstuff +\nstuff</script> after script"; $test =~ s%<script>(.*?)</script>%%gs; print "Test is $test\n";
    Which outputs:
    ./test.pl Test is before script after script
    This uses a nifty trick, you can use (just about) anything for the start/end characters for a regexp/search-replace to avoid conflicts in the matching/replacing string. Here I have used the percent signs instead.

    Have fun!
Re: Regex help
by swiftone (Curate) on Oct 01, 2002 at 19:55 UTC
    I see a few potential problems here.
    1. The regex you show won't work, because you didn't escape your / in </script>
    2. the dot (.) doesn't match _anything_, by default it doesn't match newlines. The /s modifier at the end of the regex changes that behavior to what you want. (see perlre)
    So  s/<script>(.*?)<\/script>//sg; should do what you want. I can't speak to the validity of what you're trying to do, but that should make the perl work :)

    Update: Paren typo corrected per fglock below. (Was (.*)?, which would be a greedy match, with the ? essentially pointless, acting on a * modified group) I left the parens in to show a capture, but fglock is completely correct that you don't need the parens.

      You mean  (.*?)

      Actually you don't need parenthesis:

      s/<script>.*?<\/script>//sg;
        Nope. The parenthesis are optional, but can be VERY useful. For example, say you want to remove the <script> and </script>, but be able to give some sort of warning about the script tags. For example, you may filter out:
        <script> malicious_code_to_do_something_nasty </script>
        If you use your regex as <script>(.*?)</script>, it saves the smallest amount (the ?) of anything (the .*) into a variable. That variable name depends on how many sets of parenthesis you've used. If it's the first (and only) time you use them, it gets saved into $1. If the second time, $2, and so forth. You can use it for something like this:
        $text = "my name is john q user\n"; $text =~ s/^my name is (.*?) .*$/$1/; # removes "my name is ", saves the next word, essentially, into $1, re +moves the rest print "hello, $text!\n"; # prints "hello, john!\n"
        This is VERY useful in extracting information from strings.


        -dingoStick.com

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://202087]
Approved by VSarkiss
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2024-03-19 09:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found