Regex help

chuleto1 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regex help by sauoq (Abbot) on Oct 01, 2002 at 21:32 UTC
If you are using this to strip potentially malicious code, you should be more liberal in what you match. 1. Use /i because tags can be upper, lower, or mixed cases. `<script>CODE</script> <SCRIPT>CODE</SCRIPT> <ScRiPt>CODE</ScRiPt>` [download] 2. Be careful of whitespace in tags. `<script >CODE</script> <script>CODE</script >` [download] 3. Be careful of what gets left behind after you strip it. (The following example is a good reason not to use a non-greedy match.) `<<script></script>script>CODE</script>` [download] I'd use something like jeffa's and eliminate as much as possible. I don't see any immediate problems with this: `s#<script.script\s>##gis;` but I didn't test it very thoroughly and there may be some. You might consider substituting repeatedly until nothing matches in order be sure you've avoided the 3rd issue above but that may well be overkill. -sauoq "My two cents aren't worth a dime.";	[reply] [d/l] [select]
(jeffa) Re: Regex help by jeffa (Bishop) on Oct 01, 2002 at 19:54 UTC
I have used this one with good success in the past: `s/<\s(?:no)?script.<\/(?:no)?script\s*>//sig;` [download] it only works on a scalar, so you will need to slurp your entire JS code into a scalar first. I am sure others have better regexes, but i thought i should share. :) UPDATE: fixed (i think) ... thanks sauoq :) jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply] [d/l]
Re: (jeffa) Re: Regex help by sauoq (Abbot) on Oct 01, 2002 at 21:08 UTC
~~That won't match:~~ Yes. It's fixed now. `<script> // Malicious code. </script >` [download] -sauoq "My two cents aren't worth a dime.";	[reply] [d/l]
Re: Regex help by Helter (Chaplain) on Oct 01, 2002 at 19:55 UTC
Updated, I missed the multi-line part...added a s to my regexp This actually ran? I would think the parser would have halted when it got to the </script> part...it should have ended the regex. You probably want to try something like this: `#!/usr/bin/perl -wl use strict; use warnings; my $test = "before script<script> stuff and more stuff \n stuff\nstuff +\nstuff</script> after script"; $test =~ s%<script>(.*?)</script>%%gs; print "Test is $test\n";` [download] Which outputs: `./test.pl Test is before script after script` [download] This uses a nifty trick, you can use (just about) anything for the start/end characters for a regexp/search-replace to avoid conflicts in the matching/replacing string. Here I have used the percent signs instead. Have fun!	[reply] [d/l] [select]
Re: Regex help by swiftone (Curate) on Oct 01, 2002 at 19:55 UTC
I see a few potential problems here. The regex you show won't work, because you didn't escape your / in </script> the dot (.) doesn't match _anything_, by default it doesn't match newlines. The /s modifier at the end of the regex changes that behavior to what you want. (see perlre) So `s/<script>(.?)<\/script>//sg;` should do what you want. I can't speak to the validity of what you're trying to do, but that should make the perl work :) Update:* Paren typo corrected per fglock below. (Was (.)?, which would be a greedy match, with the ? essentially pointless, acting on a modified group) I left the parens in to show a capture, but fglock is completely correct that you don't need the parens.	[reply] [d/l]
Re: Re: Regex help by fglock (Vicar) on Oct 01, 2002 at 20:34 UTC
You mean `(.?)` Actually you don't need parenthesis: `s/<script>.?<\/script>//sg;`	[reply] [d/l] [select]
Re: Regex help by Anonymous Monk on Oct 01, 2002 at 23:40 UTC
Nope. The parenthesis are optional, but can be VERY useful. For example, say you want to remove the <script> and </script>, but be able to give some sort of warning about the script tags. For example, you may filter out: `<script> malicious_code_to_do_something_nasty </script>` [download] If you use your regex as <script>(.?)</script>, it saves the smallest amount (the ?) of anything (the .) into a variable. That variable name depends on how many sets of parenthesis you've used. If it's the first (and only) time you use them, it gets saved into $1. If the second time, $2, and so forth. You can use it for something like this: `$text = "my name is john q user\n"; $text =~ s/^my name is (.?) .$/$1/; # removes "my name is ", saves the next word, essentially, into $1, re +moves the rest print "hello, $text!\n"; # prints "hello, john!\n"` [download] This is VERY useful in extracting information from strings. -dingoStick.com	[reply] [d/l] [select]


XP is just a number
	PerlMonks