Milti has asked for the wisdom of the Perl Monks concerning the following question:
My website has a contact form that uses PERL to send messages to me via an email service. I want to prevent sending messages that contain words such as "porn", "sex", etc. The present program looks like this:
$Name=param('Name');
$Organization=param('Organization');
$Email=param('Email');
$Phone=param('Phone');
$Message=param('Message');
$Number=param('Spam Number');
$Spam=param('Spam');
$Robot=param('Robot');
$Result=param('Result');
if
($Name eq ""||$Organization eq ""||$Email eq ""||$Phone eq ""||$Messag
+e eq "")
{
print "The FORM is INCOMPLETE. All Fields Marked With An * Are Require
+d";
exit;
}
if
($Number ne "$Spam") {
print "You Entered An Incorrect Anti Spam Number. Please try again.";
exit;
}
if ($Result ne "5") {
print "You Entered An Incorrect Answer. Please try again.";
exit;
}
if ($Robot ne "1") {
print "The Form Is Not Complete. Please try again.";
exit
}
The part that sends the message was omitted. I want establish an array that will contain prohibited words and then search the $Message. If $Message contains any of those words I want to print a response such as "Bad Words Not Permitted. Please try again." I have seen some examples of how to do all this but have not been able to get anything to work. Either the program responds to the first word in the array or nothing happens. Here's an example that I found and tried:
@forbidden =("porn","sex");
foreach $forbidden (@forbidden) {
if ($Message =~/$forbidden/) {
print "Bad Language Not Permitted";]
exit;}
This works for the first word but not the second, and if neither are present the rest of the program does not work.
Can someone suggest a solution or tell me what I'm doing wrong? Any and all help will be appreciated.Tks!
Re: Bad Language In Contact Messages
by haukex (Archbishop) on Jul 07, 2022 at 17:49 UTC
|
Can someone suggest a solution or tell me what I'm doing wrong?
The exit needs to be inside the if.
For a more efficient way to do what you're doing without foreach, see my tutorial Building Regex Alternations Dynamically.
| [reply] [d/l] [select] |
Re: Bad Language In Contact Messages
by LanX (Saint) on Jul 07, 2022 at 18:30 UTC
|
There are various modules regarding profanity on CPAN
Like https://metacpan.org/pod/Regexp::Profanity::US
So you don't need to implement your own solution.
More general you might need a generic spam word solution, e.g. "Nigerian prince" is not really a profanity as such... ;)
On another note:
I'm not sure if returning a detailed error message is a good approach against spammers, they might use it to train against your filter.
So at least consider a JS based feedback, to make it harder to be spotted by bots.
| [reply] |
Re: Bad Language In Contact Messages
by roho (Bishop) on Jul 08, 2022 at 05:32 UTC
|
That is a noble goal you have, for which I applaud you, but as written your sample code does not compile because the print statement ends with a square bracket instead of a curly brace. Also remember to always use strict and warnings.
"It's not how hard you work, it's how much you get done."
| [reply] |
Re: Bad Language In Contact Messages
by harangzsolt33 (Chaplain) on Jul 08, 2022 at 10:44 UTC
|
Hmm... I hate when I type something, and the website discards my message, because it thinks that I used some kind of bad language. I can think of a few situations. Also, you should maybe remove spaces from the text, because I could still play the program and get past with some bad words by inserting spaces between the letters. However, if your program uses tr to get rid of everything except letters, then the words will flow together, and you will falsely reject some letters which seem like they contain bad words but they don't. For example 'helloworld' starts with the word 'hell' so you should check if that bad word is maybe part of another word. Also, "Holy" in itself is not a bad word. But if you use it together with "Holy sh*t" then it is. Okay, so you only look for the S word. Great. But if they write "weaver's hitch" and remove everything except letters, then you got the bad word. So, it's really easy to get flagged, and it's hard to catch only the bad stuff. Like I said, I hate when some websites prevent me from posting. Facebook does this a lot. They have some sort of filter. I press "Send" and my message simply disappears. There are combination of words they dislike probably, but I haven't been able to figure it out yet.
#!/usr/bin/perl
use strict;
use warnings;
my @BADWORDS = ('hell');
$a = ' H e l l o w o r l d ! ';
$a =~ tr|a-zA-Z||cd;
$a = lc($a);
print $a;
| [reply] [d/l] |
Re: Bad Language In Contact Messages
by JayBee (Scribe) on Jul 09, 2022 at 02:02 UTC
|
My preference is to set a variables to true or assign point system, then judge afterwards.
my $fail=0; my @ErrMsg;
my @forbidden=('porn','sex');
if (map {$Message =~ /$_/i} @forbidden){
$fail=1; # or $fail++;
}
if ($fail){ # or ($fail > 2)
push @ErrMsg,"Bad Language Not Permitted";
}
| [reply] [d/l] |
|
ack -i sex /usr/share/dict/words | nl
1 ambosexous
# ...
45 misexplain
# ...
199 Wessexman
Quite a few false positives there and it will pass through these and their variations or the even more common emoji replacements for the topic I won’t include.
s3x & pr0n
s̸͚͍̺̺͇̝̥͎͕̃e̸̲͆̅̈́̈̈́̆͆̀͆͘͜x̶̧̟̭̘̿̂̈́͝ ̷̨͇͓͖̅̄͑̓̕͝a̵̰̜̣͎͔͛͗̆̋̓n̸͍̳͍̤̊́͌͌͗͂͘d̵͇̱̣̈́̀͒͆̐͠͝ ̷̡͓̥̙͕̦͙̬̠͒̐ͅp̶̡̖̫̱̞̙͔͑̊͛͌̈́͘ô̷̫̰̫̹̘̖̪͙͌͋͗̓̓̂̿r̴͎͇͖̲̦̤͕̉n̸̘̝͇͌̂͒̈́͝
รєא คภ๔ ק๏гภ
ˢᵉˣ ᵃⁿᵈ ᵖᵒʳⁿ
𝚜𝚎𝚡 𝚊𝚗𝚍 𝚙𝚘𝚛𝚗
uɹod puɐ xǝs
🅂🄴🅇 🄰🄽🄳 🄿🄾🅁🄽
🎀 𝓈𝑒𝓍 𝒶𝓃𝒹 𝓅🍑𝓇𝓃 🎀
𝘀𝗲𝘅 𝗮𝗻𝗱 𝗽𝗼𝗿𝗻
Update: I ran those through Text::Unidecode to see if it would untangle them because it’s surprisingly good at that, usually. Not here. Best conversion on any of them was s[?]x [?]n[?] [?][?]rn. The rest were nowhere close.
| [reply] [d/l] [select] |
|
Okay, this is silly and I probably could have found a better way to spend the last couple hours but I was curious and surprised by Text::Unidecode’s lack of efficacy here so… (has to be <pre/> instead of <code/> tags because of content.)
Posting it because I do these things so rarely I’ll lose the memory of how to do it if I don’t. :P
#!/usr/bin/env perl
use 5.14.0; # Not sure this is earliest/best.
use utf8;
use strictures;
use open ":std", ":encoding(utf8)";
use Unicode::UCD "charinfo";
use charnames ":full";
use YAML; # For introspection if you're curious.
my @sex_and_porn = <DATA>;
for my $sap ( @sex_and_porn )
{
chomp $sap;
my @string;
for my $chr ( split "", $sap )
{
next unless $chr =~ /\A[[:print:]]+\z/;
my $info = charinfo(ord($chr));
# print Dump($info);
my $name = $info->{name};
next if $name =~ /COMBINING/;
my $replacement;
if ( $name =~ /\b(SMALL|CAPITAL)?(?: LETTER)?(?: TURNED| SHARP)? ([A-Z])\b/ )
{
no warnings "uninitialized";
my $case = $1 eq "SMALL" ? "SMALL" : "CAPITAL";
$replacement = "LATIN $case LETTER $2";
}
my $final = charnames::string_vianame( $replacement || $name );
push @string, $final =~ /(?:\b[A-Z]\b| )/i ?
$final : "\x{0}";
}
s/\x{0}+/[?]/g for @string;
printf " %15s -> %s\n\n", $sap, join "", @string;
}
__DATA__
s3x & pr0n
s̸͚͍̺̺͇̝̥͎͕̃e̸̲͆̅̈́̈̈́̆͆̀͆͘͜x̶̧̟̭̘̿̂̈́͝ ̷̨͇͓͖̅̄͑̓̕͝a̵̰̜̣͎͔͛͗̆̋̓n̸͍̳͍̤̊́͌͌͗͂͘d̵͇̱̣̈́̀͒͆̐͠͝ ̷̡͓̥̙͕̦͙̬̠͒̐ͅp̶̡̖̫̱̞̙͔͑̊͛͌̈́͘ô̷̫̰̫̹̘̖̪͙͌͋͗̓̓̂̿r̴͎͇͖̲̦̤͕̉n̸̘̝͇͌̂͒̈́͝
รєא คภ๔ ק๏гภ
ˢᵉˣ ᵃⁿᵈ ᵖᵒʳⁿ
𝚜𝚎𝚡 𝚊𝚗𝚍 𝚙𝚘𝚛𝚗
uɹod puɐ xǝs
🅂🄴🅇 🄰🄽🄳 🄿🄾🅁🄽
🎀 𝓈𝑒𝓍 𝒶𝓃𝒹 𝓅🍑𝓇𝓃 🎀
𝘀𝗲𝘅 𝗮𝗻𝗱 𝗽𝗼𝗿𝗻
ßêx and pørñ
ẞӬ𐊜 AƝḊ ⓅỚℝꞐ
Output–
s3x & pr0n -> s[?]x [?] pr[?]n
s̸͚͍̺̺͇̝̥͎͕̃e̸̲͆̅̈́̈̈́̆͆̀͆͘͜x̶̧̟̭̘̿̂̈́͝ ̷̨͇͓͖̅̄͑̓̕͝a̵̰̜̣͎͔͛͗̆̋̓n̸͍̳͍̤̊́͌͌͗͂͘d̵͇̱̣̈́̀͒͆̐͠͝ ̷̡͓̥̙͕̦͙̬̠͒̐ͅp̶̡̖̫̱̞̙͔͑̊͛͌̈́͘ô̷̫̰̫̹̘̖̪͙͌͋͗̓̓̂̿r̴͎͇͖̲̦̤͕̉n̸̘̝͇͌̂͒̈́͝ -> sex and porn
รєא คภ๔ ק๏гภ -> [?][?][?] [?][?][?] [?][?][?][?]
ˢᵉˣ ᵃⁿᵈ ᵖᵒʳⁿ -> sex and porn
𝚜𝚎𝚡 𝚊𝚗𝚍 𝚙𝚘𝚛𝚗 -> sex and porn
uɹod puɐ xǝs -> urod pua xes
🅂🄴🅇 🄰🄽🄳 🄿🄾🅁🄽 -> SEX AND PORN
🎀 𝓈𝑒𝓍 𝒶𝓃𝒹 𝓅🍑𝓇𝓃 🎀 -> [?] sex and p[?]rn [?]
𝘀𝗲𝘅 𝗮𝗻𝗱 𝗽𝗼𝗿𝗻 -> sex and porn
ßêx and pørñ -> sex and porn
ẞӬ𐊜 AƝḊ ⓅỚℝꞐ -> SEX AND PORN
| [reply] [d/l] [select] |
|
And it gets worse if you have to support multiple languages. If the company is in North America, they can probably get away with supporting 3 different languages (spanish, french and english).
If the company in the EU, there are 24 official languages at last count: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish.
You might get away with supporting fewer languages in some private sectors, but if you are running a government system (or a system that wants to receive funding from the government), you might have to support all of those. And as anyone can imagine, a "bad" word in one of those languages might be a standard word with a different meaning in another language.
Plus, even for countries speaking the same language, there might be big cultural differences. For U.S. citizens, a "tea party" means throwing valuable cargo into the harbor because they endorse "taxation without representation" but want to keep all the taxes for their own government(*). For Brits, it means "5 p.m."
(*) Puerto Rico
| [reply] |
|
|
|
|
| [reply] |
|
|