Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Please help with Regexp::Common

by scorpio17 (Canon)
on Jan 18, 2017 at 23:26 UTC ( #1179883=perlquestion: print w/replies, xml ) Need Help??

scorpio17 has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to teach myself how to use Regexp::Common, and I'm having a little trouble.

The following works as expected, and finds the number 1234 embedded in the string aaaa1234cccc:

use strict; use Regexp::Common; while ( my $word = <DATA> ) { chomp $word; if ( $word =~ /$RE{num}{int}/ ) { print "Integer detected: \"$word\"\n"; } else { print "$word\n"; } } __DATA__ aaaabbbbcccc aaaa1234cccc ddddeeeeffff

However, this does NOT work as I would expect:

use strict; use Regexp::Common; while ( my $word = <DATA> ) { chomp $word; if ( $word =~ /$RE{profanity}/ ) { print "Profanity detected: \"$word\"\n"; } else { print "$word\n"; } } __DATA__ aaaabbbbcccc aaaaXXXXcccc ddddeeeeffff

In this case, change XXXX into your favorite 4 letter offensive word. If I change the data string to this: "aaaa XXXX cccc" (i.e., add spaces around the XXXX, then it finds it).

It seems like the profanity patterns have start of word / end of word anchors built into the patterns, and thus don't work if the word is embedded inside another string? Is there any way to control this behavior? I've gone through the docs, but so far I can't find a way.

I'm using perl 5.14 (activestate) on Win7. Thanks for any push in the right direction.

Replies are listed 'Best First'.
Re: Please help with Regexp::Common
by LanX (Archbishop) on Jan 18, 2017 at 23:55 UTC

    > However, this does NOT work as I would expect:

    Really? Well swear words having word boundaries is what I expect.

    > It seems like the profanity patterns have start of word / end of word anchors built into the patterns,

    well it seems so, why don't you just dump the regex to be sure?

    Personally I wouldn't want words like Essex to be flagged. (Or Dickens or zaddick)

    > don't work if the word is embedded inside another string? Is there any way to control this behavior? 

    After browsing thru the code ...

    http://cpansearch.perl.org/src/ABIGAIL/Regexp-Common-2016060801/lib/Regexp/Common/profanity.pm

    I saw this

    pattern name => [qw (profanity)], create => '(?:\b(?k:' . $profanity . + ')\b)', ;

    So I doubt there is any possible flag to disable the hard coded \b meta character.

    But if you really need this feature you could just copy the code into your own subclass and change the pattern to your needs.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

Re: Please help with Regexp::Common
by Paladin (Vicar) on Jan 18, 2017 at 23:52 UTC
    You can see in the source that the \b anchors are embedded in the regex itself. I would imagine this is because of the Scunthorpe Problem.
      > because of the Scunthorpe Problem.

      I once ran in a similar phonetic problem after mentioning Kant in an English conversation :)

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

Re: Please help with Regexp::Common
by AnomalousMonk (Bishop) on Jan 19, 2017 at 00:01 UTC

    You might try to trim the boundary assertions off of the stringized Regexp object (sorry for all the wrap-around):

    c:\@Work\Perl\monks>perl -wMstrict -le "use Regexp::Common; ;; print qq{$RE{profanity}}; print qq{A: match '$1'} if 'xxxpissxxx' =~ m{ ($RE{profanity}) }xms; ;; print '--------'; (my $erp = $RE{profanity}) =~ s{ \A \Q(?:\b\E (.*) \Q\b)\E \z }{$1}xm +s; print qq{'$erp'}; ;; print qq{B: match '$1'} if 'xxxpissxxx' =~ m{ ($erp) }xms; " (?:\b(?:(?:piss(?:\ take|\-take|take|e(?:rs|[srd])|ing|y)?|quims?|shit +(?:t(?:e(?:rs|[dr])|ing|y)|e(? :rs|[sdry])|ing|[se])?|t(?:urds?|wats?)|wank(?:e(?:rs|[rd])|ing|s)?|a( +?:rs(?:e(?:\ hole|\-hole|hole| [sd])|ing|e)|ss(?:\ holes?|\-holes?|ed|holes?|ing))|b(?:ull(?:\ shit(? +:t(?:e(?:rs|[dr])|ing)|s)?|\-s hit(?:t(?:e(?:rs|[dr])|ing)|s)?|shit(?:t(?:e(?:rs|[dr])|ing)|s)?)|low( +?:\ jobs?|\-jobs?|jobs?))|c(?: ock(?:\ suck(?:ers?|ing)|\-suck(?:ers?|ing)|suck(?:ers?|ing))|rap(?:p( +?:e(?:rs|[rd])|ing|y)|s)?|u(?: nts?|m(?:ing|ming|s)))|dick(?:\ head|\-head|ed|head|ing|less|s)|f(?:uc +k(?:ed|ing|s)?|art(?:e[rd]|ing |[sy])?|eltch(?:e(?:rs|[rsd])|ing)?)|ha(?:rd[\-\ ]?on|lf(?:\ a[sr]|\-a +[sr]|a[sr])sed)|m(?:other(?:\ fuck(?:ers?|ing)|\-fuck(?:ers?|ing)|fuck(?:ers?|ing))|uth(?:a(?:\ fuck +(?:ers?|ing|[aaa])|\-fuck(?:er s?|ing|[aaa])|fuck(?:ers?|ing|[aaa]))|er(?:\ fuck(?:ers?|ing)|\-fuck(? +:ers?|ing)|fuck(?:ers?|ing)))| erde?)))\b) -------- '(?:(?:piss(?:\ take|\-take|take|e(?:rs|[srd])|ing|y)?|quims?|shit(?:t +(?:e(?:rs|[dr])|ing|y)|e(?:rs| [sdry])|ing|[se])?|t(?:urds?|wats?)|wank(?:e(?:rs|[rd])|ing|s)?|a(?:rs +(?:e(?:\ hole|\-hole|hole|[sd] )|ing|e)|ss(?:\ holes?|\-holes?|ed|holes?|ing))|b(?:ull(?:\ shit(?:t(? +:e(?:rs|[dr])|ing)|s)?|\-shit( ?:t(?:e(?:rs|[dr])|ing)|s)?|shit(?:t(?:e(?:rs|[dr])|ing)|s)?)|low(?:\ +jobs?|\-jobs?|jobs?))|c(?:ock( ?:\ suck(?:ers?|ing)|\-suck(?:ers?|ing)|suck(?:ers?|ing))|rap(?:p(?:e( +?:rs|[rd])|ing|y)|s)?|u(?:nts? |m(?:ing|ming|s)))|dick(?:\ head|\-head|ed|head|ing|less|s)|f(?:uck(?: +ed|ing|s)?|art(?:e[rd]|ing|[sy ])?|eltch(?:e(?:rs|[rsd])|ing)?)|ha(?:rd[\-\ ]?on|lf(?:\ a[sr]|\-a[sr] +|a[sr])sed)|m(?:other(?:\ fuck (?:ers?|ing)|\-fuck(?:ers?|ing)|fuck(?:ers?|ing))|uth(?:a(?:\ fuck(?:e +rs?|ing|[aaa])|\-fuck(?:ers?|i ng|[aaa])|fuck(?:ers?|ing|[aaa]))|er(?:\ fuck(?:ers?|ing)|\-fuck(?:ers +?|ing)|fuck(?:ers?|ing)))|erde ?)))' B: match 'piss'

    Update: Of course, this gets you right back to the Scunthorpe Problem noted above by Paladin!


    Give a man a fish:  <%-{-{-{-<

      I followed your suggestion and tried this:

      use strict; use Regexp::Common; (my $reg = $RE{profanity}) =~ s{\A \Q(?:\b\E (.*) \Q\b)\E \z}{$1}xms; while ( my $word = <DATA> ) { chomp $word; if ( $word =~ m/$reg/ ) { print "Profanity detected: \"$word\"\n"; } else { print "$word\n"; } } __DATA__ aaaabbbbcccc aaaashitcccc aaaa1234cccc ddddeeeeffff

      This way it will find embedded "bad words" without the need for spaces around them, which is what I wanted. I realize the logic in requiring the word boundaries. But I think the fact that $RE{num}{int} finds embedded numbers made me assume that $RE{profanity} should work the same way, or else there might be a switch to toggle the behavior one way or the other.

      The reason I need this is to generate temporary (one-use) passwords (like when someone requests a password reset on a website). The generated password should, ideally, be a jumble of random letters and/or numbers, but I don't want to accidentally send someone a password with an "obvious" obscenity embedded, so a simple filter like this is helpful.

      Thanks!

        You might consider adding a test to check if the expected alteration to the original regex was successful. The  \Q(?:\b\E and  \Q\b)\E parts of the substitution are rather fragile IMO and may break if the maintainer(s) of Regexp::Common ever change his/her/their notion of what a proper profane regex should look like.

        c:\@Work\Perl\monks>perl -wMstrict -le "use Regexp::Common; ;; (my $reg = $RE{profanity}) =~ s{\A \Q(?:\b\E (.*) \Q\b)\E \z}{$1}xms or die 'profanity anchor trim failed'; ;; print qq{bad: '$1'} if 'Matsushita' =~ m{ ($reg) }xms; " bad: 'shit'


        Give a man a fish:  <%-{-{-{-<

        Shouldn't you be generating passwords that do not contain any words?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1179883]
Approved by LanX
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (3)
As of 2019-10-14 00:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?