Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

regexp to only allow for formally valid email addresses

by fraktalisman (Hermit)
on Mar 07, 2007 at 17:35 UTC ( [id://603647]=perlquestion: print w/replies, xml ) Need Help??

fraktalisman has asked for the wisdom of the Perl Monks concerning the following question:

Fellow monks, excuse me if this is too simple or if it has been asked. I used search, but it's late after a long days work, and I need a quick fix against misuse of an existing web feedback script by spammers.

Should'nt the following
$spammer=1 unless (($f{'Email'} =~ m/^[a-zA-Z_\-.0-9]+@[a-zA-Z_\-.0-9]+$/) or ($f{'Email'} eq ''));
only allow for $f{'Email'} to contain a valid email address without newlines in it and otherwise set $spammer=1 ? But it doesn't seem to, somehow I seem to be missing the point.

Replies are listed 'Best First'.
Re: regexp to only allow for formally valid email addresses
by Zaxo (Archbishop) on Mar 07, 2007 at 17:50 UTC

    use Email::Valid; # . . . my $spammer; $spammer = 1 unless Email::Valid->address($f{'Email'}) or $f{'Email'} eq '';

    There exists a regex that does what you want, but it is large and complex. Email::Valid uses a small parser.

    Update: Seperated my declaration from conditional assignment, was a thinko.

    After Compline,
    Zaxo

      There exists a regex that does what you want, but it is large and complex.

      In another post I mentioned an "impressive example" that works with the newest blead, posted by Abigail in clpmisc; for completeness, I'm pasting it hereafter:

Re: regexp to only allow for formally valid email addresses
by Fletch (Bishop) on Mar 07, 2007 at 17:51 UTC

    See Mail::RFC822::Address which has "the" regex for valid addresses. Right off I see that yours has one of the common problems that tend to tick me off, specifically disallowing "foo+identifier@example.com" style addresses (which lets me have one "foo@example.com" address but give out different "+identifier" tags to different people so I can label/tag/filter/toss accordingly).

    Update: Also see RFC::RFC822::Address for a Parse::RecDescent based parser rather than a regex.

Re: regexp to only allow for formally valid email addresses
by vrk (Chaplain) on Mar 07, 2007 at 17:53 UTC

    Use Mail::RFC822::Address. Also, the Regular Expression Library has some pretty interesting constructs.

    As to your regex, I don't see anything wrong in it, and it works with a couple of test cases as it should:

    $ perl -e 'print "valid\n" if ("foo\@bar" =~ m/^[a-zA-Z_\-.0-9]+@[a-zA +-Z_\-.0-9]+$/);' valid $ perl -e 'print "valid\n" if ("j.random.hacker\@perlmonks.com" =~ m/^ +[a-zA-Z_\-.0-9]+@[a-zA-Z_\-.0-9]+$/);'

    Of course, tests can never show the absence of errors. But I'm willing to bet you have a problem somewhere else in the program.

    UPDATE: Seems like others beat me to it... I just remembered that there was some nice discussion about this over at The Daily WTF.

    --
    print "Just Another Perl Adept\n";

      vrk wrote:
      Of course, tests can never show the absence of errors. But I'm willing to bet you have a problem somewhere else in the program.

      If we had bet, you'd won ;)
      There was another parameter that goes into the email header. The $f{'Email'} field validation was not the actual problem ...

Re: regexp to only allow for formally valid email addresses
by ikegami (Patriarch) on Mar 07, 2007 at 23:03 UTC
    In Perl 5.10, you'll be able to do
    my $email_address = qr{ (?(DEFINE) (?<addr_spec> (?&local_part) \@ (?&domain)) (?<local_part> (?&dot_atom) | (?&quoted_string)) (?<domain> (?&dot_atom) | (?&domain_literal)) (?<domain_literal> (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?& +FWS)? \] (?&CFWS)?) (?<dcontent> (?&dtext) | (?&quoted_pair)) (?<dtext> (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e]) (?<atext> (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|} +~]) (?<atom> (?&CFWS)? (?&atext)+ (?&CFWS)?) (?<dot_atom> (?&CFWS)? (?&dot_atom_text) (?&CFWS)?) (?<dot_atom_text> (?&atext)+ (?: \. (?&atext)+)*) (?<text> [\x01-\x09\x0b\x0c\x0e-\x7f]) (?<quoted_pair> \\ (?&text)) (?<qtext> (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e]) (?<qcontent> (?&qtext) | (?&quoted_pair)) (?<quoted_string> (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent +))* (?&FWS)? (?&DQUOTE) (?&CFWS)?) (?<word> (?&atom) | (?&quoted_string)) (?<phrase> (?&word)+) # Folding white space (?<FWS> (?: (?&WSP)* (?&CRLF))? (?&WSP)+) (?<ctext> (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e +]) (?<ccontent> (?&ctext) | (?&quoted_pair) | (?&comment)) (?<comment> \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) ) (?<CFWS> (?: (?&FWS)? (?&comment))* (?: (?:(?&FWS)? (?&comment)) | (?&FWS))) # No whitespace control (?<NO_WS_CTL> [\x01-\x08\x0b\x0c\x0e-\x1f\x7f]) (?<ALPHA> [A-Za-z]) (?<DIGIT> [0-9]) (?<CRLF> \x0d \x0a) (?<DQUOTE> ") (?<WSP> [\x20\x09]) ) (?&addr_spec) }x;

    Disallowing CR & LF would simply be a matter of changing
    (?<FWS>             (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
    to
    (?<FWS>             (?&WSP)+)

    Credit: The regexp was written by Abigail, who also wrote RFC::RFC822::Address.

Re: regexp to only allow for formally valid email addresses
by Moron (Curate) on Mar 07, 2007 at 19:59 UTC
    I agree with the suggestions to use a ready-made regexp. But if for some reason I had to reinvent one, I'd extend the \w token as much as necessary rather than go from scratch, something like...
    $spammer = length($f{'Email'}) && $f{'Email'} !~ /^(^\w|\-|\.)+\@(\w|\-|\.)+$/;

    -M

    Free your mind

      Word characters (\w) might include local characters like German Ä ö ü ß on a German webserver. Although these Umlaut characters should be, in theory, valid in email adresses (in my interpretation of RFC 822), I know from experience that their occurence in email addresses usually causes problems sooner or later. At least one German provider (T-Online) used to allow for those chars, but I would rather disallow and have the user enter an email address which is safe for international use.

Re: regexp to only allow for formally valid email addresses
by hangon (Deacon) on Mar 07, 2007 at 19:25 UTC

    You need to escape the dots in your regex.

    Update: Nevermind this post. I stand corrected and learned Yet Another Perl Nuance. Thanks Thelonius & Fletch.

      $ perl -le '$_ = "oh really?"; print unless /[.]/;' oh really?

        Unless you don't think the character classes are a bit redundant. I assume fraktalisman only wants to match \w as well as '-' and '.' since its for e-mail addresses.

        # my guess is that he's not trying to do this =~ /^[.]+@[.]+$/ # either of these make more sense for matching an e-mail address =~ /^[a-zA-Z_\-\.0-9]+@[a-zA-Z_\-\.0-9]+$/ =~ /^[\w\-\.]+@[\w\-\.]+$/

        Or am I missing something painfully obvious?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://603647]
Approved by polettix
Front-paged by kyle
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2024-06-13 07:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.