Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Email Validation, Round 2

by TStanley (Canon)
on Aug 24, 2001 at 06:34 UTC ( [id://107596]=perlquestion: print w/replies, xml ) Need Help??

TStanley has asked for the wisdom of the Perl Monks concerning the following question:

After looking at this node by KM and it's reply from LeGo, I went to the link in KM's node and picked up the script used to create the regex, and inserted it into a cgi script. I tried putting just the regex in, but it wasn't copying correctly. I'm just curious if any one can spot something that I am obviously overlooking here. Code follows:
#!/usr/bin/perl -w use strict; use CGI; use Fcntl qw(:flock); $|++; $CGI::DISABLE_UPLOADS=1; $CGI::POST_MAX=1*1024; my $CGI=new CGI; my $Name=$CGI->param("Name"); my $Email=$CGI->param("Email"); my $Desc=$CGI->param("Description"); my $Address="Thomas_J_Stanley\@msn.com"; #Untaint the parameters $Name=~s/[ -\,\;\.]//; if($Name=~/\d/){ die"Tainted Data!\n"; } # This script can be found in Mastering Regular Expressions by # Jeff Friedl or at this site: # http://public.yahoo.com/~jfriedl/regex/email-unopt.txt #Some things for avoiding backslashitis later on. my $esc = '\\\\'; my $Period = '\.'; my $space = '\040'; my $tab = '\t'; my $OpenBR = '\['; my $CloseBR = '\]'; my $OpenParen = '\('; my $CloseParen = '\)'; my $NonASCII = '\x80-\xff'; my $ctrl = '\000-\037'; my $CRlist = '\n\015'; # note: this should really be only \015. # Items 19, 20, 21 my $qtext = qq/[^$esc$NonASCII$CRlist\"]/; # for within +"..." my $dtext = qq/[^$esc$NonASCII$CRlist$OpenBR$CloseBR]/; # for within +[...] my $quoted_pair = qq< $esc [^$NonASCII] >; # an escaped character # Item 10: atom my $atom_char = qq/[^($space)<>\@,;:\".$esc$OpenBR$CloseBR$ctrl$NonASC +II]/; my $atom = qq< $atom_char+ # some number of atom characters... (?!$atom_char) # ..not followed by something that could be part of a +n atom >; # Items 22 and 23, comment. # Impossible to do properly with a regex, I make do by allowing at mos +t one level of nesting. my $ctext = qq< [^$esc$NonASCII$CRlist()] >; my $Cnested = qq< $OpenParen (?: $ctext | $quoted_pair )* $CloseParen +>; my $comment = qq< $OpenParen (?: $ctext | $quoted_pair | $Cnested )* $CloseParen >; my $X = qq< (?: [$space$tab] | $comment )* >; # optional separat +or # Item 11: doublequoted string, with escaped items allowed my $quoted_str = qq< \" (?: # opening quote... $qtext # Anything except backslash and +quote | # or $quoted_pair # Escaped something (something ! += CR) )* \" # closing quote >; # Item 7: word is an atom or quoted string my $word = qq< (?: $atom | $quoted_str ) >; # Item 12: domain-ref is just an atom my $domain_ref = $atom; # Item 13 domain-literal is like a quoted string, but [...] instead of + "..." my $domain_lit = qq< $OpenBR # [ (?: $dtext | $quoted_pair )* # stuff $CloseBR # ] >; # Item 9: sub-domain is a domain-ref or domain-literal my $sub_domain = qq< (?: $domain_ref | $domain_lit ) >; # Item 6: domain is a list of subdomains separated by dots. my $domain = qq< $sub_domain # initial subdom +ain (?: # $X $Period # if led by a perio +d... $X $sub_domain # ...further okay )* >; # Item 8: a route. A bunch of "@ $domain" separated by commas, followe +d by a colon my $route = qq< \@ $X $domain (?: $X , $X \@ $X $domain )* # further okay, if led by co +mma : # closing colon >; # Item 5: local-part is a bunch of $word separated by periods my $local_part = qq< $word # initial word (?: $X $Period $X $word )* # further okay, if led by a +period >; # Item 2: addr-spec is local@domain my $addr_spec = qq< $local_part $X \@ $X $domain >; # Item 4: route-addr is <route? addr-spec> my $route_addr = qq[ < $X # leading < (?: $route $X )? # optional route $addr_spec # address spec $X > # trailing +> ]; # Item 3: phrase my $phrase_ctrl = '\000-\010\012-\037'; # like ctrl, but without tab # Like atom-char, but without listing space, and uses phrase_ctrl. # Since the class is negated, this matches the same as atom-char plus +space and tab my $phrase_char = qq/[^()<>\@,;:\".$esc$OpenBR$CloseBR$NonASCII$phrase_ctrl]/; my $phrase = qq< $word # one word, optionally followed by.. +.. (?: $phrase_char | # atom and space parts, or... $comment | # comments, or... $quoted_str # quoted strings )* >; # Item #1: mailbox is an addr_spec or a phrase/route_addr my $mailbox = qq< $X # optional leading commen +t (?: $addr_spec # address | # or $phrase $route_addr # name and address ) $X # optional trailing comment >; if($Email=~m/^$mailbox$/xo){}else{ die"Tainted Data!\n"; } $Desc=~s/[*,-,\,,\;,\.]//; $Desc=$CGI->escape_html($Desc); print $CGI->header(); print $CGI->start_html('Parameters'); print $CGI->h3(" Name = $Name"); print $CGI->end_html();

TStanley
--------
There's an infinite number of monkeys outside who want to talk to us
about this script for Hamlet they've worked out
-- Douglas Adams/Hitchhiker's Guide to the Galaxy

Replies are listed 'Best First'.
Re: Email Validation, Round 2
by maverick (Curate) on Aug 24, 2001 at 06:57 UTC
    For my opinion, the e-mail regexp is most useful in a "yes we can do this" sort of way. For real world apps I'd use Email::Valid. It correctly validates an e-mail address, and does something this regex can't....verifing that the target domain exists.

    /\/\averick
    perl -l -e "eval pack('h*','072796e6470272f2c5f2c5166756279636b672');"

      You know that Email::Valid uses this same regex, right? It's a nice piece of Perl, but even the module's own POD says:
      Please note that there is no way to determine whether an address is deliverable without attempting delivery (for details, see perlfaq 9).
      Of course, as I said in my other comment, validity and deliverability are two different things, and sometimes you only need one.
      --
      man with no legs, inc.
        If you read the pod a little more carefully, you'll spot this:
        mxcheck ( <TRUE>|<FALSE> ) Specifies whether addresses passed to address() should be checked for a valid DNS entry. The default is false.
        and this:
        If an error is encountered, an exception is raised. This is really only possible when performing DNS queries. Trap any exceptions by wrapping the call in an eval block: eval { $addr = Email::Valid->address( -address => 'maurice@hevanet.com +', -mxcheck => 1 ); }; warn "an error was encountered: $@" if $@;
        I said that Email::Valid will "verify that the target domain exists", not that the email address is ultimately deliverable.

        /\/\averick
        perl -l -e "eval pack('h*','072796e6470272f2c5f2c5166756279636b672');"

Re: Email Validation, Round 2
by legLess (Hermit) on Aug 24, 2001 at 22:56 UTC
    IMHO the O'Reilly Mouse book (CGI Programming with Perl, 2nd edition) has a very good take on this. Brief excerpt (pages 217-8; forgive my typos, if any):
    ...Jeffrey Friedl, in his book Mastering Regular Expressions, tackled creating a regular expression to handle the parsing of RFC 822 email addresses. The book is the best reference for understanding regular expressions in Perl or any other context. Many people cite the regular expression he constructs as the only definitive test of whether an Internet email address is valid. But unfortunately these people have misunderstood what it does; it test for compliance with RFC 822. According to RFC 822, these are all syntactically valid email addresses:
              Alfred Neuman <Neuman@BBN-TENEXA>
              ":sysmail"@ Some-Group. Some-Org
              Muhammed.(I am the greatest) Ali @(the)Vegas.WBA
    
    Do any of them look like the type of email address you'd want to capture in an HTML form? It is true that RFC 822 has not been superseded by another RFC and is still a standard, but it is equally true that the problem we are trying to solve is radically different in time and context from the problem that it solved in 1982.

    We want an expression to recognize a syntactically valid emai laddress as required on the Internet today. ...

    (The examples from this book contain (in chapter 9) a regex to check addresses for validity on what the authors call "today's Internet." It's not as concerned with RFC 822 compliance as it is with real-world usability. I leave it to you to decide if they were successful.)

    The authors make a very good point: an email address can be syntactically valid yet for all practical purposes useless. Furthermore, syntactic validity is no guarantee of accuracy (when I don't want to give my own address I use 'bill.gates@microsoft.com').

    What I've seen others suggest, and what I do myself, is a two-step process: (1) require people to type the address twice, and check those two inputs for consistency, and (2) send an email to that address with a CGI link back to a validation page. If it fails - they should have been more careful. No amount of RFC-checking will guarantee against typos or misinformation.

    This may not fit your purposes at all - perhaps you're just trying this for fun. It's food for thought, though. In my case all I care about is whether the address "works" (whether it reaches the intended recipient), so this solution is perfect.
    --
    man with no legs, inc.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://107596]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (6)
As of 2024-03-29 08:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found