TStanley has asked for the wisdom of the Perl Monks concerning the following question:
After looking at this node by KM and it's reply from LeGo, I went to the link in KM's node and picked up the script used to create the regex, and inserted it into a cgi script. I tried putting just the regex in, but it wasn't copying correctly. I'm just curious if any one can spot something that I am obviously overlooking here. Code follows:
#!/usr/bin/perl -w
use strict;
use CGI;
use Fcntl qw(:flock);
$|++;
$CGI::DISABLE_UPLOADS=1;
$CGI::POST_MAX=1*1024;
my $CGI=new CGI;
my $Name=$CGI->param("Name");
my $Email=$CGI->param("Email");
my $Desc=$CGI->param("Description");
my $Address="Thomas_J_Stanley\@msn.com";
#Untaint the parameters
$Name=~s/[ -\,\;\.]//;
if($Name=~/\d/){
die"Tainted Data!\n";
}
# This script can be found in Mastering Regular Expressions by
# Jeff Friedl or at this site:
# http://public.yahoo.com/~jfriedl/regex/email-unopt.txt
#Some things for avoiding backslashitis later on.
my $esc = '\\\\'; my $Period = '\.';
my $space = '\040'; my $tab = '\t';
my $OpenBR = '\['; my $CloseBR = '\]';
my $OpenParen = '\('; my $CloseParen = '\)';
my $NonASCII = '\x80-\xff'; my $ctrl = '\000-\037';
my $CRlist = '\n\015'; # note: this should really be only \015.
# Items 19, 20, 21
my $qtext = qq/[^$esc$NonASCII$CRlist\"]/; # for within
+"..."
my $dtext = qq/[^$esc$NonASCII$CRlist$OpenBR$CloseBR]/; # for within
+[...]
my $quoted_pair = qq< $esc [^$NonASCII] >; # an escaped character
# Item 10: atom
my $atom_char = qq/[^($space)<>\@,;:\".$esc$OpenBR$CloseBR$ctrl$NonASC
+II]/;
my $atom = qq<
$atom_char+ # some number of atom characters...
(?!$atom_char) # ..not followed by something that could be part of a
+n atom
>;
# Items 22 and 23, comment.
# Impossible to do properly with a regex, I make do by allowing at mos
+t one level of nesting.
my $ctext = qq< [^$esc$NonASCII$CRlist()] >;
my $Cnested = qq< $OpenParen (?: $ctext | $quoted_pair )* $CloseParen
+>;
my $comment = qq< $OpenParen
(?: $ctext | $quoted_pair | $Cnested )*
$CloseParen >;
my $X = qq< (?: [$space$tab] | $comment )* >; # optional separat
+or
# Item 11: doublequoted string, with escaped items allowed
my $quoted_str = qq<
\" (?: # opening quote...
$qtext # Anything except backslash and
+quote
| # or
$quoted_pair # Escaped something (something !
+= CR)
)* \" # closing quote
>;
# Item 7: word is an atom or quoted string
my $word = qq< (?: $atom | $quoted_str ) >;
# Item 12: domain-ref is just an atom
my $domain_ref = $atom;
# Item 13 domain-literal is like a quoted string, but [...] instead of
+ "..."
my $domain_lit = qq< $OpenBR # [
(?: $dtext | $quoted_pair )* # stuff
$CloseBR # ]
>;
# Item 9: sub-domain is a domain-ref or domain-literal
my $sub_domain = qq< (?: $domain_ref | $domain_lit ) >;
# Item 6: domain is a list of subdomains separated by dots.
my $domain = qq< $sub_domain # initial subdom
+ain
(?: #
$X $Period # if led by a perio
+d...
$X $sub_domain # ...further okay
)*
>;
# Item 8: a route. A bunch of "@ $domain" separated by commas, followe
+d by a colon
my $route = qq< \@ $X $domain
(?: $X , $X \@ $X $domain )* # further okay, if led by co
+mma
: # closing colon
>;
# Item 5: local-part is a bunch of $word separated by periods
my $local_part = qq< $word # initial word
(?: $X $Period $X $word )* # further okay, if led by a
+period
>;
# Item 2: addr-spec is local@domain
my $addr_spec = qq< $local_part $X \@ $X $domain >;
# Item 4: route-addr is <route? addr-spec>
my $route_addr = qq[ < $X # leading <
(?: $route $X )? # optional route
$addr_spec # address spec
$X > # trailing
+>
];
# Item 3: phrase
my $phrase_ctrl = '\000-\010\012-\037'; # like ctrl, but without tab
# Like atom-char, but without listing space, and uses phrase_ctrl.
# Since the class is negated, this matches the same as atom-char plus
+space and tab
my $phrase_char =
qq/[^()<>\@,;:\".$esc$OpenBR$CloseBR$NonASCII$phrase_ctrl]/;
my $phrase = qq< $word # one word, optionally followed by..
+..
(?:
$phrase_char | # atom and space parts, or...
$comment | # comments, or...
$quoted_str # quoted strings
)*
>;
# Item #1: mailbox is an addr_spec or a phrase/route_addr
my $mailbox = qq< $X # optional leading commen
+t
(?: $addr_spec # address
| # or
$phrase $route_addr # name and address
) $X # optional trailing comment
>;
if($Email=~m/^$mailbox$/xo){}else{
die"Tainted Data!\n";
}
$Desc=~s/[*,-,\,,\;,\.]//;
$Desc=$CGI->escape_html($Desc);
print $CGI->header();
print $CGI->start_html('Parameters');
print $CGI->h3(" Name = $Name");
print $CGI->end_html();
TStanley
--------
There's an infinite number of monkeys outside who want to talk to us
about this script for Hamlet they've worked out -- Douglas Adams/Hitchhiker's Guide to the Galaxy
Re: Email Validation, Round 2
by maverick (Curate) on Aug 24, 2001 at 06:57 UTC
|
For my opinion, the e-mail regexp is most useful in a "yes we can do this" sort of way. For real world apps I'd use Email::Valid. It correctly validates an e-mail address, and does something this regex can't....verifing that the target domain exists.
/\/\averick
perl -l -e "eval pack('h*','072796e6470272f2c5f2c5166756279636b672');"
| [reply] [Watch: Dir/Any] |
|
You know that Email::Valid uses this same regex, right? It's a nice piece of Perl, but even the module's own POD says:
Please note that there is no way to determine whether an address is deliverable without attempting delivery (for details, see perlfaq 9).
Of course, as I said in my other comment, validity and deliverability are two different things, and sometimes you only need one.
-- man with no legs, inc.
| [reply] [Watch: Dir/Any] |
|
If you read the pod a little more carefully, you'll spot this:
mxcheck ( <TRUE>|<FALSE> )
Specifies whether addresses passed to address() should
be checked for a valid DNS entry. The default is
false.
and this:
If an error is encountered, an exception is raised. This
is really only possible when performing DNS queries. Trap
any exceptions by wrapping the call in an eval block:
eval {
$addr = Email::Valid->address( -address => 'maurice@hevanet.com
+',
-mxcheck => 1 );
};
warn "an error was encountered: $@" if $@;
I said that Email::Valid will "verify that the target domain exists", not that the email address is ultimately deliverable.
/\/\averick
perl -l -e "eval pack('h*','072796e6470272f2c5f2c5166756279636b672');"
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
|
Re: Email Validation, Round 2
by legLess (Hermit) on Aug 24, 2001 at 22:56 UTC
|
IMHO the O'Reilly Mouse book (CGI Programming with Perl, 2nd edition) has a very good take on this. Brief excerpt (pages 217-8; forgive my typos, if any):
...Jeffrey Friedl, in his book Mastering Regular Expressions, tackled creating a regular expression to handle the parsing of RFC 822 email addresses. The book is the best reference for understanding regular expressions in Perl or any other context. Many people cite the regular expression he constructs as the only definitive test of whether an Internet email address is valid. But unfortunately these people have misunderstood what it does; it test for compliance with RFC 822. According to RFC 822, these are all syntactically valid email addresses:
Alfred Neuman <Neuman@BBN-TENEXA>
":sysmail"@ Some-Group. Some-Org
Muhammed.(I am the greatest) Ali @(the)Vegas.WBA
Do any of them look like the type of email address you'd want to capture in an HTML form? It is true that RFC 822 has not been superseded by another RFC and is still a standard, but it is equally true that the problem we are trying to solve is radically different in time and context from the problem that it solved in 1982.
We want an expression to recognize a syntactically valid emai laddress as required on the Internet today. ...
(The examples from this book contain (in chapter 9) a regex to check addresses for validity on what the authors call "today's Internet." It's not as concerned with RFC 822 compliance as it is with real-world usability. I leave it to you to decide if they were successful.)
The authors make a very good point: an email address can be syntactically valid yet for all practical purposes useless. Furthermore, syntactic validity is no guarantee of accuracy (when I don't want to give my own address I use 'bill.gates@microsoft.com').
What I've seen others suggest, and what I do myself, is a two-step process: (1) require people to type the address twice, and check those two inputs for consistency, and (2) send an email to that address with a CGI link back to a validation page. If it fails - they should have been more careful. No amount of RFC-checking will guarantee against typos or misinformation.
This may not fit your purposes at all - perhaps you're just trying this for fun. It's food for thought, though. In my case all I care about is whether the address "works" (whether it reaches the intended recipient), so this solution is perfect.
-- man with no legs, inc. | [reply] [Watch: Dir/Any] |
|
|