Re: Practical e-mail address validation

So should I upload this to CPAN? Did my coworker miss something close to this already on CPAN?

There are a lot of modules on CPAN dealing with e-mail addresses. Unfortunately, a lot of them are rubbish. You might want to take a look at the Perl Email Project if you haven't already.

If I understand you correctly (without specifically having worked through the regexen), you consider the following addresses to be equivalent:

    everybody@example.com
    "Eve" <everybody@example.com> (Rybody)
    EveRybody+whatever@EXAMPLE.COM
[download]

What about addresses with different top-level-domains, but equal second-level-domains? For example emails from/to

    foo@gmail.com
    foo@gmail.net
    foo@gmail.com.au
    foo@gmail.at
    ...
    foo@gmail.za
[download]

are all routed from/to the same mailbox. Are they equivalent to you?

In either case all you need is a tokeniser. It will allow you to compare exactly what it is you need to determine equivalency.

Some of the existing modules tokenise the addresses internally to some extent, but Email::Address also gives you nice accessors for the resulting tokens, so it would be my prime choice for solving what I perceive your problem to be.

Comment on Re: Practical e-mail address validation Select or Download Code

Replies are listed 'Best First'.
Re^2: Practical e-mail address validation (Email::Address) by tye (Sage) on Sep 13, 2008 at 16:35 UTC
I've seen domains where foo.com was used in e-mail addresses for employees while foo.net was used in e-mail addresses of customers. As for the "+mailbox" convention, there are arguments on both sides of whether to ignore such in determining equivalence of addresses (a customer might legitimately want separate accounts for members of a single group where the correspondence for all accounts just go to separate mailboxes at the same address, or it might just be a source of confusion or simplify some mild cases of abuse). But we'll be using +mailbox to simplify testing so we'll just use `lc $addr1 eq lc $addr2` as I already noted. Thanks for the module recommendations. Email::Address notes: XXX: This ($phrase) used to just be: my $phrase = qr/$word+/; It was changed to resolve bug 22991, creating a significant slowdown. Given current speed problems. Once 16320 is resolved, this section should be dealt with. -- rjbs, 2006-11-11 XXX: ...and the above solution caused endless problems (never returned) when examining this address, now in a test: `admin+=E6=96=B0=E5=8A=A0=E5=9D=A1_Weblog-- ATAT --test.socialtext.com` [download] So we disallow the hateful CFWS in this context for now. Of modern mail agents, only Apple Web Mail 2.0 is known to produce obs-phrase. -- rjbs, 2006-11-19 which confirms some of my suspicions/assumptions. Looking at the regexes that the module uses, they appear to have been constructed directly from the RFCs very similarly to how I constructed mine, except fewer features were intentionally dropped. The note that "Of modern mail agents, only ... is known to produce" leads me to want to use that module if I were trying to parse e-mail addresses received in e-mail messages. An e-mail system would be broken if it required "the hateful CFWS" in order to deliver messages to it. So completely disallowing CFWS (as I did) doesn't prevent any addresses from being used. The module doesn't appear to provide a way to get the address with quoting and escaping removed so that addresses can be compared. It also doesn't disallow the very common user mistake of "everybody@gmail" (which can be valid as an e-mail address in some situations but isn't a valid address to give to somebody outside of your organization and so is worthwhile for us to disallow). So it appears that my similar regex has several advantages that I couldn't get from Email::Address as written. - tye	[reply] [d/l] [select]
Re^3: Practical e-mail address validation (Email::Address) by everybody (Scribe) on Sep 13, 2008 at 20:16 UTC
The module doesn't appear to provide a way to get the address with quoting and escaping removed so that addresses can be compared. The following snippet does just that: `use Email::Address; my @addresses = map { Email::Address->parse($_) } <DATA>; print is_equivalent(@addresses) ? '' : 'not ' , "equivalent\n"; sub is_equivalent { my ($a, $b) = @_; return lc $a->address eq lc $b->address; } __DATA__ "John Doe" <jdoe@bla.com> (Johnnie "Two Toes") jDOE@BLA.COM` [download] It also doesn't disallow the very common user mistake of "everybody@gmail" (which can be valid as an e-mail address in some situations but isn't a valid address to give to somebody outside of your organization and so is worthwhile for us to disallow). That's right. Any validation rules you want to impose beyond what RFC 2822 does is up to you, but Email::Address will tokenise the addresses in order to enable you to. Here's a snippet of how it deals with various address formats: `use Email::Address; my @addresses = map { Email::Address->parse( $_ ) } <DATA>; for my $address (@addresses) { printf("%8s: %s\n", $_, ( $address->$_ or '' ) ) for ( qw( origina +l address user host name phrase comment format ) ); print "-------\n"; } __DATA__ abc@foo.com bla@gmail "Eve Rybody" <everybody@example.com> foo@asdf.com%bar.com "Alan B. Combs" <abc@foo.com> (I can't think of anything complex)` [download] From there you can easily validate `$username`, `$host` etc.	[reply] [d/l] [select]
Re^4: Practical e-mail address validation (Email::Address) by tye (Sage) on Sep 13, 2008 at 22:11 UTC
I don't see how your proposed solution does what I requested. It doesn't do any canonicalization of quotes nor escapes. Yes, it eliminates comments. But your test considers `jdoe@bla.com` to be different from `"jdoe"@bla.com`, for example. Perhaps there are some other "tokens" I should be asking for instead? Yes, I suppose I could go to the trouble of parsing RFC 2822 (except for not allowing abitrary nesting of comments and not allowing CFWS in many places) in order to throw away the parts I don't want and then reparse the other parts to do additional validation and canonicalization. I suspect that will be more code (not counting the code for the module) than the solution I've already written. And it won't allow for simple customization such as allowing `/\.@/` as noted elsewhere. The module almost does RFC 2822 but doesn't do a good job of practical validation of e-mail addresses typed in by external users. I don't see its contribution to my task being much of a "win". - tye	[reply]


Problems? Is your data what you think it is?
	PerlMonks