Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Practical e-mail address validation

by everybody (Scribe)
on Sep 13, 2008 at 15:42 UTC ( [id://711124]=note: print w/replies, xml ) Need Help??


in reply to Practical e-mail address validation

So should I upload this to CPAN? Did my coworker miss something close to this already on CPAN?

There are a lot of modules on CPAN dealing with e-mail addresses. Unfortunately, a lot of them are rubbish. You might want to take a look at the Perl Email Project if you haven't already.

If I understand you correctly (without specifically having worked through the regexen), you consider the following addresses to be equivalent:

everybody@example.com "Eve" <everybody@example.com> (Rybody) EveRybody+whatever@EXAMPLE.COM
What about addresses with different top-level-domains, but equal second-level-domains? For example emails from/to
foo@gmail.com foo@gmail.net foo@gmail.com.au foo@gmail.at ... foo@gmail.za
are all routed from/to the same mailbox. Are they equivalent to you?

In either case all you need is a tokeniser. It will allow you to compare exactly what it is you need to determine equivalency.

Some of the existing modules tokenise the addresses internally to some extent, but Email::Address also gives you nice accessors for the resulting tokens, so it would be my prime choice for solving what I perceive your problem to be.

Replies are listed 'Best First'.
Re^2: Practical e-mail address validation (Email::Address)
by tye (Sage) on Sep 13, 2008 at 16:35 UTC

    I've seen domains where foo.com was used in e-mail addresses for employees while foo.net was used in e-mail addresses of customers.

    As for the "+mailbox" convention, there are arguments on both sides of whether to ignore such in determining equivalence of addresses (a customer might legitimately want separate accounts for members of a single group where the correspondence for all accounts just go to separate mailboxes at the same address, or it might just be a source of confusion or simplify some mild cases of abuse).

    But we'll be using +mailbox to simplify testing so we'll just use lc $addr1 eq lc $addr2 as I already noted.

    Thanks for the module recommendations. Email::Address notes:

    XXX: This ($phrase) used to just be: my $phrase = qr/$word+/; It was changed to resolve bug 22991, creating a significant slowdown. Given current speed problems. Once 16320 is resolved, this section should be dealt with. -- rjbs, 2006-11-11
    XXX: ...and the above solution caused endless problems (never returned) when examining this address, now in a test:
    admin+=E6=96=B0=E5=8A=A0=E5=9D=A1_Weblog-- ATAT --test.socialtext.com
    So we disallow the hateful CFWS in this context for now. Of modern mail agents, only Apple Web Mail 2.0 is known to produce obs-phrase. -- rjbs, 2006-11-19

    which confirms some of my suspicions/assumptions.

    Looking at the regexes that the module uses, they appear to have been constructed directly from the RFCs very similarly to how I constructed mine, except fewer features were intentionally dropped.

    The note that "Of modern mail agents, only ... is known to produce" leads me to want to use that module if I were trying to parse e-mail addresses received in e-mail messages. An e-mail system would be broken if it required "the hateful CFWS" in order to deliver messages to it. So completely disallowing CFWS (as I did) doesn't prevent any addresses from being used.

    The module doesn't appear to provide a way to get the address with quoting and escaping removed so that addresses can be compared. It also doesn't disallow the very common user mistake of "everybody@gmail" (which can be valid as an e-mail address in some situations but isn't a valid address to give to somebody outside of your organization and so is worthwhile for us to disallow).

    So it appears that my similar regex has several advantages that I couldn't get from Email::Address as written.

    - tye        

      The module doesn't appear to provide a way to get the address with quoting and escaping removed so that addresses can be compared.
      The following snippet does just that:
      use Email::Address; my @addresses = map { Email::Address->parse($_) } <DATA>; print is_equivalent(@addresses) ? '' : 'not ' , "equivalent\n"; sub is_equivalent { my ($a, $b) = @_; return lc $a->address eq lc $b->address; } __DATA__ "John Doe" <jdoe@bla.com> (Johnnie "Two Toes") jDOE@BLA.COM
      It also doesn't disallow the very common user mistake of "everybody@gmail" (which can be valid as an e-mail address in some situations but isn't a valid address to give to somebody outside of your organization and so is worthwhile for us to disallow).
      That's right. Any validation rules you want to impose beyond what RFC 2822 does is up to you, but Email::Address will tokenise the addresses in order to enable you to. Here's a snippet of how it deals with various address formats:
      use Email::Address; my @addresses = map { Email::Address->parse( $_ ) } <DATA>; for my $address (@addresses) { printf("%8s: %s\n", $_, ( $address->$_ or '' ) ) for ( qw( origina +l address user host name phrase comment format ) ); print "-------\n"; } __DATA__ abc@foo.com bla@gmail "Eve Rybody" <everybody@example.com> foo@asdf.com%bar.com "Alan B. Combs" <abc@foo.com> (I can't think of anything complex)
      From there you can easily validate $username, $host etc.

        I don't see how your proposed solution does what I requested. It doesn't do any canonicalization of quotes nor escapes. Yes, it eliminates comments. But your test considers jdoe@bla.com to be different from "jdoe"@bla.com, for example. Perhaps there are some other "tokens" I should be asking for instead?

        Yes, I suppose I could go to the trouble of parsing RFC 2822 (except for not allowing abitrary nesting of comments and not allowing CFWS in many places) in order to throw away the parts I don't want and then reparse the other parts to do additional validation and canonicalization. I suspect that will be more code (not counting the code for the module) than the solution I've already written. And it won't allow for simple customization such as allowing /\.@/ as noted elsewhere.

        The module almost does RFC 2822 but doesn't do a good job of practical validation of e-mail addresses typed in by external users. I don't see its contribution to my task being much of a "win".

        - tye        

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://711124]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (5)
As of 2024-04-25 14:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found