http://www.perlmonks.org?node_id=711061

We had a disagreement at work. One of our architects wanted us to follow the whole of RFC 2822 when deciding what e-mail addresses to accept. The developers were (at least nearly) unanimous in strongly disliking that idea. Probably the most important reason for the dislike is that (for several reasons) we need to be able to determine whether "these two e-mail addresses are the same".

My argument was "Yes, follow parts of the RFCs so that you produce something that is close to the official specification but, yes, we should certainly reject some parts of the RFCs because they are overly complex and of little value in practice."

Note that we are not talking about e-mail verification. That is done by sending an e-mail to the address in question containing a "magic cookie"...

One of the devs checked CPAN and found nothing that fit both of our very simple main requirements:

  1. Easily determine if two address are equivalent
  2. Allow "+" in addresses (which we use heavily in one test environment)

So he did a decent job of combining existing e-mail regexes that we had used already but also noted a bug in his own work before he finished composing the e-mail to suggest this solution.

So I went and loaded the two relevant RFCs into browser tabs and spent only a few minutes cutting and pasting and came up with the following response.

I'm posting it here so that I can get wider feedback on my results, methods, and assumptions. Also because my results are quite simple, are derived almost trivially directly from the RFCs and so are "correct" except in the few areas where I intentionally chose to ignore very specific features (and clearly called those out).

The resulting simple regex seems to support tons of forms of e-mail addresses that I almost never see and yet doesn't allow any illegal e-mail addresses and is quite practical.


Let us follow the RFCs to compose our regex.

I'll start with RFC 1035 that defines what a valid internet domain name looks like. It boils down to this, ignoring the "empty" case:

my $letter = q<[a-zA-Z]>; my $letdig = q<[a-zA-Z0-9]>; my $ldh = q<[-a-zA-Z0-9]>; my $label = "$letter(?:$ldh*$letdig)?"; my $domain = "$label(?:\.$label)*";

But we know that there are several things wrong with this in practice:

  1. There are domains with leading digits in their components ($label), such as 360.com
  2. We want to enforce having a top-level domain, that is, we want one dot to be mandatory
  3. All top-level domains are letters-only (and at least 2 characters)

So that gives us the following:

my $label = "$letdig(?:$ldh*$letdig)?"; my $domain = "$label(?:\.$label)*\.$letter{2,}";

Now on to RFC 2822. We only care about "internet address specification", which boils down to the following if we drop quoting and whitespace which we probably want to drop because not dropping makes canonicalization harder and adds quite a bit of complexity while not likely adding a significant number of in-use addresses that we could support:

my $atext = q<[-!#$%&'*+/0-9=?A-Z^_`a-z{|}~]>; # or $atext = q<[-\w!#$%&'*+/=?^`a-z{|}~]>; # (but that raises the question of non-ASCII letters) my $atom = "$atext+"; my $dot_atom = "$atom(?:.$atom)*"; my $addr_spec = $dot_atom . '@' . $dot_atom;

But we want to use $domain after the @ not $dot_atom and we want to match strings that equal an internet e-mail address specification, not ones that contain such:

my $addr_spec = qr/^$dot_atom\@$domain$/;

So should I upload this to CPAN? Did my coworker miss something close to this already on CPAN? Am I missing something important? Are my expectations or assumptions off base?

BTW, note that the "are these two addresses equivalent?" test is just lc $addr1 eq lc $addr2. Yes, I realize that it is possible to set up a system such that ExpertsExchange@example.com and ExpertSexchange@example.com are completely separate addresses, but anybody who does that deserves to suffer from such a set-up.

Finally, yes, I know that my regex doesn't disallow a trailing newline. I can't remember which of \z and \Z disallow that and I find that in practice the code has to deal with trailing whitespace anyway so the regex just doesn't need to worry about that detail anyway. If I were to put this in a module, then I'd look up \z and \Z (and still provide alternatives for them since there is probably little reason for such a module to not work with Perl 5.0).

- tye        

Replies are listed 'Best First'.
Re: Practical e-mail address validation
by Limbic~Region (Chancellor) on Sep 13, 2008 at 16:27 UTC
    tye,
    What I would rather see on CPAN is a email address validation that allows you to pick and choose (as well as add your own) what rules you want to play by. For instance "do not accept email address that require an open relay to work" or "accept email address that have a period before @ in violation of the RFC".

    Now on to your problem at hand. Are the following email addresses "equivalent"?

    1. foo@bar.com 2. Foo@bar.com 3. foo@BAR.com
    It turns out that 1 and 3 are but 2 is not. You have already mentioned this. I only bring it up again to point out another "rule" for this theoretical CPAN module - to consider case in the user portion of the address. Here is another one that may be difficult to tell:
    1. foo@bar.com 2. foo%bar.com@asdf.com # corrected
    These are functionally equivalent because it expects the MTA at asdf.com to relay the mail to bar.com.

    So I have no practical use for your validation routines but would love to see a more flexible module - for reasons I mention here as well as ones mentioned here, here and there.

    Cheers - L~R

      Having just skimmed the parts of RFC 2822 regarding e-mail addresses, it was pretty clear that "accept email address that have a period before @ in violation of the RFC" isn't based on the copy that I found. It clearly went to some length to document how you can use period before the @ so I'd be quite to surprised to find that some other part of the RFC disallowed such usage.

      As for ignoring case before the @, I wrote above:

      Yes, I realize that it is possible to set up a system such that ExpertsExc­hange@exam­ple.com and ExpertSexc­hange@exam­ple.com are completely separate addresses, but anybody who does that deserves to suffer from such a set-up.

      which I think makes my position on that quite clear. Note that I certainly won't be altering case of the user portion of any addresses (I see no reason to alter case of any potion of the address, actually, but I realize that altering the case of the user portion is at least technically allowed to break the e-mail address).

      As for foo@asdf.c­om%bar.com, I agree that this is a valid e-mail address and can see uses for it in certain situations. But I also feel that requiring such an address format be used by your (internal or external) customers when they request that e-mail be sent to them by random external entities is a good reason to demand a new e-mail service provider. So I don't yet feel guilty about considering not allowing such addresses to be used in order to register for a service we provide on the internet (as my regex above disallows). So (at this point) I won't have a problem with comparing such addresses.

      As for a pick-and-choose module, one reaction I have is that I think it took me less than 10 minutes to cut'n'paste from two RFCs to come up with my simplistic results that seem amply permissive to real-world e-mail addresses meant to be used "at large". So I don't foresee it as particularly difficult to spend 10 minutes to pick and choose the items that fit one's specific situation. And part of my point was to wonder why people seem to never bother to cut'n'paste from the RFCs when they go to roll their own regexes.

      But I will certainly consider producing such a module (or patching an existing module, or, rather, trying to patch an existing module since CPAN so very often makes usefully patching existing modules difficult and extremely slow), especially as more details of what items are likely to be worth picking and choosing between are explained to me explicitly and clearly. I don't claim to be an expert on e-mail addresses, in fact, part of the point of posting this was my shock at how easy it appeared to be to provide something that seemed more practical (for the very common problem of validating e-mail addresses being entered by users of a web page) than what my coworker reported finding on CPAN.

      Is "do not accept email address that require an open relay to work" nothing more than "don't allow % after the @"? In any case, thanks for another justification for not using full-RFC-2822 addresses.

      Thanks also for you assessment of whether this would be a good addition to CPAN. I appreciate your opinion.

      - tye        

        tye,
        It has been 6 years since I worked at the US Dept. Of Justice and had the RFCs memorized but you can see that others agree with me. Neither Email::Valid nor Email::Address believe 'foo.@bar.com' is a valid email address and Email::Valid::Loose only exists to relax the rules of RFC 2822 to allow a period before the at.

        Regarding case sensitivity in the user portion, you did make your position clear. In fact, I indicated you had already mentioned it. I brought it up again because I believe it would be a valuable rule to turn on/off if they were using this theoretical module to identify spammers.

        The reason I suggest such a pick and choose module is thus: The specific reasons for wanting to look for email addresses and then choose to deem them invalid changes from situation to situation. Most folks are completely ignorant of the RFCs and it would be easy for them to say "in my situation, I want to allow X and Y but deny Z" without having to go look anything up.

        Cheers - L~R

Re: Practical e-mail address validation
by everybody (Scribe) on Sep 13, 2008 at 15:42 UTC
    So should I upload this to CPAN? Did my coworker miss something close to this already on CPAN?

    There are a lot of modules on CPAN dealing with e-mail addresses. Unfortunately, a lot of them are rubbish. You might want to take a look at the Perl Email Project if you haven't already.

    If I understand you correctly (without specifically having worked through the regexen), you consider the following addresses to be equivalent:

    everybody@example.com "Eve" <everybody@example.com> (Rybody) EveRybody+whatever@EXAMPLE.COM
    What about addresses with different top-level-domains, but equal second-level-domains? For example emails from/to
    foo@gmail.com foo@gmail.net foo@gmail.com.au foo@gmail.at ... foo@gmail.za
    are all routed from/to the same mailbox. Are they equivalent to you?

    In either case all you need is a tokeniser. It will allow you to compare exactly what it is you need to determine equivalency.

    Some of the existing modules tokenise the addresses internally to some extent, but Email::Address also gives you nice accessors for the resulting tokens, so it would be my prime choice for solving what I perceive your problem to be.

      I've seen domains where foo.com was used in e-mail addresses for employees while foo.net was used in e-mail addresses of customers.

      As for the "+mailbox" convention, there are arguments on both sides of whether to ignore such in determining equivalence of addresses (a customer might legitimately want separate accounts for members of a single group where the correspondence for all accounts just go to separate mailboxes at the same address, or it might just be a source of confusion or simplify some mild cases of abuse).

      But we'll be using +mailbox to simplify testing so we'll just use lc $addr1 eq lc $addr2 as I already noted.

      Thanks for the module recommendations. Email::Address notes:

      XXX: This ($phrase) used to just be: my $phrase = qr/$word+/; It was changed to resolve bug 22991, creating a significant slowdown. Given current speed problems. Once 16320 is resolved, this section should be dealt with. -- rjbs, 2006-11-11
      XXX: ...and the above solution caused endless problems (never returned) when examining this address, now in a test:
      admin+=E6=96=B0=E5=8A=A0=E5=9D=A1_Weblog-- ATAT --test.socialtext.com
      So we disallow the hateful CFWS in this context for now. Of modern mail agents, only Apple Web Mail 2.0 is known to produce obs-phrase. -- rjbs, 2006-11-19

      which confirms some of my suspicions/assumptions.

      Looking at the regexes that the module uses, they appear to have been constructed directly from the RFCs very similarly to how I constructed mine, except fewer features were intentionally dropped.

      The note that "Of modern mail agents, only ... is known to produce" leads me to want to use that module if I were trying to parse e-mail addresses received in e-mail messages. An e-mail system would be broken if it required "the hateful CFWS" in order to deliver messages to it. So completely disallowing CFWS (as I did) doesn't prevent any addresses from being used.

      The module doesn't appear to provide a way to get the address with quoting and escaping removed so that addresses can be compared. It also doesn't disallow the very common user mistake of "everybody@gmail" (which can be valid as an e-mail address in some situations but isn't a valid address to give to somebody outside of your organization and so is worthwhile for us to disallow).

      So it appears that my similar regex has several advantages that I couldn't get from Email::Address as written.

      - tye        

        The module doesn't appear to provide a way to get the address with quoting and escaping removed so that addresses can be compared.
        The following snippet does just that:
        use Email::Address; my @addresses = map { Email::Address->parse($_) } <DATA>; print is_equivalent(@addresses) ? '' : 'not ' , "equivalent\n"; sub is_equivalent { my ($a, $b) = @_; return lc $a->address eq lc $b->address; } __DATA__ "John Doe" <jdoe@bla.com> (Johnnie "Two Toes") jDOE@BLA.COM
        It also doesn't disallow the very common user mistake of "everybody@gmail" (which can be valid as an e-mail address in some situations but isn't a valid address to give to somebody outside of your organization and so is worthwhile for us to disallow).
        That's right. Any validation rules you want to impose beyond what RFC 2822 does is up to you, but Email::Address will tokenise the addresses in order to enable you to. Here's a snippet of how it deals with various address formats:
        use Email::Address; my @addresses = map { Email::Address->parse( $_ ) } <DATA>; for my $address (@addresses) { printf("%8s: %s\n", $_, ( $address->$_ or '' ) ) for ( qw( origina +l address user host name phrase comment format ) ); print "-------\n"; } __DATA__ abc@foo.com bla@gmail "Eve Rybody" <everybody@example.com> foo@asdf.com%bar.com "Alan B. Combs" <abc@foo.com> (I can't think of anything complex)
        From there you can easily validate $username, $host etc.
Re: Practical e-mail address validation
by kyle (Abbot) on Sep 13, 2008 at 19:06 UTC

    Use \Z to match end of string or newline and \z to match only end of string.

    I've long been under the impression that domain names are allowed to have a single trailing dot. That is, "example.com" is the same as "example.com.". As a lame proof of this, dig tells me gives me the same answer either way, but it rejects "example.com.." (not a legal name). I haven't looked closely at RFC1035 for support for this. If you accept this, it screws up the "lc eq lc" test for equality. Maybe you'd want to just s/\.?\s*$// everything before anything else.

    Yes, I realize that it is possible to set up a system such that ExpertsExchange@example.com and ExpertSexchange@example.com are completely separate addresses, but anybody who does that deserves to suffer from such a set-up.

    Unfortunately, it's usually not the people who set it up who suffer but rather the people who have to use it and often have no control over it. I don't think that ignoring the case of the local part of an email address is a bad design decision—I've done it myself at times. I think, rather, that it's better justified on the grounds that the few mistakes aren't worth the extra work to avoid them.

    At YAPC::NA 2008, I attended a talk by Ricardo SIGNES (‎rjbs‎) called "Email Hates the Living!" which discussed some of the pitfalls of parsing email addresses strictly according to the standards. Google knows about it. You might pass that along to anyone else who wants to take an approach less practical than yours. Also, it's hilarious.

      The dot at the end of a domain name makes it "absolute" (at least in some situations). Without the final dot, the local domain can be appended to it when trying to resolve it. Compare "nslookup www" vs. "nslookup www." if you are in a domain with a web server, for example ("dig" here appears to just assume a trailing dot if you leave it off). Funnily enough, RFC 2822 doesn't appear to allow a trailing dot (though I didn't read up on the obsolete bits).

      I think, rather, that it's better justified on the grounds that the few mistakes aren't worth the extra work to avoid them.

      It isn't particularly hard to not ignore case only in front of the @. The reason you should ignore case there (but preserve it) is that if you have two addresses that agree except for the case of some letters to the left of the @, the possibilities and their odds are:

      near 0
      The two addresses are different and both valid
      >> 90%
      The two addresses are the same
      << 10%
      One address is valid and the other is invalid

      So handling the over-90% case correctly is a much better idea than handling the near-0% case correctly. For the under-10% case, the choice doesn't matter much, but even there the "ignore case" choice is likely more convenient for the humans involved (who may well know that they can't successfully send e-mail to ExpertSexchange@example.com, only to ExpertsExchange@example.com, but that is no reason to not recognize which account they want a password reminder for when they enter "expertsexchange@example.com" in the web form).

      Or perhaps you meant that it is too much trouble to try to determine if some particular e-mail host ignores case or not. That certainly would be a lot of trouble and I certainly see no point in trying. :) Especially since there is still benefit to ignoring case even in addresses for e-mail hosts that don't. Actually, even if I could conveniently determine if a particular e-mail host ignores case or not, I wouldn't use that information. Just because the person who runs that host puts their users through such pain doesn't mean that I should extend that pain to them when they interact with my system.

      Yes, one of my coworkers went to rjbs' talk and the Email::Address comments that I quoted elsewhere are signed "--rjbs". We'll likely review that material again before this is over.

      - tye        

Re: Practical e-mail address validation
by mr_mischief (Monsignor) on Sep 19, 2008 at 21:06 UTC
    You made it clear already you reserve the right to pick and choose which parts of the RFC to support. That's your right within your own internal systems. It's also fair, when properly documented, for software you release. So I certainly wouldn't blame you for also picking and choosing which RFCs are relevant or which parts of those additional RFCs to support. However, just in case you're interested, I thought I'd give you this heads up.

    From RFC 1101:

    From RFC 1123 (STD 3):

    From RFC 2181 :

    Here some confusion creeps in, in RFC 2181 some leniency about support for "any binary label" needing to be served properly by DNS but not necessarily being allowed as a hostname in client application like, for example, SMTP.

    Then RFC 2822 specifies RFC 1035 instead of RFC 1123 as the authoritative DNS RFC. Yet RFC 2822 also states that a domain name for either a hostname or a mail exchanger (MX) name is in the terms of STD 3 (RFCs 1122 and 1123), STD 4 (which is reserved for routing topics and respresnts RFC 1812 for IPv4 routing), and STD 14 (which is historical). RFC 2822 defers to RFC 2821 for further information about domain names.

    RFC 2821 claims to clarify some of RFC 1123 for purposes of SMTP email and states that domain names for SMTP are limited to letters, digits, and hyphen and MUST NOT contain anything else (and specifically not underscore). It makes no mention of whether or not digits may lead or be the entirety of a label as mentioned in RFC 1123. It lists RFC 1035 in examples and suggests being conservative in DNS naming, but does not actually specify a SHOULD or MUST in relation to preferring RFC 1132 (which along with RFC 1122 is part of the Internet standard STD 3) or RFC 1035.

    It has been not just common practice, then, but some attempt to actually use the standards that allows domain parts other than the top-level domains to start with digits. Like I said before, your company's own decisions are another thing entirely so long as they are internal or clearly documented. You wouldn't be in violation of the RFCs if you were to allow domain name parts to start with digits, though, unless you allow four sections of all-digits in a row or allow top-level domains to start with a digit.

      Thanks very much for that thorough summary of related RFCs that I hadn't reviewed recently. It was quite helpful and informative.

      My main take-away is that my guess based on observations was pretty accurate. I think I'll choose to be encouraged by the statement that domains for e-mail addresses are restricted to mundane characters and to take it as reason to ignore (for now) the hints that arbitrary binary labels might need to be supported.

      As for the comments about my decisions being "internal", they are actually decisions about what forms of e-mail addresses we will accept from any user on the internet who wishes to register for some of our services. So it isn't strictly "internal". But I also think that it isn't something strictly covered by these RFCs (we aren't using the system to implement something that talks SMTP, for example). The protocol involved is "accepting text from random users over the internet", so I think some simplification is warranted.

      - tye        

        As a personal problem example (what else from me, right? ;-) I have some email addresses which work fine but are sometimes rejected on registration forms.

        One is a personal email address of the form '_\w+_\@populardomain.example', and gets rejected for the underscores or the leading underscore. My company also holds four domains of the form '\d{2}[a-z]+\.[a-z]{3,4}'.

        I get somewhat frustrated with sites that refuse to accept those. It's quite alright in the long run if they offer an alternate way to get registered, such as emailing a support contact or leaving a note for registration support through a form when the address is rejected. Both of those take human intervention, though. I can always use a different email address to sign up for something, but I generally use certain ones to group certain kinds of topics. If I don't find an address that works through some method in a couple of tries, I usually start looking for the competition's website.