We had a disagreement at work. One of our architects wanted us to follow the whole of RFC 2822 when deciding what e-mail addresses to accept. The developers were (at least nearly) unanimous in strongly disliking that idea. Probably the most important reason for the dislike is that (for several reasons) we need to be able to determine whether "these two e-mail addresses are the same".

My argument was "Yes, follow parts of the RFCs so that you produce something that is close to the official specification but, yes, we should certainly reject some parts of the RFCs because they are overly complex and of little value in practice."

Note that we are not talking about e-mail verification. That is done by sending an e-mail to the address in question containing a "magic cookie"...

One of the devs checked CPAN and found nothing that fit both of our very simple main requirements:

  1. Easily determine if two address are equivalent
  2. Allow "+" in addresses (which we use heavily in one test environment)

So he did a decent job of combining existing e-mail regexes that we had used already but also noted a bug in his own work before he finished composing the e-mail to suggest this solution.

So I went and loaded the two relevant RFCs into browser tabs and spent only a few minutes cutting and pasting and came up with the following response.

I'm posting it here so that I can get wider feedback on my results, methods, and assumptions. Also because my results are quite simple, are derived almost trivially directly from the RFCs and so are "correct" except in the few areas where I intentionally chose to ignore very specific features (and clearly called those out).

The resulting simple regex seems to support tons of forms of e-mail addresses that I almost never see and yet doesn't allow any illegal e-mail addresses and is quite practical.

Let us follow the RFCs to compose our regex.

I'll start with RFC 1035 that defines what a valid internet domain name looks like. It boils down to this, ignoring the "empty" case:

my $letter = q<[a-zA-Z]>; my $letdig = q<[a-zA-Z0-9]>; my $ldh = q<[-a-zA-Z0-9]>; my $label = "$letter(?:$ldh*$letdig)?"; my $domain = "$label(?:\.$label)*";

But we know that there are several things wrong with this in practice:

  1. There are domains with leading digits in their components ($label), such as
  2. We want to enforce having a top-level domain, that is, we want one dot to be mandatory
  3. All top-level domains are letters-only (and at least 2 characters)

So that gives us the following:

my $label = "$letdig(?:$ldh*$letdig)?"; my $domain = "$label(?:\.$label)*\.$letter{2,}";

Now on to RFC 2822. We only care about "internet address specification", which boils down to the following if we drop quoting and whitespace which we probably want to drop because not dropping makes canonicalization harder and adds quite a bit of complexity while not likely adding a significant number of in-use addresses that we could support:

my $atext = q<[-!#$%&'*+/0-9=?A-Z^_`a-z{|}~]>; # or $atext = q<[-\w!#$%&'*+/=?^`a-z{|}~]>; # (but that raises the question of non-ASCII letters) my $atom = "$atext+"; my $dot_atom = "$atom(?:.$atom)*"; my $addr_spec = $dot_atom . '@' . $dot_atom;

But we want to use $domain after the @ not $dot_atom and we want to match strings that equal an internet e-mail address specification, not ones that contain such:

my $addr_spec = qr/^$dot_atom\@$domain$/;

So should I upload this to CPAN? Did my coworker miss something close to this already on CPAN? Am I missing something important? Are my expectations or assumptions off base?

BTW, note that the "are these two addresses equivalent?" test is just lc $addr1 eq lc $addr2. Yes, I realize that it is possible to set up a system such that and are completely separate addresses, but anybody who does that deserves to suffer from such a set-up.

Finally, yes, I know that my regex doesn't disallow a trailing newline. I can't remember which of \z and \Z disallow that and I find that in practice the code has to deal with trailing whitespace anyway so the regex just doesn't need to worry about that detail anyway. If I were to put this in a module, then I'd look up \z and \Z (and still provide alternatives for them since there is probably little reason for such a module to not work with Perl 5.0).

- tye