Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Re: Practical e-mail address validation

by mr_mischief (Monsignor)
on Sep 19, 2008 at 21:06 UTC ( #712630=note: print w/replies, xml ) Need Help??

in reply to Practical e-mail address validation

You made it clear already you reserve the right to pick and choose which parts of the RFC to support. That's your right within your own internal systems. It's also fair, when properly documented, for software you release. So I certainly wouldn't blame you for also picking and choosing which RFCs are relevant or which parts of those additional RFCs to support. However, just in case you're interested, I thought I'd give you this heads up.

From RFC 1101:

For these reasons, we assume that the syntax of network names will be the same as the expanded syntax for host names permitted in [HR]. The new syntax expands the set of names to allow leading digits, so long as the resulting representations do not conflict with IP addresses in decimal octet form. For example, 3Com.COM and 3M.COM are now legal, although is not. See [HR] for details.

From RFC 1123 (STD 3):

2.1 Host Names and Numbers

The syntax of a legal Internet host name was specified in RFC-952 [DNS:4]. One aspect of host name syntax is hereby changed: the restriction on the first character is relaxed to allow either a letter or a digit. Host software MUST support this more liberal syntax.

Host software MUST handle host names of up to 63 characters and SHOULD handle host names of up to 255 characters. Whenever a user inputs the identity of an Internet host, it SHOULD be possible to enter either (1) a host domain name or (2) an IP address in dotted-decimal ("#.#.#.#") form. The host SHOULD check the string syntactically for a dotted-decimal number before looking it up in the Domain Name System.

This last requirement is not intended to specify the complete syntactic form for entering a dotted-decimal host number; that is considered to be a user-interface issue. For example, a dotted-decimal number must be enclosed within "[ ]" brackets for SMTP mail (see Section 5.2.17). This notation could be made universal within a host system, simplifying the syntactic checking for a dotted-decimal number.

If a dotted-decimal number can be entered without such identifying delimiters, then a full syntactic check must be made, because a segment of a host domain name is now allowed to begin with a digit and could legally be entirely numeric (see Section However, a valid host name can never have the dotted-decimal form #.#.#.#, since at least the highest-level component label will be alphabetic.

From RFC 2181 :

Note however, that the various applications that make use of DNS data can have restrictions imposed on what particular values are acceptable in their environment. For example, that any binary label can have an MX record does not imply that any binary name can be used as the host part of an e-mail address. Clients of the DNS can impose whatever restrictions are appropriate to their circumstances on the values they use as keys for DNS lookup requests, and on the values returned by the DNS. If the client has such restrictions, it is solely responsible for validating the data from the DNS to ensure that it conforms before it makes any use of that data.

Here some confusion creeps in, in RFC 2181 some leniency about support for "any binary label" needing to be served properly by DNS but not necessarily being allowed as a hostname in client application like, for example, SMTP.

Then RFC 2822 specifies RFC 1035 instead of RFC 1123 as the authoritative DNS RFC. Yet RFC 2822 also states that a domain name for either a hostname or a mail exchanger (MX) name is in the terms of STD 3 (RFCs 1122 and 1123), STD 4 (which is reserved for routing topics and respresnts RFC 1812 for IPv4 routing), and STD 14 (which is historical). RFC 2822 defers to RFC 2821 for further information about domain names.

RFC 2821 claims to clarify some of RFC 1123 for purposes of SMTP email and states that domain names for SMTP are limited to letters, digits, and hyphen and MUST NOT contain anything else (and specifically not underscore). It makes no mention of whether or not digits may lead or be the entirety of a label as mentioned in RFC 1123. It lists RFC 1035 in examples and suggests being conservative in DNS naming, but does not actually specify a SHOULD or MUST in relation to preferring RFC 1132 (which along with RFC 1122 is part of the Internet standard STD 3) or RFC 1035.

It has been not just common practice, then, but some attempt to actually use the standards that allows domain parts other than the top-level domains to start with digits. Like I said before, your company's own decisions are another thing entirely so long as they are internal or clearly documented. You wouldn't be in violation of the RFCs if you were to allow domain name parts to start with digits, though, unless you allow four sections of all-digits in a row or allow top-level domains to start with a digit.

  • Comment on Re: Practical e-mail address validation

Replies are listed 'Best First'.
Re^2: Practical e-mail address validation (other RFCs)
by tye (Sage) on Sep 20, 2008 at 02:05 UTC

    Thanks very much for that thorough summary of related RFCs that I hadn't reviewed recently. It was quite helpful and informative.

    My main take-away is that my guess based on observations was pretty accurate. I think I'll choose to be encouraged by the statement that domains for e-mail addresses are restricted to mundane characters and to take it as reason to ignore (for now) the hints that arbitrary binary labels might need to be supported.

    As for the comments about my decisions being "internal", they are actually decisions about what forms of e-mail addresses we will accept from any user on the internet who wishes to register for some of our services. So it isn't strictly "internal". But I also think that it isn't something strictly covered by these RFCs (we aren't using the system to implement something that talks SMTP, for example). The protocol involved is "accepting text from random users over the internet", so I think some simplification is warranted.

    - tye        

      As a personal problem example (what else from me, right? ;-) I have some email addresses which work fine but are sometimes rejected on registration forms.

      One is a personal email address of the form '_\w+_\@populardomain.example', and gets rejected for the underscores or the leading underscore. My company also holds four domains of the form '\d{2}[a-z]+\.[a-z]{3,4}'.

      I get somewhat frustrated with sites that refuse to accept those. It's quite alright in the long run if they offer an alternate way to get registered, such as emailing a support contact or leaving a note for registration support through a form when the address is rejected. Both of those take human intervention, though. I can always use a different email address to sign up for something, but I generally use certain ones to group certain kinds of topics. If I don't find an address that works through some method in a couple of tries, I usually start looking for the competition's website.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://712630]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2021-10-28 08:43 GMT
Find Nodes?
    Voting Booth?
    My first memorable Perl project was:

    Results (96 votes). Check out past polls.