Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Don't Use Regular Expressions To Parse IP Addresses!

by ybiC (Prior)
on Dec 20, 2002 at 20:53 UTC ( #221512=perltutorial: print w/ replies, xml ) Need Help??

Document Scope

This document is intended to provide numerical and networking information on why Perl regular expressions are inappropriate for effectively parsing IPv4 addresses.

This document does not provide specific Perl programming examples of proper parsing.   A link to one good example can be found in the "See Also" section below.

Short Answer

Regular expressions should not be used to parse IP addresses because IP addresses can be expressed in several forms...   Aaaand the format that people generally expect to see is not the one your computer and OS use, nor well-written applications either...   Sooooo you aren't really doing what you think you are doing...   Aaaaand even if you were, you would still very likely trip over one of many a'remaining trap or snare.

The inet_aton() and inet_ntoa() functions from Socket.pm are generally considered to be the best from a very short list of Proper Ways to parse IP addresses, as they readily handle most all valid formats and representations, as well as being part of the standard Perl install.

Multiple Representations

The term "IP Address" is commonly (mis)understood to mean a number that looks like this:

  172.31.254.1

and yes, that is indeed a valid human-readable representation of an IP address.

You see, any given IP address is actually a 32 bit binary number, and dotted-quad is merely one convenient convention to ease the strain on binary-impaired people brains when forced to deal with IP addresses.   It provides a mental model that works well within limits, but the way you mentally process those four octets has very little relationship with the way routing and operating system softwares do.

Other representations include:
    C-style hex   0xac1f35fe
    dotted hex   ac.1f.35.fe
    decimal   2887726590
    octal   025407632776

Perhaps the best summarization I've seen is from one particular version of man inet

Internet Address Values specified using dot notation take one of the following four forms:
  a.b.c.d
  a.b.c
  a.b
  a

4 part - each is interpreted as a byte of data and assigned, from left to right, to the four bytes of an Internet address.

3 part - last part is interpreted as a 16-bit quantity and placed in the right-most two bytes of the network address.   This makes the three-part address format convenient for specifying Class B network addresses as in 128.net.host.

2 part - the last part is interpreted as a 24-bit quantity and placed in the right-most three bytes of the network address.   This makes the two-part address format convenient for specifying Class A network addresses as in net.host.

1 part - the value is stored directly in the network address without any byte rearrangement.

All numbers supplied as parts in dot notation can be decimal, octal, or hexadecimal, as specified in the C language (i.e., a leading 0x or 0X implies hexadecimal; a leading 0 implies octal; otherwise, the number is interpreted as decimal).

Buuuuuut... if you really want to grok IP addressing you must start by thinking of them as binary, because that's what they really are, regardless of visible or programatic representations.

Convert Dotted-Quad to Binary

Let's dissect our dotted-quad example from above.

  172.31.254.1

First step - convert each part to binary
  decimal   binary
    172    1010 1100
    31     0001 1111
    254    1111 1110
    1      0000 0001

Line them all up, in original order
   172         31          53           254
1010 1100   0001 1111   0011 0101   1111 1110

Then strip the spaces, and voila - a real-live IP address in all its 32 bit binary glory!   (we'll leave spaces in later examples for clarity)
  10101100000111110011010111111110

Traps and Snares

The 32 bit binary number above is what IOS, *nix, or Win32 use to route or process.   Given that, trying to parse while still in dotted-quad form can give unintended results.

  • inclusion of unintended dotted-quad addresses
  • inadvertant inclusion of dotted-quad addresses that are *not* valid as client source/destination
  • unexpected inclusion of addresses in other formats
  • unwitting exclusion of valid addresses in other forms
  • (big|little) endian issues may rise up mightily and smite thee
  • some utilities may want the address in native form (probably not many)

Blocks and Netmasks

If this section makes any sense at all, then you likely see why binary Is The One True Way and how dotted-quad Is The Path To Perdition.

Basic rules for address blocks, whether classful or classless:
  • a block is defined by the combination of starting address and netmask
  • the lowest-numbered address in a block is always the network itself *
  • the highest-numbered address in a block is always broadcast *
  • all the addresses between are usable for hosts
  • each address specifies a physical or virtual interface, not necessarily a host
    (consider a router with multiple LAN and/or WAN interfaces)
*   unless your internal network infrastructure architect has more time than brains and is insanely devious and contrary

Basic rules for (sub|super)netmasking:

  • subnetting divides a single classful block into multiple networks of fewer addresses each
  • supernetting combines multiple classful blocks into a single network of greater number of addresses
  • both are done using netmasks - binary numbers verrrry similar to IP addresses
  • no 0 may precede any 1 in the mask number
  • the ones bits (left side of mask) indicate network
  • the zeros bits (right side of mask) indicate host

Valid Octets for Netmasks
  128   1000 0000
  192   1100 0000
  224   1110 0000
  240   1111 0000
  248   1111 1000
  252   1111 1100
  255   1111 1111

The Classful Netmasks
  255.0.0.0       /8    11111111  00000000  00000000  00000000
  255.255.0.0     /16   11111111  11111111  00000000  00000000
  255.255.255.0   /24   11111111  11111111  11111111  00000000

Some Popular Classless Netmasks
  255.255.255.224   /27   11111111  11111111  11111111  11100000
  255.255.255.248   /29   11111111  11111111  11111111  11111000
  255.255.255.252   /30   11111111  11111111  11111111  11111100

A common example of classful netmasking
  netmask       255.255.255.0    11111111  11111111  11111111  00000000
  bitwise       /24
  network       192.168.0.0      11000000  10101000  00000000  00000000
  first usable  192.168.0.1                                    00000001
  last usable   192.168.0.254                                  11111110
  broadcast     192.168.0.255                                  11111111
  254 usable addresses
  corporate office LAN

Examples of variable length subnet masking (VLSM) (classless)
  netmask       255.255.255.224  11111111  11111111  11111111  11100000
  bitwise       /27
  network       172.31.254.0     10101100  00011111  11111110  00000000
  first usable  172.31.254.1                                      00001
  last host     172.31.254.30                                     11110
  broadcast     172.31.254.31                                     11111
  30 usable addresses
  remote office LAN


  netmask       255.255.255.248  11111111  11111111  11111111  11111000
  bitwise       /29
  network       172.31.254.32    10101100  00011111  11111110  00100000
  first usable  172.31.254.33                                       001
  last usable   172.31.254.38                                       110
  broadcast     172.31.254.39                                       111
  6 usable addresses
  tiny remote LAN


  netmask       255.255.255.252  11111111  11111111  11111111  11111100
  bitwise       /30
  network       172.31.254.40    10101100  00011111  11111110  00101000
  first usable  172.31.254.41                                        01
  last usable   172.31.254.42                                        10
  broadcast     172.31.254.43                                        11
  2 usable addresses
  point-to-point WAN link
  or PC+router telecommuter LAN

Invalid Client Addresses

None of these addresses should be assigned to Internet-attached client device interfaces, and most should not appear on private network client devices.

zero
  n.n.n.0         class C network n.n.n
  n.n.0.0         class B network n.n
  n.0.0.0         class C network n
  0.0.0.0         default route
  
broadcast
  n.n.n.255        all hosts on class C network n.n.n
  n.n.255.255      all hosts on class B network n.n
  n.255.255.255    all hosts on class A network n
  255.255.255.255  all hosts on whatever network I happen to be on

Loopback
  127.0.0.0 through 127.255.255.255
    127.0.0.0/8

Link Local
  169.254.0.0 through 169.254.255.255
    169.254.0.0/16

TEST-NET
  192.0.2.0 through 192.0.2.255
    192.0.2.0/24

Class D - statically scoped multicast
  224.0.0.0 through 239.255.255.255
  224.0.0.0/4   netmask 240.0.0.0
    224.0.0.0/8   netmask 255.0.0.0
    225.0.0.0/8
    226.0.0.0/8
    227.0.0.0/8
    228.0.0.0/8
    229.0.0.0/8
    230.0.0.0/8
    231.0.0.0/8
    232.0.0.0/8
    233.0.0.0/8
    234.0.0.0/8
    235.0.0.0/8

    236.0.0.0/8
    237.0.0.0/8
    238.0.0.0/8
    239.0.0.0/8

Class E - administratively scoped multicast
  240.0.0.0 through 255.255.255.255
  240.0.0.0/4   netmask 240.0.0.0
    240.0.0.0/8   netmask 255.0.0.0
    241.0.0.0/8
    242.0.0.0/8
    243.0.0.0/8
    244.0.0.0/8
    245.0.0.0/8
    246.0.0.0/8
    247.0.0.0/8
    248.0.0.0/8
    249.0.0.0/8
    250.0.0.0/8
    251.0.0.0/8
    252.0.0.0/8
    253.0.0.0/8
    254.0.0.0/8
    255.0.0.0/8

See Also
Credits

Big phat props to the following monks:

  • jdporter for inspiration for this node from his CB explanation & post of inet_aton() usage
  • jdporter, Zaxo, tye, Mr. Muskrat and jeffa for mondo pre-post critique, additions and corrections
  • MidLifeXis for catching usable-host-count error
  • other monks what shall remain nameless due to witness protection program restrictions
  • some guy named vroom.
Any mistakes, misinformation, or bold-faced lies contained herein are mine, all mine.

Decimal-to-Binary Conversion Chart

example:

dotted quad    192           9           200           1
binary      1100 0000    0000 1001    1100 1000    0000 0001



* indicates valid netmask

  0    0000 0000      64   0100 0000     *128  1000 0000*    *192  1100 0000*
  1    0000 0001      65   0100 0001      129  1000 0001      193  1100 0001

  2    0000 0010      66   0100 0010      130  1000 0010      194  1100 0010
  3    0000 0011      67   0100 0011      131  1000 0011      195  1100 0011

  4    0000 0100      68   0100 0100      132  1000 0100      196  1100 0100
  5    0000 0101      69   0100 0101      133  1000 0101      197  1100 0101

  6    0000 0110      70   0100 0110      134  1000 0110      198  1100 0110
  7    0000 0111      71   0100 0111      135  1000 0111      199  1100 0111

  8    0000 1000      72   0100 1000      136  1000 1000      200  1100 1000
  9    0000 1001      73   0100 1001      137  1000 1001      201  1100 1001

  10   0000 1010      74   0100 1010      138  1000 1010      202  1100 1010
  11   0000 1011      75   0100 1011      139  1000 1011      203  1100 1011


  12   0000 1100      76   0100 1100      140  1000 1100      204  1100 1100
  13   0000 1101      77   0100 1101      141  1000 1101      205  1100 1101

  14   0000 1110      78   0100 1110      142  1000 1110      206  1100 1110
  15   0000 1111      79   0100 1111      143  1000 1111      207  1100 1111

  16   0001 0000      80   0101 0000      144  1001 0000      208  1101 0000
  17   0001 0001      81   0101 0001      145  1001 0001      209  1101 0001

  18   0001 0010      82   0101 0010      146  1001 0010      210  1101 0010
  19   0001 0011      83   0101 0011      147  1001 0011      211  1101 0011

  20   0001 0100      84   0101 0100      148  1001 0100      212  1101 0100
  21   0001 0101      85   0101 0101      149  1001 0101      213  1101 0101

  22   0001 0110      86   0101 0110      150  1001 0110      214  1101 0110
  23   0001 0111      87   0101 0111      151  1001 0111      215  1101 0111

  24   0001 1000      88   0101 1000      152  1001 1000      216  1101 1000
  25   0001 1001      89   0101 1001      153  1001 1001      217  1101 1001

  26   0001 1010      90   0101 1010      154  1001 1010      218  1101 1010
  27   0001 1011      91   0101 1011      155  1001 1011      219  1101 1011

  28   0001 1100      92   0101 1100      156  1001 1100      220  1101 1100
  29   0001 1101      93   0101 1101      157  1001 1101      221  1101 1101

  30   0001 1110      94   0101 1110      158  1001 1110      222  1101 1110
  31   0001 1111      95   0101 1111      159  1001 1111      223  1101 1111

  32   0010 0000      96   0110 0000      160  1010 0000     *224  1110 0000*
  33   0010 0001      97   0110 0001      161  1010 0001      225  1110 0001

  34   0010 0010      98   0110 0010      162  1010 0010      226  1110 0010
  35   0010 0011      99   0110 0011      163  1010 0011      227  1110 0011

  36   0010 0100      100  0110 0100      164  1010 0100      228  1110 0100
  37   0010 0101      101  0110 0101      165  1010 0101      229  1110 0101

  38   0010 0110      102  0110 0110      166  1010 0110      230  1110 0110
  39   0010 0111      103  0110 0111      167  1010 0111      231  1110 0111

  40   0010 1000      104  0110 1000      168  1010 1000      232  1110 1000
  41   0010 1001      105  0110 1001      169  1010 1001      233  1110 1001

  42   0010 1010      106  0110 1010      170  1010 1010      234  1110 1010
  43   0010 1011      107  0110 1011      171  1010 1011      235  1110 1011

  44   0010 1100      108  0110 1100      172  1010 1100      236  1110 1100
  45   0010 1101      109  0010 1101      173  1010 1101      237  1010 1101

  46   0010 1110      110  0110 1110      174  1010 1110      238  1110 1110
  47   0010 1111      111  0110 1111      175  1010 1111      239  1110 1111

  48   0011 0000      112  0111 0000      176  1011 0000     *240  1111 0000*
  49   0011 0001      113  0111 0001      177  1011 0001      241  1111 0001

  50   0011 0010      114  0111 0010      178  1011 0010      242  1111 0010
  51   0011 0011      115  0111 0011      179  1011 0011      243  1111 0011

  52   0011 0100      116  0111 0100      180  1011 0100      244  1111 0100
  53   0011 0101      117  0111 0101      181  1011 0101      245  1111 0101

  54   0011 0110      118  0111 0110      182  1011 0110      246  1111 0110
  55   0011 0111      119  0111 0111      183  1011 0111      247  1111 0111

  56   0011 1000      120  0111 1000      184  1011 1000     *248  1111 1000*
  57   0011 1001      121  0111 1001      185  1011 1001      249  1111 1001

  58   0011 1010      122  0111 1010      186  1011 1010      250  1111 1010
  59   0011 1011      123  0111 1011      187  1011 1011      251  1111 1011

  60   0011 1100      124  0111 1100      188  1011 1100     *252  1111 1100*
  61   0011 1101      125  0111 1101      189  1011 1101      253  1111 1101

  62   0011 1110      126  0111 1110      190  1011 1110      254  1111 1110
  63   0011 1111      127  0111 1111      191  1011 1111     *255  1111 1111*

Updates 2003-08-28 2005-10-22

  • additional "special" address blocks in Invalid Client Addresses
  • additional links in See Also
  • more Traps and Snares
  • minor verbage tweaks
  • fix typos
  • add caveat regarding net=.0 and broadcast=.255

Comment on Don't Use Regular Expressions To Parse IP Addresses!
Re: Don't Use Regular Expressions To Parse IP Addresses!
by atcroft (Monsignor) on Dec 20, 2002 at 23:57 UTC

    In private conversations with ybiC, we discussed one of the major problems with trying to use regex-en to test an IP for validity: whether it makes sense to be using it in a particular application. For instance, it would not make sense to use a class D or E address when configuring a PC. Or, in many cases, addresses in RFC-defined private addressing space would not be appropriate, but is the case under examination one where such an address is appropriate? To truly test for validity of the address would thus seem to require knowledge specific to the application and its environment, either coded into the application, or determined by some form of active testing.

    If appropriate, one direction you can go is to remove the user's ability to cause errors by presenting them with a valid grouping of addresses to select from, which is the approach I have taken in one of the applications I have written for work. The listing depends upon the addresses entered into that listing to be valid, and so again the problem raises its head.

    In the case of adding the addresses for the application I mentioned, unfortunately I can only truly depend upon the vigilance of those administrators adding data the application will pull from to make sure it is correct and valid, as I can only test for those cases where the data is formatted incorrectly-not where it is valid but inappropriate.

    In discussing with ybiC, there are cases that fall into ranges that can be useful filters, such as the aforementioned class D/E address space, the localhost addressing space, or the RFC-defined private address space. To that end, I offer what I hope are some useful filters that may aid in this. Assuming we have validated that the format is proper (remembering both the "Traps and Snares" and "Multiple Representations" sections above), let us first convert the address in question to a number (my appologies if there are errors on these, as I generally only use the a.b.c.d format). Having done thus, it is now much easier to filter, or convert to whichever format is needed (by doing much the reverse of the ip2bin? functions). Now, sample code.

    sub ip2bin4 { my $ip = shift; # ip format: a.b.c.d return(unpack("N", pack("C4", split(/\D/, $ip)))); } sub ip2bin3 { my $ip = shift; # ip format: a.b.c return(unpack("N", pack("C2S", split(/\D/, $ip)))); } sub ip2bin2 { my $ip = shift; # ip format: a.b return(unpack("N", pack("CL", split(/\D/, $ip)))); } sub ip2bin1 { my $ip = shift; # ip format: a - for consistancy return($ip); } sub is_rfc_private { my $address = shift; return(1) if ((0x10000000 <= $address) and ($address <= 0x10FFFFFF)); # 10.0.0.0/8 return(1) if ((0xAC100000 <= $address) and ($address <= 0xAC101FFF)); # 172.16.0.0/12 return(1) if ((0xC0A80000 <= $address) and ($address <= 0xC0A8FFFF)); # 192.168.0.0/16 return(0); } sub is_class_d { my $address = shift; return(1) if ((0xE0000000 <= $address) and ($address <= 0xEFFFFFFF)); # 224.0.0.0/4 return(0); } sub is_class_e { my $address = shift; return(1) if ((0xF0000000 <= $address) and ($address <= 0xFFFFFFFF)); # 240.0.0.0/4 return(0); } sub is_localhost { my $address = shift; return(1) if ((0x7F000000 <= $address) and ($address <= 0x7FFFFFFF)); # 127.0.0.0/8 return(0); } sub is_linklocal { my $address = shift; return(1) if ((0xA9FE0000 <= $address) and ($address <= 0xA9FEFFFF)); # 169.254.0.0/16 return(0); } sub is_testnet { my $address = shift; return(1) if ((0xC0000200 <= $address) and ($address <= 0xC00002FF)); # 192.0.2.0/24 return(0); }

    Other, similar tests I believe could easily be written from this point-these were examples. Admittedly, while I am sure there are probably modules in CPAN to perform tests of this type, I do not know them off-hand, so I welcome the input of others.

    It is important to remember that to truly validate the representation of an IP address, regex-en are but one part, as one must understand the environment in which it is to be used.

    Update: Extended comments in ip2bin? subroutines.

    Update: Fixed bug in test for 172.16.0.0, because of incorrect CIDR (was /16, is /12).

    Update: Added routines for Link-Local (169.254.0.0/16) and TEST-NET (192.0.2.0/24) address ranges.

    Update: Fixed typo in code.

    Update: (17 Mar 2005) Fixed missing '(' in conditions in is_linklocal and is_testnet functions.

Re: Don't Use Regular Expressions To Parse IP Addresses!
by dws (Chancellor) on Dec 21, 2002 at 02:24 UTC
    Regular expressions should not be used to parse IP addresses because IP addresses can be expressed in several forms...

    "Yes, but..."

    There are plenty of situations, such as parsing server logfiles, where four-part IP addresses are a given. In these cases, using a regex is just fine.

    The only case I've actually run across in practice where using a regex isn't safe is when parsing URLs where some spammer has obscured an IP address by using a non-standard forms.

Re: Don't Use Regular Expressions To Parse IP Addresses!
by fokat (Deacon) on Mar 06, 2003 at 01:58 UTC

    May I suggest the use of NetAddr::IP for this purpose?

    Best regards

    -lem, but some call me fokat

Re: Don't Use Regular Expressions To Parse IP Addresses!
by Anonymous Monk on Sep 04, 2004 at 01:48 UTC
    I prefer to use inet_ntop/inet_pton instead of inet_ntoa/inet_aton. These functions also accept IPv6 addresses, making porting to IPv6 much easier. In fact, with the right functions its not too difficult to write family-independent (IPv4 or IPv6) programs. IPv6 users will thank you.
Not so Invalid Client Addresses
by Anonymous Monk on Nov 23, 2005 at 22:40 UTC

    The IP addresses you give as invalid addresses in the "Invalid Client Addresses" section isn't strictly correct.

    First there is no such thing as a classful addressing. In todays Internet classful means nothing. The subnet mask determines the range of IP addresses that can be considered local (on the same network).

    Second, because of the above x.x.x.0 - or any variation eg x.0.0.0, except 0.0.0.0 - can be a valid IP address. For example the subnet mask 255.255.254.0 means that 2 contiguous /24's are part of the same network. In this case this makes .255 of the first /24 and .0 of the second /24 valid client IP addresses.

    It is fair to say that some OSes seem to have a problem with this. But the fact remains the .0 and .255 addresses can be used by clients. It just all depends on the subnet mask.

    Other than that a quality rant. ;)

Re: Don't Use Regular Expressions To Parse IP Addresses!
by sieve (Initiate) on May 18, 2007 at 20:25 UTC
    If you do use a regular expressions to parse IPv4 in decimal, you could use:
    (1?\d\d?|2[0-4]\d|25[0-5])(\.(1?\d\d?|2[0-4]\d|25[0-5])){3}
      Not really - the above regex is wrong. Try it on 10.76.110.219 and it will match only 10.76.110.21 (since 1?\d\d? will match 21 and stop and not look for the 2[0-4]\d rule)
        It works inside the correct context. For example, if you supply a leading ^ and trailing $ then all is well. You have to supply your own context.
Re: Don't Use Regular Expressions To Parse IP Addresses!
by Anonymous Monk on Oct 27, 2008 at 03:43 UTC
    Something seems to be off on the example of dotted quad representation to binary. If you start with the example of 172.31.254.1 you should stick to it. This has the binary representation of 1010 1100 0001 1111 1111 1110 0000 0001 or 2887777793 in radix 10 (unlike 2887726590 as was stated earlier, which in fact corresponds to 172.31.53.254, which is also what the example continues to assume the ip was, for some reason).
Re: Don't Use Regular Expressions To Parse IP Addresses!
by Anonymous Monk on Jun 26, 2009 at 02:28 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perltutorial [id://221512]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (7)
As of 2014-09-22 12:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (191 votes), past polls