![]() |
|
Pathologically Eclectic Rubbish Lister | |
PerlMonks |
Don't Use Regular Expressions To Parse IP Addresses!by ybiC (Prior) |
on Dec 20, 2002 at 20:53 UTC ( [id://221512]=perltutorial: print w/replies, xml ) | Need Help?? |
This document is intended to provide numerical and networking information on why Perl regular expressions are inappropriate for effectively parsing IPv4 addresses. This document does not provide specific Perl programming examples of proper parsing. A link to one good example can be found in the "See Also" section below.
Regular expressions should not be used to parse IP addresses because IP addresses can be expressed in several forms... Aaaand the format that people generally expect to see is not the one your computer and OS use, nor well-written applications either... Sooooo you aren't really doing what you think you are doing... Aaaaand even if you were, you would still very likely trip over one of many a'remaining trap or snare. The inet_aton() and inet_ntoa() functions from Socket.pm are generally considered to be the best from a very short list of Proper Ways to parse IP addresses, as they readily handle most all valid formats and representations, as well as being part of the standard Perl install.
The term "IP Address" is commonly (mis)understood to mean a number that looks like this: 172.31.254.1 and yes, that is indeed a valid human-readable representation of an IP address.You see, any given IP address is actually a 32 bit binary number, and dotted-quad is merely one convenient convention to ease the strain on binary-impaired people brains when forced to deal with IP addresses.   It provides a mental model that works well within limits, but the way you mentally process those four octets has very little relationship with the way routing and operating system softwares do.
Other representations include: Perhaps the best summarization I've seen is from one particular version of man inet
Internet Address Values specified using dot notation take one of the following four forms: 4 part - each is interpreted as a byte of data and assigned, from left to right, to the four bytes of an Internet address. 3 part - last part is interpreted as a 16-bit quantity and placed in the right-most two bytes of the network address. This makes the three-part address format convenient for specifying Class B network addresses as in 128.net.host. 2 part - the last part is interpreted as a 24-bit quantity and placed in the right-most three bytes of the network address. This makes the two-part address format convenient for specifying Class A network addresses as in net.host. 1 part - the value is stored directly in the network address without any byte rearrangement. All numbers supplied as parts in dot notation can be decimal, octal, or hexadecimal, as specified in the C language (i.e., a leading 0x or 0X implies hexadecimal; a leading 0 implies octal; otherwise, the number is interpreted as decimal).Buuuuuut... if you really want to grok IP addressing you must start by thinking of them as binary, because that's what they really are, regardless of visible or programatic representations.
Let's dissect our dotted-quad example from above. 172.31.254.1 First step - convert each part to binarydecimal binary 172 1010 1100 31 0001 1111 254 1111 1110 1 0000 0001Line them all up, in original order 172 31 53 254 1010 1100 0001 1111 0011 0101 1111 1110Then strip the spaces, and voila - a real-live IP address in all its 32 bit binary glory! (we'll leave spaces in later examples for clarity) 10101100000111110011010111111110
The 32 bit binary number above is what IOS, *nix, or Win32 use to route or process. Given that, trying to parse while still in dotted-quad form can give unintended results.
If this section makes any sense at all, then you likely see why binary Is The One True Way and how dotted-quad Is The Path To Perdition. Basic rules for address blocks, whether classful or classless:
Basic rules for (sub|super)netmasking:
128 1000 0000 192 1100 0000 224 1110 0000 240 1111 0000 248 1111 1000 252 1111 1100 255 1111 1111The Classful Netmasks 255.0.0.0 /8 11111111 00000000 00000000 00000000 255.255.0.0 /16 11111111 11111111 00000000 00000000 255.255.255.0 /24 11111111 11111111 11111111 00000000Some Popular Classless Netmasks 255.255.255.224 /27 11111111 11111111 11111111 11100000 255.255.255.248 /29 11111111 11111111 11111111 11111000 255.255.255.252 /30 11111111 11111111 11111111 11111100A common example of classful netmasking netmask 255.255.255.0 11111111 11111111 11111111 00000000 bitwise /24 network 192.168.0.0 11000000 10101000 00000000 00000000 first usable 192.168.0.1 00000001 last usable 192.168.0.254 11111110 broadcast 192.168.0.255 11111111 254 usable addresses corporate office LANExamples of variable length subnet masking (VLSM) (classless) netmask 255.255.255.224 11111111 11111111 11111111 11100000 bitwise /27 network 172.31.254.0 10101100 00011111 11111110 00000000 first usable 172.31.254.1 00001 last host 172.31.254.30 11110 broadcast 172.31.254.31 11111 30 usable addresses remote office LAN netmask 255.255.255.248 11111111 11111111 11111111 11111000 bitwise /29 network 172.31.254.32 10101100 00011111 11111110 00100000 first usable 172.31.254.33 001 last usable 172.31.254.38 110 broadcast 172.31.254.39 111 6 usable addresses tiny remote LAN netmask 255.255.255.252 11111111 11111111 11111111 11111100 bitwise /30 network 172.31.254.40 10101100 00011111 11111110 00101000 first usable 172.31.254.41 01 last usable 172.31.254.42 10 broadcast 172.31.254.43 11 2 usable addresses point-to-point WAN link or PC+router telecommuter LAN
None of these addresses should be assigned to Internet-attached client device interfaces, and most should not appear on private network client devices. zeron.n.n.0 class C network n.n.n n.n.0.0 class B network n.n n.0.0.0 class C network n 0.0.0.0 default routebroadcast n.n.n.255 all hosts on class C network n.n.n n.n.255.255 all hosts on class B network n.n n.255.255.255 all hosts on class A network n 255.255.255.255 all hosts on whatever network I happen to be onLoopback 127.0.0.0 through 127.255.255.255 127.0.0.0/8Link Local 169.254.0.0 through 169.254.255.255 169.254.0.0/16TEST-NET 192.0.2.0 through 192.0.2.255 192.0.2.0/24Class D - statically scoped multicast 224.0.0.0 through 239.255.255.255 224.0.0.0/4 netmask 240.0.0.0 224.0.0.0/8 netmask 255.0.0.0 225.0.0.0/8 226.0.0.0/8 227.0.0.0/8 228.0.0.0/8 229.0.0.0/8 230.0.0.0/8 231.0.0.0/8 232.0.0.0/8 233.0.0.0/8 234.0.0.0/8 235.0.0.0/8 236.0.0.0/8 237.0.0.0/8 238.0.0.0/8 239.0.0.0/8Class E - administratively scoped multicast 240.0.0.0 through 255.255.255.255 240.0.0.0/4 netmask 240.0.0.0 240.0.0.0/8 netmask 255.0.0.0 241.0.0.0/8 242.0.0.0/8 243.0.0.0/8 244.0.0.0/8 245.0.0.0/8 246.0.0.0/8 247.0.0.0/8 248.0.0.0/8 249.0.0.0/8 250.0.0.0/8 251.0.0.0/8 252.0.0.0/8 253.0.0.0/8 254.0.0.0/8 255.0.0.0/8
Big phat props to the following monks:
example: dotted quad 192 9 200 1 binary 1100 0000 0000 1001 1100 1000 0000 0001 * indicates valid netmask 0 0000 0000 64 0100 0000 *128 1000 0000* *192 1100 0000* 1 0000 0001 65 0100 0001 129 1000 0001 193 1100 0001 2 0000 0010 66 0100 0010 130 1000 0010 194 1100 0010 3 0000 0011 67 0100 0011 131 1000 0011 195 1100 0011 4 0000 0100 68 0100 0100 132 1000 0100 196 1100 0100 5 0000 0101 69 0100 0101 133 1000 0101 197 1100 0101 6 0000 0110 70 0100 0110 134 1000 0110 198 1100 0110 7 0000 0111 71 0100 0111 135 1000 0111 199 1100 0111 8 0000 1000 72 0100 1000 136 1000 1000 200 1100 1000 9 0000 1001 73 0100 1001 137 1000 1001 201 1100 1001 10 0000 1010 74 0100 1010 138 1000 1010 202 1100 1010 11 0000 1011 75 0100 1011 139 1000 1011 203 1100 1011 12 0000 1100 76 0100 1100 140 1000 1100 204 1100 1100 13 0000 1101 77 0100 1101 141 1000 1101 205 1100 1101 14 0000 1110 78 0100 1110 142 1000 1110 206 1100 1110 15 0000 1111 79 0100 1111 143 1000 1111 207 1100 1111 16 0001 0000 80 0101 0000 144 1001 0000 208 1101 0000 17 0001 0001 81 0101 0001 145 1001 0001 209 1101 0001 18 0001 0010 82 0101 0010 146 1001 0010 210 1101 0010 19 0001 0011 83 0101 0011 147 1001 0011 211 1101 0011 20 0001 0100 84 0101 0100 148 1001 0100 212 1101 0100 21 0001 0101 85 0101 0101 149 1001 0101 213 1101 0101 22 0001 0110 86 0101 0110 150 1001 0110 214 1101 0110 23 0001 0111 87 0101 0111 151 1001 0111 215 1101 0111 24 0001 1000 88 0101 1000 152 1001 1000 216 1101 1000 25 0001 1001 89 0101 1001 153 1001 1001 217 1101 1001 26 0001 1010 90 0101 1010 154 1001 1010 218 1101 1010 27 0001 1011 91 0101 1011 155 1001 1011 219 1101 1011 28 0001 1100 92 0101 1100 156 1001 1100 220 1101 1100 29 0001 1101 93 0101 1101 157 1001 1101 221 1101 1101 30 0001 1110 94 0101 1110 158 1001 1110 222 1101 1110 31 0001 1111 95 0101 1111 159 1001 1111 223 1101 1111 32 0010 0000 96 0110 0000 160 1010 0000 *224 1110 0000* 33 0010 0001 97 0110 0001 161 1010 0001 225 1110 0001 34 0010 0010 98 0110 0010 162 1010 0010 226 1110 0010 35 0010 0011 99 0110 0011 163 1010 0011 227 1110 0011 36 0010 0100 100 0110 0100 164 1010 0100 228 1110 0100 37 0010 0101 101 0110 0101 165 1010 0101 229 1110 0101 38 0010 0110 102 0110 0110 166 1010 0110 230 1110 0110 39 0010 0111 103 0110 0111 167 1010 0111 231 1110 0111 40 0010 1000 104 0110 1000 168 1010 1000 232 1110 1000 41 0010 1001 105 0110 1001 169 1010 1001 233 1110 1001 42 0010 1010 106 0110 1010 170 1010 1010 234 1110 1010 43 0010 1011 107 0110 1011 171 1010 1011 235 1110 1011 44 0010 1100 108 0110 1100 172 1010 1100 236 1110 1100 45 0010 1101 109 0010 1101 173 1010 1101 237 1010 1101 46 0010 1110 110 0110 1110 174 1010 1110 238 1110 1110 47 0010 1111 111 0110 1111 175 1010 1111 239 1110 1111 48 0011 0000 112 0111 0000 176 1011 0000 *240 1111 0000* 49 0011 0001 113 0111 0001 177 1011 0001 241 1111 0001 50 0011 0010 114 0111 0010 178 1011 0010 242 1111 0010 51 0011 0011 115 0111 0011 179 1011 0011 243 1111 0011 52 0011 0100 116 0111 0100 180 1011 0100 244 1111 0100 53 0011 0101 117 0111 0101 181 1011 0101 245 1111 0101 54 0011 0110 118 0111 0110 182 1011 0110 246 1111 0110 55 0011 0111 119 0111 0111 183 1011 0111 247 1111 0111 56 0011 1000 120 0111 1000 184 1011 1000 *248 1111 1000* 57 0011 1001 121 0111 1001 185 1011 1001 249 1111 1001 58 0011 1010 122 0111 1010 186 1011 1010 250 1111 1010 59 0011 1011 123 0111 1011 187 1011 1011 251 1111 1011 60 0011 1100 124 0111 1100 188 1011 1100 *252 1111 1100* 61 0011 1101 125 0111 1101 189 1011 1101 253 1111 1101 62 0011 1110 126 0111 1110 190 1011 1110 254 1111 1110 63 0011 1111 127 0111 1111 191 1011 1111 *255 1111 1111*
Back to
Tutorials
|
|