Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Beast of the Number: Parsing the Feral Phone

by demerphq (Chancellor)
on Apr 17, 2002 at 16:21 UTC ( #159874=note: print w/ replies, xml ) Need Help??


in reply to Beast of the Number: Parsing the Feral Phone

Big time ++ dude!

Couple of quickie comments before I start trying to run your code against the 10 million german CLI(call line identifiers) that I have access to and the 100k or so UK numbers that are on hand as well.

Regarding parsing extensions. In some countries (like Germany) you arent allowed to have extensions. I believe this is due to the authorities needing to be able to uniquely identifiy the locaion of every handset in the country. This of course means that if you can find the list of countries that have such a law you can simplify the logic of parsing out extensions.

Regarding number formats, I believe that you can take advantage of the +1 code. All of these numbers are in a 3-3-4 pattern (with optional extension). These should be easy to parse. OTOH Germany uses a floating format (anywhere for 6 digits (maybe smaller!) for a local number to a full blown 14 digit (including +, country code and area code) for my own phone number (they can get larger).

Which brings me to area codes. These are/should be easy to parse in the +1 area. But theres no way to do so in a country that uses floating length area codes (like Germany with 2-5 digit area codes) short of knowing the full list for that country. Of course thats not real feasable considering that Germany alone has 5226 of them... (I know I converted the DTAG list into the AOC data used on our switches...) (Actually ive always thought it interesting that Germany has so many, but the entire NA uses less than a thousand. I guess thats why extensions are so common in NA, in order to work around the (currently) antiquitated telecoms industry that is the result of NA's early lead in the area)

Anyway, these are just quick of the cuff comments. A node this big and serious will need a lot more time for thought.

Big ++ once again!

O btw, heres a list of the German area codes in ranged form. (ie 2051-2054 means 2051, 2052, 2053, 2054)

:-)

<super>

my @zones=qw( 201-203  2041  2043  2045  2051-2054  2056  2058  2064-2066  208-209  2102-2104  211  2120-2129  2131-2133 
 2137  214  2150-2154  2156-2159  2161-2166  2171  2173-2175  2181-2183  2191-2193  2195-2196  2202-2208  221  2222-2228 
 2232-2238  2241-2248  2251-2257  2261-2269  2271-2275  228  2291-2297  2301-2309  231  2323-2325  2327  2330-2339  234  2351-2355 
 2357-2369  2371-2375  2377-2379  2381-2385  2387-2389  2391-2395  2401-2409  241  2421-2429  2431-2436  2440-2441  2443-2449 
 2451-2456  2461-2465  2471-2474  2482  2484-2486  2501-2502  2504-2509  251  2520-2529  2532-2536  2538  2541-2543  2545-2548 
 2551-2558  2561-2568  2571-2575  2581-2588  2590-2599  2601-2608  261  2620-2628  2630-2639  2641-2647  2651-2657  2661-2664 
 2666-2667  2671-2678  2680-2689  2691-2697  271  2721-2725  2732-2739  2741-2745  2747  2750-2755  2758-2759  2761-2764 
 2770-2779  2801-2804  281  2821-2828  2831-2839  2841-2845  2850-2853  2855-2859  2861-2867  2871-2874  2902-2905  291  2921-2925 
 2927-2928  2931-2935  2937-2938  2941-2945  2947-2948  2951-2955  2957-2958  2961-2964  2971-2975  2977  2981-2985  2991-2994 
 30  3301-3304  33051  33053-33056  3306-3307  33080  33082-33089  33093-33094  331  33200-33209  3321-3322  33230-33235 
 33237-33239  3327-3329  3331-3332  33331-33338  3334-3335  33361-33369  3337-3338  33393-33398  3341-3342  33432-33439  3344 
 33451-33452  33454  33456-33458  3346  33470  33472-33479  335  33601-33609  3361-3362  33631-33638  3364  33652-33657  3366 
 33671-33679  33701-33704  33708  3371-3372  33731-33734  33741-33748  3375  33760  33762-33769  3377-3379  3381-3382  33830-33839 
 33841  33843-33849  3385-3386  33870  33872-33878  3391  33920-33926  33928-33929  33931-33933  3394-3395  33962-33979  33981-33984 
 33986  33989  340-341  34202-34208  3421  34221-34224  3423  34241-34244  3425  34261-34263  34291-34299  3431  34321-34322 
 34324-34325  34327-34328  3433  34341-34348  3435  34361-34364  3437  34381-34386  3441  34422-34426  3443  34441  34443-34446 
 3445  34461-34467  3447-3448  34491-34498  345  34600-34607  34609  3461-3462  34632-34633  34635-34639  3464  34651-34654 
 34656  34658-34659  3466  34671-34673  34691-34692  3471  34721-34722  3473  34741-34743  34745-34746  3475-3476  34771-34776 
 34779  34781-34783  34785  34901  34903-34907  34909  3491  34920-34929  3493-3494  34953-34956  3496  34973  34975-34979 
 3501  35020-35028  35032-35033  3504  35052-35058  351  35200-35209  3521-3523  35240-35249  3525  35263-35268  3528-3529 
 3531  35322-35327  35329  3533  35341-35343  3535  35361-35365  3537  35383-35389  3541-3542  35433-35436  35439  3544  35451-35456 
 3546  35471-35478  355  35600-35609  3561-3564  35691-35698  3571  35722-35728  3573-3574  35751-35756  3576  35771-35775 
 3578  35792-35793  35795-35797  3581  35820  35822-35823  35825-35829  3583  35841-35844  3585-3586  35872-35877  3588  35891-35895 
 3591-3592  35930-35939  3594  35951-35955  3596  35971  35973-35975  3601  36020-36029  3603  36041-36043  3605-3606  36071-36072 
 36074-36077  36081-36085  36087  361  36200-36209  3621-3624  36252-36259  3628-3629  3631-3632  36330-36338  3634-3636 
 36370-36379  3641  36421-36428  3643-3644  36450-36454  36458-36459  36461-36465  3647  36481-36484  365  36601-36608  3661 
 36621-36626  36628  3663  36640  36642-36649  36651-36653  36691-36695  36701-36705  3671-3672  36730-36739  36741-36744 
 3675  36761-36762  36764  36766  3677  36781-36785  3679  3681-3683  36840-36849  3685-3686  36870-36871  36873-36875  36878 
 3691  36920-36929  3693  36940-36941  36943-36949  3695  36961-36969  371  37200  37202-37204  37206-37209  3721-3727  37291-37298 
 3731  37320-37329  3733  37341-37344  37346-37349  3735  37360-37369  3737  37381-37384  3741  37421-37423  37430-37439 
 3744-3745  37462-37465  37467-37468  375  37600-37609  3761-3765  3771-3774  37752  37754-37757  381  38201-38209  3821 
 38220-38229  38231-38234  38292-38297  38300-38309  3831  38320-38328  38331-38334  3834  38351-38356  3836  38370-38379 
 3838  38391-38393  3841  38422-38429  3843-3844  38450-38459  38461-38462  38464  38466  3847  38481-38486  38488  385  3860-3861 
 3863  3865-3869  3871  38720-38729  38731-38733  38735-38738  3874  38750-38759  3876-3877  38780-38785  38787-38789  38791-38794 
 38796-38797  3881  38821-38828  3883  38841-38845  38847-38848  38850-38856  38858-38859  3886  38871-38876  39000-39009 
 3901-3902  39030-39039  3904  39050-39059  39061-39062  3907  39080-39089  3909  391  39200-39209  3921  39221-39226  3923 
 39241-39248  3925  39262-39268  3928  39291-39298  3931  39320-39325  39327-39329  3933  39341-39349  3935  39361-39366 
 3937  39382-39384  39386-39409  3941  39421-39428  3943-3944  39451-39459  3946-3947  39481-39485  39487-39489  3949  395 
 39600-39608  3961-3969  3971  39721-39724  39726-39728  3973  39740-39749  39751-39754  3976  39771-39779  3981  39820-39829 
 39831-39833  3984  39851-39859  39861-39863  3987  39881-39889  3991  39921-39929  39931-39934  3994  39951-39957  39959 
 3996  39971-39973  39975-39978  3998  39991-39999  40  4101-4109  4120-4129  4131-4144  4146  4148-4149  4151-4156  4158-4159 
 4161-4169  4171-4189  4191-4195  4202-4209  421  4221-4224  4230-4249  4251-4258  4260-4269  4271-4277  4281-4289  4292-4298 
 4302-4303  4305  4307-4308  431  4320-4324  4326-4340  4342-4344  4346-4349  4351-4358  4361-4367  4371-4372  4381-4385 
 4392-4394  4401-4409  441  4421-4423  4425-4426  4431-4435  4441-4447  4451-4456  4458  4461-4469  4471-4475  4477-4489 
 4491-4499  4501-4506  4508-4509  451  4521-4529  4531-4537  4539  4541-4547  4550-4559  4561-4564  4602-4609  461  4621-4627 
 4630-4639  4641-4644  4646  4651  4661-4668  4671-4674  4681-4684  4702-4708  471  4721-4725  4731-4737  4740-4749  4751-4758 
 4761-4779  4791-4796  4802-4806  481  4821-4830  4832-4839  4841-4849  4851-4859  4861-4865  4871-4877  4881-4885  4892-4893 
 4902-4903  491  4920-4929  4931-4936  4938-4939  4941-4948  4950-4959  4961-4968  4971-4977  5021-5028  5031-5037  5041-5045 
 5051-5056  5060  5062-5069  5071-5074  5082-5086  5101-5103  5105  5108-5109  511  5121  5123  5126-5132  5135-5139  5141-5149 
 5151-5159  5161-5168  5171-5177  5181-5187  5190-5199  5201-5209  521  5221-5226  5228  5231-5238  5241-5242  5244-5248 
 5250-5255  5257-5259  5261-5266  5271-5278  5281-5286  5292-5295  5300-5309  531  5320-5329  5331-5337  5339  5341  5344-5347 
 5351-5358  5361-5368  5371-5379  5381-5384  5401-5407  5409  541  5421-5429  5431-5439  5441-5448  5451-5459  5461-5462 
 5464-5468  5471-5476  5481-5485  5491-5495  5502-5509  551  5520-5525  5527-5529  5531-5536  5541-5546  5551-5556  5561-5565 
 5571-5574  5582-5586  5592-5594  5601-5609  561  5621-5626  5631-5636  5641-5648  5650-5659  5661-5665  5671-5677  5681-5686 
 5691-5696  5702-5707  571  5721-5726  5731-5734  5741-5746  5751-5755  5761  5763-5769  5771-5777  5802-5808  581  5820-5829 
 5831-5846  5848-5855  5857-5859  5861-5865  5872-5875  5882-5883  5901-5909  591  5921-5926  5931-5937  5939  5941-5948 
 5951-5957  5961-5966  5971  5973  5975-5978  6002-6004  6007-6008  6020-6024  6026-6029  6031-6036  6039  6041-6059  6061-6063 
 6066  6068  6071  6073-6074  6078  6081-6087  6092-6096  6101-6109  611  6120  6122-6124  6126-6136  6138-6139  6142  6144-6147 
 6150-6152  6154-6155  6157-6159  6161-6167  6171-6175  6181-6188  6190  6192  6195-6196  6198  6201-6207  6209  6211-6218 
 62190-62199  6220-6224  6226-6229  6231-6239  6241-6247  6249  6251-6258  6261-6269  6271-6272  6274-6276  6281-6287  6291-6298 
 6301-6308  631  6321-6329  6331-6349  6351-6353  6355-6359  6361-6364  6371-6375  6381-6387  6391-6398  6400-6409  641  6420-6436 
 6438-6447  6449  6451-6458  6461-6462  6464-6468  6471-6479  6482-6486  6500-6509  651  6522-6527  6531-6536  6541-6545 
 6550-6559  6561-6569  6571-6575  6578  6580-6589  6591-6597  6599  661  6620-6631  6633-6639  6641-6648  6650-6661  6663-6670 
 6672-6678  6681-6684  6691-6698  6701  6703-6704  6706-6709  671  6721-6728  6731-6737  6741-6747  6751-6758  6761-6766 
 6771-6776  6781-6789  6802-6806  6809  681  6821  6824-6827  6831-6838  6841-6844  6848-6849  6851-6858  6861  6864-6869 
 6871-6876  6881  6887-6888  6893-6894  6897-6898  69  7021-7026  7031-7034  7041-7046  7051-7056  7062-7063  7066  7071-7073 
 7081-7085  711  7121-7136  7138-7139  7141-7148  7150-7154  7156-7159  7161-7166  7171-7176  7181-7184  7191-7195  7202-7204 
 721  7220-7229  7231-7237  7240  7242-7269  7271-7277  7300  7302-7309  731  7321-7329  7331-7337  7340  7343-7348  7351-7358 
 7361-7367  7371  7373-7376  7381-7389  7391-7395  7402-7404  741  7420  7422-7429  7431-7436  7440-7449  7451-7459  7461-7467 
 7471-7478  7482-7486  7502-7506  751  7520  7522  7524-7525  7527-7529  7531-7534  7541-7546  7551-7558  7561-7579  7581-7587 
 7602  761  7620-7629  7631-7636  7641-7646  7651-7657  7660-7669  7671-7676  7681-7685  7702-7709  771  7720-7729  7731-7736 
 7738-7739  7741-7748  7751  7753-7755  7761-7765  7771  7773-7775  7777  7802-7808  781  7821-7826  7831-7839  7841-7844 
 7851-7854  7903-7907  791  7930-7955  7957-7959  7961-7967  7971-7977  8020-8029  8031-8036  8038-8039  8041-8043  8045-8046 
 8051-8057  8061-8067  8071-8076  8081-8086  8091-8095  8102  8104-8106  811  8121-8124  8131  8133-8139  8141-8146  8151-8153 
 8157-8158  8161  8165-8168  8170-8171  8176-8179  8191-8196  8202-8208  821  8221-8226  8230-8234  8236-8239  8241  8243 
 8245-8254  8257-8259  8261-8263  8265-8269  8271-8274  8276  8281-8285  8291-8296  8302-8304  8306  831  8320-8338  8340-8349 
 8361-8370  8372-8389  8392-8395  8402-8407  841  8421-8424  8426-8427  8431-8435  8441-8446  8450  8452-8454  8456-8469 
 8501-8507  8509  851  8531-8538  8541-8558  8561-8565  8571-8574  8581-8586  8591-8593  861  8621-8624  8628-8631  8633-8642 
 8649-8652  8654  8656-8657  8661-8667  8669-8671  8677-8679  8681-8687  8702-8709  871  8721-8728  8731-8735  8741-8745 
 8751-8754  8756  8761-8762  8764-8766  8771-8774  8781-8785  8801-8803  8805-8809  881  8821-8825  8841  8845-8847  8851 
 8856-8858  8860-8862  8867-8869  89  906  9070-9078  9080-9094  9097  9099  9101-9107  911  9120  9122-9123  9126-9129  9131-9135 
 9141-9149  9151-9158  9161-9167  9170-9199  9201-9209  921  9220-9223  9225  9227-9229  9231-9236  9238  9241-9246  9251-9257 
 9260-9289  9292-9295  9302-9303  9305-9307  931  9321  9323-9326  9331-9360  9363-9367  9369  9371-9378  9381-9386  9391-9398 
 9401-9409  941  9420-9424  9426-9429  9431  9433-9436  9438-9439  9441-9448  9451-9454  9461-9469  9471-9474  9480-9482 
 9484  9491-9493  9495  9497-9499  9502-9505  951  9521-9529  9531-9536  9542-9549  9551-9556  9560-9569  9571-9576  9602-9608 
 961  9621-9622  9624-9628  9631-9639  9641-9648  9651-9659  9661-9666  9671-9677  9681-9683  9701  9704  9708  971  9720-9729 
 9732-9738  9741-9742  9744-9749  9761-9766  9771-9779  9802-9805  981  9820  9822-9829  9831-9837  9841-9848  9851-9857 
 9861  9865  9867-9869  9871-9876  9901  9903-9908  991  9920-9929  9931-9933  9935-9938  9941-9948  9951-9956  9961-9966 
 9971-9978 );
</super>

Yves / DeMerphq
---
Writing a good benchmark isnt as easy as it might look.


Comment on Re: Beast of the Number: Parsing the Feral Phone
Re: Re: Beast of the Number: Parsing the Feral Phone
by mojotoad (Monsignor) on Apr 17, 2002 at 22:25 UTC
    Interesting data regarding Germany -- I had no idea.

    It was for precisely this sort of reason, however, that I made no attempt to try and figure out area codes for numbers from various countries around the world. The result of this parsing could easily be passed along to a country-specific module for more appropriate parsing and beautification.

    There are a couple of things to point out here that I did not mention in the article (due to 64k limit on nodes). I made no attempt to parse valid IDD prefixes even though lists for each country are available on the net. The reason is that the IDD prefixes are not mutually exclusive to the Country Codes. Nor, unfortunately or unexpectedly, are area codes for a particular locale.

    This reality produces ambiguous areas where I could be slurping up an area code or IDD as a country code. What's needed in *that* case is some concept of the natural phone number length for that locality. Rather than get that specific, though, I relied on a threshold length and size percentages as measured against the remainder of the number. It's not perfect, but for my data set it worked suprisingly well.

    Given your information about the variability of German numbers, particularly the 14-digit monsters, this technique might fail if the area/province codes happen to match valid country codes elsewhere. This of course only applies to numbers that are presented *without* their Country Code.

    Once this code has its hands on what it thinks is the local number, it's just stored as a single number. I chunk it for display purposes, but generically and in a U.S.-centric kind of way: 4 digits on the suffix, preceded by groups of three digits as long as there are digits left.

    Also, keep in mind that this code is intended to operate on raw, unrestricted data fields. Typoes, blippoes (???? or xxxx) and all of it are present in the data. There's not a whole lot you can do in these cases to pull out a valid number without knowing in excruciating detail the particulars of the intended country.

    GIGO, GIGO, it's off to the dumpster we go!

    BTW, I suspect this code might take quite a while to run on 10 million numbers, even if they are well-behaved.

    Thanks for the comments and feedback. Any thoughts on whether any of this should be CPAN-bound once cleaned up? (new names and POD, obviously, but beyond that...)

    Matt

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://159874]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (14)
As of 2014-09-23 17:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (234 votes), past polls