http://www.perlmonks.org?node_id=268182


in reply to Parse mailing addresses with a regex

Personally, I think I'd use split with capturing brackets so the delimiters are not discarded to break this into chunks first. By using /([\d-]+)/ as the delimiter, it breaks the line up between the numbers (The '-' is to keep the telephone number in one chuck.

#! perl -sw use strict; while( <DATA> ) { my @chunks = split /([\d-]+)/; print join'|',@chunks; } __DATA__ 141 Martha Lynn Amblynoster 12345 New Pickle Drive MoreTown PA 98765 6 +54 555-1212 no detail 178 Edgar Bimblybum Jr. 23456 Highfiddle Road Acheville Ma 24680 345-7 +89-1234 no detail 161 Joyce W. Wogerbung 18 Lily Piffle Lane Middleton PA 34567 610-678- +2345 no detail 188 Alex Shmogle 6543 Bibblyboo St NW Apt B Washington DC 20009 202-98 +7-6543 no detail __OUTPUT__ |141| Martha Lynn Amblynoster |12345| New Pickle Drive MoreTown PA |98 +765| |654| |555-1212| no detail |178| Edgar Bimblybum Jr. |23456| Highfiddle Road Acheville Ma |24680| + |345-789-1234| no detail |161| Joyce W. Wogerbung |18| Lily Piffle Lane Middleton PA |34567| |6 +10-678-2345| no detail |188| Alex Shmogle |6543| Bibblyboo St NW Apt B Washington DC |20009| +|202-987-6543| no detail

As you can see the only chunk that need much further processing is then the address ($chunk[4]) which only requires the last two words to be broken off to give you city and state. At least as far as your examples go.

How you would recognise City names with more than one word (eg.Salt Lake City) is up to you. Probably the best way would be to grab a dictionary of town/city names from somewhere, put them in a hash, strip the state and look up the last word, the last two words, the last three words until you get a match. Subdividing the hash by the state first would further increase your reliability.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller