Personally, I think I'd use split with capturing brackets so the delimiters are not discarded to break this into chunks first. By using /([\d-]+)/ as the delimiter, it breaks the line up between the numbers (The '-' is to keep the telephone number in one chuck.
#! perl -sw
use strict;
while( <DATA> ) {
my @chunks = split /([\d-]+)/;
print join'|',@chunks;
}
__DATA__
141 Martha Lynn Amblynoster 12345 New Pickle Drive MoreTown PA 98765 6
+54 555-1212 no detail
178 Edgar Bimblybum Jr. 23456 Highfiddle Road Acheville Ma 24680 345-7
+89-1234 no detail
161 Joyce W. Wogerbung 18 Lily Piffle Lane Middleton PA 34567 610-678-
+2345 no detail
188 Alex Shmogle 6543 Bibblyboo St NW Apt B Washington DC 20009 202-98
+7-6543 no detail
__OUTPUT__
|141| Martha Lynn Amblynoster |12345| New Pickle Drive MoreTown PA |98
+765| |654| |555-1212| no detail
|178| Edgar Bimblybum Jr. |23456| Highfiddle Road Acheville Ma |24680|
+ |345-789-1234| no detail
|161| Joyce W. Wogerbung |18| Lily Piffle Lane Middleton PA |34567| |6
+10-678-2345| no detail
|188| Alex Shmogle |6543| Bibblyboo St NW Apt B Washington DC |20009|
+|202-987-6543| no detail
As you can see the only chunk that need much further processing is then the address ($chunk[4]) which only requires the last two words to be broken off to give you city and state. At least as far as your examples go.
How you would recognise City names with more than one word (eg.Salt Lake City) is up to you. Probably the best way would be to grab a dictionary of town/city names from somewhere, put them in a hash, strip the state and look up the last word, the last two words, the last three words until you get a match. Subdividing the hash by the state first would further increase your reliability.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
|
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.
|
|