Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Personally, I think I'd use split with capturing brackets so the delimiters are not discarded to break this into chunks first. By using /([\d-]+)/ as the delimiter, it breaks the line up between the numbers (The '-' is to keep the telephone number in one chuck.

#! perl -sw use strict; while( <DATA> ) { my @chunks = split /([\d-]+)/; print join'|',@chunks; } __DATA__ 141 Martha Lynn Amblynoster 12345 New Pickle Drive MoreTown PA 98765 6 +54 555-1212 no detail 178 Edgar Bimblybum Jr. 23456 Highfiddle Road Acheville Ma 24680 345-7 +89-1234 no detail 161 Joyce W. Wogerbung 18 Lily Piffle Lane Middleton PA 34567 610-678- +2345 no detail 188 Alex Shmogle 6543 Bibblyboo St NW Apt B Washington DC 20009 202-98 +7-6543 no detail __OUTPUT__ |141| Martha Lynn Amblynoster |12345| New Pickle Drive MoreTown PA |98 +765| |654| |555-1212| no detail |178| Edgar Bimblybum Jr. |23456| Highfiddle Road Acheville Ma |24680| + |345-789-1234| no detail |161| Joyce W. Wogerbung |18| Lily Piffle Lane Middleton PA |34567| |6 +10-678-2345| no detail |188| Alex Shmogle |6543| Bibblyboo St NW Apt B Washington DC |20009| +|202-987-6543| no detail

As you can see the only chunk that need much further processing is then the address ($chunk[4]) which only requires the last two words to be broken off to give you city and state. At least as far as your examples go.

How you would recognise City names with more than one word (eg.Salt Lake City) is up to you. Probably the best way would be to grab a dictionary of town/city names from somewhere, put them in a hash, strip the state and look up the last word, the last two words, the last three words until you get a match. Subdividing the hash by the state first would further increase your reliability.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller



In reply to Re: Parse mailing addresses with a regex by BrowserUk
in thread Parse mailing addresses with a regex by data67

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others scrutinizing the Monastery: (2)
    As of 2021-04-18 11:53 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found

      Notices?