Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Once upon a time I tried to parse addresses and came up with some fugly code. You'd first want to check out Lingua::EN::AddressParse and see if you can use it as-is or modify it to suit your needs. If all else fails here's the code I was using. Keep in mind this is two files. takes the standard US postal service abbreviation and expands them. probably read one address per line. You will definately need to modify this to use it.

Also, this is not how you should program. The prototypes should go, the direct access of @_, etc. Don't use this as a style guide. Please. This is not how I write perl anymore. For production or otherwise. Ugly, ugly, ugly.

# # Fixed the file up some. sub uword { join ' ', map unabrev($_), split ' ', shift; } sub unabrev { local $_ = shift; return $_ unless /\w/; # One really big expression s/^e$/east/ or s/^w$/west/ or s/^(?:n|no)$/north/ or s/^(?:s|so)$/south/ or s/^ne$/north east/ or s/^nw$/north west/ or s/^se$/south east/ or s/^sw$/south west/ or s/^(?:avs|aves)$/avenue south/ or s/^beachrd$/beach road/ or s/^ccedar$/cedar/ or s/^(?:adn|add'n)$/addition/ or s/^appache$/apache/ or s/^apt$/apartment/ or s/^apts$/apartments/ or s/^(?:av|ave)$/avenue/ or s/^(?:bch|bchch|beac)/beach/ or s/^(?:bx|b0x)/box $1/ or s/^blvd$/boulevard/ or s/^brg$/burg/ or s/^bldg$/building/ or s/^cen$/center/ or s/^(?:centeral|cental)$/central/ or s/^char$/character/ or s/^chas$/chase/ or s/^ches$/chesapeake/ or s/^chig$/chicago/ or s/^cir$/circle/ or s/^(?:cty|co|cnty)$/county/ or s/^(?:ct|crt|cour)/court/ or s/^cr$/curve/ or s/^crk$/creek/ or s/^crl$/curl/ or s/^(?:crystaln|crytl)$/crystal/ or s/^ctr$/center/ or s/^dist$/district/ or s/^(?:drv|drve|dr)$/drive/ or s/^est$/estate/ or s/^fst$/forest/ or s/^ft$/fort/ or s/^(?:govt|govern|gov't)$/government/ or s/^(?:grv|grov)$/grove/ or s/^hgld$/highland/ or s/^hglds$/highlands/ or s/^(?:hgt|hht|height|ht|hghtss|hghts)$/heights/ or s/^(?:hy|hyw|hwy)$/highway/ or s/^isl$/island/ or s/^(?:jct|jction|jctn|junctn|juncton)$/junction/ or s/^(?:jctns|jcts)$/junctions/ or s/^l00p$/loop/ or s/^(?:lk|lak)$/lake/ or s/^lks$/lakes/ or s/^li'l$/lil/ or s/^(?:la|lanes|ln)$/lane/ or s/^ml$/mill/ or s/^mls$/mills/ or s/^mkt$/market/ or s/^(?:mt|mnt)$/mount/ or s/^mpls$/minneapolis/ or s/^(?:mtn|mntain|mntn)$/mountain/ or s/^(?:mntns|mtns)$/mountains/ or s/^(?:nth|nrth)$/north/ or s/^nrthbrk$/northbrook/ or s/^(?:unorg|unorgized)$/unorganized/ or s/^ph$/penthouse/ or s/^(?:pk|prk)$/park/ or s/^(?:pkwy|parkwy|pkway|pky)$/parkway/ or s/^pl$/place/ or s/^plaz$/plaza/ or s/^(?:pobox|po)$/box/ or s/^prct$/precinct/ or s/^pres$/president/ or s/^pt$/point/ or s/^pts$/points/ or s/^qtr$/quarter/ or s/^qtrs$/quarters/ or s/^(?:r|rt)$/route/ or s/^rd$/road/ or s/^rdg$/ridge/ or s/^resor$/resort/ or s/^(?:ri|rv|riv|rvr)$/river/ or s/^(?:rte|rr|rural)$/route/ or s/^(?:rs|rst)$/rest/ or s/^rverview$/riverview/ or s/^(?:shr|shoar)$/shore/ or s/^(?:shoars|shrs)$/shores/ or s/^(?:spgs|spngs|sprngs)$/springs/ or s/^(?:st|str)$/street/ or s/^svc$/service/ or s/^terr$/terrace/ or s/^twp$/township/ or s/^(?:tr|trl|trails|trls)$/trail/ or s/^trlr$/trailer/ or s/^vac$/vacation/; return $_; } 1; # #!/usr/bin/perl $ID = 0; $ADDRESS = 1; while ($record = <>) { ($id,$address) = split /\t/, $record; @words = split /\s+/, $address; %record = (); HOUSE: if ($words[0] =~ /^\d+$/) { $record{house} = shift @words; ($record{odd}) = (($record{house} % 2) == 0 ? 'e' : 'o'); } if ($words[0] =~ /^1\/2$/) { $record{fraction} = shift @words; } UNIT: for ($i = $#words; $i >= 0; $i--) { if (is_unit($words[$i])) { $record{unit} = join ' ', @words[$i .. $#words]; $#words = $i - 1; last UNIT; } } unless (defined $record{unit}) { if ($words[$#words] =~ /\d+$/) { $record{unit} = pop @words; } } $t = $words[$#words]; if (is_ew($t)) { $record{direction} = pop @words; $t = $words[$#words]; if (is_ns($t)) { $t = pop @words; $record{direction} = "$t ".$record{direction}; } } elsif (is_ns($t)) { $record{direction} = pop @words; } unless (exists $record{direction}) { for ($i = 0; $i < @words; $i++) { if (is_ns($words[$i]) or is_ew($words[$i])) { $record{direction} .= ' '.$words[$i]; $words[$i] = ''; } } } @words = grep /\w/, @words; for ($i = $#words;$i>=0;$i--) { if (is_type($words[$i])) { $record{type} = $words[$i]; $words[$i] = ''; goto DIR; } } DIR: @words = grep /\w/, @words; for ($i = 0; $i < @words-1; $i++) { if ($words[$i] eq 'p' and ($words[$i+1] =~ /^(?:o|0)$/)) { $words[$i] = ''; $words[$i+1] = ''; if (exists $record{unit}) { $record{unit} = 'po '.$record{unit}; } else { $record{unit} = 'po'; } } } $t = join ' ', @words; $record{street} = $t if $t; if (1) { $line = join "\t", map {defined $_?$_:'\\N'} ($id, @record{qw(house odd fraction street direction type unit )}); for (undef,undef) { $line =~ s/ +/ /g; $line =~ s/ +\t/\t/g; $line =~ s/\t +/\t/g; } print $line,"\n"; } } sub is_ns ($) { return $_[0] =~ /^(?:north|south)$/; } sub is_ew ($) { return $_[0] =~ /^(?:east|west)$/; } sub is_unit ($) { return $_[0] =~ /(?:box|apartment|lot|suite|campus|lower|upper|flo +or|gymnasium|hall|building)\s*/; } sub is_type ($) { return $_[0] =~ /(?:alley|avenue|aveue|avneu|bay|boulevard|circle|court|courtt +|courttt|cove|crest|curve|dale|drive|grove|highway|hill |knoll|lane|mall|orchard|park|parkway|pass|pines|place|plaza|raod|ridg +e|road|route|square|state|street|summit|ter|terrace|tra il|walk|way)/o; }

In reply to Re: Parse mailing addresses with a regex by diotalevi
in thread Parse mailing addresses with a regex by data67

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others pondering the Monastery: (6)
    As of 2021-04-21 10:16 GMT
    Find Nodes?
      Voting Booth?

      No recent polls found