Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Suggestions requested: module to standardize postal address components?

by atcroft (Monsignor)
on Jun 30, 2010 at 07:26 UTC ( #847270=perlquestion: print w/replies, xml ) Need Help??
atcroft has asked for the wisdom of the Perl Monks concerning the following question:

As I mentioned in a reply to this poll, the inspiration for the question was something to standardize the abbreviations used in a lengthy list of postal addresses, with the possibility of turning it (someday) into a module.

But since I've started to wonder: are there any modules anyone could recommend that actually do what I seem to have originally intended when I started this, to be able to give it an address part (such as an address/street line, or a US state) using a possibly common abbreviation and return one in which the common but not standard abbreviations are replaced?

I'm almost thinking of an example of something like (but with no real interface in mind, so excuse the crude example):

my %address = ( name => q{}, address => q{1600 Pennsylvania Avenue Northwest}, city => q{Washington}, state_or_province => q{District of Columbia}, zip => 20500, ); convert(\$address{address}); # 1600 Pennsylvania Ave NW convert(\$address{city}); # No change convert(\$address{state_or_province}); # D.C.

Just curious. Any suggestions appreciated.

  • Comment on Suggestions requested: module to standardize postal address components?
  • Download Code

Replies are listed 'Best First'.
Re: Suggestions requested: module to standardize postal address components?
by Corion (Pope) on Jun 30, 2010 at 07:55 UTC

    I think some way of canonicalization is nice. But the meat of canonicalization is the data of replacements to make and the list of exceptions to these. I'm not aware of any set of rules, be they US-centric or not, and I'm also not aware of any (database) schema to manage addresses at all.

    Maybe looking at FOAF might provide such a schema. Maybe you can also structure your canonicalization rules in a general way as pairs (key,replacement) and have a generic driver that looks at each key and does the replacement:

    sub canonicalize { my ($rules, $element) = @_; for my $rule (@$rules) { my ($key,$action) = @$rule; if (exists $element->{ $key }) { if (ref $action eq 'CODE') { $action->( $element->{ $key } ); } else { warn "Unknown rule type '$action' for element '$key'"; }; }; }; }; my $en_us = [ [ 'address' => sub { $_[0] =~ s/\bAvenue\b/Ave/ } ], [ 'address' => sub { $_[0] =~ s/\bNorthwest$/NW/ } ], ... ]; canonicalize($en_us, \%address);

      I really appreciate the feedback, Corion. Thank you.

      Actually, the US Postal Service has a list of standard abbreviations for use with postal addressing, at least in the US. What I did was to create a set of regexes for those, so as they set now they just consist of the regexes and the common abbreviations they refer to, in a form I could generate the tests from. I haven't put them into a more usable form yet, due in part to a lack of to-its.

      I'll take a look at the FOAF project link you indicated, to see if there seems to be anything there that might be of use, as well as look over your recommendations when I have neurons firing a little more in tune.

Re: Suggestions requested: module to standardize postal address components?
by DrHyde (Prior) on Jun 30, 2010 at 10:22 UTC

    I can't think of any module that does what you want. If you decide to modularise your code and release it, do please remember that address formats vary widely from one place to another, and assuming that all addresses are of the form address/city/state/zip is incorrect: for example, in the UK there is no state field, what you have as a single line for "address" we would call "street address" (the whole thing is the address) and it may be split over several lines), and your "city" field may also be split over two (or more) lines. In general, the only structure you can assume is address/country, where address is free-form text over several lines. You can, of course, have country-specific code for picking that free-form text apart once you know what the country is.

    Obviously you're only interested in US addresses, which is fine, but to make your code more useful to others (and hence make them more likely to give you bug fixes and cool new features) it would be a good idea to define a common interface which knows how to dispatch to country-specific modules, and to put the US-specific code in one of those that you bundle with the generic front-end.

      There also tends to be some kind of postal code, but the format of the code and the usual placement within the address differs.

      Enoch was right!
      Enjoy the last years of Rome.

        Apparently so-after a previous suggestion about research on international formats, I found a posting where someone was *cough*talking*cough* about that most everyone uses (if I remember this correctly) a general format of the most specific part (recipient) at top to the most general (country) at bottom, but with no particular standard of ordering in between-except for a few who want to do it in the other direction.

        Because of that, my thought was (should I get that far on this as releasable code) just to provide a way to deal with parts of an address component, and let the person using it be able to call what they need where they think they need it.

      Points well taken. My experience with postal addresses is limited to probably only a handful of instances of dealing with international addresses, the remainder being strictly US addresses. I realize my example was probably very simplistic/incorrect, but it was only to get an initial idea across. The USPS listing I based what I have played with so far upon had 3 types of address components [state or province, address unit, and secondary address unit], so what I had in mind was to only provide a few conversion functions (assuming that there might be other types of components pointed out later) that the user could then call as they saw fit on a portion of the entry as they needed.)

      I very much like the idea of the dispatch interface and country-specific modules to make it more flexible (but will also mean I will have to learn how to do such a thing as well :-).

        I did something similar in Number::Phone, which provides a generic interface and dispatches the hard work of parsing phone numbers to country-specific modules, so maybe you could get some ideas from that.
Re: Suggestions requested: module to standardize postal address components?
by Marshall (Abbot) on Jun 30, 2010 at 14:01 UTC
    You are attempted to do something that is very hard to do in a general sense. Human beings just input the most amazing things! As a thought for you, I have one app that translates a bunch of aliases into "the standard term". Its not fancy or elegant, but after going through a few million records over the years, I've got a pretty stable DB for this specific application.

    input DB is like:

    Program generates a hash translation table and makes sure that no variant is also a "standard term". A human guides entries into this table, eg is this "close enough" to the "standard term" that it should be recognized as equivalent. I suspect that some amount of human guidance will be needed in your app also. A "translation table" as opposed to regex substitution can work well for something like translate, "state" into standard 2 letter US Postal code. Eg for TX:TX,Texas,Tex,Texass or whatever! Over time, the DB would evolve into a very high probability of translating correctly something that a human would recognize as "Texas". There will be some limit of the algorithm, no matter how "smart" it is and consider a "look-up" to do a lot of the "heavy lifting".

    I haven't researched this, but the US Postal Service is one if not the top service in the world in terms of automated sorting. Whatever these folks are using, it works pretty well and there probably are some public domain papers out there describing algorithms and methods that they use.

Re: Suggestions requested: module to standardize postal address components?
by Mr. Muskrat (Canon) on Jun 30, 2010 at 20:02 UTC

      Thank you-I will definitely check them out. If I am reinventing a wheel when instead I could perhaps aid by being able to provide a nicer tire-balancing weight, I would not be opposed to offering what meager code I have so far if it would be of use.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://847270]
Approved by Corion
Front-paged by Corion
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (2)
As of 2017-09-23 15:23 GMT
Find Nodes?
    Voting Booth?
    During the recent solar eclipse, I:

    Results (272 votes). Check out past polls.