Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Parsing "real world" addresses...

by eduardo (Curate)
on May 01, 2003 at 14:35 UTC ( #254640=perlquestion: print w/ replies, xml ) Need Help??
eduardo has asked for the wisdom of the Perl Monks concerning the following question:

Greetings. I have been looking for a module to (or a Parse::RecDescent grammar) to parse real world "mailing addresses." I have a text file that looks something like:
150 Main Street Nashville, Tennesse 37201 27 Breck Lane #27 Plingo, Maine 44343 1 2nd Av S. Apt 166 Memphis, TN 37373
And whatnot... and I was wondering if there was any modules that had been written to break it up into it's component parts (street, apartment(?), city, zip.)

Comment on Parsing "real world" addresses...
Download Code
(jeffa) Re: Parsing "real world" addresses...
by jeffa (Chancellor) on May 01, 2003 at 14:39 UTC
Re: Parsing "real world" addresses...
by hardburn (Abbot) on May 01, 2003 at 14:40 UTC

    I'm not saying that its impossible, but it seems that there are so many special cases that you'll have a hard time getting through them.

    ----
    I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
    -- Schemer

    Note: All code is untested, unless otherwise stated

Re: Parsing "real world" addresses...
by dragonchild (Archbishop) on May 01, 2003 at 14:43 UTC
    I was working on a data migration project and they spent something like 6 man-months working on this and had a 90% solution. *shrugs* That's good enough, I think. (Of course, they refused to upload their stuff, not that you'd've wanted it anyways ...)

    ------
    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: Parsing "real world" addresses...
by benn (Priest) on May 01, 2003 at 14:53 UTC
    Lingua::EN::AddressParse is pretty good - biased slightly towards Australian addresses, but should be a good start.

    Cheers
    Ben.

      I hope this gets me a 90% solution without 6 man-months of effort, gotta love code re-use eh, dragonchild? :) I will report my findings later on toda... I should have known it would be in the Lingua namespace. Thanks for the pointer benn.
Re: Parsing "real world" addresses...
by halley (Prior) on May 01, 2003 at 16:59 UTC
    All of your example addresses are USA form, but consider the non-USA case. Every country has a different postal address scheme, and some of it's just "conventional" and not a rigid format.

    --
    [ e d @ h a l l e y . c c ]

Re: Parsing "real world" addresses...
by crenz (Priest) on May 01, 2003 at 21:41 UTC

    I'd try an iterative apporach:

    • Skim through the list and develop a number of basic regexes to match the adresses. Keep them rather strict and make your code fail (or output the non-matching records) if a record doesn't match any of them.
    • Run it, then skim through the list of non-matching records and add more special cases to your rules (again, keep the rules rather strict)

    • Go back to previous step until you have reached acceptable accuracy :)

    I found this approach to be quite helpful in ensuring that I don't misunderstand records -- ie., thinking your rules match where in reality it is just pure chance and they're producing garbage without you noticing.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://254640]
Approved by sschneid
Front-paged by halley
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (11)
As of 2014-12-22 16:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (121 votes), past polls