Well I am starting with Canada. I am creating lookup structures that have stuff like valid words in a french address, valid words in an english address, what the valid abbreviations for the street type are (e.g. street=st abbey=abbey etc.) Once I get all of that, the majority of this will come from information compiled of the web just by searching for postal information on each individual country. Very manual I know. Oh throw another wrench into it, I have to be able to handle multiple character sets as well, so unicode is a must.

    Are you trying to parse this out of freeform text or do you know what data is what.

    As an aside, I posted a module here for looking up ISO Country codes. Don't know if you will find it useful or not. I have one for US states to that I will post later.


    I am not sure what you mean by "valid words"? Would this mean that an address must contain one of the special words to be ok, such as "street" or "st", "lane", "road" or something in the address field?

    I'd say that is an impossible task, unless some countries actually have such strict policies for what is a valid address. Speaking for Sweden, for one thing, we have lots of addresses containing "gata", which means "street" for instance, but lots of addresses don't - and some addresses are just the name of a village, or something smaller than that, with or without a number after it. Yet other addresses are something that would translate to "Mailbox XXX", which is not the same thing as a PO box (we have those too), etc... frankly, I can't see any other match to our addresses than /.+/.

    Either I misunderstood what you mean, or I think it will be impossible to create these rules - unless you would do as some e-commerce do, check addresses against where people live according to central government registers. And that was clearly not your goal... :)

      You are correct with the whole valid words. Hopefully it is documented per country on what the valid types of thoroughfares are. I can't do a check against the central government registers because the majority of the world's governments don't let their postal files out of the country. Some because they don't want to, others because they have no idea where everyone lives. What is the difference between Mailbox and Box (which is what I have that a PO Box is called in Sweden) is this right?
        In conclusion, you can say that some addresses is correct (includes one or several words in the right place etc), but you can never tell that one particular address is invalid.

        In the countryside, at least in some places, people can have a "Postlåda" - which is a mailbox - with a number on it as the address. The other kind - "Box" - is the one you hire at the post office itself. This is definable and parseable if one really wants to. But addresses here can be utterly freeform, even in the cities there are lots of combinations that does not include any form of "street", "square" or similar.

        I don't mean to put you down, or anything, but you should know exactly what you are up against. :)

