Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

Extracting a (UK) Address

by ropey (Hermit)
on Jan 02, 2009 at 10:14 UTC ( [id://733725]=perlquestion: print w/replies, xml ) Need Help??

ropey has asked for the wisdom of the Perl Monks concerning the following question:

Fellow Monks,

I am faced with a challenge to extract clients names and addresses from a bunch of Word documents

I came to the conclusion that processing raw text would be easier than trying to parse a word formatted document, so using Win32::OLE I open the documents and save them as text only, however now I come to the part of extracting the address data from it and before I start would ask for some advice

So has anyone done something similar to this before ? the obvious choice would be a regex, but given that the format of a name and address could vary considerably (consider MR and Mrs D.M Smith, Mrs & Mr D Smith-Brown etc) and an address could vary even more, so before I re-invent the wheel, has this been done before ? searching CPAN there are modules such as Geo::PostalAddress or Lingua::EN::AddressParse which do something similar, but do not 'extract' the address from a raw text document ?

Has anyone faced a similar problem ? and could advice on how to resolve ?

Replies are listed 'Best First'.
Re: Extracting a (UK) Address
by Perlbotics (Archbishop) on Jan 02, 2009 at 12:39 UTC

    Seems that for this kind of task, you are better off with a state machine. Something that allows you to identify an interesting section of your document which can be analysed / scrutinised for the interesting stuff. So the problem will be to identify keywords that mark the start of an address field and to have an idea of how an address field ends. If you're lucky, the address field has a fixed number of lines.

    Something along:

    use strict; sub flush_address { # may need more code to narrow down the address print "FOUND:\n>> ", join(">> ", @{$_[0]}), "======\n"; } my @address; my $line_span = -1; # -1 disabled; otherwise extract $line_span lines while (<>) { # identify start of an address section (upd.: regexp incomplete) push(@address,$_), next if /^\s*(Miss|Mister|Mr\.?|Ms\.?|Her|His)\s +/; if (@address) { push @address, $_; # identify end of an address section # regexp matches empty line here but should match something # like "BN2 ..." if (@address == $line_span or /^\s*$/) { flush_address(\@address); @address = (); } } } flush_address(\@address) if @address; __END__ pb> perl invoice.txt FOUND: >> Miss ***** ****** >> 1** Elm ****, >> Bri***** >> E*** ****** >> BN2 *** >> ======

    Some free associations of potentially useful links: Parsing addresses, Efficient Fuzzy Matching Of An Address, Extracting Bibliography Citations, validate a postal code, Extracting information from a MS WORD Document, Pull all text from msword document, ...

    Update: ... I assumed, that the sample invoice was anonymised already, but - just in case - made the sample output unreadable.

      Studying the invoice closely, it appears that a unique comma terminates the first line of the address. If this is constant, you could possibly 'vector' yourself in from there?

      (Just a stray thought - does Miss Hocker mind her name and home address being published on the web?!)

        shrdlu said:
        (Just a stray thought - does Miss Hocker mind her name and home address being published on the web?!)

        I'm glad to hear that I'm not the only one with this worry.

        @ ropey: you should really dummy up the name and address here.

        Haha if someone lives at that address I would be very suprised.. it was made up
Re: Extracting a (UK) Address
by Bloodnok (Vicar) on Jan 02, 2009 at 12:21 UTC
    I may be teaching grannies to suck eggs, but here goes anyway ... IMO, you need to identify and use, one, or more, invariant properties of the address block e.g. ...
    • Always in the same relative/absolute location in the invoice &/or ...
    • Has an identifying header &/or ...
    • .
    • .

    Even better is if, of the invariants thus identified, at least one can be demonstrated to be unique for the address block.

    In your supplementary data example, it would appear that the address block is the 3rd block where each block is separated from the next by \n{2,}.

    Having identified and isolated the address block, it then becomes a simpler matter of parsing the address details...

    Thinx: OTOH, there may be a chance that the address is stored as a formatted block in the Word doc - so using Win32::OLE may be a first step to read the object direct from the doc...

    Thinx again: It's highly probable that a combination of the 2 would be required to handle to inconsistencies introduced by the evolution of the doc...

    A user level that continues to overstate my experience :-))
Re: Extracting a (UK) Address
by u671296 (Sexton) on Jan 02, 2009 at 10:38 UTC


    The solution will depend very much upon the data you need to extract the address from. Are there any delimiters ? fixed length fields ? It would help if you could include some example data, though I appreciate you may want to change it for security purposes.

      Format may vary from time to time (as the invoices have evolved) one example would be something like
      Invoice Invoice No: C0331-2008 Invoice Date:27/02/2008 VAT No: 679 7113 03 Miss Carol Hocker 177 Elm Road, Brighton East Sussex BN2 7HB DESCRIPTION AMOUNT TOTAL Corian worktops supply and fit 3083.15 Neff double oven 599.00 Neff gas hob 298.00 Baumatic extractor hood 419.00 Neff dishwasher 420.00 Ducting kit 30.00 Franke swing spray tap 175.00 Baumatic Microwave 219.00 Double sockets x 4 160.00 Single sockets x 3 108.00 Fused spurs x 2 72.00 Cooker control panel 55.00 5 triangle lights 120.00 Supply and fit new fuse board The electrics will be invoiced by electrician and are plus vat 5243.15 PAYMENTS RECEIVED AMOUNT TOTAL Payment now due 2184.02

        So you are looking for three or more lines together, the last ending in something that looks like a post code...

        $letter =~ m/((?:[^\n]+\n){2,}[^\n]*?[a-zA-Z]+[0-9]+\s+[0-9]+[a-zA-Z]+\s*?\n)\s*?\n/

        ...seemed to do the trick, where the entire letter was read into $letter. Obviously this will miss addresses with no post code or really rubbish post codes. You could just extract all groups of 3 or more lines, and then apply some more cunning address recogniser to the result -- perhaps from one of the modules recommended elsewhere.

        (I haven't tried to figure out how much work this is asking the regex engine to do on difficult input. I'd worry about that only if it becomes a problem.)

        I have little to add to the other suggestions already made. I think it is unlikely you can home in on the address without it being delimited in some way.

        The example above clearly delimits with "VAT No:" and "DESCRIPTION". I think you'll need to know what variations the invoices have had over the years and code for all of them.

        Other tricks might help,

        e.g. is there always a titled name (Mr,Mrs,Miss,Ms etc.) at the start of the address ?, If so analyze all the names in your dataset to identify all unique titles.

        177 Elm Road, is the only line that starts with a number so the address is in that block

        Address lines are the only ones that end with a comma, so use that block

        The address is always in the first n lines of the invoice ?

        The address always has a county in it ?


        Also if you have access to postcode validation (database ?) that could help

        Whatever, I assume you will end up with many invoices that can't be correctly handled, so you'll need to agree how to handle those exceptions.

Re: Extracting a (UK) Address
by Sagacity (Monk) on Jan 03, 2009 at 08:44 UTC

    It has been awhile, but I once used saving Word docs to rtf.
    The added rtf codes gave me the regex's to use. It just seemed easier at the time, and it blocked the data into recognizable patterns that could then be used to suck out the needed information.
    Give 1 a try, You'll see what I'm talking about.
    Good Luck!

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://733725]
Approved by ikegami
Front-paged by Arunbear
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (2)
As of 2024-04-17 01:52 GMT
Find Nodes?
    Voting Booth?

    No recent polls found