Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Data Salad Address Problem

by SamCG (Hermit)
on Jul 28, 2005 at 14:28 UTC ( #478987=perlquestion: print w/replies, xml ) Need Help??

SamCG has asked for the wisdom of the Perl Monks concerning the following question:

Okay, I've got some data that, due to our vendor, isn't exactly neatly packaged when we insert it into our database. . .here's a (modified) excerpt:

ACCT_TEXT1ACCT_TEXT2ACCT_TEXT3ACCT_TEXT4ACCT_TEXT5ACCT_TEXT6
BIGGLE EQUITY MGMT ASSOC. PIGGLE BIGGLING PLAN FUTURES I & II TWO LONDDDD CENTER TRENTON NJ 97302-5115
AGRICULTURAL LIFE INSURANCE CO MAIN 111 ELEVEN AVE PO BOX 123 MORRISTOWN NJ 97631-0633
MILLENA BANC MILLINER BANK & TRUST CO 456 THAUMATURGE AVENUE NEWSOME MA 07145-6316
BRICKABRACK ADVISORS LIBERTARIAN BANK 888 CRAZED CREEK ROAD STE 11 MEMPHIS KY 77882-5394
FOO COMMUNITY BANCORP THE FOOVILLE SAVINGS BANK 18 FOOL AVENUE FOO NY 10567-1735
ELROND ASSET MANAGEMENT BRA3255 CSHYTFD ATTN SETTLEMENTS DEPT. 466 LURCHING AVENUE NEW YORK NY 10347
CRIPPLE CREEK INVESTMENT ADVISORS GLIB TALKER 666 9TH AVENUE 8TH FLOOR NEW YORK NY 10235-2302
SEPARATIST STATE BANK ATTN: JOSE KWAN XYZ FINANCIAL CENTER 145 TALKING FAST ROAD NEWBURGH NY 07083-2340


Essentially, I need to separate this out into recognizable name/address relations, complete with city/state/postal code divided into identifiable columns.

My first inclination would be to use a regex of some kind, but I'm fairly certain Merlyn would warn (correctly, I think) that this is one of those verboten topics for regexes. About the only thing I can be relatively sure of is that city/state/zip will be last (though not always in the same column).

Any advice appreciated. Even "avoid doing this like the plague".

Replies are listed 'Best First'.
Re: Data Salad Address Problem
by gam3 (Curate) on Jul 28, 2005 at 16:09 UTC
    Assuming tab seperated data this will do the job.
    use strict; while (<DATA>) { my @data = split(/\t/); my $city_data; my ($city, $state, $zip); do { $city_data = pop(@data); } while($city_data =~ /^\s*$/); if (($city, $state, $zip) = ($city_data =~ m/^(.*)\s+([A-Z][A-Z])\ +s+([0-9-]+)/)) { print "$city, $state $zip\n"; } else { die "Could not parse $city_data\n"; } my $x = 1; for my $line (@data) { print $x, " $line\n"; $x++; } printf("%d %s, %s %s\n", $x, $city, $state, $zip); print "\n"; } __DATA__ BIGGLE EQUITY MGMT ASSOC. PIGGLE BIGGLING PLAN FUTURES I & II + TWO LONDDDD CENTER TRENTON NJ 97302-5115 AGRICULTURAL LIFE INSURANCE CO MAIN 111 ELEVEN AVE PO BOX 123 + MORRISTOWN NJ 97631-0633 MILLENA BANC MILLINER BANK & TRUST CO 456 THAUMATURGE AVENUE + NEWSOME MA 07145-6316 BRICKABRACK ADVISORS LIBERTARIAN BANK 888 CRAZED CREEK ROAD +STE 11 MEMPHIS KY 77882-5394 FOO COMMUNITY BANCORP THE FOOVILLE SAVINGS BANK 18 FOOL AVENUE + FOO NY 10567-1735 ELROND ASSET MANAGEMENT BRA3255 CSHYTFD ATTN SETTLEMEN +TS DEPT. 466 LURCHING AVENUE NEW YORK NY 10347 CRIPPLE CREEK INVESTMENT ADVISORS GLIB TALKER 666 9TH AVENUE + 8TH FLOOR NEW YORK NY 10235-2302 SEPARATIST STATE BANK ATTN: JOSE KWAN XYZ FINANCIAL CENTER + 145 TALKING FAST ROAD NEWBURGH NY 07083-2340
    Depending on the number of records there are you might think of verifying the addresses (using HTML::Form) at http://zip4.usps.com/zip4/welcome.jsp
    -- gam3
    A picture is worth a thousand words, but takes 200K.
Re: Data Salad Address Problem
by jcoxen (Deacon) on Jul 28, 2005 at 15:45 UTC
    Maybe you could try working both ends against the middle. Based on the data you posted, the Company name is always field 1, the City, St, Zip combo is always the last field and the Street address is always the next to last. Given that, parse for the Company name, Street address and City, St, Zip, then deal with the middle 1-3 fields if they exist or pad with delimiters if they don't. BTW, I am assuming that your data uses consistent delimiters. If not, then step 0 would be to clean up the delimiters.

    I've had to clean up 'dirty' lists before, although nothing this bad, and that's the approach that I took. Hope this helps,

    Jack

Re: Data Salad Address Problem
by davidrw (Prior) on Jul 28, 2005 at 16:22 UTC
    The above all looks like decent advice .. One additional thought/approach i had that might save some steps is to "right-align" the data. First, this is assuming the the last non-blank column has the city/state/zip and the second-to-last has the address (street name/number), and thus saves the "if zip_in_6 elsif zip_in_5 elsif zip_in_4 ..." branching).
    my @data = ( [...], [...], ... ); # assume AoA foreach my $row (@data){ while(scalar(@$row) && ! $row->[-1]){ # if last item is blank pop @$row; #then remove it } my $city_state_zip_str = $->row[-1]; my $address_str = $row->[-2]; # parse those two vars w/regexp's here }
    (Note you could reverse the @$row array and work with ($row->[0], $row->[1]) instead)
      I would probably use something like this:

      my @fields; ... foreach my $row (@data) { @string = split(/ +/); for (0..$#string) { push @{ $fields[$_] }, $string[$_]; } } ...
      That should get the multiline record string, process it by-line, and push the corresponding field into an AoA. What needs to be added is the processing for ZIP code since it's the last field of the last row in a record.

       

      --------------------------------
      An idea is not responsible for the people who believe in it...

Re: Data Salad Address Problem
by socketdave (Curate) on Jul 28, 2005 at 14:41 UTC
    This looks pretty horrible... Junk in, junk out.

    That said, if you know that you'll always have the same format for the city and state, and always have a nine digit zip, you may have a chance. work from the last field forward. Find the ZIP with a regex, then the city and state. After that, you'll have to make the assumption that whatever is in the next field over contains the street address. Good luck!

    Update: I just noticed that you do have a five digit zip in there. It won't make that much difference in the accuracy ;)
      I agree it's horrible. . . and unfortunately, I can be sure of very little regarding the formatting. I see a number of records that put commas between city and state (which isn't really a big problem), and some which abbreviate state names with things like "MASS", and "WASH" (oh, joy).

      Thanks for the good wishes. . .
        You're basically going to have to quantify the different possibilities and allow for them individually. I was able to get the zip codes accurately from your sample data:

        unless ( ($zip) = ($field5 =~ /(\d{5}-\d{4})/)) { unless ( ($zip) = ($field5 =~ /(\d{5})/)) { unless ( ($zip) = ($field4 =~ /(\d{5}-\d{4})/) +) { ($zip) = ($field4 =~ /(\d{5})/); + } } }


        but that's already pretty nasty...
Re: Data Salad Address Problem
by bofh_of_oz (Hermit) on Jul 28, 2005 at 16:02 UTC
    First, determine the logical formatting rules for the data. In your case:

    - Records seem to be separated by a blank line (two \n)
    - Every field takes a certain number of characters on every line
    - Every field can take multiple lines
    - ZIP code is in the format /\d{5}-\D{4}/

    You can use a multiline regexp, process each line with substr pushing elements into corresponding array(s) or appending to the strings/whatever. I'm not clear about ZIP codes - if they can be in field 4 or 5, use regex; if they are only in field 5 (and we do not see it because of HTML scrambling the text separators), then you'll be fine.

    HTH

    P.S. If you want, we can work on the code later...

    --------------------------------
    An idea is not responsible for the people who believe in it...

      Don't assume that you can derive the logical formatting rules correctly from the data supplied. Make every effort to get the supplier of the data to provide you with the rules they used to create the data.
        I agree. However, that works only in about 30% of the situations as data suppliers make every effort not to open their "private and confidential" data formats. That is, if they indeed understand them... Often, studying the data is the only way to grab the logic. Granted, one'd need a lot more data for statistical analysis than provided here, but I just outlined the idea...

        --------------------------------
        An idea is not responsible for the people who believe in it...

Re: Data Salad Address Problem
by nothingmuch (Priest) on Jul 28, 2005 at 21:17 UTC
    IMHO the most reliable solution is to have the computer do human assisted parsing...

    Make some independant patterns:

    my %regexes = ( zip => { 100 => qr/(\d{4,5})/, 20 => qr/(\d+)/ }, city_state => { 200 => qr/(\w+)\s+([A-Z]{2})/, 10 => qr/(\w+)\s+(( +?:\w+)+)/), city => { 25 => qr/(\w+)/, 2 => qr/((?:\w+)+)/ }, name => { 50 => qr/^\s+(\w+)/ }, ... );
    Once you have these make a loop that will match all the patterns on each entry, where pattern groups like 'city_state' are per match...

    The numbers keying the regexes are scores to assign for each match.

    Have the computer display all possible interpretations of the address in a very well formed manner (print "Zipcode: <<$zip_match>>";), sorted by the match score, and let the user choose which one to use.

    As you go through some records you can tweak your scores a little, and maybe introduce a threshold parameter (automatically select the first match if it's score is 10 times more than the next).

    This should let you avoid cutting and pasting, or other tedious mouse/kebyoard work which is probably going to be the bulk of your time. If you can get 50% of the addresses handled automatically I guess you could weed through even 10,000-20,000 in less than a few hours of work - just add some checkpointing mechanism to the script so that you can go back and redo a single entry, or stop the script, delete bad parses from the database, and start the script again, and it will continue to parse the next unparsed entry.

    Good luck!

    Update: some guages to give a score - use less variadic types, try to anchor the matches, and align them on field numbers... The more constraints you put on a pattern, the higher it's score should be, but do keep around low scoring fallback matches. Try to match patterns in groups, like city_state, street_city, street_city_state, and give the latter one more than the sum of the other two, so that it will be preferred to fill those fields.

    Also - don't be too generous with the number of alternatives - you have to permutate N arrays of matches to produce N*N possible parses, and even if you clip the lowes scoring matches, it might be overwhelming.

    If you know the list of cities in advance, programmatically create a regex to match them as alternatives. Do this for states regardless - if there is a two letter word and it's not a statename, it better not be parsed as such. Do this for long words too, e.g. "New York" (even though that one is ambiguous).

    Also make a "Postpone" choice, that just appends the entry to the file, and moves to the next one (possibly making a stub entry in the database), so that you can delay anything that isn't obvious at once till later, and deal with it when you're not in auto-pilot mode.

    -nuffin
    zz zZ Z Z #!perl
Re: Data Salad Address Problem
by snoopy (Curate) on Jul 29, 2005 at 00:42 UTC
    As a final sanity check, you may want to verify your addresses:

    The quickest way with US addresses is to use Scrape::USPS::ZipLookup to scrape the United States Postal Service Zip Code Service.

    It'll verify that your resultant combination of street address/city/state/zip against the UPS's own postal address database.

    From the pod doco:

    #!/usr/bin/perl use Scrape::USPS::ZipLookup::Address; use Scrape::USPS::ZipLookup; my $addr = Scrape::USPS::ZipLookup::Address->new( 'Focus Research, Inc.', # Firm '', # Urbanization '8080 Beckett Center Drive Suite 203', # Delivery Address 'West Chester', # City 'OH', # State '45069-5001' # ZIP Code );
Re: Data Salad Address Problem
by jimX11 (Friar) on Jul 28, 2005 at 22:47 UTC
Re: Data Salad Address Problem
by JamesNC (Chaplain) on Jul 29, 2005 at 10:59 UTC
    Fun exercise, here was my approach, I make the assumption that the last block of data is your city_state_zip jumble and the first block is your customer, everything else in the middle is addressing info. Then I pull everything apart from its ends city_state zip then I have to make the horrible assumption that there is a state code... in either case, I can still put city_state together which you can see in the dump if you uncomment it. About as fancy as I care to get in 25 mins. Cheers.
    use strict; use Data::Dumper; my @customers; while(<DATA>){ my %customer; my (@data) = split /[\t]/, $_; my $city_state_zip; my $which = 0; for( 1..$#data){ $which = $#data -$_; $city_state_zip = $data[-$_]; last if ( $city_state_zip =~/\w+/ig ); } $customer{name} = $data[0]; for( 1..5){ if( $_ <= $which ){ $customer{qq/address$_/} = $data[$_]; } else { $customer{qq/address$_/} = "NULL"; } } $customer{city_state_zip} = $city_state_zip; push @customers, \%customer; } foreach my $c ( @customers ){ my ($city_state, $zip ) = ($1, $2, $3) if $c->{city_state_zip} =~/(.*)\s+([\d\-]+)/g; my ($city, $state) = ($`, $1) if $city_state =~/\s(\w+)$/; $c->{city} = $city; $c->{state} = $state; $c->{zip} = $zip; print " $c->{name}$c->{address1} $c->{address2} $c->{address3} $city, $state $zip\n"; } #print Dumper(\$customers[0]); __DATA__ BIGGLE EQUITY MGMT ASSOC. PIGGLE BIGGLING PLAN FUTURES I & II + TWO LONDDDD CENTER TRENTON NJ 97302-5115 AGRICULTURAL LIFE INSURANCE CO MAIN 111 ELEVEN AVE PO BOX 123 + MORRISTOWN NJ 97631-0633 MILLENA BANC MILLINER BANK & TRUST CO 456 THAUMATURGE AVENUE + NEWSOME MA 07145-6316 BRICKABRACK ADVISORS LIBERTARIAN BANK 888 CRAZED CREEK ROAD ST +E 11 MEMPHIS KY 77882-5394 FOO COMMUNITY BANCORP THE FOOVILLE SAVINGS BANK 18 FOOL AVENUE + FOO NY 10567-1735 ELROND ASSET MANAGEMENT BRA3255 CSHYTFD ATTN SETTLEMENTS DEPT. + 466 LURCHING AVENUE NEW YORK NY 10347 CRIPPLE CREEK INVESTMENT ADVISORS GLIB TALKER 666 9TH AVENUE 8 +TH FLOOR NEW YORK NY 10235-2302 SEPARATIST STATE BANK ATTN: JOSE KWAN XYZ FINANCIAL CENTER + 145 TALKING FAST ROAD NEWBURGH NY 07083-2340

    I wanted to add that, I think the way Perl allows you to develop solutions to real world problems like this so quickly and elegantly speaks volumes of praise for the mountain of programmers that built this language. I don't think I have seen a closer knit community of folks so willing to help other programmers learn and develop than right here on Perl Monks either. :0) Super!
    JamesNC
      Wow, I'm truly overwhelmed by the volume and intelligence in the responses! As it is most of the time when I ask a question on PM, the answers will cause me to stretch my knowledge of perl and programming a little more. Thank you all...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://478987]
Approved by Tanalis
Front-paged by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2021-05-16 15:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Perl 7 will be out ...





    Results (152 votes). Check out past polls.

    Notices?