http://www.perlmonks.org?node_id=268165

data67 has asked for the wisdom of the Perl Monks concerning the following question:

Background
I am trying to read a flat file that contains some name and address information. Now whats making this tricky is that this file is not delimited by anything.

Heres an example:

141 Martha Lynn Amblynoster 12345 New Pickle Drive MoreTown PA 98765 6 +54 555-1212 no detail 178 Edgar Bimblybum Jr. 23456 Highfiddle Road Acheville Ma 24680 345-7 +89-1234 no detail 161 Joyce W. Wogerbung 18 Lily Piffle Lane Middleton PA 34567 610-678- +2345 no detail 188 Alex Shmogle 6543 Bibblyboo St NW Apt B Washington DC 20009 202-98 +7-6543 no detail

Problem
I was trying to read the file in to an array and then going through each line and trying to pull out each section of information.
Namely, Name-CustomerName-CustomerAddress(broken down in: Street_City_State)-Telephone-Comments.

Here is what i have so far:

foreach my $line (@data_file) { if ($line =~ m!^(\d+)\s+(([A-Za-z]+\s+[A-Za-z].\s+[A-Za-z]+)|( +[A-Za-z]+\s+[A-Za-z]+) )!) { print "$1 - $2 \n"; $custNum = $1; # First number field. $custName = $2; # Name styles can vary + so match everything between two numbers. $custStreet = $3; # Street is everything + after name and before CITY. $custCity = $4; # City is after addres +s and before the TWO char state identifier. $custState = $5; # State is after addre +ss and before FIVE digit zip number. $custZip = $6; # Zip is before teleph +one number and after State id. $custTel = $7; # Telephone no. is aft +er zip and before comments field. $custComments = $8; # Last remaining part +after telephone number. } }

The regex you see above so far matches the CustName and some names. Other than that i still am trying to figure this one out. Any help will be great.Thx.

Obfuscated data - dvergin 2003-06-23

edited: Tue Jun 24 01:28:48 2003 by jeffa - title change (was: Help with Regular Expression)

Replies are listed 'Best First'.
(jeffa) Re: Parse mailing addresses with a regex
by jeffa (Bishop) on Jun 23, 2003 at 14:20 UTC
    The first mistake most make when creating a complex regex is to use $1,$2, et al. Just capture everything into an array. This is not guaranteed to work for everything you throw at it, but it works for the data you have given:
    my @line; foreach my $line (@data_file) { @line = $line =~ / (\d+)\s+ # first numbers ([^\d]+) # full name (.*)?(?:\w\w)\s+ # street address (\w\w)\s+ # state (\d{5})\s+ # zip (\d{3}-\d{3}-\d{4})\s+ # phone (.*) # the rest /x; print join('|',@line),"\n"; }
    And here is a play by play breakdown ;)
    • (\d+)\s+: one or more digits followed by at least one white space
    • ([^\d]+): everything up to a digit
    • (.*)?(?:\w\w)\s+: tricky (and fragile) - this gets everything up to two consecutive alphas that are followed by at least one whitespace
    • (\w\w)\s+: two consecutive alphas followed by at least one whitespace
    • (\d{5})\s+: exactly 5 digits followed by at least one whitespace
    • (\d{3}-\d{3}-\d{4})\s+: you get the picture ;)
    Hope this helps, anything more might require something smarter like Parse::RecDescent.

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
(don't use live data!) Re: Parse mailing addresses with a regex
by particle (Vicar) on Jun 23, 2003 at 14:37 UTC

    please don't use live data in your examples. take the time to obfuscate it. you've just done your customers a disservice, and may have violated one or more privacy policies. i've put in a request to the Editors to change it.

    please be more careful in the future.

    ~Particle *accelerates*

      ++particle,
      I'd be absolutely furious if I came across 'my' details in a public forum such as this. Given the proliferation of spam (Digital & Snailmail) this is a topic that should be stressed in the most stringent way possible.

      Thankfully at last individuals are to be made responsible for their actions regarding divulgence of personal data (at least in the UK that is). So, data (and others) take heed, don't do it again!... "for your own and others sakes", please.
      </rant> - barrd

      Update: Any chance dvergin you could change BrowserUK's Node as well? As that contains (at present) the same "real" data...

      Ah, apparently the data wasn't real after all... bugger... (who's red faced now? ;)
      /me scuttles off into a dark corner

Re: Parse mailing addresses with a regex
by diotalevi (Canon) on Jun 23, 2003 at 14:27 UTC

    Once upon a time I tried to parse addresses and came up with some fugly code. You'd first want to check out Lingua::EN::AddressParse and see if you can use it as-is or modify it to suit your needs. If all else fails here's the code I was using. Keep in mind this is two files. unabbrev.pm takes the standard US postal service abbreviation and expands them. parse_address.pl probably read one address per line. You will definately need to modify this to use it.

    Also, this is not how you should program. The prototypes should go, the direct access of @_, etc. Don't use this as a style guide. Please. This is not how I write perl anymore. For production or otherwise. Ugly, ugly, ugly.

    # unabbrev.pm # Fixed the unabbrev.pm file up some. sub uword { join ' ', map unabrev($_), split ' ', shift; } sub unabrev { local $_ = shift; return $_ unless /\w/; # One really big expression s/^e$/east/ or s/^w$/west/ or s/^(?:n|no)$/north/ or s/^(?:s|so)$/south/ or s/^ne$/north east/ or s/^nw$/north west/ or s/^se$/south east/ or s/^sw$/south west/ or s/^(?:avs|aves)$/avenue south/ or s/^beachrd$/beach road/ or s/^ccedar$/cedar/ or s/^(?:adn|add'n)$/addition/ or s/^appache$/apache/ or s/^apt$/apartment/ or s/^apts$/apartments/ or s/^(?:av|ave)$/avenue/ or s/^(?:bch|bchch|beac)/beach/ or s/^(?:bx|b0x)/box $1/ or s/^blvd$/boulevard/ or s/^brg$/burg/ or s/^bldg$/building/ or s/^cen$/center/ or s/^(?:centeral|cental)$/central/ or s/^char$/character/ or s/^chas$/chase/ or s/^ches$/chesapeake/ or s/^chig$/chicago/ or s/^cir$/circle/ or s/^(?:cty|co|cnty)$/county/ or s/^(?:ct|crt|cour)/court/ or s/^cr$/curve/ or s/^crk$/creek/ or s/^crl$/curl/ or s/^(?:crystaln|crytl)$/crystal/ or s/^ctr$/center/ or s/^dist$/district/ or s/^(?:drv|drve|dr)$/drive/ or s/^est$/estate/ or s/^fst$/forest/ or s/^ft$/fort/ or s/^(?:govt|govern|gov't)$/government/ or s/^(?:grv|grov)$/grove/ or s/^hgld$/highland/ or s/^hglds$/highlands/ or s/^(?:hgt|hht|height|ht|hghtss|hghts)$/heights/ or s/^(?:hy|hyw|hwy)$/highway/ or s/^isl$/island/ or s/^(?:jct|jction|jctn|junctn|juncton)$/junction/ or s/^(?:jctns|jcts)$/junctions/ or s/^l00p$/loop/ or s/^(?:lk|lak)$/lake/ or s/^lks$/lakes/ or s/^li'l$/lil/ or s/^(?:la|lanes|ln)$/lane/ or s/^ml$/mill/ or s/^mls$/mills/ or s/^mkt$/market/ or s/^(?:mt|mnt)$/mount/ or s/^mpls$/minneapolis/ or s/^(?:mtn|mntain|mntn)$/mountain/ or s/^(?:mntns|mtns)$/mountains/ or s/^(?:nth|nrth)$/north/ or s/^nrthbrk$/northbrook/ or s/^(?:unorg|unorgized)$/unorganized/ or s/^ph$/penthouse/ or s/^(?:pk|prk)$/park/ or s/^(?:pkwy|parkwy|pkway|pky)$/parkway/ or s/^pl$/place/ or s/^plaz$/plaza/ or s/^(?:pobox|po)$/box/ or s/^prct$/precinct/ or s/^pres$/president/ or s/^pt$/point/ or s/^pts$/points/ or s/^qtr$/quarter/ or s/^qtrs$/quarters/ or s/^(?:r|rt)$/route/ or s/^rd$/road/ or s/^rdg$/ridge/ or s/^resor$/resort/ or s/^(?:ri|rv|riv|rvr)$/river/ or s/^(?:rte|rr|rural)$/route/ or s/^(?:rs|rst)$/rest/ or s/^rverview$/riverview/ or s/^(?:shr|shoar)$/shore/ or s/^(?:shoars|shrs)$/shores/ or s/^(?:spgs|spngs|sprngs)$/springs/ or s/^(?:st|str)$/street/ or s/^svc$/service/ or s/^terr$/terrace/ or s/^twp$/township/ or s/^(?:tr|trl|trails|trls)$/trail/ or s/^trlr$/trailer/ or s/^vac$/vacation/; return $_; } 1; # address_parse.pl #!/usr/bin/perl $ID = 0; $ADDRESS = 1; while ($record = <>) { ($id,$address) = split /\t/, $record; @words = split /\s+/, $address; %record = (); HOUSE: if ($words[0] =~ /^\d+$/) { $record{house} = shift @words; ($record{odd}) = (($record{house} % 2) == 0 ? 'e' : 'o'); } if ($words[0] =~ /^1\/2$/) { $record{fraction} = shift @words; } UNIT: for ($i = $#words; $i >= 0; $i--) { if (is_unit($words[$i])) { $record{unit} = join ' ', @words[$i .. $#words]; $#words = $i - 1; last UNIT; } } unless (defined $record{unit}) { if ($words[$#words] =~ /\d+$/) { $record{unit} = pop @words; } } $t = $words[$#words]; if (is_ew($t)) { $record{direction} = pop @words; $t = $words[$#words]; if (is_ns($t)) { $t = pop @words; $record{direction} = "$t ".$record{direction}; } } elsif (is_ns($t)) { $record{direction} = pop @words; } unless (exists $record{direction}) { for ($i = 0; $i < @words; $i++) { if (is_ns($words[$i]) or is_ew($words[$i])) { $record{direction} .= ' '.$words[$i]; $words[$i] = ''; } } } @words = grep /\w/, @words; for ($i = $#words;$i>=0;$i--) { if (is_type($words[$i])) { $record{type} = $words[$i]; $words[$i] = ''; goto DIR; } } DIR: @words = grep /\w/, @words; for ($i = 0; $i < @words-1; $i++) { if ($words[$i] eq 'p' and ($words[$i+1] =~ /^(?:o|0)$/)) { $words[$i] = ''; $words[$i+1] = ''; if (exists $record{unit}) { $record{unit} = 'po '.$record{unit}; } else { $record{unit} = 'po'; } } } $t = join ' ', @words; $record{street} = $t if $t; if (1) { $line = join "\t", map {defined $_?$_:'\\N'} ($id, @record{qw(house odd fraction street direction type unit )}); for (undef,undef) { $line =~ s/ +/ /g; $line =~ s/ +\t/\t/g; $line =~ s/\t +/\t/g; } print $line,"\n"; } } sub is_ns ($) { return $_[0] =~ /^(?:north|south)$/; } sub is_ew ($) { return $_[0] =~ /^(?:east|west)$/; } sub is_unit ($) { return $_[0] =~ /(?:box|apartment|lot|suite|campus|lower|upper|flo +or|gymnasium|hall|building)\s*/; } sub is_type ($) { return $_[0] =~ /(?:alley|avenue|aveue|avneu|bay|boulevard|circle|court|courtt +|courttt|cove|crest|curve|dale|drive|grove|highway|hill |knoll|lane|mall|orchard|park|parkway|pass|pines|place|plaza|raod|ridg +e|road|route|square|state|street|summit|ter|terrace|tra il|walk|way)/o; }
Re: Parse mailing addresses with a regex
by tilly (Archbishop) on Jun 23, 2003 at 14:23 UTC
    I strongly recommend getting a database. It will make life a lot easier in the end, otherwise any hack you have for this wil be broken.

    However if the data is very uniform, you can just wildcard the name and rely on everything else to lock down the position. Like this untested RE: /^(\d+)\s+(.*?)\s+(\d+.*?)\s+(\w\w)\s+(\d{3}-\d{3}-\S+)\s*(.*)/ The first capture will be the customer code, then the name, then street address, state, then telephone number (with allowance for extensions, as in 223-456-1234x56), then comment.

    Looking at that again, a database would be far preferable. (If you don't do that, then add some validation checks. Because the data WILL entered badly, and that will be a constant battle to face.)

      While "getting a database" is a good idea, it may not solve this person's problem. The problem is, given a large volume of legacy, unparsed, free-form address data, how do you parse it to put it into the database in the first place?

      Unfortunately, that's difficult. Lingua::EN::AddressParse is good if you know what country the address information is for, but it isn't sufficient by itself if you also need to extract country codes from international address data.

      I'm actually about to solve a similar problem myself. If I can't find consistently exploitable patterns in the data, my next tactic will be using Lingua::EN::AddressParse in combination with state/zipcode verification to try to catch all the US addresses, and then to try to exploit patterns in the remaining (international) addresses that AddressParse can't parse effectively.

      Alan

        True. In that case, as you indicate, you try to avoid working with the legacy data. Instead you do multiple passes, in each pass you look for things that you can parse, and divide the data into stuff that you just figured out, and leftovers. After a few rounds, the number of leftovers hopefully becomes managable by hand, you load your database, and then go from there.

        If aquiring legacy data is an ongoing process, you can semi-automate this. But it would be unwise to try to avoid having the final manual pass. A 95% solution is easy. 99.5% is doable. 100% is pretty much impossible.

      Even if he gets a database, won't he still need to parse the data in order to get it into the DB?


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


Re: Parse mailing addresses with a regex
by BrowserUk (Patriarch) on Jun 23, 2003 at 14:35 UTC

    Personally, I think I'd use split with capturing brackets so the delimiters are not discarded to break this into chunks first. By using /([\d-]+)/ as the delimiter, it breaks the line up between the numbers (The '-' is to keep the telephone number in one chuck.

    #! perl -sw use strict; while( <DATA> ) { my @chunks = split /([\d-]+)/; print join'|',@chunks; } __DATA__ 141 Martha Lynn Amblynoster 12345 New Pickle Drive MoreTown PA 98765 6 +54 555-1212 no detail 178 Edgar Bimblybum Jr. 23456 Highfiddle Road Acheville Ma 24680 345-7 +89-1234 no detail 161 Joyce W. Wogerbung 18 Lily Piffle Lane Middleton PA 34567 610-678- +2345 no detail 188 Alex Shmogle 6543 Bibblyboo St NW Apt B Washington DC 20009 202-98 +7-6543 no detail __OUTPUT__ |141| Martha Lynn Amblynoster |12345| New Pickle Drive MoreTown PA |98 +765| |654| |555-1212| no detail |178| Edgar Bimblybum Jr. |23456| Highfiddle Road Acheville Ma |24680| + |345-789-1234| no detail |161| Joyce W. Wogerbung |18| Lily Piffle Lane Middleton PA |34567| |6 +10-678-2345| no detail |188| Alex Shmogle |6543| Bibblyboo St NW Apt B Washington DC |20009| +|202-987-6543| no detail

    As you can see the only chunk that need much further processing is then the address ($chunk[4]) which only requires the last two words to be broken off to give you city and state. At least as far as your examples go.

    How you would recognise City names with more than one word (eg.Salt Lake City) is up to you. Probably the best way would be to grab a dictionary of town/city names from somewhere, put them in a hash, strip the state and look up the last word, the last two words, the last three words until you get a match. Subdividing the hash by the state first would further increase your reliability.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


Re: Parse mailing addresses with a regex
by hsmyers (Canon) on Jun 23, 2003 at 14:23 UTC
    Given that you've got a grip on most of the problem, I'd take a look at Lingue::EN::NameParse which might give you what you are looking for. I've used it in the past, and I seem to remember that I works fairly well.

    --hsm

    "Never try to teach a pig to sing...it wastes your time and it annoys the pig."
Re: Parse mailing addresses with a regex
by TomDLux (Vicar) on Jun 23, 2003 at 14:33 UTC

    You might consider working backwards, or from the middle out, first extracting things you are more confident about. For example:

    $line =~ s/\w*\s*[:upper:]{2}\s(\d{5}\s*(\d{3][-\s]*\d{3}-\d{4})\s*(.* +)$//; ($custCity, $custState, $custZip, $custTel, $custComments) =($1, $2, $ +3, $4, $5); $line =~ s/(\d*)\s*(.*)\s*(\d*.*)$/; ($custNum, $custName, $custStreet) = ($1, $2, $3);

    This assumes that cities are one word, so it won't work if any of your customers are in New York City or Atlantic City or Boca Raton or Las Vegas.

    --
    TTTATCGGTCGTTATATAGATGTTTGCA

Re: Parse mailing addresses with a regex
by tos (Deacon) on Jun 23, 2003 at 15:03 UTC
    Hi,

    you only have to translate your remarks into regex-speech. For more clarity it's helpful to use the x-modifier. Here my result with your remarks.

    while (my $line = <DATA>) { #if ($line =~ m!^(\d+)\s+(([A-Za-z]+\s+[A-Za-z].\s+[A-Za-z]+)| +([A-Za-z]+\s+[A-Za-z]+) )!) { if ($line =~ m! ^(\d+)\s+ ([^\d]+) ((?:\w+\s)+) (\w+)\s (\w\w)\s (\d{5})\s ([\d\-]+)\s (.*)\s$ !x) { print "\n"; print "\$1: $1 \n"; print "\$2: $2 \n"; print "\$3: $3 \n"; print "\$4: $4 \n"; print "\$5: $5 \n"; print "\$6: $6 \n"; print "\$7: $7 \n"; $custNum = $1; # First number field. $custName = $2; # Name styles can vary + + so match everything between two numbers. $custStreet = $3; # Street is everything + + after name and before CITY. $custCity = $4; # City is after addres + +s and before the TWO char state identifier. $custState = $5; # State is after addre + +ss and before FIVE digit zip number. $custZip = $6; # Zip is before teleph + +one number and after State id. $custTel = $7; # Telephone no. is aft + +er zip and before comments field. $custComments = $8; # Last remaining part ++after telephone number. } } __DATA__ 141 Martha Lynn Costello 11750 Old Mill Drive Media PA 19063 610-555-1 +212 no detail 178 Edgar Jones Jr. 18013 Highfield Road Ashton Ma 20861 323-774-1339 +no detail 161 Joyce W. Whang 18 Long Point Lane Media PA 19063 610-891-2344 no d +etail 188 Alex Smith 1979 Biltmore St NW Apt B Washington DC 20009 202-913-6 +685 no detail
    produces
    # perl re $1: 141 $2: Martha Lynn Costello $3: 11750 Old Mill Drive $4: Media $5: PA $6: 19063 $7: 610-555-1212 $1: 178 $2: Edgar Jones Jr. $3: 18013 Highfield Road $4: Ashton $5: Ma $6: 20861 $7: 323-774-1339 $1: 161 $2: Joyce W. Whang $3: 18 Long Point Lane $4: Media $5: PA $6: 19063 $7: 610-891-2344 $1: 188 $2: Alex Smith $3: 1979 Biltmore St NW Apt B $4: Washington $5: DC $6: 20009 $7: 202-913-6685
    Greetings, tos
      And how do you get around the problem when the streets are actually numbers like "19010 20th Ave NE Apt. 505" or "19010 SE 20th Ave Apt. 505" And instead of "lettered" apartments you have numbered apartments Addresses like this are very common in the state of Washington.
Re: Parse mailing addresses with a regex
by data67 (Monk) on Jun 23, 2003 at 15:08 UTC
    This is not live data, IT IS MADE UP (i.e., NOT REAL). Thanks for the tip though.
Re: Parse mailing addresses with a regex
by CountZero (Bishop) on Jun 23, 2003 at 15:15 UTC

    Slightly off-topic: rather than reading the file as a whole into an array and then going through the array, it is far less memory and resources-consuming (not to say more elegant) to read the file line by line and immediately handle each record as soon as it is read in.

    while <CUSTOMERFILE> { # YOUR REGEX AND OTHER THINGS HERE # Note: record just read is in $_ ; works nice with m// }

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Parse mailing addresses with a regex
by Theo (Priest) on Jun 23, 2003 at 18:51 UTC
Re: Parse mailing addresses with a regex
by reclaw (Curate) on Jun 24, 2003 at 07:18 UTC

    You might want to check the USPS website for address validation tools. You could use this to check some of your results.

    Your mileage may vary.