Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

phone number parsing refuses to work

by Anonymous Monk
on Mar 13, 2004 at 23:02 UTC ( #336426=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am getting so frustrated. This is a repost of two previous posts. I've tried every example given and to no success. They all include numbers that aren't phone numbers and miss most of the numbers that are. I tried doing ALL of them, even mix-matched with tr/0-9//cd; which I think is a bad idea because what this does (from what I THINK it does) is puts all numbers in a huge line and makes a number out of them. I can't do this because there are more numbers on a line than inside my phone number.

Here is some sample data (I have different log files, but here are two so you can see):

FILE1 Residential MLS #: 2094044 Status: Active-NORMLS LP: $125,204 SP: $ 9962 BEVERLY LANE STREETSBORO OH 44241- Unit/Lot #: Area: 1909 + Unit Floor #: Map Coordinate: P13A2 Subdivision/Complex: VANTAGE POINT Photos: Media: 6 Acres: 1/2 Yr. Tax : 732 County: Portage Owner/Agent: No Parcel ID# (PIN): TBA Year Built: 2003 Lot Dimensions: 18X52 School District: 6709/Streetsboro City List Type: ERS Irregular: N High School: MLS Cross Ref #: Sub Property Type: One Family List Date: 6/20/2003 MT: 253 Directions: CORNER FROST RD & ST RT 43 # Rooms: 4 # Bedrooms: 2 Total Baths: 1.1 Finished SqFt: 1080 LO #/Name: 2380 / Realty One (440) 248-2700 Office Web Site: www.rea +ltyone.com LA #/Name: 417391 / Mark J. Abbott (440) 975-0537 LA Email: m.abbott +@realtyone.com LA 2 #/Name: / LA 2 Email: SAC: 0 BAC: 2.5 OAC: None LockBox Desc: Compensation Explain: Fixer Upper: N Remarks: WILLIAM THOMAS HOMES VANTAGE PT CLUSTER TOWNHOMES! TWO BEDROO +MS,ONE & HALF BATHS,FULL BASEMENT! FIREPLACE! KITCHEN & LAUNDRY APPLI +ANCES! WOOD RAILINGS! COMMON AREA MAINTENANCE! 56 HILLSIDE & PATIO UN +ITS, TAXES ESTIMATED, EXTRA WINDOWS! 90% EFFIC FURNACE! PRIVACY FENCE +! PATIO.FURNISHED MODEL 9941 BEVERLY Broker Remarks: COMMISSION PAID ON BASE OF $114,900. CALL LISTING AGEN +T FOR INFORMATION ON TITLE WORK. ---------------------------------------------------------------------- +---------- Residential MLS #: 2130518 Status: Active-NORMLS LP: $125,500 SP: $ 1244 Meadow Run Copley OH 44321- Unit/Lot #: 20 Area: 1820 Unit Floor #: Map Coordinate: S27B3 Subdivision/Complex: Meadows of Copley Photos: Media: 1 Acres: 1/2 Yr. Tax : 9999 County: Summit Owner/Agent: Parcel ID# (PIN): 0 Year Built: 2004 Lot Dimensions: School District: 7703/Copley-Fairlawn City List Type: ERS Irregular: + N High School: Copley MLS Cross Ref #: Sub Property Type: Condominium List Date: 2/17/2004 MT: 11 Directions: Ridgewood Road to Jacoby Rd. to Copley Rd. east to The Mea +dows # Rooms: 5 # Bedrooms: 2 Total Baths: 2.1 Finished SqFt: LO #/Name: 2817 / Smythe, Cramer Co. (330) 836-9300 Office Web Site: + www.smythecramer.com LA #/Name: 302709 / Sheila Eaton (330) 864-5741 LA Email: sheilaeato +n45@aol.com LA 2 #/Name: / LA 2 Email: SAC: 0 BAC: 2.5 OAC: None LockBox Desc: Compensation Explain: Fixer Upper: N Remarks: Beautiful new constructionin The Meadows of Copley*1st class +amenities*448 sq ft finished lower level family rm*Vaulted ceilings*F +ully applianced*Spacious master suite*Bright, open and airy*10x10 pat +io. Broker Remarks: ---------------------------------------------------------------------- +---------- FILE2 Donna I. Stoner, ABR GRI Bolton-Johnston Associates of Grosse Pointe Phone 1: (313)884-6400, Email: donnastoner@realtor.com Buyers, Relocation, Residential, Sellers, Waterfront Property Add to Scratch Pad Contact me now Go to my site DONNA L. GORMLEY Johnstone & Johnstone Office: (313) 884-0600, Mobile: (313) 590-9253, Email: johnstone@reale +stateone.com buyer's agent, Listing agent, residential properties Add to Scratch Pad Contact me now Go to my site

As you can see, on some lines I MAY have more than one set of numbers so I need it to be picky and only select things that are numbers. Someone suggested http but there is no documentation. It shows how to validate one variable, which I can't get to work much less how to trim an entire text file into numbers it'll validate.

This is frustrating me so much because I checked everything I could on Phone Numbers in the super search and nothing helped, they all died in one way or another. Can someone give me a different perspective or show how to use that module? My last attempt was:

#!/usr/bin/perl use strict; # change the below line to the file you are reading FROM (your junk fi +le) my $read_from = "test2.txt"; # Change the below line to where you want your neat phone numbers to b +e printed my $save_to = "saved.txt"; my %seen; open(FILE, '<', "$read_from") or die "Unable to open file.txt for read +ing, $!"; while (<FILE>) { #s /[\n|\r]//g; tr/0-9//cd; #print "Testing with $_, result is "; m/(1[-| ]?)?\(?(\d{3})\)?[-| ]?(\d{3})[-| ]?(\d{4})/; #m|(1-)?\(?(\d{3})\)?-?(\d{3})-(\d{4})|; my $areacode = $2; my $exchange = $3; my $line = $4; print "($areacode) $exchange-$line\n"; $seen{"$areacode-$exchange-$line"}++; } close(FILE); open(SAVED, '>', "$save_to") or die "Unable to open $!"; print SAVED "$_\n" for (sort keys %seen); close(SAVED);

Edited by Chady -- formatting and readmore tags.

Comment on phone number parsing refuses to work
Select or Download Code
Re: phone number parsing refuses to work
by Happy-the-monk (Monsignor) on Mar 13, 2004 at 23:11 UTC

    If I observed correctly, you are looking for numbers of these formats:

    1. (123) 234-3456
    2. (123)234-3456

    m/ ( # start caption \( # open paranthesis \d{3} # 3 digits \) # close paranthesis \s? # 0 or 1 whitespace \d{3} # 3 digits \- # 1 dash line \d{4} # 4 digits ) # end caption /xg;

    Sören

Re: phone number parsing refuses to work
by Anonymous Monk on Mar 13, 2004 at 23:19 UTC
    The link disappeared, the module I want to try to use is Number/Phone/US.pm but there's no documentation for what I need to do. I need to parse an entire junk file and take all valid numbers OUT of it.
      use strict; use Number::Phone::US qw(is_valid_number); my $data = <<'EOF'; all your data goes here EOF my @results = ( $data =~ m/ ( # start caption \( # open paranthesis \d{3} # 3 digits \) # close paranthesis \s? # 0 or 1 whitespace \d{3} # 3 digits \- # 1 dash line \d{4} # 4 digits ) # end caption /xg ); foreach ( @results ) { print "valid: $_\n" if is_valid_number( $_ ); }

      Sören

Re: phone number parsing refuses to work
by BrowserUk (Pope) on Mar 14, 2004 at 00:32 UTC

    Pasting your sample data into the following 1-liner (wrapped for posting only) produced the following output. (Season to taste:)

    perl -0777pe " s[[^0-9() -]+][\n]g; s[\s{3,}][\n]g; s[---][]g; for my$n(1..4){ s[\n.{1,7}\n][\n]msg;} print" - [PASTE SNIPPED] ^Z (440) 248-2700 (440) 975-0537 (330) 836-9300 (330) 864-5741 (313)884-6400 (313) 884-0600 (313) 590-9253 (440) 248-2700 (440) 975-0537 (330) 836-9300 (330) 864-5741 (313)884-6400 (313) 884-0600 (313) 590-9253

    If your data file is very large you would need to drop the slurp (-0777) which would mean that as-is, the filtering wouldn't be as effective. But the principle of first throwing away as much as possible (safetly--replacing with spaces or newlines so that you don't run good data together) is a useful first pass at extracting small pieces from large volumes.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
Re: phone number parsing refuses to work
by graff (Chancellor) on Mar 14, 2004 at 00:33 UTC
    Those input files are pretty noisy. If all you need to do is extract and print the phone numbers -- that is, if you don't need to associate each phone number with some name and/or address that's next to it in the data -- then it would help to pre-condition the text so as to eliminate all the stuff you know you don't need, and isolate the potential phone numbers to make them easier to pick out.

    Perhaps you can take it for granted that a phone number will never be broken up by a line break (a single line contains one or more complete phone numbers, or contains no relevant data at all). You could also take for granted that all phone numbers use a limited set of punctuation patterns. Here is one possible way to handle the preconditioning:

    while (<>) # read one line at a time { s/[a-z;:\@]+//gi; # these aren't used for numbers s/(?<=\d\)) (?=\d)//g; # remove space in "\d) \d" # split the line on whitespace (that's why we got rid of # any spaces that might be within a given phone number); # for each thing coming out of the split, print it if it # looks like a phone number: for my $num ( split /\s+/ ) { next unless ( $num =~ /\D*(\d{3})\D(\d{3})-(\d{4})\D*/ ); print "$1-$2-$3\n"; } }
    That won't be much use if you do have to preserve information about each phone number along with the number itself -- given the nature of the data, that's a slightly more tricky problem. (But not too tricky... your data is messy, but there are patterns in it that can be used to guide a more intelligent form of data extraction; you use the same sort of approach -- skip or remove things that are not relevant, and use simple patterns to isolate the things that are relevant.)
Re: phone number parsing refuses to work
by etcshadow (Priest) on Mar 14, 2004 at 01:57 UTC
    Here ya go:
    push(@list, "($1) $2-$3") while /\(?(\d{3})?\)?\s*[-.]*\s*(\d{3})\ +s*[-.]*\s*(\d{4})/g;
    I pasted your big block of text into it and got this out:
    () 209-4044 (440) 248-2700 (440) 975-0537 () 213-0518 (330) 836-9300 (330) 864-5741 (313) 884-6400 (313) 884-0600 (313) 590-9253
    Good enough?

    Update: I should explain a little... The regexp breaks down into, basically: optional parens around optional 3 digits, minimal separating junk, 3 digits, minimal separating junk 4 digits. It may look ugly, but it's actually quite straight-forward, although you could obviously tinker with it a little if you wanted. Oh... and the while ... /g means to count it each time it appears on a line (so if multiple phone #'s are on one line, it'll count each one).

    For example, I'll turn the "separating garbage" chunk into just [-.\s]*, which is more permissive as well as shorter to write out. Still gets the same results on your sample data.

    [me@host]$ perl -ne 'push(@list, "($1) $2-$3") while /\(?(\d{3})?\)?[- +.\s]*(\d{3})[-.\s]*(\d{4})/g; END{print join("\n",@list)."\n";}' data +.txt () 209-4044 (440) 248-2700 (440) 975-0537 () 213-0518 (330) 836-9300 (330) 864-5741 (313) 884-6400 (313) 884-0600 (313) 590-9253 [me@host]$
    ------------ :Wq Not an editor command: Wq
Re: phone number parsing refuses to work
by converter (Priest) on Mar 14, 2004 at 15:54 UTC

    This may be "off topic", but are these real data you've posted here? If so, bad form. I don't think the folks whose information appears here would appreciate it at all. I never allow customers' data to escape my network if they include information about specific companies or people, even if it's information that could be found in any phone book. In the future, you should take a few minutes to create dummy data for any examples you want to include in your posts.

Re: phone number parsing refuses to work
by mojotoad (Monsignor) on Mar 23, 2004 at 00:35 UTC
    It is not necessarily going to help too much extracting phone numbers from surrounding prose, but the following may give you some ideas with dealing with parsing the numbers once you think you have one:

    Beast of the Number: Parsing the Feral Phone

    Matt

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://336426]
Approved by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (8)
As of 2014-12-26 05:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (165 votes), past polls