comment on

Those input files are pretty noisy. If all you need to do is extract and print the phone numbers -- that is, if you don't need to associate each phone number with some name and/or address that's next to it in the data -- then it would help to pre-condition the text so as to eliminate all the stuff you know you don't need, and isolate the potential phone numbers to make them easier to pick out.

Perhaps you can take it for granted that a phone number will never be broken up by a line break (a single line contains one or more complete phone numbers, or contains no relevant data at all). You could also take for granted that all phone numbers use a limited set of punctuation patterns. Here is one possible way to handle the preconditioning:

while (<>)  # read one line at a time
{
    s/[a-z;:\@]+//gi;  # these aren't used for numbers
    s/(?<=\d\)) (?=\d)//g;   # remove space in "\d) \d"

# split the line on whitespace (that's why we got rid of
# any spaces that might be within a given phone number);
# for each thing coming out of the split, print it if it
# looks like a phone number:

    for my $num ( split /\s+/ ) {
        next unless ( $num =~ /\D*(\d{3})\D(\d{3})-(\d{4})\D*/ );
        print "$1-$2-$3\n";
    }
}
[download]

That won't be much use if you do have to preserve information about each phone number along with the number itself -- given the nature of the data, that's a slightly more tricky problem. (But not too tricky... your data is messy, but there are patterns in it that can be used to guide a more intelligent form of data extraction; you use the same sort of approach -- skip or remove things that are not relevant, and use simple patterns to isolate the things that are relevant.)

In reply to Re: phone number parsing refuses to work by graff
in thread phone number parsing refuses to work by Anonymous Monk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Your skill will accomplish what the force of many cannot
	PerlMonks