http://www.perlmonks.org?node_id=351465

Theo has asked for the wisdom of the Perl Monks concerning the following question:

This is the second Perl script I've done from scratch and I'm havin' a ball! I'd like some constructive criticism over all and in two specific areas.

The objective is to convert an MS Word document of the company phone list to an MS Excel spreadsheet for sorting and conversion to several web pages (of differing sort orders).

The Word file is saved to a text file where each person's info is spread over 3 lines. The script pulls everything into one line per person with tabs separating each item. Having been hand entered, the data is somewhat "dirty". Most of the regexp work is cleaning linefeeds, carriage returns and the random non word characters around the valid data.

Question 1: Is there a more perlish way to count the three lines per person? (What I'm doing looks so much like C.)

Question 2: If you look at the input data for "Bo Jo Le Much" you'll see there is no space after the comma separator and there is a CR (^M) after the first name in the output. If I insert a space after the comma in the input data, the CR in the output goes away. Why?

#!/opt/bin/perl #perl 5.8.3 built for sun4-solaris-64int-ld use warnings; use strict; my $T = "\t"; my $fname = "U1"; my $lname = "U2"; my $phone = "U3"; my $room = "U4"; my $bldg = "U5"; my $email = "U6"; my $temp = "Ut"; my $count = 0; print "\nLast First Phone Bldg Room email\n"; open FH => "<testdata.txt" or die "can't find the data file: $!\n"; #open FH => "<phonedata.txt" or die "can't find the data file: $!\n"; while (<FH>) { chomp; $count++; if ($count == 1){ ($lname, $fname) = split(","); $fname =~ s/^\s+(.+?)\W*$/$1/; }elsif ($count == 2){ ($phone, $temp) = split("/"); #line 28 if ( ($temp =~ m[^\d.*]) || ($temp =~ m[--- --]) ){ ($room, $bldg) = split(" ", $temp); } else { $bldg = $temp; $bldg =~ s/^(.*?)\W*$/$1/; #$bldg includes a 'cr' $room = ""; #no room number, just bldg name } }elsif ($count == 3){ $email = $_; $email =~ s/^(.*?)\W*$/$1/; $count = 0; pA(); }else{ print "Got an overflow - $count $_\n"; } } print "\n" if ($count == 0); close FH; sub pA{print "$lname$T$fname$T$phone$T$bldg$T$room$T$email\n";} ------------------------------------------------------- __Input__ Alanon, Bart 5590/EL ---- O'Lewis, John. ----/--- -- john Le Much,Bo Jo 3406/165 NS ed@a.nl Abe-Jen, Mar-Jo 3421/164D NS cbest __Output__ Last First Phone Bldg Room email Alanon Bart 5590 EL O'Lewis John ---- -- --- john Le Much Bo Jo^M 3406 NS 165 ed@a.nl Abe-Jen Mar-Jo 3421 NS 164D cbest

-Theo-
(so many nodes and so little time ... )

Replies are listed 'Best First'.
Re: Need a better way to count input lines
by japhy (Canon) on May 07, 2004 at 14:51 UTC
    You could do:
    # while (not eof FILE) { ... } # or... until (eof FILE) { my (@lines) = map scalar <FILE>, 1 .. 3; # work with $lines[0], $lines[1], $lines[2] }
    You need the scalar there, because you only want to get one line at a time.
    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;
      I've never used map before. It looks like @lines ends up with three elements because of the "1 .. 3" range. .oO(Verrry useful)
      Would the loop work the same if you used until (eof FILE) instead of the while?

      Update: Thanks for the update!

      -Theo-
      (so many nodes and so little time ... )

Re: Need a better way to count input lines
by BrowserUk (Patriarch) on May 07, 2004 at 17:18 UTC

    I'd probably do it this way.

    #! perl -slw use strict; until( eof( DATA ) ) { no warnings 'uninitialized'; printf '%-7.7s %-7.7s ', <DATA> =~ m[ ([^,]+) , \s* (.+) $]x; printf '%7.7s %7.7s %7.7s ', <DATA> =~ m[ ([^/]+) / (?:(\S+)\s+)? +(\S+) $]x; printf "%s\n", <DATA> =~ m[^(\S+)]; } __DATA__ Alanon, Bart 5590/EL ---- O'Lewis, John. ----/--- -- john Le Much,Bo Jo 3406/165 NS ed@a.nl Abe-Jen, Mar-Jo 3421/164D NS cbest

    Output

    P:\test>351465 Alanon Bart 5590 EL ---- O'Lewis John. ---- --- -- john Le Much Bo Jo 3406 165 NS ed@a.nl Abe-Jen Mar-Jo 3421 164D NS cbest

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
      That looks a lot more perlish!. I changed it slightly to use the same data in an external file & I get an error for each line of the file that has a "/". (Of course, to get those errors I uncommented the 'no warnings' line)
      Last First Phone Bldg Room email Use of uninitialized value in printf at PL.pl line 11, <FH> line 2. Use of uninitialized value in printf at PL.pl line 11, <FH> line 2. Use of uninitialized value in printf at PL.pl line 11, <FH> line 2. ---- Use of uninitialized value in printf at PL.pl line 11, <FH> line 5. Use of uninitialized value in printf at PL.pl line 11, <FH> line 5. Use of uninitialized value in printf at PL.pl line 11, <FH> line 5. john Use of uninitialized value in printf at PL.pl line 11, <FH> line 8. Use of uninitialized value in printf at PL.pl line 11, <FH> line 8. Use of uninitialized value in printf at PL.pl line 11, <FH> line 8. ed@a.nl Use of uninitialized value in printf at PL.pl line 11, <FH> line 11. Use of uninitialized value in printf at PL.pl line 11, <FH> line 11. Use of uninitialized value in printf at PL.pl line 11, <FH> line 11. cbest
      If I redirect the output to a file, it looks like this:
      Last First Phone Bldg Room email Alanon Bart^M ---- O'Lewis John.^M john Le Much Bo Jo^M ed@a.nl Abe-Jen Mar-Jo^M cbest
      The file, as I ran it is:
      #!/opt/bin/perl -slw use strict; print "\nLast First Phone Bldg Room email\n"; open FH => "<testdata.txt" or die "can't find the data file: $!\n"; until( eof( FH ) ) { # no warnings 'uninitialized'; printf '%-7.7s %-7.7s ', <FH> =~ m[ ([^,]+) , \s* (.+) $]x +; printf '%7.7s %7.7s %7.7s ', <FH> =~ m[ ([^/]+) / (?:(\S+)\s+) +? (\S+) $]x; printf "%s\n", <FH> =~ m[^(\S+)]; }
      I definitely don't understand what's happening in the second half of the middle regexp.

      -Theo-
      (so many nodes and so little time ... )

        Weird! The ^Ms that are getting left behind on the end of the first names suggests that the contents of the data file is weird, though given the source that's no real surprise.

        The reason for having the no warnings 'uninitialized' is to avoid the need for a special case to deal with location lines that don't contain the room number. The regex is saying:

        m[ ([^/]+) / # capture the phone number before the slash (?: #optionally capture the room number if it is exists (\S+) # capture all the non-spaces between the / \s+ #and one or more spaces )? # but only if there are two sets of non-spaces # separated by a one or more spaces before EOL (\S+) #capture the building code. $ ]x;

        If the room number isn't present, the second capture ($2) will be undefined, hence the need to suppress the warning. However, that you are getting three warnings means that all 3 captures are undefined (ie. the regex failed to match), which suggests that the data in the file is formatted somewhat differently to the sample data you posted.

        Without being able to see the actual contents of the file it is a little difficult to diagnose the problem.

        Perhaps you could run this one liner on the data file to dump the first few lines in hex and post the output here?

        perl -nle" exit if $. == 15; print unpack 'H*', $_" testdata.txt

        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
Re: Need a better way to count input lines
by hv (Prior) on May 07, 2004 at 16:15 UTC

    For Bo Jo Le Much, the problem occurs when you try to clean up $fname:

    ($lname, $fname) = split(","); $fname =~ s/^\s+(.+?)\W*$/$1/;

    The /\s+/fragment requires one or more whitespace characters at the start, and when that isn't present the substitution fails.

    The cleanest fix would probably be to absorb optional whitespace after the comma within the split, so that you don't need to look for it afterwards:

    ($lname, $fname) = split /,\s*/, $_, 2; $fname =~ s/\W+$//;

    Note that I've extended the split command slightly: specifying a LIMIT of 2 ensures that trailing text will not be thrown away if for some reason there is more than one comma in the name. It also allows the split to execute slightly faster, since it doesn't need to scan for more commas after the first is found.

    Also, removing the need to check for whitespace in the substitution means that the only thing it is still doing is stripping non-word characters from the end, allowing a somewhat simpler pattern.

    Hugo

      Ahh! The lights come on! (Then they go dim again) I changed the
      $fname =~ s/^\s+(.+?)\W*$/$1/;
      to
      $fname =~ s/^\s*(.+?)\W*$/$1/;
      and the problem went away. (Of course it did. I should have seen that.) I can see why with the '+' it didn't work right, but why would the failure manifest as a CR at the end of $fname?
      But I am perplexed about the split itself. I replaced the (",") with (",\s*") and got the error:
      Unrecognized escape \s passed through at phonelist.pl line 28.
      The paren/quote format worked fine with just a comma but broke with the addition of the space char class. Did I just get lucky the first way?

      As to the LIMIT option, my Llama book doesn't explain it. Does the 2 relate the expectation of having two char within the match?

      -Theo-
      (so many nodes and so little time ... )

        ... but why would the failure manifest as a CR at the end of $fname?

        I suspect the text, created under Windows, has CRLF as the line terminator; running the script under Solaris the chomp is removing the LF but not the CR, so the CR was getting removed instead by the $fname substitution - but only when the substitution actually happened.

        As for split, please take a look through perldoc -f split. First, note that the first parameter is shown as /PATTERN/; if you supply this parameter as a pattern (eg /,\s*/) it gets parsed as a pattern and does the right thing. If you supply a quoted string instead (eg ",\s*") perl will treat that first as a quoted string - in which case "\s" gives the warning, and gets converted to "s" - and the resulting string is then converted to a regexp further down the line.

        You'll also find there far more than you ever wanted to know about the LIMIT argument.

        Hugo