Need a better way to count input lines

Theo has asked for the wisdom of the Perl Monks concerning the following question:

This is the second Perl script I've done from scratch and I'm havin' a ball! I'd like some constructive criticism over all and in two specific areas.

The objective is to convert an MS Word document of the company phone list to an MS Excel spreadsheet for sorting and conversion to several web pages (of differing sort orders).

The Word file is saved to a text file where each person's info is spread over 3 lines. The script pulls everything into one line per person with tabs separating each item. Having been hand entered, the data is somewhat "dirty". Most of the regexp work is cleaning linefeeds, carriage returns and the random non word characters around the valid data.

Question 1: Is there a more perlish way to count the three lines per person? (What I'm doing looks so much like C.)

Question 2: If you look at the input data for "Bo Jo Le Much" you'll see there is no space after the comma separator and there is a CR (^M) after the first name in the output. If I insert a space after the comma in the input data, the CR in the output goes away. Why?

#!/opt/bin/perl
#perl 5.8.3 built for sun4-solaris-64int-ld
 
use warnings;
use strict;
 
my $T = "\t";
my $fname = "U1";
my $lname = "U2";
my $phone = "U3";
my $room  = "U4";
my $bldg  = "U5";
my $email = "U6";
my $temp  = "Ut";
my $count = 0;
 
print "\nLast   First   Phone   Bldg    Room    email\n";
 
 open FH => "<testdata.txt" or die "can't find the data file: $!\n";
#open FH => "<phonedata.txt" or die "can't find the data file: $!\n";
 
while (<FH>) {
   chomp;
   $count++;
   if ($count == 1){
        ($lname, $fname) = split(",");
        $fname =~ s/^\s+(.+?)\W*$/$1/;
   }elsif ($count == 2){
        ($phone, $temp) = split("/");           #line 28
        if ( ($temp =~ m[^\d.*]) || ($temp =~ m[--- --]) ){
            ($room, $bldg) = split(" ", $temp);
        } else {
            $bldg = $temp;
            $bldg =~ s/^(.*?)\W*$/$1/;      #$bldg includes a 'cr'
            $room = "";             #no room number, just bldg name
        }
   }elsif ($count == 3){
        $email = $_;
        $email =~ s/^(.*?)\W*$/$1/;
        $count = 0;
        pA();
 
   }else{
        print "Got an overflow - $count $_\n";
   }
}
print "\n" if ($count == 0);
close FH;

sub pA{print "$lname$T$fname$T$phone$T$bldg$T$room$T$email\n";}
-------------------------------------------------------
__Input__
Alanon, Bart
5590/EL
---- 
O'Lewis, John.
----/--- --
john 
Le Much,Bo Jo
3406/165 NS 
ed@a.nl 
Abe-Jen, Mar-Jo
3421/164D NS
cbest

__Output__
Last    First   Phone   Bldg    Room    email
Alanon  Bart    5590    EL
O'Lewis John    ----    --      ---     john
Le Much Bo Jo^M 3406    NS      165     ed@a.nl
Abe-Jen Mar-Jo  3421    NS      164D    cbest
[download]

-Theo-
(so many nodes and so little time ... )

Comment on Need a better way to count input lines Download Code

Replies are listed 'Best First'.
Re: Need a better way to count input lines by japhy (Canon) on May 07, 2004 at 14:51 UTC
You could do: `# while (not eof FILE) { ... } # or... until (eof FILE) { my (@lines) = map scalar <FILE>, 1 .. 3; # work with $lines[0], $lines[1], $lines[2] }` [download] You need the `scalar` there, because you only want to get one line at a time. _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area) `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply] [d/l]
Re: Re: Need a better way to count input lines by Theo (Priest) on May 07, 2004 at 20:23 UTC
I've never used map before. It looks like @lines ends up with three elements because of the "1 .. 3" range. .oO(Verrry useful) Would the loop work the same if you used `until (eof FILE)` instead of the while? Update: Thanks for the update! -Theo- (so many nodes and so little time ... )	[reply] [d/l]
Re: Need a better way to count input lines by BrowserUk (Patriarch) on May 07, 2004 at 17:18 UTC
I'd probably do it this way. `#! perl -slw use strict; until( eof( DATA ) ) { no warnings 'uninitialized'; printf '%-7.7s %-7.7s ', <DATA> =~ m[ ([^,]+) , \s* (.+) $]x; printf '%7.7s %7.7s %7.7s ', <DATA> =~ m[ ([^/]+) / (?:(\S+)\s+)? +(\S+) $]x; printf "%s\n", <DATA> =~ m[^(\S+)]; } __DATA__ Alanon, Bart 5590/EL ---- O'Lewis, John. ----/--- -- john Le Much,Bo Jo 3406/165 NS ed@a.nl Abe-Jen, Mar-Jo 3421/164D NS cbest` [download] Output `P:\test>351465 Alanon Bart 5590 EL ---- O'Lewis John. ---- --- -- john Le Much Bo Jo 3406 165 NS ed@a.nl Abe-Jen Mar-Jo 3421 164D NS cbest` [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l] [select]
Re: Re: Need a better way to count input lines by Theo (Priest) on May 07, 2004 at 19:53 UTC
That looks a lot more perlish!. I changed it slightly to use the same data in an external file & I get an error for each line of the file that has a "/". (Of course, to get those errors I uncommented the 'no warnings' line) Last First Phone Bldg Room email Use of uninitialized value in printf at PL.pl line 11, <FH> line 2. Use of uninitialized value in printf at PL.pl line 11, <FH> line 2. Use of uninitialized value in printf at PL.pl line 11, <FH> line 2. ---- Use of uninitialized value in printf at PL.pl line 11, <FH> line 5. Use of uninitialized value in printf at PL.pl line 11, <FH> line 5. Use of uninitialized value in printf at PL.pl line 11, <FH> line 5. john Use of uninitialized value in printf at PL.pl line 11, <FH> line 8. Use of uninitialized value in printf at PL.pl line 11, <FH> line 8. Use of uninitialized value in printf at PL.pl line 11, <FH> line 8. ed@a.nl Use of uninitialized value in printf at PL.pl line 11, <FH> line 11. Use of uninitialized value in printf at PL.pl line 11, <FH> line 11. Use of uninitialized value in printf at PL.pl line 11, <FH> line 11. cbest [download] If I redirect the output to a file, it looks like this: `Last First Phone Bldg Room email Alanon Bart^M ---- O'Lewis John.^M john Le Much Bo Jo^M ed@a.nl Abe-Jen Mar-Jo^M cbest` [download] The file, as I ran it is: `#!/opt/bin/perl -slw use strict; print "\nLast First Phone Bldg Room email\n"; open FH => "<testdata.txt" or die "can't find the data file: $!\n"; until( eof( FH ) ) { # no warnings 'uninitialized'; printf '%-7.7s %-7.7s ', <FH> =~ m[ ([^,]+) , \s* (.+) $]x +; printf '%7.7s %7.7s %7.7s ', <FH> =~ m[ ([^/]+) / (?:(\S+)\s+) +? (\S+) $]x; printf "%s\n", <FH> =~ m[^(\S+)]; }` [download] I definitely don't understand what's happening in the second half of the middle regexp. -Theo- (so many nodes and so little time ... )	[reply] [d/l] [select]
Re: Re: Re: Need a better way to count input lines by BrowserUk (Patriarch) on May 07, 2004 at 20:57 UTC
Weird! The ^Ms that are getting left behind on the end of the first names suggests that the contents of the data file is weird, though given the source that's no real surprise. The reason for having the `no warnings 'uninitialized'` is to avoid the need for a special case to deal with location lines that don't contain the room number. The regex is saying: `m[ ([^/]+) / # capture the phone number before the slash (?: #optionally capture the room number if it is exists (\S+) # capture all the non-spaces between the / \s+ #and one or more spaces )? # but only if there are two sets of non-spaces # separated by a one or more spaces before EOL (\S+) #capture the building code. $ ]x;` [download] If the room number isn't present, the second capture ($2) will be undefined, hence the need to suppress the warning. However, that you are getting three warnings means that all 3 captures are undefined (ie. the regex failed to match), which suggests that the data in the file is formatted somewhat differently to the sample data you posted. Without being able to see the actual contents of the file it is a little difficult to diagnose the problem. Perhaps you could run this one liner on the data file to dump the first few lines in hex and post the output here? `perl -nle" exit if $. == 15; print unpack 'H*', $_" testdata.txt` [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l] [select]
Re: Re: Re: Re: Need a better way to count input lines by Theo (Priest) on May 07, 2004 at 22:19 UTC
Re: Re: Re: Re: Re: Need a better way to count input lines by BrowserUk (Patriarch) on May 07, 2004 at 22:22 UTC
Some notes below your chosen depth have not been shown here
Re: Need a better way to count input lines by hv (Prior) on May 07, 2004 at 16:15 UTC
For Bo Jo Le Much, the problem occurs when you try to clean up `$fname`: `($lname, $fname) = split(","); $fname =~ s/^\s+(.+?)\W$/$1/;` [download] The `/\s+/`fragment requires one or more whitespace characters at the start, and when that isn't present the substitution fails. The cleanest fix would probably be to absorb optional whitespace after the comma within the split, so that you don't need to look for it afterwards: `($lname, $fname) = split /,\s/, $_, 2; $fname =~ s/\W+$//;` [download] Note that I've extended the split command slightly: specifying a LIMIT of 2 ensures that trailing text will not be thrown away if for some reason there is more than one comma in the name. It also allows the split to execute slightly faster, since it doesn't need to scan for more commas after the first is found. Also, removing the need to check for whitespace in the substitution means that the only thing it is still doing is stripping non-word characters from the end, allowing a somewhat simpler pattern. Hugo	[reply] [d/l] [select]
Re: Re: Need a better way to count input lines by Theo (Priest) on May 07, 2004 at 17:38 UTC
Ahh! The lights come on! (Then they go dim again) I changed the `$fname =~ s/^\s+(.+?)\W$/$1/;` [download] to `$fname =~ s/^\s(.+?)\W$/$1/;` [download] and the problem went away. (Of course it did. I should have seen that.) I can see why with the '+' it didn't work right, but why would the failure manifest as a CR at the end of $fname? But I am perplexed about the split itself. I replaced the `(",")` with `(",\s")` and got the error: `Unrecognized escape \s passed through at phonelist.pl line 28.` [download] The paren/quote format worked fine with just a comma but broke with the addition of the space char class. Did I just get lucky the first way? As to the LIMIT option, my Llama book doesn't explain it. Does the 2 relate the expectation of having two char within the match? -Theo- (so many nodes and so little time ... )	[reply] [d/l] [select]
Re: Re: Re: Need a better way to count input lines by hv (Prior) on May 07, 2004 at 20:09 UTC
... but why would the failure manifest as a CR at the end of $fname? I suspect the text, created under Windows, has CRLF as the line terminator; running the script under Solaris the `chomp` is removing the LF but not the CR, so the CR was getting removed instead by the `$fname` substitution - but only when the substitution actually happened. As for `split`, please take a look through `perldoc -f split`. First, note that the first parameter is shown as `/PATTERN/`; if you supply this parameter as a pattern (eg `/,\s/`) it gets parsed as a pattern and does the right thing. If you supply a quoted string instead (eg `",\s"`) perl will treat that first as a quoted string - in which case `"\s"` gives the warning, and gets converted to `"s"` - and the resulting string is then converted to a regexp further down the line. You'll also find there far more than you ever wanted to know about the LIMIT argument. Hugo	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom