Theo has asked for the wisdom of the Perl Monks concerning the following question:
This is the second Perl script I've done from scratch and I'm havin' a ball!
I'd like some constructive criticism over all and in two specific areas.
The objective is to convert an MS Word document of the company phone list to
an MS Excel spreadsheet for sorting and conversion to several web pages
(of differing sort orders).
The Word file is saved to a text file where each
person's info is spread over 3 lines.
The script pulls everything into one line per person with tabs separating
each item. Having been hand entered, the data is somewhat "dirty". Most of
the regexp work is cleaning linefeeds, carriage returns and the random non word characters around the valid data.
Question 1: Is there a more perlish way to count the three lines per person?
(What I'm doing looks so much like C.)
Question 2: If you look at the input data for "Bo Jo Le Much" you'll see there
is no space after the comma separator and there is a CR (^M) after the first
name in the output. If I insert a space after the comma in the input data, the CR in the output goes away. Why?
#!/opt/bin/perl
#perl 5.8.3 built for sun4-solaris-64int-ld
use warnings;
use strict;
my $T = "\t";
my $fname = "U1";
my $lname = "U2";
my $phone = "U3";
my $room = "U4";
my $bldg = "U5";
my $email = "U6";
my $temp = "Ut";
my $count = 0;
print "\nLast First Phone Bldg Room email\n";
open FH => "<testdata.txt" or die "can't find the data file: $!\n";
#open FH => "<phonedata.txt" or die "can't find the data file: $!\n";
while (<FH>) {
chomp;
$count++;
if ($count == 1){
($lname, $fname) = split(",");
$fname =~ s/^\s+(.+?)\W*$/$1/;
}elsif ($count == 2){
($phone, $temp) = split("/"); #line 28
if ( ($temp =~ m[^\d.*]) || ($temp =~ m[--- --]) ){
($room, $bldg) = split(" ", $temp);
} else {
$bldg = $temp;
$bldg =~ s/^(.*?)\W*$/$1/; #$bldg includes a 'cr'
$room = ""; #no room number, just bldg name
}
}elsif ($count == 3){
$email = $_;
$email =~ s/^(.*?)\W*$/$1/;
$count = 0;
pA();
}else{
print "Got an overflow - $count $_\n";
}
}
print "\n" if ($count == 0);
close FH;
sub pA{print "$lname$T$fname$T$phone$T$bldg$T$room$T$email\n";}
-------------------------------------------------------
__Input__
Alanon, Bart
5590/EL
----
O'Lewis, John.
----/--- --
john
Le Much,Bo Jo
3406/165 NS
ed@a.nl
Abe-Jen, Mar-Jo
3421/164D NS
cbest
__Output__
Last First Phone Bldg Room email
Alanon Bart 5590 EL
O'Lewis John ---- -- --- john
Le Much Bo Jo^M 3406 NS 165 ed@a.nl
Abe-Jen Mar-Jo 3421 NS 164D cbest
-Theo-
(so many nodes and so little time ... )
Re: Need a better way to count input lines
by japhy (Canon) on May 07, 2004 at 14:51 UTC
|
# while (not eof FILE) { ... }
# or...
until (eof FILE) {
my (@lines) = map scalar <FILE>, 1 .. 3;
# work with $lines[0], $lines[1], $lines[2]
}
You need the scalar there, because you only want to get one line at a time.
_____________________________________________________
Jeff [japhy]Pinyan:
Perl,
regex,
and perl
hacker, who'd like a job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;
| [reply] [d/l] |
|
| [reply] [d/l] |
Re: Need a better way to count input lines
by BrowserUk (Patriarch) on May 07, 2004 at 17:18 UTC
|
#! perl -slw
use strict;
until( eof( DATA ) ) {
no warnings 'uninitialized';
printf '%-7.7s %-7.7s ', <DATA> =~ m[ ([^,]+) , \s* (.+) $]x;
printf '%7.7s %7.7s %7.7s ', <DATA> =~ m[ ([^/]+) / (?:(\S+)\s+)?
+(\S+) $]x;
printf "%s\n", <DATA> =~ m[^(\S+)];
}
__DATA__
Alanon, Bart
5590/EL
----
O'Lewis, John.
----/--- --
john
Le Much,Bo Jo
3406/165 NS
ed@a.nl
Abe-Jen, Mar-Jo
3421/164D NS
cbest
Output P:\test>351465
Alanon Bart 5590 EL ----
O'Lewis John. ---- --- -- john
Le Much Bo Jo 3406 165 NS ed@a.nl
Abe-Jen Mar-Jo 3421 164D NS cbest
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
| [reply] [d/l] [select] |
|
That looks a lot more perlish!. I changed it slightly to use the same data in an external file & I get an error for each line of the file that has a "/". (Of course, to get those errors I uncommented the 'no warnings' line)
Last First Phone Bldg Room email
Use of uninitialized value in printf at PL.pl line 11, <FH> line 2.
Use of uninitialized value in printf at PL.pl line 11, <FH> line 2.
Use of uninitialized value in printf at PL.pl line 11, <FH> line 2.
----
Use of uninitialized value in printf at PL.pl line 11, <FH> line 5.
Use of uninitialized value in printf at PL.pl line 11, <FH> line 5.
Use of uninitialized value in printf at PL.pl line 11, <FH> line 5.
john
Use of uninitialized value in printf at PL.pl line 11, <FH> line 8.
Use of uninitialized value in printf at PL.pl line 11, <FH> line 8.
Use of uninitialized value in printf at PL.pl line 11, <FH> line 8.
ed@a.nl
Use of uninitialized value in printf at PL.pl line 11, <FH> line 11.
Use of uninitialized value in printf at PL.pl line 11, <FH> line 11.
Use of uninitialized value in printf at PL.pl line 11, <FH> line 11.
cbest
If I redirect the output to a file, it looks like this:
Last First Phone Bldg Room email
Alanon Bart^M ----
O'Lewis John.^M john
Le Much Bo Jo^M ed@a.nl
Abe-Jen Mar-Jo^M cbest
The file, as I ran it is:
#!/opt/bin/perl -slw
use strict;
print "\nLast First Phone Bldg Room email\n";
open FH => "<testdata.txt" or die "can't find the data file: $!\n";
until( eof( FH ) ) {
# no warnings 'uninitialized';
printf '%-7.7s %-7.7s ', <FH> =~ m[ ([^,]+) , \s* (.+) $]x
+;
printf '%7.7s %7.7s %7.7s ', <FH> =~ m[ ([^/]+) / (?:(\S+)\s+)
+? (\S+) $]x;
printf "%s\n", <FH> =~ m[^(\S+)];
}
I definitely don't understand what's happening in the second half of the middle regexp.
-Theo-
(so many nodes and so little time ... )
| [reply] [d/l] [select] |
|
Weird! The ^Ms that are getting left behind on the end of the first names suggests that the contents of the data file is weird, though given the source that's no real surprise.
The reason for having the no warnings 'uninitialized' is to avoid the need for a special case to deal with location lines that don't contain the room number. The regex is saying:
m[
([^/]+) / # capture the phone number before the slash
(?: #optionally capture the room number if it is exists
(\S+) # capture all the non-spaces between the /
\s+ #and one or more spaces
)? # but only if there are two sets of non-spaces
# separated by a one or more spaces before EOL
(\S+) #capture the building code.
$
]x;
If the room number isn't present, the second capture ($2) will be undefined, hence the need to suppress the warning. However, that you are getting three warnings means that all 3 captures are undefined (ie. the regex failed to match), which suggests that the data in the file is formatted somewhat differently to the sample data you posted.
Without being able to see the actual contents of the file it is a little difficult to diagnose the problem.
Perhaps you could run this one liner on the data file to dump the first few lines in hex and post the output here?
perl -nle" exit if $. == 15; print unpack 'H*', $_" testdata.txt
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
| [reply] [d/l] [select] |
|
|
|
Re: Need a better way to count input lines
by hv (Prior) on May 07, 2004 at 16:15 UTC
|
For Bo Jo Le Much, the problem occurs when you try to clean up $fname:
($lname, $fname) = split(",");
$fname =~ s/^\s+(.+?)\W*$/$1/;
The /\s+/fragment requires one or more whitespace characters at the start, and when that isn't present the substitution fails.
The cleanest fix would probably be to absorb optional whitespace after the comma within the split, so that you don't need to look for it afterwards: ($lname, $fname) = split /,\s*/, $_, 2;
$fname =~ s/\W+$//;
Note that I've extended the split command slightly: specifying a LIMIT of 2 ensures that trailing text will not be thrown away if for some reason there is more than one comma in the name. It also allows the split to execute slightly faster, since it doesn't need to scan for more commas after the first is found.
Also, removing the need to check for whitespace in the substitution means that the only thing it is still doing is stripping non-word characters from the end, allowing a somewhat simpler pattern.
Hugo | [reply] [d/l] [select] |
|
Ahh! The lights come on! (Then they go dim again) I changed the
$fname =~ s/^\s+(.+?)\W*$/$1/;
to
$fname =~ s/^\s*(.+?)\W*$/$1/;
and the problem went away. (Of course it did. I should have seen that.) I can see why with the '+' it didn't work right, but why would the failure manifest as a CR at the end of $fname?
But I am perplexed about the split itself. I replaced the (",") with (",\s*") and got the error:
Unrecognized escape \s passed through at phonelist.pl line 28.
The paren/quote format worked fine with just a comma but broke with the addition of the space char class. Did I just get lucky the first way?
As to the LIMIT option, my Llama book doesn't explain it. Does the 2 relate the expectation of having two char within the match?
-Theo-
(so many nodes and so little time ... )
| [reply] [d/l] [select] |
|
... but why would the failure manifest as a CR at the end of $fname?
I suspect the text, created under Windows, has CRLF as the line terminator; running the script under Solaris the chomp is removing the LF but not the CR, so the CR was getting removed instead by the $fname substitution - but only when the substitution actually happened.
As for split, please take a look through perldoc -f split. First, note that the first parameter is shown as /PATTERN/; if you supply this parameter as a pattern (eg /,\s*/) it gets parsed as a pattern and does the right thing. If you supply a quoted string instead (eg ",\s*") perl will treat that first as a quoted string - in which case "\s" gives the warning, and gets converted to "s" - and the resulting string is then converted to a regexp further down the line.
You'll also find there far more than you ever wanted to know about the LIMIT argument.
Hugo
| [reply] [d/l] [select] |
|
|