http://www.perlmonks.org?node_id=1002964

killersquirel11 has asked for the wisdom of the Perl Monks concerning the following question:

Hey all,

I've hacked a few perl scripts in my day, but this is the first time I've ever written one from scratch. Thus I want to get some feedback on the good, the bad, and the ugly of what I've done here

Essentially, the script's job is to parse a file containing a column of part names and several columns of description (the columns are fixed-width space-delimited), and output how many parts match each description. Below is the code:

#!/usr/bin/perl -w use strict; use warnings; print "ARGV: $#ARGV\n"; if ($#ARGV != 0) { print "One, and only one (not $#ARGV), command line parameter is +expected\n"; exit; } my $inputFile=$ARGV[0]; if ( not -e $inputFile) { die "$inputFile doesn't exist!!!\nExiting.\n"; } print "Attempting to open $inputFile\n"; open my $file, $inputFile or die "Could not open $inputFile: $!"; my $headerLine= -1; my $prtIndex = -1; my $shpIndex = -1; my $clrIndex = -1; my $sizIndex = -1; my %values = (); my $index = 0; my @parts = (); while( my $line = <$file>) { if ( $line =~ "Part *Shape *Color *Size" ){ if($headerLine < 0){ $headerLine=$.; $prtIndex=index $line, "Part"; $shpIndex=index $line, "Shape"; $clrIndex=index $line, "Color"; $sizIndex=index $line, "Size"; } next } if(not checkIndices($prtIndex, $shpIndex, $clrIndex, $sizIndex)){ $headerLine=-1; next; } if(length($line)<$sizIndex){ next; } my $prt = substr $line, $prtIndex, $shpIndex-1-$prtIndex; my $shp = substr $line, $shpIndex, $clrIndex-1-$shpIndex; my $clr = substr $line, $clrIndex, $sizIndex-1-$clrIndex; my $siz = substr $line, $sizIndex; trimall($prt,$shp,$clr,$siz); my $superstr=$shp.$clr.$siz; if(not exists $values{$superstr}){ $values{$superstr}=$index; $parts[$index][0][0]=$prt; #List of all parts $parts[$index][1]=$superstr; #Value of the parts $parts[$index][2]=1; #Total qty of these parts $index++; } else { my $i=$values{$superstr}; $parts[$i][0][$parts[$i][2]] = $prt; $parts[$i][2]++; } } for my $i (0..$index-1){ my @plist=$parts[$i][0]; my $value=$parts[$i][1]; my $count=$parts[$i][2]; for my $j (0..$parts[$i][2]-1){ printf $parts[$i][0][$j]." "; } printf "- "; print $value." - "; print $count.".\n"; } exit; # Check if the indices for a number of input variables are all >= 0 sub checkIndices { my $count; for ($count=0; $count<=$#_; $count++){ if($_[$count] < 0){ return (1==0); } } return (1==1); } # Remove leading and trailing whitespace from all arguments sub trimall { my $count; for ($count=0; $count<=$#_; $count++){ $_[$count] =~ s/^\s*//; $_[$count] =~ s/\s*$//; } return; }


And here is the test input file

Part Shape Color Size P1 Circle Red 1 P2 Square Green 3 P3 Rectangle Red 4 P4 Circle Red 1 P5 Square Blue 1 P6 Square Green 3 P7 Rectangle Blue 4 P8 Square Red 2 P9 Circle Green 3 P10 Circle Blue 4


When all is said and done, it should output that P1 and P4 are equal, as are P2 and P6, which it does:

ARGV: 0 Attempting to open testFile.txt P1 P4 - CircleRed1 - 2. P2 P6 - SquareGreen3 - 2. P3 - RectangleRed4 - 1. P5 - SquareBlue1 - 1. P7 - RectangleBlue4 - 1. P8 - SquareRed2 - 1. P9 - CircleGreen3 - 1. P10 - CircleBlue4 - 1.


So now, I ask of ye, oh keepers of the Scrolls Scripts, what you would change about my work?

Thanks in advance!

Replies are listed 'Best First'.
Re: My first perl script is working, what did I do wrong?
by toolic (Bishop) on Nov 08, 2012 at 19:52 UTC
    If your part numbers will never have spaces in them, you could split your lines and build up a hash of arrays:
    use strict; use warnings; print "ARGV: $#ARGV\n"; if ($#ARGV != 0) { print "One, and only one (not $#ARGV), command line parameter is +expected\n"; exit; } my $inputFile=$ARGV[0]; if ( not -e $inputFile) { die "$inputFile doesn't exist!!!\nExiting.\n"; } print "Attempting to open $inputFile\n"; open my $file, $inputFile or die "Could not open $inputFile: $!"; my %parts; while (<$file>) { next if /^Part/; my ($part, $info) = split /\s+/, $_, 2; $info =~ s/\s//g; push @{ $parts{$info} }, $part; } for my $info (sort keys %parts) { print "@{ $parts{$info} } - $info - ", scalar @{ $parts{$info} }, +"\n"; } __END__ ARGV: 0 Attempting to open 1.txt P10 - CircleBlue4 - 1 P9 - CircleGreen3 - 1 P1 P4 - CircleRed1 - 2 P7 - RectangleBlue4 - 1 P3 - RectangleRed4 - 1 P5 - SquareBlue1 - 1 P2 P6 - SquareGreen3 - 2 P8 - SquareRed2 - 1

    With a little more work, you could sort on hash values to get the output in the order you want.

      If you can depend on the fixed columns, use unpack the same way as split.
      Bill
        Yes, I can depend on fixed columns more so than I can on the contents being nicely-formatted. Thanks for the quick replies.

      1 == @ARGV or die "One, and only one argument expected\n";

      1 == @ARGV or die Usage();

Re: My first perl script is working, what did I do wrong?
by rjt (Curate) on Nov 08, 2012 at 21:16 UTC

    You have a lot of options in your approach, here. I humbly offer my somewhat more Perl-ish approach to this problem.

    The basis of what I did is to get rid of hard-coded fields and allow you to specify them near the top of the script in @fields, building a regexp to capture them in the header line. I relaxed the usage of the program as well, to allow for specifying a filename on the command line, or, if none is specified, accepting input from STDIN. You can easily change the output format by modifying the final for (sort keys %inv) { ... } loop, and you can change how records are grouped by changing the key() function to suit your tastes. In both cases, I tried to stay with what you had, since I don't know what you want to ultimately do with the data.

    use strict; use warnings; use 5.12.00; die "usage: $0 [filename]\n" if (@ARGV > 1); # Set required fields and determine fixed widths my @fields = qw/Part Shape Color Size/; my $re; $re .= qr/(?<$_>$_\s*)/ for @fields; my $header = <>; die "Header must match " . join(' ', @fields) unless ($header =~ /^$re +$/); my $expected_len = length($header); my $tmpl = join(' ', map { "A[".length($+{$_})."]" } @fields); my %inv; # Inventory; $inv{key(%rec)} while (<>) { if (length != $expected_len) { die sprintf("Got length of %d, expecting %d", length, $expecte +d_len); } my @rec = map { /(.+?)\s*$/ } unpack $tmpl; # Get (trimmed) record +s my %rec = map { $_ => shift @rec } @fields; # Add the part to our inventory push @{$inv{key(%rec)}}, $rec{Part}; } # Now print out the inventory in the desired format for (sort keys %inv) { say join(' ',@{$inv{$_}}) . " - $_ - " . scalar @{$inv{$_}}; } # Our custom hashing function for inventory items. Expects %rec argume +nt sub key { my %rec = @_; $rec{Shape} . $rec{Color} . $rec{Size}; }
      Thanks for the reply, it actually looks like Perl! A couple quick questions on this:
      1) Are both 'use 5.12.00' and 'use strict' necessary? From what I read, 'use 5.12.00' implies 'use strict'.
      2) With the following line, I'm having a bit of difficulty understanding it.
      $re .= qr/(?<$_>$_\s*)/ for @fields;
      Is it essentially this?:
      for my $field (@fields) $re=$re.qr/(?<$_>$_\s*)/
        1. You are correct, use strict is not required here. Sorry, old habits die hard. :-)
        2. Not quite. Your example will not compile, but a similarly expanded translation of what I wrote would look like:
        for my $field (@fields) { $re .= qr/(?<$field>$field\s*)/; }

        Read $_ for a description of the $_ special variable, but the short version is, if you do not supply a variable name to for, Perl will automatically assume $_.

        The regex itself might benefit from a bit more explanation. It's capturing all of the column names including trailing whitespace, and saving those as named captures in %+ for use in the line that builds $tmpl:

            my $tmpl = join(' ', map { "A[".length($+{$_})."]" } @fields);

        That generates an (un)pack template string based on the field lengths read from the header, determining the length of each field in @fields by taking the length of the same-named capture in %+ . I'm using map here to "map" the values in @fields to the list that I want: a list of the lengths of each field.

Re: My first perl script is working, what did I do wrong?
by marquezc329 (Scribe) on Nov 08, 2012 at 22:20 UTC
    Hello killersquirel11. Looks like you've already been given some full rewrites. I just figured I'd give you some smaller points to think about in the future.

    if() vs unless() i.e.
    if ($#ARGV != 0) -> unless ($#ARGV == 0)
    if (not -e $inputFile) -> unless (-e $inputFile)

    Multiple Variable Declaration can be broken down from:

    my $headerLine= -1; my $prtIndex = -1; my $shpIndex = -1; my $clrIndex = -1; my $sizIndex = -1;
    to:

    my ($headerLine, $prtIndex, $shpIndex, $clrIndex, $sizIndex) = -1
     my ($headerLine, $prtIndex, $shpIndex, $clrIndex, $sizIndex) = (-1) x 5;

    C style for loops like:

    sub checkIndices { my $count; for ($count=0; $count<=$#_; $count++){ if($_[$count] < 0){ return (1==0); } } return (1==1); }
    Can be "Perl-ified" like:
    sub checkIndices { for my $item (@_){ return (1==0) if ($item < 0) } return (1==1); }

      Multiple Variable Declaration can be broken down from:

      my $headerLine= -1; my $prtIndex = -1; my $shpIndex = -1; my $clrIndex = -1; my $sizIndex = -1;

      to:

      my ($headerLine, $prtIndex, $shpIndex, $clrIndex, $sizIndex) = -1;

      No, this won't do what you expect. Given:

        my ($a, $b, $c) = -1;

      only $a will receive the value ($b and $c will be undef). You can, however, use:

        my ($a, $b, $c) = (-1, -1, -1);

      That being said, I've found more often than not, when I have to start assigning the same value to a bunch of variables, there's probably a deeper design decision I need to question.

      In this case, -1 is used as some sort of magic value to indicate the state of something. Although even better refactoring can be done in this case, at the very least leaving the variables as undef provides better information, especially if any of those variables might legitimately contain -1 in the future.

        Thanks for the replies. In case you couldn't tell, I typically program in C (and sometimes Java, although really when I use Java I treat it a lot like C). I will definitely keep these pointers in mind as I continue to learn how to hack together Perl scripts.

        Thank you for your correction rjt. Crackers2 also notified me of this mistake. I edited my response to reflect an alternate method of accomplishing this.

Re: My first perl script is working, what did I do wrong?
by eyepopslikeamosquito (Archbishop) on Nov 09, 2012 at 07:13 UTC

    Some suggestions below on how to improve the early part of your script.

    1) Replace:

    if ($#ARGV != 0) { print "One, and only one (not $#ARGV), command line parameter is +expected\n"; exit; }
    with:
    @ARGV == 1 or die "usage: $0 inputfile\n";
    When a script encounters an error it should exit with a non-zero value (a bald exit returns a zero exit value), a usage statement is conventional, and or die is idiomatic.

    BTW, you should rarely need to use the cumbersome $#array notation. Using it here is unnecessary and the code is clearer using @ARGV instead. Another example, from Perl Best Practices, chapter 5, "Use negative indices when counting from the end of an array" is to prefer $frames[-2] to $frames[$#frames-1] because it is easier on the eye and avoids the undesirable repetition of the variable name (DRY).

    2) Replace:

    my $inputFile=$ARGV[0];
    with:
    my $inputFile = shift;
    Using shift makes maintenance easier if you later add or remove command line arguments because you don't need to manually shuffle the ARGV indices (which are "magic numbers").

    3) Replace:

    if ( not -e $inputFile) { die "$inputFile doesn't exist!!!\nExiting.\n"; }
    with:
    -f $inputFile or die "input file '$inputFile' does not exist.\n";
    Using -f is more precise than -e because the first argument must be a file (not a directory, say). Using or die is idiomatic and more concise.

    4) Replace:

    open my $file, $inputFile or die "Could not open $inputFile: $!";
    with:
    open my $file, '<', $inputFile or die "Could not open $inputFile: $!";
    As for why the three-argument form of open is preferred, the old two-argument form of open is subject to various security exploits as described at Opening files securely in Perl.

Re: My first perl script is working, what did I do wrong?
by space_monk (Chaplain) on Nov 09, 2012 at 01:40 UTC
    First, its a good idea to seek advice and review. I regularly write stuff that I think is good until one of my colleagues points out that he would have done it better/ faster/ in less code by doing it another way.... :-)

    Second, the main mistake you made is that you didn't ask yourself what tasks in the above script could be done by standard PERL libraries instead of lovingly hand crafting every aspect. There are lots of text parsing libraries, which if used, would have the benefit of substantially simplifying your code.

    In future, spend a few minutes before commencing any programming task, asking yourself what someone is likely to have done before, and what you can borrow from others to simplify your task and to make your code more understandable and maintainable.

    A quick browse of CPAN or PerlMonks before you program will often save someone dissecting where you went wrong afterwards. That said, good luck in the future! :-)

    A Monk aims to give answers to those who have none, and to learn from those who know more.
Re: My first perl script is working, what did I do wrong?
by tospo (Hermit) on Nov 09, 2012 at 09:31 UTC
    ++ for the title :-)