http://www.perlmonks.org?node_id=863434

rhiridflaidd has asked for the wisdom of the Perl Monks concerning the following question:

Right, I've been using Perl off and on for a few months, building bits of code as I go along. I never even did a GCSE in IT. I've never read a book about programing, had a lesson or anything. So in a nutshell, my coding is a mess. Here's an example of what I write - It's parsing code to take a set of lines and convert it into a csv.
if ($#ARGV != 2){die "usage $0 filenamein filenameout\n";} open( INFILE, "< $ARGV[0]" ) || die "Cannot open $ARGV[0] $!\n"; open( OUTFILE, "> $ARGV[1]" ) || die "Cannot open $ARGV[1] $!\n"; $nopat=0; $lookup =""; while (<INFILE>) { $line=$_; if($line=~m/^Patient ID.*?(\d+)/) { $ID=$1; $nopat=$nopat+1; $blank=""; print $ID."\n"; %hash = (); } if($line=~m/^NAD.*?(\d+)/) { chomp; $nhs=$1; } if($line=~m/^Date Of Birth/) { $dob=$line; $dob=~s/Date Of Birth ://; chomp $dob; } if($line=~m/^Sex/) { $sex=$line; $sex=~s/Sex ://; chomp $sex; } if($line=~m/^Post/) { $postcode=$line; $postcode=~s/Postcode :(\w+)/$1/; chomp $postcode; $lookup=$lookup.$ID.",".$nhs.",".$postcode.",".$dob.",".$sex."\n"; #$age = <STDIN>; # pause #print $lookup; } if ($line=~m/^\d\d\/\d\d.*?/) { #print "$line"; @names = split(',',$line); $date=@names[0]; $index=@names[1]; $nonsense=@names[2]; $val=@names[3]; $test=@names[3]; $orgtest=@names[3]; $orgindex=@names[1]; chomp $orgtest; chomp $val; chomp $test; #print "$val\n"; #if ($index=~m/O\/E -/) { #$index=~s/\d.*//; #search for digits and deletes everything after it #} $test=~s/\s$//; #substitute all blank spaces at end of line with nothi +ng $test=~s/\s$//; #substitute all blank spaces at end of line with nothi +ng #$index=~s/(\.\d+)|(\d+\.\d+)//;#delete number.number $index=~s/(\d)(\.00)(\s)/$1.$3/; #delet .00s $index=~s/(\d\.\d)(0)(\s)/$1.$3/; #delet .x0s $index=~s/(\d)(\.0)(\s)/$1.$3/; #delet e.0s $index=~s/\s(\.\d)/0.$1/; #add a 0 in front of a period This is so th +at the units match the old units and that replace works $index=~s/\<|(\)|\()|\/|\^|\.|\*|//g; #substitute all operators $test=~s/\<|(\)|\()|\/|\^|\.|\*|//g; #substitute all operators $test=~s/\s0//; #$test=~s\s0$\\; $testother="$test"."0"; $index=~s/\s$testother\s.*//; $index=~s/\s$test\s.*//; $presub=$index; $index=~s/$testother//; $index=~s/$test//; #search for last #$index=~s/(\.\d+)|(\d+\.\d+)//;#delete number.number #print "-$val- substituted -$index- \n"; #index=~s/x109l|x109 L|109L|x1012L|gdL|iul|iuL|gdl|ugL|uL|gL|mmolL|mmo +ll| Ul|1012L| gl |mmh|mm|.0|IUL|Uml|ngmL| UL//; $index=~s/\s+$//; #substitute all blank spaces at end of line with not +hing $index=~s/\.$//; #substitute all . spaces at end of line with nothing $index=~s/\s$//; #substitute all blank spaces at end of line with noth +ing $index=~s/(OE -\w*)\d/$1/; #substitute all blank spaces at end of line + with nothing $index=~s/\s\d+\scm$//; #$index=~ s/10\*.*//; #Search for 10*whatever and delete it #$index=~s/x109l|x109 L|109L|x1012L//; $val=~ s/mL\/min\/1.73m2//; $val=~ s/10\*.*//; #Search for 10*whatever and delete it $val=~s/109L//; $val=~tr/[0-9][.]//cd; #$index=~s/$val//; #and re dof or final clear up $index=~s/\s+$//; #substitute all blank spaces at end of line with not +hing # print "-$test- substituted -$index- \n"; if ($val != "") { $hash{$index} =$hash{$index}.$val.","; #push @lines, $ID.",".$index.",".$date.",".$val.",".$orgtest.",".$orgi +ndex.",".$presub.",".$test.",".$testother."\n"; #HanDY TO TEST push @lines, $ID.",".$index.",".$date.",".$val."\n"; #HanDY TO TEST push @keys, $index."\n"; } } } #print "Sorting and collating records.....takes some time.....\n Wait +for this window to close before continuing \n "; my @unique = (); my %seen = (); my @unique = grep { ! $seen{ $_ }++ } @keys; @sorted = sort { $a cmp $b } @unique ; #print @sorted; print OUTFILE @lines; print "\n$. lines of data Processed. "; close (OUTFILE); open( OUTFILE2, "> $ARGV[2]" ) || die "Cannot open $ARGV[2] $!\n"; print OUTFILE2 @sorted; close (OUTFILE2); close (INFILE); $path="lookup.CSV"; open (OUTFILE3, "> $path") || die "Cannot open LOOKUP.CSV $!\n"; print OUTFILE3 $lookup; close (OUTFILE3);
So - what basic errors am I making. Basically all the above is cobbled together. What should decent code look like??

Replies are listed 'Best First'.
Re: Code Critique
by chromatic (Archbishop) on Oct 04, 2010 at 21:12 UTC

    Several comments.

    First, use something like Perl::Tidy to fix the indentation and formatting. It'll be easier to read that way. (I know that copying and pasting code in here sometimes changes the formatting, but it's a good habit anyway.)

    Second, use strict and warnings. They can help prevent making silly mistakes and their warnings can help you to think about what you're doing and why. In particular, you use undeclared global variables and you use array slices in scalar context (which is rarely what you mean).

    Third, use lexical filehandles and three-argument open. This helps avoid further global variables and avoids a potential problem (though not in this case) where untrusted input can change how your program performs IO.

    Fourth, embrace Perlishness. For example, assigning to @names and then extracting elements into named variables directly is much easier if you assign to a list of those variables from the split in a single step. Likewise string interpolation is often clearer than lots of small concatenations. Similarly there's no reason to assign initial empty values to variables if you declare them as lexicals.

    Once you've done that, you can reduce the busy work—I'm sure many of those substitutions are superfluous, and there's no reason to push values onto @keys if you're going to treat it as a hash anyway; make it a hash to start.

    If you don't already have a good Perl 5 book, you might like to read the draft of my Modern Perl.

Re: Code Critique
by TomDLux (Vicar) on Oct 04, 2010 at 21:26 UTC

    step zero ... declare your variables. That limits the scope, so you can tell where something exists, where it gets modified. Pass variables in and out of routines, don't just change things globally.

    The first thing I would say is the program needs indentation, so you can tell what happens when. Probably the web site mucked up the nice indentation you had going. if you didn't, check out perltidy, you can download it from CPAN.

    A long linear program like this hard to understand. Break it into subroutines, and it becomes simpler, more compact. Instead of having to understand a dozen lines of code, the subroutine name explains what is happening ... and at a separate time, you can look at the subroutine body to determine how it's implemented.

    Bareword file handles are out-of-date, as is the two-argument open(). Real variables are easier to pass to subroutines or embed in an object, and three-arg open is safer in terms of hacks. Modern style is ...

    open my $infile, '<', @ARGV[0] or die ....

    Your while() loop uses the auto assignment to $_, then you assign to a variable. Why not just assign into the variable, and save a line?

    while ( my $line = <INFILE>) { chomp $line; ...

    Your many if() blocks are truly independent, only one will be processed for any line. Use if/elsif to clarify that only one will be true. Put the block bodies into subroutines, to hide away the details. Use a subroutine to determine the 'type' of the current line, and pass back a symbolic representation.

    # at the top use Readonly; Readonly my $PATIENT_ID => 0; Readonly my $DOB => 1; # and in the main code ... my %data; my $linetype = determine_type( $line ); if ( $linetype == $PATIENT_ID ) { process_patient( $line, \%data ); elsif ( $linetype == $DOB ) { process_dob( $line, \%data ); elsif ...

    If you're using Perl 5.10 or 5.12, you can use a given{} block, or you could use a hash of subroutine references.

    given ( determine_type( $line ) ) { when { $PATIENT_ID } { process_patient( $line, \%data ); } when { $DOB } { process_dob( $line, \%data ); } ...

    Or in older Perls, if all the options use the input line and update a %data data structure, use a hash:

    # define the process_hash near the top # Readonly my %PROCESS => ( $PATIENT_ID => \&process_patient, $DOB => \&process_dob, ... ); # and use it in the loop # my $type = determine_type( $line ); $PROCESS{$type}->( $line, \%data );

    As Occam said: Entia non sunt multiplicanda praeter necessitatem.

      Indentation No I haven't been doing that. At all. As I said, this is all written purely intuitively. It started as a 10 liner and is now a mess.

      Subroutines. Yes, Understanding how to use them sensibly would be immensely sensible. This script's partrner has hit 700 lines and is truly ridiculous. elseif That's sensible

      Readonly Hmm at the moment I'm not even doing simple subroutines let alone hashes of subroutines...... But that's why I posted here - I knew something big was amiss Thanks for all fo this. Incredibly helpful
      I'm only getting to this now. And am a bit muddled. while (<$infile>) Is massively slower than while (<INFILE>) What do you advise. edit - ignore it's something else
Re: Code Critique
by Marshall (Canon) on Oct 04, 2010 at 22:46 UTC
    The lack of any systematic indenting and spacing of the code is a very serious flaw. Trying to read your code is like trying to read a book which has no chapters, paragraphs or even punctuation marks! This is so serious that this code would not get more than a "D" in a college class even if it did work. I'm not saying that to be mean or condescending, you may not be aware of just how important this.

    There is more than one way to do the indenting. As a beginner at this, I would suggest the "line up the braces method", like the example below. Use 3 or 4 extra spaces between levels. Alternately, you can use the diagonal brace method - its just as valid, but I suggest trying this method for awhile and see how you like it. Judiciously applied whitespace is one the most important things that you can do to improve readability!

    while (condition) { .... if () { .... } .... }
    Another systemic flaw is that your code doesn't have locality of functionality - related things are in widely separated areas. Some examples... I would open all of the files at the beginning of the code. BTW, I suspect some better names than "OUTPUT3" or "OUTPUT2"can be thought of... like maybe CSV or SORTED? I see print OUTFILE3 $lookup;. Don't defer output til later. Do right after you calculate it! I don't think that you even need this $lookup variable...print the lines as you go. print CSV "$ID,$nhs,$postcode,$dob,$sex\n"; Likewise, print OUTFILE2 @sorted;. Put that right after the sort.

    A whole bunch of these chomp; statements appear to unnecessary. Try to factor out common code. Perhaps;

    while (my $line = <INFILE>) { $line =~ s/^\s*//; # remove leading whitespace $line =~ s/\s*$//; # remove ending whitespace (includes \n) # chomp is not necessary ... }
    Anyway think about the suggestions you've gotten and take another stab at this. Get it running with "use strict; use warnings;" and clean-up formatting.

      That example is tame.What I'm writing now is running into 700 lines and, yep, you guessed it, understanding what the heck I was thinking is now becoming the problem - hence the need to get some rhyme and reason into what I'm doing.

      And re it working - course it works!! It's a funny data set in this format

      20/08/2007,Erythrocyte sedimentation rate,,3 mm/h 20/08/2008,Total white blood count,,6.7 10*9/L and sometimes 04/04/2007,Haemoglobin estimation 12.9 g/dL,,12.9 g/dL And a myriad other units.
      The task is to strip away the units and leave me with Date, Test, Data Being a pragmatist, getting things working is the challenge. But before packing my bags on this, I need to make it understandable if I ever need to revisit it. A dataset is around half a million lines, and the diferent types of units legion.

        Ah, your programs are becoming complex enough that the realization that there are always 2 people who need to understand the code is starting to sink in. Those 2 people are: (a) you when you write it and (b) you when you have to extend it or fix at some time in the future! And if it is a useful program, one of these 2 things and probably both will happen!

        Some simplifications to the code will be possible, based upon your first example, I recoded the first part with some straightforward techniques and with just borrowing pretty much your regex'es. Your actual record is more complex. However, I hope that the lesson about spacing and indenting is becoming more clear - it helps a LOT.

        #!/usr/bin/perl -w use strict; my $patient_id; #patient record for the CSV file my $nhs; my $dob; my $sex; my $postcode; my $total_patients=0; while (<DATA>) { s/^\s*//; #remove leading space s/\s*$//; #remove trailing space $patient_id = $1 if m/^Patient ID .*?(\d+)$/; $nhs = $1 if m/^NAD .*?(\d+)$/; $dob = $1 if m/^Date Of Birth .*?([\/\d]+)$/; $sex = $1 if m/^Sex .*?(\w+)$/; #record ends with postal code -> output to the CSV file if ( ($postcode )= m/^Postcode .*?(\w+)$/ ) { print "$patient_id,$nhs,$postcode,$dob,$sex\n"; #CSV line $total_patients++; # this is here instead of with $patient_id # because it is here that we have an # "official" patient record # this starts variables "over again". A runtime warning will # happen if for some reason one of these winds up still # being undefined when the record is printed to the CSV file # # note: doing this for $postcode is not necessary, but I # treat this variable same as the others because this lessens # the chance of a mistake when modifying the program later. $patient_id = $nhs = $dob = $sex = $postcode = undef; + } } print "Total Patients: $total_patients\n"; =prints 1234,5678,BH78,01/01/1965,yes 4321,999,XYZ98,10/10/2007,no Total Patients: 2 =cut __DATA__ Patient ID 1234 NAD 5678 Date Of Birth : 01/01/1965 Sex : yes Postcode : BH78 Patient ID 4321 NAD 999 Date Of Birth : 10/10/2007 Sex : no Postcode : XYZ98
        I'm not sure what you want in this second programming problem. Maybe you could explain it a bit more?

        This is a very simple CSV file (no quote characters or embedded quotes within quotes), so just split on the comma's and then do another split on space to get the basic value and measurement units from the line. I used what is called a list slice to assign directly to variables without any numeric subscript stuff.

        The "formula" for this sort of thing is: open files, read and separate the input data into one "record" (a "quanta" if you will that makes sense for processing), either save that for processing in a collection of those "records" or process the records as you go.

        In your first example, I don't think there is a need to save anything past the "current record". For this one, I don't know.. explain more and show more...

        #!/usr/bin/perl use strict; use warnings; # the same as #!/usr/bin/perl -w in first line while (<DATA>) { next if /^\s*$/; # skip blank lines # sometimes a last blank line # causes problems! s/\s*$//; # or chomp; fine also my ($date, $description, $measurement) = (split (/,/,$_))[0,1,3]; my ($value, $units) = split (/\s+/,$measurement); print "date = $date\n", "descript = $description\n", "value = $value\n", "units = $units\n\n"; # instead of print, do something here ... } =prints date = 20/08/2007 descript = Erythrocyte sedimentation rate value = 3 units = mm/h date = 20/08/2008 descript = Total white blood count value = 6.7 units = 10*9/L date = 04/04/2007 descript = Haemoglobin estimation 12.9 g/dL value = 12.9 units = g/dL =cut __DATA__ 20/08/2007,Erythrocyte sedimentation rate,,3 mm/h 20/08/2008,Total white blood count,,6.7 10*9/L 04/04/2007,Haemoglobin estimation 12.9 g/dL,,12.9 g/dL
        If a new problem is being discussed, please start a new node.
Re: Code Critique
by JavaFan (Canon) on Oct 04, 2010 at 21:14 UTC
    What should decent code look like??
    I looked at your first line:
    if ($#ARGV != 2){die "usage $0 filenamein filenameout\n";}
    You require the program to have exactly three arguments. Your die message suggest two are needed. Looking through the code, it turns out you indeed use all three. But the last one almost at the bottom. Better to assign your arguments to meaningful variables at the top of your program.

    As for the rest of the code, the almost total lack of indentation makes it too hard to read. I'm not going to bother.

Re: Code Critique
by halfcountplus (Hermit) on Oct 05, 2010 at 16:14 UTC
    >>I've never read a book about programing.

    In other words, you prefer to do things the hard way? If someone said to you, I'm learning algebra, but I've never even read a book on it, would you consider that person:

    1. Shrewd and wise
    2. Some kind of masochist
    3. Just eccentric
    4. A bit of a fool

    I'll pick 2 and 4. Reading a book may seem like a time consuming and costly endeavour, but I GUARANTEE it will end up saving you time, and unless that time is worth $0/hour to you, the expense will probably seem trivial in a couple of months when you realize you have now spend scores or hundreds of learning from and consulting a $40 book.

    If you live in a major city, go into the public library and search the catalogue for "Perl" -- you'll be surprised. Another guarantee: every single beginner perl title on Amazon will have "used" copies available for <$20 US. I've order lots of used books from their resellers and they are always in great shape.

    The only thing I see demonstrated in your posted code is a basic understanding of datatypes, conditional structures, and simple regular expressions: how about loops and functions? As Marshall partially demonstrated, a more complete understanding of programming concepts will make your life easier -- something probably most easily gained by working thru some exercises, like those you'd find in a book. On-line tutorials are seldom as thorough and not infrequently just plain wrong. Also, I don't think the base perl documentation itself (the manpages, POD, etc, also online at perldoc.org) is very good except as a reference, which is what it is intended as. (All apologies to the authors of that documentation, of course, but I would not expect anyone to enjoy learning both programming and perl from it.)

    Perl is my first language, and I learned it via the "Perl Cookbook" (used at Amazon $15); it is a decent intro to important concepts with concrete examples, altho 1) there are no exercises, and exercises are great, 2) it suffers from the sketchy style and other problems common in "Cookbook" titles (which I've stopped bothering with). I notice "Learning Perl" is available used for as little as $1 (4th edition, which should be fine), that looks like it might be decent too, also apparently includes exercises. Which I just have to repeat: exercises are great for learning. Beg for your mom's credit card number and go order it now. You'll get it in about a week and it will cost you the same as renting a movie...

    My #1 tip:

    Learn to indent properly

    I was terrible with indentation and formatting originally and considered it trivial and subjective. I was very wrong ;)

      The thing about all this is that I'm not a coder and never had been. This all just started off as using a few lines of perl to add commas into a file to turn it into a csv - and it snowballed from there - resulting in this. I'm now at a point where I'm realising that I need to stop and take stock. If you look at the datastructures I'm analysing, you can guess what my dayjob is - I don't need to ask my mum! The real question is - which is the right book to buy.

        The thing about all this is that I'm not a coder and never had been. This all just started off as using a few lines of perl to add commas into a file to turn it into a csv - and it snowballed from there - resulting in this.

        Once it snowballs, you lose the ability to not be a coder :)

        (OK, actually before it snowballs. Because it always snowballs...)

Re: Code Critique
by toolic (Bishop) on Oct 04, 2010 at 23:27 UTC
Re: Code Critique
by TomDLux (Vicar) on Oct 05, 2010 at 17:44 UTC

    My first programming language was C, which I learned from Kernighan & Ritchie's book. That was my concept of the correct way to indent and format programs. Gradually my ideas evolved ... different languages create situations that didn't exist elsewhere. But my current job, maintaining software that has been around for a long time, convinces me the best solution is to have a perltidy configuration file that everyone used. I can cope with just about any arrangement, if it's consistent.

    Your 700 line problem is that you begin at the beginning, and add lines until you reach the end. Instead, describe the task on a top-level basis.

    # Read patient records from file, one by one # parse the data # output summary

    Well, there's an 'open file' and a loop hidden in that first element, but there are essentially a few routines there.

    my $options = parse_cmd_args(); my $patient_file = open_patient_file( $options ); my $result_file = open_result_file( $options ); while ( my @record_lines = get_record( $patient_file ) ) { my $record_hash = parse_record( @record_lines ); output_results( $result_file, $record_hash ); }

    I'm not saying the above is the ideal result, but it IS simple, localized, readable. Expand each subroutine into a few steps, actually coding items that are simple, using abstract steps that need to be defined for more complex steps.

    Before you start coding, write a top level description of what you want, focus on generalities, rather than details. Some of this will change as you begin implementing. Even after years of experience, I find some steps become very complicated, and need to be split into multiple steps, while other steps wither away. But without any toplevel concept, your program will be a tangle of spaghetti.

    As Occam said: Entia non sunt multiplicanda praeter necessitatem.

      Agreed - subroutines has to be the way forwards. So... How do I learn to write them?

        I am a big fan of design by comments, at least for relatively simple projects.

        What this means is, start with an English (or whatever language makes sense to you) description of what the program should do. Keep it all very simple and high level. Don't go into great detail about formats and so forth at first.

        # This program takes a list of foo data files # for each file it: # reads the file # sanitizes the data fields # converts the file to csv (name is based on foo file name)

        Now you can write some code. Keep everything related to the current idea all on one screen of text. If it gets too long, wave your hands and invent a subroutine name.

        # This program takes a list of foo data files my @foo_files = @ARGV; for my $foo_file (@foo_files) { save_foo_file_as_csv( $foo_file ); } sub save_foo_file_as_csv { my $foo_file = shift; my $csv_file = convert_foo_name_to_csv_name($foo_file); my $foo_data = read_foo_file( $foo_file ); $foo_data = sanitize_data( $foo_data ); write_as_csv( $csv_file, $foo_data ); return; }

        Now, the top level of abstraction is done. Of course the code won't work yet. We've waved our hands an awful lot. We now need to implement each sub we invented to avoid having to think about the details of what we're doing.

        Each sub is effectively its own little program. Repeat this process for each subroutine until they are all written. For example, we can be very concrete about how we convert a foo file name into a csv file name. No need for additional subs:

        sub convert_foo_name_to_csv_name { my $name = shift; $name =~ s/(.foo)?$/.csv/c; return $name; }

        Something like sanitize_data() may be more complex, and even need a good number of subs of its own.

        Once you've fleshed out all the subs, your code will be done.

        For existing code, work in reverse. Label things. Describe what is happening. Take big blocks of code and move them into a sub and replace them with descriptively named routines.


        TGI says moo