dimmesdale has asked for the wisdom of the Perl Monks concerning the following question:

Monks, I am trying to parse the data at virginia.edu/.../webcod.enf.txt. An example of the format is:
92861 APMA 109 0001 GI LC CALCULUS I 4.0 0900-0950 M W F OLS 011 OBERHAUSER JP 055 002 O 0830-0920 T OLS 005 90063 APMA 109 0002 GI LC CALCULUS I 4.0 1000-1050 M W F OLS 120 BECK M 055 004 O 0830-0920 R OLS 120 91589 APMA 109 0003 GI LC CALCULUS I 4.0 1100-1150 M W F OLS 120 BECK M 055 006 O 0830-0920 T MEC 205 93778 APMA 109 0004 GI LC CALCULUS I 4.0 1200-1250 M W F OLS 120 BECK M 055 004 O 0830-0920 T MEC 205
(Looking at the html version is easier to parse mentally, ..., but seemed like a harder problem initially.)
The lines that begin with a 5 digit number denote the beginning of a particular section (i.e., an instance of a certain course being taught). APMA 109 would be a mneumonic for the course. Then the section number, the course name, etc. The issue is made more difficult because certain information is sometimes absent. I've made an attempt at it, but on dumping out my hash, it's not anywhere close to what it should be. Your help is most appreciated. Code ...
#!/usr/bin/perl # parseCOD.perl # Attempts to parse .txt COD files for UVa use strict; use warnings; use Data::Dumper; # coures{mneumonic}->[sections] = # [ "Name", "ID", "Credit", "CurrEnroll", "MaxEnroll", [start time], + [end time], [days], [location], [instructor] ] # # So, for instance, to find out all the professors of each section of +SPAN411: # for (@{$coures{SPAN411}->[9]}) { print } # my %courses; my ($id, $mneumonic, $sect, $maxEnroll, $currEnroll, $name, $credit); my (@startTime, @endTime, @days, @location, @instructor); my $line; my $file = "webcod.enf.txt"; my $i = 1; open COD,$file or die "No go fo' $file\n"; open OUT,">out.txt" or die "No go fo' out.txt\n"; for (1..10) { my $junk = <COD>; } # Don't need first ten lines ... while ($line = <COD>) { SECTION: if ($line =~ /\d\d\d\d\d/) { # Start of a course section if ($line =~ /(\d\d\d\d\d)\s((?:\W|\S){2,5}\S*?\d\d\d\w?)\s+(\d\d\ +d\d)\s(.*)(\d\.\d)/) { $id = $1; $mneumonic = $2; $sect = $3; $name = $4; $credit = $5; } else { die "Line not properly parsed! Choking, choking ... dead +. Line:\n'$line'"; } # Now, get times, dates, etc. $line = <COD>; if ($line =~ /TBA/) { $line = <COD>; goto SECTION; } if ($line =~ /\s*(\d+)-(\d+)\s*([MTWRFS ]+?)\s*(\w\w\w\s*\w?\d*\w? +)\s*(\w*[,'-]?\w*\s*\w*)\s*(\d\d\d)\s*(\d\d\d)/) { $startTime[0] = $1; $endTime[0] = $2; $days[0] = $3; $location[0 +] = $4; $instructor[0] = $5; $maxEnroll = $6; $currEnroll = $7; } else { die "Oof! Malformed line. Line:\n'$line'"; } while ($line = <COD>) { if ($line =~ /\d\d\d\d\d/) { $line = <COD>; goto SECTION; } if ($line =~ /GRAD ENGR/) { $line = <COD>; goto SECTION; } if ($line =~ /TBA/) { $line = <COD>; goto SECTION; } if ($line =~ /RESTRICTED TO/) { $line = <COD>; goto SECTION; } if ($line =~ /^$/) { $line = <COD>; goto SECTION; } if ($line =~ /\s+([a-zA-Z]+\s+[a-zA-Z])/) { push @instructor, $1; } elsif ($line =~ /\s*(\d+)-(\d+)\s*([MTWRFS ]+?)\s*(\w\w\w\s*\w +?\d*)\s*(\w*[,'-]?\w*\s*\w*)/) { push @startTime, @$1; push @endTime, @$2; push @days, @$3; pus +h @location, @$4; if ($5) { push @instructor, @$5; } else { push @instructor, @{$instructor[0]}; } } else { die "Doh! Silly out of format line. Line:\n'$line'"; +} } } } continue { # Remove spaces from $mneumonic $mneumonic =~ s/\s+//g; $courses{$mneumonic}->[$sect] = [$name, $id, $credit, $currEnroll, $ +maxEnroll, \@startTime, \@endTime, \@days, \ +@location, \@instructor]; print "Assigned ", $i++, "\n"; }

Replies are listed 'Best First'.
Re: Parsing COD text help
by tachyon (Chancellor) on Jul 26, 2004 at 01:18 UTC

    There is a lot of extraneous material in that datafile. The key part (to me) with data munging is to first break out the records, then process each record in turn. This snippet breaks all the courses out into records, skipping all the dross. From there it should be easy to parse $course reliably which contains one complete course entry.

    my $file = "c:/webcod_enf.txt"; open F, $file or die $!; local $/ = "\n\n"; # break up into rough 'records' at blank lines # often you can get good records just by setting $ +/ # not with this data though ;-) while (<F>) { # skip the extraneous data, valid chunks will start with ^\s*\d{5} unless ( m/^\s*\d{5}/) { #print "Skipping:\n$_"; next; } # now we have chunks of real data to parse. we split it on the uni +que m/^\s*\d{5}/m # numeric feature to break out the individual records. we use a lo +okahead assertion # to do this so we don't loose that data in the split for my $course( split /(?=^\s*\d{5})/m, $_ ) { next if $course =~ m/^\s*$/; # we possibly get a null record t +o start so skip print "$course\n\n"; } } __DATA__ 92861 APMA 109 0001 GI LC CALCULUS I 4.0 0900-0950 M W F OLS 011 OBERHAUSER JP 055 002 O 0830-0920 T OLS 005 90063 APMA 109 0002 GI LC CALCULUS I 4.0 1000-1050 M W F OLS 120 BECK M 055 004 O 0830-0920 R OLS 120 91589 APMA 109 0003 GI LC CALCULUS I 4.0 1100-1150 M W F OLS 120 BECK M 055 006 O 0830-0920 T MEC 205

    cheers

    tachyon

Re: Parsing COD text help
by Aristotle (Chancellor) on Jul 26, 2004 at 01:47 UTC

    These look like fixed-width records to me. That's a job for unpack.

    Something like this would tell lines that start a new record apart from those which don't:

    while(<>) { chomp; my ( $cnum, $details ) = unpack "x a5 x a*", $_; if( $cnum =~ /\d/ ) { # new record } else { # follow-up line } }
    The string left in $details for follow-up lines seem to roughly conform to the unpack format
    a4 x a4 x a5 x2 a3 x3 a3 x2 a18 x a3 x a3 x a
    Something similar would have to be devised for the $details in record-starting lines. Also, the else-branch will need to test $details against /^\*\*/ to tell apart lines with course details from those with constraints. You'll need a few variables to store some state when you detect a new record, so you can access it in the following iterations on the detail lines. The logic is simple enough that you need not and should not use gotos.

    Makeshifts last the longest.

Re: Parsing COD text help
by graff (Chancellor) on Jul 26, 2004 at 01:53 UTC
    To me, nesting a while (<FH>) loop inside another one never seems like the right approach. The condition that should cause you to exit the inner loop happens to be one that is supposed to be picked up in the outer loop, so that you can get into the inner loop again to handle the next record. That makes things too complicated.

    I'll suggest an alternative, but first I'd like to point out that the input data appears to consist of fixed-length records. Having looked at the cited page, there seem to be three basic types of data lines -- one has digits in columns 1-5, the other two don't; among the latter, there are a few that are "category" headings (e.g. "SYSTEMS ENGINEERING", "COMPUTER SCIENCE", etc), and the rest are "detail" records about a given course/section. (Actually the latter type probably breaks down into two or three sub-types, presenting different sorts of information.

    Fixed-width data can be handled either with regex matching (using things like / (.{5}) (.{4}) (.{4})/), or with unpack. The latter is really simpler (even though it seems more complicated when you look it up in the "perlfunc" man page). It would go something like this, in your case:

    my %courses; my $mnemonic; # this is the correct spelling :) while (<COD>) { # let's use $_, shall we? next unless ( /\S/ ); # skip blank lines; my @fields; my ($id, $rest) = unpack("xA5xA*", $_); # break line into 2 pieces if ( $id =~ /^\d{5}$/ ) { # it's the start of a record ($mnemonic,@fields = unpack("A4xA4xA4xA2xA2xA28A*", $rest); # work out what to do with @fields; $mnemonic will retain # it's current value till the next one is encountered, # so sub-records after this one will be added to the # correct hash element. } elsif ( $rest =~ /^\d+-\d+/ ) { # it's a sub-record my ($time,$days,$bldg,$room,$end) = unpack("A9xA6xA4xA4xA*", $re +st); # you need to work out what to do with $end, # and push stuff into the current $courses{$mnemonic} structure } else { # do something else with (or ignore) other stuff } }
    I hope that will get you started. Note that by using "unpack", the "DAYS" portion of the sub-records will always be taken as a string of six characters, some of which happen to be spaces ("M W F " vs. " T R " etc) -- you could get the same result with a suitable regex instead of unpack, but plain-old split will do it wrong. Personally, I think this is one situation where unpack is relatively easier to do than a regex; it's just a natural for fixed-length ASCII records.
Re: Parsing COD text help
by davidj (Priest) on Jul 26, 2004 at 02:11 UTC
    I would suggest you use the html source and use HTML::TokeParser to parse it out. The following might be a good place to start. (I saved the html source into the file "APMA.txt")

    #!/usr/bin/perl use HTML::TokeParser; use strict vars; my $file = 'APMA.txt'; my $stream = HTML::TokeParser->new($file); my $tok; while( $tok = $stream->get_token) { if( $tok->[1] eq 'a' && $tok->[2]{'href'} =~ m/course_nbr/ ) { print "Class: " . $stream->get_text('/a') . " "; } if( $tok->[1] eq 'span' && exists($tok->[2]{'class'}) && $tok->[2] +{'class'} eq "title") { print $stream->get_text('/span') . "\n"; } if( $tok->[1] eq 'span' && exists($tok->[2]{'title'}) && !exists($ +tok->[2]{'class'}) ) { print "Title = $tok->[2]{'title'}: "; print $stream->get_text('/span'), "\n"; } } exit;
    partial output:

    Class: APMA 109 Calculus I Title = Schedule Number: 92861 Title = Section Number: 0001 Title = Credit Hours: 04.0 Title = Time: 0900-0950 Title = Day:MTWRFS: MWF Title = Olsson Hall: OLS 011 Title = Instructor: Oberhauser, James P. Title = Enrollment:Authorized/Actual: 55/2 Title = Grading Method: O Title = Time: 0830-0920 Title = Day:MTWRFS: T Title = Olsson Hall: OLS 005 Title = Instructor: Title = Schedule Number: 90063 Title = Section Number: 0002 Title = Credit Hours: 04.0 Title = Time: 1000-1050 Title = Day:MTWRFS: MWF Title = Olsson Hall: OLS 120 Title = Instructor: Beck, Mary Title = Enrollment:Authorized/Actual: 55/4 Title = Grading Method: O Title = Time: 0830-0920 Title = Day:MTWRFS: R Title = Olsson Hall: OLS 120 Title = Instructor: ... Class: APMA 111 Single Variable Calculus Title = Schedule Number: 93433 Title = Section Number: 0001 Title = Credit Hours: 04.0 Title = Time: 1100-1150 Title = Day:MTWRFS: MWF Title = Olsson Hall: OLS 011 Title = Instructor: Title = Enrollment:Authorized/Actual: 55/55 Title = Grading Method: O Title = Time: 0830-0920 Title = Day:MTWRFS: T Title = Instructor: Castiglione, Jason ...
    Further parsing is easily done to get exactly what you want.
    Take a look at Perl & LWP for more information.

    Hope this helps,
    davidj
      Of course, you should never say HTML::TokeParser without the ::Simple. :)
      #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $file = 'APMA.txt'; my $stream = HTML::TokeParser::Simple->new( $file ); while( my $t = $stream->get_token ) { if( $t->is_start_tag( 'a' ) and $t->return_attr( 'href' ) =~ m/course_nbr/ ) { print "Class: ", $stream->get_text( '/a' ), " "; } elsif( $t->is_start_tag( 'span' ) ) { my $class = $t->return_attr( 'class' ); my $title = $t->return_attr( 'title' ); if( defined $title and not defined $class ) { print "Title = $title: ", $stream->get_text( '/span' ), "\ +n", } elsif( defined $class and $class eq 'title' ) { print $stream->get_text( '/span' ), "\n"; } } }

      Makeshifts last the longest.

Re: Parsing COD text help
by dimmesdale (Friar) on Jul 26, 2004 at 23:46 UTC
    My sincere thanks to all those who contributed. I feel, first, I need to explain for the sloppiness of the code above (with the GOTO and otherwise messy logic). I started out trying to parse a few lines with a few regexes, then as I started checking what was being assigned I quickly saw there were tons of special cases to be handlded: the number of sections varies, sometimes data isn't present, or sometimes it is marked with TBD, some data is marked "Reserved", etc. So my attempt put hack upon hack (the gotos started out as next, but that was when while was an infinite loop!)

    Well, enough excuses. I came up with something that seems to work pretty well. I guess I should explain why I am doing this. I want to make a program that will ask me what classes I want to take and then tell me all the possible schedule combinations (if any) I can have with those classes. The schedule combinations part I already finished in Java (which I did to teach myself the language, because C/C++, perl, scheme, and Q/PBASIC aren't good enough for UVa -- but that's another discusssion!).

    Anyways, the code, for those interested. I ended up using the TokeParse::Simple, which I refrained from at first, not having used it before and wanting to get something tested as quick as possible (a missaplication of laziness I suppose), but the lovely examples helped me through it ...

    #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; use Data::Dumper; # $courses{mneumonic}{sectID} = # [ "Section number", "Credit", "CurrEnroll", "MaxEnroll", # [start time], [end time], [days], [location], [instructor] ] # use constant SECT_NUMBER => 0; use constant CREDIT_HOURS => 1; use constant CURR_ENROLL => 2; use constant MAX_ENROLL => 3; use constant START_TIME => 4; use constant END_TIME => 5; use constant DAYS => 6; use constant LOCATION => 7; use constant INSTRUCTOR => 8; my $file = 'APMA.txt'; my $stream = HTML::TokeParser::Simple->new( $file ); my ($class,$title); my (%courses, $mneumonic, $sectID); # Flag to tell program if last $title was a match against /Day/ # If so, location follows next (no consistent marker otherwise) my $wasJustDays = 0; while( my $t = $stream->get_token ) { if( $t->is_start_tag( 'a' ) and $t->return_attr( 'href' ) =~ m/course_nbr/ ) { # And thus begins a new Course ... $mneumonic = $stream->get_text( '/a' ); } elsif( $t->is_start_tag( 'span' ) ) { $class = $t->return_attr( 'class' ); $title = $t->return_attr( 'title' ); if( defined $title and not defined $class ) { # These would be all the rest of the fields, # ... Schedule number, credit hours, etc. if ($title =~ /Schedule Number/) { $sectID = $stream->get_text( '/span' ); } elsif ($title =~ /Section Number/) { $courses{$mneumonic}{$sectID}->[SECT_NUMBER] = $stream->get_text( '/span' ); } elsif ($title =~ /Credit Hours/) { $courses{$mneumonic}{$sectID}->[CREDIT_HOURS] = $stream->get_text( '/span' ); } elsif ($title =~ /Time/) { $stream->get_text( '/span' ) =~ /(\d+)-(\d+)/; push @{$courses{$mneumonic}{$sectID}->[START_TIME]}, $1; push @{$courses{$mneumonic}{$sectID}->[END_TIME]}, $2; } elsif ($title =~ /Day/) { push @{$courses{$mneumonic}{$sectID}->[DAYS]}, $stream->get_text( '/span' ); $wasJustDays = 1; # See note at variable declaration } elsif ($title =~ /Instructor/) { push @{$courses{$mneumonic}{$sectID}->[INSTRUCTOR]}, $stream->get_text( '/span' ); } elsif ($title =~ m<Enrollment:Authorized/Actual>) { $stream->get_text( '/span' ) =~ m<(\d+)/(\d+)>; $courses{$mneumonic}{$sectID}->[MAX_ENROLL] = $1; $courses{$mneumonic}{$sectID}->[CURR_ENROLL] = $2; } else { if ($wasJustDays == 1) { push @{$courses{$mneumonic}{$sectID}->[LOCATION]}, $title . ": " . $stream->get_text( '/span' ); $wasJustDays = 0; # See note at variable declaration } } } elsif( defined $class and $class eq 'title' ) { # This is the name of the course; e.g., Linear Algebra # Ignore for the time being ... #print $stream->get_text( '/span' ), "\n"; } } }
    edit: removed readmore