Design hints for a file processor

PhilHibbs has asked for the wisdom of the Perl Monks concerning the following question:

I have a script that reads a structured file, looks for various elements, and does a few search/replace operations, writes out a translated version of the file, and also records various bits of useful information and writes them out to separate files.

The file in question is an export from IBM DataStage, which is a graphical data processing tool. The file contains the "jobs" that we have created, most of which consist of reading and writing CSV files and hashed files.

The design for the script is rather haphazard at the moment, built up over time from a very simple single search-replace operation into something of a nine-headed hydra of a script.

I'm after some tips about how to go about rewriting this script. It needs to be able to identify various elements in the file based on context. For instance, a chunk might look like this:

BEGIN DSJOB
   Identifier "AP_CDBS_Vendor_Summary"
[download]

This is an easy one - any line starting with three spaces and the word "Identifier" is the Job Name.

Here's a more complex example:

   BEGIN DSRECORD
      Identifier "ROOT"
      DateModified "1899-12-30"
      TimeModified "00.00.01"
      OLEType "CJobDefn"
      Readonly "0"
      Name "AP_CDBS_Vendor_Summary"
      Description "Collates all of the data for the customer master mi
+gration."
      NextID "194"
      Container "V0"
      FullDescription "The first part of the routine gathers data from
+ the ABAP which extracts the necessary data from the SAP tables KNA1 
+and KNB1 (NB the keys of the link between KNA1 and KNB1 will form the
+ basis of all the ABAP queries)."
      JobVersion "50.0.0"
      ControlAfterSubr "0"
      Parameters "CParameters"
      BEGIN DSSUBRECORD
         Name "ROOT"
         Prompt "/home/migration/Dev root "
         Default "/home/migration/Dev"
         ParamType "0"
         ParamLength "0"
         ParamScale "0"
      END DSSUBRECORD
      BEGIN DSSUBRECORD
         Name "SITE"
         Prompt "Business Unit, ie WHUB"
         Default "CDBS"
         ParamType "0"
         ParamLength "0"
         ParamScale "0"
      END DSSUBRECORD
      BEGIN DSSUBRECORD
         Name "AOW"
         Prompt "Area of Work ie AP"
         Default "AP"
         ParamType "0"
         ParamLength "0"
         ParamScale "0"
      END DSSUBRECORD
      BEGIN DSSUBRECORD
         Name "DMR"
         Prompt "DMR/Spec ie Vendors"
         Default "Vendors"
         ParamType "0"
         ParamLength "0"
         ParamScale "0"
      END DSSUBRECORD
      MetaBag "CMetaProperty"
      BEGIN DSSUBRECORD
         Owner "APT"
         Name "AdvancedRuntimeOptions"
         Value "#DSProjectARTOptions#"
      END DSSUBRECORD
      NULLIndicatorPosition "0"
      IsTemplate "0"
      NLSLocale ",,,,"
      JobType "0"
      Category "1.FSS\\2.AP\\6.CDBS\\1.Vendors\\3.Reports"
      CenturyBreakYear "30"
...another 20 lines of stuff I don't need...
   END DSRECORD
[download]

From here, I need to pick up the Category "1.FSS\\2.AP\\6.CDBS\\1.Vendors\\3.Reports", and all of the parameter Name and Default values, "ROOT" = "/home/migration/Dev", "SITE" = "CDBS", etc.

There are some more complex cases, where I need to pick up one or two paths from a DSRECORD section, and then one or more pairs of filenames from DSSUBRECORD sections.

Any hints on how I should structure this program, so that it is clear and maintainable to someone who is new to Perl? Any modules that would make this easier?

By way of an apology, it's been a long time since I did any serious procedural or OO programming, 10 years ago I'd have been able to do this with my eyes closed.

Comment on Design hints for a file processor Select or Download Code

Replies are listed 'Best First'.
Re: Design hints for a file processor by moritz (Cardinal) on Jul 07, 2008 at 12:09 UTC
If you have no other occurences of `Category`, you can simply search for that: `#!/usr/bin/perl use strict; use warnings; for (<DATA>){ if (m/^\sCategory (.)$/){ print $1, $/; } }` [download] If you want to really parse the file, I'd recommend a simple recursive descending parser, see Parsing with Regexes and Beyond for an explanation. The tokens would be just the lines.	[reply] [d/l] [select]
Re^2: Design hints for a file processor by PhilHibbs (Hermit) on Jul 07, 2008 at 12:18 UTC
Yes, that's exactly what I do currently - the script is basically a whole load of special cases with no real structure to it. Well, this is my actual code for that: `$cat = $1 if /^\s+Category "(.+)"/;` I prefer this notation, I know some people don't.	[reply] [d/l]
Re^3: Design hints for a file processor by moritz (Cardinal) on Jul 07, 2008 at 12:32 UTC
If you want structure, use a real parser. Here is one, albeit a bit hacked up: Read more... (4 kB) It returns a sort of parse tree with an array ref for each block or line, where blocks look like `['BLOCK', $name_of_block, @lines_in_this_block]` and lines look like `['LINE', $key, $value]`. Depending on your exact data format and what you want to extract, hashes might be more suitable.	[reply] [d/l] [select]
Re^4: Design hints for a file processor by PhilHibbs (Hermit) on Jul 07, 2008 at 13:13 UTC
Re^5: Design hints for a file processor by moritz (Cardinal) on Jul 07, 2008 at 13:21 UTC
Some notes below your chosen depth have not been shown here
Re: Design hints for a file processor by Tanktalus (Canon) on Jul 07, 2008 at 14:54 UTC
Actually, this kinda looks like XML to me, so I'd be tempted to try to use XML::Twig. Ok, so some XML purists out there may cry, "but it has no tags!" At this point, they're nitpicking. It DOES have tags. Just not in the XML format. "Oh, but so you admit it isn't XML!" I never claimed it was XML, merely that it kinda looks like XML. What I'd be tempted to do is write an IO::Handle-derived object to convert the input file, line by line, into XML. And then use XML::Twig to handle the actual data. Especially as you say this is "up to half a gigabyte" - XML::Twig can handle that just fine. What I don't know is if XML::Twig needs to backtrack in a file... but if it doesn't, you just basically have to convert: `s/BEGIN \s (\w+)/<$1>/x or s/END \s (\w+)/</$1>/x or s[(\w+) \s ("\w+")][<$1 value=$2 />]x;` [download] and then pass the line to XML::Twig. (Ok, you may need to exclude the quotes on the last one, and then use proper XML escaping in case there are funny characters there, but the idea is here.) By putting your actual code in the proper end-tag handlers for XML::Twig, and flushing as appropriate, you should use very little memory while not having to do much heavy lifting yourself. Why write your own parser, when there already is a perfectly good parser already out there? :-) PS: I'd also want to ask IBM when they'll start supporting XML output here ;-)	[reply] [d/l]
Re^2: Design hints for a file processor by PhilHibbs (Hermit) on Jul 09, 2008 at 12:43 UTC
PS: I'd also want to ask IBM when they'll start supporting XML output here ;-) They kind of do - you can export the whole project as XML, or a whole category of jobs from within their GUI client, but you can't export individual jobs as XML, only as the legacy DSX format. The latest version might have improved on this.	[reply]
Re: Design hints for a file processor by pc88mxer (Vicar) on Jul 07, 2008 at 12:40 UTC
Since your file appears to be line-oriented and each line has the same simple syntax, you can write a simple parser that converts this output into a perl data structure: `my $root = {}; my $node = $root; my @stack; while (<DATA>) { chomp; if (/^\sBEGIN\s+(.)/) { my $child = {}; $child->{_type} = $1; push(@{$node->{_children}}, $child); push(@stack, $node); $node = $child; } elsif (/^\sEND\s+(.)/) { # check: $node->{_type} eq $1 $node = pop(@stack); } elsif (/^\s(\S+)\s+(.)/) { # remove quotes from $2 here? $node->{$1} = $2; } else { die "unexpected input: $_" } } use Data::Dumper; print Dumper($root);` [download] And on your sample input this produces: Read more... (7 kB)	[reply] [d/l] [select]
Re: Design hints for a file processor by GrandFather (Saint) on Jul 07, 2008 at 22:55 UTC
I'd combine some light weight OO, recursive descent parsing and method dispatch. Consider (most of OP's data represented by ellipsis): use strict; use warnings; my $data = <<DATA; BEGIN DSRECORD Identifier "ROOT" ... CenturyBreakYear "30" END DSRECORD DATA open my $fh, '<', \$data or die "Unable to open data file: $!"; my $obj = main->new (undef, fh => $fh); 1 while defined $obj->nextLine (); close $fh; sub new { my ($class, $context, %params) = @_; $class = ref $class \|\| $class; my $self = bless {%params}, $class; if (defined $context && ref $context) { $self->{$_} = $context->{$_} for keys %$context; } $self->{context} = $context; return $self; } sub nextLine { my $self = shift; my $fh = $self->{fh}; while (defined (my $line = <$fh>)) { chomp $line; return $line unless $line =~ /^\sbegin\s+(\w+)/i; if ($self->{skipping}) { $self->skip (); next; } my $recType = lc $1; my $handler = $self->can ("rec_$recType"); my $nested = $self->new ($self); $handler ? $nested->$handler () : $self->skip ($recType); next; } return undef; } sub skip { my ($self, $recType) = @_; warn ">>> Can't handle $recType records (line $.)\n" if defined $recType; ++$self->{skipping}; 1 while defined $self->nextLine (); --$self->{skipping}; } sub rec_dsrecord { my ($self) = @_; my $fh = $self->{fh}; my @wantedFields = qw/identifier name description/; my $matchStr = join '\|', @wantedFields; my $fieldsMatch = qr/$matchStr/i; while (defined (my $line = $self->nextLine ())) { next unless $line =~ /($fieldsMatch)\s+(.)/; $self->{lc $1} = $2; } my @missingFields = grep {! exists $self->{$_}} @wantedFields; my @gotFields = grep {exists $self->{$_}} @wantedFields; warn "Missing @missingFields fields in DSRECORD ending line $." if @missingFields; print join (', ', map {"$_: $self->{$_}"} @gotFields), "\n"; } [download] Prints: `>>> Can't handle dssubrecord records (line 15) identifier: "ROOT", name: "AP_CDBS_Vendor_Summary", description: "The +first part of the routine gathers data from the ABAP which extracts t +he necessary data from the SAP tables KNA1 and KNB1 (NB the keys of t +he link between KNA1 and KNB1 will form the basis of all the ABAP que +ries)."` [download] Handlers for new record types are easily added as a `sub rec_newrectype` sub. A context dumper can be easily implemented using the "stack" formed by the context links in the objects (see `sub new` implementation). The code can be refactored from a light weight main based object into a proper base class with an object factory to create derived classes for each interesting record type. Common output code could be added to the base class. Perl is environmentally friendly - it saves trees	[reply] [d/l] [select]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks