Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re: advice with Parse::RecDescent

by TheDamian (Vicar)
on Dec 11, 2001 at 01:56 UTC ( [id://130789]=note: print w/replies, xml ) Need Help??


in reply to advice with Parse::RecDescent

There has been plenty of good advice already, but I suppose I should offer mine anyway. ;-)

RecDescent is overkill for this project, unless you expect it to grow in complexity (i.e. not just in the number of tags you're handling, but greater structural complexity of the data).

A good indicator that a grammar is overkill is when it:

  • doesn't have many levels of rules
  • doesn't have many rules with two or more productions
  • doesn't construct a complex, multi-level data structure as it parses
  • does the vast majority of its work with rules that consist of a single regex

Moreover, when the data is line-based (i.e. each low-level rule in the grammar parses exactly one line), RecDescent is probably not needed.

Your grammar seems to meet most of those criteria.

On the other hand, the parsing task you have is very well suited for learning RecDescent.

If I were implementing a parser for this in real life, rather than as a teaching exercise, I would probably bundle the regexes for each line type into a hash, and then iterate lines, testing against the various alternatives. Like so:

my $name = qr/(?:\w+)/; my $data = qr/(?:\w+)/; my $num = qr/(?:\d+)/; my %line_is = ( header => qr/HDR($name) ($data)/, trailer => qr/TLR($num)/, additive => qr/(ADDRANGE|ADD|DELETERANGE|DELETE),/, additive_data => qr/($num),($num?),($name)/, ); $_ = qr/\G(?:$_)/ foreach values %line_is; my %data; while (<DATA>) { if (/$line_is{header}/gcx) { $data{header} = { company => $1, code => $2 } } elsif (/$line_is{trailer}/gcx) { $data{trailer} = { count => $1 } } elsif (/$line_is{additive}/gcx) { my $cmd = $1; warn "Bad $cmd: ", substr($_,pos) unless /$line_is{additive_data}/; push @{$data{record}}, [ $cmd, $1, $2||undef, $3 ] } else { warn "Unparsable data: ", substr($_,pos); } } use Data::Dumper 'Dumper'; print Dumper [ \%data ]; __DATA__ HDRCOMPNAME BIG000OLD111IDENTIFIER1020301WITH1010LOTS1010OF1010CRAP ADD,1234567890,,COMPNAME ADDRANGE,2468,4680,COMPNAME DELETE,987654321,,COMPNAME DELETERANGE,13579,13599,COMPNAME TLR000004

The result is quite readable and maintainable. And fast. Provided, of course, the data remains line-oriented.

Finally, I do have big plans to rewrite RecDescent to make it much faster (though probably still Pure Perl). The original module was only supposed to be a quick-hack proof-of-concept for self-modifying parsers. It predates the /gc flag; hence the clunky (and slow!) parsing-by-substitution-of-copies idiom.

But somehow escaped the lab and has subsequently infested a huge number of organizations, which now rely on it.

There's probably a lesson in that. ;-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://130789]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (3)
As of 2024-04-19 20:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found