http://www.perlmonks.org?node_id=244329

spurperl has asked for the wisdom of the Perl Monks concerning the following question:

Friend monks,

We have a certain internal language (a programming language, or more accurately a Hardware Description Language, like Verilog). There are full-fledged parsers for it (C + Lex + Yacc), but I have to do some simple preprocessing and wonder what is the easiest way to go...

Suppose I have some keyword, say "env", which is followed by a body within braces:
env { ... ... }

I want to rip the contents out (from the braces) into some string. Needless to say that there may be other "bodies", delimited by braces nested inside to an arbitrary level, so regexes aren't much of a help (a regex assuming just one level of nesting already looks very scary).

Thus my question: what is the easiest way to handle it ? Use some Parser module from CPAN ? I don't need, and don't want to define the full grammar of this language - I just want what's inside the braces.

TIA

Replies are listed 'Best First'.
Re: Parsing... possible w/o too much stress ?
by Corion (Patriarch) on Mar 19, 2003 at 13:40 UTC

    If the file is simple enough, Text::Balanced could work, as could a well-tuned set of REs (but then again, Text::Balanced is easier to use). If you have strings that contain unbalanced parentheses, then problems could ensue (or rather, you have to do more work).

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
      If the format is more complex than what parsing with Text::Balanced will allow, another option on the path to a solution may be to investigate the Parse::RecDescent module (also by Damian Conway). As those who have used this module are aware, this module very much represents *the* way to perform complex parsing and pattern extraction - However, as with all powerful tools, there is a learning curve, particularly if you are not used to writing formal grammars.

      Some useful sources of information for reference when learning to use Parse::RecDescent include Parse::RecDescent::FAQ and this article from The Perl Journal (Google cache).

      Using this module I have recently been able to translate complex proprietary format files supplied from an external source into an internal format for print processing with only a few hours work by writing a formal grammar and making use of Parse::RecDescent - A process solution which has previously taken days, and in some cases weeks, for dedicated parsing code to be written, tested and deployed.

       

      perl -le 'print+unpack("N",pack("B32","00000000000000000000001000111110"))'

Re: Parsing... possible w/o too much stress ?
by broquaint (Abbot) on Mar 19, 2003 at 13:43 UTC
    Thus my question: what is the easiest way to handle it?
    If it's really a simple matter of matching braces then you could indeed use a regex
    use Regexp::Common; my $str = <<CODE; env { foo { bar {} } } CODE print $str =~ /($RE{balanced}{-parens => '{}'})/; __output__ { foo { bar {} } }
    Check out Re: Graph File Parsing for some regex style parsing, or if you'd prefer to stay away from regexes there's always Text::Balanced.
    HTH

    _________
    broquaint

Re: Parsing... possible w/o too much stress ?
by hsmyers (Canon) on Mar 19, 2003 at 15:47 UTC
    If there is the slightest chance that your braces or whatever will be nested, then avoid regexen for that part of the problem. I've used all of the solutions mentioned so far and while they will do the job with varying degrees of ease, the one I've found easiest is Text::DelimMatch. It does the obvious and will handle nesting as well.

    --hsm

    "Never try to teach a pig to sing...it wastes your time and it annoys the pig."
Re: Parsing... possible w/o too much stress ?
by BrowserUk (Patriarch) on Mar 19, 2003 at 16:39 UTC

    If the snippet you show indicates that the bit you wish to remove is the outermost level, and it is not embedded within, or alongside other structures, this is one occasion when the greediness of dot-star comes into its own.

    #!perl -slw use strict; my $body = $1 if join('',<DATA>) =~ m[env\s*{(.*)}$]s; print $body; __DATA__ env { "{"; F { "'{\"\{" } g { '}'; } }

    Output

    C:\test>244329 "{"; F { "'{\"\{" } g { '}'; } C:\test>

    Of course, I can imagine any number of scenarios in which this would not work, but in the absence of further info, attempting to compensate for them would simply be guesswork. If you have a better description of the application, I would relish the opportunity to practice my regex skills on real data.

    As effective as Parser::RecDescent is for complex grammers, it seems overkill for this application as described. What's the point in having the much vaunted Perl 5 regex engine, if noone is going to learn to use it?

    The regex notation is a mini-language of its own. Like any language, it takes time to learn. Like any language it takes practice to master.


    Examine what is said, not who speaks.
    1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
    2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
    3) Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke.

      This will work as long as the brace that closes the env block is the last one. To wit:

      env { anything you want here } not-env { other stuff }

      is not going to work.

      Now, it's possible that his input might look as simple as your example...but then he (hopefully) would have figured out a solution for that case already; something brain-dead like cat input | awk '{if (first == 1) {print l; l = $0;} else {first=1} }' for instance.

      A poster above said it already -- if it nests, don't use regexes. Even if you do get it to work, you'll wish you hadn't.

      ---
      "I hate it when I think myself into a corner."
      Matt Mitchell
Re: Parsing... possible w/o too much stress ?
by Rudif (Hermit) on Mar 19, 2003 at 22:34 UTC
    Here is a quick hack that does the job, on a data set that I constructed based on my understanding of your problem statement.
    #!/usr/bin/perl use strict; use warnings; my $verbose = 0; my @lines = <DATA>; my @items = split /(env\n\s*\{|\{|\})/, join '', @lines; for (@items) { print "###$_###" if $verbose; } my $depth = 0; my $inenv = 0; my @envblocks; my @envblock; for (@items) { print ">>>$inenv.$depth---$_\n" if $verbose; ++$inenv if ($depth == 0 && /env\n\s*\{/); ++$depth if (/\{/); if ($inenv){ push @envblock, $_; if ($depth == 1 && /\}/) { push @envblocks, join '', @envblock; @envblock = (); --$inenv; } } --$depth if (/\}/); print "<<<$inenv.$depth---$_\n" if $verbose; } for (@envblocks) { print "===\n$_\n===\n"; } __DATA__ somestuff morestuff env { 111 env { 333 } } and more stuff and things zut { 222 env { 444 } } env { 777 env { 555 } } finally a few more things
    and the output - does it look like what you expect?
    === env { 111 env { 333 } } === === env { 777 env { 555 } } ===
    I won't try to explain it. perlfunc and perlre do a better job of that.
    HTH.
    Rudif
Re: Parsing... possible w/o too much stress ?
by Anonymous Monk on Mar 19, 2003 at 16:10 UTC
    Couldn't you simply keep count of left and right braces to know when you are out of env{}? Saving the in-between stuff as you go and do with it what you need when you are out?
Re: Parsing... possible w/o too much stress ?
by I0 (Priest) on Mar 20, 2003 at 15:11 UTC
    $_=qq{ env { {}{} {{}} } }; (my $re=$_)=~s/((\{)|(\})|.)/${[')','']}[!$3]\Q$1\E${['(','']}[!$2]/gs +; $re = join'|',map{quotemeta}eval{/$re/}; die $@ if $@; print +(/env\s*\{($re)\}/)[0];