Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: How would you parse this? (rec-descent)

by tye (Sage)
on Oct 26, 2013 at 08:44 UTC ( [id://1059809]=note: print w/replies, xml ) Need Help??


in reply to How would you parse this?

After reading the replies, I think that I may have a feel for what you are hoping to do. My initial response to that desire was pretty much "Use any of the standard techniques that they taught in 'Compiler Writing' class". So the rest of what I'm writing is based on the supposition that you have never taken such a class. If you have, then I'm pretty sure that I've seriously misunderstood what you are asking for and you should probably just ignore this reply. Though, I end up advising against most of those standard techniques below.

My guess is that the approach that would most easily suit you is writing a recursive-descent parser.

I wouldn't suggest that you use a framework (such as Parse::RecDescent or marpa). There are technical reasons why such frameworks can have disadvantages (that I find many gloss over in trying to discourage the re-inventing of wheels by those just wanting to move their vehicle and not focused on and perhaps not well suited for actual designing of excellent wheels). Some of the reasons have to do with personality traits that I suspect that you share with me.

But the biggest reason is that it just really isn't terribly hard to write a decent rec-descent parser from scratch, especially in Perl. I'm pretty sure that your ample general programming skill is well above that required to succeed at this. There is a chance that there will be certain aspects of the task that will seem foreign, but I don't expect many or significant difficulties greater than that for you.

I'm not going to try to teach you (that would surely end badly) or even other readers how to write such a parser. I don't have any handy links to particularly good resources or tutorials that cover this topic (and I suspect such might not be a great route forward for this case anyway). I'm not going to try to write such a parser in this node.

But I have recently been looking at JSON::Tiny because it actually comes pretty close to how I would have implemented a JSON parser. Now, JSON is likely trivial compared to what you are hoping to parse. But JSON::Tiny is a decent example of how you build a rec-descent parser. So go browse http://cpansearch.perl.org/src/DAVIDO/JSON-Tiny-0.35/lib/JSON/Tiny.pm (the parts that have to do with parsing, of course).

You'll see the basic idea of having a function whose job is to consume the text that represents some structure of your grammar. And that function calls other functions that consume sub-structures. JSON::Tiny's _decode_value(), _decode_array(), _decode_object(), _decode_string(), and decode() are just such subs. And they are simple and working code that parses something that is easy to understand and so I hope they make it easy to "get" the concept of rec-descent parsing.

So you'll write _parse_program() which will need to call things like _parse_include() (my guess at what the first few lines in your example represent) and _parse_statement(). For most languages, the structure of each such routine ends up closely matching a BNF definition in a grammar describing your language. So you may end up writing something like BNF as you work out the definition of this language and/or the implementation of your parser.

The way that writing working code can also serve to create/refine the definition/design of your grammar is part of why I think this approach may work well for you.

When I try to reverse-engineer what are the key markers that indicate that those first few lines are supposed to be the equivalent of "#include" or use lines/statements, I think I'm close to guessing wildly. In C, you know such a thing because the lexer gave the following tokens in exactly this order: start-of-line, '#', 'include'. In Perl, you know such because you were expecting the start of a statement when you found the token "use". These are typical of how such items of syntax are defined and parsed in an LALR(1) language or similar language. "LALR(1)" just means that you can tell what type of broad syntaxy construct you have started parsing by strictly moving forward in the text and only looking at the next 2 tokens (Looking Ahead, Left-to-Right, 1 token past the next). You never have to "back up" more than one token. That's why a C parser needs to know which words represent data types vs. other things and why Perl has prefixing keywords like 'sub'.

So you may not be dealing with a language that fits in the LALR(1) world. This is another reason why an existing tutorial or existing framework isn't what I'm recommending. I won't discourage you from "straying" beyond LALR(1). In writing the parser, you'll see where the language is "too ambiguous" and adjust it.

One of the challenges in writing a parser (including rec-descent), is how to implement "consume the text". The classic solution is to write a lexer that returns tokens and give yourself a way to look at the next 2 tokens (without consuming them) or a way to "unconsume" at least the last 2 tokens one token that you consumed. But you don't have to do it that way (and I don't think you should start with that for this case).

The easiest solution is what JSON::Tiny does: read the whole document into a buffer and run regexes over it always using /\G.../gc in a scalar context (so pos tracks your progress in "consuming"). You may want to, fairly early in the development, encapsulate the applying of /.../gc to the buffer if you ever expect to parse documents that don't easily fit in RAM.

I also recommend completely avoiding capturing parens as they kill the performance of this approach, as can be seen by using Parse::RecDescent. Nevermind! (see replies)

At long last, I'll give you some custom concrete code to demonstrate one place JSON::Tiny gets this wrong (IMHO) and how I'd do it better. JSON::Tiny includes:

sub _decode_string { my $pos = pos; # Extract string with escaped characters m!\G((?:(?:[^\x00-\x1f\\"]|\\(?:["\\/bfnrt]|u[0-9a-fA-F]{4})){0,3276 +6})*)!gc; # segfault under 5.8.x in t/20-mojo-json.t #83 my $str = $1;

Disadvantages of that code:

  1. Capturing parens force the entire document to be copied (slow)
  2. It gets in the way when trying to switch to not reading the whole document into RAM
  3. It packs too much complexity into the one regex (and that even lead to a segfault in some Perl versions)
  4. It either ends up complaining about the " at the start of the string when the real problem is a specific malformed escape much farther along or encourages duplication of code.

A big improvement, IMHO, is more like:

sub _decode_string { my $pos = pos(); m{\G(?:[^\\"]+|\\.)*}gc; my $str = substr( $_, $pos, pos()-$pos ); Fail("Unclosed string",$pos) if ! m{"}gc;

Then go through parsing, validating, replacing, and complaining about escapes after that.

If you decide to deal with "won't fit in memory" documents, then parsing string literals might look more like:

sub parse_value { if( peek( qr{['"]} ) ) { parse_string(); ... sub parse_string { my $start = get_pos(); # now a seek position, not a string offset my $str = ''; if( eat("'") ) { # Inside '...', '' is a literal ', all else is just literal: while( ! peek("'(?!')",1) ) { # ^ keep leading whitespace if( eat("''",1) ) { $str .= "'"; } else { swallow("[^']+",1,\$str,1); # fatal if no match # Can match to $Buf end ^ } } swallow("'",1); } else { swallow('"'); while( peek('[^"]',1) ) { while( peek('\\',1) ) { $str .= parse_escape(); } eat( qr{[^\\"]*}, 1, \$str, 1 ); } swallow('"',1,'',0,'Unclosed string',$start); } return $str; }

Here's a stab at most of the "consume the document stream" functions used above (nothing tested, in case that wasn't as obvious as I think it would be):

my $FH; my $Buf; # Consume matching text, if found next: sub eat { my( $re, # Regex or string to match against next part o +f doc. $keep_ws, # Don't skip over whitespace first? $sv, # Append matched text to referenced scalar (if + any). $to_end, # Allow match to hit end of $Buf? $peek, # Keep pos() unchanged? ) = @_; $Buf =~ /\G\s+/gc if ! $keep_ws; my $start = pos( $Buf ); $re = _compile( $re ); # Prepends \G also do { return 0 if $Buf !~ /$re/gc; } while( ! $to_end && _hit_end(\$start) ); $$sv .= substr( $Buf, $start, pos($Buf)-$start ) if $sv; pos($Buf) = $start if $peek; return 1; } # Tell me if next text matches (without consuming, except maybe whites +pace): sub peek { my( $re, # Regex or string to match against next part o +f doc. $keep_ws, # Don't skip over whitespace first. $sv, # Append matched text to referenced scalar (if + any). $to_end, # Allow match to hit end of $Buf. ) = @_; return eat( $re, $keep_ws, $sv, $to_end, 1 ); } # Consume matching text! If not found next, then die: sub swallow { my( $re, # Regex or string to match against next part o +f doc. $keep_ws, # Don't skip over whitespace first. $sv, # Append matched text to referenced scalar (if + any). $to_end, # Allow match to hit end of $Buf. $err, # Error name if no match. @args, # Extra details for above error. ) = @_; return 1 if eat( $re, $keep_ws, $sv, $to_end ); throw( $err ||= 'Internal bug in parser', @args ); } sub _compile { my( $re ) = @_; our %Compiled; return $Compiled{$re} ||= qr/\G$re/; } # Keep sliding window of document in $Buf: sub _hit_end { my( $sv_start ) = @_; my $pos = pos( $Buf ); return 0 if $pos < length($Buf)-1024; # Darn! Match got too close to end of $Buf. if( 2048 < $$sv_start ) { # Copy to left as little of $Buf as reasonable: my $skip = $$sv_start - 1024; $$sv_start -= $skip; # This way may just keep consuming memory: # substr( $Buf, 0, $skip, '' ); $Buf = substr( $Buf, $skip ); } sysread( $FH, $Buf, 1024*1024, length($Buf) ); # Tell caller to re-do match; have it compare at same part of $Buf + again: pos($Buf) = $$sv_start; return 1; }

I sincerely hope some of that was helpful.

- tye        

Replies are listed 'Best First'.
Re^2: How would you parse this? (bug)
by tye (Sage) on Oct 26, 2013 at 16:18 UTC

    I noticed a bug in:

    # Consume matching text, if found next: sub eat { my( $re, # Regex or string to match against next part o +f doc. $keep_ws, # Don't skip over whitespace first? $sv, # Append matched text to referenced scalar (if + any). $to_end, # Allow match to hit end of $Buf? $peek, # Keep pos() unchanged? ) = @_; $Buf =~ /\G\s+/gc if ! $keep_ws; my $start = pos( $Buf ); $re = _compile( $re ); # Prepends \G also do { return 0 if $Buf !~ /$re/gc; } while( ! $to_end && _hit_end(\$start) ); $$sv .= substr( $Buf, $start, pos($Buf)-$start ) if $sv; pos($Buf) = $start if $peek; return 1; }

    In trying to minimize how often $Buf gets extended and thinking mostly only about the cases I ran into in my examples, I neglected the more obvious case of looking for the next token but you are right up against the end of $Buf.

    There is no perfect solution to this problem. My preferred solution is to just avoid matching potentially long strings as tokens so I can just pick an arbitrary value like "1024 more chars must be present in buffer" that is tiny in comparison to how much I read into the buffer each time (to reduce the percentage of document text that I end up copying as I slide the window along) and yet way too long to worry about a single token not fitting into it.

    You can see that approach in parse_string().

    The consequence of such an approach is that, for example, when parsing XML you couldn't match a whole XML tag as a single token. Even something as simple as "</b>" you'd end up matching in at least two steps, first '</' then '\w+\s*>', for example. More likely you'd match '<', '/', '\w+', then '>' (which works nicely because you get skipping of whitespace automatically and you can easily pull out the tag name without using capturing parens).

    But that still potentially imposes limitations like not allowing tag names that are longer than 1024 characters. Luckily, the '\w+' case is easily handled by the version of eat() above as not having the full tag name in the buffer will not prevent '\w+' from matching the first part of the tag name so the code will extend the buffer and run the match again, at least eventually reading in the whole 8GB tag name. ;)

    So a better version would be more like:

    # Consume matching text, if found next: sub eat { my( $re, # Regex or string to match against next part o +f doc. $keep_ws, # Don't skip over whitespace first? $sv, # Append matched text to referenced scalar (if + any). $to_end, # Allow match to hit end of $Buf? $peek, # Keep pos() unchanged? ) = @_; 0 while ! $keep_ws && eat( '\s+', 1, '', 1 ); my $start = pos( $Buf ); _hit_end( \$start ); $re = _compile( $re ); # Prepends \G also do { return 0 if $Buf !~ /$re/gc; } while( ! $to_end && _hit_end(\$start) ); $$sv .= substr( $Buf, $start, pos($Buf)-$start ) if $sv; pos($Buf) = $start if $peek; return 1; }

    Also, _hit_end() surely needs to keep track of line numbers for the sake of reporting syntax errors and should track when end-of-file is hit so it can short-circuit and just return false.

    - tye        

      I noticed a bug in:

      Just a heads up. I won't be parsing documents that will come even close to being too big for memory; so don't expend further effort in this area on my behalf.

      An update on my thinking so far is that I'm caught in the dilemma of hand-coding the parser to just do what I need, whilst seeing the potential for abstracting out some of the common components and writing code to generate it.

      And then I come full circle and realise that I'm just as likely to make all the same mistakes as those modules I've decried :)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re^2: How would you parse this? (placeholder for a more considered reply.))
by BrowserUk (Patriarch) on Oct 26, 2013 at 10:19 UTC

    Thank you! tye

    I'm running on sub-continental time this week, so it'll be a while before I can digest all of that, but I'm pretty convinced that you've outlined the method I will use here.

Re^2: How would you parse this? (rec-descent)
by smls (Friar) on Oct 26, 2013 at 21:26 UTC
    Capturing parens force the entire document to be copied (slow)

    Can you explain?
    Where and why does this happen?

      When you run $str =~ /(\w+)/, in order to make $1 available, the first thing Perl does (sometimes) is to hide away a copy of the value of $str. $1 just points to a substring of that copy. This is so that $1's value doesn't get corrupted just because you modify $str:

      $str = "===test==="; if( $str =~ /(\w+)/ ) { print "[$str] ($1)\n"; substr( $str, 5, 2, 'mp' ); print "[$str] ($1)\n"; } __END__ [===test===] (test) [===temp===] (test)

      This isn't a big deal in this case (nor in most cases). And there are cases where it is a design choice that makes things more efficient. But it can have a significant impact on performance when you have a 1GB document in a string and then pull out some small part of it with a regex, especially if you do it over and over again:

      use strict; use Benchmark qw< cmpthese >; my $doc = ' "a string",' x 1_024_000; cmpthese( -1, { copy => sub { return $1 if $doc =~ /(['"])/; }, substr => sub { return substr( $doc, pos($doc)-1, 1 ) if $doc =~ /['"]/g; }, } ); __END__ Rate copy substr copy 38.6/s -- -100% substr 382480/s 991549% --

      Re^3: regexp - repeatedly delete words expressed in alternation from end of string (speed) notes that this penalty (in modern versions of Perl) never applies when you use /g on your regex. So this shouldn't matter at all for the cases we are talking about here. So thanks for prompting me to dig up the details again!

      I had done some incremental changes to JSON::Tiny and some benchmarks. And I saw that removing some grouping parens did actually make JSON::Tiny a little bit faster. But I was wrong about the underlying reason.

      This also means that capturing parens isn't the reason that Parse::RecDescent has been considered significantly slower than it should be. I recall having heard that Parse::RecDescent was significantly slower than it could be because it ends up copying the whole document being parsed over and over. But that memory might be wrong. Another way that parsers can waste a lot of time copying text over and over is the classic mistake of repeatedly removing matched tokens from the front of the document string:

      if( $doc =~ s/^\s*(['"])// ) { parse_string($1);

      But I don't know whether Parse::RecDescent suffers from that particular problem much or not. And it might not suffer from any significant performance problems at all at this point.

      - tye        

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1059809]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2024-04-20 03:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found