Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Parse::RecDescent: problem with grammar and error reporting

by kikuchiyo (Pilgrim)
on Jan 20, 2012 at 10:32 UTC ( #948931=perlquestion: print w/replies, xml ) Need Help??
kikuchiyo has asked for the wisdom of the Perl Monks concerning the following question:

I want to write a program for creating presentations. The presentation style I have in mind is very simple: few lines per slide, few words per line with a font size as large as possible, occasionally an image. (This style is close to what's called the Takahashi style).

LaTeX with the beamer class is a good candidate to make such presentations, but writing \begin{frame}...\end{frame} every time is tedious. So I wanted a program that takes the bare outline (with just the lines I want displayed) and produces the latex source for me.

One required feature of this program would be that, by default, the font size for a particular slide should be chosen so that the longest line fills the available screen width. However, it should be possible to override this mechanism and set absolute font sizes for individual lines.

Here is an example of an outline file:

First slide Test line Second line Line with explicit font size @20

The "@20", or more generally the "@" character and an integer at the end of the line is the font size override.

I actually wrote the program that does all this, but it's became an unmaintainable mess, so this time I want to do it the Right Way(TM).

I figured that when it comes to Perl and parsing text files with a pre-determined structure, the Right Way is Parse::RecDescent. (I also have the ulterior motive that I want to learn P::RD.)

As a first step I set out to write a grammar that parses outline files with a structure similar to the example above.

Here is what I have so far:

#!/usr/bin/perl use strict; use warnings; use Parse::RecDescent; use Data::Dumper; $::RD_ERRORS = 1; $::RD_WARN = 1; $::RD_HINT = 1; $::RD_TRACE = 1; my $grammar = <<'END_GRAMMAR'; #<autotree> startrule: slide(s) slide: <skip: qr/[ \t]*/> line(s) line_end(s) line: text fontspec(?) line_end text: /[^@\n]+(?:\b@[^@\n]*)*/ fontspec: <skip: ''> "@" fontspec_size fontspec_size: /\d+/ | <error: Invalid fontspec or unescaped '\@' at:\n$text.> line_end: "\n" END_GRAMMAR my $good_text = <<'SAMPLE_TEXT'; Test Line @20 Even more test @20 Escaped \@text SAMPLE_TEXT my $bad_text = <<'SAMPLE_TEXT'; Bad line @invalid Also bad @foo @20 SAMPLE_TEXT my $parser = new Parse::RecDescent ($grammar); #$parser->startrule($text); $parser->startrule($good_text); print STDERR "\n" x 3; $parser->startrule($bad_text);

I've ran into problems with the font size specifiers. I want to allow "@" characters within the text if they are not at word-initial positions (for e-mail addresses and whatnot), but I want an error to be generated if there is an unescaped or non-word-initial "@", or the font size specifier is invalid (not an integer). The version above fails at "\@", even though I want to allow that, and it does not print the error text on failure, just fails silently, as seen from the P::RD trace.

How to rephrase the grammar to make it pass the lines in $good_text, but fail on the lines in $bad_text, printing the error message?

Eventually I want to collect the parsed lines into a data structure for further processing, so I'd appreciate tips on how to that, too.

Replies are listed 'Best First'.
Re: Parse::RecDescent: problem with grammar and error reporting
by moritz (Cardinal) on Jan 20, 2012 at 11:12 UTC

    I think that Parse::RecDescent is a bit overkill for such a simple, block+line based format. Here's my take that uses just regexes, and assembles a data structure from the input:

    use 5.010; use strict; use warnings; sub parse_file { my $fh = shift; local $/ = ''; # paragraph mode my @paragraphs; my $line_no = 0; while (my $para = <$fh>) { push @paragraphs, [ map parse_line($_, ++$line_no), split /\n/ +, $para]; } return \@paragraphs; } sub parse_line { my ($_, $line_no) = @_; return unless /\S/; if (/^(.*)\s\@(\w+)/) { my $line = "$1"; my $font_size = "$2"; die "Invald \@fontsize declaration in line $line_no ($font_siz +e is not a number)\n" if $font_size =~ /\D/; return { text => $line, font_size => $font_size }; } else { return { text => $_ } } } use Data::Dumper; print Dumper parse_file(\*DATA); __DATA__ First slide Test line Second line Line with explicit font size @20

    You can decide for yourself it it's too much of an unmaintainable mess to use :-).

      Yeah, I started out with something like this. Then I added provisions for embedding images, literal TeX syntax that could extend to more than one lines, inlined gnuplot scripts, etc., and it became pretty much unmaintainable.

      But you are perhaps right, I should at least have a go at refactoring the existing script using idioms I'm familiar with, before I jump into rewriting it with a tool I don't know (P::RD).

        Not so fast.   (IMHO...)   It does not take very long for code such as this, that is written without P:RD, to become very unmaintainable.   P:RD takes what you already know (regular expressions ...) and puts them into a very complete framework that otherwise you would have to build yourself.   It rather sneaks up on you ... suddenly there is some little twist on the format that you need to support, and something inside your lovingly-fashioned code goes, “snap.”


        P:RD does take some time to get to know.   (One thing that I quickly discovered is that you need to write a package of helper subroutines that you can use within the body of your generated grammar, so that you do not repeat yourself.   Grammar handlers should be short, should of course include use warnings; use strict;, but generally should leverage code that is used, not embedded verbatim.)   But, having spent the time to learn it, I find that I use it constantly.   Even “simple” requirements tend to grow, and it is not pleasant to discover that you have run out of tool.   With P:RD, that will never happen.

Re: Parse::RecDescent: problem with grammar and error reporting
by JavaFan (Canon) on Jan 20, 2012 at 10:51 UTC
    So, you basically want to allow a @ as long as it's not preceeded by whitespace? \b will not do that -- that just forces a word character to be present.

    Suggestion (untried):

    text: /[^@\n]*(?:\S@[^@\n]*)*/
    That still allows text like:
    foo bar@
    I cannot deduce whether you want to allow that or not.

    It's not the most efficient regex, as it typically will backtrack one character on each (valid) @ character encountered. But since Parse::RecDescent is a massive backtracking engine written in Perl, this is likely to be acceptable.

      The suggestion doesn't work, it fails at "Even more test @20". I guess it's because \S is not a zero-width assertion, it actually wants a non-whitespace character, but those were already gobbled up by [^@\n]*.

      "foo bar@" should be allowed.

      Interestingly, this version does print the error messages, but I don't understand why.

        Ah, it's because if one has /PAT1*PAT2*/ Perl prefers to match as many PAT1s as possible, even to the extend of matching less in total. Witness the difference:
        $ perl -wE 'q{Even more test @20} =~ /[^@\n]*(?:\S@[^ +@\n]*)*/ and say $&' Even more test address $ perl -wE 'q{Even more test @20} =~ /[^@\n]*(?:\S@[^ +@\n]*)+/ and say $&' Even more test
        Try this:
Re: Parse::RecDescent: problem with grammar and error reporting
by tobyink (Abbot) on Jan 20, 2012 at 16:13 UTC

    Seems to me like the best way to define text would be something like:

    /( \\@ | \\\\ | [^\\@\r\n] )*/x

    In other words, text is sero or more occurances of any of:

    • backslash then @-sign
    • double backslash
    • a character which is not backslash, @-sign, carriage return or line break

      And the other approach says that the text is a series of “tokens,” using e.g. the definitions you give here.   That is the first stage of a good parser ... the so-called “lexer.”   The fun begins when you have to deal with the structures that this stream-of-tokens can take.   That is the upper-half of a parser ... and the point where its real power comes to play.

      I say this, merely to describe situations where a full-blown Parser is extremely useful.   If “all that you really need is a lexer,” and the structure of the “language” is so straightforward that you really are not in need of decision-making to decide what you have and/or what to do, then a Parser might (or, might not ...) be overkill.   The time to make these decisions is early.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://948931]
Approved by Corion
Front-paged by McDarren
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2018-05-26 13:03 GMT
Find Nodes?
    Voting Booth?