Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Using Parse::RecDescent to parse Perl-ish strings without resorting to string eval

by polypompholyx (Chaplain)
on Feb 29, 2008 at 16:32 UTC ( #671209=perlquestion: print w/replies, xml ) Need Help??
polypompholyx has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I'm in the middle of updating a module that grades Excel spreadsheets by comparing the contents of cells to a model text file parsed by Parse::RecDescent. Strings are allowed in the comparisons, but in the previous version of the module, they were hacked in by simply returning a manually unescaped version of the matched text, e.g.:

quoted_string: '"' m{(([\\]"|[^"])*)} '"' { $item[2] =~ s{\\"}{"}g; # Unescape quotes $item[2] =~ s{\\\\}{\\}g; # Unescape backslashes $item[2]; }

I want to use the <perl_quotelike> production (a wrapper around Text::Balanced), for greater flexibility with quoted strings and regexes. The problem is that <perl_quotelike> extracts the Perl-ish string/regex, but the only way I can think of to interpret the string/regex correctly (which could contain quotes, backslashes, Unicode hexes, regex modifers, etc.) is to eval the relevant bits, e.g.:

quoted_string: <perl_quotelike> { my ( $name, $ldelim, $text, $rdelim ) = @{ $item[1] }; if ( $name eq 'qq' ) { $text = eval 'qq' . $ldelim . $text . $rdelim; } # etc... }

Which is nasty, as the model file text could then contain:

A1 mean(B1:B10) && A2 "Something innocuous" && A3 C1/C2 && A4 qq(Oh dear @{[ system 'rm -rf *' ]})

Am I missing something, or am I trapped between either implementing my own interpolator/unescaper (which certainly won't be able to replicate all the useful features of perl quoting and regex modifiers), or using string eval (and hoping that no-one does anything nasty)?

Replies are listed 'Best First'.
Re: Using Parse::RecDescent to parse Perl-ish strings without resorting to string eval
by sundialsvc4 (Abbot) on Feb 29, 2008 at 17:06 UTC

    Is it feasible to output the Excel data in a known structured format, such as XML? That would open up the world of XML::XPath where you could query the XML-formatted data ... in effect letting something else (namely, XPath) do the dirty-work of pulling out the data you need.

    The “model file” would not need to be the same format. If it consists of a series of lines containing “cell-address, and what should be in it,”then your code could work by constructing an XPath-string, letting XPath dredge the file for that piece of data, compare it, and move on.

    “The model file” is the part that you can easily control; the student's homework is the Great Unknown. But if the complexity were hidden from view (XPath is very good at what it does, and XPath expressions are very powerful), the complexity of your program would be reduced by at-least half. I'd call that “a win.”

      Pulling the data out of Excel isn't too difficult (Win32::OLE), it's the parsing of the model file that's the problem. The model file, in its simplest form, is exactly what you say: a list of cells and what should be in them, in terms of strings, calculations, etc. I leverage Perl's regexes to grade things like graph axis titles - as you guessed, there's a hideous number of ways for them to write µmol min−1 mg−1, which makes me ache for the /x modifier. Using <perl_quotelike> seemed like an easy way out for both strings and regexes, but it looks increasingly like it'd be better to hand-roll something more limited.
Re: Using Parse::RecDescent to parse Perl-ish strings without resorting to string eval
by ikegami (Pope) on Feb 29, 2008 at 17:10 UTC
    Don't. It might be a lot of work to convert from Perl, but it's even more work to parse Perl. If you insist, you probably should start by taking a look at String::Interpolate (uses a Safe compartment to confine the interpolation to specified VAR => $value pairings) and PPI (a Perl parser).

    quoted_string: '"' m{(([\\]"|[^"])*)} '"'

    I hope your never have spaces after the first """ in the text you are parsing, or did you change <skip>?

    It's best to avoid separating a token into multiple items. It's rarely necessary or even easier.

      Yes, in the original code, there's some fiddling with <skip> (since whitespace has syntactic significance in the grammar), and unescaping of Unicode hex character escapes. I left them out for clarity.

      I can probably live with handrolled code that parses its own definition of a double quoted string, since most of the interpolation is unnecessary, but I'm not sure of any way out with regexes: I'd like to be able to do something like:

      $text = qr/$text/$modifiers

      but $modifiers doesn't interpolate, and I can't think of any easy way to simulate this without having a very large switch statement.

        my $mods_p = ''; my $mods_m = ''; ($opts{$_} ? $mods_p : $mods_m) .= $_ for qw( x i s m ); $re = qr/(?$mods_p-$mods_m:$re)/;

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://671209]
Approved by Corion
[atcroft]: abner: Hope it helps you toward a solution. :)

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (9)
As of 2017-01-24 04:14 GMT
Find Nodes?
    Voting Booth?
    Do you watch meteor showers?

    Results (202 votes). Check out past polls.