http://www.perlmonks.org?node_id=703751

Hercynium has asked for the wisdom of the Perl Monks concerning the following question:

So, I've been happily learning how to use grammars for parsing with Parse::RecDescent, and I've been very pleased with it's power and flexibility so far... but I'm stumbling over a problem that for the life of me, I can't understand why it's happening!

I highly doubt that this could be a bug in PRD - it's used by too many people... but even the most bare code is demonstrating this frustrating problem:

Basically, it's this: Changing the prefix pattern has NO effect!

If I print out $skip it shows that it is set as expected, but the behavior of PRD does not change from the default.

This happens whether I am using a skip: directive, setting $skip from within an Action, or setting $Parse::RecDescent::skip from outside the grammar code.

Here's a little demonstration of what I'm getting...

Code like this:
#!/usr/bin/env perl use strict; use warnings; use Parse::RecDescent; use Data::Dumper; my $grammar = <<'END_GRAMMAR'; file: <skip:qr/#\w+/> 'foo' 'bar' { [ @item ] } END_GRAMMAR my $data = <<'END_DATA'; #skdjslkdjsakdjadjlksa foobar END_DATA our $parse = new Parse::RecDescent($grammar) || die "Couldn't generate parser from grammar: $!"; my $parse_tree = $parse->file($data); print Dumper $parse_tree;

Outputs this:
$VAR1 = undef;


I'm pretty certain it's not a problem with the regexes I'm using because when I do something like this instead:
#!/usr/bin/env perl use strict; use warnings; use Parse::RecDescent; use Data::Dumper; my $grammar = <<'END_GRAMMAR'; file: /#\w+/ 'foo' 'bar' { [ @item ] } END_GRAMMAR my $data = <<'END_DATA'; #skdjslkdjsakdjadjlksa foobar END_DATA our $parse = new Parse::RecDescent($grammar) || die "Couldn't generate parser from grammar: $!"; my $parse_tree = $parse->file($data); print Dumper $parse_tree;

I get this output:
$VAR1 = [ 'file', '#skdjslkdjsakdjadjlksa', 'foo', 'bar' ];

I've scoured Google, PM, the Docs, the FAQ, and RT for more info about this, but it looks like I've the only soul to have this problem... Is there any advice on how to track down the source of this conundrum?

Update:

As I suspected, the "skip" or "terminal prefix" functionality is *not* broken... but it is not quite as DWIMmy as I was expecting with regards to how the regular expression specified is used.

I still don't think I understand the subtle details, but as far as I can tell, one should keep in mind that the skip regex (aka terminal prefix), is matched ONLY ONCE. Therefore, one probably should surround the whole thing with a parenthesis and asterisk to ensure *everything* one wants to skip will be consumed in *one pass*

To further show what I mean, here is one of the many non-working regexes that brought me here:
/(?: \# .*? \n? | \s* )?/msx
It will match only ONE INSTANCE of a comment or repeated whitespace. My example text has several adjoining instances of comments and whitespace, and only the first match was being consumed!

Here is the regex that does what I want:
/(?: \# .*? \n | \s )*/msx
As you can see, it consumes ALL Comments AND whitespace until nothing matches. SMALL change, BIG difference!

I now have this working the way I want, by assigning it to $skip in the "start-up actions":
$skip = '(?msx: \# .*? \n | \s )*'

This has been another fun and edifying expedition, and if anyone reading this has any additional questions, I am happy to share whatever meager knowledge I have gained :)