perlquestion
Anonymous Monk
<p>I have a Marpa::R2 parser that is attempting to differentiate between IP addresses and hostnames without a difference in leading keywords. The actual grammar I am using is complicated enough not to try and replicate it here, but a minimally-reproducable example of the same problem is below:</p>
<code>
#!/usr/bin/env perl
use warnings;
use strict;
use Data::Dumper;
use Term::ANSIColor qw(:constants);
use Marpa::R2;
my $rules = <<'END_OF_GRAMMAR';
lexeme default = latm => 1
:default ::= action => [name,values]
:start ::= <entry>
<entry> ::= <op> (SP) <hostaddr4>
<op> ::= 'add' | 'remove'
<ipv4> ::= NUMBER ('.') NUMBER ('.') NUMBER ('.') NUMBER
<hostname> ::= NAME
<hostaddr4> ::= <ipv4> | <hostname>
SP ~ [\s]+
NAME ~ [\S]+
NUMBER ~ [\d]+
END_OF_GRAMMAR
my $input = <<'END_OF_INPUT';
add 192.0.2.1
add www.example.org
remove 192.0.2.2
END_OF_INPUT
my $grammar = Marpa::R2::Scanless::G->new({source => \$rules});
for (split /^/m, $input) {
chomp;
if (length $_) {
print "\n\n$_\n";
my $recce = Marpa::R2::Scanless::R->new({
grammar => $grammar,
ranking_method => 'rule'
});
eval { $recce->read(\$_ ) };
print ($@ ? (RED . "$@\n") : GREEN);
print $recce->show_progress(), "\n";
print Dumper($recce->value), "\n\n", RESET;
}
}
</code>
<p>From what I can tell, Marpa always picks the <code><hostname></code> form of the grammar, even on lines that look more like IPs. I assume this is because the character class <code>[\S]+</code> <i>also</i> includes the characters which make up an IP address.</p>
<p>So far, in my grammar definition, I've tried:</p>
<code>
<hostaddr4> ::= <ipv4> | <hostname>
<hostaddr4> ::= <ipv4> || <hostname>
<hostaddr4> ::= <hostname> | <ipv4>
<hostaddr4> ::= <hostname> || <ipv4>
<hostaddr4> ::= <ipv4> rank => 2
| <hostname> rank => 1
<hostaddr4> ::= <ipv4> rank => 1
| <hostname> rank => 2
<hostaddr4> ::= <ipv4> rank => 1
<hostaddr4> ::= <hostname> rank => 2
<hostaddr4> ::= <hostname> rank => 1
<hostaddr4> ::= <ipv4> rank => 2
</code>
<p>...and none seem to make a difference. They all yield the <code>['hostname', '192.0.2.1']</code> array.</p>
<p>The only thing that does it is removing the <code><hostname></code> alternate from <code><hostaddr4></code> (which does not match the grammar of the data I am parsing), and then the representation changes to <code>['ipv4', '192', '0', '2', '1']</code></p>
<p>Can anyone advise the correct approach in this (seemingly) simple case?</p>
J.
14