Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Cannot get Marpa::R2 to prioritise one rule over another

by Anonymous Monk
on Jan 20, 2021 at 23:58 UTC ( #11127191=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a Marpa::R2 parser that is attempting to differentiate between IP addresses and hostnames without a difference in leading keywords. The actual grammar I am using is complicated enough not to try and replicate it here, but a minimally-reproducable example of the same problem is below:

#!/usr/bin/env perl use warnings; use strict; use Data::Dumper; use Term::ANSIColor qw(:constants); use Marpa::R2; my $rules = <<'END_OF_GRAMMAR'; lexeme default = latm => 1 :default ::= action => [name,values] :start ::= <entry> <entry> ::= <op> (SP) <hostaddr4> <op> ::= 'add' | 'remove' <ipv4> ::= NUMBER ('.') NUMBER ('.') NUMBER ('.') NUMBER <hostname> ::= NAME <hostaddr4> ::= <ipv4> | <hostname> SP ~ [\s]+ NAME ~ [\S]+ NUMBER ~ [\d]+ END_OF_GRAMMAR my $input = <<'END_OF_INPUT'; add 192.0.2.1 add www.example.org remove 192.0.2.2 END_OF_INPUT my $grammar = Marpa::R2::Scanless::G->new({source => \$rules}); for (split /^/m, $input) { chomp; if (length $_) { print "\n\n$_\n"; my $recce = Marpa::R2::Scanless::R->new({ grammar => $grammar, ranking_method => 'rule' }); eval { $recce->read(\$_ ) }; print ($@ ? (RED . "$@\n") : GREEN); print $recce->show_progress(), "\n"; print Dumper($recce->value), "\n\n", RESET; } }

From what I can tell, Marpa always picks the <hostname> form of the grammar, even on lines that look more like IPs. I assume this is because the character class [\S]+ also includes the characters which make up an IP address.

So far, in my grammar definition, I've tried:

<hostaddr4> ::= <ipv4> | <hostname> <hostaddr4> ::= <ipv4> || <hostname> <hostaddr4> ::= <hostname> | <ipv4> <hostaddr4> ::= <hostname> || <ipv4> <hostaddr4> ::= <ipv4> rank => 2 | <hostname> rank => 1 <hostaddr4> ::= <ipv4> rank => 1 | <hostname> rank => 2 <hostaddr4> ::= <ipv4> rank => 1 <hostaddr4> ::= <hostname> rank => 2 <hostaddr4> ::= <hostname> rank => 1 <hostaddr4> ::= <ipv4> rank => 2

...and none seem to make a difference. They all yield the ['hostname', '192.0.2.1'] array.

The only thing that does it is removing the <hostname> alternate from <hostaddr4> (which does not match the grammar of the data I am parsing), and then the representation changes to ['ipv4', '192', '0', '2', '1']

Can anyone advise the correct approach in this (seemingly) simple case?

J.

Replies are listed 'Best First'.
Re: Cannot get Marpa::R2 to prioritise one rule over another
by Discipulus (Abbot) on Jan 21, 2021 at 07:57 UTC
    Hello J.

    UPDATE I probably missed your point.. now I see the IP is never reachead but always treated as if it was and hostname. I'll think a bit more about it..

    The problem seems to be fixed if you put NAME            ~ [\D]+ but you will fail again with a hostname like 42.perl.org Perhaps a more strict rule definition is needed to tell difference from ip and hostname.

    original reply My help can be very limited because I still do not understand Marpa::R2 and I'm just moving my first, baby steps. I dont understand the rank nor the show_progress part (atm). So I reduced the example to something I know (removing the colors).

    Is not the hostname coming from your :default        ::= action => [name,values] ? This is what proposed in the synopsis but I find it a bit misleading.

    If you see A dice roller system with Marpa::R2 and its prequel First steps with Marpa::R2 and BNF you will see an anonymous hash is used and is populated during the parsing phase. Maybe you can use a pattern like this (then you can check the validity of an IP or of a valid hostname in distinct part of the code).

    I ended with the following code that seems to produce the expected result

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: Cannot get Marpa::R2 to prioritise one rule over another
by Discipulus (Abbot) on Jan 21, 2021 at 12:32 UTC
    Hello again,

    maybe a second attempt is better than first one. I had to specify what an hostname is in an ugly way but seems viable.

    I'm going mad to understand why the dot . is passed in for IPs and not for hostnames! (because ip ends with an action?)

    #!/usr/bin/env perl use warnings; use strict; use Data::Dump; use Marpa::R2; my $rules = <<'END_OF_GRAMMAR'; lexeme default = latm => 1 :default ::= action => ::first entry ::= op hostaddr4 action => dump_entry op ::= 'add' action => add_op | 'remove' action => add_op hostaddr4 ::= hostname | ipv4 hostname ::= DOMAIN EXT action => add_hostname | DOMAIN DOMAIN EXT action => add_hostname | DOMAIN DOMAIN DOMAIN EXT action => add_hostname DOMAIN ::= NAME '.' NAME ~ [\d\w]+ EXT ~ 'org' | 'net' ipv4 ::= NUMBER '.' NUMBER '.' NUMBER '.' NUMBER action => +add_ip NUMBER ~ [\d]+ :discard ~ SP SP ~ [\s]+ END_OF_GRAMMAR my $input = <<'END_OF_INPUT'; add example.org add www.perl.org add 42.perl.net add 192.0.2.1 remove 192.0.2.2 END_OF_INPUT my $grammar = Marpa::R2::Scanless::G->new({source => \$rules}); for (split /^/m, $input) { chomp; if (length $_) { print "\nPARSING: $_\n"; my $recce = Marpa::R2::Scanless::R->new({ grammar => $grammar, }); my $value_ref = $grammar->parse( \$_, 'main'); } } sub dump_entry{ print "dump_entry received: "; dd shift @_; } sub add_op{ my $self = shift @_; print "add_op received: "; dd @_; $$self{operator} = join '',@_; return $self; } sub add_ip{ my $self = shift @_; print "add_ip received: "; dd @_; $$self{type} = 'IP'; $$self{value} = join '',@_; return $self; } sub add_hostname{ my $self = shift @_; print "add_hostname received: "; dd @_; $$self{type} = 'hostname'; $$self{value} = join '.',@_; return $self; } __DATA__ PARSING: add example.org add_op received: "add" add_hostname received: ("example", "org") dump_entry received: { operator => "add", type => "hostname", value => + "example.org" } PARSING: add www.perl.org add_op received: "add" add_hostname received: ("www", "perl", "org") dump_entry received: { operator => "add", type => "hostname", value => + "www.perl.org" } PARSING: add 42.perl.net add_op received: "add" add_hostname received: (42, "perl", "net") dump_entry received: { operator => "add", type => "hostname", value => + "42.perl.net" } PARSING: add 192.0.2.1 add_op received: "add" add_ip received: (192, ".", 0, ".", 2, ".", 1) dump_entry received: { operator => "add", type => "IP", value => "192. +0.2.1" } PARSING: remove 192.0.2.2 add_op received: "remove" add_ip received: (192, ".", 0, ".", 2, ".", 2) dump_entry received: { operator => "remove", type => "IP", value => "1 +92.0.2.2" }

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      The dot is ignored because the default action (::first) is used for the DOMAIN rule. Mixing the lexer and grammar rules is not a good idea, they're very different. Using consistent capitalization for the non-terminals also helps, I usually use a different rule for the grammar and lexer ones.

      I usually build the grammar from the top to the bottom, i.e. from the starting symbol to the L0 rules. I start with the default action of [name,values] and replace it with individual actions from the bottom to the top.

      The result might be something like

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
        Hello choroba,

        can you be so kind to explain me better your:

        > Mixing the lexer and grammar rules is not a good idea, they're very different.

        because I'm reading Marpa-R2 vocabulary and I am not able to strictly define them. Where my code mixes them?

        L*

        There are no rules, there are no thumbs..
        Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

        Thanks for demonstrating how to recompose the dotted components of hostnames and IPs, using a custom action. I had been wondering how best to go about that, and you have given me a starting point.

        One question, regarding your concat subroutine, if I may: Is it possible to generalise it to return the [rulename,concatted-string] pair, so it conforms to the tokens emitted by the default action [name,values], or would I have to have a separate subroutine for each rule (and return the rulename literally)?

        I had originally thought there might be context in first argument, which you shift over, but that appears to be an empty hashref in all cases I've seen.

      Thanks for this attempt, but I'm not sure that defining hostname as a fixed number of DOMAIN components, nor defining a limited set of EXT suffixes is the right way to go. Hostnames can be arbitrarily long, at least in terms of subdomains, and the list of top-level domains is growing by the day.

      I'm probably going to settle just capturing NAME and laying off the semantics of IPv4, (later) IPv6, and neither of those to a custom action. Given the complexity of the problem (esp. IPv6), that is likely the best way forward.

      J.

Re: Cannot get Marpa::R2 to prioritise one rule over another
by duelafn (Vicar) on Jan 21, 2021 at 11:54 UTC

    I don't have a fix for your actual problem, but The reason it refuses to select the ipv4 is because of longest-token matching. NAME matches a longer token than NUMBER, therefore it always wins.

    Update: Change NAME to not accept a dot (and update hostname rule) and then it will work as you originally had:

    my $rules = <<'END_OF_GRAMMAR'; lexeme default = latm => 1 :default ::= action => [name,values] :start ::= <entry> <entry> ::= <op> (SP) <hostaddr4> <op> ::= 'add' | 'remove' <ipv4> ::= NUMBER ('.') NUMBER ('.') NUMBER ('.') NUMBER <hostname> ::= NAME+ separator => DOT <hostaddr4> ::= <ipv4> | <hostname> DOT ~ '.' SP ~ [\s]+ NAME ~ [^\s.]+ NUMBER ~ [\d]+ END_OF_GRAMMAR

    Good Day,
        Dean

      Thanks. I won't say the reasoning makes absolute sense to me yet, but I do confirm that your approach does allow differentiation between IPv4 and not-IPv4 at parse-time -- albeit one that I have to recompose back to a single string.

      Given that I'm also going to have to handle IPv6 at some point, and the complexities involved in that, I'm probably going to flatten the <hostaddr4> rule to simply <hostaddr> ::= NAME and lay off to a custom action to determine IPv4, IPv6 or hostname in my larger grammar. It would have been nice to formalise support for the three types in the grammar definition, but I'm not going to lose sleep over it.

      That said, if you (or anyone) can shed light on why the order of rules, or use of || vs | makes no difference, I would be keen to understand. Props to the package author for taking the time to document thoroughly, but it's not an easy read for someone for whom this isn't going to be a full-time gig!

Re: Cannot get Marpa::R2 to prioritise one rule over another
by Anonymous Monk on Jan 21, 2021 at 05:07 UTC

      While these regex libraries are useful, I don't think I can fold them directly into a Marpa DSL definiton. From what I can tell, I can only use individial character classes, and would have to comprise a number of lexeme tokens to match the relative complexity of the IPv4 (and especially IPv6) expressions.

      In my more-specific application, I am indeed intending to validate the IPs externally -- probably with Net::IP. To that end, I may well alter the DSL just to accept NAME, and lay off the exact semantics to a subroutine with an if (is_ip4(...) { ... } elsif (is_ip6(...) { ... } else { assume_hostname(...) } flavour.

      Thanks for looking in...

      J.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11127191]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (2)
As of 2021-02-25 02:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?