Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Lexing: how to define tokens based on "context"

by three18ti (Scribe)
on Oct 16, 2013 at 13:01 UTC ( #1058438=perlquestion: print w/ replies, xml ) Need Help??
three18ti has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks

At work we have a giant cluster... of a sudoers file. I was able to reuse code from my last monks post Iterator to parse multiline string with \\n terminator, since these lines can be continued with a "\\n" in sudoers as well my last project (which is really cool!)

Now that I'm able to grab each line, I'm trying to figure out how to define my tokens. I really like MJD's approach to Lexing and have been referencing HOP::Lexer::Article (as well as Higher Order Perl) and the general wisdom is we should break down each line into the applicable tokens.

Ok, so I think I understand the code as presented (Though, I have no idea why HOP::Lexer "uses"(imports?) HOP::Stream but doesn't actually use the module... but that's really irrelevant to my use of the module) and I think get the gist of why we want our tokens in "TYPE", "TOKEN" format.

What I'm really not groking is the how of defining/identifying tokens.

For my sudoers file, my lines can be one of three types, comment, alias definition, or rule definition. Comments _should_ be easy since the line is just prefixed with a "#" (though I just thought of an edge case where rules have been commented out and might potentially end with a \\n... I may want to parse comments as rules. "should" is a funny word...), so I'm currently trying to tackle parsing alias definitions.

There are four types of aliases, "Host_Alias", "User_Alias", "Runas_Alias", and "Command_Alias". Alias definitions use the format:

ALIASTYPE ALIASNAME = PARAMETER, PARAMETER, \ PARAMETER, PARAMETER

For example:

Host_Alias HA_FOO_GROUP = abc123, eigh456, \ foo987, bar654 Runas_Alias RA_FOO_SVCACCT = www-data, ceph, \ sshd, memcache, \ xab123

(users don't really need to run as sshd, this is for example purposes, but the accounts that a user would sudo as could be any service or user account)

HOP::Lexer::make_lexer takes an iterator then a list of array refs in the form $label, $pattern, $transform_sub . The keywords are easy since we can just match against text, e.g.: (My::Sudoers::Iterator returns an iterator that grabs a line that is continued with a \\n)

my $lexer = make_lexer( My::Sudoers::Iterator->new('/etc/sudoers'), [ 'ALIASTYPE' => qr/(?:Host_Alias|User_Alias|Runas_Alias|Command_A +lias)/, ], [ 'COMMA' => qr/,/, ], [ 'DIVIDER' => qr/=/, ], [ 'LINECONTINUER' => qr/\\\n/, ], );

(open to better names than "LINECONTINUER"...)

Where I'm stuck is how do I define my token for "ALIASNAME" and the "PARAMETER"?

Since the rule name can be any alphanumeric string including _ a simple "\w+" won't suffice. I was thinking something along the lines of:

[ 'ALIASNAME' => qr/ALIASTYPE \s+ (.*+) \s+ =/msx, ]

The big problem here is that HOP::Lexer uses capturing parenthesis to extract the token, so the above code will break the module. Additionally, "(.*+)" is typically a bad idea, but I couldn't figure out how to define that better. also, I don't think HOP::Lexer will be able to "see" tokens in the line that have been previously consumed.

The way I'm currently dealing with aliases is splitting on the equals, then split the left half on the the spaces to get the alias type and name, and split the right half on the commas. I don't think this approach isn't necessarily appropriate as it requires further logic to make sense of the mess (as opposed to just lexing the string to obtain tokens... obviously I will need to make use of the tokens at a later point in my application, but I think trying to do too much at once is causing me headaches when debugging edge cases. Parsing tokens will more easily allow me to determine what each piece of the statement means)...

Thanks all for your help.

Comment on Lexing: how to define tokens based on "context"
Select or Download Code
Re: Lexing: how to define tokens based on "context"
by sundialsvc4 (Monsignor) on Oct 16, 2013 at 13:47 UTC

    That looks like a really nice Lexer, but I must admit that I have become very fond of Parse::RecDescent.   Yes, a full-on parser, driven by a grammar.   I’ve asked that tool to do fairly-ridiculous things, like parsing a conglomeration of SAS programs, Korn shell scripts and Tivoli schedule files ... hundreds of ’em ... and it Just Did It™ with style and grace.   I would basically take that approach instead of building my own program to navigate through the file’s semantic structure, even with a good Lexer by my side.

    Furthermore, you can find an EBNF grammar-description for the Sudoers file here:   http://www.sudo.ws/sudoers.man.html.   No, P::RD does not handle such grammars directly (although other Perl parsers do ...), but it shows you outright what the proper grammar structure ought to be.   I think that this might save you a lot of messy coding.

      >> but I must admit that I have become very fond of Parse::RecDescent

      Wow... that looks. complicated... I'm sure if I study the manual long enough I can figure it out. Looks fairly powerful though. Thanks for the link.

      >> Furthermore, you can find an EBNF grammar-description for the Sudoers file here: http://www.sudo.ws/sudoers.man.html.

      You know... when I first started on this project, I found that page but had no idea what an EBNF was. Thanks for pointing that out again, it certainly does help me define my rules.

Re: Lexing: how to define tokens based on "context"
by marinersk (Chaplain) on Oct 16, 2013 at 14:09 UTC
    Hello three18ti,

    Since the rule name can be any alphanumeric string including _ a simple "\w+" won't suffice.

    I should think something like  [\w\_]+ would work:

    #!/usr/bin/perl use strict; use warnings; my $testval = "KUNG_FOO=bar"; if ($testval =~ /([\w\_]+)/) { my $keyname = $1; print "KEY = [$keyname]\n"; } exit; __END__ C:\Steve\Dev\PerlMonks\P-2013-10-16@0800-Underscore>testregex.pl KEY = [KUNG_FOO]
      Since the rule name can be any alphanumeric string including _ a simple "\w+" won't suffice.

      I should think something like [\w\_]+ would work:

      \w already matches underscore.

      Dave.

        D'oh!

        /me slinks off to the closest dark corner to try to hide.

        :-)

Re: Lexing: how to define tokens based on "context" (Marpa)
by Anonymous Monk on Oct 16, 2013 at 14:35 UTC
      open to better names than "LINECONTINUER"

      "CONTINUATION_CHARACTER" ?

        APPEND_NEXT?
Re: Lexing: how to define tokens based on "context"
by VincentK (Beadle) on Oct 16, 2013 at 15:08 UTC

    Hello.

    I realize this is not the Lexing solution you are going after, but I want to throw this out there. I also realize that this is probably not the most elegant solution, but it does seem to work with the input that you specified.

    In any case I hope this helps.

    Input

    Host_Alias HA_FOO_GROUP = abc123, eigh456, \ foo987, bar654 #comment line Runas_Alias RA_FOO_SVCACCT = www-data, ceph, \ sshd, memcache, \ xab123 #comment line User_Alias CA_FOO_CSVACCT = www-foo, ceph, \ sshd, memcache, \ xab456

    Output

    C:\monks>perl parse.pl input.txt Alias type : Host_Alias Alias name : HA_FOO_GROUP Values : abc123 eigh456 foo987 bar654 **** Alias type : Runas_Alias Alias name : RA_FOO_SVCACCT Values : www-data ceph sshd memcache xab123 **** Alias type : User_Alias Alias name : CA_FOO_CSVACCT Values : www-foo ceph sshd memcache xab456 **** C:\monks>

    Script

    #/usr/bin/perl use strict; use warnings; die "Error. Usage \'perl parse.pl inputfile.txt\'\n $!" unless $#ARGV +== 0; my $in_filename = shift @ARGV; my $complete_line = ""; open(my $IN,"<",$in_filename) || die "Cannot find input '$in_filena +me'\n $!"; while(<$IN>) { chomp; next if ($_ eq ""); next if (m/^\#/); if ($_ =~ m/\\$/) { $_ =~ s/\\$//; $complete_line .= $_; next } $complete_line .= $_; # get alias type and name my $index_pos = index($complete_line,'='); my @alias_type_and_name = split(/ /, substr($complete_lin +e,0,$index_pos-1) ); # get alias values my @alias_values = split(/\,/, substr($complete_line,$inde +x_pos+1) ); foreach my $v (@alias_values) { $v =~ s/^\s+//; $v =~ s/\s+$//; } # print values print "Alias type : $alias_type_and_name[0]\n"; print "Alias name : $alias_type_and_name[1]\n"; print "Values : @alias_values\n ****\n"; $complete_line = ""; } close($IN);

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1058438]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (8)
As of 2014-08-23 12:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (173 votes), past polls