Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Parsing with Perl 6

by jryan (Vicar)
on Jul 05, 2002 at 01:20 UTC ( [id://179555]=perlmeditation: print w/replies, xml ) Need Help??

In Apocalypse 5, Larry Wall stated that our "regex culture has gone wrong in a variety of ways". When I first read that, I thought: "What on earth is he talking about? Perl regex do exactly what I need; so what's wrong with them?" Sometimes you need to see solution to realize how blatant the problem really is.

In general, readability and modularity is stressed to the extreme, yet, not for regexes. In fact, the complete opposite usually is. Far from clear and organized, our regexes tend to look like strings of crap. As a result, anything more advanced than mid-level-logic becomes seemingly impossible to do. Thats where perl 6 comes to the rescue.

Perl 6 has loads of new syntactical changes that help regex writers to clean up their act. New perl 6 regexes are now readable, modular, and even easy to write.

Advantages of Perl 6 Regex

  • Improved Syntax increases readability and writeability

    Larry has completely refactored and redone the perl 6 regex syntax from scratch. From the way you think about spacing, to the very way you structure your logic; almost everything is new. Check out the summary of changes.

  • Rules help create modular designs

    Rules are defined assertions. Similar to, but more powerful than, the (??{}) assertion from perl5. You can think of a rule as a form of limited subroutine. Arguments are passed in the header, and can then be used within the rule. Instead of containing code, the rule contains a regex. Modifiers can also be used with rules.

  • Grammars allow for abstraction

    As rules are to subroutines, grammars are to classes. Groups of rules and assertions can be grouped together into a regex class called a grammar. Grammars have similar attributes to classes; for instance, they create their own namespace, can use inheritance and polymorphism, and group together their members. Grammars make it easy to define common rules.

Enough of the hoopla; lets see it in action.

The Situation

Say you work at a web design establishment. Your current assignment is to catalog all javascript functions that have been used in many of the various sites the company has done over the years. Each function is to be catagorized as either complex or simple; a function is to be considered complex if it contains some sort of loop. The question is, how do you start?

Developing the Logic

The answer is to start mapping out logic and strategy for a parser. More precisely, a recursive parser. One strategy that we can use is to extract each nested set of information that we need, and then extract the next level from that. For instance, against the raw HTML, extract the <script></script> elements; from that, extract all functions; from that, check for the existance of a loop for each match. Each of these definitions can be broken into separate grammars, and common elements among them can be grouped into a base grammar from which each of the sub grammars can inherit from.

Even with the benefits of OO organization, writing a parser is no easy task. Perl6 provides many useful tools to make the job easier, but that doesn't change the fact logic needs to be mapped out before the regex is written. Or, at least, it should be. Anyways, before the parsing begins lets define our definitions:

(Note: the script definition, as well as a few definition to code explanations, are going to be skipped because their explanation is mundane for the purposes of this article; however, they is still encluded in the final code below)

The Definitions

First, a function:

- A header made up of: - the word function - a function name - an arguments list - A body made up of a block of code

Next, our loops.

A while loop:

- A header of: - the word while - a condition - A body of a block of code

A for loop:

- A header of: - the word for - enclosed in parenthesis: - a declaration and then a semicolon - a condition and then a semi-colon - an increment - A body of a block of code

A do loop:

- A header of the word do - A body of a block of code - A footer of: - the word while - a condition

Finding Common Ground

Now that the parsing is planned, the common elements can be located and modularized, such as:

- a header - a body - a block of code - a condition

Each of these common elements can now be placed into our base class. Although the header and body set for each object isn't quitethe same, a generic set can be defined in the base class, and the sub-grammars can overload if needed.

Defining a Block

Descending one level further, the code block and condition need to be defined. A code block is a set of balanced brackets, of which subsets may be nested. Borrowing from perl5's perlre, we can turn:

$block = qr/ \{ (?: (?> [^{}]* ) | (??{ $block }) )* \}

into

rule block { \{ [ <-[{}]>*: | <block> ]* \} }

Defining a condition is not much different than defining a code block. Since validating is of no concern, a condition can be defined as a set of balanced parenthesis; or, as:

rule block { \( [ <-[()]>*: | <block> ]* \) }

Since the block of code and condition are similarly defined, we can bind them to a single assertion that takes 2 arguments that define the delimeters.

rule block ($left,$right) { $left [ <-[$left $right]>*: | <block $left $right> ]* $right }

However, we'll also need to account for several other things to completely define our block; for instance, ignoring our delimiters within comments and strings, so that they don't interupt our block finding. Before any further progress can be made, comments and strings will need to be defined; their definitions will also be placed into the base class.

Defining Quotes

Quotes can be defined as:

- start quote - string of (non-quote or backslashed quote) characters - end quote

Which can easily be translated into:

rule quoted_string ($type) { " [ <-["]>+: | [ <after \\ > " ] ]* " }

However, two types of quoting exists in javascript; single and double. Instead of creating a rule for each, the above rule can easily be generalized into:

rule quoted_string ($type) { $type [ <-[$type]>+: | [ <after \\ > $type ] ]* $type }

Single Quotes can now be called as: <quoted_string '>
Dobule Quotes can be called as <quoted_string ">

We can now bind them as:

rule string { <quoted_string "> | <quoted_string '> }

Ignoring Your Comments

Javascript uses C++ style comments: // for single line, and /* */ for multi-line. Therefore, a comment is:

A single-line comment or A multi-line comment

A single-line comment is easy to define: its nothing more than a // and the rest of the line:

rule single_line_comment { // \N*: }

A multi-line comment is a bit more complicated. It can be defined as:

- A leading /* - A string of characters that includes non asterix and astrix not foll +owed by a / - A closing */

Translated to regex:

rule multi_line_comment { /\* [ <- [*] >* : | <before \*/ > | \* ]* \*/ }

Rounding Out the Block

Now that the sub-definitions are completed, the block definition can be finished. Its a simple matter of adding our new pieces to the alternation:

rule block ($left,$right) { $left [ <-[$left$right"'/]>*: | <comment> | <string> | <block $left $right> ]* $right }

Enough is Enough; Time for Action!

Enough with boring definitions; its pretty easy to match up our parts with the definitions we created earlier. Its time to see the completed parser, which is below. However, a few things to notice:

  • Look closer as to how the script elements are extracted. If you follow backwards, you'll notice that the arrayref is refering to an array that is tied to standard input! Since perl will concatanate all files supplied on the command line into stdin, a huge list of files can be placed into the input stream and perl will take care of them for you. Also note how the rules in the script grammar use <cut> instead of :; this is to snip the infinite string that is created by applying a regex directly on the input stream.
  • Instead of perl 5 style extraction of:

    @matches = $var =~ /(match)/g;

    Perl 6 style extraction is used by binding an array to the match itself:

    $var =~ m:e/@matches:=(match)/;

    This has the added advanatage of not "capturing" non captures as undefined values since perl 6 matches are treated as hypothetical variables.

The Parser

#!/usr/bin/perl -w use strict; grammar javascript { rule single_line_comment { // \N* : } rule multi_line_comment { /\* [ <- [*] >* : | <before \*/ > | \* ]* \*/ } rule comment { <single_line_comment> | <multi_line_comment> } rule quoted_string ($type) { $type [ <- [$type] >+ : | [ <after \\ > $type ] ]* $type } rule string { <quoted_string "> | <quoted_string '> } rule block ($left,$right) { $left [ <- [$left$right"'/] >* : | <comment> | <string> | <block $left $right> ]* $right } rule code { <block \{ \}> } rule cond { <block \( \)> } rule header { } rule body :w { <code> } rule match { <header> <body> } } grammar script is javascript { rule header :wi { \< script <- [>] >* <cut> \> [ \< ! -- ]? } rule body :wi { [ <- [<("'/] >* <cut> | <comment> | <quoted_string "> | <quoted_string '> | <cond> | <!before :: [ \</ ] | <null> > ]* } rule footer :wi { \</ script \> } rule match { <header> <body> <footer> } } grammar function is javascript { rule arg_list :w { \( [ \w+ <!before \) :: , | <null> > ]* \) } rule name { \w+ } rule header :wi { function <name> <arg_list> } } grammar while is javascript { rule header :wi { while <cond> } } grammar for is javascript { rule decl { <- [;] >* : } rule cond { <- [;] >* : } rule incr { <- [)] >* : } rule header :wi { for \( <decl> ; <cond> ; <incr> \) } } grammar do is javascript { rule header :wi { do } rule footer :wi { while <cond> } rule match { <header> <body> <footer> } } module main; my (@scripts, @functions, @html); @html := <>; my $html is from(\@html); $html =~ m:e/ @scripts := (<javascript::script::match>) /; @scripts =~ m:e/ @functions := (<javascript::function::match>) /; for @functions -> $function { given $function { when / <javascript::while::match> | <javascript::for::match> | <javascript::do::match> / { catalog_complex ($_) } default { catalog_simple ($_) } } } sub catalog_simple { ... } sub catalog_complex { ... }

A subset as Perl 5

A few of the main items described in terms of perl 5 regex. I tried doing a direct translation of the parser above with perl 5 objects; however, everything got incredibly messy :(

use re 'eval'; sub quoted_string { my $type = quotemeta shift; return qr/ $type (?: [^$type]+ | (?: (?<= \\ ) $type ) )* $type /x; } my $comment = qr~ (?: // (?> [^\n]* ) ) # single line | (?: # multi line /\* (?: (?> [^*]+ ) | (?= \*/ ) | \* )* \*/ ) ~x; sub block { my $left = quotemeta shift; my $right = quotemeta shift; my $block = qr! $left (?: (?> [^$left$right"'/] *) | (??{$comment}) | (??{ quoted_string( qq(") ) }) | (??{ quoted_string( qq(') ) }) | (??{$block}) )* $right !x; return $block; } my $arg_list = qr/ \( (?: [^,)]* (,|) )* \) /x; # ugly! my $block; my $code = qr/ (??{$block = block ( "\{" , "\}" ); "$block"}) /x; my $cond = qr/ (??{$block = block ( "(" , ")" ); "$block"}) /x;

Update: Removed the \Q\E and replaced with quotemeta in the dynamic assertions in the perl5 section. Also, there was a paste error with $block in the perl5 section that would have caused it to break had it been used.

Update: Fixed a mistake in the multi-line comment regex that was pointed out by kelen.

Update: Updated Perl6 code to accurate reflect changes in the language since this article was written (4-27-03).

Replies are listed 'Best First'.
Re: Parsing with Perl 6
by Juerd (Abbot) on Jul 05, 2002 at 05:44 UTC

    Wow, very impressive. ++!

    // <-[\n]>*:

    I'd really want to take advantage of the new \N:

    // \N*

    foreach my $function (@functions)

    In Perl6, foreach will just be for. The C-style for loop will be called loop. And the for syntax will slightly change:

    for @functions -> my $function { ... }

    rule quoted_string ($type) { $type [ <-[$type]>*: | [ <after \\ > $type ] ]* $type }

    Why not already have both types there?

    rule quoted_string { $delimiter := ( <['"]> ) [ <!before $delimiter> . | <after \\> $delimiter ]* $delimiter }
    Hmm - In current regex stuff, I'd probably have written: /(['"]) (?: (?! \\ | \1) . | \\ . )* \1/x. Why not use something like that? (Or is it just a matter of taste?)
    rule quoted_string { $delimiter := ( <['"]> ) [ <!before \\ | $delimiter> . | \\ . ] $delimiter }

    - Yes, I reinvent wheels.
    - Spam: Visit eurotraQ.
    

      Ah, you are completely right about \N, of course. I'm just a blockhead ;(

      Concerning quotes: I think that the way I structured mine is easier to convert from logic. For instance, I defined a quoted string as:

      - start quote - string of (non-quote or backslashed quote) characters - end quote

      Which I then translated into (comments added this time):

      rule quoted_string { " # start quote [ # string <-["]>+: # non quote characters | # or [ <after \\ > " ] # backslashed quote character ]* # end string " # end quote }

      It was then shown how to make it "dynamic" by simply adding an argument and subsituting that argument in for the quote delimeter. While yours is much more elegant and efficient, I thought that my example was easier to follow in the current context.

Re: Parsing with Perl 6
by FoxtrotUniform (Prior) on Jul 05, 2002 at 05:29 UTC

    <keanu>Whoa.</keanu>

    Thanks for the more complete writeup of the new regex syntax and semantics. I found it difficult to get a "big picture" idea of how things are expected to work from the Apocalypse, even though that made for some excellent reading. This should keep me satisfied until Exegesis 5 comes out.

    I'm going to be very interested to find out how well these grammar-based regular expressions play with Perl 6's increased focus on functional and logical programming (or at least my perception of increased focus). In my admittedly limited experience, thinking in "functional" terms, in "logical" terms, and in "grammar" terms require similar mindsets and methods of reasoning, and I hope that the Perl 6 developers can unify (heh) these facets of the language.

    Update: For instance, there are a few hints on dev.perl.org about the Perl 6 parser being retargetable, mungeable, and tied in with the regex engine. With luck, this'll make it easier to treat code as data, which seems to have worked out fairly well for Lisp. (And I'd love to see it in Perl, despite the fact that dealing with Perl code is a much hairier task than dealing with a bunch of (possibly nested (like this)) lists.)

    --
    The hell with paco, vote for Erudil!
    :wq

Re: Parsing with Perl 6
by erikharrison (Deacon) on Jul 05, 2002 at 03:46 UTC

    For lots of reasons, this is very impressive. As I am still learning Perl 5, I've had a hard time internalizing Perl 6 regexen. Overall, you've shown the nay sayers and doubters what can be done with the new syntax (and given me something to analyze while I wait for Exegesis 5 :-).

    A nit, though. I don't know diddly 'bout JavaScript, but are backslashed backslashes allowed in strings? If so, this grammar doesn't allow for this.

    Cheers,
    Erik

    Light a man a fire, he's warm for a day. Catch a man on fire, and he's warm for the rest of his life. - Terry Pratchet

      A nit, though. I don't know diddly 'bout JavaScript, but are backslashed backslashes allowed in strings? If so, this grammar doesn't allow for this.

      Sure it does, try the perl 5 version that I have at the end:

      sub quoted_string { my $type = quotemeta shift; return qr/ $type (?: [^$type]+ | (?<= \\ ) $type )* $type /x; } my $data = qq(This "is a quoted string" and so is "this" and this one "has \\\\ \\" backslashes" and other unrelated stuff); my @matches = $data =~ / ((??{ quoted_string( qq(") ) })) /xg; print join("\n\n",@matches);

      Also, here are a few Javascript code snippets for those unfamiliar with Javascript:

      <script> function Some_Function (arg1, arg2) { } do { } while (1) for (i=0; i < 10; i++) { } while (1) { } </script>

      Its pretty much just like C, for those who are familiar with that (except Javascript variables don't have types).

        Right... but I am not sure that you cover this case:

          q{"this string ends with a backslash \\"}

        does it? That is, \\" and \" are different.
Re: Parsing with Perl 6
by BrowserUk (Patriarch) on Jul 05, 2002 at 06:29 UTC

    Maybe all that time I spent learning "Backus Naur Form" in college will finally find some utility.

    Problem is: do I continue trying to understand how to use Perl 5 regexs? With my rate of progress to date, I'll just have got comfortable with them and the knowledge will be become redundant. Still, thats nothing new in the 'puter industry.

      Problem is: do I continue trying to understand how to use Perl 5 regexs? With my rate of progress to date, I'll just have got comfortable with them and the knowledge will be become redundant.

      Yes! Don't stop or pause your learning process. It will take a long while before Perl 6 is actually here, and you'll have to use Perl 5 regexes until then. And of course, knowing Perl 5 regexes makes learning Perl 6 ones easier.

      - Yes, I reinvent wheels.
      - Spam: Visit eurotraQ.
      

Perl6: too bad non-greediness is not made the default
by stefp (Vicar) on Jul 05, 2002 at 13:01 UTC
    The only thing I regret is that matching non greediness has not been made the default. Larry is discussing this in perl6-language. His concern is that greediness is disconcerting for beginners but he laters goes on saying that non greediness is too. Strangely his favorite argument "Huffman Coding" is not invoked here. Personally, I add the "?" modifier by default and supress it when it would break my regexep... like in the exemple given by Larry
    my ($num) = /(\d*)/; # greediness needed

    So, because in my use, it is prevalent, I think that non greediness should be made the default but I have not checked other people code to see if they use non greediness more than greediness. Also I don't know if changing the default would affect the ratio of greediness/non greediness usage claimed to be 10/1 for perl 5.8 in an answer to this nodelet. Anyway, this ratio seems to show that criteria of Huffman coding would lead to a different grammar decision choice for me than to most of people. Also, I don't think that new feature introduced in perl6 regexp would change this pattern.

    -- stefp -- check out TeXmacs wiki

      I try to write my regexen backtracking-free, as, I believe, anyone should. And in that case greediness is very desired and useful most of the time. Non-greediness basically means that you match broader than you really need to - it works because you "forwardtrack", you gobble the string one submatch at a time. It is better to match more narrowly and greedily, since a greedy match will gobble up a lot of the string in one fell swoop and do less superfluous searching. In simple cases the regex optimizer is smart enough to simplify a non-greedy match into a Boyer-Moore search, but when you're working with a complex regex you really want to match narrowly and greedily.

      Regexen are a tricky art.

      Makeshifts last the longest.

        I beg to differ. My general principle is to be lax on what I receive and strict on what I emit. So I try to write my regexen as lax as possible and to "synchronize" my match on characters/sequences I am sure will be present in the input. This means that I often use .*? in my regexen. Using the greedy counterpart will lead to a lot of backtracking and more so geometrically with the number of .*?. Also I incrementally build my regexen testing them on samples: non greedy match also is less a nuisance here when examining the matches.

        Regexen are a tricky art and I like to abuse it at (*) the risk of being called demented. (*) And I beg to disagree with Felix Gallo, France is the lang of semiology, not of semiotics... and hair splitting too, BTW. And, on a related field, the main tagmemics foray in France is collateral to the introduction of american camelides but has yet to appear widely in French. tagmemique is indeed French neologism that googlewhacks until the present node referencing of a node

        -- stefp (qui aime couper les cheveux en quatre)-- check out TeXmacs wiki

      The only thing I regret is that matching non greediness has not been made the default.

      Non-greediness is only useful if you have something following the non-greediness. For example, /(\d*)/ would not be useful at all: it would always succeed, matching and capturing an empty string. /(\d+)/ would be equivalent to /(\d)/, because a non-greedy expression doesn't take more than the absolute minimum.

      If you use non-greedy quantifiers now, think about the efficiency that is gained by re-writing for greediness. Suppose you have /"(.*?)"/ - it can be written as /"([^"]*)"/, which is much more efficient. With backtracking disabled, as jryan did in his example (with the new : meta character), it's even more efficient (in case the subject string is not well formed).

      - Yes, I reinvent wheels.
      - Spam: Visit eurotraQ.
      

        I guess TIMTOWTDI. It is a question of emphasis, yours is on performance, mine is on readability and incremental building.

        If I combine my idea of (ab)?using .* with the concept that one can hack perl6 to make one's own language, I would define a modifier that would turn consecutive non breaking line spaces into an implicit .*? subregex with ? having its traditional meaning of non greediness. I am not sure it would be a so good idea though. Will be a good exercice anyway.

        -- stefp -- check out TeXmacs wiki

      I do not think your high use of non-greediness is anywhere near the norm. A rough and ready regex tromp through the .pl and .pm files in the 5.8.0 distribution shows greedy usage to exceed non-greedy usage by around 10 to 1. So the huffman argument would appear to be against you on this one.
        But does greedy usage exceed non-greedy usage because people wanted to be greedy, or because they didn't care and omitting the ? is easier?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://179555]
Approved by Zaxo
Front-paged by jeffa
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2024-03-28 08:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found