In Apocalypse 5, Larry Wall stated that our "regex culture has gone wrong in a variety of ways". When I first read that, I thought: "What on earth is he talking about? Perl regex do exactly what I need; so what's wrong with them?" Sometimes you need to see solution to realize how blatant the problem really is.

In general, readability and modularity is stressed to the extreme, yet, not for regexes. In fact, the complete opposite usually is. Far from clear and organized, our regexes tend to look like strings of crap. As a result, anything more advanced than mid-level-logic becomes seemingly impossible to do. Thats where perl 6 comes to the rescue.

Perl 6 has loads of new syntactical changes that help regex writers to clean up their act. New perl 6 regexes are now readable, modular, and even easy to write.

Advantages of Perl 6 Regex

Enough of the hoopla; lets see it in action.

The Situation

Say you work at a web design establishment. Your current assignment is to catalog all javascript functions that have been used in many of the various sites the company has done over the years. Each function is to be catagorized as either complex or simple; a function is to be considered complex if it contains some sort of loop. The question is, how do you start?

Developing the Logic

The answer is to start mapping out logic and strategy for a parser. More precisely, a recursive parser. One strategy that we can use is to extract each nested set of information that we need, and then extract the next level from that. For instance, against the raw HTML, extract the <script></script> elements; from that, extract all functions; from that, check for the existance of a loop for each match. Each of these definitions can be broken into separate grammars, and common elements among them can be grouped into a base grammar from which each of the sub grammars can inherit from.

Even with the benefits of OO organization, writing a parser is no easy task. Perl6 provides many useful tools to make the job easier, but that doesn't change the fact logic needs to be mapped out before the regex is written. Or, at least, it should be. Anyways, before the parsing begins lets define our definitions:

(Note: the script definition, as well as a few definition to code explanations, are going to be skipped because their explanation is mundane for the purposes of this article; however, they is still encluded in the final code below)

The Definitions

First, a function:

- A header made up of: - the word function - a function name - an arguments list - A body made up of a block of code

Next, our loops.

A while loop:

- A header of: - the word while - a condition - A body of a block of code

A for loop:

- A header of: - the word for - enclosed in parenthesis: - a declaration and then a semicolon - a condition and then a semi-colon - an increment - A body of a block of code

A do loop:

- A header of the word do - A body of a block of code - A footer of: - the word while - a condition

Finding Common Ground

Now that the parsing is planned, the common elements can be located and modularized, such as:

- a header - a body - a block of code - a condition

Each of these common elements can now be placed into our base class. Although the header and body set for each object isn't quitethe same, a generic set can be defined in the base class, and the sub-grammars can overload if needed.

Defining a Block

Descending one level further, the code block and condition need to be defined. A code block is a set of balanced brackets, of which subsets may be nested. Borrowing from perl5's perlre, we can turn:

$block = qr/ \{ (?: (?> [^{}]* ) | (??{ $block }) )* \}

into

rule block { \{ [ <-[{}]>*: | <block> ]* \} }

Defining a condition is not much different than defining a code block. Since validating is of no concern, a condition can be defined as a set of balanced parenthesis; or, as:

rule block { \( [ <-[()]>*: | <block> ]* \) }

Since the block of code and condition are similarly defined, we can bind them to a single assertion that takes 2 arguments that define the delimeters.

rule block ($left,$right) { $left [ <-[$left $right]>*: | <block $left $right> ]* $right }

However, we'll also need to account for several other things to completely define our block; for instance, ignoring our delimiters within comments and strings, so that they don't interupt our block finding. Before any further progress can be made, comments and strings will need to be defined; their definitions will also be placed into the base class.

Defining Quotes

Quotes can be defined as:

- start quote - string of (non-quote or backslashed quote) characters - end quote

Which can easily be translated into:

rule quoted_string ($type) { " [ <-["]>+: | [ <after \\ > " ] ]* " }

However, two types of quoting exists in javascript; single and double. Instead of creating a rule for each, the above rule can easily be generalized into:

rule quoted_string ($type) { $type [ <-[$type]>+: | [ <after \\ > $type ] ]* $type }

Single Quotes can now be called as: <quoted_string '>
Dobule Quotes can be called as <quoted_string ">

We can now bind them as:

rule string { <quoted_string "> | <quoted_string '> }

Ignoring Your Comments

Javascript uses C++ style comments: // for single line, and /* */ for multi-line. Therefore, a comment is:

A single-line comment or A multi-line comment

A single-line comment is easy to define: its nothing more than a // and the rest of the line:

rule single_line_comment { // \N*: }

A multi-line comment is a bit more complicated. It can be defined as:

- A leading /* - A string of characters that includes non asterix and astrix not foll +owed by a / - A closing */

Translated to regex:

rule multi_line_comment { /\* [ <- [*] >* : | <before \*/ > | \* ]* \*/ }

Rounding Out the Block

Now that the sub-definitions are completed, the block definition can be finished. Its a simple matter of adding our new pieces to the alternation:

rule block ($left,$right) { $left [ <-[$left$right"'/]>*: | <comment> | <string> | <block $left $right> ]* $right }

Enough is Enough; Time for Action!

Enough with boring definitions; its pretty easy to match up our parts with the definitions we created earlier. Its time to see the completed parser, which is below. However, a few things to notice:

The Parser

#!/usr/bin/perl -w use strict; grammar javascript { rule single_line_comment { // \N* : } rule multi_line_comment { /\* [ <- [*] >* : | <before \*/ > | \* ]* \*/ } rule comment { <single_line_comment> | <multi_line_comment> } rule quoted_string ($type) { $type [ <- [$type] >+ : | [ <after \\ > $type ] ]* $type } rule string { <quoted_string "> | <quoted_string '> } rule block ($left,$right) { $left [ <- [$left$right"'/] >* : | <comment> | <string> | <block $left $right> ]* $right } rule code { <block \{ \}> } rule cond { <block \( \)> } rule header { } rule body :w { <code> } rule match { <header> <body> } } grammar script is javascript { rule header :wi { \< script <- [>] >* <cut> \> [ \< ! -- ]? } rule body :wi { [ <- [<("'/] >* <cut> | <comment> | <quoted_string "> | <quoted_string '> | <cond> | <!before :: [ \</ ] | <null> > ]* } rule footer :wi { \</ script \> } rule match { <header> <body> <footer> } } grammar function is javascript { rule arg_list :w { \( [ \w+ <!before \) :: , | <null> > ]* \) } rule name { \w+ } rule header :wi { function <name> <arg_list> } } grammar while is javascript { rule header :wi { while <cond> } } grammar for is javascript { rule decl { <- [;] >* : } rule cond { <- [;] >* : } rule incr { <- [)] >* : } rule header :wi { for \( <decl> ; <cond> ; <incr> \) } } grammar do is javascript { rule header :wi { do } rule footer :wi { while <cond> } rule match { <header> <body> <footer> } } module main; my (@scripts, @functions, @html); @html := <>; my $html is from(\@html); $html =~ m:e/ @scripts := (<javascript::script::match>) /; @scripts =~ m:e/ @functions := (<javascript::function::match>) /; for @functions -> $function { given $function { when / <javascript::while::match> | <javascript::for::match> | <javascript::do::match> / { catalog_complex ($_) } default { catalog_simple ($_) } } } sub catalog_simple { ... } sub catalog_complex { ... }

A subset as Perl 5

A few of the main items described in terms of perl 5 regex. I tried doing a direct translation of the parser above with perl 5 objects; however, everything got incredibly messy :(

use re 'eval'; sub quoted_string { my $type = quotemeta shift; return qr/ $type (?: [^$type]+ | (?: (?<= \\ ) $type ) )* $type /x; } my $comment = qr~ (?: // (?> [^\n]* ) ) # single line | (?: # multi line /\* (?: (?> [^*]+ ) | (?= \*/ ) | \* )* \*/ ) ~x; sub block { my $left = quotemeta shift; my $right = quotemeta shift; my $block = qr! $left (?: (?> [^$left$right"'/] *) | (??{$comment}) | (??{ quoted_string( qq(") ) }) | (??{ quoted_string( qq(') ) }) | (??{$block}) )* $right !x; return $block; } my $arg_list = qr/ \( (?: [^,)]* (,|) )* \) /x; # ugly! my $block; my $code = qr/ (??{$block = block ( "\{" , "\}" ); "$block"}) /x; my $cond = qr/ (??{$block = block ( "(" , ")" ); "$block"}) /x;

Update: Removed the \Q\E and replaced with quotemeta in the dynamic assertions in the perl5 section. Also, there was a paste error with $block in the perl5 section that would have caused it to break had it been used.

Update: Fixed a mistake in the multi-line comment regex that was pointed out by kelen.

Update: Updated Perl6 code to accurate reflect changes in the language since this article was written (4-27-03).


In reply to Parsing with Perl 6 by jryan

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":