Come for the quick hacks, stay for the epiphanies. | |
PerlMonks |
Parsing with Perl 6by jryan (Vicar) |
on Jul 05, 2002 at 01:20 UTC ( [id://179555]=perlmeditation: print w/replies, xml ) | Need Help?? |
In Apocalypse 5, Larry Wall stated that our "regex culture has gone wrong in a variety of ways". When I first read that, I thought: "What on earth is he talking about? Perl regex do exactly what I need; so what's wrong with them?" Sometimes you need to see solution to realize how blatant the problem really is.
In general, readability and modularity is stressed to the extreme, yet, not for regexes. In fact, the complete opposite usually is. Far from clear and organized, our regexes tend to look like strings of crap. As a result, anything more advanced than mid-level-logic becomes seemingly impossible to do. Thats where perl 6 comes to the rescue. Perl 6 has loads of new syntactical changes that help regex writers to clean up their act. New perl 6 regexes are now readable, modular, and even easy to write. Advantages of Perl 6 Regex
Enough of the hoopla; lets see it in action. The SituationSay you work at a web design establishment. Your current assignment is to catalog all javascript functions that have been used in many of the various sites the company has done over the years. Each function is to be catagorized as either complex or simple; a function is to be considered complex if it contains some sort of loop. The question is, how do you start? Developing the LogicThe answer is to start mapping out logic and strategy for a parser. More precisely, a recursive parser. One strategy that we can use is to extract each nested set of information that we need, and then extract the next level from that. For instance, against the raw HTML, extract the <script></script> elements; from that, extract all functions; from that, check for the existance of a loop for each match. Each of these definitions can be broken into separate grammars, and common elements among them can be grouped into a base grammar from which each of the sub grammars can inherit from. Even with the benefits of OO organization, writing a parser is no easy task. Perl6 provides many useful tools to make the job easier, but that doesn't change the fact logic needs to be mapped out before the regex is written. Or, at least, it should be. Anyways, before the parsing begins lets define our definitions: (Note: the script definition, as well as a few definition to code explanations, are going to be skipped because their explanation is mundane for the purposes of this article; however, they is still encluded in the final code below) The DefinitionsFirst, a function:
Next, our loops. A while loop:
A for loop:
A do loop:
Finding Common GroundNow that the parsing is planned, the common elements can be located and modularized, such as:
Each of these common elements can now be placed into our base class. Although the header and body set for each object isn't quitethe same, a generic set can be defined in the base class, and the sub-grammars can overload if needed. Defining a BlockDescending one level further, the code block and condition need to be defined. A code block is a set of balanced brackets, of which subsets may be nested. Borrowing from perl5's perlre, we can turn:
into
Defining a condition is not much different than defining a code block. Since validating is of no concern, a condition can be defined as a set of balanced parenthesis; or, as:
Since the block of code and condition are similarly defined, we can bind them to a single assertion that takes 2 arguments that define the delimeters.
However, we'll also need to account for several other things to completely define our block; for instance, ignoring our delimiters within comments and strings, so that they don't interupt our block finding. Before any further progress can be made, comments and strings will need to be defined; their definitions will also be placed into the base class. Defining QuotesQuotes can be defined as:
Which can easily be translated into:
However, two types of quoting exists in javascript; single and double. Instead of creating a rule for each, the above rule can easily be generalized into:
Single Quotes can now be called as: <quoted_string '> We can now bind them as:
Ignoring Your CommentsJavascript uses C++ style comments: // for single line, and /* */ for multi-line. Therefore, a comment is:
A single-line comment is easy to define: its nothing more than a // and the rest of the line:
A multi-line comment is a bit more complicated. It can be defined as:
Translated to regex:
Rounding Out the BlockNow that the sub-definitions are completed, the block definition can be finished. Its a simple matter of adding our new pieces to the alternation:
Enough is Enough; Time for Action!Enough with boring definitions; its pretty easy to match up our parts with the definitions we created earlier. Its time to see the completed parser, which is below. However, a few things to notice:
The Parser
A subset as Perl 5A few of the main items described in terms of perl 5 regex. I tried doing a direct translation of the parser above with perl 5 objects; however, everything got incredibly messy :(
Update: Removed the \Q\E and replaced with quotemeta in the dynamic assertions in the perl5 section. Also, there was a paste error with $block in the perl5 section that would have caused it to break had it been used. Update: Fixed a mistake in the multi-line comment regex that was pointed out by kelen. Update: Updated Perl6 code to accurate reflect changes in the language since this article was written (4-27-03).
Back to
Meditations
|
|