|There's more than one way to do things|
Structural Regular Expressionsby Erez (Priest)
|on Feb 02, 2008 at 13:12 UTC||Need Help??|
(This post is based on a lightning talk I gave on the 2007 Israeli Perl WorkShop).
The term "Structural Regular Expressions" was coined by Rob Pike in an article by the same name, written during the initial release of the text editor Sam, and published on May, 1987.
In this sense, a file can be delimited by many elements, determining its structure. A file may consist of paragraphs, that are delimited by empty lines, or only by a tab character at the beginning. A record file can be delimited by special characters, like commas, at-signs, etc. C-style source code has lines that are delimited by semi-colons and curly brackets, and so on.
Using structural regular expression, we can split a file according to its structure, using regular expressions to identify each part of the structure.
For this, I need 2 commands which I'll call "extract" and "given".
by &-delimited record and will return only the records that inlcude authors names.
2 complementary commands can make the process easier, "delimit" (returns all text that doesn't include the regex) and "ignore" (returns all previously extracted text that doesn't include the regex).
For example, taking the full text of David Copperfield by Charles Dickens, I wish to locate every line where the name 'Miss Trotwood' appears (Dialogue lines might span over a line break). To accomplish this I split the file by paragraphs, then locate those that have dialogue (delimited by single quotes), and search those that have this name.
The structural specification for this are: split the file by paragraphs - sequential non-empty lines (/^\n/); extract only text that is located between single quotes ('.' ); return only the lines that have the name (Miss Trotwood) in them.
In Sam, this will be written as:
0,$ y/^\n/ x/'.' / g/Miss Trotwood/.
To make it even more interesting, substitute every appearance of Copperfield with Trotwood, but only if the name 'Miss Trotwood' appear in a line in that paragraph, and only Copperfields that appear outside dialog lines:
0,$ y/^\n/ g/Miss Trotwood/ y/'.' / x/Copperfield/c/Trotwood/
(c substitutes the end result with the regex after it)
These simple examples show the basic concept behind structural regular expressions, or, more to the point, of the way these force a structure in a piping-like way, which is a metaphor familiar to any *nix user.
How does all that apply to Perl?
The regular expression features in Perl have more than enough operators and elements that allows us to use this technique. For starters, assigning the $/ variable allows us to process the file's content in chunks that are delimited by a delimiter of our choice. For instance, assigning an empty string to $/ ($/ = '') will treat empty lines as an input terminator, similar to Sam's y/^\n/ (keeping in mind that we can't assign a regular expression to $/). Once $/ is set, we can perform more operations using the while (<>) idiom (or similar) and continue processing the text. Each iteration of the loop can be matched (=~, or !~ m//, similar to Sam's x,y) against a regular expression, split according to another (Sam's v operator), and so on. Similarly, inside the match, lookarounds, backreferences and substitution operators can be used to further dissect and locate the areas we need until the final result is located, reported, returned or removed.
While this probably appears as no-news to most Perl programmers, it is the concept I find interesting most of all.