Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Structural Regular Expressions

by Erez (Priest)
on Feb 02, 2008 at 13:12 UTC ( #665719=perlmeditation: print w/replies, xml ) Need Help??

(This post is based on a lightning talk I gave on the 2007 Israeli Perl WorkShop).

The term "Structural Regular Expressions" was coined by Rob Pike in an article by the same name, written during the initial release of the text editor Sam, and published on May, 1987.
In the article, Pike describes an interesting problem he encountered using Unix tools, most notably sed and awk. His initial argument was, that while both Unix and C treat text as a byte stream (that is, an array of characters), the actual tools used on Unix handle text only as an array of lines. This metaphor, while fitting well with the way we humans view text, has many shortcomings when it is used for text processing and data extraction.
Pike started looking for a way to logically divide a file according to its structure (hence the name) using Regular Expressions, and continue processing each part until the needed data was found, extracted, changed, or deleted.

In this sense, a file can be delimited by many elements, determining its structure. A file may consist of paragraphs, that are delimited by empty lines, or only by a tab character at the beginning. A record file can be delimited by special characters, like commas, at-signs, etc. C-style source code has lines that are delimited by semi-colons and curly brackets, and so on.

Using structural regular expression, we can split a file according to its structure, using regular expressions to identify each part of the structure.

For this, I need 2 commands which I'll call "extract" and "given".
Extract returns every appearance of a certain regular expression. For instance, extract/^&[^&]+/ will return every instance of text that begins with an ampersand and the following non-ampersand text, similar to m/(^&[^&]+)/ (a capture-all match).
Given returns all previously extracted text that include a certain regular expression, so given/Author:/ will filter out all extracted matches that don't have "Author:" in them. Both commands, used in segue, will split the following text:

&Book: Programming Perl &Author: Larry Wall&Author: Tom Christiansen&Author: Jon Orwant &ISBN: 0-596-00027-8

by &-delimited record and will return only the records that inlcude authors names.

2 complementary commands can make the process easier, "delimit" (returns all text that doesn't include the regex) and "ignore" (returns all previously extracted text that doesn't include the regex).
In Sam, those commands are called x and g (eXtract and Guard/Given) and their complements are y and v.

For example, taking the full text of David Copperfield by Charles Dickens, I wish to locate every line where the name 'Miss Trotwood' appears (Dialogue lines might span over a line break). To accomplish this I split the file by paragraphs, then locate those that have dialogue (delimited by single quotes), and search those that have this name.
In single line mode, there would be a need to make sure those dialogue "lines" won't cross line breaks, forcing us to manually mark every instance of a dialogue, raising and lowering flags, before even checking for any appearance of the actual text we wish to match. A structural way of looking at the dialogue is called for.

The structural specification for this are: split the file by paragraphs - sequential non-empty lines (/^\n/); extract only text that is located between single quotes ('.' ); return only the lines that have the name (Miss Trotwood) in them.

In Sam, this will be written as:

0,$  y/^\n/ x/'.' / g/Miss Trotwood/.

To make it even more interesting, substitute every appearance of Copperfield with Trotwood, but only if the name 'Miss Trotwood' appear in a line in that paragraph, and only Copperfields that appear outside dialog lines:

0,$  y/^\n/ g/Miss Trotwood/ y/'.' / x/Copperfield/c/Trotwood/

(c substitutes the end result with the regex after it)

These simple examples show the basic concept behind structural regular expressions, or, more to the point, of the way these force a structure in a piping-like way, which is a metaphor familiar to any *nix user.

How does all that apply to Perl?

The regular expression features in Perl have more than enough operators and elements that allows us to use this technique. For starters, assigning the $/ variable allows us to process the file's content in chunks that are delimited by a delimiter of our choice. For instance, assigning an empty string to $/ ($/ = '') will treat empty lines as an input terminator, similar to Sam's y/^\n/ (keeping in mind that we can't assign a regular expression to $/). Once $/ is set, we can perform more operations using the while (<>) idiom (or similar) and continue processing the text. Each iteration of the loop can be matched (=~, or !~ m//, similar to Sam's x,y) against a regular expression, split according to another (Sam's v operator), and so on. Similarly, inside the match, lookarounds, backreferences and substitution operators can be used to further dissect and locate the areas we need until the final result is located, reported, returned or removed.

While this probably appears as no-news to most Perl programmers, it is the concept I find interesting most of all.
Part of the necessity of the structural regular expression concept emerged because existing regex engines were usually an implementation of Ken Thompson's DFA algorithm. Perl uses a much more powerful engine that can allow, and exceed, the abilities of the DFA engines without heeding to the text's actual structure (albeit with a performance price).
However, I find that applying this metaphor to text processing helps "unlocking" the exact, and most correct, method of dissecting and matching the text, and may result in better understanding, and as result of better implementation, of the solution to the text processing problems I tend to encounter.

Software speaks in tongues of man; I debug, therefore I code.
Stop saying 'script'. Stop saying 'line-noise'.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://665719]
Approved by ww
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2019-10-19 15:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?