Re: Apply A Set Of Regexes To A String

I often prefer solutions that are declarative in nature. Rather than writing code to do the work, I write code to interpret or compile a description of the work into the code that does the actual work.

In your problem, for example, we have the following situation:

We have regex operations.
We want to apply certain of the operations to certain pages.
Given a page, we want to know which regex operations to apply, and then we want to apply them.

Since I don't know the specifics of your situation, let's say that you're working on books and that you deal with three kinds of pages: front matter, body, and index. Let's further say that each page has two properties: (1) its content (the text to appear on the page) and (2) its page type (one of the three we listed earlier).

Now, let's say that we have the following rules for processing the pages:

All pages are expected to have a ::PAGENUM:: placeholder that shall be replaced by the page number during processing. On front-matter pages, however, the page number shall be displayed as a roman numeral.
Front-matter pages may may contain ::COPYRIGHT:: and ::PRINTING:: placeholders that shall be replaced by copyright and printing information. These placeholders are ignored on other kinds of pages.
Body pages require no additional processing for now (but might later).
Index pages require no additional processing for now (but might later).

I would probably convert the rules into a simple text-based specification that is easy for humans to understand and edit:

    body:
        +all_pages

    front_matter:
        s/::PAGENUM::/roman_numeral($page_number)/eg;
        s/::COPYRIGHT::/Copyright 2004 blah, blah/g;
        s/::PRINTING::/1st printing, Blah Blah Press/g;
        +all_pages

    index:
        +all_pages

    all_pages:
        s/::PAGENUM::/$page_number/eg;
[download]

The spec's meaning is straightforward. Each page type is represented by a labeled section. Each section contains a bit of Perl code that gives the substitutions to be performed on pages of that type. Further, to make reuse easy, we define lines of the form +label to mean "and now do the stuff specified in the section labeled label, too."

The idea is to be able to convert this specification into an engine that makes it easy process pages given their page types. For example, to process and print out a book, this is all the more complicated we should need to get:

    my $page_engine = make_regex_engine_from_spec( $spec_fh );
    my $page_number = 1;

    for my $page (@book_pages) {
        print $page_engine->( @$page{'content','page_type'} ), "\n";
        $page_number++;
    }
[download]

That's pretty simple, right? But like most things in life this simplicity comes as a price: We must write the code that reads the spec and converts it into an engine for us. Fortunately, the price is isn't too high:

    sub make_regex_engine_from_spec
    {
        my $fh = shift;  # filehandle contains spec
        my %sections;
        my $label;

        # read in spec

        while (<$fh>) {
            chomp;
            next unless /\S/; # skip blanks
            if (/^(\w+):/) {
                $label = $1;
            }
            else {
                die "syntax error: need a section label\n"
                    unless $label;
                push @{$sections{$label}}, $_;
            }
        }

        # compile spec into code
        
        my $interpret = sub {
            local $_ = shift;
            if ( /^ \s* \+ (\w+) /x ) {
                if ($sections{$1}) {
                    return '$sections{'.$1.'}->();';
                }
                die "there is no section named '$1'";
            }
            return $_;
        };

        while (($label, my $section) = each %sections) {
            my $generated_code =
                join "\n", 'sub {', 
                    (map $interpret->($_), @$section), "}\n";
            $sections{$label} = eval $generated_code
                or die "couldn't eval section $label: $@";
        }

        # return processor engine that embodies compiled spec

        return sub {
            # args: page content, page type
            (local $_, my $page_type) = @_;
            my $processor = $sections{$page_type};
            $processor->() if $processor;
            return $_;
        }
    }
[download]

That might seem like a lot of code. However, it's of constant size and won't change as our regex needs grow and become more complicated. All we'll need to do is change our spec, which we expect will be easier than writing the equivalent code by hand. We're hoping that the simplicity and cost savings of the specification language more than pay for the one-time cost of having to write that function above.

To test out the spec-based system, let's create some pages of various types:

my @book_pages = (

   { page_type => 'front_matter',
     content   => "This is the copyright page (::PAGENUM::).\n"
                . "::COPYRIGHT::\n"
                . "::PRINTING::\n" },

   { page_type => 'body',
     content   => "This is a body page (::PAGENUM::).\n" },

   { page_type => 'index', 
     content   => "This an index page (::PAGENUM::).\n" },
);
[download]

And here's what the pages look like when processed sequentially as a book using the for loop from earlier:

    This is the copyright page (i).
    Copyright 2004 blah, blah
    1st printing, Blah Blah Press

    This is a body page (2).

    This an index page (3).
[download]

Each of the page types was processed as expected. All of the expected placeholders were replaced on all pages. The copyright page (which is front matter) has a roman-numeral page number.

Looks like we're ready to print our book. :)

So that's how I might do it: (1) Write a spec. (2) Write code to convert the spec into worker code. (3) Use the worker code to do the work.

Cheers,
Tom

P.S. The complete code, ready to run, is below for your convenience:

#!/usr/bin/perl

use warnings;
use strict;

# Tom Moertel <tom@moertel.com> 2004-10-11

# here are some fake pages

my @book_pages = (

   { page_type => 'front_matter',
     content   => "This is the copyright page (::PAGENUM::).\n"
                . "::COPYRIGHT::\n"
                . "::PRINTING::\n" },

   { page_type => 'body',
     content   => "This is a body page (::PAGENUM::).\n" },

   { page_type => 'index', 
     content   => "This an index page (::PAGENUM::).\n" },
);


# sample code that shows how to build
# an engine from spec and use it

my $page_number = 1;
my $page_engine = make_regex_engine_from_spec(\*DATA);
for my $page (@book_pages) {
    print $page_engine->(@$page{'content','page_type'}), "\n";
    $page_number++;
}

sub roman_numeral {
    my $index = shift;
    return (qw/0 i ii iii iv v ... /)[$index] || "?";
}

# the following code generates the worker code from the spec

sub make_regex_engine_from_spec
{
    my $fh = shift;
    my %sections;
    my $label;

    # read in spec

    while (<$fh>) {
        chomp;
        next unless /\S/; # skip blanks
        if (/^(\w+):/) {
            $label = $1;
        }
        else {
            die "syntax error: need a section label\n"
                unless $label;
            push @{$sections{$label}}, $_;
        }
    }

    # compile spec into code
    
    my $interpret = sub {
        local $_ = shift;
        if ( /^ \s* \+ (\w+) /x ) {
            if ($sections{$1}) {
                return '$sections{'.$1.'}->();';
            }
            die "there is no section named '$1'";
        }
        return $_;
    };

    while (($label, my $section) = each %sections) {
        my $generated_code =
            join "\n", 'sub {', 
                (map $interpret->($_), @$section), "}\n";
        # uncomment below line to see generated code
        # print STDERR "$label => $generated_code\n";
        $sections{$label} = eval $generated_code
            or die "couldn't eval section $label: $@";
    }

    # return processor engine that embodies compiled spec

    return sub {
        # args: page content, page type
        (local $_, my $page_type) = @_;
        my $processor = $sections{$page_type};
        $processor->() if $processor;
        return $_;
    }
}


# our spec follows

__DATA__
body:
    +all_pages

front_matter:
    s/::PAGENUM::/roman_numeral($page_number)/eg;
    s/::COPYRIGHT::/Copyright 2004 blah, blah/g;
    s/::PRINTING::/1st printing, Blah Blah Press/g;
    +all_pages

index:
    +all_pages

all_pages:
    s/::PAGENUM::/$page_number/eg;
[download]

Tom Moertel : Blog / Talks / CPAN / LectroTest / PXSL / Coffee / Movie Rating Decoder

Comment on Re: Apply A Set Of Regexes To A String Select or Download Code


Keep It Simple, Stupid
	PerlMonks