Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Apply A Set Of Regexes To A String

by tmoertel (Chaplain)
on Oct 11, 2004 at 05:37 UTC ( [id://398100]=note: print w/replies, xml ) Need Help??


in reply to Apply A Set Of Regexes To A String

I often prefer solutions that are declarative in nature. Rather than writing code to do the work, I write code to interpret or compile a description of the work into the code that does the actual work.

In your problem, for example, we have the following situation:

  1. We have regex operations.
  2. We want to apply certain of the operations to certain pages.
  3. Given a page, we want to know which regex operations to apply, and then we want to apply them.

Since I don't know the specifics of your situation, let's say that you're working on books and that you deal with three kinds of pages: front matter, body, and index. Let's further say that each page has two properties: (1) its content (the text to appear on the page) and (2) its page type (one of the three we listed earlier).

Now, let's say that we have the following rules for processing the pages:

  1. All pages are expected to have a ::PAGENUM:: placeholder that shall be replaced by the page number during processing. On front-matter pages, however, the page number shall be displayed as a roman numeral.
  2. Front-matter pages may may contain ::COPYRIGHT:: and ::PRINTING:: placeholders that shall be replaced by copyright and printing information. These placeholders are ignored on other kinds of pages.
  3. Body pages require no additional processing for now (but might later).
  4. Index pages require no additional processing for now (but might later).

I would probably convert the rules into a simple text-based specification that is easy for humans to understand and edit:

body: +all_pages front_matter: s/::PAGENUM::/roman_numeral($page_number)/eg; s/::COPYRIGHT::/Copyright 2004 blah, blah/g; s/::PRINTING::/1st printing, Blah Blah Press/g; +all_pages index: +all_pages all_pages: s/::PAGENUM::/$page_number/eg;
The spec's meaning is straightforward. Each page type is represented by a labeled section. Each section contains a bit of Perl code that gives the substitutions to be performed on pages of that type. Further, to make reuse easy, we define lines of the form +label to mean "and now do the stuff specified in the section labeled label, too."

The idea is to be able to convert this specification into an engine that makes it easy process pages given their page types. For example, to process and print out a book, this is all the more complicated we should need to get:

my $page_engine = make_regex_engine_from_spec( $spec_fh ); my $page_number = 1; for my $page (@book_pages) { print $page_engine->( @$page{'content','page_type'} ), "\n"; $page_number++; }
That's pretty simple, right? But like most things in life this simplicity comes as a price: We must write the code that reads the spec and converts it into an engine for us. Fortunately, the price is isn't too high:
sub make_regex_engine_from_spec { my $fh = shift; # filehandle contains spec my %sections; my $label; # read in spec while (<$fh>) { chomp; next unless /\S/; # skip blanks if (/^(\w+):/) { $label = $1; } else { die "syntax error: need a section label\n" unless $label; push @{$sections{$label}}, $_; } } # compile spec into code my $interpret = sub { local $_ = shift; if ( /^ \s* \+ (\w+) /x ) { if ($sections{$1}) { return '$sections{'.$1.'}->();'; } die "there is no section named '$1'"; } return $_; }; while (($label, my $section) = each %sections) { my $generated_code = join "\n", 'sub {', (map $interpret->($_), @$section), "}\n"; $sections{$label} = eval $generated_code or die "couldn't eval section $label: $@"; } # return processor engine that embodies compiled spec return sub { # args: page content, page type (local $_, my $page_type) = @_; my $processor = $sections{$page_type}; $processor->() if $processor; return $_; } }
That might seem like a lot of code. However, it's of constant size and won't change as our regex needs grow and become more complicated. All we'll need to do is change our spec, which we expect will be easier than writing the equivalent code by hand. We're hoping that the simplicity and cost savings of the specification language more than pay for the one-time cost of having to write that function above.

To test out the spec-based system, let's create some pages of various types:

my @book_pages = ( { page_type => 'front_matter', content => "This is the copyright page (::PAGENUM::).\n" . "::COPYRIGHT::\n" . "::PRINTING::\n" }, { page_type => 'body', content => "This is a body page (::PAGENUM::).\n" }, { page_type => 'index', content => "This an index page (::PAGENUM::).\n" }, );
And here's what the pages look like when processed sequentially as a book using the for loop from earlier:
This is the copyright page (i). Copyright 2004 blah, blah 1st printing, Blah Blah Press This is a body page (2). This an index page (3).
Each of the page types was processed as expected. All of the expected placeholders were replaced on all pages. The copyright page (which is front matter) has a roman-numeral page number.

Looks like we're ready to print our book. :)

So that's how I might do it: (1) Write a spec. (2) Write code to convert the spec into worker code. (3) Use the worker code to do the work.

Cheers,
Tom

P.S. The complete code, ready to run, is below for your convenience:

#!/usr/bin/perl use warnings; use strict; # Tom Moertel <tom@moertel.com> 2004-10-11 # here are some fake pages my @book_pages = ( { page_type => 'front_matter', content => "This is the copyright page (::PAGENUM::).\n" . "::COPYRIGHT::\n" . "::PRINTING::\n" }, { page_type => 'body', content => "This is a body page (::PAGENUM::).\n" }, { page_type => 'index', content => "This an index page (::PAGENUM::).\n" }, ); # sample code that shows how to build # an engine from spec and use it my $page_number = 1; my $page_engine = make_regex_engine_from_spec(\*DATA); for my $page (@book_pages) { print $page_engine->(@$page{'content','page_type'}), "\n"; $page_number++; } sub roman_numeral { my $index = shift; return (qw/0 i ii iii iv v ... /)[$index] || "?"; } # the following code generates the worker code from the spec sub make_regex_engine_from_spec { my $fh = shift; my %sections; my $label; # read in spec while (<$fh>) { chomp; next unless /\S/; # skip blanks if (/^(\w+):/) { $label = $1; } else { die "syntax error: need a section label\n" unless $label; push @{$sections{$label}}, $_; } } # compile spec into code my $interpret = sub { local $_ = shift; if ( /^ \s* \+ (\w+) /x ) { if ($sections{$1}) { return '$sections{'.$1.'}->();'; } die "there is no section named '$1'"; } return $_; }; while (($label, my $section) = each %sections) { my $generated_code = join "\n", 'sub {', (map $interpret->($_), @$section), "}\n"; # uncomment below line to see generated code # print STDERR "$label => $generated_code\n"; $sections{$label} = eval $generated_code or die "couldn't eval section $label: $@"; } # return processor engine that embodies compiled spec return sub { # args: page content, page type (local $_, my $page_type) = @_; my $processor = $sections{$page_type}; $processor->() if $processor; return $_; } } # our spec follows __DATA__ body: +all_pages front_matter: s/::PAGENUM::/roman_numeral($page_number)/eg; s/::COPYRIGHT::/Copyright 2004 blah, blah/g; s/::PRINTING::/1st printing, Blah Blah Press/g; +all_pages index: +all_pages all_pages: s/::PAGENUM::/$page_number/eg;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://398100]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (5)
As of 2024-04-23 23:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found