Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
I often prefer solutions that are declarative in nature. Rather than writing code to do the work, I write code to interpret or compile a description of the work into the code that does the actual work.

In your problem, for example, we have the following situation:

  1. We have regex operations.
  2. We want to apply certain of the operations to certain pages.
  3. Given a page, we want to know which regex operations to apply, and then we want to apply them.

Since I don't know the specifics of your situation, let's say that you're working on books and that you deal with three kinds of pages: front matter, body, and index. Let's further say that each page has two properties: (1) its content (the text to appear on the page) and (2) its page type (one of the three we listed earlier).

Now, let's say that we have the following rules for processing the pages:

  1. All pages are expected to have a ::PAGENUM:: placeholder that shall be replaced by the page number during processing. On front-matter pages, however, the page number shall be displayed as a roman numeral.
  2. Front-matter pages may may contain ::COPYRIGHT:: and ::PRINTING:: placeholders that shall be replaced by copyright and printing information. These placeholders are ignored on other kinds of pages.
  3. Body pages require no additional processing for now (but might later).
  4. Index pages require no additional processing for now (but might later).

I would probably convert the rules into a simple text-based specification that is easy for humans to understand and edit:

body: +all_pages front_matter: s/::PAGENUM::/roman_numeral($page_number)/eg; s/::COPYRIGHT::/Copyright 2004 blah, blah/g; s/::PRINTING::/1st printing, Blah Blah Press/g; +all_pages index: +all_pages all_pages: s/::PAGENUM::/$page_number/eg;
The spec's meaning is straightforward. Each page type is represented by a labeled section. Each section contains a bit of Perl code that gives the substitutions to be performed on pages of that type. Further, to make reuse easy, we define lines of the form +label to mean "and now do the stuff specified in the section labeled label, too."

The idea is to be able to convert this specification into an engine that makes it easy process pages given their page types. For example, to process and print out a book, this is all the more complicated we should need to get:

my $page_engine = make_regex_engine_from_spec( $spec_fh ); my $page_number = 1; for my $page (@book_pages) { print $page_engine->( @$page{'content','page_type'} ), "\n"; $page_number++; }
That's pretty simple, right? But like most things in life this simplicity comes as a price: We must write the code that reads the spec and converts it into an engine for us. Fortunately, the price is isn't too high:
sub make_regex_engine_from_spec { my $fh = shift; # filehandle contains spec my %sections; my $label; # read in spec while (<$fh>) { chomp; next unless /\S/; # skip blanks if (/^(\w+):/) { $label = $1; } else { die "syntax error: need a section label\n" unless $label; push @{$sections{$label}}, $_; } } # compile spec into code my $interpret = sub { local $_ = shift; if ( /^ \s* \+ (\w+) /x ) { if ($sections{$1}) { return '$sections{'.$1.'}->();'; } die "there is no section named '$1'"; } return $_; }; while (($label, my $section) = each %sections) { my $generated_code = join "\n", 'sub {', (map $interpret->($_), @$section), "}\n"; $sections{$label} = eval $generated_code or die "couldn't eval section $label: $@"; } # return processor engine that embodies compiled spec return sub { # args: page content, page type (local $_, my $page_type) = @_; my $processor = $sections{$page_type}; $processor->() if $processor; return $_; } }
That might seem like a lot of code. However, it's of constant size and won't change as our regex needs grow and become more complicated. All we'll need to do is change our spec, which we expect will be easier than writing the equivalent code by hand. We're hoping that the simplicity and cost savings of the specification language more than pay for the one-time cost of having to write that function above.

To test out the spec-based system, let's create some pages of various types:

my @book_pages = ( { page_type => 'front_matter', content => "This is the copyright page (::PAGENUM::).\n" . "::COPYRIGHT::\n" . "::PRINTING::\n" }, { page_type => 'body', content => "This is a body page (::PAGENUM::).\n" }, { page_type => 'index', content => "This an index page (::PAGENUM::).\n" }, );
And here's what the pages look like when processed sequentially as a book using the for loop from earlier:
This is the copyright page (i). Copyright 2004 blah, blah 1st printing, Blah Blah Press This is a body page (2). This an index page (3).
Each of the page types was processed as expected. All of the expected placeholders were replaced on all pages. The copyright page (which is front matter) has a roman-numeral page number.

Looks like we're ready to print our book. :)

So that's how I might do it: (1) Write a spec. (2) Write code to convert the spec into worker code. (3) Use the worker code to do the work.

Cheers,
Tom

P.S. The complete code, ready to run, is below for your convenience:

#!/usr/bin/perl use warnings; use strict; # Tom Moertel <tom@moertel.com> 2004-10-11 # here are some fake pages my @book_pages = ( { page_type => 'front_matter', content => "This is the copyright page (::PAGENUM::).\n" . "::COPYRIGHT::\n" . "::PRINTING::\n" }, { page_type => 'body', content => "This is a body page (::PAGENUM::).\n" }, { page_type => 'index', content => "This an index page (::PAGENUM::).\n" }, ); # sample code that shows how to build # an engine from spec and use it my $page_number = 1; my $page_engine = make_regex_engine_from_spec(\*DATA); for my $page (@book_pages) { print $page_engine->(@$page{'content','page_type'}), "\n"; $page_number++; } sub roman_numeral { my $index = shift; return (qw/0 i ii iii iv v ... /)[$index] || "?"; } # the following code generates the worker code from the spec sub make_regex_engine_from_spec { my $fh = shift; my %sections; my $label; # read in spec while (<$fh>) { chomp; next unless /\S/; # skip blanks if (/^(\w+):/) { $label = $1; } else { die "syntax error: need a section label\n" unless $label; push @{$sections{$label}}, $_; } } # compile spec into code my $interpret = sub { local $_ = shift; if ( /^ \s* \+ (\w+) /x ) { if ($sections{$1}) { return '$sections{'.$1.'}->();'; } die "there is no section named '$1'"; } return $_; }; while (($label, my $section) = each %sections) { my $generated_code = join "\n", 'sub {', (map $interpret->($_), @$section), "}\n"; # uncomment below line to see generated code # print STDERR "$label => $generated_code\n"; $sections{$label} = eval $generated_code or die "couldn't eval section $label: $@"; } # return processor engine that embodies compiled spec return sub { # args: page content, page type (local $_, my $page_type) = @_; my $processor = $sections{$page_type}; $processor->() if $processor; return $_; } } # our spec follows __DATA__ body: +all_pages front_matter: s/::PAGENUM::/roman_numeral($page_number)/eg; s/::COPYRIGHT::/Copyright 2004 blah, blah/g; s/::PRINTING::/1st printing, Blah Blah Press/g; +all_pages index: +all_pages all_pages: s/::PAGENUM::/$page_number/eg;

In reply to Re: Apply A Set Of Regexes To A String by tmoertel
in thread Apply A Set Of Regexes To A String by Cody Pendant

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others perusing the Monastery: (5)
    As of 2014-09-22 22:22 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      How do you remember the number of days in each month?











      Results (205 votes), past polls