Polyglot: I don't know if the following will be of any use to you, but I was curious to play with some different approaches to what I conceive to be your problem. You may as well have the results. All these work (for some definition of 'work').
The first new approach is a variation on something I've already posted: two different replacement strings for the sequential versus non-sequential page number cases. In the case of sequential page numbers, the replacement string is the empty string, which may be something the regex engine can effectively 'optimize away' at run time.
The second new approach is to try to avoid altogether the replacement clause of the substitution in the case of sequential page numbers. This approach uses some of the newer, more exotic regex constructs introduced with 5.10. The problem with these is that their newness means that they may not be as efficiently recognized and optimized by the regex compiler, hence slower overall. I have done no benchmarking whatsoever.
use warnings
FATAL => 'all'
;
use strict;
use constant DEBUG => 0;
my $book = <<'ENDBOOK';
pg. 1 one two pg. 2 two three four pg.
4 four five pg. 5 five
six pg. 6 six seven eight nine pg. 9
nine ten pg. 10
ten eleven twelve thirteen fourteen pg. 14 fourteen
fifteen pg.
15 fifteen sixteen seventeen pg.
17 seventeen eighteen nineteen
pg. 19 nineteen and out
ENDBOOK
print qq{[[$book]] \n\n};
# all these solutions use \K of 5.10+
# # works
# # this solution works (insofar as i understand what Polyglot
# # wants), but is 'inefficient' in that it involves substitution
# # of a substring with an identical substring in most cases
# # (assuming sequential page numbers are the most common case).
#
# my $pn = qr{ pg[.] \s+ }xms;
# $book =~
# s{ $pn (\d+) \K (.*?) (?= $pn (\d+)) }
# { my $m = missing($1, $3); $m ? qq{$2$m } : $2; }xmsge;
# # works. extracts/classifies pg. number/matter ok. subst. ok.
# # this solution works (with caveat given above), but in the case
# # of sequential page numbers will insert an empty string into
# # the target string, which may or may not be 'efficient'.
# my $pn = qr{ pg[.] \s+ (\d+) }xms; # CAUTION: embedded capture
# $book =~ s{
# $pn # capture pg. number to $1
# .*? \K # ignore pg. number/matter in replace
# (?= $pn) # overlap capture next pg. number to $2
# }
# { my $m = missing($1, $2);
# print "rr'$1' '$2' s/${^MATCH}/$m/rr \n" if DEBUG;
# $m;
# }xmspge;
# use exotic 5.10+ regex constructs to avoid 'useless' substitution.
# # works. extracts/classifies pg. number/matter ok. subst. ok.
# $book =~ s{
# pg[.] \s+ (\d+) # capture pg. number to $1
# .*? \K # ignore pg. number/matter in replace
# (?= pg[.] \s+ (\d+)) # overlap capture next pg. number to $2
# (?(?{ $2 - $1 == 1 }) # sequential pages?
# # sequential: no replacement, advance to next pg.
# (?{ print "++'$1' '$2'++ \n" if DEBUG; })
# (*SKIP) (*FAIL)
# |
# # non-sequential: replace/insert missing pg(s)., advance
# (?{ print "--'$1' '$2'-- \n" if DEBUG; })
# # null regex always true
# )
# }
# { my $m = missing($1, $2);
# print "rr'$1' '$2' s/${^MATCH}/$m/rr \n" if DEBUG;
# $m;
# }xmspge;
# # works. extracts/classifies pg. number/matter ok. subst. ok.
# my $pn = qr{ pg[.] \s+ (\d+) }xms; # CAUTION: embedded capture
# use re 'eval';
# $book =~ s{
# $pn # capture pg. number to $1
# .*? \K # ignore pg. number/matter in replace
# (?= $pn) # overlap capture next pg. number to $2
# (?(?{ $2 - $1 == 1 }) # sequential pages?
# # sequential: no replacement, advance to next pg.
# (?{ print "++'$1' '$2'++ \n" if DEBUG; })
# (*SKIP) (*FAIL)
# |
# # non-sequential: replace/insert missing pg(s)., advance
# (?{ print "--'$1' '$2'-- \n" if DEBUG; })
# # null regex always true
# )
# }
# { my $m = missing($1, $2);
# print "rr'$1' '$2' s/${^MATCH}/$m/rr \n" if DEBUG;
# $m;
# }xmspge;
# works. extracts/classifies pg. number/matter ok. subst. ok.
my $pn = qr{ pg[.] \s+ (\d+) }xms; # CAUTION: embedded capture
use re 'eval';
$book =~ s{
$pn # capture pg. number to $1
.*? \K # ignore pg. number/matter in replace
# advance (i.e., skip) matching to this point if pages sequential
(?= $pn) # overlap capture next pg. number to $2
(?(?{ $2 - $1 == 1 }) # sequential pages?
# sequential: no replacement, advance to next pg.
(?{ print "++'$1' '$2'++ \n" if DEBUG; })
(*SKIP) # skip past current page on failure
(*FAIL) # fail the match: no replacement
)
}
{ my $m = missing($1, $2);
print "rr'$1' '$2' s/${^MATCH}/$m/rr \n" if DEBUG;
$m;
}xmspge;
print "\n";
print "(($book)) \n";
sub missing {
my ($i, $j) = @_;
die "bad page sequence $i-$j" if $i >= $j;
return '' if $j - $i < 2; # no missing page(s)
my ($ii, $jj) = ($i + 1, $j - 1); # figure the gap
return $ii == $jj
? qq{(PAGE $ii MISSING) } # just one page missing
: qq{(PAGES $ii - $jj MISSING) } # multiple pages missing
;
}
Output:
c:\@Work\Perl\monks\Polyglot>perl non_sequential_pages_1.pl
[[pg. 1 one two pg. 2 two three four pg.
4 four five pg. 5 five
six pg. 6 six seven eight nine pg. 9
nine ten pg. 10
ten eleven twelve thirteen fourteen pg. 14 fourteen
fifteen pg.
15 fifteen sixteen seventeen pg.
17 seventeen eighteen nineteen
pg. 19 nineteen and out
]]
((pg. 1 one two pg. 2 two three four (PAGE 3 MISSING) pg.
4 four five pg. 5 five
six pg. 6 six seven eight nine (PAGES 7 - 8 MISSING) pg. 9
nine ten pg. 10
ten eleven twelve thirteen fourteen (PAGES 11 - 13 MISSING) pg. 14 f
+ourteen
fifteen pg.
15 fifteen sixteen seventeen (PAGE 16 MISSING) pg.
17 seventeen eighteen nineteen
(PAGE 18 MISSING) pg. 19 nineteen and out
))
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.