comment on

Polyglot: I don't know if the following will be of any use to you, but I was curious to play with some different approaches to what I conceive to be your problem. You may as well have the results. All these work (for some definition of 'work').

The first new approach is a variation on something I've already posted: two different replacement strings for the sequential versus non-sequential page number cases. In the case of sequential page numbers, the replacement string is the empty string, which may be something the regex engine can effectively 'optimize away' at run time.

The second new approach is to try to avoid altogether the replacement clause of the substitution in the case of sequential page numbers. This approach uses some of the newer, more exotic regex constructs introduced with 5.10. The problem with these is that their newness means that they may not be as efficiently recognized and optimized by the regex compiler, hence slower overall. I have done no benchmarking whatsoever.

use warnings
    FATAL => 'all'
    ;
use strict;

use constant DEBUG => 0;


my $book = <<'ENDBOOK';
pg. 1 one two  pg. 2 two three four  pg.
4 four five  pg. 5 five
six  pg. 6 six seven eight nine  pg. 9
 nine ten  pg. 10
ten eleven twelve thirteen fourteen  pg. 14 fourteen
 fifteen  pg.
15 fifteen sixteen seventeen  pg.
17 seventeen eighteen nineteen
pg. 19 nineteen and out
ENDBOOK

print qq{[[$book]] \n\n};

# all these solutions use  \K  of 5.10+

# # works
# # this solution works (insofar as i understand what Polyglot
# # wants), but is 'inefficient' in that it involves substitution
# # of a substring with an identical substring in most cases
# # (assuming sequential page numbers are the most common case).
#
# my $pn = qr{ pg[.] \s+ }xms;
# $book =~
#     s{ $pn (\d+) \K  (.*?)  (?= $pn (\d+))             }
#      { my $m = missing($1, $3);  $m ? qq{$2$m  } : $2; }xmsge;

# # works.  extracts/classifies pg. number/matter ok.  subst. ok.
# # this solution works (with caveat given above), but in the case
# # of sequential page numbers will insert an empty string into
# # the target string, which may or may not be 'efficient'.
# my $pn = qr{ pg[.] \s+ (\d+) }xms;  # CAUTION: embedded capture
# $book =~ s{
#     $pn                     # capture pg. number to $1
#     .*? \K                  # ignore pg. number/matter in replace
#     (?= $pn)                # overlap capture next pg. number to $2
#   }
#   { my $m = missing($1, $2);
#     print "rr'$1'  '$2'  s/${^MATCH}/$m/rr \n" if DEBUG;
#     $m;
#   }xmspge;


# use exotic 5.10+ regex constructs to avoid 'useless' substitution.

# # works.  extracts/classifies pg. number/matter ok.  subst. ok.
# $book =~ s{
#     pg[.] \s+ (\d+)         # capture pg. number to $1
#     .*? \K                  # ignore pg. number/matter in replace
#     (?= pg[.] \s+ (\d+))    # overlap capture next pg. number to $2
#     (?(?{ $2 - $1 == 1 })   # sequential pages?
#         # sequential: no replacement, advance to next pg.
#         (?{ print "++'$1'  '$2'++ \n" if DEBUG; })
#         (*SKIP) (*FAIL)
#         |
#         # non-sequential: replace/insert missing pg(s)., advance
#         (?{ print "--'$1'  '$2'-- \n" if DEBUG; })
#         # null regex always true
#         )
#   }
#   { my $m = missing($1, $2);
#     print "rr'$1'  '$2'  s/${^MATCH}/$m/rr \n" if DEBUG;
#     $m;
#   }xmspge;

# # works.  extracts/classifies pg. number/matter ok.  subst. ok.
# my $pn = qr{ pg[.] \s+ (\d+) }xms;  # CAUTION: embedded capture
# use re 'eval';
# $book =~ s{
#     $pn                     # capture pg. number to $1
#     .*? \K                  # ignore pg. number/matter in replace
#     (?= $pn)                # overlap capture next pg. number to $2
#     (?(?{ $2 - $1 == 1 })   # sequential pages?
#         # sequential: no replacement, advance to next pg.
#         (?{ print "++'$1'  '$2'++ \n" if DEBUG; })
#         (*SKIP) (*FAIL)
#         |
#         # non-sequential: replace/insert missing pg(s)., advance
#         (?{ print "--'$1'  '$2'-- \n" if DEBUG; })
#         # null regex always true
#         )
#   }
#   { my $m = missing($1, $2);
#     print "rr'$1'  '$2'  s/${^MATCH}/$m/rr \n" if DEBUG;
#     $m;
#   }xmspge;

# works.  extracts/classifies pg. number/matter ok.  subst. ok.
my $pn = qr{ pg[.] \s+ (\d+) }xms;  # CAUTION: embedded capture
use re 'eval';
$book =~ s{
    $pn                     # capture pg. number to $1
    .*? \K                  # ignore pg. number/matter in replace
    # advance (i.e., skip) matching to this point if pages sequential
    (?= $pn)                # overlap capture next pg. number to $2
    (?(?{ $2 - $1 == 1 })   # sequential pages?
        # sequential: no replacement, advance to next pg.
        (?{ print "++'$1'  '$2'++ \n" if DEBUG; })
        (*SKIP)  # skip past current page on failure
        (*FAIL)  # fail the match: no replacement
        )
  }
  { my $m = missing($1, $2);
    print "rr'$1'  '$2'  s/${^MATCH}/$m/rr \n" if DEBUG;
    $m;
  }xmspge;

print "\n";
print "(($book)) \n";


sub missing {

    my ($i, $j) = @_;

    die "bad page sequence $i-$j" if $i >= $j;

    return '' if $j - $i < 2;  # no missing page(s)

    my ($ii, $jj) = ($i + 1, $j - 1);  # figure the gap
    return $ii == $jj
         ? qq{(PAGE $ii MISSING)  }         # just one page missing
         : qq{(PAGES $ii - $jj MISSING)  }  # multiple pages missing
         ;

    }
[download]

Output:

c:\@Work\Perl\monks\Polyglot>perl non_sequential_pages_1.pl
[[pg. 1 one two  pg. 2 two three four  pg.
4 four five  pg. 5 five
six  pg. 6 six seven eight nine  pg. 9
 nine ten  pg. 10
ten eleven twelve thirteen fourteen  pg. 14 fourteen
 fifteen  pg.
15 fifteen sixteen seventeen  pg.
17 seventeen eighteen nineteen
pg. 19 nineteen and out
]]

((pg. 1 one two  pg. 2 two three four  (PAGE 3 MISSING)  pg.
4 four five  pg. 5 five
six  pg. 6 six seven eight nine  (PAGES 7 - 8 MISSING)  pg. 9
 nine ten  pg. 10
ten eleven twelve thirteen fourteen  (PAGES 11 - 13 MISSING)  pg. 14 f
+ourteen
 fifteen  pg.
15 fifteen sixteen seventeen  (PAGE 16 MISSING)  pg.
17 seventeen eighteen nineteen
(PAGE 18 MISSING)  pg. 19 nineteen and out
))
[download]

In reply to Re^5: How to use "less than" and "greater than" inside a regex for a $variable number by AnomalousMonk
in thread How to use "less than" and "greater than" inside a regex for a $variable number by Polyglot

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Keep It Simple, Stupid
	PerlMonks