Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Polyglot: I don't know if the following will be of any use to you, but I was curious to play with some different approaches to what I conceive to be your problem. You may as well have the results. All these work (for some definition of 'work').

The first new approach is a variation on something I've already posted: two different replacement strings for the sequential versus non-sequential page number cases. In the case of sequential page numbers, the replacement string is the empty string, which may be something the regex engine can effectively 'optimize away' at run time.

The second new approach is to try to avoid altogether the replacement clause of the substitution in the case of sequential page numbers. This approach uses some of the newer, more exotic regex constructs introduced with 5.10. The problem with these is that their newness means that they may not be as efficiently recognized and optimized by the regex compiler, hence slower overall. I have done no benchmarking whatsoever.

use warnings FATAL => 'all' ; use strict; use constant DEBUG => 0; my $book = <<'ENDBOOK'; pg. 1 one two pg. 2 two three four pg. 4 four five pg. 5 five six pg. 6 six seven eight nine pg. 9 nine ten pg. 10 ten eleven twelve thirteen fourteen pg. 14 fourteen fifteen pg. 15 fifteen sixteen seventeen pg. 17 seventeen eighteen nineteen pg. 19 nineteen and out ENDBOOK print qq{[[$book]] \n\n}; # all these solutions use \K of 5.10+ # # works # # this solution works (insofar as i understand what Polyglot # # wants), but is 'inefficient' in that it involves substitution # # of a substring with an identical substring in most cases # # (assuming sequential page numbers are the most common case). # # my $pn = qr{ pg[.] \s+ }xms; # $book =~ # s{ $pn (\d+) \K (.*?) (?= $pn (\d+)) } # { my $m = missing($1, $3); $m ? qq{$2$m } : $2; }xmsge; # # works. extracts/classifies pg. number/matter ok. subst. ok. # # this solution works (with caveat given above), but in the case # # of sequential page numbers will insert an empty string into # # the target string, which may or may not be 'efficient'. # my $pn = qr{ pg[.] \s+ (\d+) }xms; # CAUTION: embedded capture # $book =~ s{ # $pn # capture pg. number to $1 # .*? \K # ignore pg. number/matter in replace # (?= $pn) # overlap capture next pg. number to $2 # } # { my $m = missing($1, $2); # print "rr'$1' '$2' s/${^MATCH}/$m/rr \n" if DEBUG; # $m; # }xmspge; # use exotic 5.10+ regex constructs to avoid 'useless' substitution. # # works. extracts/classifies pg. number/matter ok. subst. ok. # $book =~ s{ # pg[.] \s+ (\d+) # capture pg. number to $1 # .*? \K # ignore pg. number/matter in replace # (?= pg[.] \s+ (\d+)) # overlap capture next pg. number to $2 # (?(?{ $2 - $1 == 1 }) # sequential pages? # # sequential: no replacement, advance to next pg. # (?{ print "++'$1' '$2'++ \n" if DEBUG; }) # (*SKIP) (*FAIL) # | # # non-sequential: replace/insert missing pg(s)., advance # (?{ print "--'$1' '$2'-- \n" if DEBUG; }) # # null regex always true # ) # } # { my $m = missing($1, $2); # print "rr'$1' '$2' s/${^MATCH}/$m/rr \n" if DEBUG; # $m; # }xmspge; # # works. extracts/classifies pg. number/matter ok. subst. ok. # my $pn = qr{ pg[.] \s+ (\d+) }xms; # CAUTION: embedded capture # use re 'eval'; # $book =~ s{ # $pn # capture pg. number to $1 # .*? \K # ignore pg. number/matter in replace # (?= $pn) # overlap capture next pg. number to $2 # (?(?{ $2 - $1 == 1 }) # sequential pages? # # sequential: no replacement, advance to next pg. # (?{ print "++'$1' '$2'++ \n" if DEBUG; }) # (*SKIP) (*FAIL) # | # # non-sequential: replace/insert missing pg(s)., advance # (?{ print "--'$1' '$2'-- \n" if DEBUG; }) # # null regex always true # ) # } # { my $m = missing($1, $2); # print "rr'$1' '$2' s/${^MATCH}/$m/rr \n" if DEBUG; # $m; # }xmspge; # works. extracts/classifies pg. number/matter ok. subst. ok. my $pn = qr{ pg[.] \s+ (\d+) }xms; # CAUTION: embedded capture use re 'eval'; $book =~ s{ $pn # capture pg. number to $1 .*? \K # ignore pg. number/matter in replace # advance (i.e., skip) matching to this point if pages sequential (?= $pn) # overlap capture next pg. number to $2 (?(?{ $2 - $1 == 1 }) # sequential pages? # sequential: no replacement, advance to next pg. (?{ print "++'$1' '$2'++ \n" if DEBUG; }) (*SKIP) # skip past current page on failure (*FAIL) # fail the match: no replacement ) } { my $m = missing($1, $2); print "rr'$1' '$2' s/${^MATCH}/$m/rr \n" if DEBUG; $m; }xmspge; print "\n"; print "(($book)) \n"; sub missing { my ($i, $j) = @_; die "bad page sequence $i-$j" if $i >= $j; return '' if $j - $i < 2; # no missing page(s) my ($ii, $jj) = ($i + 1, $j - 1); # figure the gap return $ii == $jj ? qq{(PAGE $ii MISSING) } # just one page missing : qq{(PAGES $ii - $jj MISSING) } # multiple pages missing ; }

Output:

c:\@Work\Perl\monks\Polyglot>perl non_sequential_pages_1.pl [[pg. 1 one two pg. 2 two three four pg. 4 four five pg. 5 five six pg. 6 six seven eight nine pg. 9 nine ten pg. 10 ten eleven twelve thirteen fourteen pg. 14 fourteen fifteen pg. 15 fifteen sixteen seventeen pg. 17 seventeen eighteen nineteen pg. 19 nineteen and out ]] ((pg. 1 one two pg. 2 two three four (PAGE 3 MISSING) pg. 4 four five pg. 5 five six pg. 6 six seven eight nine (PAGES 7 - 8 MISSING) pg. 9 nine ten pg. 10 ten eleven twelve thirteen fourteen (PAGES 11 - 13 MISSING) pg. 14 f +ourteen fifteen pg. 15 fifteen sixteen seventeen (PAGE 16 MISSING) pg. 17 seventeen eighteen nineteen (PAGE 18 MISSING) pg. 19 nineteen and out ))


In reply to Re^5: How to use "less than" and "greater than" inside a regex for a $variable number by AnomalousMonk
in thread How to use "less than" and "greater than" inside a regex for a $variable number by Polyglot

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2024-04-19 23:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found