Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

Polyglot: I don't know if the following will be of any use to you, but I was curious to play with some different approaches to what I conceive to be your problem. You may as well have the results. All these work (for some definition of 'work').

The first new approach is a variation on something I've already posted: two different replacement strings for the sequential versus non-sequential page number cases. In the case of sequential page numbers, the replacement string is the empty string, which may be something the regex engine can effectively 'optimize away' at run time.

The second new approach is to try to avoid altogether the replacement clause of the substitution in the case of sequential page numbers. This approach uses some of the newer, more exotic regex constructs introduced with 5.10. The problem with these is that their newness means that they may not be as efficiently recognized and optimized by the regex compiler, hence slower overall. I have done no benchmarking whatsoever.

use warnings FATAL => 'all' ; use strict; use constant DEBUG => 0; my $book = <<'ENDBOOK'; pg. 1 one two pg. 2 two three four pg. 4 four five pg. 5 five six pg. 6 six seven eight nine pg. 9 nine ten pg. 10 ten eleven twelve thirteen fourteen pg. 14 fourteen fifteen pg. 15 fifteen sixteen seventeen pg. 17 seventeen eighteen nineteen pg. 19 nineteen and out ENDBOOK print qq{[[$book]] \n\n}; # all these solutions use \K of 5.10+ # # works # # this solution works (insofar as i understand what Polyglot # # wants), but is 'inefficient' in that it involves substitution # # of a substring with an identical substring in most cases # # (assuming sequential page numbers are the most common case). # # my $pn = qr{ pg[.] \s+ }xms; # $book =~ # s{ $pn (\d+) \K (.*?) (?= $pn (\d+)) } # { my $m = missing($1, $3); $m ? qq{$2$m } : $2; }xmsge; # # works. extracts/classifies pg. number/matter ok. subst. ok. # # this solution works (with caveat given above), but in the case # # of sequential page numbers will insert an empty string into # # the target string, which may or may not be 'efficient'. # my $pn = qr{ pg[.] \s+ (\d+) }xms; # CAUTION: embedded capture # $book =~ s{ # $pn # capture pg. number to $1 # .*? \K # ignore pg. number/matter in replace # (?= $pn) # overlap capture next pg. number to $2 # } # { my $m = missing($1, $2); # print "rr'$1' '$2' s/${^MATCH}/$m/rr \n" if DEBUG; # $m; # }xmspge; # use exotic 5.10+ regex constructs to avoid 'useless' substitution. # # works. extracts/classifies pg. number/matter ok. subst. ok. # $book =~ s{ # pg[.] \s+ (\d+) # capture pg. number to $1 # .*? \K # ignore pg. number/matter in replace # (?= pg[.] \s+ (\d+)) # overlap capture next pg. number to $2 # (?(?{ $2 - $1 == 1 }) # sequential pages? # # sequential: no replacement, advance to next pg. # (?{ print "++'$1' '$2'++ \n" if DEBUG; }) # (*SKIP) (*FAIL) # | # # non-sequential: replace/insert missing pg(s)., advance # (?{ print "--'$1' '$2'-- \n" if DEBUG; }) # # null regex always true # ) # } # { my $m = missing($1, $2); # print "rr'$1' '$2' s/${^MATCH}/$m/rr \n" if DEBUG; # $m; # }xmspge; # # works. extracts/classifies pg. number/matter ok. subst. ok. # my $pn = qr{ pg[.] \s+ (\d+) }xms; # CAUTION: embedded capture # use re 'eval'; # $book =~ s{ # $pn # capture pg. number to $1 # .*? \K # ignore pg. number/matter in replace # (?= $pn) # overlap capture next pg. number to $2 # (?(?{ $2 - $1 == 1 }) # sequential pages? # # sequential: no replacement, advance to next pg. # (?{ print "++'$1' '$2'++ \n" if DEBUG; }) # (*SKIP) (*FAIL) # | # # non-sequential: replace/insert missing pg(s)., advance # (?{ print "--'$1' '$2'-- \n" if DEBUG; }) # # null regex always true # ) # } # { my $m = missing($1, $2); # print "rr'$1' '$2' s/${^MATCH}/$m/rr \n" if DEBUG; # $m; # }xmspge; # works. extracts/classifies pg. number/matter ok. subst. ok. my $pn = qr{ pg[.] \s+ (\d+) }xms; # CAUTION: embedded capture use re 'eval'; $book =~ s{ $pn # capture pg. number to $1 .*? \K # ignore pg. number/matter in replace # advance (i.e., skip) matching to this point if pages sequential (?= $pn) # overlap capture next pg. number to $2 (?(?{ $2 - $1 == 1 }) # sequential pages? # sequential: no replacement, advance to next pg. (?{ print "++'$1' '$2'++ \n" if DEBUG; }) (*SKIP) # skip past current page on failure (*FAIL) # fail the match: no replacement ) } { my $m = missing($1, $2); print "rr'$1' '$2' s/${^MATCH}/$m/rr \n" if DEBUG; $m; }xmspge; print "\n"; print "(($book)) \n"; sub missing { my ($i, $j) = @_; die "bad page sequence $i-$j" if $i >= $j; return '' if $j - $i < 2; # no missing page(s) my ($ii, $jj) = ($i + 1, $j - 1); # figure the gap return $ii == $jj ? qq{(PAGE $ii MISSING) } # just one page missing : qq{(PAGES $ii - $jj MISSING) } # multiple pages missing ; }

Output:

c:\@Work\Perl\monks\Polyglot>perl non_sequential_pages_1.pl [[pg. 1 one two pg. 2 two three four pg. 4 four five pg. 5 five six pg. 6 six seven eight nine pg. 9 nine ten pg. 10 ten eleven twelve thirteen fourteen pg. 14 fourteen fifteen pg. 15 fifteen sixteen seventeen pg. 17 seventeen eighteen nineteen pg. 19 nineteen and out ]] ((pg. 1 one two pg. 2 two three four (PAGE 3 MISSING) pg. 4 four five pg. 5 five six pg. 6 six seven eight nine (PAGES 7 - 8 MISSING) pg. 9 nine ten pg. 10 ten eleven twelve thirteen fourteen (PAGES 11 - 13 MISSING) pg. 14 f +ourteen fifteen pg. 15 fifteen sixteen seventeen (PAGE 16 MISSING) pg. 17 seventeen eighteen nineteen (PAGE 18 MISSING) pg. 19 nineteen and out ))


In reply to Re^5: How to use "less than" and "greater than" inside a regex for a $variable number by AnomalousMonk
in thread How to use "less than" and "greater than" inside a regex for a $variable number by Polyglot

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others surveying the Monastery: (11)
    As of 2014-10-30 16:14 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      For retirement, I am banking on:










      Results (208 votes), past polls