Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

How to enforce match priority irrespective of string position

by Polyglot (Hermit)
on Mar 07, 2021 at 11:34 UTC ( #11129253=perlquestion: print w/replies, xml ) Need Help??

Polyglot has asked for the wisdom of the Perl Monks concerning the following question:

According to the Perl documentation, Perl will always match at the earliest possible position within the string. This is true of any typical alternation, as can be seen here: https://perldoc.perl.org/perlrequick.

However, this is not what I need. I need to specify a priority of match without regard to position. An approximation of what I am needing is illustrated in the following (perhaps poor) example:

$line = qq~I'm looking for the end of a sentence, where possible. How +ever, in some cases, I'll need to go with a non-conventional "end" to + it, such as: "Here's a quote by a famous person which is supposed to exceed forty w +ords and is therefore required to be set apart as a separate, indente +d paragraph per APA style." (Famous, 1999) Note that the regex needs to look for the full end of the sentence, if + it exists: it cannot simply stop at the colon unless there is no fur +ther part to the sentence provided in that paragraph.~; $line =~ s/^ (.*?) ( (?:[.?!"]) #FIRST PRIORITY | (?:[:;-]) #SECOND PRIORITY | (?:\n|\r|\z|$) #LAST PRIORITY ) /<span class="s">$1$2</span>/gmx;

For the above, the desired sentence matches should be:

  1. I'm looking for the end of a sentence, where possible.
  2. However, in some cases, I'll need to go with a non-conventional "end" to it, such as:
  3. "Here's a quote by a famous person which is supposed to exceed forty words and is therefore required to be set apart as a separate, indented paragraph per APA style."
  4. (Famous, 1999)

  5. Note that the regex needs to look for the full end of the sentence, if it exists: it cannot simply stop at the colon unless there is no further part to the sentence provided in that paragraph.

As the example illustrates, the sentences should break at the first colon, but not at the second, as there is a higher-priority break-point, the period.

Is it possible to mandate a match priority such that the first one, irrespective of position, will be looked for first, and only upon failure would the next priority be sought, and so on? I have a case where my entire regex is failing on this issue, and I just cannot think of a good way to resolve it. It would not work, in my case, to use two separate regexes, as it would destroy the correcting ordering of the sequences matched.

Edit:

Perhaps this will be a better example/illustration.

 

Point 1.3.4: A piece of text.

Point 1.3.5: A piece of text.

Point 1.3.6: Another piece of text. Point 1.3.6: For some reason this piece of text isn't finished yet.

Point 1.3.6: In fact, this piece of text even broke into a new line.

Point 1.3.7: Finally, a new piece of text.

 

Now, it's easy to see that there are four points here. But the computer might not "see" four as it reads each of the "Point" notations. How could these points be captured such that each substitution will operate on the FULL point at once, not just a portion of a point? In other words, Point 1.3.6 needs to include three such notations spanning two separate lines.

I have coded it something like this:

$line =~ s~^ ( Point\s(\d+)\.(\d+)\.(d+) (.*?) ) (?= (?:Point\s (?:\d+)\.(?:\d+)\.(?!\4) ) #1 Priority | (?:\z|$) #2 Priority ) ~$processthis->()~egmx;

However, the #2 Priority match, because it matches first, ends up trumping the #1 Priority match, and any of the chunks of the form illustrated by Point 1.3.6 end up truncated.

Again, this is just an illustration, but perhaps it more clearly explains the priority issues. Moving the (.*?) into the forward-looking assertion(s), as some suggested I try, did not bring about the desired results for me.

Blessings,

~Polyglot~

Replies are listed 'Best First'.
Re: How to enforce match priority irrespective of string position
by tybalt89 (Monsignor) on Mar 08, 2021 at 00:49 UTC

    Finally, a "sort of" test case :)

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11129253 use warnings; local $_ = <<END; Point 1.3.4: A piece of text. Point 1.3.5: A piece of text. Point 1.3.6: Another piece of text. Point 1.3.6: For some reason this +piece of text isn't finished yet. Point 1.3.6: In fact, this piece of text even broke into a new line. Point 1.3.7: Finally, a new piece of text. END my @parts; push @parts, $& while / (Point\s[\d.]+:) .*? (?=Point|\z) (?!\1) /gsx; use Data::Dump 'dd'; dd \@parts;

    Outputs four chunks, just like you asked for:

    [ "Point 1.3.4: A piece of text.\n\n", "Point 1.3.5: A piece of text.\n\n", "Point 1.3.6: Another piece of text. Point 1.3.6: For some reason th +is piece of text isn't finished yet.\n\nPoint 1.3.6: In fact, this pi +ece of text even broke into a new line.\n\n", "Point 1.3.7: Finally, a new piece of text.\n\n", ]

      And that method worked! (Though I've had to restructure a bit to accommodate, as that was not in a simple substitution form.) I don't mind doing whatever is necessary to get things working, though...so thank you very much! I'll certainly upvote this when I get my next day's rations.

      This part seems to be the crucial bit: (?=Point|\z) (?!\1). I find this sort of syntax confusing because it always seems to me that the "Point" here should have precedence over anything coming afterward in the regex sequence, in this case the "\1" backreference. If "Point" is already detected from the forward assertion, why can it be matched again (overlapped) by this reference, even if in the negative?

      Well, no complaints at the moment, certainly, as at least the script is now past this hurdle. Thank you.

      Blessings,

      ~Polyglot~

        Because (?= and (?! are ZERO-WIDTH assertions.

Re: How to enforce match priority irrespective of string position
by Takeshi Kovacs (Beadle) on Mar 07, 2021 at 12:10 UTC
    I have trouble fully grasping your intention, especially because your example text and your description overlap.

    Could it be you are looking for recursive parsing, where anything in "quotes" won't be broken up at period?

    perldocs have examples for implementing this.

      I am, of course, dealing with some exceptions in a body of text. The text has some irregularities, but could be parsed correctly if only I am able to impose a strict ordering of match priority. It isn't an issue of quotes, nor is nesting involved; it's actually an issue of some potential "false positives" that must be initially skipped in favor of a more favorable match unless that more favorable match cannot be found--in which case the "false positive" might be the correct match. Does this make sense?

      Blessings,

      ~Polyglot~

        I'd say use Hippo's template of an SSCCE Re: Matching a string in a parenthesized block (regex help) to write some tests for
        • what you want and
        • what you don't want.
        This would certainly be beneficial for you too.

        Other than that, |-or conditions with swallowing can prioritize areas, like "quoted" ones. demo

        DB<132> $_ = 'phrase. "phrase1.phrase2" phrase. phrase' 0 'phrase. "phrase1.phrase2" phrase. phrase' DB<133> split /(".*?"|\.)/ 0 'phrase' 1 '.' 2 ' ' 3 '"phrase1.phrase2"' 4 ' phrase' 5 '.' 6 ' phrase' DB<134>
Re: How to enforce match priority irrespective of string position
by jcb (Parson) on Mar 09, 2021 at 00:17 UTC

    If you can advance incrementally through the text, you could try anchoring all of your patterns at pos with \G. Since pos is an lvalue, you could store the previous match position, try each pattern in priority order starting at the same previous position, take whichever match you prefer, store that into pos, and repeat for the next chunk. Something like: (untested)

    my $lastpos = 0; while ($lastpos < length $_) { my @matches = (undef x 3); pos = $lastpos; $matches[0] = pos if m/\G([^/?!"]+[.?!"])/gc; #FIRST PRIORITY pos = $lastpos; $matches[1] = pos if m/\G([^:;-]+[:;-])/gc; #SECOND PRIORITY pos = $lastpos; $matches[2] = pos if m/\G(.*(?:\n|\r|\z|$))/gc; #LAST PRIORITY # somehow choose which match to use for the next cycle and set $last +pos here # substr $_, $lastpos, ($matches[$chosen] - $lastpos) # should yield the selected chunk between choosing a match and upda +ting $lastpos }
Re: How to enforce match priority irrespective of string position
by rsFalse (Chaplain) on Mar 07, 2021 at 13:33 UTC
    >> I need to specify a priority of match without regard to position.

    Try look-ahead search. May this sketch give some help:
    $line =~ s/^ ( (?= .*? $regex_1 ) .*? $regex_1 #FIRST PRIORITY | (?= .*? $regex_2 ) .*? $regex_2 #SECOND PRIORITY ) /something/gmx;
    Edit: removed text that caret is obsolete.

    Upd.: I think my example (now striked-thourgh) simply reduces to the same but without look-ahead; see comment by Lanx.
      > (?= .*? $regex_1 ) .*? $regex_1

      does it make sense to repeat the regex?

      isn't it rather

      (?= $re_cond_1 ) $re_match_1

      ?

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        Of course, '$re_cond_1' may be not equal to '$re_match_1'. But I wanted to show the simplest example. Further '\1' can be used to avoid self-repeating.

        I must say that this syntax confuses me. I am already using a lookahead to define the forward edge of the match (versus where the next match will start in the global substitution), and everything up to but not including that lookahead needs to be captured. I've never thought one could capture from a lookahead...but perhaps I'd misunderstood. I'm also using backslash lookaround assertions, because some of what is matched will be matched again (these are the false positives) and for an unpredictable number of times (fewer than 20).

        I tried putting rsFalse's suggestion to use but was unable to get the match to succeed. I don't think I understand it well enough.

        Blessings,

        ~Polyglot~

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11129253]
Approved by Corion
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2022-09-29 17:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I prefer my indexes to start at:




    Results (125 votes). Check out past polls.

    Notices?