Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Can I determine index point of failure in string during regex match attempt

by tj_thompson (Monk)
on May 06, 2014 at 21:44 UTC ( [id://1085239]=perlquestion: print w/replies, xml ) Need Help??

tj_thompson has asked for the wisdom of the Perl Monks concerning the following question:

Hello again Monks!

In addition to my long question below, I've come up with a shorter more theoretical question that should be easier to jump into.

Given a string $s, and a regex $r anchored at both front and end of the string with '^' and '$', is it possible to write a function could_match( $r, $s ) that will determine whether $r could match if additional characters were appended to $s?

One requirement: the entirety of $s must be matched, not a sub string of $s. $r could always match with additional characters if it was not anchored.

It seems this should be easy enough to answer if you can answer the following question: How much of string $s was consumed in the failed match attempt against $r? If the entire string was consumed and $r remained unmatched, then it would seem $r could match if characters were appended to $s. On the other hand, if only part of $s was consumed, then adding additional characters would not help and $r could not be matched.

Perhaps consumed is poor verbiage. You could also look at this as the index point of failure during the match attempt.

Examples:

# $s does not match, but '12' would be consumed in the # match attempt. In this case, all of $s is consumed # so $s should be able to match if additional characters # were added $s = '12' $r = qr/^123$/; could_match( $r, $s ) # returns TRUE # $s does not match and none of $s would be consumed in the # attempt as the '1' in the regex could not match. Since # $s was not consumed in it's entirety in the match attmempt # there are no characters that could be added to $s to allow # it to match $r $s = '23' $r = qr/^123$/; could_match( $r, $s ) # returns FALSE # $r in this case is not anchored, so it could always match # if more characters were added. However, since none of $s # was used in the match attempt, the entirety of $s could # not match even if more characters were added. $s = '123'; $r = qr/456/; could_match( $r, $s ) # returns FALSE
  • Comment on Can I determine index point of failure in string during regex match attempt
  • Download Code

Replies are listed 'Best First'.
Re: Can I determine index point of failure in string during regex match attempt
by AnomalousMonk (Archbishop) on May 06, 2014 at 23:32 UTC

    If  $r is really just a plain, simple string of characters, it seems to me you're asking questions that can easily be answered by index. If  $r is a regex of limitless complexity, it seems much more difficult, perhaps impossible, to answer the question "could there be a match if more characters were added to the right end of $s?"

    c:\@Work\Perl>perl -wMstrict -le "my $r = '123'; ;; for my $s ('', qw(1 12 123 1234 2 23 234)) { my $left_same = 0 == index($r, $s); my $all_same = $r eq $s; printf qq{%-6s with more stuff could %smatch with '$r' \n}, qq{'$s'}, ($left_same and not $all_same) ? '' : 'NOT ' ; } " '' with more stuff could match with '123' '1' with more stuff could match with '123' '12' with more stuff could match with '123' '123' with more stuff could NOT match with '123' '1234' with more stuff could NOT match with '123' '2' with more stuff could NOT match with '123' '23' with more stuff could NOT match with '123' '234' with more stuff could NOT match with '123'
    Updates:
    1. Maybe the idea of Levenshtein distance (LD) could also be brought to bear. I.e., if the two strings (again, I'm assuming they're both just simple strings) are the same at the left, the LD gives an idea of how much one must change the shorter string to make it the same as the longer. (LD == 0 means both strings were exactly the same to begin with.)
    2. OTOH, If you just want to answer the question "can  $s with one or more characters added and anchored at the start of  $r match?" (again assuming  $s $r to be plain strings), the regex
          $r =~ m{ \A \Q$s\E .+ }xms
      would seem to do the trick (note reversal of $s and $r).

      For the case of a simple string of characters, that would be an easy way to handle it. And as Oiskuu mentioned below, full regex capability probably does essentially make this a variant of the Halting problem.

      So perhaps a simpler question. Can you get the index of the last successful matching character for a failed regex? Judging from what I've seen, I don't think so, but I'm often wrong :)

        My limited understanding of regex behavior is that the RE tries to match the regex  'aaaa' =~ /b/ everywhere, so the last 'failed' match position would always be at the end of the  'aaaa' string. (I believe that on-going regex optimization efforts have led to an RE that will abandon matching very quickly, perhaps not even start, if faced with such a simple regex as the one in the example. But as the regex begins to be even a little more complex, attempts at such optimizations are soon frustrated.)

        In the case of an anchored or unanchored regex, the RE must, of course, 'know' in some sense when and where matches fail, and where the last attempt ends. But since the RE is only concerned with reporting successful matches and not unsuccessful ones, of which there may be many and many, this information is not, as far as I am aware, preserved.

        In any event, it still seems to me that questions like "what is the last offset at which 'efg' matches in 'abcdefghi'?" or "what is the last offset at which 'abc' anchored at the start of 'abcdef' matches?" can easily be answered by index.

Re: Can I determine index point of failure in string during regex match attempt
by oiskuu (Hermit) on May 06, 2014 at 23:36 UTC

    If you can compile your regex into DFA, you will be able to tell (trivially). However, perl regex have backreferences, can embed code, and so on. Basically the Halting Problem? In that case, the answer is no.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1085239]
Approved by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (7)
As of 2024-04-23 14:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found