http://www.perlmonks.org?node_id=471541

dbrockman has asked for the wisdom of the Perl Monks concerning the following question:

The other day, someone posted a message to the Ruby mailing list saying he was writing an application with a text box whose contents were supposed to match a certain regular expression. When the user “made a mistake,” the poster wanted his application to display a warning message. For this purpose, “making a mistake” can be defined as typing a character that, regardless of any further input, definitely prevents the input string from matching the given regular expression.

I thought this problem was interesting enough to ask the Perl community for help. (I have nothing against Ruby people, but in all honesty, when it comes to answering difficult questions about regular expressions, I have rather much more faith in you guys.)

So, my question is this: How can you, given a string s and a regular expression p, determine whether s is a prefix of any string that matches p? (Please note that s itself does not have to match p; if it does, there is no problem.)

For example, let p = /\d{3}-\d{3}/. Then the string "123-4" is the prefix of a match, while the string "123-x" is not.

I’m not sure, but I think that the regular expression engine might already have all information required to solve the problem, but it probably lacks the API needed to share it.

When trying to match, if the engine never reaches the end of the string before failing, I think it is safe to conclude that there is an “error” in the string. This is based on the assumption that if the end of the string is never reached, then there can be no way to make it match by appending stuff to the end of it, because the regular expression engine doesn’t even look at that part!

Using the same example as above, when trying to match "123-4" against /\d{3}-\d{3}/, the regular expression engine will consume all five input characters, reach for another one, and only then conclude that there is no match. On the other hand, when the engine is trying to match "123-x", it will stop (well, I guess it will start backtracking for a while and then stop) once it hits the ‘x’; it will never hit the end of the input and reach for more.

Hovewer, I am less certain about the converse: If there is an “error” in the string, does that necessarily imply that the end of it can never be reached when attempting to match?

That is, just because the end of the string "123-4" was in fact reached, is that sufficient reason to conclude that there can be no error in the string?

Practical aside: Is there any way to find out how much of a string the regular expression engine was able to consume when attempting to match it against some pattern?

I eagerly await a reply from anyone who can share some insight into this problem. Oh, by the way, you may assume that the regular expression does not contain any obscure extensions, if that makes the problem easier.

  • Comment on Checking whether a string is a prefix of any string matching a given regular expression
  • Select or Download Code

Replies are listed 'Best First'.
Re: Checking whether a string is a prefix of any string matching a given regular expression
by crashtest (Curate) on Jul 01, 2005 at 05:54 UTC

    In terms of Perl regular expressions, I believe you're thinking about the matching process the wrong way. The way you're asking the question implies that the regular expression provides a road map for the engine to track along the input string, character for character, to determine a match. That's how you could "reach the end of the (input) string". However, the matching process is done the other way around: The regular expression engine tracks along the regular expression, using the input string as its "map".

    I think that means the question you're asking about reaching the end of a prefix string can't really be answered, because the Perl regular expression engine never asks it. As an example, take the expression /^abc(?<=d)/, which can never match any string. If your input string starts with "abc", the Perl regex engine will track the expression successfully up to ^abc, moving all the way through the prefix "abc". Although the "end" of the prefix is reached, there is no way to realize that the full string cannot match without continuing to step through the regular expression.

    The distinction of matching by moving along the input string vs. moving along the regular expression is the difference between a DFA regular expression engine and an NFA regular expression engine. My understanding is that the NFA engine is by far the most popular version, simply because the DFA cannot perform many of the more popular regular expression features, like backreferences or look-around.

    Although some programs (like grep) have hybrid implementations that use both the DFA and the NFA technique, Perl I believe is exclusively NFA. Meaning that the engine does not, in fact, have the information necessary to answer your question, even for simpler regular expressions that omit backrefences or look-around.

    The more interesting question might be whether Ruby can do this. I have no inkling what kind of engine it uses.

    I highly recommend Mastering Regular Expressions by Jeffrey Friedl. Almost everybody comes away from it saying "Well, I thought I understood regular expressions, but I didn't." I also hope I got most of this right, and that the many wise monks out there can correct any mistake I might have made.

      You say that the question of whether the end of the input string was reached cannot be answered because “the Perl regular expression engine never asks it.” I believe that it, in fact, does. It obviously tracks along both the pattern and the input string. How does it matter which is the “road map” and which is the “road”?

      Your example of trying to match the string "abc" against the pattern /^abc(?<=d)/ is interesting. First of all, for the purpose of my original question, I don't consider this matching process to “reach the end of the input string.” I should have made this clear.

      For the engine to “reach the end” in the sense I intended, it would have to consume all input characters and then try to consume another one. That is, it has to care about a potential suffix. Maybe I should have said “bump into the end of the input” rather than the less evocative “reach the end of the input” (although I do believe I used the verb “hit” once).

      In your example, the engine would give up after consuming all three characters in the input string and seeing that the lookbehind fails. It would not try to consume another character and thus not “bump into” the end of the input string. So using the conjectured implication I described in my original post (i.e., string does not match and engine never bumped into the end ⇒ string is not a prefix of any matching string), I would predict that the input string is not the prefix of any matching string, and I would be correct—because, as you said, in this case there are no matching strings!

      I am not familiar with the DFA and NFA techniques and their relative merits, but frankly, I don’t see how those details are relevant. You say that Perl does not have the information necessary to answer my question, which makes me wonder what you think that information is. (Surely it must know at any given time which part of the input string it is working with, and whether it ever tried to read past the end of the input.)

      My reasoning is as follows: If the engine decides not to look at some suffix of the input, that can only mean that the match attempt has already failed and that the engine will then start backtracking. Hence, if there is some character in the input string that the engine never looks at, then the string certainly does not match (but not the other way around—when the engine looks at every character, the match can clearly either succeed or fail). From this it follows that if (a) the string does not match, and (b) the engine never tries to look past the end of the string, then adding characters to the end of the string cannot make it start matching (because the engine does not even look there).

      Does that make sense to you? Please let me know if you think I’m being unclear on some point.

Re: Checking whether a string is a prefix of any string matching a given regular expression
by hv (Prior) on Jul 01, 2005 at 11:52 UTC

    Hmm, interesting problem.

    1. I think you need to start with the assumption that the pattern is anchored at the start: /a/ might match "bbbbb..." as long as you throw an "a" in there at some point in the future, but /^a/ certainly cannot.

    2. To get this directly out of the regexp engine you'd need to disable all its optimisations (and that's not something it is possible to do right now) - matching /^abcdef/ against "abc" will never read individual characters of the string, since an optimisation will see that the string is shorter than the minimum possible length before getting that far.

    3. Defining what you want may become trickier with lookaheads: matching "abcd" =~ /^ab(?=.*foo)cd/ makes defining the "right" answer rather more difficult.

    4. I think the definitions need some work in any case: "this is a substring of something that would match" will be difficult to know in general, and may be impossible in some cases. To offer a variant of crashtest's example: "ab" =~ /^abc(?<=d)/ is something that an unoptimised regexp engine might see as a prefix, even though no string actually matches the pattern.

    Better, perhaps, would be to look for the longest prefix of the string that matches some prefix of the pattern. This might even be doable in perl5 - you'd need a function that takes a pattern and removes the last element, and then try a variety of prefixes of the string against each prefix of the pattern. Note that chomping /^\d{3}-\d{3}/ should give you /^\d{3}-\d{2}/, so it isn't entirely trivial, but as long as you're prepared to accept that the granularity might not always be perfect I think the code wouldn't be too bad even to cope with quite complex regular expresssions.

    5. It is possible that at some point the perl5 regexp engine may become improved to the point that you can a) disable optimisations and b) provide some code for the "get next character in string" hook. This would allow you to achieve at least a variant of what you want with little pain, but I wouldn't hold your breath for the facilities.

    6. It is likely this will be possible in perl6 before it is in perl5, since the perl6 authors have the benefit of starting from scratch with the lessons of perl5. You might want to get involved in that effort, at least enough to make clear what facilities you would want - but I think the facilities I suggest in (5) would be enough, and I'm pretty certain those are already planned.

    Hugo

      Thank you for your insightful reply, Hugo.

      Regarding /^a/ and /a/: I don’t see any problem with either. You can make errors when typing against the former, as you said, and my algorithm detects those errors. On the other hand, you can never make an error when typing against the latter, as you also said (and my algorithm will never report a false positive).

      I’m not very familiar with the implications of lookaheads, but your example of "abcd" =~ /^ab(?=.*foo)cd/ seems rather unproblematic to me. The engine would go looking for “foo” and thereby bump into the end of string, which means that "abcd" cannot be ruled out as a potential prefix of some match of the given pattern (which is indeed correct, because it is a prefix of, e.g., "abcdfoo"). Can you think of a case in which the algorithm gives the wrong answer?

      Your next example is more interesting. I agree that applying the algorithm to "ab" =~ /^abc(?<=d)/ would fail to establish that "ab" is not a prefix of any matching string. But this is an imaginary/academic problem, because the given regular expression would never occur in the real world except as an outright bug. I wonder if the problem can occur non-pathological cases (i.e., for regular expressions that can actually match things). Can you think of any example?

      I find intriguing the idea of iterating through the Cartesian product of input string prefixes and pattern prefixes (the latter of which will be non-trivial to generate, as you point out), and attempting to match each string–pattern pair. Though obviously inefficient, I think the fact that this is something you could implement without touching any C code makes it the most attractive solution so far. :-)

        I’m not very familiar with the implications of lookaheads, but your example of "abcd" =~ /^ab(?=.*foo)cd/ seems rather unproblematic to me. The engine would go looking for “foo” and thereby bump into the end of string, which means that "abcd" cannot be ruled out as a potential prefix of some match of the given pattern (which is indeed correct, because it is a prefix of, e.g., "abcdfoo"). Can you think of a case in which the algorithm gives the wrong answer?

        Compare the results against "abxy" =~ /^ab(?=.*foo)cd/. The Cartesian product approach will say for both cases that "ab" =~ /^ab/ is the longest matching prefix, while checking what characters are looked at would claim the whole string is ok in both cases.

        Your next example is more interesting. I agree that applying the algorithm to "ab" =~ /^abc(?<=d)/ would fail to establish that "ab" is not a prefix of any matching string. But this is an imaginary/academic problem, because the given regular expression would never occur in the real world except as an outright bug. I wonder if the problem can occur non-pathological cases (i.e., for regular expressions that can actually match things). Can you think of any example?

        Well unmatchable patterns could quite reasonably arise in some situations, particularly if they are themselves generated. But I think the likely real-world problems would occur with backreferences, of the /^(something)other\1/ variety, but I can't think of a concrete example off the top of my head.

        Hugo

Re: Checking whether a string is a prefix of any string matching a given regular expression
by BrowserUk (Patriarch) on Jul 01, 2005 at 14:42 UTC

    I don't think that you can expect the regex engine to help much here.

    If your regex always match fixed length strings, which is fairly typical for this kind field entry data validation application, then you might get away with padding the partial string from a known good pattern something like this:

    Otherwise, if your patterns can contain variable width quantifiers, but not unbounded quantifiers (eg. .? or .{1,3} would okay but .*, .+ and .{3,} would not be), then it would not be too hard to programmically break up simple regex into a set of equivalent regex that could be applied sequentially or as an alternation to do the vaidation. I say not too hard, but non-trivial enough that you would need to apply a set of restrictions on what patterns where acceptable in the original regex. In other words, non-trivial enough that I wasn't prepared to expend the effort trying to do it for the purposes of demonstaration :). Hence, I hand coded the break up for your sample regex in the code below.

    Better code would require a better spec.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
Re: Checking whether a string is a prefix of any string matching a given regular expression
by bsb (Priest) on Jan 16, 2006 at 17:13 UTC
    I think the technique used in Generating regex strings with a regex might be adapted to your problem. Using (?{}) blocks allows you to get inside regex processing to a degree.

    The Regexp::Genex module based on the above transforms a regex using YAPE::Regex, you could write a similar transform to answer your question. The transformed version would have to be peppered with an illegible sequence of (?{}) terms although I don't know what form it would take. Something like:

    /^(\d(\d(\d(-(\d(\d(\d$)|$)|$)|$)|$)|$)|$)/