dbrockman has asked for the wisdom of the Perl Monks concerning the following question:
The other day, someone posted a message to the Ruby mailing list saying he was writing an application with a text box whose contents were supposed to match a certain regular expression. When the user “made a mistake,” the poster wanted his application to display a warning message. For this purpose, “making a mistake” can be defined as typing a character that, regardless of any further input, definitely prevents the input string from matching the given regular expression.
I thought this problem was interesting enough to ask the Perl community for help. (I have nothing against Ruby people, but in all honesty, when it comes to answering difficult questions about regular expressions, I have rather much more faith in you guys.)
So, my question is this: How can you, given a string s and a regular expression p, determine whether s is a prefix of any string that matches p? (Please note that s itself does not have to match p; if it does, there is no problem.)
For example, let p = /\d{3}-\d{3}/. Then the string "123-4" is the prefix of a match, while the string "123-x" is not.
I’m not sure, but I think that the regular expression engine might already have all information required to solve the problem, but it probably lacks the API needed to share it.
When trying to match, if the engine never reaches the end of the string before failing, I think it is safe to conclude that there is an “error” in the string. This is based on the assumption that if the end of the string is never reached, then there can be no way to make it match by appending stuff to the end of it, because the regular expression engine doesn’t even look at that part!
Using the same example as above, when trying to match "123-4" against /\d{3}-\d{3}/, the regular expression engine will consume all five input characters, reach for another one, and only then conclude that there is no match. On the other hand, when the engine is trying to match "123-x", it will stop (well, I guess it will start backtracking for a while and then stop) once it hits the ‘x’; it will never hit the end of the input and reach for more.
Hovewer, I am less certain about the converse: If there is an “error” in the string, does that necessarily imply that the end of it can never be reached when attempting to match?
That is, just because the end of the string "123-4" was in fact reached, is that sufficient reason to conclude that there can be no error in the string?
Practical aside: Is there any way to find out how much of a string the regular expression engine was able to consume when attempting to match it against some pattern?
I eagerly await a reply from anyone who can share some insight into this problem. Oh, by the way, you may assume that the regular expression does not contain any obscure extensions, if that makes the problem easier.
|
---|