Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Comparative satisfiability of regexps.

by Meowse (Beadle)
on Jan 19, 2005 at 20:06 UTC ( #423487=perlquestion: print w/ replies, xml ) Need Help??
Meowse has asked for the wisdom of the Perl Monks concerning the following question:

This isn't exactly a Perl question, but Perl gurus do regexps better than anyone else I know, so I humbly request your wisdom.

The challenge: given two regexps A and B, determine (ideally in polynomial time) whether all strings that match A will also match B.

The motivation: to determine whether a process whose output is characterized by regexp A can guarantee to correctly drive a process whose input is characterized by regexp B.

The details: mildly crippled subsets of regexp are acceptable, if full-strength regexps make the problem computationally intractable or impossible. Execution time is not a huge issue; the regexps in question will be short, and the evaluation will be done at design time rather than at runtime (thus "ideally in polynomial time" above).

Thanks much!

Mickey.

Comment on Comparative satisfiability of regexps.
Re: Comparative satisfiability of regexps.
by CountZero (Bishop) on Jan 19, 2005 at 20:56 UTC
    As the numbers of strings is infinite, brute forcing the two regexes over all the strings is not possible, so you are down to finding some theoretical solution (and I have no clue even how to start with it).

    Only if the regexes determine a limited number of strings, you can perhaps work backward and calculate the set of acceptable strings for the "A" regex and the "B" regex. If both sets are the same, you have proven your point.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Comparative satisfiability of regexps.
by tall_man (Parson) on Jan 19, 2005 at 21:04 UTC
    In "Mastering Regular Expressions" p. 156-157, there is a comparison of nondeterministic finite automaton (NFA) and determinisitic finite automaton (DFA):

    "There could be a hundred ways to acheive the same result, but since the DFA keeps track of them all simultaneously (almost magically -- more on this later), it doesn't matter which form the regex takes. To a pure DFA, even expressions that appear as different as abc and [aa-a](b|b{1}|b)c are indistinguishable."

    If you accept only regular expressions recognizable by a DFA (no captures to match later in the string, no embedded code) as your "mildly cripped subset" then your problem should be solveable.

      Actually Friedl is making a point about how the engine would match, not that the state tables involved would be the same. IOW there is no guarantee that a DFA constructed from abc would have the same state transition tables as from [aa-a](b|b{1}|b)c.

      Instead of trying to build a DFA regex engine (a non trivial task) I'd suggest just writing some code that can normalize two regexes to the same representational format. If you can normalize two regexes to the same pattern then they are the same. Of course this doesnt help when it turns out that the receivers regex can accept a superset of the emitters pattern generator. Ie: the emitting regex could be /a/ and the receiving regex could be /a|b/ which means the receiving regex can handle anything that the emitting regex can produce but vice versa is not true. IMO if you figure that problem out you are well on your way to solving the halting problem.

      ---
      demerphq

        Any technology provably equivalent to a perpetual motion machine is junk. :-)

        Actually, what I'm trying to resolve is the "receivers regex can accept a superset of the emitters pattern generator" question. Specifically, whether that is equivalent to the halting problem or not.

        As I said in another note, I believe that DFAs require a memory strip to be Turing-complete, and thus analysis of regexps doesn't run into the halting problem. But I'm not certain of that, and my last Comp. Sci. course was very long ago.

        Resolving to canonical form is a good first step, but it's not entirely clear how to algorithmically determine that, for example, aaaab?aaaa is a subset of aaaaab*aaaaa. That one took a moment to prove and double-check in my head, and it's a *really* trivial case compared to some I'd like to verify.

        Brain-achingly,

        Mickey.

        It is true that the two DFAs may not be exactly alike. However, any two DFAs can be proved equivalent (or not) in finite time, according to this article (in PowerPoint format).

        Of course, "finite" is not the same thing as "practical". The brute-force solution mentioned in that article is exponential (test all strings of length equal to the number of states). There is a second method that XORs the two DFAs that looks more promising.

        It looks like either method could be adapted to test whether one DFA accepts a subset of the other.

        Update: blokhead below gives the same solution in a more thoroughly-explained way.

      *nod* Yours is the second pointer to "Mastering Regular Expressions" I've received, and theorbtwo said substantially the same thing (doable if the regexps are deterministic) in the Chatterbox earlier.

      *ponder* of course, any NFA can be mechanistically converted to a DFA (by creating a state for every reachable combination of NFA states), so really the answers should be equivalent. But since NFA/DFA's aren't actually Turing-complete (barring a Turing-machine-style memory tape), we don't run into the halting problem. Huh. Sounds like it should be solvable.

      But it's sounding like I may have to write the actual code myself; at least, nobody has spoken up and said "Oh, yes, that's been solved, and here's a link" yet. Which is, honestly, what I was hoping for. :-)

      Of course, even the fact that two deterministic regexps can be proven to be identical doesn't actually answer the original question. abc and ab[c|d] will have different DFA representations, but all strings that abc matches will also be matched by ab[c|d]. But it does suggest some approaches for resolving the issue.

      Anyone got more advice?

      Thanks 10e6,

      Mickey.

Re: Comparative satisfiability of regexps.
by Solo (Deacon) on Jan 19, 2005 at 21:32 UTC
    This problem is basically asking whether 2 algorithms do the same thing just by examining the coded representations. This concept problem must exist in academia, does anyone know the name for it?

    --Solo

    --
    You said you wanted to be around when I made a mistake; well, this could be it, sweetheart.

      "Compiler optimizations" :)

      Seriously, an optimizer's job is to anaylize code, realize that it is computationally equivilent to another peice of code, and then using the better version ("better" could be one or more of less processor time, RAM, space on disk, etc.).

      I'm not sure if this is useful here or not, though.

      "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

        Which brings to mind a good point, for which I thank you. That is--the challenge "compare two algorithms, and decide if they do the same thing" is provably incomputable in the general case (c.f. "the Halting Problem")--and yet, compilers replace code with more efficient but guaranteed-identical code every day. Thus, while it might not be possible to determine for all regexps whether one is a subset of another, it may well be possible for most regexps to determine if one is a subset of another.

        And this is all that is needed in our case, because we're trying to derive as much information as possible, not all information. If we can only prove nine of ten steps in a process, we can't say that we've proven the process, but we're a lot better off than if we couldn't prove any of them.

        So. If we restate the problem as "For most simple regexps, determine whether regexp A matches a subset of the strings that regexp B matches", does that make it seem more tractable?

        Hopefully,

        Mickey.

Re: Comparative satisfiability of regexps.
by blokhead (Monsignor) on Jan 19, 2005 at 22:56 UTC
    As long as you mean regular expressions in the CS-theory sense (no backrefs, etc), yes it's perfectly computable, although it takes exponential time (and space) in the general case. No, this is not as general as the undecidable question "are these two algorithms equivalent" that others have mentioned in this thread. The underlying critters we are dealing with are regular languages (not recursive languages), and all decision properties of regular languages (that I can think of, at least) are decidable.

    Here's how a theory person would check if two regexes were equivalent. In fact, this is a very good question for a first exam in a computational models/formal languages class.

    1. Convert both regular expressions to equivalent NFAs. This is probably the most annoying part as you need a regex parser. This can be done in linear time.
    2. Convert the NFAs to DFAs. This is the part that takes exponential time in the general case.
    3. Do a product construction on the two DFAs, setting the accept states appropriately so the result accepts the symmetric difference of the two languages. This is done in quadratic time.
    4. Check to see if the resultant DFA accepts the empty language. This is easily done in linear time with a DFS/BFS traversal. If so, then the two regular expressions you started with were equivalent. If not, you can report back any string which this DFA accepts as a counterexample to their equivalence (such a string would be in one language but not the other).
    A review of the first few chapters of your favorite automata book might be helpful to fill in the gaps. I have the old classic Intro to Automata Theory, Languages & Computation by Hopcroft/Ullman. It goes through these constructions with an algorithmic feel, which you might appreciate.

    Of course, this begs the question, is there a regular language playground module, where I can take a regular language, perform all the closure property constructions, decision problems, and conversion between NFA/DFA/regex representations? And if not, why not? ;)

    Update: Fixed ISBN link (for some reason, the ISBN on the back of my edition wasn't recognized.

    blokhead

      Thank you, my friend. That is precisely the level, type, and depth of answer I was looking for. There's lots in there that I'll need to research, but that's where the meat of the problem is--in the learning.

      My, but it's been a long time since I formally studied regular expressions and finite automata. I can remember, vaguely, remembering being good at them. But I can't even remember being good at them, and I'm certainly not good at them now.

      Ullman? I TA'd Introduction to Computer Science for Dr. Ullman, back in the day. Smart, smart man. And now I'm heading back to the well to tank myself up. Small world. :-)

      Thanks again for the help, blokhead. And if/when I actually get some working code out of this in the direction of your above-mentioned "regular language playground module", I'll be sure to post about it here.

      Gratefully yours,

      Mickey.

      Just to append to your CS theory regexp, you can't use memory beyond storage for the original string and the regular expression itself. Backrefs require memory to store values and do manips on them.

      ----
      Give me strength for today.. I will not talk it away..
      Just for a moment.. It will burn through the clouds.. and shine down on me.

Re: Comparative satisfiability of regexps.
by BrowserUk (Pope) on Jan 20, 2005 at 06:31 UTC

    I cannot add anything to the debate whether this is doable, but I have to ask why you would want to do this?

    Other than an academic answer, "To see if it is possible."; or an enigmatic one "Because I can."; both of which are perfectly fine answers, I cannot see a reason for doing this.

    I cannot imagine the situation in which characterising a routine's inputs and or outputs in terms of some restricted subset of regex is the best, or even the easiest, way of characterisation. I guess it could be the most concise, but you generally know much more about an input ot output field than can easily be captured into a regex. Especially at design time?


    Examine what is said, not who speaks.
    Silence betokens consent.
    Love the truth but pardon error.

      The reason I was guessing at is that they have a lot of data to validate based on a complex regex and were considering using another faster regex as a precheck to save on processing.

      If this was the case then it's to prove that the precheck won't reject anything that the main check would allow.

        That's a pretty good application, though I have to say that I would probably opt for a statistical approach to verification.

        Generate a bunch of random strings, pass them through the stricter test, and then pass it's output through the quicker. If the second results set equals the first, you're good to go.

        With a good knowledge of what the regex are attempting to match, it should be relatively easy to generate random datasets that exercise most of the edgecases, or at least a statistically computable subset thereof.

        And that was my point. Using the knowledge of what the regex are trying to match, and statistics, is probably much easier than try to cross-compare to regex, for which the comparison is only valid if you can first prove that at least one of the regex is itself "correct"--which is an extremely hard task to start with.

        Unless you can:

        • use a definite dataset to prove one of them.

          In which case you could use the same dataset to prove the other.

        • You accept a statistical proof of one.

          Which again, could be used to prove the other.

        Proving the compatibility of two unprovable entities doesn't really buy you a lot unless the comparability test is very quick and cheap--which doesn't appear to be the case here.

        There is no point in measuring something, unless you are pretty sure that your ruler is accurate!


        Examine what is said, not who speaks.
        Silence betokens consent.
        Love the truth but pardon error.
      Oh, man, you guys are gonna have a field day with this one. *wry grin*

      I'm attempting to answer the question, "Does any XML document described by A.xsd satisfy the constraints of B.xsd?" Which is, as far as I can tell, calculable iff one can calculate the answer to the question, "Does regexp A satisfy the constraints of regexp B?"

      I was quite surprised to find that nobody is apparently even considering the question as it relates to .xsd's, which to my mind misses one of the really great opportunities for proving the correctness of loosely coupled, distributed application networks. Apparently, people are only concerned either with tightly-coupled, pre-negotiated application networks (where correctness is known because application A's output and application B's input are constrained by the same .xsd), or with ad-hoc, one-off verification that this particular instance of application A's output is consistent with application B's input requirements.

      I think this is regrettable, and should be fixed. Hence the preceding discussion.

      Thanks for your patience, and your knowledge, and your help!

      Mickey.

      UPDATE:An .xsd (XML Schema Definition) file defines the allowable names, structures, and contents of XML elements and attributes in an XML file. One might write a calendaring application that took XML input, and whose input was specified by an .xsd ("a .xsd"? "an xsd"? "an .xsd"? Bah!) that required the input XML to contain a date element, a time element, and a schedule_item_name element. And so on.

        I'm attempting to answer the question, "Does any XML document described by A.xsd satisfy the constraints of B.xsd?" ...

        Ouch! Well that told me :)

        I have to say that the thought that crosses my mind is an old joke about the guy that answered the question:

        How can I get to X*, from here?

        *Avoiding racial sterotypes here!

        And got the reply.

        If I were you, I wouldn't be starting from here!

        I happen to think that the benefits of XML have been way oversold, and I'm not even certain that I know what an .xsd is.

        In any case, I wish you the very best of luck in your endevour.


        Examine what is said, not who speaks.
        Silence betokens consent.
        Love the truth but pardon error.
        Like XML, xsd appears to be a language with arbitrary nesting of tagged regions. In addition, it seems to define symbols which are used in other expressions. Regular expressions are not a suitable tool for such a language. You need something like an LALR or recursive descent parser.

        For example, this is a fragment I found:

        <xsd:complexType name="Items"> <xsd:sequence> <xsd:element name="item" minOccurs="0" maxOccurs="unbounded"> <xsd:complexType> <xsd:sequence> <xsd:element name="productName" type="xsd:string"/> <xsd:element name="quantity"> <xsd:simpleType> <xsd:restriction base="xsd:positiveInteger"> <xsd:maxExclusive value="100"/> </xsd:restriction> </xsd:simpleType> </xsd:element> <xsd:element name="USPrice" type="xsd:decimal"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="shipDate" type="xsd:date" minOccurs="0" +/> </xsd:sequence> <xsd:attribute name="partNum" type="SKU" use="required"/> </xsd:complexType> </xsd:element> </xsd:sequence> </xsd:complexType> <!-- Stock Keeping Unit, a code for identifying products --> <xsd:simpleType name="SKU"> <xsd:restriction base="xsd:string"> <xsd:pattern value="\d{3}-[A-Z]{2}"/> </xsd:restriction> </xsd:simpleType>

        It does embed regular expressions here and there to define new data types (the definition of "SKU" for example). If those are all you are trying to validate, our suggestions should work. Otherwise, this is a much more difficult problem.

        Update: I'm also having trouble seeing the utility of this proposed checker. Suppose I have two arbitrary web applications A and B, and both were kind enough to supply xsd schemas. Can I determine if they can interact meaniningfully? Not really. It's a matter of semantics, not just matching schemas. Say both define a "Price" field. How do I know one is not in dollars and the other in euros? I think that's why the usual practice is "pre-negotiated" schemas.

        Um, stupid question time: Since it sounds like you are already using XML (or could), and already have an XSD (or two), why aren't you either:

        • Changing the input and output programs to use the SAME .XSD. Then either end can validate the document against the XSD and assume that if it is valid it will work. Or,
        • Using an XSLT to get from one format to the other. If you have XSD for both, then you should only have to build the XSL transform once, and just run it on the data as it goes through.

        It just really sounds like you're making this too hard. Or I completely don't get it.

        Footnote: For those wondering what XSLT is, it is an XML application that defines transformations that can be applied to XML documents. It can convert XML to XML, HTML, or just about anything else.


        --
        Spring: Forces, Coiled Again!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://423487]
Approved by Limbic~Region
Front-paged by Enlil
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2014-10-21 00:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (94 votes), past polls