Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Why machine-generated solutions will never cease to amaze me

by grinder (Bishop)
on Nov 10, 2004 at 11:24 UTC ( [id://406643]=perlmeditation: print w/replies, xml ) Need Help??

Consider the set of strings construction, contorsion, destruction, transaction. Using the module Regexp::Optimizer, you can generate the pattern (?:con(?:struct|tors)ion|(?:destru|transa)ction) as a result. This is an excellent pattern, since no backtracking is required (notwithstanding the inspections required to examine different a|b|c alternations). As soon as this pattern fails to track the target string being matched it can fail quickly. This is a very desirable property.

A contrario, the simple-minded pattern construction|contorsion|destruction|transaction will backtrack on a target like conduct, because it will have to try both "con-" alternations. (And apologies to RE engine internals specialists, I know this explanation skips over a few details).

All this is well and good, but consider the set of four words again. They all end in "-ion", and it would be nice to be able to factor that out. Because when you have lots and lots of strings (read: thousands) with common tails, the savings are considerable. My algorithm produces (?:con(?:struct|tors)|(?:destru|transa)ct)ion. The tradeoff is that at the cost of making the engine jump through a few more hoops, the pattern is shorter.

In putting the finishing touches on the module, and making sure my test suite covers all the edge cases, I prepared a particular test to cover what I expected the module should produce. When I ran the test, the module produced something else.

Consider the set of strings sad, salad, spread. Quick, think of a pattern that will match these strings, factoring out the trailing "-ad". My test case solution was that it should generate s(?:al|pre)?ad. When I ran the test, it failed. Instead, it produced s(?:(?:al)?|pre)ad. That looks so bizarre I had to paste in manually into a one-liner to prove to myself that it worked.

And the crazy thing is that it does work. What is more, when running with use re 'debug', to my untutored eye, it appears to do so more efficiently.

The (al|pre)? pattern produces the following debug output:

atching REx `^s(?:al|pre)?ad$' against `sad' Setting an EVAL scope, savestack=6 0 <> <sad> | 1: BOL 0 <> <sad> | 2: EXACT <s> 1 <s> <ad> | 4: CURLYX[0] {0,1} 1 <s> <ad> | 13: WHILEM 0 out of 0..1 cc=bfbfe980 Setting an EVAL scope, savestack=12 1 <s> <ad> | 6: BRANCH Setting an EVAL scope, savestack=22 1 <s> <ad> | 7: EXACT <al> failed... 1 <s> <ad> | 10: EXACT <pre> failed... Clearing an EVAL scope, savestack=12..22 failed, try continuation... 1 <s> <ad> | 14: NOTHING 1 <s> <ad> | 15: EXACT <ad> 3 <sad> <> | 17: EOL 3 <sad> <> | 18: END Match successful! Guessing start of match, REx `^s(?:al|pre)?ad$' against `salad'... Found floating substr `ad'$ at offset 3... Found anchored substr `s' at offset 0... Guessed: match at offset 0 Matching REx `^s(?:al|pre)?ad$' against `salad' Setting an EVAL scope, savestack=6 0 <> <salad> | 1: BOL 0 <> <salad> | 2: EXACT <s> 1 <s> <alad> | 4: CURLYX[0] {0,1} 1 <s> <alad> | 13: WHILEM 0 out of 0..1 cc=bfbfe980 Setting an EVAL scope, savestack=12 1 <s> <alad> | 6: BRANCH Setting an EVAL scope, savestack=22 1 <s> <alad> | 7: EXACT <al> 3 <sal> <ad> | 13: WHILEM 1 out of 0..1 cc=bfbfe980 3 <sal> <ad> | 14: NOTHING 3 <sal> <ad> | 15: EXACT <ad> 5 <salad> <> | 17: EOL 5 <salad> <> | 18: END Match successful!

Whereas the ((al)?|pre) pattern produces:

Matching REx `^s(?:(?:al)?|pre)ad$' against `sad' Setting an EVAL scope, savestack=6 0 <> <sad> | 1: BOL 0 <> <sad> | 2: EXACT <s> 1 <s> <ad> | 4: BRANCH Setting an EVAL scope, savestack=16 1 <s> <ad> | 5: CURLYM[0] {0,1} 1 <s> <ad> | 7: EXACT <al> failed... matched 0 times, len=0... Setting an EVAL scope, savestack=16 trying tail with n=0... 1 <s> <ad> | 15: EXACT <ad> 3 <sad> <> | 17: EOL 3 <sad> <> | 18: END Match successful! Guessing start of match, REx `^s(?:(?:al)?|pre)ad$' against `salad'... Found floating substr `ad'$ at offset 3... Found anchored substr `s' at offset 0... Guessed: match at offset 0 Matching REx `^s(?:(?:al)?|pre)ad$' against `salad' Setting an EVAL scope, savestack=6 0 <> <salad> | 1: BOL 0 <> <salad> | 2: EXACT <s> 1 <s> <alad> | 4: BRANCH Setting an EVAL scope, savestack=16 1 <s> <alad> | 5: CURLYM[0] {0,1} 1 <s> <alad> | 7: EXACT <al> 3 <sal> <ad> | 9: SUCCEED could match... matched 1 times, len=2... Setting an EVAL scope, savestack=16 trying tail with n=1... 3 <sal> <ad> | 15: EXACT <ad> 5 <salad> <> | 17: EOL 5 <salad> <> | 18: END Match successful! Freeing REx: `"^s(?:(?:al)?|pre)ad$"'

And try as I might, I can't find a string that behaves differently between the two. I've tried all the possibilities I can think of, like sadad salalad saladad slad sal, but they behave identically. I still have a nagging doubt that there may be a difference; if anyone can point it out I be most appreciative.

Before I embarked on this journey into regular expressions, I would not have guessed that foo|(bar)?|rat is logically equivalent to (foo|bar|rat)?. It took a machine-generated solution to the problem for me to realise. You learn something every day.

- another intruder with the mooring of the heart of the Perl

Replies are listed 'Best First'.
Re: Why machine-generated solutions will never cease to amaze me
by demerphq (Chancellor) on Nov 10, 2004 at 14:24 UTC

    Ive written a number of these regex optimizers and I never saw a satisfactory way of handling the tail. Im really interested in how you are doing your thing, and also how it handles the following:

    foobar foabar foal foaler fool fooler foil foilar foiliar foo-bar foiler

    Also, in experiments ive done, building a Trie and then using it for these tasks outperforms such precompiled regexes as soon as the number of words involved becomes more than a small number. Iirc i saw pureperl Trie solutions outperform regex solutions when the dictionary size was more than a few hundred words. Backtracking over all of those alternations is expensive, whereas a trie solution is entirely backtracking free.

    ---
    demerphq

      how it handles the following

      Well, plugging in that list, I get:

      fo(?:o(?:l(?:er)?|-?bar)|a(?:l(?:er)?|bar)|il(?:(?:i?a|e)r)?)

      Regexp::Optimizer arrives at the same result, more or less:

      fo(?:o(?:l(?:er)?|\-?bar)|a(?:l(?:er)?|bar)|il(?:(?:i?a|e)r)?)

      In other words, there were no common tails in any of the alternations. Remember, it's the forward-looking tree that drives things. Tail folding is a secondary consideration. Pretty freaky that the alternation orders came out the same, though.

      The main problem for me is that a pure Perl solution is useless, since I need to write the patterns for use in another program that uses the PCRE library (otherwise there is the excellent Text::Trie module).

      If the first character of an alternation was an index into a jump table, the cost becomes O(1), at the expense of a slight increase in memory, but as if perl ever cared about that :)

      Im really interested in how you are doing your thing

      My approach is to take the words, insert them into a tree, and then in a second pass do a depth-first search, reversing the leaves and inserting the reversed leaves into another tree. If I get a tree back (as opposed to a forest), I percolate the common trunk upwards towards the root. When there's a subtree has nothing in common I just unreverse the subtree and unwind the results.

      - another intruder with the mooring of the heart of the Perl

        Ah ok, your implementation appears to do pretty much the same thing as mine, disabling common suffix handling when there are multiple suffixes. (I strived _hard_ with little luck to overcome that problem). Just in case you havent thought of it already do you collapse things like /fo(?:o|a)l/ to /fo[oa]l/?

        I have to admit ive never used a module for this stuff. Ive always handrolled. :-) I have a specialized Inline::C Trie matcher for digits that is thousands of times faster than the regex engine (when match times alone are relevent).

        ---
        demerphq

Re: Why machine-generated solutions will never cease to amaze me
by steves (Curate) on Nov 10, 2004 at 13:04 UTC

    It took a machine-generated solution to the problem for me to realise.

    I would reword that as:

    It took building a machine-generated solution to the problem for me to realise.

    There is definitely something about building code generators that gives you much more insight into a problem. I consider the time I've spent building such meta-tools to be about the most instructive time I've spent coding.

    Thanks also for pointing out Yet Another Interesting CPAN Module I want to have a look at.

Re: Why machine-generated solutions will never cease to amaze me
by tmoertel (Chaplain) on Nov 10, 2004 at 18:27 UTC
    (Updated 2004-11-11: Pointed out all-or-none matching constraint that applies here. If you read this post, please also read the follow-on thread started by hv for more regexp-probing fun.)

    (Updated: Added quotation for context; added conclusion.)

    Regarding:

    And try as I might, I can't find a string that behaves differently between the two. I've tried all the possibilities I can think of, like sadad salalad saladad slad sal, but they behave identically. I still have a nagging doubt that there may be a difference; if anyone can point it out I be most appreciative.
    There's no difference because we can prove that the regular expressions are equivalent for any atomic regexes x and y (when constrained to all-or-none matching, which is the case in the original poster's question):
    (x|y)? === (x?|y) === (x|y?)
    The proof goes as follows. (Note: I use (x) instead of (?:x) to denote grouping.)

    1. We start with (x|y)?
    2. We can rewrite as (x|y)|E because, by definition, r? is equivalent to r|E where E matches the empty string.
    3. We can rewrite as x|(y|E) because regex union | is associative.
    4. And rewrite as x|(y?), again by definition of ?
    5. And as x|(y?|0) because the regex that never matches anything – 0 – is the identity for union.
    6. And as (x|y?)|0 by associativity.
    7. And finally as (x|y?), again, because 0 is the identity for union.

    (The proof for the (x?|y) case follows immediately from regex union being commutative.)

    To use your example, let x = (?:al) and y = (?:pre) and the same proof applies. That's why you can't find a string that matches s(?:al|pre)?ad differently than s(?:(?:al)?|pre)ad – there isn't one.

    Sometimes it's easier to go after the proof than the counterexample.   : )

    Cheers,
    Tom

      The proof for the (x?|y) case follows immediately from regex union being commutative.

      Regexp union is not commutative when one of the alternates is a leading substring of another: then order becomes important - (E|x) will always match E in preference to x.

      It is the presence of the outer anchors in the original pattern that disambiguates and thus makes it commutative.

      Hugo

        hv wrote:
        Regexp union is not commutative when one of the alternates is a leading substring of another: then order becomes important - (E|x) will always match E in preference to x.
        Excellent point. (However, union is commutative even in the face of overlapping operands when the union is forced to match all of the target string or not match at all.)
        It is the presence of the outer anchors in the original pattern that disambiguates and thus makes (regexp union) commutative.
        Even anchoring isn't sufficient. What if one of the anchors matches the difference between the union operands? Consider (a|aa) in:
        "aaaa" =~ /a(a|aa)a/

        Fortunately, in the case of the OP's regexes – s(?:al|pre)?ad and s(?:(?:al)?|pre)ad) – the union operands don't overlap, and so classical equivalence (do the regexps generate the same language?) predicts the equivalence of Perl's matching behavior (do the regexps match the same strings identically?)

        Cheers,
        Tom

Re: Why machine-generated solutions will never cease to amaze me
by bart (Canon) on Nov 10, 2004 at 16:05 UTC
    Just for the record, I'd like to point out that there's the module Regex::PreSuf (short for prefix+suffix), which does something similar — namely, combining strings into a single pattern that matches them all:
    #! /usr/local/bin/perl -wl use Regex::PreSuf; print presuf(qw(sad salad spread));
    It produces:
    s(?:al|pre)?ad
    Heh. I think you might feel right at home with it. :)

      Yes, I know about that module. Its biggest shortcoming is that it works on single characters, which makes it pretty useless for applying it to patterns that contain meta-characters.

      For instance, the two patterns foo\d+bar and foo\s+1 will produce foo\\(?:d\+bar|s\+1), and that's not going to match anything the individual patterns match.

      - another intruder with the mooring of the heat of the Perl

Re: Why machine-generated solutions will never cease to amaze me
by FoxtrotUniform (Prior) on Nov 10, 2004 at 15:37 UTC

    Interesting. One of the common complaints levelled against some machine learning techniques (artificial neural networks and genetic algorithms spring to mind) is that their solutions are usually so bizarre that it's impossible to learn anything about the problem domain from them. You've gone the other way, which is pretty cool.

    --
    Yours in pedantry,
    F o x t r o t U n i f o r m

    "Anything you put in comments is not tested and easily goes out of date." -- tye

Re: Why machine-generated solutions will never cease to amaze me
by tilly (Archbishop) on Dec 31, 2008 at 03:05 UTC
    Ir is never too late to point out a mistake. :-)

    /foo|(bar)?|rat/ is not logically equivalent to /(foo|bar|rat)?/. They both match every possible string, but they will put different things in $1 and $&.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://406643]
Approved by Corion
Front-paged by demerphq
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (6)
As of 2024-03-19 11:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found