Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Consider the set of strings construction, contorsion, destruction, transaction. Using the module Regexp::Optimizer, you can generate the pattern (?:con(?:struct|tors)ion|(?:destru|transa)ction) as a result. This is an excellent pattern, since no backtracking is required (notwithstanding the inspections required to examine different a|b|c alternations). As soon as this pattern fails to track the target string being matched it can fail quickly. This is a very desirable property.

A contrario, the simple-minded pattern construction|contorsion|destruction|transaction will backtrack on a target like conduct, because it will have to try both "con-" alternations. (And apologies to RE engine internals specialists, I know this explanation skips over a few details).

All this is well and good, but consider the set of four words again. They all end in "-ion", and it would be nice to be able to factor that out. Because when you have lots and lots of strings (read: thousands) with common tails, the savings are considerable. My algorithm produces (?:con(?:struct|tors)|(?:destru|transa)ct)ion. The tradeoff is that at the cost of making the engine jump through a few more hoops, the pattern is shorter.

In putting the finishing touches on the module, and making sure my test suite covers all the edge cases, I prepared a particular test to cover what I expected the module should produce. When I ran the test, the module produced something else.

Consider the set of strings sad, salad, spread. Quick, think of a pattern that will match these strings, factoring out the trailing "-ad". My test case solution was that it should generate s(?:al|pre)?ad. When I ran the test, it failed. Instead, it produced s(?:(?:al)?|pre)ad. That looks so bizarre I had to paste in manually into a one-liner to prove to myself that it worked.

And the crazy thing is that it does work. What is more, when running with use re 'debug', to my untutored eye, it appears to do so more efficiently.

The (al|pre)? pattern produces the following debug output:

atching REx `^s(?:al|pre)?ad$' against `sad' Setting an EVAL scope, savestack=6 0 <> <sad> | 1: BOL 0 <> <sad> | 2: EXACT <s> 1 <s> <ad> | 4: CURLYX[0] {0,1} 1 <s> <ad> | 13: WHILEM 0 out of 0..1 cc=bfbfe980 Setting an EVAL scope, savestack=12 1 <s> <ad> | 6: BRANCH Setting an EVAL scope, savestack=22 1 <s> <ad> | 7: EXACT <al> failed... 1 <s> <ad> | 10: EXACT <pre> failed... Clearing an EVAL scope, savestack=12..22 failed, try continuation... 1 <s> <ad> | 14: NOTHING 1 <s> <ad> | 15: EXACT <ad> 3 <sad> <> | 17: EOL 3 <sad> <> | 18: END Match successful! Guessing start of match, REx `^s(?:al|pre)?ad$' against `salad'... Found floating substr `ad'$ at offset 3... Found anchored substr `s' at offset 0... Guessed: match at offset 0 Matching REx `^s(?:al|pre)?ad$' against `salad' Setting an EVAL scope, savestack=6 0 <> <salad> | 1: BOL 0 <> <salad> | 2: EXACT <s> 1 <s> <alad> | 4: CURLYX[0] {0,1} 1 <s> <alad> | 13: WHILEM 0 out of 0..1 cc=bfbfe980 Setting an EVAL scope, savestack=12 1 <s> <alad> | 6: BRANCH Setting an EVAL scope, savestack=22 1 <s> <alad> | 7: EXACT <al> 3 <sal> <ad> | 13: WHILEM 1 out of 0..1 cc=bfbfe980 3 <sal> <ad> | 14: NOTHING 3 <sal> <ad> | 15: EXACT <ad> 5 <salad> <> | 17: EOL 5 <salad> <> | 18: END Match successful!

Whereas the ((al)?|pre) pattern produces:

Matching REx `^s(?:(?:al)?|pre)ad$' against `sad' Setting an EVAL scope, savestack=6 0 <> <sad> | 1: BOL 0 <> <sad> | 2: EXACT <s> 1 <s> <ad> | 4: BRANCH Setting an EVAL scope, savestack=16 1 <s> <ad> | 5: CURLYM[0] {0,1} 1 <s> <ad> | 7: EXACT <al> failed... matched 0 times, len=0... Setting an EVAL scope, savestack=16 trying tail with n=0... 1 <s> <ad> | 15: EXACT <ad> 3 <sad> <> | 17: EOL 3 <sad> <> | 18: END Match successful! Guessing start of match, REx `^s(?:(?:al)?|pre)ad$' against `salad'... Found floating substr `ad'$ at offset 3... Found anchored substr `s' at offset 0... Guessed: match at offset 0 Matching REx `^s(?:(?:al)?|pre)ad$' against `salad' Setting an EVAL scope, savestack=6 0 <> <salad> | 1: BOL 0 <> <salad> | 2: EXACT <s> 1 <s> <alad> | 4: BRANCH Setting an EVAL scope, savestack=16 1 <s> <alad> | 5: CURLYM[0] {0,1} 1 <s> <alad> | 7: EXACT <al> 3 <sal> <ad> | 9: SUCCEED could match... matched 1 times, len=2... Setting an EVAL scope, savestack=16 trying tail with n=1... 3 <sal> <ad> | 15: EXACT <ad> 5 <salad> <> | 17: EOL 5 <salad> <> | 18: END Match successful! Freeing REx: `"^s(?:(?:al)?|pre)ad$"'

And try as I might, I can't find a string that behaves differently between the two. I've tried all the possibilities I can think of, like sadad salalad saladad slad sal, but they behave identically. I still have a nagging doubt that there may be a difference; if anyone can point it out I be most appreciative.

Before I embarked on this journey into regular expressions, I would not have guessed that foo|(bar)?|rat is logically equivalent to (foo|bar|rat)?. It took a machine-generated solution to the problem for me to realise. You learn something every day.

- another intruder with the mooring of the heart of the Perl


In reply to Why machine-generated solutions will never cease to amaze me by grinder

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (5)
As of 2024-03-19 09:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found