Beefy Boxes and Bandwidth Generously Provided by pair Networks DiBona
"be consistent"
 
PerlMonks  

Re: Why machine-generated solutions will never cease to amaze me

by demerphq (Chancellor)
on Nov 10, 2004 at 14:24 UTC ( #406675=note: print w/ replies, xml ) Need Help??


in reply to Why machine-generated solutions will never cease to amaze me

Ive written a number of these regex optimizers and I never saw a satisfactory way of handling the tail. Im really interested in how you are doing your thing, and also how it handles the following:

foobar foabar foal foaler fool fooler foil foilar foiliar foo-bar foiler

Also, in experiments ive done, building a Trie and then using it for these tasks outperforms such precompiled regexes as soon as the number of words involved becomes more than a small number. Iirc i saw pureperl Trie solutions outperform regex solutions when the dictionary size was more than a few hundred words. Backtracking over all of those alternations is expensive, whereas a trie solution is entirely backtracking free.

---
demerphq


Comment on Re: Why machine-generated solutions will never cease to amaze me
Download Code
Re^2: Why machine-generated solutions will never cease to amaze me
by grinder (Bishop) on Nov 10, 2004 at 15:33 UTC
    how it handles the following

    Well, plugging in that list, I get:

    fo(?:o(?:l(?:er)?|-?bar)|a(?:l(?:er)?|bar)|il(?:(?:i?a|e)r)?)

    Regexp::Optimizer arrives at the same result, more or less:

    fo(?:o(?:l(?:er)?|\-?bar)|a(?:l(?:er)?|bar)|il(?:(?:i?a|e)r)?)

    In other words, there were no common tails in any of the alternations. Remember, it's the forward-looking tree that drives things. Tail folding is a secondary consideration. Pretty freaky that the alternation orders came out the same, though.

    The main problem for me is that a pure Perl solution is useless, since I need to write the patterns for use in another program that uses the PCRE library (otherwise there is the excellent Text::Trie module).

    If the first character of an alternation was an index into a jump table, the cost becomes O(1), at the expense of a slight increase in memory, but as if perl ever cared about that :)

    Im really interested in how you are doing your thing

    My approach is to take the words, insert them into a tree, and then in a second pass do a depth-first search, reversing the leaves and inserting the reversed leaves into another tree. If I get a tree back (as opposed to a forest), I percolate the common trunk upwards towards the root. When there's a subtree has nothing in common I just unreverse the subtree and unwind the results.

    - another intruder with the mooring of the heart of the Perl

      Ah ok, your implementation appears to do pretty much the same thing as mine, disabling common suffix handling when there are multiple suffixes. (I strived _hard_ with little luck to overcome that problem). Just in case you havent thought of it already do you collapse things like /fo(?:o|a)l/ to /fo[oa]l/?

      I have to admit ive never used a module for this stuff. Ive always handrolled. :-) I have a specialized Inline::C Trie matcher for digits that is thousands of times faster than the regex engine (when match times alone are relevent).

      ---
      demerphq

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://406675]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (7)
As of 2014-04-21 08:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (492 votes), past polls