http://www.perlmonks.org?node_id=471434

thor has asked for the wisdom of the Perl Monks concerning the following question:

Hello esteemed monks,

In reading a document internal to our company, it mentions something about a process that determines the best fit for something based on how it matches a pattern. This process basically has a config file that has a bunch of regexes, and for each regex, an action. Items are passed to the process, and it determines the "best fit" and executes that action associated with it. However, in reading the documentation, it seems that the metric that they use for "best fit" is the pattern with the fewest wildcard characters. It got me to thinking. The pattern "*" has only one wildcard character, but is the most general regex of all. However, by the aforementioned notion of "best fit", it could certainly qualify as the most specific pattern. My question to you is: can you think of a better metric? This is more of a thought experiment for me than anything.

thor

Feel the white light, the light within
Be your own disciple, fan the sparks of will
For all of us waiting, your kingdom will come

Update: Just to be clear...I had shell wildcards on the brain when I wrote that '*' was the most liberal regex. Of course I meant '.*'. I'm leaving it as is so as not to orphan those kind enough to respond so far.

Replies are listed 'Best First'.
Re: Most specific pattern (edit distance)
by Ovid (Cardinal) on Jun 30, 2005 at 21:14 UTC

    Now I'm just guessing, but it seems to me that what they want is the text that most closely corresponds to a given action. The concern is the text, not the regex. Since I haven't seen the data you work with, this might be way off base, but perhaps a better way would be to find all regexes that will match with the text and then choose the text with the smallest "edit distance" against a target text?

    Edit distance is generally concerned with the number of additions, deletions and substitutions necessary to transform one string into another. There are a number of CPAN modules which can calculate this for you. See Text::LevenshteinXS for a representative example.

    Cheers,
    Ovid

    New address of my CGI Course.

Re: Most specific pattern
by jhourcle (Prior) on Jun 30, 2005 at 20:19 UTC

    I don't think that it's necessarily a function of the pattern. It's a function of the process.

    For instance, if there's a pattern that's more preferable to use, no matter if it's the shortest, or longest pattern, it should be tested first.

    In much of the work that I do, I test against the patterns that are most likely to match, and then test all of the outlying conditions that I still know how to handle.

    Typically, I find that longer regexes, although they might be more time consuming to match, are a more precise fit, and so I'd qualify them as 'better' in the types of processes that I deal with.

    A few metrics that I can think of that I might use would be:

    • Most desired result
    • number of capturing parenthesis
    • Number of fixed (non-varying) characters.
    • Number of fixed characters, or zero-width items.
    • Odds of matching.
    • Number of characters, zero-width items, or character classes

    There's probably dozens of other metrics ... I'd have to judge based on the process. Typically, I manually order things -- if regexA will match everything that regexB will match, but regexB won't match as much as regexA, I'd have to decide if there are different actions to be taken by the two, or if one outcome is preferable to the other.

    Oh -- and a minor nitpick. /*/ matches an asterisk. /.*/ is zero or more characters (non-newlines, unless you add modifiers).

Re: Most specific pattern
by Roy Johnson (Monsignor) on Jun 30, 2005 at 19:49 UTC
    I would think that the most specific would be the one with the most non-wildcard chars.

    Caution: Contents may have been coded under pressure.
Re: Most specific pattern
by ww (Archbishop) on Jun 30, 2005 at 20:07 UTC
    Aside: In a regex, * is not a wildcard in the sense you seem to be using it; it's a quantifier.

    . is a wildcard and a character class may well be regarded as a (restricted) wildcard.

    Re your underlying question, the dot_star is perhaps the most general regex, precisely because it is minimally specific...

    Suggest meditating on Roy Johnson's view

Re: Most specific pattern
by GrandFather (Saint) on Jun 30, 2005 at 20:46 UTC

    If you can calculate the fitness at run time, then try all of the regex's and choose the match which matches using fewest wildcard matched characters. Are matches using a character set a lesser evil than those using .? Probably it's a function of the number of different characters that could have matched at each character location.

    The fitness should also be modified by providing some weighting for alternate matches. Would you need to account for backtracking too?


    Perl is Huffman encoded by design.
Re: Most specific pattern
by trammell (Priest) on Jun 30, 2005 at 20:08 UTC
    However, in reading the documentation, it seems that the metric that they use for "best fit" is the pattern with the fewest wildcard characters.

    The pattern // has zero wildcard characters. A perfect fit!

Re: Most specific pattern
by Ultra (Hermit) on Jul 01, 2005 at 06:48 UTC

    Disclaimer: This may not fit your purposes ...

    A rough brute method would be to match all the REs against a given string and measure the time consumed. This way you'll have a best fit.
    read more before flaming...

    Actually this doesn't necessarily reveal the best fit, but somehow benchmarks the RE engine.
    The point is that if you want to determine the best fit you need to write some sort of a RE engine. But that's why you use one, in order not to write it yourself :-)

    Dodge This!
Re: Most specific pattern
by tphyahoo (Vicar) on Jul 01, 2005 at 16:44 UTC
    I agree with a part of what jhourcle said above. The best fit would be the regex that matches, with the LEAST odds of matching.

    IE, let's say you have regex "blah.*blah" and regex "blah".

    And a string that matches both. I would say that the best match would be the first, because this has smaller odds of matching any given string of length n.

    This seems to cover all cases to me. Where would this not be true?

      I agree with this...however, the devil's in the details. Given an arbitrary regex, how does one create a metric around how many non-wildcarded characters are in it? Is there a module that takes care of this? Or at least one that one could bend to make it fit?

      thor

      Feel the white light, the light within
      Be your own disciple, fan the sparks of will
      For all of us waiting, your kingdom will come

        That was just one of the metrics that I could think of... unless the number of regexes were so large that you couldn't rank them yourself (Going with the assumption that I know more about the process than a regex does)

        If I had to go completely on just odds of matching, I would think it'd be easiest to take a representative sample of inputs, and test them against each of the regexes, and build a table with the odds.

        If you don't have a log of those inputs for testing, then we'd have to get more creative ... I might use something like the following --

        • Any character or zero width assertion gets 1 point. (unless the assertion is pointless, like '\W\b\w'
        • Any character class of n characters gets f(n) points, where f(n) yields a number less than one, and decreases as n increases (maybe 1/n, or sqrt(1/n) )
        • Quantifiers reduce the value of the items they modify ... perhaps as multipliers... ( ? = 0.5; + = 0.6; * = 0.25; +? = 0.7; *? = 0.35 ) (I'm just pulling numbers out of the air...you'd want to tweek the numbers 'till you get good results for your situation).
        • Alterations provide something less than the points value of each of its possibilities. (I have no clue on a formula for this one...)

        I'm not aware of a module to do this sort of things, but that doesn't mean that there isn't one out there.

Re: Most specific pattern
by graff (Chancellor) on Jul 03, 2005 at 07:04 UTC
    This process basically has a config file that has a bunch of regexes, and for each regex, an action. Items are passed to the process, and it determines the "best fit" and executes that action associated with it. However, in reading the documentation, it seems that the metric that they use for "best fit" is the pattern with the fewest wildcard characters.

    Are you sure the documentation is a correct description of what the code does? (Are you sure you are interpreting it correctly?)

    When a given input is able to match two or more distinct regexes in a set (and the chosen action differs for each regex), then there are two ways to choose among the possible paths:

    • If the regexes are applied in a specific order and the first one to match determines the action, then there's a simple and unavoidable rule, similiar to the idea in tphyahoo's reply: the regexes need to be ordered from longest specific match to shortest. Otherwise, the longer, more specific regexes would never get a chance to apply.

    • If all regexes are tested before any action is taken, and their successes are scored in some way before deciding which action to take, this would provide some flexibility in defining how the scoring works (you could even introduce probabilities, if you wanted results that were non-deterministic but could still correlate in some way with the inputs).

      But for most systems (where people want the results to be deterministic), it's more likely that the best way to score would be based on the length of the match, or rather, the difference between the length of the input and the length of the match. If two regexes yield the same minimum length difference between input and match, then the next criterion might be some sort of ranking among the regexes in terms of their relative literalness or specificity -- e.g. the one using more literals, smaller character classes and/or smaller quantity qualifiers would rank higher. But it's hard to conceive of a programmatic approach to ranking regexes this way.