Re: generating regexes?

Based on the responses so far, I thought it would be useful to give a brief overview of the approach the example code uses.

It starts out by checking the training set to see if all of the exmples share a common word (determined by \b). If they do, it divides the training set into 2 sub-sets by splitting each of the examples based on the common word. It then attacks each sub-set independently.

The code then takes the first character from the first example, and produces a set of candidate regexes that match this character. The candidates are purposefuly orderd from most specific to least specific.

From the candidate set, it picks the most suitable candidate. Most suitable in this case means the first of the candidates that matches each of the examples from the training set. The selected regex is then appended to the current regex.

The current regex is then tried against each of the examples. If it matches each of the examples exactly (/^pat$/), the algorithm terminates.

If the current regex doesn't match, the remainder ($1 from: /^pat(.+)$/) is extracted, and the process continues refining the regex.

There is an artifical limit of 25 iterations in the code to stop it in case it's not reaching a viable regex.

There are several issues with the code. One is that the next candidate might match more of the current example, but match less of other examples (this usualy causes an infinite loop). Another issue is that the candidate pattern might match the current example, but no others, thus producing an empty candidate set for matching the other examples. This could be solved by backtracking (removing the last regex atom off of the current candidate), but would require a major reworking of the example code (and still may never terminate).

That's the basic high-level overview of how the sample code works. As suggested, a genetic programming approach would probably produce better results. For the problem domain I've been attacking, the example code has actualy worked rather well.

Kyle

Comment on Re: generating regexes?


Don't ask to ask, just ask
	PerlMonks