Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

generating regexes?

by mortis (Pilgrim)
on Nov 19, 2001 at 20:13 UTC ( #126287=perlmeditation: print w/ replies, xml ) Need Help??

This is a copy of an email that I sent to phl.pm - I was prompted to additionaly post this here by Blake Mills, because he felt that perlmonks would be interested in it.

------

I've been doing a bit of reading on machine learning. One of the things I've been toying with is the ability to generate a regex to match a given example set of data. My particualr examples would be for things like phone numbers, or zip codes, or information that consists of single data elements.

I've looked on CPAN for any possible existing work, but haven't been able to find anything. Does anyone know of anything along the lines of what I'm describing? The Regexp package provides some common examples, but what I really want is a tool I can use to generate regexes for data in a generic, automated fashion.

I've tried writing some simplistic code, and it has some success with data that has a consistient format - though it creates some horrible looking regexes for less consistient data, and fails completely for inconsistient data. I'm almost embarassed to offer this up, but if you're interested the code I wrote to try this out is available here.

Any advice or pointers would be great.

Thanks,
Kyle R. Burton

Edited by footpad, ~Tue Nov 20 15:25:42 2001

Comment on generating regexes?
(jeffa) Re: generating regexes?
by jeffa (Chancellor) on Nov 19, 2001 at 20:30 UTC
    "I've looked on CPAN for ... but haven't been able to find anything."

    Point your browser on over to Regexp::Common!

    And mortis - please surround URL's in square brackets to make them links, much more convenient: http://www.bgw.org/projects/perl/machine_learning/

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    F--F--F--F--F--F--F--F--
    (the triplet paradiddle)
    
(dkubb) Re: (1) generating regexes?
by dkubb (Deacon) on Nov 19, 2001 at 21:56 UTC

    Regex::Presuf is the closest match to automatic regex generation on CPAN, that I know of. It generates a regex based on a word list, where the resulting regex is usually more efficient than just concatenating all the words together with a "|".

    If you haven't already, you should check out Mastering Regular Expressions. It has an excellent chapter on machine generated regexes for matching email addresses. The resulting regex can be seen in the CPAN module Email::Valid.

Re: generating regexes?
by japhy (Canon) on Nov 19, 2001 at 22:54 UTC
    You guys missed the point, I think. First, Regexp::Common is just a module that provides common regexes instantly. Second, MRE does not contain a regex for matching email addresses from examples -- it builds the regex in pieces, so that it's easier to follow.

    What Kyle wants is something I've wanted and realized is nigh impossible to create without HEAVY input from the user. Finding patterns is easy for humans, when we know what pattern we want. Telling computers to figure out a pattern is not so easy. The goal here is, given a set of input, determine the best pattern for matching the content. Phone numbers of the form 555-5555 are easy to generate a pattern for, if you know ahead of time that you're dealing with phone numbers!. And as soon as you get a 1-800 number, you're screwed.

    The concept is good, and Kyle makes a good attempt, but the problem is very hard. I've not attempted a solution yet, and you can be sure that I would have had I had a good idea of how to approach this problem.

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: generating regexes?
by FoxtrotUniform (Prior) on Nov 19, 2001 at 22:57 UTC

    This sounds like a good problem domain for genetic algorithms, which I unfortunately don't know much about. (The machine learning course I took at uni was supposed to get into GAs, but of course we ran out of time....)

    Here's the basic theory: (for more info, look at geneticprogramming.com)

    1. Generate a population of possible solutions pretty much at random.
    2. Run some sort of fitness test on the solutions (in this case, try matching them against your data, and see how many matches you get, and how close those matches are).
    3. Generate a new population: copy the best solutions over verbatim, mutate some of the solutions, and "breed" (cross-over) some of the solutions.
    4. Repeat until you get a "close enough" solution.
    gumpu has done some genetic programming in Perl before.

    It strikes me that, since Perl's regexes are built around a backtracking finite automaton, it might be possible to analytically compute a regex from "representative" data, using either some sort of search or constraint-satisfaction techniques. It also seems plausible that you'd be able to use Markov chains to describe the data: I've seen this technique work fairly well at finding potential coding regions in DNA, which looks like a similar problem (looking for patterns in connected data), and since Markov processes are pretty close to state machines, they might mesh well with the regex engine....

    Great problem! Thanks for bringing this up. If I have time today, I'll hunt down some useful-looking papers and update this node.

    --
    :wq
      The only problem with the Genetic Algorithm approach is the scoring function. There are two possible types of bad answers.
      1. RE does not match at all
      2. RE matches too much (ie /^.*$/)

      You need a function to tell you if answer a was closer than b even though they were type 1 answers or a matches a small number of sets than b even though they were type 2 answers. It seems that you would have to have access to the actual RE code in perl to write such a function.
      ----
      I always wanted to be somebody... I guess I should have been more specific.
        This was actually the direction the mailing-list discussion took. The final suggestion was that you'd need two sets of data - one set that should match, and one set that shouldn't match. The scoring function would be a combination betwen correctly matching those that should match, and correctly *not* matching those that shouldn't.

        -Blake

        I think the regex engine provides enough hooks that you could write such a function w/o access to the underlying C code. For instance, some automatically placed (?{}) assertions might allow the scoring routine to offer "partial credit" for regexes that only match part of the string. Therefore, you could allow for a finer granularity than just 1=match 0=nomatch.

        (?{}) is a relativly new feature that allows arbritrary code to be executed inside your regex..... it is (ab)used in the rebug regex debugger that japhy mentioned a while back.

        -Blake

Re: generating regexes?
by atlantageek (Monk) on Nov 19, 2001 at 23:55 UTC
    Great problem, it could also be used to figure out what portions of a web page change so an bot could rip out stories from news sites. I suggest thinking of this along the lines of a diff. First determine you record delimiter. In a diff the delimiter is a new line. However with regular expressions you might go with white space (or this could be a command line option). Look up diff and use a similar algorithm. Once you find the components that are different look at the differences. Would they both fit in the same character class. Maybe just go down the following list to see which describes both first.
    /^[0-9]$/ /^[0-9]+$/ /^[0-9]*$/ /^[0-9A-Za-z]$/ /^[0-9A-Za-z]+$/ /^[0-9A-Za-z]*$/ /^[0-9A-Za-z.,]$/ /^[0-9A-Za-z.,]+$/ /^[0-9A-Za-z.,]*$/ /^.$/ /^.+$/ #Giving up /^.*$/
    This might be good for a first pass.
    ----
    I always wanted to be somebody... I guess I should have been more specific.
      Actualy, that is what I want it for. I've got code that's parsing apart web pages to extract data, and I want it to know when it's not extracting the data correctly. The prototype regex code is used so the parser can generate a 'signature' (regex) that describes the data to be extracted (based on an example set) which it can use to validate that further information matches the same 'signature'.

      As far as the parsing logic, we're using landmark based location identification. Move forward past 'New Questions', move forward past 'lastnode_id', move forward past '>', extract to '<'. And so on...

      Kyle

Re: generating regexes?
by mortis (Pilgrim) on Nov 20, 2001 at 00:24 UTC
    Based on the responses so far, I thought it would be useful to give a brief overview of the approach the example code uses.

    It starts out by checking the training set to see if all of the exmples share a common word (determined by \b). If they do, it divides the training set into 2 sub-sets by splitting each of the examples based on the common word. It then attacks each sub-set independently.

    The code then takes the first character from the first example, and produces a set of candidate regexes that match this character. The candidates are purposefuly orderd from most specific to least specific.

    From the candidate set, it picks the most suitable candidate. Most suitable in this case means the first of the candidates that matches each of the examples from the training set. The selected regex is then appended to the current regex.

    The current regex is then tried against each of the examples. If it matches each of the examples exactly (/^pat$/), the algorithm terminates.

    If the current regex doesn't match, the remainder ($1 from: /^pat(.+)$/) is extracted, and the process continues refining the regex.

    There is an artifical limit of 25 iterations in the code to stop it in case it's not reaching a viable regex.

    There are several issues with the code. One is that the next candidate might match more of the current example, but match less of other examples (this usualy causes an infinite loop). Another issue is that the candidate pattern might match the current example, but no others, thus producing an empty candidate set for matching the other examples. This could be solved by backtracking (removing the last regex atom off of the current candidate), but would require a major reworking of the example code (and still may never terminate).

    That's the basic high-level overview of how the sample code works. As suggested, a genetic programming approach would probably produce better results. For the problem domain I've been attacking, the example code has actualy worked rather well.

    Kyle

Re: generating regexes?
by MadraghRua (Vicar) on Nov 20, 2001 at 06:04 UTC
    Actually there's another approach to this, not using necessarily using Genetic algorithms or regex - weight matrices. So you would want to look through a set of data, basically scoring it for the occurance of pattern(s). As you find the common pattern, you create a weight matrix, based upon the number of occurances and the frequency of occurance of a particular letter/resiidue at a position in the pattern. Once you have some examples of the pattern you can rescan your new data for the occurance of the pattern, reinforce the matrix by rescoring it for new data, etc. Check out this link: TFBS. Boris lenhard of the Karolinska Institute presented this as a way of scanning for promoter sequence patterns in July. It should be available for download.

    MadraghRua
    yet another biologist hacking perl....

Any other interest parties?
by GermanHerman (Sexton) on Jul 22, 2003 at 03:03 UTC
    I decided to work something together in perl/tk that "rely's
    on user input", I have been handcoding some of
    these regexes myself for four days now and it's getting to me. Please email me at

    $_ = "douglasmjames1@comcast.net";
    s/ames//;

    It seems like I am going to need all the help I can get on
    this, so mail me right nnnnnnnnow
    Thank You,
    Douglas Marshall Johnson

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://126287]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (6)
As of 2014-08-21 11:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (134 votes), past polls