This is a copy of an email that I sent to phl.pm - I was prompted to
additionaly post this here by Blake Mills, because he felt that perlmonks would
be interested in it.
------
I've been doing a bit of reading on machine learning. One of the things I've been
toying with is the ability to generate a regex to match a given example
set of data. My particualr examples would be for things like phone numbers,
or zip codes, or information that consists of single data elements.
I've looked on CPAN for any possible existing work, but haven't been able
to find anything. Does anyone know of anything along the lines of what
I'm describing? The Regexp package provides some common examples, but what
I really want is a tool I can use to generate regexes for data in a generic,
automated fashion.
I've tried writing some simplistic code, and it has some success with
data that has a consistient format - though it creates some horrible looking
regexes for less consistient data, and fails completely for inconsistient
data. I'm almost embarassed to offer this up, but if you're interested the
code I wrote to try this out is available here.
Any advice or pointers would be great.
Thanks,
Kyle R. Burton
Edited by footpad, ~Tue Nov 20 15:25:42 2001
Re: generating regexes?
by japhy (Canon) on Nov 19, 2001 at 22:54 UTC
|
You guys missed the point, I think. First, Regexp::Common is just a module that provides common regexes instantly. Second, MRE does not contain a regex for matching email addresses from examples -- it builds the regex in pieces, so that it's easier to follow.
What Kyle wants is something I've wanted and realized is nigh impossible to create without HEAVY input from the user. Finding patterns is easy for humans, when we know what pattern we want. Telling computers to figure out a pattern is not so easy. The goal here is, given a set of input, determine the best pattern for matching the content. Phone numbers of the form 555-5555 are easy to generate a pattern for, if you know ahead of time that you're dealing with phone numbers!. And as soon as you get a 1-800 number, you're screwed.
The concept is good, and Kyle makes a good attempt, but the problem is very hard. I've not attempted a solution yet, and you can be sure that I would have had I had a good idea of how to approach this problem.
_____________________________________________________
Jeff[japhy]Pinyan:
Perl,
regex,
and perl
hacker.
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??; | [reply] |
Re: generating regexes?
by FoxtrotUniform (Prior) on Nov 19, 2001 at 22:57 UTC
|
This sounds like a good problem domain for genetic
algorithms, which I unfortunately don't know much about.
(The machine learning course I took at uni was supposed
to get into GAs, but of course we ran out of time....)
Here's the basic theory: (for more info, look at
geneticprogramming.com)
- Generate a population of possible solutions pretty
much at random.
- Run some sort of fitness test on the solutions (in
this case, try matching them against your data, and see
how many matches you get, and how close those matches
are).
- Generate a new population: copy the best solutions
over verbatim, mutate some of the solutions, and "breed"
(cross-over) some of the solutions.
- Repeat until you get a "close enough" solution.
gumpu has done some genetic programming
in Perl before.
It strikes me that, since Perl's regexes are built
around a backtracking finite automaton, it might be
possible to analytically compute a regex from
"representative" data, using either some sort of search
or constraint-satisfaction techniques. It also seems
plausible that you'd be able to use Markov chains to
describe the data: I've seen this technique work fairly
well at finding potential coding regions in DNA, which
looks like a similar problem (looking for patterns in
connected data), and since Markov processes are pretty
close to state machines, they might mesh well with the
regex engine....
Great problem! Thanks for bringing this up. If I
have time today, I'll hunt down some useful-looking
papers and update this node.
--
:wq
| [reply] |
|
The only problem with the Genetic Algorithm approach is the scoring function. There are two possible types of bad answers.
1. RE does not match at all
2. RE matches too much (ie /^.*$/)
You need a function to tell you if answer a was closer than b even though they were type 1 answers or a matches a small number of sets than b even though they were type 2 answers.
It seems that you would have to have access to the actual RE code in perl to write such a function.
----
I always wanted to be somebody... I guess I should have been more specific.
| [reply] |
|
This was actually the direction the mailing-list discussion took. The final suggestion was that you'd need two sets of data - one set that should match, and one set that shouldn't match. The scoring function would be a combination betwen correctly matching those that should match, and correctly *not* matching those that shouldn't.
-Blake
| [reply] |
|
|
|
| [reply] [d/l] [select] |
(jeffa) Re: generating regexes?
by jeffa (Bishop) on Nov 19, 2001 at 20:30 UTC
|
"I've looked on CPAN for ... but haven't been able
to find anything."
Point your browser on over to Regexp::Common!
And mortis - please surround URL's in square brackets
to make them links, much more convenient:
http://www.bgw.org/projects/perl/machine_learning/
jeffa
L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
F--F--F--F--F--F--F--F--
(the triplet paradiddle)
| [reply] |
(dkubb) Re: (1) generating regexes?
by dkubb (Deacon) on Nov 19, 2001 at 21:56 UTC
|
Regex::Presuf is the closest match to automatic
regex generation on CPAN, that I know of. It generates a regex based on a word
list, where the resulting regex is usually more efficient than just concatenating all the words together with a "|".
If you haven't already, you should check out Mastering Regular Expressions. It has an
excellent chapter on machine generated regexes for matching email
addresses. The resulting regex can be seen in the CPAN module
Email::Valid.
| [reply] [d/l] |
Re: generating regexes?
by mortis (Pilgrim) on Nov 20, 2001 at 00:24 UTC
|
Based on the responses so far, I thought it would be useful to
give a brief overview of the approach the example code uses.
It starts out by checking the training set to see if all
of the exmples share a common word (determined by \b). If they
do, it divides the training set into 2 sub-sets by splitting
each of the examples based on the common word. It then attacks
each sub-set independently.
The code then takes the first character from the first example, and
produces a set of candidate regexes that match this character.
The candidates are purposefuly orderd from most specific to
least specific.
From the candidate set, it picks the most suitable candidate. Most
suitable in this case means the first of the candidates that
matches each of the examples from the training set. The
selected regex is then appended to the current regex.
The current regex is then tried against each of the examples.
If it matches each of the examples exactly (/^pat$/), the
algorithm terminates.
If the current regex doesn't match, the remainder ($1 from: /^pat(.+)$/)
is extracted, and the process continues refining the regex.
There is an artifical limit of 25 iterations in the code to stop
it in case it's not reaching a viable regex.
There are several issues with the code. One is that the next
candidate might match more of the current example, but match
less of other examples (this usualy causes an infinite loop).
Another issue is that the candidate pattern might match the current
example, but no others, thus producing an empty candidate set
for matching the other examples. This could be solved by backtracking (removing
the last regex atom off of the current candidate), but would
require a major reworking of the example code (and still may never terminate).
That's the basic high-level overview of how the sample code works.
As suggested, a genetic programming approach would probably
produce better results. For the problem domain I've been attacking,
the example code has actualy worked rather well.
Kyle | [reply] |
Re: generating regexes?
by atlantageek (Monk) on Nov 19, 2001 at 23:55 UTC
|
Great problem, it could also be used to figure out what portions of a web page change so an bot could rip out stories from news sites. I suggest thinking of this along the lines of a diff. First determine you record delimiter. In a diff the delimiter is a new line. However with regular expressions you might go with white space (or this could be a command line option). Look up diff and use a similar algorithm. Once you find the components that are different look at the differences. Would they both fit in the same character class. Maybe just go down the following list to see which describes both first.
/^[0-9]$/
/^[0-9]+$/
/^[0-9]*$/
/^[0-9A-Za-z]$/
/^[0-9A-Za-z]+$/
/^[0-9A-Za-z]*$/
/^[0-9A-Za-z.,]$/
/^[0-9A-Za-z.,]+$/
/^[0-9A-Za-z.,]*$/
/^.$/
/^.+$/
#Giving up
/^.*$/
This might be good for a first pass.
----
I always wanted to be somebody... I guess I should have been more specific. | [reply] [d/l] |
|
Actualy, that is what I want it for. I've got code
that's parsing apart web pages to extract data, and I
want it to know when it's not extracting the data correctly.
The prototype regex code is used so the
parser can generate a 'signature' (regex) that describes
the data to be extracted (based on
an example set) which it
can use to validate that further information matches the
same 'signature'.
As far as the parsing logic, we're using landmark based
location identification. Move forward past 'New Questions',
move forward past 'lastnode_id',
move forward past '>', extract to '<'. And so on...
Kyle
| [reply] |
Re: generating regexes?
by MadraghRua (Vicar) on Nov 20, 2001 at 06:04 UTC
|
| [reply] |
Any other interest parties?
by GermanHerman (Sexton) on Jul 22, 2003 at 03:03 UTC
|
I decided to work something together in perl/tk that "rely's on user input", I have been handcoding some of these regexes myself for four days now and it's getting to me. Please email me at
$_ = "douglasmjames1@comcast.net";
s/ames//;
It seems like I am going to need all the help I can get on this, so mail me right nnnnnnnnow
Thank You,
Douglas Marshall Johnson
| [reply] |
|
|