Regular Expression Builder

Rich36 has asked for the wisdom of the Perl Monks concerning the following question:

Does anyone know of any kind of regular expression builder that generates regexes from a given string? What I'm looking for is something like this...

Given a string like "Rich36!", it would produce \w{4}\d{2}\! or \w+\d+\!.

The driving force behind this is that I'm working on a tool that uses regular expressions to grab data out of some files. Part of the interface sometimes requires the users to input regular expressions to capture the necessary text. Since most of them are not familiar with regular expressions, I'm looking for a way to allow them to input a string of text and then output a regular expression with metacharacters that would grab text that was like the string they inputted.

I did come across the excellent Regex::PreSuf, which does that, but it doesn't use metacharacters to any great degree.
For instance,

use Regex::PreSuf;
my $re = presuf({anychar => 1}, qw(@foo @bar @baz));
print qq($re\n);
__RESULT__
\@(?:ba[rz]|foo)
[download]

Which is great, but I'm looking for a mechanism that would produce a regex that would capture something like @oof as well (a regex like \@\w{3}).

Any suggestions or information would be greatly appreciated.

«Rich36»

Comment on Regular Expression Builder Select or Download Code

Replies are listed 'Best First'.
Re: Regular Expression Builder by tommyw (Hermit) on Aug 30, 2002 at 15:36 UTC
Programming Perl includes: `#!/usr/bin/perl $vowels='aeiouy'; $cons='bcdfghjklmnpqrstvwxzy'; %map={C=>$cons, V=>$vowels; for $class=($vowels, $cons) { for (split //, $class) { map{$_}.=$class; } } for $char (split //, shift) { $pat.="[$map{$char}]"; } $re=qr/^${pat}$/i; print "REGEX is $re\n"; @ARGV='/usr/dict/words' if -t && !@ARGV; while (<>) { print if /$re/; }` [download] Which takes a word, and builds a template from it with the same pattern of vowels and consonants. Although the original is commented. Extending this to handle digits should be easy. The cunning part will be collapsing the multiple character classes down, and using a multiple instead. This is, of course, left as an exercise for the reader ;-) -- Tommy Too stupid to live. Too stubborn to die.	[reply] [d/l]
Re: Regular Expression Builder by erikharrison (Deacon) on Aug 30, 2002 at 16:34 UTC
The challenge here is asking youself "What kind of regexes do I want my tool to generate". This makes things a little harder and is one of the reasons that this kind of tool isn't on the market. A computer program cannot read your mind, obviously. So, the regexes generated from a single simple string will be rather simple - there isn't enough data to work with to create a complex expression there. For example, should the regex retain length? When should a regex generalize a character into a character class or match exactly? If we generalize out to a character class, what about when a character could be placed in several different character classes? While the tool could produce more useful regexes from additional data (such as multiple strings) the question remains - by what rules do we generate a regex from the given data? The rules will vary from project to project, so a tool that has rules builtin will not be very useful to others, and as such won't be out there in the market. If you want a tool you can program regex generating rules into, you get into a layer of abstraction which makes things harder not easier on the programmer - you'd be better off generating the regexes yourself. Some tools that might help you out - Parse::RecDecent Parse::Yapp and perhaps Regex::English. Cheers, Erik Light a man a fire, he's warm for a day. Catch a man on fire, and he's warm for the rest of his life. - Terry Pratchet	[reply]
OT: DWIM by BorgCopyeditor (Friar) on Aug 30, 2002 at 19:11 UTC
A computer program cannot read your mind, obviously. Darn. It sure would be helpful to be able to write: `#!/usr/bin/perl use Read::Mind qw(disambiguate implement); $script = new Read::Mind; $script->do_what_I_mean(); exit;` [download] BCE --Dude, you're getting assimilated!	[reply] [d/l]
Re: Regular Expression Builder by Brutha (Friar) on Sep 02, 2002 at 12:04 UTC
What kind of regexes do I want my tool to generate E.g. give that generator a handfull Strings, recognize a possible pattern behind that and generate the regular expression to recognize these strings. That would cut the number of possible solutions down to a reasonable amount. Problem is the pattern recognition or is ther a module? And it came to pass that in time the Great God Om spake unto Brutha, the Chosen One: "Psst!" -- (Terry Pratchett, Small Gods)	[reply]
Re: Regular Expression Builder by demerphq (Chancellor) on Aug 30, 2002 at 16:19 UTC
I doubt that there is a robust way to do this, but heres a really simple way: `my $string="123 abcdef"; $string=~s{(\d+)\|(\w+)\|(\s+)} { defined($1) ? '\\d{'.length($1).'}' : defined($2) ? '\\w{'.length($2).'}' : '\\s{'.length($3).'}' }ge; print $string; __END__ \d{3}\s{1}\w{6}` [download] But i dont think this will scale very well... (and probably has subtle problems anyway) Yves / DeMerphq --- Software Engineering is Programming when you can't. -- E. W. Dijkstra (RIP)	[reply] [d/l]
Re: Re: Regular Expression Builder by Anonymous Monk on Aug 30, 2002 at 17:05 UTC
But i dont think this will scale very well... (and probably has subtle problems anyway) One quibble is that because \d is a subset of \w then a string such as "abc123def" will get `\w{9}` in your version. Here's a slightly improved version (for some definition of improved) `my $string=" \aabc123def!*#\n"; $string=~s{ ([[:digit:]]+) \|([[:alpha:]]+) \|([[:punct:]]+) \|([[:space:]]+) \|([[:cntrl:]]+) \|(.) } { defined($1) ? '[[:digit:]]{'.length($1).'}' : defined($2) ? '[[:alpha:]]{'.length($2).'}' : defined($3) ? '[[:punct:]]{'.length($3).'}' : defined($4) ? '[[:space:]]{'.length($4).'}' : defined($5) ? '[[:cntrl:]]{'.length($5).'}' : "\Q$+\E" # anything else? }gex; print $string;` [download] But it still has problems (for example, \n is in both :space: and :cntrl: so "\n\a" produces `[[:space:]]{1}[[:cntrl:]]{1}`, but "\a\n" produces `[[:cntrl:]]{2}`).	[reply] [d/l] [select]
Re: Re: Re: Regular Expression Builder by demerphq (Chancellor) on Aug 30, 2002 at 17:18 UTC
One quibble is that because \d is a subset of \w then a string such as "abc123def" will get \w{9} in your version. Yup. But personally I consider that a feature not a bug. :-) After all ldkjdlkjf2098kklls probably isnt `[[:alpha:]]+\d+[[:alpha:]]+` But we are both in agreement that there isnt a good way to do this, although as we both have shown there are a variety of bad ways to do it... BTW, is the . really necessary? I dont think it is as the s/// will just skip the char if it doesnt match. Oh and I considered using something like you post here, but I fgured that considering I tend not to use the POSIX char classes that much probably others wouldnt either. :-) Yves / DeMerphq --- Software Engineering is Programming when you can't. -- E. W. Dijkstra (RIP)	[reply] [d/l]
Re: Re: Re: Re: Regular Expression Builder by Anonymous Monk on Aug 30, 2002 at 17:29 UTC
Re(4): Regular Expression Builder by Dog and Pony (Priest) on Sep 02, 2002 at 20:11 UTC
Re: Regular Expression Builder by Anonymous Monk on Aug 30, 2002 at 16:19 UTC
Given a string like "Rich36!", it would produce `\w{4}\d{2}\!` or `\w+\d+\!`. And why wouldn't it produce one of these `/\w{6}!/ /\w+!/ /[A-Z][a-z]{3}\d\d!/ /Rich36!/ /......./ /\S+/ /./` [download] I mean, the tightest or least general thing it could produce when given a $string is just `/\Q$string\E/` and the most general thing would be `/./s`, and between those is a rather large space of candidates.	[reply] [d/l] [select]
Re: Regular Expression Builder by zentara (Archbishop) on Aug 30, 2002 at 16:17 UTC
There is a bash script at txt2regex that lets you make regexes based on a simple question and answer menu. It might give you an idea	[reply]
Re: Regular Expression Builder by fruiture (Curate) on Aug 30, 2002 at 16:50 UTC
Well, 'rich36' could be translated to '\w{4}\d{2}' or to '\w{6}' or '.{6}' ... You need to specify that [a-zA-Z] must become \w and [0-9] must become \d ... A try: `#!/usr/bin/perl use strict; use warnings; { my @classes = ( ['[a-zA-Z]' => '\w'], ['[0-9]' => '\d'], ['\w' => '_'], #that's why order matters ['.' => '.'], ); sub make_regex { local $_ = @_ ? shift : $_; my $result = ''; my $i = -1; while( ++$i < @classes ){ my $p = pos($_) \|\| 0; my ($re,$su) = @{ $classes[$i] }; if( /\G($re+)/g ){ $result .= $su . '{' . length($1) . '}'; $i = -1; } else { pos($_) = $p; } } $result } } printf "%s => %s\n",$_,make_regex for ( 'abc12','123','#+#+#', )` [download] update: corrected [ and ] again (twice)... -- http://fruiture.de	[reply] [d/l]
Re: Regular Expression Builder by bart (Canon) on Aug 30, 2002 at 17:41 UTC
Just a thought: replace all letters by "A" and all digits by "9". Then apply the Regex::PreSuf thing — or just quotemeta(). And in that result, replace "A" with '\w' and "9" with '\d'. Intermediate steps, as an example: `@foo23 -> @AAA99 -> \@AAA99 -> \@\w\w\w\d\d`	[reply] [d/l]
Re: Re: Regular Expression Builder by belg4mit (Prior) on Aug 31, 2002 at 01:39 UTC
Clever, but add the step (actually, merge it with the A9 -> metachar translation): `s%((?:\\w)+)%'\w{'. length($1)/2 .'}'%eg; ...` [download] `-- perl -pew "s/\b;([mnst])/'$1/g"`	[reply] [d/l]
Re: Re: Regular Expression Builder by BigLug (Chaplain) on Sep 03, 2002 at 04:37 UTC
I was going to suggest the same thing. This is similar to how the old dBase use to work with its 'patterns' to authenticate data. I can't remember it completely but I'd suggest using extra wildcards to the above: A or a: Any alpha character Z: Uppercase character z: Lowercase character 9: Numeral : Any string of characters ?: Any single character Anything apart from the above would be a literal .. as would escaping the above with a backslash. ALSO: Note that /\w/ ne /a-z/i* These combined would result in: `USER: @foo29 RE: /\@foo2\d/ USER: @zzz99 RE: /\@[a-z]{3}\d{2}/ USER: @AAA99 RE: /\@[a-zA-Z]\d{2}/ #Note that 'A' becomes #[a-zA-Z] rather than [a-z] with /i #because there may later be a 'z' #in your users pattern :)` [download] The code for parsing this shouldn't be too hard to create, but I'd suggest wrapping the following comment in at an earlier stage and parsing the users pattern looking for repeats as you go.	[reply] [d/l]
Re: Regular Expression Builder by hiseldl (Priest) on Aug 30, 2002 at 18:46 UTC
There's also regexEvaluater written using Perl/Tk. This may not be exactly what you are looking for, but it will help you develop and capture regexes. Here's an excerpt from the web page: regexEvaluater.pl helps users to write (perl) regular expressions for filtering text data. Especially the interactive testing (Tk-GUI) of regular expressions including the immediate visualization of the resulting output makes regexEvaluater.pl a helpful tool for daily use. The program can be used in 5 different ways: Script generator: allows to write the current expression to a executable perl script Developing tool: pasting of data (from clipboard or selection (by middle mouse click)) into input area. Copying of the regular expression to the clipboard Filter program: modifies input by applying (stepwise) regular expressions Extracting tool: extracting useful information by writing the return values of regular expressions to a file Browser: browsing structured data by changing input separator -- hiseldl	[reply]
Re: Regular Expression Builder by Boots111 (Hermit) on Aug 30, 2002 at 15:51 UTC
All~ Komodo from ActiveState includes a regular expression toolkit that allows you to see what a regex does as you have it, on a sample output. I know this is not exactly what you are looking for but it might be helpful... Boots --- Computer science is merely the post-Turing decline of formal systems theory. --???	[reply]
Re: Regular Expression Builder by mojotoad (Monsignor) on Aug 31, 2002 at 20:58 UTC
Thought I'd plug one of my nodes, Deriving Regular Expressions, from a while back. It might be of interest. I wasn't aware of Regex::PreSuf at the time, that seems worth using as a springboard as bart suggested. Matt	[reply]

Back to Seekers of Perl Wisdom