Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Regular Expression Builder

by Rich36 (Chaplain)
on Aug 30, 2002 at 15:23 UTC ( #194140=perlquestion: print w/ replies, xml ) Need Help??
Rich36 has asked for the wisdom of the Perl Monks concerning the following question:

Does anyone know of any kind of regular expression builder that generates regexes from a given string? What I'm looking for is something like this...

Given a string like "Rich36!", it would produce \w{4}\d{2}\! or \w+\d+\!.

The driving force behind this is that I'm working on a tool that uses regular expressions to grab data out of some files. Part of the interface sometimes requires the users to input regular expressions to capture the necessary text. Since most of them are not familiar with regular expressions, I'm looking for a way to allow them to input a string of text and then output a regular expression with metacharacters that would grab text that was like the string they inputted.

I did come across the excellent Regex::PreSuf, which does that, but it doesn't use metacharacters to any great degree.
For instance,

use Regex::PreSuf; my $re = presuf({anychar => 1}, qw(@foo @bar @baz)); print qq($re\n); __RESULT__ \@(?:ba[rz]|foo)

Which is great, but I'm looking for a mechanism that would produce a regex that would capture something like @oof as well (a regex like \@\w{3}).

Any suggestions or information would be greatly appreciated.


«Rich36»

Comment on Regular Expression Builder
Select or Download Code
Re: Regular Expression Builder
by tommyw (Hermit) on Aug 30, 2002 at 15:36 UTC

    Programming Perl includes:

    #!/usr/bin/perl $vowels='aeiouy'; $cons='bcdfghjklmnpqrstvwxzy'; %map={C=>$cons, V=>$vowels; for $class=($vowels, $cons) { for (split //, $class) { map{$_}.=$class; } } for $char (split //, shift) { $pat.="[$map{$char}]"; } $re=qr/^${pat}$/i; print "REGEX is $re\n"; @ARGV='/usr/dict/words' if -t && !@ARGV; while (<>) { print if /$re/; }
    Which takes a word, and builds a template from it with the same pattern of vowels and consonants. Although the original is commented. Extending this to handle digits should be easy. The cunning part will be collapsing the multiple character classes down, and using a multiple instead.

    This is, of course, left as an exercise for the reader ;-)

    --
    Tommy
    Too stupid to live.
    Too stubborn to die.

Re: Regular Expression Builder
by Boots111 (Hermit) on Aug 30, 2002 at 15:51 UTC
    All~

    Komodo from ActiveState includes a regular expression toolkit that allows you to see what a regex does as you have it, on a sample output.

    I know this is not exactly what you are looking for but it might be helpful...

    Boots
    ---
    Computer science is merely the post-Turing decline of formal systems theory.
    --???
Re: Regular Expression Builder
by zentara (Archbishop) on Aug 30, 2002 at 16:17 UTC
    There is a bash script at
    txt2regex
    that  lets you make regexes based on a simple
    question and answer menu. It might give you an idea
Re: Regular Expression Builder
by Anonymous Monk on Aug 30, 2002 at 16:19 UTC
    Given a string like "Rich36!", it would produce \w{4}\d{2}\! or \w+\d+\!.

    And why wouldn't it produce one of these

    /\w{6}!/ /\w+!/ /[A-Z][a-z]{3}\d\d!/ /Rich36!/ /......./ /\S+/ /.*/

    I mean, the tightest or least general thing it could produce when given a $string is just /\Q$string\E/ and the most general thing would be /.*/s, and between those is a rather large space of candidates.

Re: Regular Expression Builder
by demerphq (Chancellor) on Aug 30, 2002 at 16:19 UTC
    I doubt that there is a robust way to do this, but heres a really simple way:
    my $string="123 abcdef"; $string=~s{(\d+)|(\w+)|(\s+)} { defined($1) ? '\\d{'.length($1).'}' : defined($2) ? '\\w{'.length($2).'}' : '\\s{'.length($3).'}' }ge; print $string; __END__ \d{3}\s{1}\w{6}
    But i dont think this will scale very well... (and probably has subtle problems anyway)

    Yves / DeMerphq
    ---
    Software Engineering is Programming when you can't. -- E. W. Dijkstra (RIP)

      But i dont think this will scale very well... (and probably has subtle problems anyway)

      One quibble is that because \d is a subset of \w then a string such as "abc123def" will get \w{9} in your version. Here's a slightly improved version (for some definition of improved)

      my $string=" \aabc123def!*#\n"; $string=~s{ ([[:digit:]]+) |([[:alpha:]]+) |([[:punct:]]+) |([[:space:]]+) |([[:cntrl:]]+) |(.) } { defined($1) ? '[[:digit:]]{'.length($1).'}' : defined($2) ? '[[:alpha:]]{'.length($2).'}' : defined($3) ? '[[:punct:]]{'.length($3).'}' : defined($4) ? '[[:space:]]{'.length($4).'}' : defined($5) ? '[[:cntrl:]]{'.length($5).'}' : "\Q$+\E" # anything else? }gex; print $string;

      But it still has problems (for example, \n is in both :space: and :cntrl: so "\n\a" produces [[:space:]]{1}[[:cntrl:]]{1}, but "\a\n" produces [[:cntrl:]]{2}).

        One quibble is that because \d is a subset of \w then a string such as "abc123def" will get \w{9} in your version.

        Yup. But personally I consider that a feature not a bug. :-) After all ldkjdlkjf2098kklls probably isnt [[:alpha:]]+\d+[[:alpha:]]+

        But we are both in agreement that there isnt a good way to do this, although as we both have shown there are a variety of bad ways to do it... BTW, is the . really necessary? I dont think it is as the s/// will just skip the char if it doesnt match.

        Oh and I considered using something like you post here, but I fgured that considering I tend not to use the POSIX char classes that much probably others wouldnt either.

        :-)

        Yves / DeMerphq
        ---
        Software Engineering is Programming when you can't. -- E. W. Dijkstra (RIP)

Re: Regular Expression Builder
by erikharrison (Deacon) on Aug 30, 2002 at 16:34 UTC

    The challenge here is asking youself "What kind of regexes do I want my tool to generate". This makes things a little harder and is one of the reasons that this kind of tool isn't on the market.

    A computer program cannot read your mind, obviously. So, the regexes generated from a single simple string will be rather simple - there isn't enough data to work with to create a complex expression there. For example, should the regex retain length? When should a regex generalize a character into a character class or match exactly? If we generalize out to a character class, what about when a character could be placed in several different character classes?

    While the tool could produce more useful regexes from additional data (such as multiple strings) the question remains - by what rules do we generate a regex from the given data? The rules will vary from project to project, so a tool that has rules builtin will not be very useful to others, and as such won't be out there in the market. If you want a tool you can program regex generating rules into, you get into a layer of abstraction which makes things harder not easier on the programmer - you'd be better off generating the regexes yourself.

    Some tools that might help you out - Parse::RecDecent Parse::Yapp and perhaps Regex::English.

    Cheers,
    Erik

    Light a man a fire, he's warm for a day. Catch a man on fire, and he's warm for the rest of his life. - Terry Pratchet

      A computer program cannot read your mind, obviously.

      Darn. It sure would be helpful to be able to write:

      #!/usr/bin/perl use Read::Mind qw(disambiguate implement); $script = new Read::Mind; $script->do_what_I_mean(); exit;

      BCE
      --Dude, you're getting assimilated!

      What kind of regexes do I want my tool to generate

      E.g. give that generator a handfull Strings, recognize a possible pattern behind that and generate the regular expression to recognize these strings. That would cut the number of possible solutions down to a reasonable amount. Problem is the pattern recognition or is ther a module?

      And it came to pass that in time the Great God Om spake unto Brutha, the Chosen One: "Psst!"
      -- (Terry Pratchett, Small Gods)

Re: Regular Expression Builder
by fruiture (Curate) on Aug 30, 2002 at 16:50 UTC

    Well, 'rich36' could be translated to '\w{4}\d{2}' or to '\w{6}' or '.{6}' ... You need to specify that [a-zA-Z] must become \w and [0-9] must become \d ...

    A try:

    #!/usr/bin/perl use strict; use warnings; { my @classes = ( ['[a-zA-Z]' => '\w'], ['[0-9]' => '\d'], ['\w' => '_'], #that's why order matters ['.' => '.'], ); sub make_regex { local $_ = @_ ? shift : $_; my $result = ''; my $i = -1; while( ++$i < @classes ){ my $p = pos($_) || 0; my ($re,$su) = @{ $classes[$i] }; if( /\G($re+)/g ){ $result .= $su . '{' . length($1) . '}'; $i = -1; } else { pos($_) = $p; } } $result } } printf "%s => %s\n",$_,make_regex for ( 'abc12','123','#+#+#', )

    update: corrected &#91; and &#93; again (twice)...

    --
    http://fruiture.de
Re: Regular Expression Builder
by bart (Canon) on Aug 30, 2002 at 17:41 UTC
    Just a thought: replace all letters by "A" and all digits by "9". Then apply the Regex::PreSuf thing — or just quotemeta(). And in that result, replace "A" with '\w' and "9" with '\d'.

    Intermediate steps, as an example: @foo23  ->  @AAA99  ->  \@AAA99  ->  \@\w\w\w\d\d

      Clever, but add the step (actually, merge it with the A9 -> metachar translation):
      s%((?:\\w)+)%'\w{'. length($1)/2 .'}'%eg; ...

      --
      perl -pew "s/\b;([mnst])/'$1/g"

      I was going to suggest the same thing. This is similar to how the old dBase use to work with its 'patterns' to authenticate data. I can't remember it completely but I'd suggest using extra wildcards to the above:
      • A or a: Any alpha character
      • Z: Uppercase character
      • z: Lowercase character
      • 9: Numeral
      • *: Any string of characters
      • ?: Any single character
      Anything apart from the above would be a literal .. as would escaping the above with a backslash.
      ALSO: Note that /\w/ ne /a-z/i

      These combined would result in:

      USER: @foo29 RE: /\@foo2\d/ USER: @zzz99 RE: /\@[a-z]{3}\d{2}/ USER: @AAA99 RE: /\@[a-zA-Z]\d{2}/ #Note that 'A' becomes #[a-zA-Z] rather than [a-z] with /i #because there may later be a 'z' #in your users pattern :)
      The code for parsing this shouldn't be too hard to create, but I'd suggest wrapping the following comment in at an earlier stage and parsing the users pattern looking for repeats as you go.
Re: Regular Expression Builder
by hiseldl (Priest) on Aug 30, 2002 at 18:46 UTC
    There's also regexEvaluater written using Perl/Tk. This may not be exactly what you are looking for, but it will help you develop and capture regexes. Here's an excerpt from the web page:

      regexEvaluater.pl helps users to write (perl) regular expressions for filtering text data. Especially the interactive testing (Tk-GUI) of regular expressions including the immediate visualization of the resulting output makes regexEvaluater.pl a helpful tool for daily use.

      The program can be used in 5 different ways:

      1. Script generator: allows to write the current expression to a executable perl script
      2. Developing tool: pasting of data (from clipboard or selection (by middle mouse click)) into input area. Copying of the regular expression to the clipboard
      3. Filter program: modifies input by applying (stepwise) regular expressions
      4. Extracting tool: extracting useful information by writing the return values of regular expressions to a file
      5. Browser: browsing structured data by changing input separator

    --
    hiseldl

Re: Regular Expression Builder
by mojotoad (Monsignor) on Aug 31, 2002 at 20:58 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://194140]
Approved by shadox
Front-paged by shadox
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (8)
As of 2014-10-02 14:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    What is your favourite meta-syntactic variable name?














    Results (61 votes), past polls