A Regexp Assembler/Compiler

PetaMem has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: A Regexp Assembler/Compiler
by Abigail-II (Bishop) on Jun 19, 2002 at 15:18 UTC

Given a set of regexps and given a wanted logical concatenation (lets start with AND,OR) of these regexps, is there any mechanism to create a smaller, more eficient set of regexps that will have the same effect?

Very unlikely given the power of Perl regular expressions. One of the questions that you would need to answer is whether two grammars (I use grammar here instead of regular expression to avoid confusion later on) are equivalent; that is, whether the will match the same language (where a language is the set of all strings matched by a grammar). For regular expressions (not Perl regular expressions, but for traditional ones) this question is decidable. But for more powerful grammars on the Chomsky hierarchy, this question is undecidable.

However, Perl regular expressions are hard to place on the Chomsky hierarchy. Without (?{ }) or (??{ }), Perl regular expressions cannot be used to create all context free grammars (take the language of strings with balanced parenthesis for instance). On the other hand, it can be used to create grammars that are context sensitive on the Chomsky hierarchy. (With (?{ }) it could very well be that Perl regular expressions are at least as powerful as context free grammars, but I don't see an obvious proof.)

My guts say that such a mechanism does not exist, that the problem is unsolvable. If the problem is solvable, it's going to be incredibly hard.

Abigail

[reply]
[d/l]
[select]

Re(2): A Regexp Assembler/Compiler

by gumby (Scribe) on Jun 19, 2002 at 15:29 UTC

I don't think that there is anything wrong in posing this question in the language of boolean logic and set theory. Any solution set of the above subroutine will be the union of the complement of the solution set of the first regexp and the solution sets of the other regexps. Although finding a regexp whose solution set this is would be non-trivial, as Abigail-II points out.

[reply]

Re(3): A Regexp Assembler/Compiler

by gumby (Scribe) on Jun 19, 2002 at 15:56 UTC

On further reflection, it's actually quite likely that algorithms have been developed for similar problems (ie. Boolean algebra's, elimination theory etc.).

[reply]

Re: Re: A Regexp Assembler/Compiler

by PetaMem (Priest) on Jun 19, 2002 at 20:53 UTC

My guts say that such a mechanism does not exist, that the problem is unsolvable. If the problem is solvable, it's going to be incredibly hard.

I agree with Abigails intuition here, but only for the language space perl regexps span in general. Considering the examples I've given above, one sees, that there are regexps (hard ones) and regexps (next to trivial).

So my guts tell me, that for a certain subset of regexps this problem must have a more or less achievable solution, the more expression you'd like to handle the more "incredibly hard" it will get. But hey - then again it'll be just incredibly possible if you allow me the s///ing of a well known citation. :-)

Bye
PetaMem

[reply]
[d/l]

Re: A Regexp Assembler/Compiler
by RMGir (Prior) on Jun 19, 2002 at 15:27 UTC

Regex::PreSuf

You'd have to be able to enumerate the roman numerals you want to match to be able to use it directly, though...

$ perl -MRegex::PreSuf -e'my $re = presuf(qw(I II III IV V VI VII VIII
+ IX X));
print "$re\n";'
(?:I(?:II|[IVX])|VI(?:II|I)?|[IVX])
[download]

[reply]
[d/l]

Re: A Regexp Assembler/Compiler
by demerphq (Chancellor) on Jun 19, 2002 at 15:16 UTC

Yves / DeMerphq
---
Writing a good benchmark isnt as easy as it might look.

[reply]

Re:x2 A Regexp Assembler/Compiler (Whats a 'trie'?)

by demerphq (Chancellor) on Jun 20, 2002 at 10:40 UTC

crazyinsomniac

trie

RMGir

Regexp::PreSuf

So what is a trie? Well the first thing is that 'trie' is short for a 'patricia trie'. And yes crazyinsomniac its a form of tree. First off lets just consider a classic tree, such as a binary tree. (What all this has to do with regexes will become clear a bit later).

In common tree implementations such as a binary tree, each node holds a key. Searching the tree involves comparing the key we are looking for against the key inside of the current node. Based on the relationship (as in greater than less than etc) between the keys we follow the appropriate branch. If the keys are equal we have found our node. If we get to a leaf node without finding the key we are looking for we know the key is not present in the tree. This type of tree has a number of fairly efficient properties, such as inserting and deleting. The efficiency (look ups especially) is bound to N. (Actually its O(log₂N) ). This means that the number of comparisons involved is related to the number of elements in the over all data structure. Also, it may be useful to think of a binary tree as being a tree representation of a binary search (divide and conquor).

A patricia trie (or trie or digit tree or digit trie) is also a tree, however it differs quite radically (no pun intended, (you'll see :-)) from a binary tree in that internal nodes usually do not carry a payload (as in they do not hold a 'key'), only external (leaf nodes) contain data. Instead the structure of the tree (er trie :-) is determined by the digit of the key (at the position correlating to the depth in the trie). In a sense it may be helpful to think of a digit trie as being a tree version of a radix sort (although it doesnt sort explicitly, reading off subkeys in a sorted order would (note the pun)) This may seem a bit opaque so lets look at an example. Lets say we have the following strings 'FOOL','FOOD','FOAM','FEAR','FOLLY' we would turn them into the following structure:

F--+-O--+-O--+-L        (FOOL)
   |    |    |
   |    |    +-D        (FOOD)
   |    |
   |    +-A----M        (FOAM)
   |    |
   |    +-L----L----Y   (FOLLY)
   |
   +-E----A----R        (FEAR)

not

O(1)

¹²

/^F(O(O[LD]|AM|LLY)|EAR)/
[download]

Tie::Hash::Trie

RE (tilly) 4: SAS log scanner

So tries are trees with some funky twists, and with an application toward prefix and suffix matching. (although only one or the other... :-)

Hope that clears things up,

Updated: uc()'d the example. Minor typographic corrections, and left anchored the regex.

Yves / DeMerphq
---
Writing a good benchmark isnt as easy as it might look.

[reply]
[d/l]

(trie exploration) Re: Re:x2 A Regexp Assembler/Compiler (Whats a 'trie'?)

by crazyinsomniac (Prior) on Jun 20, 2002 at 12:11 UTC

#!/usr/bin/perl -w
#file://letrie.pl
use strict;
$\="\n";
my %T=();

for my $word( qw/ aba abba abbda abbaracadabra bam bamara barbara / ) 
+{
    print $word;

    my $R = \%T;
    my $C = -1;
    my( @word ) = $word =~ m{.}g;
    my $p = "";
    while(++$C < @word) {
        $p  = $word[$C];
        if(exists $R->{$p}) {
            if(ref $R->{$p}) {
                $R = $R->{$p};
            } else {
                $R = $R->{$p} = {};
            }
        }else{
            $R = $R->{$p} = {};
        }
    }

    $R->{"\\NULL"}=1; # for exact match testing

}#endof for
use Data::Dumper;
$Data::Dumper::Indent=1;
$Data::Dumper::Purity=1;
$Data::Dumper::Quotekeys=1; 
print Dumper \%T;
print "Is 'abba' in a trie?  ",in_a_trie(\%T,'abba');
print "Is 'abbda' in a trie? ",in_a_trie(\%T,'abbda');
print "Is 'abbdr' in a trie? ",in_a_trie(\%T,'abbdr');
print "Is 'a' in a trie?     ",in_a_trie(\%T,'a');
print "Is 'b' in a trie?     ",in_a_trie(\%T,'b');
print "Is 'ba' in a trie?    ",in_a_trie(\%T,'ba');
print "Is 'bam' in a trie?   ",in_a_trie(\%T,'bam');
print "Is 'fsck' in a trie?  ",in_a_trie(\%T,'fsck');
exit;

sub in_a_trie {
    my( $T, $W ) = @_;
    my $R = 0;
    my $SH = "";

    for $SH ($W =~ m{.}g) {
        if(exists $T->{$SH} ) {
            $T = $T->{$SH};
            $R++;
        } else {
            return 'no';
        }
    }

    return ($R ? 'yes' : 'no').' '
          .(ref $T and $T->{"\\NULL"} ? 'exact' : 'partial' );
}

__DATA__

aba
abba
abbda
abbaracadabra
bam
bamara
barbara
$VAR1 = {
  'a' => {
    'b' => {
      'a' => {
        '\\NULL' => 1
      },
      'b' => {
        'a' => {
          '\\NULL' => 1,
          'r' => {
            'a' => {
              'c' => {
                'a' => {
                  'd' => {
                    'a' => {
                      'b' => {
                        'r' => {
                          'a' => {
                            '\\NULL' => 1
                          }
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        },
        'd' => {
          'a' => {
            '\\NULL' => 1
          }
        }
      }
    }
  },
  'b' => {
    'a' => {
      'm' => {
        '\\NULL' => 1,
        'a' => {
          'r' => {
            'a' => {
              '\\NULL' => 1
            }
          }
        }
      },
      'r' => {
        'b' => {
          'a' => {
            'r' => {
              'a' => {
                '\\NULL' => 1
              }
            }
          }
        }
      }
    }
  }
};

Is 'abba' in a trie?  yes exact
Is 'abbda' in a trie? yes exact
Is 'abbdr' in a trie? no
Is 'a' in a trie?     yes partial
Is 'b' in a trie?     yes partial
Is 'ba' in a trie?    yes partial
Is 'bam' in a trie?   yes exact
Is 'fsck' in a trie?  no
[download]

______crazyinsomniac_____________________________
Of all the things I've lost, I miss my mind the most.
perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

[reply]
[d/l]

Re^3: A Regexp Assembler/Compiler (Whats a 'trie'?)

by particle (Vicar) on Jun 20, 2002 at 11:57 UTC

Tie::Hash::Trie

~Particle *accelerates*

[reply]
[d/l]

Re: Re:x2 A Regexp Assembler/Compiler (Whats a 'trie'?)

by RMGir (Prior) on Jun 20, 2002 at 12:12 UTC

You should copy & paste this as a new post in Tutorials...
--
Mike

[reply]

Re: A Regexp Assembler/Compiler
by boo_radley (Parson) on Jun 19, 2002 at 16:19 UTC

Deriving Regular Expressions

[reply]


Problems? Is your data what you think it is?
	PerlMonks