Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

Re: A Regexp Assembler/Compiler

by demerphq (Chancellor)
on Jun 19, 2002 at 15:16 UTC ( #175718=note: print w/replies, xml ) Need Help??

in reply to A Regexp Assembler/Compiler

Well, you could construct a trie of the patterns, this would reduce the duplication and marginally improve the efficiency. There is at least one module on CPAN that will do this. I believe it is under the String:: heirarchy, but I dont remember right now.

Yves / DeMerphq
Writing a good benchmark isnt as easy as it might look.

Replies are listed 'Best First'.
Re:x2 A Regexp Assembler/Compiler (Whats a 'trie'?)
by demerphq (Chancellor) on Jun 20, 2002 at 10:40 UTC
    crazyinsomniac asked me to clarify what a trie is, and also the module that I meant was the one that RMGir pointed out Regexp::PreSuf, which almost for sure uses a trie implementation to do its business (I havent checked, but...)

    So what is a trie? Well the first thing is that 'trie' is short for a 'patricia trie'. And yes crazyinsomniac its a form of tree. First off lets just consider a classic tree, such as a binary tree. (What all this has to do with regexes will become clear a bit later).

    In common tree implementations such as a binary tree, each node holds a key. Searching the tree involves comparing the key we are looking for against the key inside of the current node. Based on the relationship (as in greater than less than etc) between the keys we follow the appropriate branch. If the keys are equal we have found our node. If we get to a leaf node without finding the key we are looking for we know the key is not present in the tree. This type of tree has a number of fairly efficient properties, such as inserting and deleting. The efficiency (look ups especially) is bound to N. (Actually its O(log2N) ). This means that the number of comparisons involved is related to the number of elements in the over all data structure. Also, it may be useful to think of a binary tree as being a tree representation of a binary search (divide and conquor).

    A patricia trie (or trie or digit tree or digit trie) is also a tree, however it differs quite radically (no pun intended, (you'll see :-)) from a binary tree in that internal nodes usually do not carry a payload (as in they do not hold a 'key'), only external (leaf nodes) contain data. Instead the structure of the tree (er trie :-) is determined by the digit of the key (at the position correlating to the depth in the trie). In a sense it may be helpful to think of a digit trie as being a tree version of a radix sort (although it doesnt sort explicitly, reading off subkeys in a sorted order would (note the pun)) This may seem a bit opaque so lets look at an example. Lets say we have the following strings 'FOOL','FOOD','FOAM','FEAR','FOLLY' we would turn them into the following structure:

    F--+-O--+-O--+-L        (FOOL)
       |    |    |
       |    |    +-D        (FOOD)
       |    |
       |    +-A----M        (FOAM)
       |    |
       |    +-L----L----Y   (FOLLY)
       +-E----A----R        (FEAR)
    Now this structure has a number of interesting properties. First off, search efficiency is not a function of the number of items the structure holds, but rather the length of the key we are looking for, which means that it is O(1) for look ups. Unfortunately this wonderful search time (think of it, you could have 1012 records and still find any element in the same number of steps as there are digits in the longest key) is paid for by a very inefficient use of memory (unless the data is extraordinarily dense and the alphabet for the keys is small,) although a hybrid approach can greatly improve the memory efficiency without impacting the speed significantly. Beside the above interesting properties tries are perfect for creating an efficient regex:
    Actually in experiments ive done with a module I should have put on CPAN ages ago, (I'll do it today) Tie::Hash::Trie, when there are many keys it actually turns out that using a normal trie lookup outperforms the equivelent regex (only when anchored at the beginning though). A simple implementation of a trie for this purpose (generating regexes) can be found at RE (tilly) 4: SAS log scanner.

    So tries are trees with some funky twists, and with an application toward prefix and suffix matching. (although only one or the other... :-)

    Hope that clears things up,

    Updated: uc()'d the example. Minor typographic corrections, and left anchored the regex.

    Yves / DeMerphq
    Writing a good benchmark isnt as easy as it might look.

      Interesting. I explored the idea a little. The browser didn't like \0 too much, so i typed \\NULL. Thanks.
      #!/usr/bin/perl -w #file:// use strict; $\="\n"; my %T=(); for my $word( qw/ aba abba abbda abbaracadabra bam bamara barbara / ) +{ print $word; my $R = \%T; my $C = -1; my( @word ) = $word =~ m{.}g; my $p = ""; while(++$C < @word) { $p = $word[$C]; if(exists $R->{$p}) { if(ref $R->{$p}) { $R = $R->{$p}; } else { $R = $R->{$p} = {}; } }else{ $R = $R->{$p} = {}; } } $R->{"\\NULL"}=1; # for exact match testing }#endof for use Data::Dumper; $Data::Dumper::Indent=1; $Data::Dumper::Purity=1; $Data::Dumper::Quotekeys=1; print Dumper \%T; print "Is 'abba' in a trie? ",in_a_trie(\%T,'abba'); print "Is 'abbda' in a trie? ",in_a_trie(\%T,'abbda'); print "Is 'abbdr' in a trie? ",in_a_trie(\%T,'abbdr'); print "Is 'a' in a trie? ",in_a_trie(\%T,'a'); print "Is 'b' in a trie? ",in_a_trie(\%T,'b'); print "Is 'ba' in a trie? ",in_a_trie(\%T,'ba'); print "Is 'bam' in a trie? ",in_a_trie(\%T,'bam'); print "Is 'fsck' in a trie? ",in_a_trie(\%T,'fsck'); exit; sub in_a_trie { my( $T, $W ) = @_; my $R = 0; my $SH = ""; for $SH ($W =~ m{.}g) { if(exists $T->{$SH} ) { $T = $T->{$SH}; $R++; } else { return 'no'; } } return ($R ? 'yes' : 'no').' ' .(ref $T and $T->{"\\NULL"} ? 'exact' : 'partial' ); } __DATA__ aba abba abbda abbaracadabra bam bamara barbara $VAR1 = { 'a' => { 'b' => { 'a' => { '\\NULL' => 1 }, 'b' => { 'a' => { '\\NULL' => 1, 'r' => { 'a' => { 'c' => { 'a' => { 'd' => { 'a' => { 'b' => { 'r' => { 'a' => { '\\NULL' => 1 } } } } } } } } } }, 'd' => { 'a' => { '\\NULL' => 1 } } } } }, 'b' => { 'a' => { 'm' => { '\\NULL' => 1, 'a' => { 'r' => { 'a' => { '\\NULL' => 1 } } } }, 'r' => { 'b' => { 'a' => { 'r' => { 'a' => { '\\NULL' => 1 } } } } } } } }; Is 'abba' in a trie? yes exact Is 'abbda' in a trie? yes exact Is 'abbdr' in a trie? no Is 'a' in a trie? yes partial Is 'b' in a trie? yes partial Is 'ba' in a trie? yes partial Is 'bam' in a trie? yes exact Is 'fsck' in a trie? no

      Of all the things I've lost, I miss my mind the most.
      perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

      this is a fantastic intro to Trie. with a little beefing up and some code samples from Tie::Hash::Trie, this would make a fine tutorial.

      ~Particle *accelerates*

      Nice writeup!

      You should copy & paste this as a new post in Tutorials...

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://175718]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (8)
As of 2017-02-22 15:13 GMT
Find Nodes?
    Voting Booth?
    Before electricity was invented, what was the Electric Eel called?

    Results (330 votes). Check out past polls.