Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: Patter Finding

by lemming (Priest)
on Sep 11, 2001 at 04:17 UTC ( #111636=note: print w/ replies, xml ) Need Help??


in reply to Pattern Finding

update: Look at my second offering, it's better

Here's an inefficient subroutine that I've used in the past. Probably time for an overhaul
Output when called with $string, 2, 2:
4 : (hello)
2 : (world)
2 : (hi)

sub get_pattern { my ($string, $min_len, $min_num) = @_; my $str_len = length($string); my $srch_max = int($str_len/2); my %patterns; # First we find all patterns that are up to 1/2 the length of the stri +ng foreach my $len ($min_len..$srch_max) { my $eol = $str_len - $len; foreach my $ind1 (0..$eol) { my $pat = substr($file, $ind1, $len); unless ( defined($patterns{$pat}) ) { $patterns{$pat} = 0; my $index = 0; do { $index = index($file, $pat, $index); unless ($index < 0) { $index += length($pat); $patterns{$pat}++; } } while ($index >= 0); } } } # We then dump all patterns that do not occur min_num times foreach my $key (keys %patterns) { delete $patterns{$key} if ($patterns{$key} < $min_num); } # We then go through the patterns by order and remove those # that are invalidated by better patterns foreach my $key (sort { $patterns{$b} * (length($b)-1) <=> $patterns{$a} * (length($a)-1) or length($b) <=> length($a) or $a cmp $b } keys %patterns) { my $check = 0; $patterns{$key} = 0; my $index; do { $index = index($file, $key, 0); unless ($index < 0) { $check = 1; $patterns{$key}++; substr($file, $index, length($key)) = "\000"; } } while ($index >= 0); delete $patterns{$key} if ($patterns{$key} < $min_num); } foreach my $key (sort { $patterns{$b} * (length($b)-1) <=> $patterns{$a} * (length($a)-1) or length($b) <=> length($a) or $a cmp $b } keys %patterns) { (my $pat = $key) =~ s/\n/\\n/g; printf("%3d : (%s)\n", $patterns{$key}, $pat); } }


Comment on Re: Patter Finding
Download Code
Re: Re: Patter Finding
by demerphq (Chancellor) on Sep 11, 2001 at 16:57 UTC
    Hi Lemming,

    Im a little confused. As posted your code goes into an infinte loop. When I s/$file/\$string/g I get the output as you said we would (impressive) However if minlen is 0 it goes into an infinite loop!

    Also when I try the string:'hellohiothellobrakerakerashash' I only get one of the many words contained, and a couple that arent words.

    2 : (hello) 2 : (aker) 2 : (ash)
    I would expect any of the following:
    hello,hi,othello,brake,rake,raker,rash,ash,hash,ohio, the,lob,bra,hell,era # I get this using substr counts: ak,ake,aker,akera,as,ash,el,ell,ello,er,era, he,hel,hell,hello,ke,ker,kera,ll,llo,lo, ra,rak,rake,raker,rakera,sh
    So my guess is that the above results are coincidental or am I missing something? Yves

    --
    You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)

      I would expect any of the following: hello,hi,othello,brake,rake,raker,rash,ash,hash,ohio, the,lob,bra,hell,era

      I don't see how you could expect some of those strings, because some only appear once in the string (e.g. "othello", "ohio"), so you really couldn't call them a "pattern" unless you're matching against a dictionary file.

      My solution near the top of this thread sort of assumes that the string is a contiguous series of patterns (one of the original constraints was "String contains nothing but patterns"), so it only finds "hello" from your test string, but if you change this line:

      # From this if (/\G(.{2,})(?=.*?\1)/g) { # To this if (/\G.*?(.{2,})(?=.*?\1)/g) {
      Then it does better and finds "ash", "rake", and "hello" from your test string, which is about as good as it gets, I believe.
Re:^2 Pattern Finding
by lemming (Priest) on Sep 14, 2001 at 00:06 UTC

    Ok. Here's a better version. While I haven't benchmarked it, my feeling are that it's a hog, but I bullet proofed several areas. It's less than a hog than my earlier post. I'm posting the new version for an easier compare.

    #!/usr/bin/perl # string, min_len of pattern, min_num of patterns use strict; use warnings; my $string = "bookhelloworldhellohellohihellohiworldhihelloworldhihe +llobookpenbookpenworld"; get_pattern($string, 2, 2); exit; sub get_pattern { my ($string, $min_len, $min_num) = @_; my $str_len = length($string); my $srch_max = int($str_len/2); my %patterns; # First we find all patterns that are up to 1/2 the length of the stri +ng print "length : $str_len\n"; my %tmp_hash; foreach my $len ($min_len..$srch_max) { my $eol = $str_len - $len; foreach my $ind1 (0..$eol) { my $pat = substr($string, $ind1, $len); unless ( defined($tmp_hash{$pat}) ) { $tmp_hash{$pat} = 0; $tmp_hash{$pat}++ while ($string =~ /\Q$pat\E/g); $patterns{$pat} = $tmp_hash{$pat} if ($tmp_hash{$pat} >= $min_ +num); } } } undef %tmp_hash; print "Patterns: ", scalar (keys %patterns), "\n"; # We then go through the patterns by order and remove those # that are invalidated by better patterns # Longer strings that occur more often are considered better my $mod_str = $string; foreach my $key (sort { $patterns{$b} * (length($b)-1) <=> $patterns{$a} * (length($a)-1) or length($b) <=> length($a) } keys %patterns) { my $tstr = $mod_str; # We null out any area with pattern and count $patterns{$key} = ($tstr =~ s/\Q$key\E/\000/g); if ($patterns{$key} >= $min_num) { # If it hits threshold we keep $mod_str = $tstr; } else { # If not we toss pattern delete $patterns{$key}; } } print "Valid : ", scalar (keys %patterns), "\n"; # We finally print results foreach my $key (sort { $patterns{$b} * (length($b)-1) <=> $patterns{$a} * (length($a)-1) or length($b) <=> length($a) or $a cmp $b } keys %patterns) { (my $pat = $key) =~ s/\n/\\n/g; printf "%3d: (%s)\n", $patterns{$key}, $pat; } }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://111636]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (17)
As of 2015-07-06 20:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (82 votes), past polls