Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: Patter Finding

by lemming (Priest)
on Sep 11, 2001 at 04:17 UTC ( #111636=note: print w/ replies, xml ) Need Help??


in reply to Pattern Finding

update: Look at my second offering, it's better

Here's an inefficient subroutine that I've used in the past. Probably time for an overhaul
Output when called with $string, 2, 2:
4 : (hello)
2 : (world)
2 : (hi)

sub get_pattern { my ($string, $min_len, $min_num) = @_; my $str_len = length($string); my $srch_max = int($str_len/2); my %patterns; # First we find all patterns that are up to 1/2 the length of the stri +ng foreach my $len ($min_len..$srch_max) { my $eol = $str_len - $len; foreach my $ind1 (0..$eol) { my $pat = substr($file, $ind1, $len); unless ( defined($patterns{$pat}) ) { $patterns{$pat} = 0; my $index = 0; do { $index = index($file, $pat, $index); unless ($index < 0) { $index += length($pat); $patterns{$pat}++; } } while ($index >= 0); } } } # We then dump all patterns that do not occur min_num times foreach my $key (keys %patterns) { delete $patterns{$key} if ($patterns{$key} < $min_num); } # We then go through the patterns by order and remove those # that are invalidated by better patterns foreach my $key (sort { $patterns{$b} * (length($b)-1) <=> $patterns{$a} * (length($a)-1) or length($b) <=> length($a) or $a cmp $b } keys %patterns) { my $check = 0; $patterns{$key} = 0; my $index; do { $index = index($file, $key, 0); unless ($index < 0) { $check = 1; $patterns{$key}++; substr($file, $index, length($key)) = "\000"; } } while ($index >= 0); delete $patterns{$key} if ($patterns{$key} < $min_num); } foreach my $key (sort { $patterns{$b} * (length($b)-1) <=> $patterns{$a} * (length($a)-1) or length($b) <=> length($a) or $a cmp $b } keys %patterns) { (my $pat = $key) =~ s/\n/\\n/g; printf("%3d : (%s)\n", $patterns{$key}, $pat); } }


Comment on Re: Patter Finding
Download Code
Re: Re: Patter Finding
by demerphq (Chancellor) on Sep 11, 2001 at 16:57 UTC
    Hi Lemming,

    Im a little confused. As posted your code goes into an infinte loop. When I s/$file/\$string/g I get the output as you said we would (impressive) However if minlen is 0 it goes into an infinite loop!

    Also when I try the string:'hellohiothellobrakerakerashash' I only get one of the many words contained, and a couple that arent words.

    2 : (hello) 2 : (aker) 2 : (ash)
    I would expect any of the following:
    hello,hi,othello,brake,rake,raker,rash,ash,hash,ohio, the,lob,bra,hell,era # I get this using substr counts: ak,ake,aker,akera,as,ash,el,ell,ello,er,era, he,hel,hell,hello,ke,ker,kera,ll,llo,lo, ra,rak,rake,raker,rakera,sh
    So my guess is that the above results are coincidental or am I missing something? Yves

    --
    You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)

      I would expect any of the following: hello,hi,othello,brake,rake,raker,rash,ash,hash,ohio, the,lob,bra,hell,era

      I don't see how you could expect some of those strings, because some only appear once in the string (e.g. "othello", "ohio"), so you really couldn't call them a "pattern" unless you're matching against a dictionary file.

      My solution near the top of this thread sort of assumes that the string is a contiguous series of patterns (one of the original constraints was "String contains nothing but patterns"), so it only finds "hello" from your test string, but if you change this line:

      # From this if (/\G(.{2,})(?=.*?\1)/g) { # To this if (/\G.*?(.{2,})(?=.*?\1)/g) {
      Then it does better and finds "ash", "rake", and "hello" from your test string, which is about as good as it gets, I believe.
Re:^2 Pattern Finding
by lemming (Priest) on Sep 14, 2001 at 00:06 UTC

    Ok. Here's a better version. While I haven't benchmarked it, my feeling are that it's a hog, but I bullet proofed several areas. It's less than a hog than my earlier post. I'm posting the new version for an easier compare.

    #!/usr/bin/perl # string, min_len of pattern, min_num of patterns use strict; use warnings; my $string = "bookhelloworldhellohellohihellohiworldhihelloworldhihe +llobookpenbookpenworld"; get_pattern($string, 2, 2); exit; sub get_pattern { my ($string, $min_len, $min_num) = @_; my $str_len = length($string); my $srch_max = int($str_len/2); my %patterns; # First we find all patterns that are up to 1/2 the length of the stri +ng print "length : $str_len\n"; my %tmp_hash; foreach my $len ($min_len..$srch_max) { my $eol = $str_len - $len; foreach my $ind1 (0..$eol) { my $pat = substr($string, $ind1, $len); unless ( defined($tmp_hash{$pat}) ) { $tmp_hash{$pat} = 0; $tmp_hash{$pat}++ while ($string =~ /\Q$pat\E/g); $patterns{$pat} = $tmp_hash{$pat} if ($tmp_hash{$pat} >= $min_ +num); } } } undef %tmp_hash; print "Patterns: ", scalar (keys %patterns), "\n"; # We then go through the patterns by order and remove those # that are invalidated by better patterns # Longer strings that occur more often are considered better my $mod_str = $string; foreach my $key (sort { $patterns{$b} * (length($b)-1) <=> $patterns{$a} * (length($a)-1) or length($b) <=> length($a) } keys %patterns) { my $tstr = $mod_str; # We null out any area with pattern and count $patterns{$key} = ($tstr =~ s/\Q$key\E/\000/g); if ($patterns{$key} >= $min_num) { # If it hits threshold we keep $mod_str = $tstr; } else { # If not we toss pattern delete $patterns{$key}; } } print "Valid : ", scalar (keys %patterns), "\n"; # We finally print results foreach my $key (sort { $patterns{$b} * (length($b)-1) <=> $patterns{$a} * (length($a)-1) or length($b) <=> length($a) or $a cmp $b } keys %patterns) { (my $pat = $key) =~ s/\n/\\n/g; printf "%3d: (%s)\n", $patterns{$key}, $pat; } }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://111636]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (11)
As of 2014-10-24 18:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (134 votes), past polls