Ok. Here's a better version. While I haven't
benchmarked it, my feeling are that it's a hog, but I
bullet proofed several areas. It's less than a hog
than my earlier post. I'm posting the new version
for an easier compare.
#!/usr/bin/perl
# string, min_len of pattern, min_num of patterns
use strict;
use warnings;
my $string = "bookhelloworldhellohellohihellohiworldhihelloworldhihe
+llobookpenbookpenworld";
get_pattern($string, 2, 2);
exit;
sub get_pattern {
my ($string, $min_len, $min_num) = @_;
my $str_len = length($string);
my $srch_max = int($str_len/2);
my %patterns;
# First we find all patterns that are up to 1/2 the length of the stri
+ng
print "length : $str_len\n";
my %tmp_hash;
foreach my $len ($min_len..$srch_max) {
my $eol = $str_len - $len;
foreach my $ind1 (0..$eol) {
my $pat = substr($string, $ind1, $len);
unless ( defined($tmp_hash{$pat}) ) {
$tmp_hash{$pat} = 0;
$tmp_hash{$pat}++ while ($string =~ /\Q$pat\E/g);
$patterns{$pat} = $tmp_hash{$pat} if ($tmp_hash{$pat} >= $min_
+num);
}
}
}
undef %tmp_hash;
print "Patterns: ", scalar (keys %patterns), "\n";
# We then go through the patterns by order and remove those
# that are invalidated by better patterns
# Longer strings that occur more often are considered better
my $mod_str = $string;
foreach my $key (sort { $patterns{$b} * (length($b)-1) <=>
$patterns{$a} * (length($a)-1)
or length($b) <=> length($a) }
keys %patterns) {
my $tstr = $mod_str;
# We null out any area with pattern and count
$patterns{$key} = ($tstr =~ s/\Q$key\E/\000/g);
if ($patterns{$key} >= $min_num) {
# If it hits threshold we keep
$mod_str = $tstr;
}
else {
# If not we toss pattern
delete $patterns{$key};
}
}
print "Valid : ", scalar (keys %patterns), "\n";
# We finally print results
foreach my $key
(sort { $patterns{$b} * (length($b)-1) <=>
$patterns{$a} * (length($a)-1)
or length($b) <=> length($a)
or $a cmp $b } keys %patterns) {
(my $pat = $key) =~ s/\n/\\n/g;
printf "%3d: (%s)\n", $patterns{$key}, $pat;
}
}
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
Outside of code tags, you may need to use entities for some characters:
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.
|
|