update: Look at my second offering, it's better
Here's an inefficient subroutine that I've used in
the past. Probably time for an overhaul
Output when called with $string, 2, 2:
4 : (hello)
2 : (world)
2 : (hi)
sub get_pattern {
my ($string, $min_len, $min_num) = @_;
my $str_len = length($string);
my $srch_max = int($str_len/2);
my %patterns;
# First we find all patterns that are up to 1/2 the length of the stri
+ng
foreach my $len ($min_len..$srch_max) {
my $eol = $str_len - $len;
foreach my $ind1 (0..$eol) {
my $pat = substr($file, $ind1, $len);
unless ( defined($patterns{$pat}) ) {
$patterns{$pat} = 0;
my $index = 0;
do {
$index = index($file, $pat, $index);
unless ($index < 0) {
$index += length($pat);
$patterns{$pat}++;
}
} while ($index >= 0);
}
}
}
# We then dump all patterns that do not occur min_num times
foreach my $key (keys %patterns) {
delete $patterns{$key} if ($patterns{$key} < $min_num);
}
# We then go through the patterns by order and remove those
# that are invalidated by better patterns
foreach my $key
(sort { $patterns{$b} * (length($b)-1) <=>
$patterns{$a} * (length($a)-1)
or length($b) <=> length($a)
or $a cmp $b } keys %patterns) {
my $check = 0;
$patterns{$key} = 0;
my $index;
do {
$index = index($file, $key, 0);
unless ($index < 0) {
$check = 1;
$patterns{$key}++;
substr($file, $index, length($key)) = "\000";
}
} while ($index >= 0);
delete $patterns{$key} if ($patterns{$key} < $min_num);
}
foreach my $key
(sort { $patterns{$b} * (length($b)-1) <=>
$patterns{$a} * (length($a)-1)
or length($b) <=> length($a)
or $a cmp $b } keys %patterns) {
(my $pat = $key) =~ s/\n/\\n/g;
printf("%3d : (%s)\n", $patterns{$key}, $pat);
}
}
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.
|