And Now For Something Completely Different... Re: Pattern Finding

I'd like to show a failed attempt at this interesting but ill-posed problem. I say ill-posed because the expected output is not deducible from the input alone, but expects patterns which form English words.

The LZx family of compression algorithms depend on finding and cataloging repeated substrings. I had the notion to use a modified LZW compression routine to find a list of candidate patterns. Only the dictionary is built, and I generate no compressed stream (take that, Unisys!). The dictionary keeps frequencies rather than unique numeric identifiers.

After preliminaries,

#!/usr/bin/perl -w --
use strict;

# Usage: ./Pattern [n [string]]
[download]

I shift in arguments or set to a default, then clear out control characters. $depth provides for multiple scanning of the string, part of why this approach is flawed. As a stream-oriented algorithm, LZW is does not predict frequent substrings at the beginning, and is greedy about tacking extra characters onto a candidate. We'll see these effects later in the output listing.

my $depth = shift || 2;
my $string = shift || "helloworldhellohellohihellohiworld";
$string =~ s/[\000-\037\0177]/ /g; # strip non-printable
[download]

Per LZW, we prime the dictionary with our alphabet. We set up a current working string, $j, then scan the input string one character ($_) at a time. If $j.$_ has been seen, we go on to the next character. If it has not, we increment $j's count, add $j.$_ to the dictionary, and reset $j to $_.

# seed dictionary with printable characters...
my %dict = map {(chr$_ => 0)} '32'..'126';
# and populate it from the input string
for (1..$depth) {
    my $j = '';
    for (split //, $string) {
        my $tmp = $j . $_;
        $j = defined $dict{$tmp} ? $tmp
            : do{
                $dict{$j}++;
                $dict{$tmp}=0;
                $_};
    }
}
[download]

Print the collected substrings, filtering out the single characters. I make a crude attempt to sort by desirability of a pattern, accounting for both frequency and length. A Data::Dumper spill of the dictionary can be uncommented for closer study.

# the following sort routine is arbitrary,
# chosen because it behaves the way I want... sort of.
{
    local $, = " ";
    local $\ = "\n";
    print sort {
        $dict{$a}*length($a)**2 <=> $dict{$b}*length($b)**2 #||
        } grep {length$_>1} keys %dict;
}

#use Data::Dumper;
#print Dumper(\%dict);

__END__
[download]

Run without arguments, the program prints:

ow ohi llo worl dh ldh orl rld hellohi hi hellow lohi iwo ihe ell ld h
+e ll lo iw rl oh ih el or wo hel wor loh hell helloh hello
[download]

Clearly, the limited view of the data taken by a stream oriented algorithm is not good enough to recognize the two occurances of 'world' disjoint to 'hello''s four. The lookaside and capture facilities of perl's regex engine are superior for this task.

Props to jepri, who motivated me to look at the LZ clan a while back. I didn't use his code for this; these warts are all mine. I originally was going to try a similar trick in a cryptanalytic tookit, but this exercise has showed me that I need to find a better idea. I think I'll find several in the other replies here.

++artist for a stimulating question to think about.

After Compline,
Zaxo

Comment on And Now For Something Completely Different... Re: Pattern Finding Select or Download Code


No such thing as a small change
	PerlMonks