Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

And Now For Something Completely Different... Re: Pattern Finding

by Zaxo (Archbishop)
on Sep 11, 2001 at 17:29 UTC ( #111746=note: print w/replies, xml ) Need Help??

in reply to Pattern Finding

I'd like to show a failed attempt at this interesting but ill-posed problem. I say ill-posed because the expected output is not deducible from the input alone, but expects patterns which form English words.

The LZx family of compression algorithms depend on finding and cataloging repeated substrings. I had the notion to use a modified LZW compression routine to find a list of candidate patterns. Only the dictionary is built, and I generate no compressed stream (take that, Unisys!). The dictionary keeps frequencies rather than unique numeric identifiers.

After preliminaries,

#!/usr/bin/perl -w -- use strict; # Usage: ./Pattern [n [string]]
I shift in arguments or set to a default, then clear out control characters. $depth provides for multiple scanning of the string, part of why this approach is flawed. As a stream-oriented algorithm, LZW is does not predict frequent substrings at the beginning, and is greedy about tacking extra characters onto a candidate. We'll see these effects later in the output listing.
my $depth = shift || 2; my $string = shift || "helloworldhellohellohihellohiworld"; $string =~ s/[\000-\037\0177]/ /g; # strip non-printable
Per LZW, we prime the dictionary with our alphabet. We set up a current working string, $j, then scan the input string one character ($_) at a time. If $j.$_ has been seen, we go on to the next character. If it has not, we increment $j's count, add $j.$_ to the dictionary, and reset $j to $_.
# seed dictionary with printable characters... my %dict = map {(chr$_ => 0)} '32'..'126'; # and populate it from the input string for (1..$depth) { my $j = ''; for (split //, $string) { my $tmp = $j . $_; $j = defined $dict{$tmp} ? $tmp : do{ $dict{$j}++; $dict{$tmp}=0; $_}; } }
Print the collected substrings, filtering out the single characters. I make a crude attempt to sort by desirability of a pattern, accounting for both frequency and length. A Data::Dumper spill of the dictionary can be uncommented for closer study.
# the following sort routine is arbitrary, # chosen because it behaves the way I want... sort of. { local $, = " "; local $\ = "\n"; print sort { $dict{$a}*length($a)**2 <=> $dict{$b}*length($b)**2 #|| } grep {length$_>1} keys %dict; } #use Data::Dumper; #print Dumper(\%dict); __END__
Run without arguments, the program prints:
ow ohi llo worl dh ldh orl rld hellohi hi hellow lohi iwo ihe ell ld h +e ll lo iw rl oh ih el or wo hel wor loh hell helloh hello

Clearly, the limited view of the data taken by a stream oriented algorithm is not good enough to recognize the two occurances of 'world' disjoint to 'hello''s four. The lookaside and capture facilities of perl's regex engine are superior for this task.

Props to jepri, who motivated me to look at the LZ clan a while back. I didn't use his code for this; these warts are all mine. I originally was going to try a similar trick in a cryptanalytic tookit, but this exercise has showed me that I need to find a better idea. I think I'll find several in the other replies here.

++artist for a stimulating question to think about.

After Compline,

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://111746]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2018-06-20 19:03 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (117 votes). Check out past polls.