|We don't bite newbies here... much|
I'd like to show a failed attempt at this interesting but ill-posed problem. I say ill-posed because the expected output is not deducible from the input alone, but expects patterns which form English words.
The LZx family of compression algorithms depend on finding and cataloging repeated substrings. I had the notion to use a modified LZW compression routine to find a list of candidate patterns. Only the dictionary is built, and I generate no compressed stream (take that, Unisys!). The dictionary keeps frequencies rather than unique numeric identifiers.
I shift in arguments or set to a default, then clear out control characters. $depth provides for multiple scanning of the string, part of why this approach is flawed. As a stream-oriented algorithm, LZW is does not predict frequent substrings at the beginning, and is greedy about tacking extra characters onto a candidate. We'll see these effects later in the output listing.
Per LZW, we prime the dictionary with our alphabet. We set up a current working string, $j, then scan the input string one character ($_) at a time. If $j.$_ has been seen, we go on to the next character. If it has not, we increment $j's count, add $j.$_ to the dictionary, and reset $j to $_.
Print the collected substrings, filtering out the single characters. I make a crude attempt to sort by desirability of a pattern, accounting for both frequency and length. A Data::Dumper spill of the dictionary can be uncommented for closer study.
Run without arguments, the program prints:
Clearly, the limited view of the data taken by a stream oriented algorithm is not good enough to recognize the two occurances of 'world' disjoint to 'hello''s four. The lookaside and capture facilities of perl's regex engine are superior for this task.
Props to jepri, who motivated me to look at the LZ clan a while back. I didn't use his code for this; these warts are all mine. I originally was going to try a similar trick in a cryptanalytic tookit, but this exercise has showed me that I need to find a better idea. I think I'll find several in the other replies here.
++artist for a stimulating question to think about.
In reply to And Now For Something Completely Different... Re: Pattern Finding