Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

I'd like to show a failed attempt at this interesting but ill-posed problem. I say ill-posed because the expected output is not deducible from the input alone, but expects patterns which form English words.

The LZx family of compression algorithms depend on finding and cataloging repeated substrings. I had the notion to use a modified LZW compression routine to find a list of candidate patterns. Only the dictionary is built, and I generate no compressed stream (take that, Unisys!). The dictionary keeps frequencies rather than unique numeric identifiers.

After preliminaries,

#!/usr/bin/perl -w -- use strict; # Usage: ./Pattern [n [string]]
I shift in arguments or set to a default, then clear out control characters. $depth provides for multiple scanning of the string, part of why this approach is flawed. As a stream-oriented algorithm, LZW is does not predict frequent substrings at the beginning, and is greedy about tacking extra characters onto a candidate. We'll see these effects later in the output listing.
my $depth = shift || 2; my $string = shift || "helloworldhellohellohihellohiworld"; $string =~ s/[\000-\037\0177]/ /g; # strip non-printable
Per LZW, we prime the dictionary with our alphabet. We set up a current working string, $j, then scan the input string one character ($_) at a time. If $j.$_ has been seen, we go on to the next character. If it has not, we increment $j's count, add $j.$_ to the dictionary, and reset $j to $_.
# seed dictionary with printable characters... my %dict = map {(chr$_ => 0)} '32'..'126'; # and populate it from the input string for (1..$depth) { my $j = ''; for (split //, $string) { my $tmp = $j . $_; $j = defined $dict{$tmp} ? $tmp : do{ $dict{$j}++; $dict{$tmp}=0; $_}; } }
Print the collected substrings, filtering out the single characters. I make a crude attempt to sort by desirability of a pattern, accounting for both frequency and length. A Data::Dumper spill of the dictionary can be uncommented for closer study.
# the following sort routine is arbitrary, # chosen because it behaves the way I want... sort of. { local $, = " "; local $\ = "\n"; print sort { $dict{$a}*length($a)**2 <=> $dict{$b}*length($b)**2 #|| } grep {length$_>1} keys %dict; } #use Data::Dumper; #print Dumper(\%dict); __END__
Run without arguments, the program prints:
ow ohi llo worl dh ldh orl rld hellohi hi hellow lohi iwo ihe ell ld h +e ll lo iw rl oh ih el or wo hel wor loh hell helloh hello

Clearly, the limited view of the data taken by a stream oriented algorithm is not good enough to recognize the two occurances of 'world' disjoint to 'hello''s four. The lookaside and capture facilities of perl's regex engine are superior for this task.

Props to jepri, who motivated me to look at the LZ clan a while back. I didn't use his code for this; these warts are all mine. I originally was going to try a similar trick in a cryptanalytic tookit, but this exercise has showed me that I need to find a better idea. I think I'll find several in the other replies here.

++artist for a stimulating question to think about.

After Compline,
Zaxo


In reply to And Now For Something Completely Different... Re: Pattern Finding by Zaxo
in thread Pattern Finding by artist

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others studying the Monastery: (4)
    As of 2014-08-30 20:34 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The best computer themed movie is:











      Results (293 votes), past polls