Beefy Boxes and Bandwidth Generously Provided by pair Networks Joe
Problems? Is your data what you think it is?
 
PerlMonks  

Markov Chain Program

by sacked (Hermit)
on Feb 02, 2001 at 01:38 UTC ( #55851=sourcecode: print w/ replies, xml ) Need Help??

Category: Text Processing
Author/Contact Info sacked
Description: Taking the suggestion from Kernighan and Pike's The Practice of Programming, I wrote another version of their Markov Chain program (chapter 3) that allows for different length prefixes. It works best with shorter prefixes, as they are more likely to occur in the text than longer ones.

Please offer any tips for improving/enhancing this script. Thanks!
#!/usr/bin/perl -w
#
# based on kernighan and pike's markov chain program
# in _the practice of programming_, chapter 3
# (http://cm.bell-labs.com/cm/cs/tpop/markov.pl)
#
# Usage: markov.pl [prefix length] [max words to print] <infile

use strict;

my @words;    # a line of text
my %wordlist; # key: prefix, value: array of suffixes

my $pref_len = shift @ARGV || 2;
my $maxwords = shift @ARGV || 10000;

# build word list
#
# 'Blessed is the man that walketh not in the counsel'
# %wordlist = ( 'blessed is' => [ 'the',  ],
#               'is the'     => [ 'man',  ],
#               'the man'    => [ 'that', ],
#             );
while (<>) {
    push @words, split;

    while ( @words > $pref_len )  {
        # build prefix of $pref_len words
        # join(' ', @array) is faster than qq(@array) or "@array"
        #
        my $pref = join(' ', @words[0..($pref_len-1)]);

        # add suffix to list
        #
        push @{ $wordlist{$pref} }, $words[$pref_len];

        shift @words; # next word on this line
    }
}

# starting point
#
my $pref = (keys %wordlist)[rand keys %wordlist];

print "$pref";

# dump out listings
#
for (0..($maxwords-1)) {
  last if not defined($wordlist{$pref});

  my $index = rand @{ $wordlist{$pref} };
  my $suf   = $wordlist{$pref}[$index];

  print ' '. $suf;

  print "\n" if ( $_ % 10 == 0);

  # skip past first word in prefix
  #
  #$pref = (split(' ', $pref))[1..$pref_len-1] . ' ' . $suf;
  $pref =~ s/^[^ ]+ (.+)$/$1 $suf/;
}

__END__
two good samples generated from the book of psalms:

"For, lo, thine enemies, and the horn of David the son of
thine house hath eaten me up"

"His mouth is full of troubles"

Comment on Markov Chain Program
Download Code
Re: Markov Chain Program
by dws (Chancellor) on Feb 02, 2001 at 04:10 UTC
    To improve of this, you may have to rethink your data structures. I've had good results by weighting the random choice of the next token based on its frequency of occurence after the prior n tokens (where n==2 in your code).

    One approach would be to extend your data structure along the lines of

    %wordslist = ( 'blessed is' => ['the', 47, 'a', 23, 'vroom', 1], ... );
    and then to use the frequency counts to weight the random choice.

    For an object-oriented approach that builds n-deep trees, take a look at babelize.

      Thanks, I like the idea of using frequency counts very much. I turned to The Perl Cookbook for some help, and the results are below. Note that the default value for $maxwords is much lower for this version to prevent the program from taking a very long time to complete, as the subroutine weighted_suffix() often needs to loop many times when there is a low number of suffixes for a given prefix.
      #!/usr/bin/perl -w # # based on kernighan and pike's markov chain generator # in _the practice of programming_, chapter 3 # (http://cm.bell-labs.com/cm/cs/tpop/markov.pl) # use strict; my @words; # words on a line my %wordlist; # key: prefix, value: anon hash (k: suffix, # v: frequency) my $pref_len = shift @ARGV || 2; my $maxwords = shift @ARGV || 100; my $entries = 0; # build word list # # 'Blessed is the man that walketh not in the counsel' # %wordlist = ( 'blessed is' => { 'the' => 1, }, # 'is the' => { 'man' => 1, }, # 'the man' => { 'that'=> 1, }, # ); # while (<>) { my $suf; push @words, split; while ( @words > $pref_len ) { # build prefix of $pref_len words # join(' ', @array) is faster than qq(@array) or "@array" # my $pref = join(' ', @words[0..($pref_len-1)]); # add suffix to list # $suf = $words[$pref_len]; $wordlist{$pref}{$suf}++; shift @words; # next word on this line $entries++; } } # change frequency count to a percentage # (with help from pcb, recipe 2.10) # foreach my $href ( values %wordlist ) { foreach ( values %$href ) { $_ /= $entries; } } # starting point # my $pref = (keys %wordlist)[rand keys %wordlist]; print "$pref"; # dump out listings # for (0..($maxwords-1)) { last unless (exists $wordlist{$pref}); my $suf = weighted_suffix(); print ' '. $suf; print "\n" if ( $_ % 10 == 0); # skip past first word in prefix # $pref =~ s/^[^ ]+ (.+)$/$1 $suf/; } exit; # from pcb (recipe 2.10) # sub weighted_suffix { my ($suf,$weight,$rand); while (1) { $rand = rand; while ( ($suf,$weight) = each %{ $wordlist{$pref} } ) { return $suf if ($rand -= $weight) < 0; } } }
Re: Markov Chain Program
by extremely (Priest) on Feb 02, 2001 at 04:13 UTC
    There are some good examples of Markov chain code floating around on here that break up the text by letter rather than word. In those cases I've seen the best results with 4|1 partitioning. Sifting through it is sometimes tedious but some really funny stuff pops up once in a while. The letter by letter mode has a creepy ability to create new words that seem to make sense.

    --
    $you = new YOU;
    honk() if $you->love(perl)

      I'm currently implementing one right now (looking at as few others as possible) that is just words - boring and makes many mistakes.

      But I'm planning on rewriting it shortly after being done and will be doing it with digraphs (two character pairs, including punctuation and spaces, although not likely line breaks) after learning how nice those can be for old-school cryptography - the digraphs should theortically be better than trigraphs and mono (single letters).

      just a thought. (my current problems lie less in the above theory or programming - all very easy - and in the way I strip it and from where... trying several boards, as well as doing it in an amusing way in newsgroups as well)
        I need some help for Markov algorithm for followign question Markov chain algorithm that will allow you to write a program to analyze your current publication's texts and generate random text that uses phrases in a manner similar to the input text.

        You ask how this works and she explains:

        Find some body of text (in our case, text files) that you want to imitate. For every pair of words that occurs in the text, keep track of each word that can follow that pair of words. So, for every pair of words, you would know a) which words followed that pair of words AND b) know at what probability those words might follow the pair of words. (See examples below.)

        Using the information gathered in the previous step. Start with a pair of two consecutive words ($word_one and $word_two) that occur in the text, print those two words, then randomly choose the next word ($next_word) according to the probability that it would follow those two words. Print that word. Now use the second word ($word_two) and the new word ($next_word) as your two consecutive words and repeat this process until you have generated the amount of text you want or hit a word pair that has no next word.

        Let us look at an example from The New Testament According to Dr. Suess:

        He didn't come in a plane.
        He didn't come in a Jeep.
        He didn't come in a pouch
        Of a high jumping Voveep.
        If we were to analyze the word pairs, we see the following pairs of words in the text:
        He didn't come [3, 100.0%] Jeep. He didn't [1, 100.0%] Of a high [1, 100.0%] a Jeep. He [1, 100.0%] a high jumping [1, 100.0%] a plane. He [1, 100.0%] a pouch Of [1, 100.0%] come in a [3, 100.0%] didn't come in [3, 100.0%] high jumping Voveep. [1, 100.0%] in a Jeep. [1, 33.3%] pouch [1, 33.3%] plane. [1, 33.3%] plane. He didn't [1, 100.0%] pouch Of a [1, 100.0%]
        We can see that the word pair He didn't occurred three times, each time followed by the word come (at 100% probability). And the word pair in a occurred three times, followed by either Jeep., pouch, or plane (each of these with a 33.3% probability).

        Your task is write a program called babble that will read text from <> and apply the Markov Chain algorithm to generate random text that reads like the input text.

        Your program will also take three options (you are advised to use Getopt::Long qw(GetOptions) but you may use other methods if you insist):

        --words (the total number of words to generate)
        --paragraphs (the number of words per paragraph)
        --show_pairs (show the word pairs and frequencies as in the example above, which is sorted alphabetically by word pairs, then by decreasing frequency for the next words). If --show_pairs is given as an option, your program should not do any babbling, just output the table and exit.

        You are advised to implement --show_pairs first. This will require designing a data structure to store the "word pair" to "next word" mappings (when you hear "map", you might think "hash" or "hashref") and then writing a subroutine to load/build this data structure from the input text. Don't worry about capitalization and punctuation -- you can treat anything that's not whitespace as word characters (i.e., @words = split() is a perfectly acceptable construct to use to get your words). Once you have --show_pairs working, you should be able to do something like this:

        (Thu Oct 18 21:54:27): ~/e13cvs/users/solutions/hw3/ austin@elmo 21 $ ./babble --show_pairs data/example.txt He didn't come [3, 100.0%] Jeep. He didn't [1, 100.0%] Of a high [1, 100.0%] a Jeep. He [1, 100.0%] a high jumping [1, 100.0%] a plane. He [1, 100.0%] a pouch Of [1, 100.0%] come in a [3, 100.0%] didn't come in [3, 100.0%] high jumping Voveep. [1, 100.0%] in a Jeep. [1, 33.3%] pouch [1, 33.3%] plane. [1, 33.3%] plane. He didn't [1, 100.0%] pouch Of a [1, 100.0%]
        Edit by dws to rescue formatting

Back to Code Catacombs

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://55851]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2014-04-18 04:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (461 votes), past polls