Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

ngram

by TheEnigma (Pilgrim)
on Sep 18, 2004 at 23:48 UTC ( #392067=sourcecode: print w/ replies, xml ) Need Help??

Category: Text Processing
Author/Contact Info TheEnigma Mike Phenicie
Description: This is a program that performs what I think is called Markov Chaining of a text, on a letter by letter basis. This is based on something I read in Byte about 20 years ago, they called their program Travesty.

I've seen other programs that do this on a word basis, for instance the program in the Text Processing section of the Code Catacombs called markov.pl; but I havent't seen many that do it on a letter by letter basis.

The comments at the beginning of the program should give good enough instructions on how to use it.

I am planning to keep improving this program, and any and all comments|suggestions|critiques are welcome.

#!/usr/local/bin/perl -w
######################################################################
+########################
#       ngram.plx
#       v 0.2
#       by Mike Phenicie (TheEnigma on PerlMonks - www.perlmonks.org)
#       09-18-04
######################################################################
+########################
#  Background:  An analysis can be made of text samples, in which the 
+frequency of the various
#  ngrams is calulated.  An ngram is my term for character sequences o
+f length n.  I have seen
#  2 character sequences refered to as both digrams and digraphs; and 
+3 character sequences
#  refered to as trigrams and trigraphs.  So to refer in a generic way
+ to sequences of any
#  length, I chose ngram.
#
#  This program scans through a text file, and looks at every sequence
+ of length n-1, noting
#  what character comes after it.  It will create a hash of hashes tha
+t contains the number
#  of occurances of a particular character following a particular n-1 
+length sequence.  In
#  other words, if there are 10 places in the text where 'te' is follo
+wed by 'r', then
#  $freq{'te'}->{'r'} == 10.
#
#  This frequency table can be used to create a new text that has simi
+larities to the original
#  text file.  The new file will contain text with a very similar rati
+o of ngrams.  For instance,
#  if the original text had a lot of occasions where the letters 'ne' 
+were followed by
#  an 's' (with n=3), but not many occasions where it was followed by 
+a 'w', then that should
#  be true in the new text as well.
#
#  With low numbers of n (three or four, for instance), the only simil
+arity might be that it
#  seems to be in the same language as the input.  With higher values 
+for n, the output will 
#  start to resemble the input more and more.  With a high enough valu
+e it's possible you might
#  get the exact same text for output as was input.
######################################################################
+########################
#                                    Switches
#
#             -i <input text file>              file to analyze
#             -o <output text file>             computer generated tex
+t
#             -d <depth (2-10)>                 default 3
#             -v <verbosity> (1 to 10k)         default 100 (how long 
+the output is)
#             -s <frequency file>               save frequency hash to
+ a file
#             -l <frequency file>               use existing frequency
+ file for input
#                                               (can't specify -i or -
+s with this)
#             -t <"starting text">              default is random by s
+cript
#             -a <human readable freq file>     a human readable frequ
+ency file
#             -e <frequency file>               examine - ignores all 
+other switches
#             -D (no display on monitor)        if switch present, no 
+display on monitor
#             -A (append)                       append to output file
#
#  Usage:  You must specify an -i or an -l (but not both) to tell the 
+program what to use as input
#          You may specify an -s, but if you do, you can't use -l
#          You may specify an -o, and if you optionally also use -A, t
+he file will be appended
#          The generated text will by default go to the monitor, turn 
+this off with -D
#          You may specify -t to tell it what text you want to start w
+ith
#          You may specify -a to store the frequency data in a human r
+eadable file
#          You may specify an -e (which will ignore all other switches
+) to examine a freq file
#          If you don't specify -d, a default of 3 will be used
#          If you don't specify -v, a default of 100 will be used
######################################################################
+#########################
#         to do
# add help
# make line breaks occur on spaces
# try mixing two or more frequency files to create one output
# prevent it from printing bad start text to the output file
#
# and...?
######################################################################
+####
use strict;
use Getopt::Std;
use Storable qw(nstore); 

$| = 1;

srand;
my $name = "ngram.plx v 0.2";
my $wrap = 80;
my(%freq, $ngram, $next_char, $freq_file, $freq_depth, $freq_name, $fr
+eq, @freq);

# Get command line options.
my($input, $output, $depth, $verbosity, $save, $load, $start_text, $as
+cii, $examine, $no_display, $append) = get_options();

if($examine){
  $freq = Storable::retrieve("$examine") or die "Can't load frequency 
+analysis data from $examine: $!";
  ($freq_name, $freq_file, $freq_depth, %freq) = @$freq;
  print "\n\nFrequency file $examine created by $freq_name\n";
  print "Based on text from $freq_file\n";
  print "Depth = $freq_depth\n\n";
  exit;
}

##################################################################
# Put frequency data in %freq, either by analyzing an input file,
# or reading a previously saved frequency file.
##################################################################
if($input){
  print "\n\nOutput based on the file $input.  Depth = $depth.\n\n";
  %freq = create_freq_hash($input, $depth);
} elsif($load){
  eval{$freq = Storable::retrieve("$load") or die "Can't load frequenc
+y analysis data from $load: $!";};
  if($@){print "ERROR: $load does not appear to be a valid frequency f
+ile\n"; exit;}
  ($freq_name, $freq_file, $freq_depth, %freq) = @$freq;
  $depth = $freq_depth;
  if($freq_name !~ /ngram\.plx/){
    print "\n\nERROR: $load was not created by this script\n";
    exit;
  }
  print "\n\nOutput based on the file $freq_file.  Depth = $freq_depth
+.\n\n";
}

##############################################
# If -s, save %freq to a file.
##############################################
if($save){
  @freq = ($name, $input, $depth, %freq);
  nstore(\@freq, "$save") or die "Can't store frequency analysis data 
+to $save: $!";
}

########################################################
# If -a, store human readable version of frequency file
########################################################
if($ascii){
  my($key, $value, $k, $v);
  
  open (DBG, ">$ascii");
  while(($key, $value) = each %freq){
    print DBG "\n\n#$key#\n";
    while(($k, $v) = each %$value){
      print DBG "#$k $v#   ";
    }
  }
  close(DBG);
}

##################################
# Create and print the start text
##################################
$start_text = create_start_text() unless $start_text;
$ngram = substr($start_text, 1-$depth, $depth-1);

open(OFH, ">$output") or die "Can't open $output for writing: $!" if $
+output && ! $append;
open(OFH, ">>$output") or die "Can't open $output for appending: $!" i
+f $output && $append;
print "\n\n$start_text" unless $no_display;
print OFH "$start_text" if $output;

############################
# Let's create some text!
############################
for my $i (length($start_text)+1 .. $verbosity-1){
  $next_char = get_next_char($ngram);
  print "$next_char" unless $no_display;
  print OFH "$next_char" if $output;
  print OFH "\n" if($output && !($i % $wrap));
  $ngram = substr($ngram, 1) . $next_char;
}

print "\n\n" unless $no_display;
close(OFH);

######################################################################
+######################
#                                    SUBROUTINES                      
+                     #
######################################################################
+######################
######################################################################
+###########
#                          Get next character
#
#  Given a $depth-1 length character string, and the frequency hash (%
+freq), this
#  routine will pick the next character in the output.  It finds the v
+alue stored
#  in %freq that is keyed by $ngram.  This value is another hash that 
+has as its
#  keys all the possible letters that followed $ngram in the original 
+text.  The
#  value of each of these keys is the number of times that letter foll
+owed $ngram
#  in the original text.  One of these letters is picked at random, wi
+th a higher
#  probability of picking a letter that appeared more often in the ori
+ginal text.
#
#  There are several ways to do this (TMTOWTDI). This routine does it 
+as follows:
#  Let's say the ngram is 'ceed'; and that 'ceed' is followed by 'e' 2
+ times, 's'
#  1 time, and 'i' 5 times.  An array with 8 elements (2+1+5) will be 
+created.
#  2 elements will contain 'e', 1 element will contain 's', and 5 elem
+ents will
#  contain 'i'.  An element is picked at random to be returned.  Thus,
+ letters
#  that were more likely to follow the ngram in the original text will
+ be more
#  likely to follow in the created text.
######################################################################
+###########
sub get_next_char {
  my($ngram) = @_;
  
  my $ptr = 0;
  my(@ary, $sub_hash);
  
  if(defined $freq{$ngram}){
    $sub_hash = $freq{$ngram};
  } else {
    print "\n\nERROR:  The text you entered does not end with characte
+rs\n";
    print "that are in the frequency hash.\n\n";
    exit;
  }
  while(my($key, $value) = each %$sub_hash){
    for($ptr..$value+$ptr-1){
      $ary[$_] = $key;
    }
    $ptr += $value;
  }
  
  return $ary[rand($ptr)];
}

######################################################################
+###########
#                           Create start text
#
#  This routine will first attempt to create an array, @keys, consisti
+ng of all
#  the first level keys of %freq that start with an uppercase letter f
+ollowed 
#  by a lowercase letter.  This is so the output text will have a 'pro
+per' start.
#  If this array is empty, meaning there are no capitalized words in t
+he input
#  text, it will repopulate @keys with all the keys of %freq.  In eith
+er case,
#  it will select one of the keys at random, to be used as the startin
+g text
#  for the output.
######################################################################
+###########
sub create_start_text {
  my(@keys);
  
  for(keys %freq){
    push(@keys, $_) if /^[A-Z][a-z]/;
  }
  
  @keys = keys %freq unless @keys;
  return $keys[rand(@keys)];
}

######################################################################
+###########
#                           Create frequency hash                     
+           
#
#  This routine opens an input file, and puts the entire contents into
+ $text,
#  converting instances of more than one whitespace character in a row
+ into one
#  whitespace character.  It creates a hash with keys consisting of al
+l the
#  ngrams (of length $depth-1) in the input text.  The value will be a
+nother hash
#  with keys consisting of all the possible characters that follow tha
+t ngram in
#  the text.  The value of each of those keys will be the number of ti
+mes that
#  particular combination of ngram and following letter occur in the t
+ext.
######################################################################
+###########
sub create_freq_hash {
  my($input, $depth) = @_;
  my(@input, %freq, @text, $text, $ngram, $ptr, $next_char);
  
  # Make one long string, collapsing all multiple whitespace into one 
+space
  open (IFH, "$input") or die "Can't open $input for reading: $!";
  @text = <IFH>;
  $text = join('', @text);
  $text =~ s/\s+/ /g;
  close (IFH);
  
  $ngram = substr($text,0,$depth-1);
  $ptr = $depth-1;
  
  for(0..length($text)-$depth){
    $next_char = substr($text,$ptr,1);
    $freq{"$ngram"}->{"$next_char"}++;
    $ngram = substr($ngram,1) . $next_char;
    $ptr++;
  }
  
  return %freq;  
}

######################################################################
+###########
#                           Get command line switches
#
#  -i: Specifies the input file to run frequency analysis on. Mutually
+ exclusive
#      with -l, so if -l is also specified, it will quit with an error
+ message.
#
#  -o: Specifies an output file to write results to.  If switch is not
+ present
#      results will not be written to a file.  If -A is specified, res
+ults will
#      be appended to whatever the output file is.
#
#  -d: Specifies the depth of analysis.  In other words, if depth is '
+3', the
#      program will look at the last two letters output, and based on 
+the
#      frequency analysis for that digram, pick a letter to follow.
#
#  -v: Specifies the length of the output (verbosity).
#
#  -s: Specifies a file in which to store the frequency analysis.  Als
+o stored
#      in the file will be the name and version of this script, the na
+me of the
#      input file the analysis was based on, and the depth of the anal
+ysis.            
#      Mutually exclusive with -l, so if -l is also specified, it will
+ quit with
#      an error message.
#
#  -l: Specifies a frequency file to load and use.  Mutually exclusive
+ with -i
#      and -s, so if either -i or -s is specified, it will quit with a
+n error message.
#
#  -t: Specifies starting text for the output.  The script will start 
+with the
#      last d-1 characters of the text (where d is the depth), and do 
+its thing
#      from there.  If -t is not specified, the script will start with
+ a random
#      d-1 character sequence from the frequency hash.  If you specify
+ your own
#      text and it contains spaces, then the text must be enclosed in 
+"".
#
#  -a: Specifies the name of a file in which to store a human readable
+ version
#      of the frequency analysis.  This works just like the -s switch,
+ except
#      the frequency data is human readable.  An excerpt of a file fol
+lows:
#   
#                  #ki#
#                  #l 1#   #n 20#   
#
#                  #tr#
#                  #y 3#   #e 22#   #u 17#   #a 21#   #i 18#   #o 5#
#
#      This would mean that:
#
#      sequence   is followed by    this many times in the original te
+xt
#         ki            l                     1
#         ki            n                    20
#         tr            y                     3
#         tr            e                    22
#         tr            u                    17
#         tr            a                    21
#         tr            i                    18
#         tr            o                     5
#                   (for $64,000, Name That Text! ;)
#
#  -e: Specifies the name of a frequency file to examine.  It will cau
+se
#      the program to print out the following embedded information fro
+m the
#      frequency file: the name and version of this script, the name o
+f the input
#      file the frequency analysis is based on, and the depth used in 
+the
#      analysis.  If this switch is present, all other switches are ig
+nored.
#
#  -D: If specified, will not display output on monitor.
#
#  -A: Output file will be appended to, not overwritten.
######################################################################
+###########
sub get_options {
  my(%opts, $input, $output, $depth, $verbosity, $save, $load, $start_
+text, $ascii, $examine, $no_display, $append);
 
  getopts('i:o:d:v:s:l:t:a:e:DA', \%opts);
  
  #### check for mutually exclusive cases and lack of both -i and -l
  if(defined $opts{i} && defined $opts{l}){
    print "You may not specify both the -i and -l switches\n";
    exit
  }
  
  if(! defined $opts{i} && ! defined $opts{l}){
    print "You have to specify one of the -i or -l switches\n";
    exit
  }
  
  if(defined $opts{s} && defined $opts{l}){
    print "You may not specify both the -s and -l switches\n";
    exit;
  }

  ###########################  check for switches  ###################
  
  if(defined $opts{e}){
    $examine = $opts{e};
    if(! -e $examine){
      print "Frequency file $examine does not exist.\n";
      exit;
    }
    return ($input, $output, $depth, $verbosity, $save, $load, $start_
+text, $ascii, $examine, $no_display, $append);
  }
  
  $no_display = $opts{D} || 0;
  $append = $opts{A} || 0;
     
  if(defined $opts{i}){
    $input = $opts{i};
    do {print "Input file $input does not exist\n"; exit} unless -e $i
+nput;
    do {print "Input file $input does not appear to be a text file\n";
+ exit} unless -T "$input";
  }
  
  if(defined $opts{l}){
    $load = $opts{l};
    if(! -e $load){
      print "Frequency file $load does not exist.\n";
      exit;
    }
  }
  
  if(defined $opts{s}){
    $save = "$opts{s}";
    if(-e $save){
      print "$save already exists.  Overwrite? n\b";
      my $answer = <STDIN>;
      chomp($answer);
      $answer = lc($answer);
      exit unless $answer eq "y";
    }
  }

  if(defined $opts{o}){
    $output = $opts{o};
    if(-e $output && ! $append){
      print "$output already exists.  Overwrite? n\b";
      my $answer = <STDIN>;
      chomp($answer);
      $answer = lc($answer);
      exit unless $answer eq "y";
    }
  }
  
  if(defined $opts{d}){
    $depth = $opts{d};
    if($depth < 2 || $depth > 10){
      print "Depth must be from 2 to 10, inclusive\n";
      exit;
    }
  } else {
    $depth = 3;
  }
  
  if(defined $opts{v}){
    $verbosity = $opts{v};
    if($verbosity < 10 || $verbosity > 10000){
      print "Length must be from 10 to 10,000 characters, inclusive\n"
+;
      exit;
    }
  } else {
    $verbosity = 100;
  }

  if(defined $opts{t}){
    $start_text = $opts{t};
    if(length($start_text) < $depth){
      print "\n\nERROR: The length of the text you supplied with the -
+t option must be\n";
      print "at least $depth characters long, because that is what you
+ set depth to.\n\n";
      exit;
    }
  }
  
  if(defined $opts{a}){
    $ascii = "$opts{a}";
    if(-e $ascii){
      print "$ascii already exists.  Overwrite? n\b";
      my $answer = <STDIN>;
      chomp($answer);
      $answer = lc($answer);
      exit unless $answer eq "y";
    }
  }

  return ($input, $output, $depth, $verbosity, $save, $load, $start_te
+xt, $ascii, $examine, $no_display, $append);
}

Comment on ngram
Download Code
Replies are listed 'Best First'.
Re: ngram
by hossman (Prior) on Sep 19, 2004 at 01:00 UTC
      Thanks hossman! I'll check those out.

      TheEnigma

Re: ngram
by ambrus (Abbot) on Sep 19, 2004 at 09:10 UTC

    You might want to know that emacs has a markov-chaining module that can work in a character-by-character mode (and a word-by-word mode too). I think it's called dissociated press. (Google "dissociated press" emacs)

    (Sidenote: this is the kind of thing why I think that emacs is a noteworthy program although not as a text-editor.)

      Thanks! I've never really used emacs, I've heard it's a steep learning curve, but I should probably check it out.

      One nice thing about my program, I hope, is that it's standalone.

      TheEnigma

      You sometimes hear emacs weenies say "Sure, I like unix. It makes a nice program loader for emacs!"

      Of course, the same could be said for Perl.

      Update: I mean it could be said that unix makes a nice program loader for perl, not that perl makes a nice program loader for emacs. ;-)

Re: ngram
by planetscape (Chancellor) on Jun 07, 2005 at 05:04 UTC

Back to Code Catacombs

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://392067]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (19)
As of 2015-07-28 13:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (255 votes), past polls