n-dimensional statistical analysis of DNA sequences (or text, or ...)

The recent node Reduce RAM required by onlyIDleft is asking for efficient shuffling of a DNA sequence. As an answer, hdb suggested (Re: Reduce RAM required) to build the probability distribution of the DNA sequence at hand (simply as a hash of the four DNA bases ATGC each holding the count/probability of each base appearing). Then one can ask the built distribution to output symbols according to their probability. The result will reflect the statistical properties of the input.

I added that a multi-dimensional prob distribution could create DNA sequences closer to the original because it would count the occurence of single bases, ATGC, as well as pair of bases AT,AG,AC,.., and triplets, and so on. So, I concocted this module, released here, which reads some input data consisting of symbols, optionally separated by a separator, and builds its probability distribution, and some other structures, for a specified number of dimensions, or the ngram-length. That statistical data can then be used to predict output for a given seed.

For example, for ndim = 3 the cumulative distribution will look like this:

   CT => ["A", 0.7, "G", 1],
   GA => ["T", 0.8, "C", 1],
   GC => ["A", 0.2, "T", 1],
   GT => ["A", 1],
[download]

meaning that CT followed by A appears 7/10 times whereas the only other alternative is CT followed by G which appears 3/10. The above structure (and especially that it has cumulative probs) enables one quite easily to return A or G weighted on their probabilities by just checking if rand() falls below or above 0.7. And so I learned that the likelihood of certain sequences of bases is much larger, even an order of magnitude, than others. For example:

   AAA => 0.0335,
   CCC => 0.0158,
   GGG => 0.0158,
   TTT => 0.0350,
   GCG => 0.0030,
   ...
[download]

Another feature of the module is that one can use a starting seed, e.g. AT and ask the system to predict what base follows according to the statistical data built already. And so, a DNA sequence of similar statistical properties as the original can be built.

The module I am describing can also be used for reading in any other type of sequence, not just DNA sequences (fasta files). Text for example. And the module, then, becomes a random text generator emulating the style (i.e. the statistical distribution in n-dimensions) of the specific corpus or literature opus.

The internal data structure used is a hash of hashes which even for 4 dimensions & literature, it is kept reasonably small, because it is usually very sparse. For example, Shelley's Frankestein 3-dimensional statistical distrbution can be serialised to a 1.5MB file. So, huge data can be compressed to the multi-dimensional probability distribution I am describing if all one wants is to keep creating clones of that data with respect to statistical properties of the original (and not an exact replicate of the original).

Of course finite input data may not encompass all the details of the process which produced it and such a probability distrbution even over n-dimensions may prove insufficient for emulating the process.

There are a few modules for doing something similar in CPAN already but I wanted to be able to read huge datasets without resorting to intermediate arrays for obvious reasons. And I wanted to be able to have access to the internal data representing the probability distribution of the data.

Also, I wanted to read the data once, built the statistical distribution and save it, serialised to a file. Then I could do as many predictions as I wanted without re-reading huge data files.

Lastly, I wanted to implement an efficient-to-store n-dimensional histogram so-to-speak using the very simple method of hash-of-hashes with the twist that one can also interrogate the HoH by means of: what follows the phrase I took my?

package    Markov::Ndimensional;

# by bliako
# for https://perlmonks.org/?node_id=1228191
# 20/01/2019

use strict;
use warnings;

our $VERSION = 0.1;

use Exporter qw/import/;
our @EXPORT = qw/learn predict range save_state load_state/;

use Storable;
use Data::Dump;

# * reads in some data (from file or string),
# * splits it to symbols according to separator
#   or it works character-by-character if the separator is left undef,
# * it eventually will have symbols.
# These symbols can be words in text (if separator = word boundary='\b
+')
# Or they can be letters in text (if separator is left undefined)
# (in this case you must arrange that the input has each word in a lin
+e on its own)
# Symbols can are then put together to form "ngrams" and the frequency
+ of
# each ngram is counted.
# Input parameter 'ngram-length' refers to the number of symbols in an
+ ngram.
# For example, with an ngram-length=2 and a word-boundary separator
# we get
#   'hello','there' => 5
#   'hello','you' => 10
#   'goodbye','all' => 5
# etc.
# the numbers are the counts for each ngram as read from the input dat
+a.
# Input parameter 'counts' can be used to supply such frequency data
# without re-reading huge corpus each time. The param must be a hashre
+f
# where key=ngrams (separated by anything you like, no quotes necessar
+y
# but use some separator in order not to confuse composite words)
# With that frequency data, we proceed to calculate a probability
# distribution ('dist') which is just normalised frequencies.
# Then we build the cumulative probability distribution but with a twi
+st:
# the ngrams are broken into (n-1)grams and 1grams (from left to right
+ for latin)
# The first ngram is the key. For all possible endings of that (i.e.
# all the 1grams with which it ends) we count frequencies, normalise t
+o
# probabilities and then build the cumulative probability.
# For example consider the following 'counts'
# (let's assume we are dealing with letters, ngram-length=3):
#    HEL => 10
#    HEA => 20
#    HEN => 5
#    ALA => 10
#    ALB => 20
# The cumulative-probability-distribution will be something like this:
#    HE => {
#            L => 10/(10+20+5)
#            A => 10/(10+20+5) + 20(10+20+5)
#            N => 10/(10+20+5) + 20/(10+20+5) + 5/(10+20+5)
#    }
#    AL # left as an exercise for the reader
# The sub returns a hashref made of these three hashrefs
# 'counts' => frequency(=counts) of each ngram as it occured in the in
+put data
# 'dist' => probability distribution (i.e. normalised frequency/counts
+) of the above
# 'twisted-dist' => as explained before.
# 'cum-twisted-dist' => as explained before.
# Refer to the 'predict()' sub for how the 'cum-twisted-dist' hash is 
+used to make a prediction.

sub    learn {
    my $params = $_[0] // {};

    my $parent = ( caller(1) )[3] || "N/A";
    my $whoami = ( caller(0) )[3];

    my $ngram_length = $params->{'ngram-length'} // 1;
    # optional separator, can be left undef in which case we work lett
+er-by-letter
    # if defined it has the form of a regex, e.g. '.' or '\b'
    my $separator = $params->{'separator'};
    my $internal_separator = defined $separator ? $params->{'internal-
+separator'} // '\|/' : '';
    # optional debug parameter
    my $DEBUG = $params->{'debug'} // 0;
    # optionally specify here 'counts' or 'twisted-dist'
    # in order to return them
    my $need = $params->{'need-to-return'} // {};
    # optionally specify here what not to return
    my $avoid = $params->{'avoid-to-return'} // {};
    # counts the occurence of each combination of symbols of length 'n
+gram_length'
    # if specified in params, data will be appended and distribution
    # re-calculated with old and new data:
    my $counts = undef;
    my $remove_these_characters_from_input = $params->{'remove-these-c
+haracters'};

    # TODO: separator for symbols
    if( defined($counts=$params->{'counts'}) ){
        my ($anykey) = keys %$counts;
        my $le = length($anykey);
        if( $le != $ngram_length ){ print STDERR "$whoami (via $parent
+) : input counts data was for ngrams-length of $le and not $ngram_len
+gth which you requested.\n"; return undef }
    } else { $counts = {} } # <<< fresh counts

    my ($infile, $instring, $howmany, $m);
    my $remove_chars_regex = defined($remove_these_characters_from_inp
+ut) ? qr/${remove_these_characters_from_input}/ : undef;
    if( defined($infile=$params->{'input-filename'}) ){
        $howmany = 0;
        $counts = {};
        my $FH;
        if( ! open $FH, '<', $infile ){ print STDERR "$whoami (via $pa
+rent) : could not open file '$infile' for reading, $!"; return undef 
+}
        while( my $line = <$FH> ){
            if( defined $remove_chars_regex ){ $line =~ s/${remove_cha
+rs_regex}/ /g; }
            $howmany += _process_line(\$line, $counts, $ngram_length, 
+$separator, $internal_separator)
        }
        close($FH);
        if( $DEBUG ){ print "$whoami (via $parent) : file read OK '$in
+file': $howmany items read.\n" }
    } elsif( defined($instring=$params->{'input-string'}) ){
        $howmany = 0;
        $counts = {};
        foreach my $line (split /^/, $instring){
            $howmany += _process_line(\$line, $counts, $ngram_length, 
+$separator, $internal_separator)
        }
        if( $DEBUG ){ print "$whoami (via $parent) : string read OK: $
+howmany items read.\n" }
    } elsif( defined($m=$params->{'input-array-ngrams'}) ){
        $counts = {};
        $howmany = scalar(@$m) - $ngram_length + 1;
        my ($ngram, $i);
        for($i=0;$i<$howmany;$i++){
            $ngram = join($internal_separator, @{$m}[$i..($i+$ngram_le
+ngth)]);
            $counts->{$ngram}++;
        }
        if( $DEBUG ){ print "$whoami (via $parent) : array-ngrams read
+ OK: $howmany items read.\n" }
    } elsif( defined($m=$params->{'input-array-symbols'}) ){
        $counts = {};
        $howmany = scalar(@$m);
        my $i;
        for($i=0;$i<$howmany;$i++){
            $counts->{$m->[$i]}++;
        }
        if( $DEBUG ){ print "$whoami (via $parent) : array-symbols rea
+d OK: $howmany items read.\n" }
    } elsif( ! defined($counts=$params->{'counts'}) ){ print STDERR "$
+whoami (via $parent) : either 'input-filename', 'input-string' or 'in
+put-array-symbols' or 'input-array-ngrams' or 'counts' (which is a ha
+shref of symbol counts) must be specified.\n"; return undef }

    # create normalised counts = probability distribution (into %dist)
    my %dist = %$counts;
    my $sum = 0; $sum+=$_ for values %dist;
    $dist{$_} /= $sum for keys %dist;
    if( $DEBUG ){ print "$whoami (via $parent) : normalised counts OK:
+\n".Data::Dump::dump(%dist)."\n" }

    my ($asy, $lasy, $c, $i, $l, $arr);
    # make a cumulative distribution but break it into 2 parts
    # the last part is the last symbol and the first part are the rest
    # we do a lookup by using the first part so that we have all the a
+vailable
    # choices for what may follow that first part as cum-prob-dist
    my %twisted_dist = ();
    foreach $asy (keys %$counts){
        $c = $counts->{$asy};
        # internal_separator must be escaped (\Q)...
        my @lasys = split /\Q${internal_separator}/, $asy;
        $lasy = pop @lasys;
        $asy = join $internal_separator, @lasys;
        # ngram-length == 1 then ...
        if( $asy eq '' ){ $asy = '<empty>' }
        if( ! exists $twisted_dist{$asy} ){ $twisted_dist{$asy} = [] }
        push @{$twisted_dist{$asy}}, ($lasy, $c);
    }
    # normalise it, remember: it's an array of 'symbol','count','symbo
+l','count', ...
    foreach $asy (keys %twisted_dist){
        $arr = $twisted_dist{$asy};
        $l = scalar(@$arr);
        $sum = 0; for($i=1;$i<$l;$i+=2){ $sum += $arr->[$i] }
        for($i=1;$i<$l;$i+=2){ $arr->[$i] /= $sum }
    }
    if( $DEBUG ){ print "$whoami (via $parent) : made twisted distribu
+tion OK:\n".Data::Dump::dump(\%twisted_dist)."\n" }

    my $cum_twisted_dist = undef;
    if( $need->{'all'} or $need->{'twisted-dist'} ){
        $cum_twisted_dist = {};
        foreach $asy (keys %twisted_dist){
            $cum_twisted_dist->{$asy} = [@{$twisted_dist{$asy}}]
        }
    } else { $cum_twisted_dist = \%twisted_dist }

    # do the cumulative thing:
    foreach $asy (keys %$cum_twisted_dist){
        $arr = $cum_twisted_dist->{$asy};
        $l = scalar(@$arr);
        for($i=3;$i<$l;$i+=2){ $arr->[$i] += $arr->[$i-2] }
        # make last item 1 because of rounding errors might be .9999
        $arr->[-1] = 1.0;
    }
    if( $DEBUG ){ print "$whoami (via $parent) : made cumulative twist
+ed distribution OK:\n".Data::Dump::dump($cum_twisted_dist)."\n" }

    my %ret = (
        'cum-twisted-dist' => $cum_twisted_dist,
        'dist' => \%dist,
        'N' => $ngram_length
    );
    if( $need->{'all'} or $need->{'twisted-dist'} ){ $ret{'twisted-dis
+t'} = \%twisted_dist }
    if( ! $avoid->{'counts'} ){ $ret{'counts'} = $counts }

    return \%ret
}
# return the next in sequence weighted on probabilities
# and given the previous symbols (inp)
# 'dat' must be built using learn(), it is the 'cum-twisted-dist' entr
+y (in the returned hash)
# returns undef on failure.
# 
# The input is a 'cum-twisted-dist' hashref (as described in 'learn()'
+)
# Assuming that we processed our data with ngram-length=3,
# then this function will take as input the first 2 symbols
# and will predict the 3rd based on the probability distribution
# built from the input data during the 'learn()' stage.
# For example, if built with ngram-length=3 the distribution of
# triplets of bases of the human genome (A,T,C,G), then
# during the 'learn()' stage we would have build the following data
# 'counts', e.g. ATC => 1000
# 'dist', e.g. ATC => 1000/(total number of triplets) # << which is th
+e normalised frequency
# 'cum-twisted-dist', with a twist, e.g.
# AT => {C=>0.2, G=>0.3, A=>0.8, T=>1.0} # << Cumulative probabilities
+, always sum to 1
# This last 'cum-twisted-dist' is our 'dat' input parameter.
# Then we can ask the system to predict the third base
# given the first two bases, 'AT'.
# The prediction will be random but will be based (weightet) on the cu
+m-twisted-dist.
# In our example, predict(AT) will yield an A most of the times, follo
+wed by
# equally-probable C and T and then G (1 in 10).
sub    predict {
    my $dat = $_[0]; # the hashref with the cum-prob distribution ('cu
+m-twisted-dist')
    my $inp = $_[1]; # must be a string containing symbols separated b
+y the internal separator, e.g. ATGC or I|want|to

    my $c = $dat->{$inp};
    if( ! defined $c ){ return undef } 
    my $l = scalar(@$c);
    my $r = rand;
    my $i;
    for($i=1;$i<$l;$i+=2){
        if( $r < $c->[$i] ){ return $c->[$i-1] }
    }
    return undef
}
sub    save_state {
    my $dat = $_[0];
    my $outfile = $_[1];
    if( ! Storable::store($dat, $_[1]) ){ print STDERR 'save_state()'.
+": call to ".'Storable::store()'." has failed for output file '$outfi
+le'.\n"; return 0 }
    return 1
}
sub    load_state {
    my $infile = $_[0];
    my $ret = Storable::retrieve($infile);
    if( ! defined($ret) ){ print STDERR 'load_state()'.": call to ".'S
+torable::retrieve()'." has failed for input file '$infile'.\n"; retur
+n 0 }
    return $ret
}

# Same as the 'predict()' but it does not make a prediction.
# It returns all the possible options the (n-1)grams input
# has as a hashref of cumulative probabilities.
# will return undef if input has never been seen and does
# not exist in 'dat' (which is a 'cum-twisted-dist',see 'learn()')
sub    range {
    #my $dat = $_[0]; # the hashref with the cum-prob distribution ('c
+um-twisted-dist')
    #my $inp = $_[1]; # must be a string of length = ngram_length
    #return $dat->{$inp};
    return $_[0]->{$_[1]}
}

# private subs
# processes a line of text which consists of ngram_length words separa
+ted by optional separator
# By specifying a separator we can process sentences where words are o
+ur basic symbols. In this
# way we can have arbitrary-length symbols.
# On the other hand, the separator can be left undef. In this case we 
+treat a line as character-by-character
# and each symbol consists of exactly ngram_length letters.
# it returns the number of symbols found in this line (>=0)
# It assumes that a line contains only complete words, no half word ca
+n end the line.
sub    _process_line {
    my $line = ${$_[0]}; # expecting a scalar ref to the line to proce
+ss
    my $output_ref = $_[1];
    my $ngram_length = $_[2];
    my $sep = $_[3]; # optional, can be left undefined for no separato
+r and work letter-by-letter
    my $intsep = $_[4]; # optional and only if sep is defined, it sets
+ the internal separator between symbols

    $line =~ s/>.*$//;
    $line =~ s/\s+/ /g; $line =~ s/\s+$//;
    return 0 if $line =~ /^\s*$/; # nothing to do

    my ($howmany);    
    if( defined $sep ){
        # we have a separator, so we need to regex
        $howmany = 0;
        my $temp;
        while( $line =~ /(((.+?)(?=${sep}+|$)){${ngram_length}})/g ){
            $temp = $1; $temp =~ s/^${sep}+//; $temp =~ s/${sep}+/${in
+tsep}/g;
            $output_ref->{$temp}++;
            $howmany++;
        }
    } else {
        $howmany = length($line)-$ngram_length+1;
        for(my $from=$howmany;$from-->0;){
            $output_ref->{substr $line, $from, $ngram_length}++;
        }
    }
    return $howmany;
}
1;
[download]

And here are ~~four~~ three scripts for reading DNA or text sequences and doing a prediction. Available options can be inferred by just looking at the script or look at the examples further down.

analyse_DNA_sequence.pl:

#!/usr/bin/env perl

# FILE: analyse_DNA_sequence.pl
# by bliako

use strict;
use warnings;

use Getopt::Long;
use Data::Dump qw/dump/;

use Markov::Ndimensional;

my $input_fasta_filename = undef;
my $input_state_filename = undef;
my $output_state_filename = undef;
my $output_stats_filename = undef;
my $separator = undef;
my $ngram_length = -1;
if( ! Getopt::Long::GetOptions(
    'input-fasta=s' => \$input_fasta_filename,
    'input-state=s' => \$input_state_filename,
    'output-state=s' => \$output_state_filename,
    'output-stats=s' => \$output_stats_filename,
    'ngram-length=i' => \$ngram_length,
    'separator=s' => \$separator,
    'help|h' => sub { print STDERR usage($0); exit(0) }
) ){ print STDERR usage($0) . "\n\nSomething wrong with command-line p
+arameters...\n"; exit(1); }

if( $ngram_length <= 0 ){ print STDERR "$0 : ngram-length must be a po
+sitive integer.\n"; exit(1) }

my %params = ();
if( defined($output_state_filename) ){ $params{'need'} = {'all'=>1} }
else { $params{'avoid'} = {'counts'=>1} }
my $state = undef;
if( defined($input_state_filename) ){
    $state = load_state($input_state_filename);
    if( ! defined($state) ){ print STDERR "$0 : call to ".'load_state(
+)'." has failed.\n"; exit(1) }
    $params{'counts'} = $state->{'counts'};
}
if( defined($input_fasta_filename) ){
    $state = learn({
        %params,
        'ngram-length' => $ngram_length,
        'separator' => $separator,
        'input-filename' => $input_fasta_filename,
    });
    if( ! defined($state) ){ print STDERR "$0 : call to ".'learn()'." 
+has failed.\n"; exit(1) }
}
if( ! defined($state) ){ print STDERR "$0 : --input-state and/or --inp
+ut-fasta must be specified.\n"; exit(1) }

if( defined($output_state_filename) ){
    if( ! save_state($state, $output_state_filename) ){ print STDERR "
+$0 : call to ".'save_state()'." has failed.\n"; exit(1) }
}
if( defined($output_stats_filename) ){
    print Data::Dump::dump($state);
} else {
    print Data::Dump::dump($state);
}
print "\n$0 : done.\n";
exit(0);

sub usage {
    return "Usage : $0 <options>\n";
}
1;
[download]

analyse_text.pl:

#!/usr/bin/env perl

# FILE: analyse_text.pl
# by bliako
use strict;
use warnings;

use Getopt::Long;
use Data::Dump qw/dump/;

use Markov::Ndimensional;

my $input_corpus_filename = undef;
my $input_state_filename = undef;
my $output_state_filename = undef;
my $output_stats_filename = undef;
my $separator = '\s';
my $internal_separator = '|';
my $ngram_length = -1;
if( ! Getopt::Long::GetOptions(
    'input-corpus=s' => \$input_corpus_filename,
    'input-state=s' => \$input_state_filename,
    'output-state=s' => \$output_state_filename,
    'output-stats=s' => \$output_stats_filename,
    'ngram-length=i' => \$ngram_length,
    'separator=s' => \$separator,
    'help|h' => sub { print STDERR usage($0); exit(0) }
) ){ print STDERR usage($0) . "\n\nSomething wrong with command-line p
+arameters...\n"; exit(1); }

if( $ngram_length <= 0 ){ print STDERR "$0 : ngram-length must be a po
+sitive integer.\n"; exit(1) }

my %params = ();
if( defined($output_state_filename) ){ $params{'need'} = {'all'=>1} }
else { $params{'avoid'} = {'counts'=>1} }
my $state = undef;
if( defined($input_state_filename) ){
    $state = load_state($input_state_filename);
    if( ! defined($state) ){ print STDERR "$0 : call to ".'load_state(
+)'." has failed.\n"; exit(1) }
    $params{'counts'} = $state->{'counts'};
}
if( defined($input_corpus_filename) ){
    $state = learn({
        %params,
        'ngram-length' => $ngram_length,
        'separator' => $separator,
        'internal-separator' => $internal_separator,
        'remove-these-characters' => '[^a-zA-Z]',
        'input-filename' => $input_corpus_filename,
    });
    if( ! defined($state) ){ print STDERR "$0 : call to ".'learn()'." 
+has failed.\n"; exit(1) }
}
if( ! defined($state) ){ print STDERR "$0 : --input-state and/or --inp
+ut-fasta must be specified.\n"; exit(1) }

if( defined($output_state_filename) ){
    if( ! save_state($state, $output_state_filename) ){ print STDERR "
+$0 : call to ".'save_state()'." has failed.\n"; exit(1) }
}
if( defined($output_stats_filename) ){
    print Data::Dump::dump($state);
} else {
    print Data::Dump::dump($state);
}
print "\n$0 : done.\n";
exit(0);

sub usage {
    return "Usage : $0 <options>\n";
}
1;
[download]

predict_text.pl:

#!/usr/bin/env perl

# FILE: predict_text.pl
# by bliako

use strict;
use warnings;

use Getopt::Long;
use Data::Dump qw/dump/;

use Markov::Ndimensional;

my $input_corpus_filename = undef;
my $input_state_filename = undef;
my $output_state_filename = undef;
my $output_stats_filename = undef;
my $separator = '\s';
my $internal_separator = '|';
my $seed = undef;
my $num_iterations = 100;
if( ! Getopt::Long::GetOptions(
    'input-state=s' => \$input_state_filename,
    'separator=s' => \$separator,
    'num-iterations=i' => $num_iterations,
    'seed=s' => \$seed,
    'help|h' => sub { print STDERR usage($0); exit(0) }
) ){ print STDERR usage($0) . "\n\nSomething wrong with command-line p
+arameters...\n"; exit(1); }

if( ! defined($input_state_filename) ){ print STDERR "$0 : --input-sta
+te must be used to specify a state file to read. In order to produce 
+a state use analyse_text.pl\n"; exit(1); }
my $state = load_state($input_state_filename);
if( ! defined($state) ){ print STDERR "$0 : call to ".'load_state()'."
+ has failed.\n"; exit(1) }
my $ngram_length = $state->{'N'};
print "$0 : read state from '$input_state_filename', ngram-length is $
+ngram_length.\n";

if( defined $seed ){
    my @crap = split /${separator}/, $seed;
    if( scalar(@crap) != ($ngram_length-1) ){ print STDERR "$0 : the n
+gram-length from state file '$input_state_filename' is $ngram_length 
+but the provided seed assumes length ".(scalar(@crap)+1)."\n"; exit(1
+) }
    $seed = join $internal_separator, @crap;
} else {
    $seed = (keys %{$state->{'cum-twisted-dist'}})[0]
}
print "$0 : starting with seed '$seed' ...\n";
my %params = ();
$params{'avoid'} = {'counts'=>1};

my $w = $state->{'cum-twisted-dist'};
print $seed;
for(1..$num_iterations){
    my $ret = predict($w, $seed);
    if( ! defined($ret) ){ last }
    print " $ret";
    my @y = split /${separator}/, $seed; shift @y;
    $seed = join ' ', @y, $ret;
}
print "\n$0 : done.\n";
exit(0);

sub usage {
    return "Usage : $0 <options>\n";
}
1;
[download]

Here is some usage:

$ wget ftp://ftp.ncbi.nih.gov/genomes/Homo_sapiens/CHR_20/hs_ref_GRCh3
+8.p12_chr20.fa.gz # warning ~100MB
# this will build the 3-dim probability distribution of the input DNA 
+seq and serialise it to the state file
$ analyse_DNA_sequence.pl --input-fasta hs_ref_GRCh38.p12_chr20.fa --n
+gram-length 3 --output-state hs_ref_GRCh38.p12_chr20.fa.3.state --out
+put-stats stats.txt
# now work with some text, e.g http://www.gutenberg.org/files/84/84-0.
+txt (easy on the gutenberg servers!!!)
$ analyse_text.pl --input-corpus ShelleyFrankenstein.txt --ngram-lengt
+h 2 --output-state shelley.state
$ predict_text.pl --input-state shelley.state
[download]

I am looking for comments/feature-request before publishing this. Once it is published I will replace the code in this node with links. Please let me know asap if I am abusing resources by posting this code here.

bw, bliako

Comment on n-dimensional statistical analysis of DNA sequences (or text, or ...) Select or Download Code

Replies are listed 'Best First'.
Re: n-dimensional statistical analysis of DNA sequences (or text, or ...) by Laurent_R (Canon) on Jan 22, 2019 at 18:18 UTC
Hi bliako, You might have some interest in reading Section 11.13 (Markov Analysis) of my Perl 6 book (http://greenteapress.com/thinkperl6/thinkperl6.pdf), which suggests an exercise on a quite similar subject, as well as Subsection A.9.5. presenting a solution to the aforesaid exercise. The exercise and its solution are doing much simpler things than your module, but it goes in the same direction: looking into a text for the probability, for a given sequence of words, of the words that might come next. Then using that probability to generate random sentences that might almost look like English (at least much more so than just picking random words). For example, running the program on Emma, the novel by Jane Austen, produced the following random text: it was a black morning�s work for her. the friends from whom she could not have come to hartfield any more! dear affectionate creature! you banished to abbey mill farm. now i am afraid you are a great deal happier if she had no hesitation in approving. dear harriet, i give myself joy of so sorrowful an event; As you can see, the result is almost syntactically correct, but not quite. And, semantically, it almost makes sense, but not quite. I hope you find it fun.	[reply]
Re^2: n-dimensional statistical analysis of DNA sequences (or text, or ...) by bliako (Monsignor) on Jan 22, 2019 at 18:52 UTC
Thanks for the book Laurent_R and letting me know of your case-study. I may use your book to get my first contact with Perl 6. As you say, selecting the data structure is challenging. The algorithm is there but different data structure will make it more convenient to use the algorithm in different tasks. I am interested in Statistical Learning and in this algorithm (and Monte Carlo) which is so simple. And Perl gives us so much freedom in trying things and whipping code in no time. I give myself joy...	[reply]
Re: n-dimensional statistical analysis of DNA sequences (or text, or ...) by Aldebaran (Curate) on Jan 23, 2019 at 17:28 UTC
bliako, what an interesting subject matter! As of yet, I'm unable to follow you. My first problem is that I get nothing from the wget command: $ wget ftp://ftp.ncbi.nih.gov/genomes/Homo_sapiens/CHR_20/hs_ref_GRCh3 +8.p12_chr20.fa.gz --2019-01-23 07:54:51-- ftp://ftp.ncbi.nih.gov/genomes/Homo_sapiens/C +HR_20/hs_ref_GRCh38.p12_chr20.fa.gz => �hs_ref_GRCh38.p12_chr20.fa.gz� Resolving ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)... 130.14.250.12, 2607:f +220:41e:250::12 Connecting to ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)\|130.14.250.12\|:21... + failed: Connection timed out. Connecting to ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)\|2607:f220:41e:250::1 +2\|:21... failed: Network is unreachable. $ [download] I took some care that newlines, spaces and plusses were removed from the foldover effect, but that is error-prone. I do see 'nih' in the url and suspect that it might be closed because of Orange Ulbricht.(?) I can't call the package properly. I usually don't have packages with scope, so I don't know what to do with this. I tried to imitate what cpan does for module and then use as a lib. Perl isn't impressed: $ mkdir Markov $ cd Markov/ $ touch Ndimensional.pm $ gedit Ndimensional.pm & [1] 20524 $ pwd /home/bob/2.scripts/pages/7.cw/template_stuff/crosswords/eugene/Markov [1]+ Done gedit Ndimensional.pm ^^^^ pasted in source $ cd .. $ ls 1.bliako.pl bliako1.pm Markov $ ./1.bliako.pl Can't locate Markov/Ndimensional.pm in @INC (you may need to install t +he Markov::Ndimensional module) (@INC contains: Markov /home/bob/perl +5/lib/perl5/x86_64-linux-gnu-thread-multi /home/bob/perl5/lib/perl5 / +etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.26.1 /usr/local/share +/perl/5.26.1 /usr/lib/x86_64-linux-gnu/perl5/5.26 /usr/share/perl5 /u +sr/lib/x86_64-linux-gnu/perl/5.26 /usr/share/perl/5.26 /usr/local/lib +/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at ./1.bliako.pl line + 13. BEGIN failed--compilation aborted at ./1.bliako.pl line 13. $ [download] , where caller looks like this: `$ cat 1.bliako.pl #!/usr/bin/env perl # FILE: analyse_DNA_sequence.pl # by bliako use strict; use warnings; use Getopt::Long; use Data::Dump qw/dump/; use lib 'Markov'; use Markov::Ndimensional;` [download] My question then is whether the download is there for anyone else and where to put Ndimensional.pm and how to call it correctly. I think it's a way in which perl truly is relevant.	[reply] [d/l] [select]
Re^2: n-dimensional statistical analysis of DNA sequences (or text, or ...) by pryrt (Abbot) on Jan 23, 2019 at 17:55 UTC
`use lib 'Markov'; use Markov::Ndimensional;` [download] ... is looking for `Markov/Markov/Ndimensional.pm`. Try: `use lib '.'; use Markov::Ndimensional;` [download]	[reply] [d/l] [select]
Re^3: n-dimensional statistical analysis of DNA sequences (or text, or ...) by Aldebaran (Curate) on Jan 23, 2019 at 20:54 UTC
`$ ./1.bliako.pl ./1.bliako.pl : ngram-length must be a positive integer. $ cat 1.bliako.pl #!/usr/bin/env perl # FILE: analyse_DNA_sequence.pl # by bliako use strict; use warnings; use Getopt::Long; use Data::Dump qw/dump/; use lib '.'; use Markov::Ndimensional;` [download] Great, I think that's gonna work with some proper input. Does this look right between p tags: wget ftp://ftp.ncbi.nih.gov/genomes/Homo_sapiens/CHR_20/hs_ref_GRCh38.p12_chr20.fa.gz	[reply] [d/l]
Re^4: n-dimensional statistical analysis of DNA sequences (or text, or ...) by bliako (Monsignor) on Jan 23, 2019 at 23:16 UTC
Re^5: n-dimensional statistical analysis of DNA sequences (or text, or ...) by Aldebaran (Curate) on Feb 07, 2019 at 00:00 UTC
Some notes below your chosen depth have not been shown here
Re^2: n-dimensional statistical analysis of DNA sequences (or text, or ...) by bliako (Monsignor) on Jan 23, 2019 at 19:28 UTC
Regarding downloading the DNA sequence file try pasting the url on your browser or give `curl` a try. For the record, `wget` works in my system, maybe because of some wgetrc setting like passive ftp ON or user-agent string. pryrt answered your 2nd query! Basically set your lib (by use of `use lib`) wherever the module file is placed. If you want to replicate my system: ~~`mkdir Markov; mv Ndimensional.pm Markov`~~ `mkdir -p lib/Markov; mv Ndimensional.pm lib/Markov` should do it. Btw, this is where DNA sequences from a lot of species including homosapiens (HS) are located: `ftp.ncbi.nih.gov/genome`. If you can spare the CPU you can do statistical analyses not only between different chromosomes of HS but also between HS and other species. What's needed now is to take that frequency counts and compare them or just plot. Btw2, in addition to ATCG bases (canonical) one gets a few more, notably "N". Check Re^10: Reduce RAM required for more. I don't know more on these. Hope you get it running soon	[reply] [d/l] [select]
Re^3: n-dimensional statistical analysis of DNA sequences (or text, or ...) by Aldebaran (Curate) on Jan 23, 2019 at 23:09 UTC
Hope you get it running soon. I've been running this software to try to understand and must remark at how impressed I am, now with results to show. I'll try to show the output without breaking the reader's scroll finger. The first program analyzes the dna at that zipped download. Then you gotta unzip it: `gunzip -k hs_ref_GRCh38.p12_chr20.fa.gz` Even between readmore tags, I'll edit it to highlight points. Read more... (3 kB) What gives with the clumpiness of the N's? They look like the duct tape that holds everything else together. It was also helpful for me to run it again and compare output: `$ diff 1.bliako.txt 2.bliako.txt >3.bliako.txt` Can you explain why these might be different one run to the next: `285c285 < NT => ["A", 0.15, "C", 0.5, "G", 0.8, "T", 1], --- > NT => ["C", 0.35, "G", 0.65, "T", 0.85, "A", 1],` [download] Downloading shelley was easy. $ ./1.predict.pl --input-state 84.state ./1.predict.pl : read state from '84.state', ngram-length is 2. ./1.predict.pl : starting with seed 'futile' ... futile were placed at once turned with anxious suspense I threw me whe +n she might be guilty are soul have wandered with Project Gutenberg t +m works based ./1.predict.pl : done. $ [download] `./1.predict.pl : starting with seed 'reasoning' ... reasoning I paid a drunken ./1.predict.pl : done. $` [download] The .state file was truly amazing: `friendship => ["and", 0.5, "You", 1], frightened => ["as", 0.5, "me", 1], frightful => [ "darkness", 0.111111111111111, "that", 0.222222222222222, "catalogue", 0.333333333333333, "selfishness", 0.444444444444444, "an", 0.555555555555556, "I", 0.666666666666667, "fiend", 0.777777777777778, "the", 0.888888888888889, "dreams", 1, ],` [download] But wait, is this software that makes a frankensentence? (I heard that it is one of the gifts that the magi brought to the child in nativity story.) This is incredible: $ ./1.predict.pl --input-state 84.state ./1.predict.pl : read state from '84.state', ngram-length is 2. ./1.predict.pl : starting with seed 'futile' ... futile were placed at once turned with anxious suspense I threw me whe +n she might be guilty are soul have wandered with Project Gutenberg t +m works based ./1.predict.pl : done. $ ./1.predict.pl --input-state 84.state ./1.predict.pl : read state from '84.state', ngram-length is 2. ./1.predict.pl : starting with seed 'reasoning' ... reasoning I paid a drunken ./1.predict.pl : done. [download] I hope I didn't beat it up too much trying to replicate it....	[reply] [d/l] [select]
Re^4: n-dimensional statistical analysis of DNA sequences (or text, or ...) by bliako (Monsignor) on Jan 23, 2019 at 23:32 UTC

Back to Meditations