Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: Reduce RAM required

by bliako (Monsignor)
on Jan 08, 2019 at 11:26 UTC ( [id://1228201]=note: print w/replies, xml ) Need Help??


in reply to Reduce RAM required

I have exactly the same suggestion as hdb. To put it more formally, build the statistical distribution of your data and then sample it. If you need a more complex model which will also depend not only on frequency of single letters but also on frequency of pairs and triplets and so on, then build a multi-dimensional distribution or use a markov-chain. The latter is the tool being used for random text production which more or less does what you want: create new words from a corpus respecting their neighbouring properties as well as frequency of individual letters. So, it does one step more than what you do as it counts also the properties of neighbours. For example, with your method, "TTTTTTT" is as probable as "ATCGACACGT" in the shuffled sub-sequences, but in the real world the first one is a freak sequence and the second one just a normal looking one...

Comment on your code: I think reversing the array will not add anything to the shuffle (if shuffle() is good, which is good). Secondly, you can eliminate a lot of split()/join() (after the shuffle) and just use substr() to break the whole genome to sub-sequences. Thirldy, refactoring your code to use subs, as hippo suggested will be good because then you can assess the efficiency of each step (RAM/CPU-wise as well as statistically) and you can try different methods by writing a new sub and plugging it in.

Update: Eily also makes the same comment about statistics.

bw, bliako

Replies are listed 'Best First'.
Re^2: Reduce RAM required
by onlyIDleft (Scribe) on Jan 09, 2019 at 16:22 UTC

    Thanks for your response. I don't think the bioinformatics tool HMMER3 (making and using HMMs) can generate a whole shuffled genome sequence based on reading in its intact version.

    But if you know of any software package that can do this, that would be good to know

    About frequencies of 1 letters, 2 letters etc., REALLY good question - but I think I am happy with simple shuffling, without consideration for frequencies of 2 letters or higher order, because there is no theoretical or empirical indication that it would make a difference while assessing background signal from these shuffled sequences. But I might run that experiment in the future, for curiosity sake :)

      I have done an investigation of my own (on file ftp://ftp.ncbi.nih.gov/genomes/Homo_sapiens/CHR_20/hs_ref_GRCh38.p12_chr20.fa.gz).

      Unless my method for calculating the frequencies is wrong or my script has a bug (quite likely actually), empirically at least, bases like to be next to some bases more than others, and the more they gather the more fussy they become. Almost a tenfold! Also note that this is for just 1 chromosome of HS. Maybe overall the differences even out which I don't believe.

      CG => 0.0121 TT => 0.0902 TAC => 0.0098 TGA => 0.0195

      Results when examining pairs of letters:

      Results when examining triplets of letters:

      I will post the script I used shortly on the Meditations section as it may be of more general use.

      bw, bliako

      @bliako - thank you for your investigation.

      I think I misspoke when I said "there is no theoretical or empirical indication that it would make a difference while assessing background signal from these shuffled sequences"

      What should I have instead said is that "periodicities and abundance biases of di-, tri-, or multi-mer sequences when completely destroyed are more likely to be truly random shuffles, rather than shuffles that retain frequencies of di-mer, or tri-mer etc. This is obvious from your own empirical observation from one human chromosome

      Anyways, I have an old script that uses shuffle from List::Util Perl module and I remember it never ran out of RAM, so I think I will use that. Though I would have preferred something that concats the entirety of the genome and then shuffles and fragments as per length distribution of chromosomes in input - but various Perl scripts that do this run out of RAM even on my cluster. Need to take a weekend break to figure out RAM usage and fix it.

        I would have preferred something that concats the entirety of the genome and then shuffles and fragments as per length distribution of chromosomes in input

        Why? Why not consume only as much data as required per current "length" (btw, are they really constant (1e6)?), and then dispose of data no longer needed as you move along? Reversing every second fragment before shuffling is kind of mad science... (well, in positive sense)

        Here's a take, that, while "concatenating in entirety", tries hard not to allocate any more memory. And even less if id_lines can be re-built during output, i.e. not stored. Though I didn't profile.

        The problem I found interesting for myself to look into was this: if there's huge chunk of bytes to, e.g., shuffle, why split or make copies or construct huge Perl lists etc. I hope code below shuffles "in place". C code is more or less ripped from PDL::API. It's puzzling to me why is it that "wrapping existing data into piddle in-place" exists (for 20+ years?) as somewhat obscure part of documentation and not implemented in pure Perl.

        The RNG ('taus') was chosen arbitrarily because of synopsis, there are plenty of others to choose from, more fun than simple "fragment reversing", so have a look :)

        use strict; use warnings; use PDL; use Inline with => 'PDL'; use PDL::GSL::RNG; my $genome = ''; my @id_lines; my @runs; ########################## # Read ########################## while ( <DATA> ) { chomp; if ( /^>/ ) { push @id_lines, $_ } else { push @runs, length; $genome .= $_ } } ########################## # Shuffle ########################## my $rng = PDL::GSL::RNG-> new( 'taus' ); $rng-> set_seed( time ); my $start = 0; my $stop = length( $genome ) - 1; my $window = 3; while ( $start < $stop ) { my $len = $start + $window > $stop ? $stop - $start : $window; my $p = mkpiddle( $genome, $start, $len ); $rng-> ran_shuffle( $p ); $start += $window } ########################## # Output ########################## $start = 0; for ( 0 .. $#runs ) { print $id_lines[ $_ ], "\n", substr( $genome, $start, $runs[ $_ ]), "\n"; $start += $runs[ $_ ] } ########################## # Guts ########################## use Inline C => <<'END_OF_C'; static void default_magic( pdl *p, int pa ) { p-> data = 0; } pdl* mkpiddle( char* data, int ofs, int len ) { PDL_Indx dims[] = { len }; pdl* npdl = PDL-> pdlnew(); PDL-> setdims( npdl, dims, 1 ); npdl-> datatype = PDL_B; npdl-> data = data + ofs; npdl-> state |= PDL_DONTTOUCHDATA | PDL_ALLOCATED; PDL-> add_deletedata_magic( npdl, default_magic, 0 ); return npdl; } END_OF_C __DATA__ >Chr1 CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT >Chr2 TATGACGTTTAGGGACGATCTTAATGACGTTTAGGGTTTTATCGATCAGCGACGTAGGGA >Chr3 GTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTT >Chr4 AACAAGGTACTCTCATCTCTTTACTGGGTAAATAACATATCAACTTGGACCTCATTCATA >Chr5 AACATGATTCACACCTTGATGATGTTTTTAGAGAGTTCTCGTGTGAGGCGATTCTTGAGG

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1228201]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (6)
As of 2024-04-23 11:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found