Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Loading only portions of a file at any given time

by TJCooper (Beadle)
on Nov 30, 2017 at 13:23 UTC ( #1204589=perlquestion: print w/replies, xml ) Need Help??
TJCooper has asked for the wisdom of the Perl Monks concerning the following question:

I currently load a FASTA formatted file, which contains DNA sequence separated by headers denoted by '>' i.e.:

>header1 AGACATGATCGACTGGACACAATTTACGTAGCTG >header2 AAATACTAGGGCAACACACACACACACCACACAC >header3 AGTTAGTCCAGTAGTTTACAGTACGACTGATCGT

via bioPerl (Bio::SeqIO), and store the entire FASTA file into memory as a hash:

my $fasta = $ARGV[0]; my $seqio = Bio::SeqIO->new(-file => $fasta); while(my $seqobj = $seqio->next_seq) { my $id = $seqobj->display_id; my $seq = $seqobj->seq; $sequences{$id} = $seq; } open my $IN2, '<', $ARGV[1] or die "$!"; while (<$IN2>) { # Subsample the FASTA sequences }

Or I could code it manually (which is likely quicker given the amount of overhead in Bio::SeqIO).

Based on a second input file, I then subsample the FASTA sequences and do some processing.

Each line of $IN2 includes the header from its associated FASTA sequence.

This works fine, and on modern computers the memory usage isn't too much of a problem even with larger genomes (Human etc.). Nevertheless, i'd like to take advantage of the fact that my second input file ($IN2) is always sorted by header and thus, for example, the first 5000 lines of it only need the FASTA sequence from >header1. This could potentially save me having to load the entire FASTA file into memory and instead only load the sequence that is currently being used.

I'm however having a tough time trying to figure out how to actually do this so any suggestions or code would be greatly appreciated.

Replies are listed 'Best First'.
Re: Loading only portions of a file at any given time
by 1nickt (Monsignor) on Nov 30, 2017 at 13:38 UTC

    Hi, if you won't know which header section you need until you start parsing the second file, I'd think you'd have a hard time doing what you want efficiently on the fly. Can you simply break your FASTA file into separate files in advance, one per header?


    The way forward always starts with a minimal test.
      Indeed, I found that to be the main issue when trying to think of a way to do this. I could definitely split the FASTA file beforehand, although I try to minimise the amount of steps required to run a script hence the question! Thanks for your input.
Re: Loading only portions of a file at any given time
by kcott (Chancellor) on Dec 01, 2017 at 05:54 UTC

    G'day TJCooper,

    I've provided an example script below which shows some techniques that you might use for this. I've assumed your IN2 file just contains headers (your description wasn't 100% clear on that). If you're going to be reusing IN2, you may want to consider running the first part of the script separately, and storing the @all_order data: a database, Storable, or other serialisation methods are all possible candidates (depends on your requirements and what you have available).

    #!/usr/bin/env perl use strict; use warnings; use autodie; use Data::Dump; my @all_order; { open my $in2_fh, '<', 'pm_1204589_IN2'; while (<$in2_fh>) { chomp; push @all_order, $_; } } for my $range ([0..0], [0..2], [1..3], [3..3]) { my %wanted = map { $_ => 1 } @all_order[@$range]; { open my $in1_fh, '<', 'pm_1204589_IN1'; local $/ = "\n>"; while (<$in1_fh>) { chomp; substr $_, 0, 1, '' if $. == 1; my ($head, $seq) = split; $wanted{$head} = $seq if $wanted{$head}; } } print "Range: @$range\n"; dd \%wanted; }

    With these dummy files (yes, I know, the sequences are totally bogus — just used for demonstration purposes):

    $ cat pm_1204589_IN1 >Z ABC >Y DEF >X GHI >W JKL
    $ cat pm_1204589_IN2 W X Y Z

    I get this output:

    Range: 0 { W => "JKL" } Range: 0 1 2 { W => "JKL", X => "GHI", Y => "DEF" } Range: 1 2 3 { X => "GHI", Y => "DEF", Z => "ABC" } Range: 3 { Z => "ABC" }

    See also: Data::Dump.

    — Ken

Re: Loading only portions of a file at any given time
by duff (Parson) on Nov 30, 2017 at 15:31 UTC

    What TKCooper said, but ... you could also do something more complex like build an "index" and store it in another file and then use that index to determine which parts to load when. But, the cost of that method (in time and complexity and developing the thing that reads based on index and in maintenance) may out-weigh the benefit you gain. Though, if this is a commonish thing to want, maybe everyone who uses FASTA could benefit from a module that does this indexing? I dunno.

      Actually, a compromise suggestion might not be so bad ... Since the file is sorted, one would only need to read the file sequentially once, and note the binary seek() position where each sequence begins. (It ends where the next one begins.) You might even be able to keep that data in memory. Or, write it to a separate file that each script loads. Each script could now seek directly to the right spot, read the specified number of bytes, and do a couple of quick debugging-checks to make sure that what it just read looks okay. (i.e. that the index-file is not out of date.) Simple and easy to do, and it just might save a lot of time. (If the region is very-large, consider also File::Map.)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1204589]
Approved by 1nickt
Front-paged by 1nickt
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (9)
As of 2018-07-20 20:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?















    Results (441 votes). Check out past polls.

    Notices?