Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re: Searching array against hash

by abualiga (Scribe)
on Aug 22, 2013 at 01:07 UTC ( #1050435=note: print w/ replies, xml ) Need Help??


in reply to Searching array against hash

Hi, out of curiosity, as I work with fasta files a bit, why not use the Bio::DB::Fasta module from BioPerl for indexed access to any number of fasta files. Your example is precisely what this module is for, unless you're exercising OO programming. I used your sample data for the code below, if interested.

update: Initially, creating the db may take a moment depending on size of input files, but it works pretty quick every time it's accessed after that.

#!/usr/bin/perl use Modern::Perl '2011'; use autodie; use Bio::DB::Fasta; die "Usage: $0 <fastaFile> <queryFile>\n" unless my( $fastaFile, $quer +yFile ) = @ARGV; open my $fh, '<', $queryFile; chomp( my @queries = <$fh> ); close $fh; # trim header @queries = map{ local $_ = $_; s/^>(\S+)\s+\d+/$1/; $_ } @queries; # create indexed database from fasta file my $db = Bio::DB::Fasta->new( $fastaFile ); open my $fh2, '>', 'querySeqs.fa'; for my $query ( @queries ) { my $seqheader = $db->header( $query ); my $seqdna = $db->seq( $seqheader ); say {$fh2} ">$seqheader"; print {$fh2} format_sequence( $seqdna, 60 ); } close $fh2; sub format_sequence { my($sequence, $length) = @_; # sequence and length of line open my $string_fh, '>', \ my $string; # print sequence to string for( my $pos = 0; $pos < length( $sequence ); $pos += $length ) { print {$string_fh} substr( $sequence, $pos, $length ), "\n"; } $string; } output in file querySeqs.fa >Chr1.ID=PAC:19650373;Name=AT1G11900.1;pacid=19650373;longest=1;Parent +=AT1G11900.len361.i1.1_111 GTATGAATTCCAAAAATCCAGAACCGTTTTCGTGATTCATGTTATGCTCTCGTTGTTGTT TTCTGATTGTTACTGCTCAGCGAGTTTCTTCTATCAATGTTTGATTCGATGAAGATGCGA AATTTCGAACCATTGCTGTTCTTCTGAGTTTGATCGTTTTTTAGTTTCGGGGCTTTCACG CTTCAGCTAGTGTTTGATTACGAAGTTTTCTGATTAAATGTGTGAGTTTTTTTGTAGTCA TCTCGAAAACTGAGAAATCCATTTTTATAGATTTACATTGTTCATAGTTATATGTGGAAG TTGATGATTGATGGTGATTCTGCAAATTGATGATTTGGTTTTCTGTTTATTGGCATTGCA G >Chr1.ID=PAC:19657430;Name=AT1G76660.1;pacid=19657430;longest=1;Parent +=AT1G76660.len490.i1.2_394 GTTTCTGTTTAATTCTCTCTATTTTCGTTTGATTTCGACTTCTTGAGCTTTTACTTCTCT CTGTCTCAGTTCTAACTTCTTCAGATTTTAAAGCTTTCGTTTTTTTGGCAAGTTGTTTTT TTTTCCTACTTAGATCTGACTACTCCGACTCTGTTCACACTAATGTTCGTTAGGGTTTAT GTTGAATCTCTCCTTTGATCATTATGTTATTGTAAAAATCCCAGCTTTATGCTAAATCGA GCTAGTGATTCTTGAGAATTGAACAAAAAAGTTTTACATTTTTCTGAATTGCCATTCAAT TAGAAGAAGAAAAAATTCAACCTTTTACTGGTTATGATCTAGATTCGATGCGTGTAAGCT ATAAGATCACCATTTCGTGCTTTAGATCCATAATCATTGATTCACTATATGGCAATTATC TTCTTGCTTCACAGATCTCTTTTACACTTACATGTCAAGTGTCTGAGTGTGTGTGTGTGT CCTTTTGCAG >Chr1.ID=PAC:19652608;Name=AT1G31420.1;pacid=19652608;longest=1;Parent +=AT1G31420.len415.i1.13_217 GTTCTTATAATTCTTACTAATTCTGTTTGCTTTTAAAGCTATAACCTTTGATATTGTTGG AAAGAGTGGGGTTTTGGGTCTTTTGCTGAATCGTTTTTTGGATTTGTTATATTGTTCGAA TCTTCAGTTGTTATTGTGTTATAGGATTGGATGGTATCTGGGAATTTCGTAATCTATGTT AAGCTAGAGCTGTTTTTGAGCTCTTTTGTTGATGATGATTTTGAGATTGTTGGCCGAATT TAGCTCTCGTTTTCTGATTTTAGCAATTGGAAAGTGTGTATTGGTTCTTGTGAGGCAATT TCACTGTTTTGAGTACTCAAAATGTAGATGAGAGCATGCATAAGTTGTGTGGAGACTGAG CTTAATGTGTAGTGTAATTGACAATTAGTTTTGTGGGCTTTCCTTTGTTTTTCAG


Comment on Re: Searching array against hash
Download Code
Re^2: Searching array against hash
by BrowserUk (Pope) on Aug 22, 2013 at 01:55 UTC

    Without prejudicing the OPs response in any way, can I ask you, what you think using Bio::DB::Fasta would have over the OPs 50 line module?

    Given that to get it, he would have to try and install it and its 897 codependants -- not to mention AnyDBM_File and at least one of DB_File GDBM_File NDBM_File SDBM_File; all of which are a known problem on my platform -- and somehow resolve the 95 "cannot resolve"s (whatever they are?):

    all to install an indexing module that he doesn't need, and won't benefit from, in order to complete his task in "a couple seconds".


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      It will help if you are looking to retrieve a subsequence from the human genome, the FASTA file of which is about 5 Gb; in the case of parsing a smaller FASTA file with just a few entries, it makes less of a difference.

      Bioinformatics
        It will help if you are looking to retrieve a subsequence from the human genome, the FASTA file of which is about 5 Gb;

        I guess things have moved on. The version I have is just under 3GB and came in 25 files chr(1-22, M, X, Y).

        That said, if his 3 posted sequences are representative of his 900,000; that means his file is a tad under 900MB.

        Which if he can process that in "a few seconds"; means he could process your 5GB file in 5+bit * "a few seconds".

        But, and here is the point. It will take Bio::DB::Fasta at least that same 5+bit*"a few seconds" to construct an index; before he can start processing anything. So for a one-off process, there is a net loss.

        Now the real crux. Given all the additional layers and overheads; how many times does he have to redo the process in order to obtain a net gain? (If ever.)

        Then add to that the (potential) problems with installation; and the learning curve of finding your way around the documentation for 897 modules to find the one that you want; and then learning how to use it to do what you want; and suddenly the reason why so many bioinformaticians are looking for Lite alternatives to the Bio::Behemoth and simple procedures in order to get their work done; rather than becoming technical debt slaves to the byzantine Bio::Empire.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      Without prejudicing the OPs response in any way, can I ask you, what you think using Bio::DB::Fasta would have over the OPs 50 line module?

      Not having a CS background, I cannot analyze an algorithm for efficiency. However, Bio::DB::Fasta is an exisiting and, among biologists, widely used solution to a common task, presented by the OP. So, like you, I exercise the privilege to offer my solution.

      Given that to get it, he would have to try and install it and its 897 codependants -- not to mention AnyDBM_File and at least one of DB_File GDBM_File NDBM_File SDBM_File; all of which are a known problem on my platform -- and somehow resolve the 95 "cannot resolve"s (whatever they are?):

      So far, I've installed BioPerl on Ubuntu, RedHat, and Mac OS X without problems. If the OP happens to be in a biomedical field, it may be worthwhile to install and use BioPerl as it does solve many genomic tasks.

      all to install an indexing module that he doesn't need, and won't benefit from, in order to complete his task in "a couple seconds".

      "Without prejudicing the OPs response in any way"

        "Without prejudicing the OPs response in any way"

        Meaning: he will speak for himself if he so chooses. My inquiry was for my benefit.

        Having helped a growing list of bio-guys to avoid the byzantine Bio::Empire, I'm always looking to maintain my knowledge of the reasons for its highly complicated, intricate and involved nature.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1050435]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (4)
As of 2014-10-25 17:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (146 votes), past polls