Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

how do i obtain blast result from the given file

by bingalee (Acolyte)
on Jun 17, 2013 at 17:08 UTC ( [id://1039417]=perlquestion: print w/replies, xml ) Need Help??

bingalee has asked for the wisdom of the Perl Monks concerning the following question:

I have a file of blast results of different sequences. Some of them have "no hits found", and the others have hits. How do I obtain the lines which have the results , ignoring the initial lines.

Here"s what a file looks like

BLASTX 2.2.26 [Sep-21-2011] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaff +er, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database sear +ch programs", Nucleic Acids Res. 25:3389-3402. Query= XLOC_000039-chr1:983051-985037 (1987 letters) Database: nr 26,236,801 sequences; 9,088,244,489 total letters Searching..................................................done Score + E Sequences producing significant alignments: (bits +) Value ref|XP_002322127.1| autoinhibited H+ ATPase [Populus trichocarpa... +195 2e-93 emb|CAD29314.1| plasma membrane H+-ATPase [Oryza sativa Japonica... +213 1e-55 ref|NP_001048647.1| Os03g0100800 [Oryza sativa Japonica Group] >... +213 1e-55 gb|EEC74325.1| hypothetical protein OsI_09609 [Oryza sativa Indi... +213 1e-55 gb|EMT10609.1| ATPase 11, plasma membrane-type [Aegilops tauschii] +208 3e-54 ref|XP_002511598.1| H(\+)-transporting atpase plant/fungi plasma... +207 3e-54 ref|XP_002274074.1| PREDICTED: ATPase 7, plasma membrane-type is... +207 5e-54 ref|XP_003633895.1| PREDICTED: ATPase 7, plasma membrane-type is... +207 6e-54 gb|AAA34099.1| plasma membrane H+ ATPase, partial [Nicotiana plu... +197 8e-54 ref|XP_003562242.1| PREDICTED: plasma membrane ATPase 1-like [Br... +207 8e-54 ref|XP_002326870.1| autoinhibited H+ ATPase [Populus trichocarpa... +206 1e-53 ref|XP_003529038.1| PREDICTED: plasma membrane ATPase 1-like iso... +205 4e-53

There are many hits, but if I want the top ten only, what do i do?

here"s the code i got so far for my entire task, I managed to do the first part, but am stuck at this one

#!usr/bin/perl -w open(IN,"/home/maize/sequence-ID.txt"); open(OUT,">abc.txt"); open(OUT1,">blastresults.txt"); while($id=<IN>) { $id=~s/\n|\r//g; open(BLAST,"output/$id"); while($file=<BLAST>) { if ($file=~ m/No hits found/) { print OUT "$id \n"; } } close(BLAST); } close(IN); close(OUT); close(OUT1);

so now i've created a text file which has the names of all those files without any hits. Now i have to create another file with the names of those which have significatn hits, along with the first ten hits. Please help :(

EDIT

I figured one way of doing it,

#!usr/bin/perl -w open(IN,"/home/adithie/datasets/maize/sequence-ID.txt"); open(OUT,">lncRNA3.txt"); open(OUT1,">blastresults2.txt"); while($id=<IN>) { $id=~s/\n|\r//g; open(BLAST,"output/$id"); for($i=0;$i<16;$i++){<BLAST>;} $raw=<BLAST>; if ($raw=~ m/No hits found/) { print OUT "$id\n"; } else { <BLAST>; <BLAST>; <BLAST>; <BLAST>; $raw=<BLAST>; print OUT1 "$id\n"; $count=0; while(length($raw)>10 && $count<=10) { $count++; print OUT1 "$raw"; $raw=<BLAST>; } } close(BLAST); } close(IN); close(OUT); close(OUT1);

Replies are listed 'Best First'.
Re: how do i obtain blast result from the given file
by rjt (Curate) on Jun 17, 2013 at 19:15 UTC

    You can print the matching lines with a regular expression that matches only the lines you're interested in:

    while (<$fh>) { print if /^(.+?)\|(.+?)\| (.+?)\s+(\d+)\s+(\d+e[+-]\d+)$/; }

    The above will also handle arbitrarily large files, as it reads one line at a time into memory.

    There are many hits, but if I want the top ten only, what do i do?

    I'm not sure what you mean by top ten, exactly (highest "Score"? "E Value"? Order within the file?), but a likely approach to this would be to modify the above while loop and instead of printing, place each line in a hash. My regular expression already does basic parsing of the input line into capture variables $1 through $5, so you can try something like this:

    $hash{$2} = { col1 => $1, desc => $3, score => $4, E => $5 } if /.../; # Use regex above

    Then, after the loop finishes, you can sort and display the results however you like. For example:

    my $how_many = 10; for (sort { $hash{$b}->{score} <=> $hash{$a}->{score} } keys %hash +) { printf "%-4s %-55s %5s\n", $hash{$_}->{col1}, $hash{$_}->{desc}, $hash{$_}->{score}; last if --$how_many == 0; }

    Outputs:

    emb plasma membrane H+-ATPase [Oryza sativa Japonica... 213 gb hypothetical protein OsI_09609 [Oryza sativa Indi... 213 ref Os03g0100800 [Oryza sativa Japonica Group] >... 213 gb ATPase 11, plasma membrane-type [Aegilops tauschii] 208 ref PREDICTED: ATPase 7, plasma membrane-type is... 207 ref H(\+)-transporting atpase plant/fungi plasma... 207 ref PREDICTED: plasma membrane ATPase 1-like [Br... 207 ref PREDICTED: ATPase 7, plasma membrane-type is... 207 ref autoinhibited H+ ATPase [Populus trichocarpa... 206 ref PREDICTED: plasma membrane ATPase 1-like iso... 205

      correction- I'm so sorry, I meant the first ten hits

        In that case, just modify the initial loop to stop after ten hits:

        my $how_many = 10; while (<$fh>) { if (/^(.+?)\|(.+?)\| (.+?)\s+(\d+)\s+(\d+e[+-]\d+)$/) { print; last if --$how_many == 0; } }

        If that's all you want to do, you don't need the hash. Note that you can still format the output with printf: printf "%-4s %-55s %5s\n", $1, $3, $4;

        You could still use the hash, as well, if you need to do further processing on the results. In that case, to preserve the order, either add an index to the hash ref for use with sort, or, simpler, push each key to an array as you find them:

        my @hits; # Keys in order my $how_many = 10; while (<$fh>) { if (/^(.+?)\|(.+?)\| (.+?)\s+(\d+)\s+(\d+e[+-]\d+)$/) { $hash{$2} = { col1 => $1, desc => $3, score => $4, E => $5, key => $2 }; push @hits, $2; last if --$how_many == 0; } } # Go through the first ten hits, in order for (map { $hash{$_} } @hits) { # $_ contains the hash ref for each record printf "%-5s %-55s %5s\n", $_->{col1}, $_->{desc}, $_->{score}; }
Re: how do i obtain blast result from the given file
by hbm (Hermit) on Jun 17, 2013 at 19:45 UTC

    Another way:

    ... my @seq; # read past the header while(<IN>){ last if /^Sequences producing/; } # assuming remaining data is fixed while(<IN>){ chomp; next if /^\s*$/; # capture the score /^.{67}\s*(\d+)/; # push anonymous array - [score,line] push @seq, [$1,$_]; } close IN; print join $/, ( map { $_->[1] } # print original line sort { $b->[0] <=> $a->[0] } # sorted by score @seq )[0..9]; # top ten

      thank you!

Re: how do i obtain blast result from the given file
by zork42 (Monk) on Jun 18, 2013 at 05:56 UTC
    Your homepage says you're new to programming, so I thought I'd add some missing bits of code (in bold) in case you were not aware that they should be there.
    NB I have not changed/fixed the logic of your code.
    ---------------------------------------------------------------------
    #!usr/bin/perl -w
    use strict;
    use warnings;
    
    open(IN,"/home/maize/sequence-ID.txt") || die $!;
    open(OUT,">abc.txt") || die $!;
    open(OUT1,">blastresults.txt") || die $!;
    
    while(my $id=<IN>)
    {
        $id=~s/\n|\r//g;
        open(BLAST,"output/$id") || die $!;
            
        while(my $file=<BLAST>)
        {
            if ($file=~ m/No hits found/)
            {
                print OUT "$id \n";
            }
        }
        close(BLAST) || die $!;
    }
    close(IN) || die $!;
    close(OUT) || die $!;
    close(OUT1) || die $!; 
    ---------------------------------------------------------------------
    If you prefer you can use normal Perl variables instead of file handles
    eg
    ---------------------------------------------------------------------
    
        open(my $blast,"output/$id") || die $!;
            
        while(my $file=<$blast>)
        {
            if ($file=~ m/No hits found/)
            {
                print OUT "$id \n";
            }
        }
        close($blast) || die $!;
    
    ---------------------------------------------------------------------


    If you're looking for some good books, I can highly recommend these 3 books which form a 3 part series:
    Learning Perl: Making Easy Things Easy and Hard Things Possible
    Intermediate Perl: Beyond The Basics of Learning Perl
    Mastering Perl: Creating Professional Programs with Perl

    I just happened to notice this. I have no idea what it it like, but it might be just what you want:
    Mastering Perl for Bioinformatics

      Hey, thanks a lot for the books. I have the book by Tisdall, its ok. But right now I'm reading beginning perl by Simon Cozens.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1039417]
Approved by Happy-the-monk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2024-03-19 05:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found