Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

get full hit descrption from blast output (xml)

by ejbiers (Initiate)
on Nov 29, 2012 at 21:58 UTC ( #1006367=perlquestion: print w/ replies, xml ) Need Help??
ejbiers has asked for the wisdom of the Perl Monks concerning the following question:

Hello wise monks!

I am trying to parse an xml file like the one below using the following script, and everything works well except that I cannot get the full hit description (the "Hit_def" parameter in this xml format).

What I want that output to read is "43989.cce_0262 (Cyanothece ATCC 51142)", but my output only has "(Cyanothece ATCC 51142)". I played around a bit and found that if I put non-numerical text infront of this information, then I get the entire hit description. However, I'd like to be able to retrieve it without modifying the xml file. Any suggestions?

Here is a sample of the xml input

<?xml version="1.0"?> <!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://ww +w.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd"> <BlastOutput> <BlastOutput_program>blastx</BlastOutput_program> <BlastOutput_version>BLASTX 2.2.27+</BlastOutput_version> <BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejan +dro A. Sch&amp;auml;ffer, Jinghui Zhang, Zheng Zhang, Webb Miller, an +d David J. Lipman (1997), &quot;Gapped BLAST and PSI-BLAST: a new gen +eration of protein database search programs&quot;, Nucleic Acids Res. + 25:3389-3402.</BlastOutput_reference> <BlastOutput_db>/Applications/blast-2.2.27+/db/COG_Nov2012/protein.s +equences.v9.0</BlastOutput_db> <BlastOutput_query-ID>Query_1</BlastOutput_query-ID> <BlastOutput_query-def>HKUN3Y301D9XQX</BlastOutput_query-def> <BlastOutput_query-len>508</BlastOutput_query-len> <BlastOutput_param> <Parameters> <Parameters_matrix>BLOSUM62</Parameters_matrix> <Parameters_expect>10</Parameters_expect> <Parameters_gap-open>11</Parameters_gap-open> <Parameters_gap-extend>1</Parameters_gap-extend> <Parameters_filter>L;</Parameters_filter> </Parameters> </BlastOutput_param> <BlastOutput_iterations> <Iteration> <Iteration_iter-num>1</Iteration_iter-num> <Iteration_query-ID>Query_1</Iteration_query-ID> <Iteration_query-def>HKUN3Y301D9XQX</Iteration_query-def> <Iteration_query-len>508</Iteration_query-len> <Iteration_hits> <Hit> <Hit_num>1</Hit_num> <Hit_id>gnl|BL_ORD_ID|1515029</Hit_id> <Hit_def>43989.cce_0262 (Cyanothece ATCC 51142)</Hit_def> <Hit_accession>1515029</Hit_accession> <Hit_len>65</Hit_len> <Hit_hsps> <Hsp> <Hsp_num>1</Hsp_num> <Hsp_bit-score>40.0466</Hsp_bit-score> <Hsp_score>92</Hsp_score> <Hsp_evalue>0.00664016</Hsp_evalue> <Hsp_query-from>155</Hsp_query-from> <Hsp_query-to>253</Hsp_query-to> <Hsp_hit-from>12</Hsp_hit-from> <Hsp_hit-to>44</Hsp_hit-to> <Hsp_query-frame>-1</Hsp_query-frame> <Hsp_hit-frame>0</Hsp_hit-frame> <Hsp_identity>17</Hsp_identity> <Hsp_positive>27</Hsp_positive> <Hsp_gaps>0</Hsp_gaps> <Hsp_align-len>33</Hsp_align-len> <Hsp_qseq>LRGAICSMEHIEEALGKLKDWARKLIELLLGPR</Hsp_qseq> <Hsp_hseq>ITGAVCLMDYLEKVLEKLRELAQKLIETLLGPQ</Hsp_hseq> <Hsp_midline>+ GA+C M+++E+ L KL++ A+KLIE LLGP+</Hsp_midl +ine> </Hsp> </Hit_hsps> </Hit>

and here is the perl script:

#!/usr/local/bin/perl # Usage information #Usage: $0 -i <BLAST-report-file> -o <output-file> -n <number-of-top-h +its> -b <min_bit_score> # -t <trashed.output_queries_without_hits> use strict; use warnings; use Bio::SearchIO; use Getopt::Std;#needed for flagging parameters sub main{ my %opt; #note: colons after letter mean the flag expects an argument getopt('i:o:n:b:t:', \%opt); print "Parsing the BLAST result ...\n"; my $in = Bio::SearchIO->new(-format => 'blastxml', -file => $opt{i}); open (OUT,">$opt{o}") or die "Cannot open $opt{o}: $!"; open (OUT2,">$opt{t}") or die "Cannot open $opt{t}: $!"; open (OUT3, ">$opt{o}.header") or die "Cannot open $opt{o}.header: $!" +; # print the header info for tab-deliminated columns print OUT "query_name\tquery_length\taccession_number\tsubject_length\ +tsubject_description\tE value\tbit score\tframe\tquery_start\t"; print OUT "query_end\thit_start\thit_end\t%_conserved\t%_identical\n"; print OUT2 "query_name\tquery_length\taccession_number\tsubject_length +\tsubject_description\tE value\tbit score\tframe\tquery_start\t"; print OUT2 "query_end\thit_start\thit_end\t%_conserved\t%_identical\n" +; # extraction of information for each result recursively while ( my $result = $in->next_result ) { #prints query info for reads WITHOUT hits into -t ="bad" file if ( $result->num_hits == 0 ) { print OUT2 $result->query_description . "\t"; print OUT2 $result->query_length . "\t"; print OUT2 "No hits found\n"; } else { my $count = 0; # process each hit recursively while (my $hit = $result->next_hit) { #prints query info for reads WITH hits BELOW bit-score inp +ut value into -t = "bad" file if ( $hit->bits < $opt{b}) { print OUT2 $result->query_description . "\t"; print OUT2 $result->query_length . "\t"; print OUT2 "below bit score\n";} #prints query and other info for reads WITH hits ABOVE bit +-score input into -o = "good" file elsif ( $hit->bits >= $opt{b}) { print OUT $result->query_description . "\t"; print OUT3 $result->query_description . "\n"; print OUT $result->query_length . "\t"; print OUT $hit->accession . "\t"; print OUT $hit->length . "\t"; print OUT $hit->description . "\t"; print OUT $hit->significance . "\t"; print OUT $hit->bits . "\t"; my $hspcount = 0; # process the top HSP for the top number of hits (user + defined) into -o file while (my $hsp = $hit->next_hsp) { if ($hit->bits >= $opt{b}) { print OUT "\t\t\t\t\t\t\t", if ($hspcount > 0) +; print OUT $hsp->query->frame . "\t"; print OUT $hsp->start('query') . "\t" . $hsp-> +end('query'). "\t"; print OUT $hsp->start('hit') . "\t" . $hsp->en +d('hit') . "\t"; printf OUT "%.1f" , ($hsp->frac_conserved * 10 +0); print OUT "%\t"; printf OUT "%.1f" , ($hsp->frac_identical * 10 +0); print OUT "%\n"; $hspcount++; } } } $count++; # flow control for the number of hits needed last if ($count == $opt{n}); } } } close OUT; close OUT2; } main(); print " DONE!!!\n";

and what I put into the command line:

perl scriptname.pl -i inputfile.xml -o goodoutputfile.txt -d badoutputfile.txt -n number_of_hits_to_keep -b bit-score_cutoff

and here is the output I get from "goodoutputfile.txt"

query_name query_length accession_number subject_length su +bject_description E value bit score frame query_start +query_end hit_start hit_end %_conserved %_identical HKUN3Y301D9XQX length=508 xy=1636_1159 region=1 run=R_2012_03_16_06_53 +_48_ 508 1515029 65 (Cyanothece ATCC 51142) 0.00664016 + 40.0466 0 155 253 12 44 81.8% 51.5%

Comment on get full hit descrption from blast output (xml)
Select or Download Code
Re: get full hit descrption from blast output (xml)
by remiah (Hermit) on Nov 30, 2012 at 02:20 UTC

    Hello ejbiers.

    print "###:".$hit->name . $hit->description."\n"; __DATA__ ###:43989.cce_0262(Cyanothece ATCC 51142)
    This is first time I looked into BioPerl. I have no idea for why name method stores it, just found it while looking around modules of bio perl.

    Your XML lacks several end tags.
    And real command line, I guess ,
    perl scriptname.pl -i inputfile.xml -o goodoutputfile.txt -t test.txt -n 3 -b 1
    
    I hope you fixing them and get more proper answers from bio perl monks.

    regards.

Re: get full hit descrption from blast output (xml)
by graff (Chancellor) on Nov 30, 2012 at 02:34 UTC
    I want to help, but based on what you've posted, it's hard. First, you didn't post a complete XML file -- I have to wonder if there's anything else missing besides a bunch of close tags at the end of your sample data.

    Then, your command line doesn't really give us enough info. What would be reasonable values for the "-n number_of_hits_to_keep" and "-b bit_score_cutoff" in order to get relevant results? Also, your command line uses a "-d" where I think the script is expecting a "-t".

    And I'm sorry to seem picky, but you should be able to find an easy way to get your indention right (emacs? vi? some decent IDE or other programmer-savvy editor? perltidy?). Trust me, it really helps.

    Anyway, I did manage to get the output you reported, but I'm a stranger to Bio::SearchIO, and I really can't tell what line or portion of the code actually has something to do with the "Hit_def" parameter in the xml file.

    Just using a straightforward XPath extraction for "//Hit_def" on your (fixed) xml file does indeed return the full string you want - "43989.cce_0262 (Cyanothece ATCC 51142)".

    In order to figure it out, I had to add this at the top:

    use Data::Dumper 'Dumper';
    Then step through it with the debugger until I got inside this block:
    while (my $hit = $result->next_hit) {
    Then my next debugger command was:
    p Data::Dumper::Dumper($hit)
    Looking through the resulting output, I found the missing string -- see if you can find it too... Once you do, you should be able to figure out how to print it to your output file as desired:
    $VAR1 = bless( { '_hsps' => [ { '-query_start' => 253, '-algorithm' => 'BLASTX', '-gaps' => '0', '-hit_seq' => 'ITGAVCLMDYLEKVLEKLRELAQ +KLIETLLGPQ', '-hit_length' => '65', '-query_length' => '508', '-query_desc' => 'HKUN3Y301D9XQX', '-query_frame' => -1, '-rank' => 1, '-hit_desc' => '43989.cce_0262 (Cyanot +hece ATCC 51142)', '-query_end' => 155, '-hit_name' => 'gnl|BL_ORD_ID|1515029' +, '-identical' => '17', '-query_name' => 'Query_1', '-evalue' => '0.00664016', '-score' => '92', '-conserved' => '27', '-hit_frame' => 0, '-hsp_length' => '33', '-query_seq' => 'LRGAICSMEHIEEALGKLKDW +ARKLIELLLGPR', '-hit_start' => '12', '-homology_seq' => '+ GA+C M+++E+ L KL +++ A+KLIE LLGP+', '-hit_end' => '44', '-bits' => '40.0466' } ], '_iterator' => 0, '_description' => '(Cyanothece ATCC 51142)', '_significance' => '0.00664016', '_query_length' => '508', '_accession' => '1515029', '_length' => '65', '_psiblast_iteration' => '1', '_name' => '43989.cce_0262', '_rank' => 1, '_algorithm' => 'BLASTX', '_root_verbose' => 0, '_hashes' => { '0' => 1 }, '_hsp_factory' => bless( { 'interface' => 'Bio::Searc +h::HSP::HSPI', 'type' => 'Bio::Search::HS +P::GenericHSP', '_loaded_types' => { 'Bio: +:Search::HSP::GenericHSP' => 1 }, '_root_verbose' => 0 }, 'Bio::Factory::ObjectFact +ory' ) }, 'Bio::Search::Hit::GenericHit' );

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1006367]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (6)
As of 2014-08-30 00:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (289 votes), past polls