Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

get full hit descrption from blast output (xml)

by ejbiers (Initiate)
on Nov 29, 2012 at 21:58 UTC ( #1006367=perlquestion: print w/replies, xml ) Need Help??
ejbiers has asked for the wisdom of the Perl Monks concerning the following question:

Hello wise monks!

I am trying to parse an xml file like the one below using the following script, and everything works well except that I cannot get the full hit description (the "Hit_def" parameter in this xml format).

What I want that output to read is "43989.cce_0262 (Cyanothece ATCC 51142)", but my output only has "(Cyanothece ATCC 51142)". I played around a bit and found that if I put non-numerical text infront of this information, then I get the entire hit description. However, I'd like to be able to retrieve it without modifying the xml file. Any suggestions?

Here is a sample of the xml input

<?xml version="1.0"?> <!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://ww"> <BlastOutput> <BlastOutput_program>blastx</BlastOutput_program> <BlastOutput_version>BLASTX 2.2.27+</BlastOutput_version> <BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejan +dro A. Sch&amp;auml;ffer, Jinghui Zhang, Zheng Zhang, Webb Miller, an +d David J. Lipman (1997), &quot;Gapped BLAST and PSI-BLAST: a new gen +eration of protein database search programs&quot;, Nucleic Acids Res. + 25:3389-3402.</BlastOutput_reference> <BlastOutput_db>/Applications/blast-2.2.27+/db/COG_Nov2012/protein.s +equences.v9.0</BlastOutput_db> <BlastOutput_query-ID>Query_1</BlastOutput_query-ID> <BlastOutput_query-def>HKUN3Y301D9XQX</BlastOutput_query-def> <BlastOutput_query-len>508</BlastOutput_query-len> <BlastOutput_param> <Parameters> <Parameters_matrix>BLOSUM62</Parameters_matrix> <Parameters_expect>10</Parameters_expect> <Parameters_gap-open>11</Parameters_gap-open> <Parameters_gap-extend>1</Parameters_gap-extend> <Parameters_filter>L;</Parameters_filter> </Parameters> </BlastOutput_param> <BlastOutput_iterations> <Iteration> <Iteration_iter-num>1</Iteration_iter-num> <Iteration_query-ID>Query_1</Iteration_query-ID> <Iteration_query-def>HKUN3Y301D9XQX</Iteration_query-def> <Iteration_query-len>508</Iteration_query-len> <Iteration_hits> <Hit> <Hit_num>1</Hit_num> <Hit_id>gnl|BL_ORD_ID|1515029</Hit_id> <Hit_def>43989.cce_0262 (Cyanothece ATCC 51142)</Hit_def> <Hit_accession>1515029</Hit_accession> <Hit_len>65</Hit_len> <Hit_hsps> <Hsp> <Hsp_num>1</Hsp_num> <Hsp_bit-score>40.0466</Hsp_bit-score> <Hsp_score>92</Hsp_score> <Hsp_evalue>0.00664016</Hsp_evalue> <Hsp_query-from>155</Hsp_query-from> <Hsp_query-to>253</Hsp_query-to> <Hsp_hit-from>12</Hsp_hit-from> <Hsp_hit-to>44</Hsp_hit-to> <Hsp_query-frame>-1</Hsp_query-frame> <Hsp_hit-frame>0</Hsp_hit-frame> <Hsp_identity>17</Hsp_identity> <Hsp_positive>27</Hsp_positive> <Hsp_gaps>0</Hsp_gaps> <Hsp_align-len>33</Hsp_align-len> <Hsp_qseq>LRGAICSMEHIEEALGKLKDWARKLIELLLGPR</Hsp_qseq> <Hsp_hseq>ITGAVCLMDYLEKVLEKLRELAQKLIETLLGPQ</Hsp_hseq> <Hsp_midline>+ GA+C M+++E+ L KL++ A+KLIE LLGP+</Hsp_midl +ine> </Hsp> </Hit_hsps> </Hit>

and here is the perl script:

#!/usr/local/bin/perl # Usage information #Usage: $0 -i <BLAST-report-file> -o <output-file> -n <number-of-top-h +its> -b <min_bit_score> # -t <trashed.output_queries_without_hits> use strict; use warnings; use Bio::SearchIO; use Getopt::Std;#needed for flagging parameters sub main{ my %opt; #note: colons after letter mean the flag expects an argument getopt('i:o:n:b:t:', \%opt); print "Parsing the BLAST result ...\n"; my $in = Bio::SearchIO->new(-format => 'blastxml', -file => $opt{i}); open (OUT,">$opt{o}") or die "Cannot open $opt{o}: $!"; open (OUT2,">$opt{t}") or die "Cannot open $opt{t}: $!"; open (OUT3, ">$opt{o}.header") or die "Cannot open $opt{o}.header: $!" +; # print the header info for tab-deliminated columns print OUT "query_name\tquery_length\taccession_number\tsubject_length\ +tsubject_description\tE value\tbit score\tframe\tquery_start\t"; print OUT "query_end\thit_start\thit_end\t%_conserved\t%_identical\n"; print OUT2 "query_name\tquery_length\taccession_number\tsubject_length +\tsubject_description\tE value\tbit score\tframe\tquery_start\t"; print OUT2 "query_end\thit_start\thit_end\t%_conserved\t%_identical\n" +; # extraction of information for each result recursively while ( my $result = $in->next_result ) { #prints query info for reads WITHOUT hits into -t ="bad" file if ( $result->num_hits == 0 ) { print OUT2 $result->query_description . "\t"; print OUT2 $result->query_length . "\t"; print OUT2 "No hits found\n"; } else { my $count = 0; # process each hit recursively while (my $hit = $result->next_hit) { #prints query info for reads WITH hits BELOW bit-score inp +ut value into -t = "bad" file if ( $hit->bits < $opt{b}) { print OUT2 $result->query_description . "\t"; print OUT2 $result->query_length . "\t"; print OUT2 "below bit score\n";} #prints query and other info for reads WITH hits ABOVE bit +-score input into -o = "good" file elsif ( $hit->bits >= $opt{b}) { print OUT $result->query_description . "\t"; print OUT3 $result->query_description . "\n"; print OUT $result->query_length . "\t"; print OUT $hit->accession . "\t"; print OUT $hit->length . "\t"; print OUT $hit->description . "\t"; print OUT $hit->significance . "\t"; print OUT $hit->bits . "\t"; my $hspcount = 0; # process the top HSP for the top number of hits (user + defined) into -o file while (my $hsp = $hit->next_hsp) { if ($hit->bits >= $opt{b}) { print OUT "\t\t\t\t\t\t\t", if ($hspcount > 0) +; print OUT $hsp->query->frame . "\t"; print OUT $hsp->start('query') . "\t" . $hsp-> +end('query'). "\t"; print OUT $hsp->start('hit') . "\t" . $hsp->en +d('hit') . "\t"; printf OUT "%.1f" , ($hsp->frac_conserved * 10 +0); print OUT "%\t"; printf OUT "%.1f" , ($hsp->frac_identical * 10 +0); print OUT "%\n"; $hspcount++; } } } $count++; # flow control for the number of hits needed last if ($count == $opt{n}); } } } close OUT; close OUT2; } main(); print " DONE!!!\n";

and what I put into the command line:

perl -i inputfile.xml -o goodoutputfile.txt -d badoutputfile.txt -n number_of_hits_to_keep -b bit-score_cutoff

and here is the output I get from "goodoutputfile.txt"

query_name query_length accession_number subject_length su +bject_description E value bit score frame query_start +query_end hit_start hit_end %_conserved %_identical HKUN3Y301D9XQX length=508 xy=1636_1159 region=1 run=R_2012_03_16_06_53 +_48_ 508 1515029 65 (Cyanothece ATCC 51142) 0.00664016 + 40.0466 0 155 253 12 44 81.8% 51.5%

Replies are listed 'Best First'.
Re: get full hit descrption from blast output (xml)
by graff (Chancellor) on Nov 30, 2012 at 02:34 UTC
    I want to help, but based on what you've posted, it's hard. First, you didn't post a complete XML file -- I have to wonder if there's anything else missing besides a bunch of close tags at the end of your sample data.

    Then, your command line doesn't really give us enough info. What would be reasonable values for the "-n number_of_hits_to_keep" and "-b bit_score_cutoff" in order to get relevant results? Also, your command line uses a "-d" where I think the script is expecting a "-t".

    And I'm sorry to seem picky, but you should be able to find an easy way to get your indention right (emacs? vi? some decent IDE or other programmer-savvy editor? perltidy?). Trust me, it really helps.

    Anyway, I did manage to get the output you reported, but I'm a stranger to Bio::SearchIO, and I really can't tell what line or portion of the code actually has something to do with the "Hit_def" parameter in the xml file.

    Just using a straightforward XPath extraction for "//Hit_def" on your (fixed) xml file does indeed return the full string you want - "43989.cce_0262 (Cyanothece ATCC 51142)".

    In order to figure it out, I had to add this at the top:

    use Data::Dumper 'Dumper';
    Then step through it with the debugger until I got inside this block:
    while (my $hit = $result->next_hit) {
    Then my next debugger command was:
    p Data::Dumper::Dumper($hit)
    Looking through the resulting output, I found the missing string -- see if you can find it too... Once you do, you should be able to figure out how to print it to your output file as desired:
    $VAR1 = bless( { '_hsps' => [ { '-query_start' => 253, '-algorithm' => 'BLASTX', '-gaps' => '0', '-hit_seq' => 'ITGAVCLMDYLEKVLEKLRELAQ +KLIETLLGPQ', '-hit_length' => '65', '-query_length' => '508', '-query_desc' => 'HKUN3Y301D9XQX', '-query_frame' => -1, '-rank' => 1, '-hit_desc' => '43989.cce_0262 (Cyanot +hece ATCC 51142)', '-query_end' => 155, '-hit_name' => 'gnl|BL_ORD_ID|1515029' +, '-identical' => '17', '-query_name' => 'Query_1', '-evalue' => '0.00664016', '-score' => '92', '-conserved' => '27', '-hit_frame' => 0, '-hsp_length' => '33', '-query_seq' => 'LRGAICSMEHIEEALGKLKDW +ARKLIELLLGPR', '-hit_start' => '12', '-homology_seq' => '+ GA+C M+++E+ L KL +++ A+KLIE LLGP+', '-hit_end' => '44', '-bits' => '40.0466' } ], '_iterator' => 0, '_description' => '(Cyanothece ATCC 51142)', '_significance' => '0.00664016', '_query_length' => '508', '_accession' => '1515029', '_length' => '65', '_psiblast_iteration' => '1', '_name' => '43989.cce_0262', '_rank' => 1, '_algorithm' => 'BLASTX', '_root_verbose' => 0, '_hashes' => { '0' => 1 }, '_hsp_factory' => bless( { 'interface' => 'Bio::Searc +h::HSP::HSPI', 'type' => 'Bio::Search::HS +P::GenericHSP', '_loaded_types' => { 'Bio: +:Search::HSP::GenericHSP' => 1 }, '_root_verbose' => 0 }, 'Bio::Factory::ObjectFact +ory' ) }, 'Bio::Search::Hit::GenericHit' );
Re: get full hit descrption from blast output (xml)
by remiah (Hermit) on Nov 30, 2012 at 02:20 UTC

    Hello ejbiers.

    print "###:".$hit->name . $hit->description."\n"; __DATA__ ###:43989.cce_0262(Cyanothece ATCC 51142)
    This is first time I looked into BioPerl. I have no idea for why name method stores it, just found it while looking around modules of bio perl.

    Your XML lacks several end tags.
    And real command line, I guess ,
    perl -i inputfile.xml -o goodoutputfile.txt -t test.txt -n 3 -b 1
    I hope you fixing them and get more proper answers from bio perl monks.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1006367]
Approved by ww
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (7)
As of 2018-06-24 03:50 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (126 votes). Check out past polls.