Hello wise monks!
I am trying to parse an xml file like the one below using the following script, and everything works well except that I cannot get the full hit description (the "Hit_def" parameter in this xml format).
What I want that output to read is "43989.cce_0262 (Cyanothece ATCC 51142)", but my output only has "(Cyanothece ATCC 51142)". I played around a bit and found that if I put non-numerical text infront of this information, then I get the entire hit description. However, I'd like to be able to retrieve it without modifying the xml file. Any suggestions?
Here is a sample of the xml input
<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://ww
+w.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
<BlastOutput_program>blastx</BlastOutput_program>
<BlastOutput_version>BLASTX 2.2.27+</BlastOutput_version>
<BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejan
+dro A. Sch&auml;ffer, Jinghui Zhang, Zheng Zhang, Webb Miller, an
+d David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new gen
+eration of protein database search programs", Nucleic Acids Res.
+ 25:3389-3402.</BlastOutput_reference>
<BlastOutput_db>/Applications/blast-2.2.27+/db/COG_Nov2012/protein.s
+equences.v9.0</BlastOutput_db>
<BlastOutput_query-ID>Query_1</BlastOutput_query-ID>
<BlastOutput_query-def>HKUN3Y301D9XQX</BlastOutput_query-def>
<BlastOutput_query-len>508</BlastOutput_query-len>
<BlastOutput_param>
<Parameters>
<Parameters_matrix>BLOSUM62</Parameters_matrix>
<Parameters_expect>10</Parameters_expect>
<Parameters_gap-open>11</Parameters_gap-open>
<Parameters_gap-extend>1</Parameters_gap-extend>
<Parameters_filter>L;</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>Query_1</Iteration_query-ID>
<Iteration_query-def>HKUN3Y301D9XQX</Iteration_query-def>
<Iteration_query-len>508</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>gnl|BL_ORD_ID|1515029</Hit_id>
<Hit_def>43989.cce_0262 (Cyanothece ATCC 51142)</Hit_def>
<Hit_accession>1515029</Hit_accession>
<Hit_len>65</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>40.0466</Hsp_bit-score>
<Hsp_score>92</Hsp_score>
<Hsp_evalue>0.00664016</Hsp_evalue>
<Hsp_query-from>155</Hsp_query-from>
<Hsp_query-to>253</Hsp_query-to>
<Hsp_hit-from>12</Hsp_hit-from>
<Hsp_hit-to>44</Hsp_hit-to>
<Hsp_query-frame>-1</Hsp_query-frame>
<Hsp_hit-frame>0</Hsp_hit-frame>
<Hsp_identity>17</Hsp_identity>
<Hsp_positive>27</Hsp_positive>
<Hsp_gaps>0</Hsp_gaps>
<Hsp_align-len>33</Hsp_align-len>
<Hsp_qseq>LRGAICSMEHIEEALGKLKDWARKLIELLLGPR</Hsp_qseq>
<Hsp_hseq>ITGAVCLMDYLEKVLEKLRELAQKLIETLLGPQ</Hsp_hseq>
<Hsp_midline>+ GA+C M+++E+ L KL++ A+KLIE LLGP+</Hsp_midl
+ine>
</Hsp>
</Hit_hsps>
</Hit>
and here is the perl script:
#!/usr/local/bin/perl
# Usage information
#Usage: $0 -i <BLAST-report-file> -o <output-file> -n <number-of-top-h
+its> -b <min_bit_score>
# -t <trashed.output_queries_without_hits>
use strict;
use warnings;
use Bio::SearchIO;
use Getopt::Std;#needed for flagging parameters
sub main{
my %opt;
#note: colons after letter mean the flag expects an argument
getopt('i:o:n:b:t:', \%opt);
print "Parsing the BLAST result ...\n";
my $in = Bio::SearchIO->new(-format => 'blastxml', -file => $opt{i});
open (OUT,">$opt{o}") or die "Cannot open $opt{o}: $!";
open (OUT2,">$opt{t}") or die "Cannot open $opt{t}: $!";
open (OUT3, ">$opt{o}.header") or die "Cannot open $opt{o}.header: $!"
+;
# print the header info for tab-deliminated columns
print OUT "query_name\tquery_length\taccession_number\tsubject_length\
+tsubject_description\tE value\tbit score\tframe\tquery_start\t";
print OUT "query_end\thit_start\thit_end\t%_conserved\t%_identical\n";
print OUT2 "query_name\tquery_length\taccession_number\tsubject_length
+\tsubject_description\tE value\tbit score\tframe\tquery_start\t";
print OUT2 "query_end\thit_start\thit_end\t%_conserved\t%_identical\n"
+;
# extraction of information for each result recursively
while ( my $result = $in->next_result ) {
#prints query info for reads WITHOUT hits into -t ="bad" file
if ( $result->num_hits == 0 ) {
print OUT2 $result->query_description . "\t";
print OUT2 $result->query_length . "\t";
print OUT2 "No hits found\n";
}
else {
my $count = 0;
# process each hit recursively
while (my $hit = $result->next_hit) {
#prints query info for reads WITH hits BELOW bit-score inp
+ut value into -t = "bad" file
if ( $hit->bits < $opt{b}) {
print OUT2 $result->query_description . "\t";
print OUT2 $result->query_length . "\t";
print OUT2 "below bit score\n";}
#prints query and other info for reads WITH hits ABOVE bit
+-score input into -o = "good" file
elsif ( $hit->bits >= $opt{b}) {
print OUT $result->query_description . "\t";
print OUT3 $result->query_description . "\n";
print OUT $result->query_length . "\t";
print OUT $hit->accession . "\t";
print OUT $hit->length . "\t";
print OUT $hit->description . "\t";
print OUT $hit->significance . "\t";
print OUT $hit->bits . "\t";
my $hspcount = 0;
# process the top HSP for the top number of hits (user
+ defined) into -o file
while (my $hsp = $hit->next_hsp) {
if ($hit->bits >= $opt{b}) {
print OUT "\t\t\t\t\t\t\t", if ($hspcount > 0)
+;
print OUT $hsp->query->frame . "\t";
print OUT $hsp->start('query') . "\t" . $hsp->
+end('query'). "\t";
print OUT $hsp->start('hit') . "\t" . $hsp->en
+d('hit') . "\t";
printf OUT "%.1f" , ($hsp->frac_conserved * 10
+0);
print OUT "%\t";
printf OUT "%.1f" , ($hsp->frac_identical * 10
+0);
print OUT "%\n";
$hspcount++;
}
}
}
$count++;
# flow control for the number of hits needed
last if ($count == $opt{n});
}
}
}
close OUT;
close OUT2;
}
main();
print " DONE!!!\n";
and what I put into the command line:
perl scriptname.pl -i inputfile.xml -o goodoutputfile.txt -d badoutputfile.txt -n number_of_hits_to_keep -b bit-score_cutoff
and here is the output I get from "goodoutputfile.txt"
query_name query_length accession_number subject_length su
+bject_description E value bit score frame query_start
+query_end hit_start hit_end %_conserved %_identical
HKUN3Y301D9XQX length=508 xy=1636_1159 region=1 run=R_2012_03_16_06_53
+_48_ 508 1515029 65 (Cyanothece ATCC 51142) 0.00664016
+ 40.0466 0 155 253 12 44 81.8% 51.5%
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.