Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
HI All,
I would be greatful if you could share me your knowledge in parsing the blastn file. I have a blastn output file, which is something like this
ALIGNMENTS
>lcl|14079 ref|NC_000009.11|:4900000-5300000 Homo sapiens chromosome 9
+,
GRCh37 primary reference assembly
Length=400001
Score = 270 bits (146), Expect = 2e-74
Identities = 148/149 (99%), Gaps = 0/149 (0%)
Strand=Plus/Minus
Query 1 TGGGCAAGGACTTCATGTCTAAAACACCAAAAGCAATGGCAACAAAAGCCAAAATT
+GACA 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
+||||
Sbjct 48784 TGGGCAAGGACTTCATGTCTAAAACACCAAAAGCAATGGCAACAAAAGCCAAAATT
+GACA 48725
Query 61 AATGGGATCTAATTAAACTAAAGAGCTTCTGCACAGCAAAAGAAACTACCATCAGA
+GTGA 120
|||||||||||||| |||||||||||||||||||||||||||||||||||||||||
+||||
Sbjct 48724 AATGGGATCTAATTCAACTAAAGAGCTTCTGCACAGCAAAAGAAACTACCATCAGA
+GTGA 48665
Query 121 ACAGGCAACCTACAGAATGGGAGAACATT 149
|||||||||||||||||||||||||||||
Sbjct 48664 ACAGGCAACCTACAGAATGGGAGAACATT 48636
I would like to create a summary of the position of mismatch, and the type (insertion/deletion) from the blast output. In this case, the position 75, and the alleles A-C.
I tried with BioSearchIO, which parses the percentage, start and the end position of the alignment.Obviously, I dont want to have them in my summary rather the corodinates of mismatch, type of variation. Have any one know about any modules in perl/or a simpler(even harder) way of finding the position of mismatch and the type of variation?
Re: parsing mismatch from blast output
by BioLion (Curate) on Aug 09, 2010 at 14:56 UTC
|
Boulder::Blast is probably one standard way of parsing BLASTN results (and other tag/val reports), and will do a good job of simplifying the data held in the record, from where you could split the match and reference sequences into individual characters, and compare them, outputting mismatched characters (and their positions within the match).
You might also want to check out bioperl: parse BLAST HSPs which gives some example code on handling the records, to get at the info you want.
Hope this helps - have a go, and if you get into further trouble re-post and I am sure many a Monk will rush to your aid - remember to include input/code/outpur/error messages etc...!
Just a something something...
| [reply] [Watch: Dir/Any] |
Re: parsing mismatch from blast output
by AnomalousMonk (Archbishop) on Aug 09, 2010 at 23:45 UTC
|
If the sequence strings being compared are always the same length, the string bitwise-xor trick is pretty quick and simple:
>perl -wMstrict -le
"my $seq1 = 'AATGGGATCTAATTAAACTAAAGAGCTTCTGCACAGCAAAAGAAACTACCATC';
my $seq2 = 'AATGGGATCTAATTAAACTCAAGAGCTTCTGCACAGCAAAAGAAACTACCATC';
my $diff = $seq1 ^ $seq2;
$diff =~ tr{\x00-\xff}{.*};
print $seq1;
print $seq2;
print $diff;
"
AATGGGATCTAATTAAACTAAAGAGCTTCTGCACAGCAAAAGAAACTACCATC
AATGGGATCTAATTAAACTCAAGAGCTTCTGCACAGCAAAAGAAACTACCATC
...................*.................................
If the string length are not the same (i.e., characters were inserted or deleted, not just altered in place), this trick will at least get you the position of the first difference, and you're on your own thereafter to re-synchronize and find any other diffs. (I'm sure there are already many tools on CPAN and elsewhere to do this sort of thing!)
| [reply] [Watch: Dir/Any] [d/l] |
Re: parsing mismatch from blast output
by BrowserUk (Patriarch) on Aug 10, 2010 at 01:46 UTC
|
I couldn't find any files of the correct format. I tried blastn at NCBI on NC_000009.11, but after waiting 45 minutes it said: "Informational Message: blastsrv4.REAL: Error: CPU usage limit was exceeded, resulting in SIGXCPU (24)." which probably just goes to show I don't know what the .... I'm doing.
So I based this upon your sample:
#! perl -slw
use strict;
my( $ID, $query, $off );
while( <DATA> ) {
if( m[^>] ) {
( $ID ) = (split '\|', $_)[ 1 ];
next;
}
if( m[^Query] ) {
( $query ) = m[^Query\s+(\d+)];
my $top = substr $_, 15;
my $pipes = substr <DATA>, 15;
my $bot = substr <DATA>, 15;
my $p = 0;
while( $p = 1+index $pipes, ' ', $p ) {
printf "%20s :(%d) %1s/%1s\n",
$ID, $query+$p,
substr( $top, $p-1, 1 ),
substr( $bot, $p-1, 1 );
}
}
}
__DATA__
>lcl|14079 ref|NC_000009.11|:4900000-5300000 Homo sapiens chromosome 9
+,
GRCh37 primary reference assembly
Length=400001
Score = 270 bits (146), Expect = 2e-74
Identities = 148/149 (99%), Gaps = 0/149 (0%)
Strand=Plus/Minus
Query 1 TGGGCAAGGACTTCATGTCTAAAACACCAAAAGCAATGGCAACAAAAGCCAAAATT
+GACA 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
+||||
Sbjct 48784 TGGGCAAGGACTTCATGTCTAAAACACCAAAAGCAATGGCAACAAAAGCCAAAATT
+GACA 48725
Query 61 AATGGGATCTAATTAAACTAAAGAGCTTCTGCACAGCAAAAGAAACTACCATCAGA
+GTGA 120
|||||||||||||| |||||||||||||||||||||||||||||||||||||||||
+||||
Sbjct 48724 AATGGGATCTAATTCAACTAAAGAGCTTCTGCACAGCAAAAGAAACTACCATCAGA
+GTGA 48665
Query 121 ACAGGCAACCTACAGAATGGGAGAACATT 149
|||||||||||||||||||||||||||||
Sbjct 48664 ACAGGCAACCTACAGAATGGGAGAACATT 48636
>lcl|14080 ref|NC_000009.11|:4900000-5300000 Homo sapiens chromosome 9
+,
GRCh37 primary reference assembly
Length=400001
Score = 270 bits (146), Expect = 2e-74
Identities = 148/149 (99%), Gaps = 0/149 (0%)
Strand=Plus/Minus
Query 1 TGGGCAAGGACTTCATGTCTAAAACACCAAAAGCAATGGCAACAAAAGCCAAAATT
+GACA 60
|||||||||||||| |||||||||||||||||||||||||||||||||||||||||
+||||
Sbjct 48784 TGGGCAAGGACTTCATGTCTAAAACACCAAAAGCAATGGCAACAAAAGCCAAAATT
+GACA 48725
Query 61 AATGGGATCTAATTAAACTAAAGAGCTTCTGCACAGCAAAAGAAACTACCATCAGA
+GTGA 120
||||||||||||||||||||||||||||| ||||||||||||||||||||||||||
+||||
Sbjct 48724 AATGGGATCTAATTCAACTAAAGAGCTTCTGCACAGCAAAAGAAACTACCATCAGA
+GTGA 48665
Query 121 ACAGGCAACCTACAGAATGGGAGAACATT 149
||| |||| |||||| |||||||||||
Sbjct 48664 ACAGGCAACCTACAGAATGGGAGAACATT 48636
>lcl|14081 ref|NC_000009.11|:4900000-5300000 Homo sapiens chromosome 9
+,
GRCh37 primary reference assembly
Length=400001
Score = 270 bits (146), Expect = 2e-74
Identities = 148/149 (99%), Gaps = 0/149 (0%)
Strand=Plus/Minus
Query 1 TGGGCAAGGACTTCATGTCTAAAACACCAAAAGCAATGGCAACAAAAGCCAAAATT
+GACA 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
+||||
Sbjct 48784 TGGGCAAGGACTTCATGTCTAAAACACCAAAAGCAATGGCAACAAAAGCCAAAATT
+GACA 48725
Query 61 AATGGGATCTAATTAAACTAAAGAGCTTCTGCACAGCAAAAGAAACTACCATCAGA
+GTGA 120
|||||||||||||| |||||||||||||||||||||||||||||||||||||||||
+||||
Sbjct 48724 AATGGGATCTAATTCAACTAAAGAGCTTCTGCACAGCAAAAGAAACTACCATCAGA
+GTGA 48665
Query 121 ACAGGCAACCTACAGAATGGGAGAACATT 149
|||||||||||||||||||||||||||||
Sbjct 48664 ACAGGCAACCTACAGAATGGGAGAACATT 48636
Outputs: c:\test\blast>..\853819.pl
14079 ref :(75) A/C
14080 ref :(15) A/A
14080 ref :(90) T/T
14080 ref :(124) G/G
14080 ref :(129) C/C
14080 ref :(136) A/A
14081 ref :(75) A/C
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
HI Thanks for that! Insted of getting the position in reference to Query, I qould like to have them in reference to Subject. So, the output should be something like this.
14079 ref 487239 A/C.
my tweaks other than your suggestions are here
#!/usr/local/bin/perl -sl
#
use strict;
use Getopt::Long;
my $home = $ENV{'HOME'};
my($ID, $query, $off, $idi, $subject);
print "ID\tposition\tvariation\n";
while(<DATA>) {
next if $. <20;
if(m[^>]) {
($ID) = (split '\|',$_)[1];
#!/usr/local/bin/perl -sl
#
use strict;
use Getopt::Long;
my $home = $ENV{'HOME'};
my($ID, $query, $off, $idi, $subject);
print "ID\tposition\tvariation\n";
while(<DATA>) {
next if $. <20;
if(m[^>]) {
($ID) = (split '\|',$_)[1];
}
if(/^\s+Identities/){
my($identity, undef) = split/,/ ;
($idi) = $identity =~ /\sIdentities\s\=\s\d{3}\/\d{3}\s\((\d{2,3}\%
+)\)$/;
}
if($idi eq "95%") {
if(m/^Query/) {
($query) = m[^Query\s+(\d+)];
my $top = substr $_, 15;
my $pipes = substr <DATA>,15;
my $bot = substr <DATA>, 15;
my $p = 0 ;
while ($p = 1+index $pipes,' ', $p) {
printf "%20s\t%d\t%1s/%1s\n",$ID,$query+$p, substr( $top, $p-1, 1
+),substr( $bot, $p-1, 1 );
}
}
}
}
}
if(/^\s+Identities/){
my($identity, undef) = split/,/ ;
($idi) = $identity =~ /\sIdentities\s\=\s\d{3}\/\d{3}\s\((\d{2,3}\%
+)\)$/;
}
if($idi eq "95%") {
if(m/^Query/) {
($query) = m[^Query\s+(\d+)];
my $top = substr $_, 15;
my $pipes = substr <DATA>,15;
my $bot = substr <DATA>, 15;
my $p = 0 ;
while ($p = 1+index $pipes,' ', $p) {
printf "%20s\t%d\t%1s/%1s\n",$ID,$query+$p, substr( $top, $p-1, 1
+),substr( $bot, $p-1, 1 );
}
}
}
}
Thanks. | [reply] [Watch: Dir/Any] [d/l] |
|
OH Yes, I managed find a way through it thanks for al wonderful suggestions.
| [reply] [Watch: Dir/Any] |
|
|