Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

parsing mismatch from blast output

by Anonymous Monk
on Aug 09, 2010 at 13:46 UTC ( [id://853819]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

HI All,
I would be greatful if you could share me your knowledge in parsing the blastn file. I have a blastn output file, which is something like this
ALIGNMENTS >lcl|14079 ref|NC_000009.11|:4900000-5300000 Homo sapiens chromosome 9 +, GRCh37 primary reference assembly Length=400001 Score = 270 bits (146), Expect = 2e-74 Identities = 148/149 (99%), Gaps = 0/149 (0%) Strand=Plus/Minus Query 1 TGGGCAAGGACTTCATGTCTAAAACACCAAAAGCAATGGCAACAAAAGCCAAAATT +GACA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||| +|||| Sbjct 48784 TGGGCAAGGACTTCATGTCTAAAACACCAAAAGCAATGGCAACAAAAGCCAAAATT +GACA 48725 Query 61 AATGGGATCTAATTAAACTAAAGAGCTTCTGCACAGCAAAAGAAACTACCATCAGA +GTGA 120 |||||||||||||| ||||||||||||||||||||||||||||||||||||||||| +|||| Sbjct 48724 AATGGGATCTAATTCAACTAAAGAGCTTCTGCACAGCAAAAGAAACTACCATCAGA +GTGA 48665 Query 121 ACAGGCAACCTACAGAATGGGAGAACATT 149 ||||||||||||||||||||||||||||| Sbjct 48664 ACAGGCAACCTACAGAATGGGAGAACATT 48636

I would like to create a summary of the position of mismatch, and the type (insertion/deletion) from the blast output. In this case, the position 75, and the alleles A-C.

I tried with BioSearchIO, which parses the percentage, start and the end position of the alignment.Obviously, I dont want to have them in my summary rather the corodinates of mismatch, type of variation. Have any one know about any modules in perl/or a simpler(even harder) way of finding the position of mismatch and the type of variation?

Replies are listed 'Best First'.
Re: parsing mismatch from blast output
by BioLion (Curate) on Aug 09, 2010 at 14:56 UTC

    Boulder::Blast is probably one standard way of parsing BLASTN results (and other tag/val reports), and will do a good job of simplifying the data held in the record, from where you could split the match and reference sequences into individual characters, and compare them, outputting mismatched characters (and their positions within the match).

    You might also want to check out bioperl: parse BLAST HSPs which gives some example code on handling the records, to get at the info you want.

    Hope this helps - have a go, and if you get into further trouble re-post and I am sure many a Monk will rush to your aid - remember to include input/code/outpur/error messages etc...!

    Just a something something...
Re: parsing mismatch from blast output
by AnomalousMonk (Archbishop) on Aug 09, 2010 at 23:45 UTC

    If the sequence strings being compared are always the same length, the string bitwise-xor trick is pretty quick and simple:

    >perl -wMstrict -le "my $seq1 = 'AATGGGATCTAATTAAACTAAAGAGCTTCTGCACAGCAAAAGAAACTACCATC'; my $seq2 = 'AATGGGATCTAATTAAACTCAAGAGCTTCTGCACAGCAAAAGAAACTACCATC'; my $diff = $seq1 ^ $seq2; $diff =~ tr{\x00-\xff}{.*}; print $seq1; print $seq2; print $diff; " AATGGGATCTAATTAAACTAAAGAGCTTCTGCACAGCAAAAGAAACTACCATC AATGGGATCTAATTAAACTCAAGAGCTTCTGCACAGCAAAAGAAACTACCATC ...................*.................................

    If the string length are not the same (i.e., characters were inserted or deleted, not just altered in place), this trick will at least get you the position of the first difference, and you're on your own thereafter to re-synchronize and find any other diffs. (I'm sure there are already many tools on CPAN and elsewhere to do this sort of thing!)

Re: parsing mismatch from blast output
by BrowserUk (Patriarch) on Aug 10, 2010 at 01:46 UTC

    I couldn't find any files of the correct format. I tried blastn at NCBI on NC_000009.11, but after waiting 45 minutes it said: "Informational Message: blastsrv4.REAL: Error: CPU usage limit was exceeded, resulting in SIGXCPU (24)." which probably just goes to show I don't know what the .... I'm doing.

    So I based this upon your sample:

    #! perl -slw use strict; my( $ID, $query, $off ); while( <DATA> ) { if( m[^>] ) { ( $ID ) = (split '\|', $_)[ 1 ]; next; } if( m[^Query] ) { ( $query ) = m[^Query\s+(\d+)]; my $top = substr $_, 15; my $pipes = substr <DATA>, 15; my $bot = substr <DATA>, 15; my $p = 0; while( $p = 1+index $pipes, ' ', $p ) { printf "%20s :(%d) %1s/%1s\n", $ID, $query+$p, substr( $top, $p-1, 1 ), substr( $bot, $p-1, 1 ); } } } __DATA__
    >lcl|14079 ref|NC_000009.11|:4900000-5300000 Homo sapiens chromosome 9 +, GRCh37 primary reference assembly Length=400001 Score = 270 bits (146), Expect = 2e-74 Identities = 148/149 (99%), Gaps = 0/149 (0%) Strand=Plus/Minus Query 1 TGGGCAAGGACTTCATGTCTAAAACACCAAAAGCAATGGCAACAAAAGCCAAAATT +GACA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||| +|||| Sbjct 48784 TGGGCAAGGACTTCATGTCTAAAACACCAAAAGCAATGGCAACAAAAGCCAAAATT +GACA 48725 Query 61 AATGGGATCTAATTAAACTAAAGAGCTTCTGCACAGCAAAAGAAACTACCATCAGA +GTGA 120 |||||||||||||| ||||||||||||||||||||||||||||||||||||||||| +|||| Sbjct 48724 AATGGGATCTAATTCAACTAAAGAGCTTCTGCACAGCAAAAGAAACTACCATCAGA +GTGA 48665 Query 121 ACAGGCAACCTACAGAATGGGAGAACATT 149 ||||||||||||||||||||||||||||| Sbjct 48664 ACAGGCAACCTACAGAATGGGAGAACATT 48636 >lcl|14080 ref|NC_000009.11|:4900000-5300000 Homo sapiens chromosome 9 +, GRCh37 primary reference assembly Length=400001 Score = 270 bits (146), Expect = 2e-74 Identities = 148/149 (99%), Gaps = 0/149 (0%) Strand=Plus/Minus Query 1 TGGGCAAGGACTTCATGTCTAAAACACCAAAAGCAATGGCAACAAAAGCCAAAATT +GACA 60 |||||||||||||| ||||||||||||||||||||||||||||||||||||||||| +|||| Sbjct 48784 TGGGCAAGGACTTCATGTCTAAAACACCAAAAGCAATGGCAACAAAAGCCAAAATT +GACA 48725 Query 61 AATGGGATCTAATTAAACTAAAGAGCTTCTGCACAGCAAAAGAAACTACCATCAGA +GTGA 120 ||||||||||||||||||||||||||||| |||||||||||||||||||||||||| +|||| Sbjct 48724 AATGGGATCTAATTCAACTAAAGAGCTTCTGCACAGCAAAAGAAACTACCATCAGA +GTGA 48665 Query 121 ACAGGCAACCTACAGAATGGGAGAACATT 149 ||| |||| |||||| ||||||||||| Sbjct 48664 ACAGGCAACCTACAGAATGGGAGAACATT 48636 >lcl|14081 ref|NC_000009.11|:4900000-5300000 Homo sapiens chromosome 9 +, GRCh37 primary reference assembly Length=400001 Score = 270 bits (146), Expect = 2e-74 Identities = 148/149 (99%), Gaps = 0/149 (0%) Strand=Plus/Minus Query 1 TGGGCAAGGACTTCATGTCTAAAACACCAAAAGCAATGGCAACAAAAGCCAAAATT +GACA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||| +|||| Sbjct 48784 TGGGCAAGGACTTCATGTCTAAAACACCAAAAGCAATGGCAACAAAAGCCAAAATT +GACA 48725 Query 61 AATGGGATCTAATTAAACTAAAGAGCTTCTGCACAGCAAAAGAAACTACCATCAGA +GTGA 120 |||||||||||||| ||||||||||||||||||||||||||||||||||||||||| +|||| Sbjct 48724 AATGGGATCTAATTCAACTAAAGAGCTTCTGCACAGCAAAAGAAACTACCATCAGA +GTGA 48665 Query 121 ACAGGCAACCTACAGAATGGGAGAACATT 149 ||||||||||||||||||||||||||||| Sbjct 48664 ACAGGCAACCTACAGAATGGGAGAACATT 48636

    Outputs:

    c:\test\blast>..\853819.pl 14079 ref :(75) A/C 14080 ref :(15) A/A 14080 ref :(90) T/T 14080 ref :(124) G/G 14080 ref :(129) C/C 14080 ref :(136) A/A 14081 ref :(75) A/C

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      HI Thanks for that! Insted of getting the position in reference to Query, I qould like to have them in reference to Subject. So, the output should be something like this. 14079 ref 487239 A/C. my tweaks other than your suggestions are here
      #!/usr/local/bin/perl -sl # use strict; use Getopt::Long; my $home = $ENV{'HOME'}; my($ID, $query, $off, $idi, $subject); print "ID\tposition\tvariation\n"; while(<DATA>) { next if $. <20; if(m[^>]) { ($ID) = (split '\|',$_)[1]; #!/usr/local/bin/perl -sl # use strict; use Getopt::Long; my $home = $ENV{'HOME'}; my($ID, $query, $off, $idi, $subject); print "ID\tposition\tvariation\n"; while(<DATA>) { next if $. <20; if(m[^>]) { ($ID) = (split '\|',$_)[1]; } if(/^\s+Identities/){ my($identity, undef) = split/,/ ; ($idi) = $identity =~ /\sIdentities\s\=\s\d{3}\/\d{3}\s\((\d{2,3}\% +)\)$/; } if($idi eq "95%") { if(m/^Query/) { ($query) = m[^Query\s+(\d+)]; my $top = substr $_, 15; my $pipes = substr <DATA>,15; my $bot = substr <DATA>, 15; my $p = 0 ; while ($p = 1+index $pipes,' ', $p) { printf "%20s\t%d\t%1s/%1s\n",$ID,$query+$p, substr( $top, $p-1, 1 +),substr( $bot, $p-1, 1 ); } } } } } if(/^\s+Identities/){ my($identity, undef) = split/,/ ; ($idi) = $identity =~ /\sIdentities\s\=\s\d{3}\/\d{3}\s\((\d{2,3}\% +)\)$/; } if($idi eq "95%") { if(m/^Query/) { ($query) = m[^Query\s+(\d+)]; my $top = substr $_, 15; my $pipes = substr <DATA>,15; my $bot = substr <DATA>, 15; my $p = 0 ; while ($p = 1+index $pipes,' ', $p) { printf "%20s\t%d\t%1s/%1s\n",$ID,$query+$p, substr( $top, $p-1, 1 +),substr( $bot, $p-1, 1 ); } } } }
      Thanks.
        OH Yes, I managed find a way through it thanks for al wonderful suggestions.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://853819]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2024-03-29 07:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found