Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Re: Parsing BLAST

by MadraghRua (Vicar)
on Apr 24, 2006 at 19:19 UTC ( #545363=note: print w/replies, xml ) Need Help??

in reply to Parsing BLAST

Depends on what you're trying to do. OK, you appear to be searching for 20mers against a virus library. Is the purpose to identify how often a given 20mer comes up in the library? To see if the 20mer is unique in a given sequence? To see if the 20mer is represented at all in the library? To map the position of 20mer hits to features within the library? Answers depend upon the purpose.

I suggest that you go to the Pasteur web site that provides BioPerl training and have a look at the examples they give around doing Blast - it has several very good examples on how to do this with variations on the parameters. I agree that BioPerl can be a bit of a beast at the beginning but I happen to like it. The alternative is try out a copy of the Tisdall book

Another question - given that you're looking for 20mers, is BLAST even the best tool to be using for this exercise? You're going to end up with many hits (they're 20mers after all and they're everywhere) and many HSPs based upon each individual hit and your e-values are going to be crap.

Given this would you maybe be better in taking your library of sequences, walking down it 20 bases at a time and scoring each 20mer pattern as you observe it? This is more by way of a regex approach to the problem. Each 20mer is represented as a key in a hash and you simply increment by one every time you get a new pattern occurring.

If you're trying something more ambitious, you'll need to provide more information.

yet another biologist hacking perl....

Replies are listed 'Best First'.
Re^2: Parsing BLAST
by cumurph (Novice) on Apr 25, 2006 at 00:04 UTC
    I am trying to find which 20mer's are unique to my sequence. I've read the stuff at pasteur and its doesn't really seem to help me for my particular problem. I also have to do this search using FASTA, and have no clue even where to start with that., but that's another bird to kill. thanks!
      I always parse blast in its -m 8 or -m 9 tabular output format. Much easier to parse.
      Unless your sequence is quite large (and so you have many thousands of unique 20mers), I would go the hash route. It will be VERY fast if memory isn't limiting. If that isn't feasible, break your sequence into fasta sequences of size 20 base pairs and give each a unique ID. Then, blast away using tabular output. Then, you can parse to your heart's content using simple perl.

      Is this homework?
        Any suggestions on implementing the hashing methos, or web sites with code I might be able to user/modify? This a part of class project for the bioinformatics class I'm in. The rest of my classmates and I (seven of us.) are all trying to figure this out. The professor has given us some leads, but the code he gave us isn't working right. thanks! -Rob

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://545363]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (7)
As of 2018-06-24 13:50 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (126 votes). Check out past polls.