in reply to
Re^4: Parsing BLAST
in thread Parsing BLAST
Perhaps he gave it to you like that so you could learn how to read and debug the code? The code tutorials will really help you if you stop, breath and then take the tie to go through, understand and then use them.
The Monks don't usually do your homework for you - its a point of principle that doing your homework doesn't help you learn the language. I'm going to give you some pointers on how you might tackle the problem - its up to you to do something with it. Or not.
I could structure it something like this
1. Create a hash of all possible 20mers
a. Start by making an array containing four strings A,T,G,C
b. Count the number of array elements you have
c. For each array element use shift to get it from the left side of the array
d. add each of the four nucleotides to the shifted element
e. add each new string back into the right side of the array with push
f. repeat for each of the original elements in the array
g. You should end up with 4^20 array elements - 1.0995e13
h. Use each array element as a hask key and set the value of the key to zero
i. Thinking about it, the size of the array will get pretty large, so maybe start with four arrays, each containing a nucleotide. This will decrease the final size of the individual arrays by a quarter. You can beak it down even further by creating more arrays ealier, such as create individual arrays for the first 64 combinations (3mers) and then carry on from there. Play with it and see what works best.
2. Read the files in from your directory:
a. Read a directory of file names
b. For each file
a. grab the sequence and the name
c. close the file
d. Process the sequence and the file before starting the next one
3. Process the file as follows:
a. Make the sequence one long concatenated string
b. You know you want to look at a window of 20 bases, you have to deceide how many bases you want to walk down the sequence, eg read first 20 base window, step down 5 bases, read next 20 base window and so on
c. For each window, match the window to a hash key and autoincrement the value of the hash key
d. If you run out of sequence, end the processing
4. Reporting on the matches
a. Use the has to find keys with a value of 0, 1, 2, 3, 4, etc.
b. You have the sequence name, so print the output as sequence name, patterns with 0 hits, patterns with 1 hit and so on. If you're only interested in single hits for that sequence, then only print those out.
c. If you use tabs between each value, you can open it in excel as tab delimited text.
This is a fairly straight forward project - really. You should be able to figure it out with the first five chapters of Merlyn's Learning Perl book, which is pretty compact.
yet another biologist hacking perl....