Clear questions and runnable code
get the best and fastest answer
I seek your help in verifying whether the following code intends what I want it to do, please.
The input file has multiple alternating lines of IDs formatted as >ID1 and DNA sequences as strings of A/T/G/C. I want to deal with each pair of ID followed by its sequence, at a time. For a given sequence, I want to randomly shuffle the sequence in a window of length = 10^6. The move on to the next window until that sequence has been operated upon. Store this shuffled sequence under the same ID. Then move into the next pair of ID and its corresponding sequence. Process until all pairs of IDs and sequences are processed.
The reason I ask for your stamp of approval that my script is OK is to be doubly sure I am not making a mistake here. The downstream results I am getting from these randomly shuffled DNA sequences is SO VERY different from what has been published that if/when I claim that those results are wrong, the burden of proof lays heavy on me, and I want to be extra careful making such claims!UPDATE 1
Thank you for all your responses. All useful. I see that there is no point in 10X shuffling, and I see the logic for why that is the case. JohnGG thanks for re-working the script. I will use that version now. BrowserUK, yes, I can exchange private messages if you are OK with that
But before future runs for shuffling the DNA sequences randomly, may be some more detail may help you help me better. This is more to do with logic than actual scripting... The "element" I am identifying in the genome has two ends. These two ends are each around 50-100 long. The separation between these ends is between 200 and 20,000. The length of the ends, and the separation between them is NOT known a priori. So, I wonder if there are two size ranges or periodicities whose non randomness needs to be broken down, with two successive shuffles, one in the ~ 100 length window and again in a ~1000 length window? Or some other window lengths? Or should I seed the shuffle start position differently? I am confused about this....
A question more related to the Perl modules. Would folks in the know about random shuffling point me to where I can simply understand any differences between List::Util qw(shuffle) and Array::Shuffle qw(shuffle_array). Or point to me whether they are same in algorithm, just different in any other way...UPDATE 2
The identification of the two ends is NOT deterministic, but based on finding a pattern, where each character in the pattern could be A or T or G or C or some combination of those 2, 3 or all 4 letters. when you search for a perfect match, and when even 1 character gets randomly shuffled you cannot detect it any more, so that is straight forward, right? But here we are NOT looking for JUST perfect matches, but there exists degeneracy in each matching position, so that when you shuffle out one letter and replace it with another, it might still be a match, albeit a different one that "scores" better or worse - that makes it a bit more complicated IMO, doesn't it? This is the actual scenario. I had forgotten to mention this earlier, apologies! How does this change things when it comes to shuffling and periodicities I was referring to in Update 1? THANK YOU! :)UPDATE 3
Hi Laurent and BrowserUK et al. - thank you for your replies. I am not trying to match the sequences to the query myself, the software I am using does it and returns the score. And BrowserUK's claim that genomicists have no use of score is not true to say the least. While it may be true that for some tasks, it might become necessary from a practical point of view, to impose a threshold score, it doesn't mean scores do not matter. Quite the contrary.... prime case in point - the BLAST tool. OK, I am going on a tangent, so back to the question at hand:
BrowserUK, I am sure understand your point about randomness being randomness, so my bone of contention or argument is NOT about how many numbers of shuffles, because it is now clear to me that doing it multiple times is not better, but a waste of time, no argument there
I am getting more numbers of matches in the shuffled DNA sequences than in the intact DNA sequences! This is an absurd result, which can only mean that there is so much noise being picked up by the software and reported as a match from intact sequences. When the sequence is shuffled, there is more noise that is being generated due to the shuffle per se, and since the software does not do a good job of discrimination, it reports a higher number of matches! This is the only conclusion I can arrive at based on my discussion with you folks. DNA random shuffling that I am doing is not a problem, the approach the software used in separating signal from noise is not efficient enough, is what I am thinking....
Unless any of you folks have a different opinion about my conclusion regarding the software I am using, and why it is reported more number of matches despite DNA sequence random shuffling, I consider this matter closed. Thanks to all who participated in this discussion, I am grateful for your inputs, suggestions, thoughts and code. Cheers!(hopefully) FINAL UPDATE
In response to most recent post by BrowserUK, there is nothing in that post that I disagree with. However, the main point I am making regarding the software I am using, might have been lost on readers. So I will reiterate: When DNA sequence is shuffled, there will be some loss of "signal". Signal here meaning millions of years of selection for meaningful sequences. So more signal and less noise in intact genomes. And less signal and more noise in shuffled genomes. I dont think there is going to be any disagreement here. The problem arises - as I see it, because the software I am using does not have sufficient "discriminatory power" to efficiently separate signal from noise. If it could, then it would report more true matches proportional to signal in intact genomes, and fewer true matches proportional to reduced signal in shuffled genomes. But in reality, the behavior is opposite. So this is a limitation of the software, or the training set used on it, or the library of sequences that it is searching for or some combination thereof, and not of the shuffling at all. I think it would be very hard to argue against my conclusions. Whether you are a genomicist or not :). Thank you all for your inputs!LATEST UPDATE & SEEKING ADVICE REGARDING WHERE TO SEEK HELP WITH JAVA PROGRAM
I requested a friend to check my results on the intact genomes which for me are different from the published results that talk about this bioinformatic software. He gets totally different results from me or the published work. I dont know what machine the authors used, I used a UBUNTU based compute cluster managed via SLURM, my friend used a local UNIX machine with 32GB RAM. This software is Java based. So the executable is a *.jar file. Since there is some issue with even intact genomes returning different numbers for different users / machines, may be FDR > 100% on randomly shuffled genome is may be an artifact of the software?!#$%^& Not sure!!! But I looked up the internet to look for hints. May be these are related to the sort of problem I am having? http://bit.ly/1BIIoiv, http://bit.ly/1LqQ8Zm. Software is at : http://bit.ly/1fvIz6y. Where do you advice me to seek help about whether the software is messed up or not. Asking authors has not elicited a useful reply, and I dont think it will in the future. Is there a Java forum like this is for Perl users? Thanks in advance!