The assumption made here by the author, and I am advancing that notion, (not sure if it is entirely correct or not, nevertheless) that when I discover elements on a scrambled genome, it has to be, by definition, a false positive
No quibble with that. The result of shuffling the DNA, is that it is no longer DNA. Anything detected is just random chance.
But combining the number of hits in real DNA samples with numbers of hits found by chance in non-DNA samples, in a mathematical equation (your %FDR), is extremely dubious; if not just outright bogus.
At the very best, all it gives you is some measure of the possibility that of the hits you find in the real DNA; some percentage of them might be down to chance. But it doesn't tell you if they are down to chance; and even if some of them are; it doesn't give you any informational way to determine which ones are down to chance.
As such, it is a useless statistic. It's like knowing that the odds of any given pick of 6 numbers in the (UK) lottery has a 1 in 53.66 chance of picking up some prize. It doesn't help you pick a winning combination; much less pick one, that will win a major prize.
So at least in this context, I am seeing what is 'expected' in terms of the shuffled genome serving as a negative control, and yielding fewer # of elements than for randomly shuffled genomes.
Your original question asks if using a larger window when shuffling your DNA samples, reduces the chances of false positives; as appeared to be indicated by your graph.
But for that to be true, the random state of your non-DNA sample would have to somehow influence the hits found in your unshuffled, real-DNA sample. And that simply cannot be. So, the answer must be: NO!
The only affect that using a larger window might have is that by shuffling the characters over a wider base, it might(*) be less likely to random produce matches to your header/trailer libraries. But even if it does; that tells you exactly nothing about whether the hits found in the unshuffled, real-DNA sample are good or bad; because the two have literally nothing in common.
You might just as well start with the digits of PI and map them to ACGT and search the result for matches, for all the bearing the results -- whatever they might come out to -- will have upon the efficacy of any matches you find in your real DNA samples. Ie. None whatsoever.
*DNA is not random. I ran a crude process on a copy of the full human genome I already had on my harddisk and scanned the 2,861,343,839 (non-N) characters and collated all the unique 16-base subsequences therein. Amongst the 2,861,343,824 subsequences in the file (drawn from the 4 billion (2^32) possible 16-base subsequences), only 1,130,866,232 unique subsequences actually appear.
Of those, 633,492,754 appear exactly once; another 188,580,306 appear less that 8 times; and 8,793,172 less than 256 times. The remaining 236,135 subsequences appear more than 256 times.
The most frequent subsequences, 'aaaaaaaaaaaaaaaa' & 'tttttttttttttttt' appear nearly 1 million times each. These are the frequencies of the next 30 most frequent:
332362 328795 327795 324360 203697 203018 201263 199475 199235 198964
+198412 197732 197184 196340 195806
195132 194474 194028 193476 192956 191843 191242 191019 190768 190628
+190278 182911 182452 180801 179857
As you can see, real DNA is heavily biased when compared to purely random DNA, so shuffling real-DNA is a good way to produce a DNA-like overall mix of random bases;
But once shuffled, it has no relationship to the real DNA it was derived from; and thus, there can be no correlation between any data derived by comparing the two.
As for the size of the "sliding window" over which you shuffle the bases. There is some visual (but uncorroborated) evidence that real DNA has some locality bias also. That is, the same subsequences, if repeated, tend to appear in relatively close proximity to the duplicates. On that basis, it is probable that the effect of the larger window, is to 'more thoroughly mix' the bases, and thus the result tends to be less DNA-like; with the knock on effect that you are less likely to find matches to subsequences drawn from real-DNA. That could explain the graph you posted elsewhere.
But, and I cannot emphasis this enough; regardless of how well you mix the bases; any matches (or lack thereof) found in the shuffled DNA, have exactly no correlation with; nor influence upon; nor any predictive or diagnostic utility when compared to the matches found in the real DNA.
I have no knowledge of the experience/prowess/standing of the author of the paper you cited; nor do I understand its contents; but I am really very sure that combining numbers derived from real & shuffled DNA into a single equation is completely bogus math.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
|