Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^6: Window size for shuffling DNA?

by onlyIDleft (Scribe)
on May 20, 2015 at 00:01 UTC ( [id://1127202]=note: print w/replies, xml ) Need Help??


in reply to Re^5: Window size for shuffling DNA?
in thread Window size for shuffling DNA?

Y-axis is # of element reported on a certain genome that has been scrambled. X - axis is the size of the sliding window within which this random shuffling was performed. The assumption made here by the author, and I am advancing that notion, (not sure if it is entirely correct or not, nevertheless) that when I discover elements on a scrambled genome, it has to be, by definition, a false positive

Conversely, the elements that I discover and report on the original, unshuffled genome, have to be, by definition, true positives

Just the comparison of original Vs unscrambled genomes in terms of # of elements reported in each case, was used by the software author to report FDR. FDR = (# element in shuffled genome) / (# elements in original genome) * 100 (in %)

My chart does NOT show the # of elements in the original genome without any DNA random shuffling. Those numbers are as follows:

A. thaliana (original genome, no DNA shuffle) - 885 elements

B. rapa (original genome, no DNA shuffle) - 3686 elements

M. truncatula (original genome, no DNA shuffle) - 1808 elements

As expected, these numbers above, for the unshelled genomic DNA as input, yields higher # of elements than for the same genomes that have undergone random DNA shuffling (irrespective of what the sliding window size is. So at least in this context, I am seeing what is 'expected' in terms of the shuffled genome serving as a negative control, and yielding fewer # of elements than for randomly shuffled genomes.

Replies are listed 'Best First'.
Re^7: Window size for shuffling DNA?
by BrowserUk (Patriarch) on May 20, 2015 at 05:42 UTC
    The assumption made here by the author, and I am advancing that notion, (not sure if it is entirely correct or not, nevertheless) that when I discover elements on a scrambled genome, it has to be, by definition, a false positive

    No quibble with that. The result of shuffling the DNA, is that it is no longer DNA. Anything detected is just random chance.

    But combining the number of hits in real DNA samples with numbers of hits found by chance in non-DNA samples, in a mathematical equation (your %FDR), is extremely dubious; if not just outright bogus.

    At the very best, all it gives you is some measure of the possibility that of the hits you find in the real DNA; some percentage of them might be down to chance. But it doesn't tell you if they are down to chance; and even if some of them are; it doesn't give you any informational way to determine which ones are down to chance.

    As such, it is a useless statistic. It's like knowing that the odds of any given pick of 6 numbers in the (UK) lottery has a 1 in 53.66 chance of picking up some prize. It doesn't help you pick a winning combination; much less pick one, that will win a major prize.

    So at least in this context, I am seeing what is 'expected' in terms of the shuffled genome serving as a negative control, and yielding fewer # of elements than for randomly shuffled genomes.

    Your original question asks if using a larger window when shuffling your DNA samples, reduces the chances of false positives; as appeared to be indicated by your graph.

    But for that to be true, the random state of your non-DNA sample would have to somehow influence the hits found in your unshuffled, real-DNA sample. And that simply cannot be. So, the answer must be: NO!

    The only affect that using a larger window might have is that by shuffling the characters over a wider base, it might(*) be less likely to random produce matches to your header/trailer libraries. But even if it does; that tells you exactly nothing about whether the hits found in the unshuffled, real-DNA sample are good or bad; because the two have literally nothing in common.

    You might just as well start with the digits of PI and map them to ACGT and search the result for matches, for all the bearing the results -- whatever they might come out to -- will have upon the efficacy of any matches you find in your real DNA samples. Ie. None whatsoever.

    *DNA is not random. I ran a crude process on a copy of the full human genome I already had on my harddisk and scanned the 2,861,343,839 (non-N) characters and collated all the unique 16-base subsequences therein. Amongst the 2,861,343,824 subsequences in the file (drawn from the 4 billion (2^32) possible 16-base subsequences), only 1,130,866,232 unique subsequences actually appear.

    Of those, 633,492,754 appear exactly once; another 188,580,306 appear less that 8 times; and 8,793,172 less than 256 times. The remaining 236,135 subsequences appear more than 256 times.

    The most frequent subsequences, 'aaaaaaaaaaaaaaaa' & 'tttttttttttttttt' appear nearly 1 million times each. These are the frequencies of the next 30 most frequent:

    332362 328795 327795 324360 203697 203018 201263 199475 199235 198964 +198412 197732 197184 196340 195806 195132 194474 194028 193476 192956 191843 191242 191019 190768 190628 +190278 182911 182452 180801 179857

    As you can see, real DNA is heavily biased when compared to purely random DNA, so shuffling real-DNA is a good way to produce a DNA-like overall mix of random bases;

    But once shuffled, it has no relationship to the real DNA it was derived from; and thus, there can be no correlation between any data derived by comparing the two.

    As for the size of the "sliding window" over which you shuffle the bases. There is some visual (but uncorroborated) evidence that real DNA has some locality bias also. That is, the same subsequences, if repeated, tend to appear in relatively close proximity to the duplicates. On that basis, it is probable that the effect of the larger window, is to 'more thoroughly mix' the bases, and thus the result tends to be less DNA-like; with the knock on effect that you are less likely to find matches to subsequences drawn from real-DNA. That could explain the graph you posted elsewhere.

    But, and I cannot emphasis this enough; regardless of how well you mix the bases; any matches (or lack thereof) found in the shuffled DNA, have exactly no correlation with; nor influence upon; nor any predictive or diagnostic utility when compared to the matches found in the real DNA.

    I have no knowledge of the experience/prowess/standing of the author of the paper you cited; nor do I understand its contents; but I am really very sure that combining numbers derived from real & shuffled DNA into a single equation is completely bogus math.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      Thank you for your detailed and patient explanation

      Several of my suspicions are confirmed

      I must point out that at the outset I already knew that comparing original and shuffled DNA would NOT allow identification and/or elimination of the false positives. But only a calculation of the percentage of elements that are likely to be false positives. So your final reply confirms that unequivocally

      You assert that because there is no predictive power due to the shuffling, this is useless. As a biologist, I would argue against that claim. When you need to experimentally verify a set of predictions, I would take a method that yields ~ 10% FDR over another method that suffers from ~ 40%. That way time, energy and resources are better utilized. So it does not matter so much which ones are real are not, if a large enough sample size is cross-verified experimentally, it should check out as per theoretical FDR predictions. IF it does not, something is wrong with the computation pipeline or the experimental verification protocol or both.

      What is interesting however is your clear statement that since one sample is original and another is shuffled, the FDR calculation as # elements found in shuffled DNA vs. the original DNA is completely bogus! :) In my experience, this is how FDRs for sequence based analyses have been reported in published literature. I do not know if there are other viable methods to assess FDR. But your statement is of concern, in terms of any disconnect existing between the theory of FDR and how it might be applied by biologists

      In any case, your replies were all enlightening. Thank you for taking the time to reply with patient explanations. I have enough fodder to go beyond my confusion and proceed with my analyses. Cheers!

        As a biologist, I would argue against that claim. When you need to experimentally verify a set of predictions, I would take a method that yields ~ 10% FDR over another method that suffers from ~ 40%. That way time, energy and resources are better utilized.

        You're right. I'm not a biologist, but, please think again.

        For each of your 3 species, you have a single "actual discoveries" figure; but 7 different %FDRs.

        The original data, and discoveries don't change, so at best, only one of those 7 numbers could possibly be right; and which one could be different for each of the three species. Or they could all be wrong.

        Picking any of them because it is convenient is just wishful thinking.

        And basing your experimental strategy upon a guess -- it is nothing more -- because it will involve less work; completely subverts the scientific method.

        I'll shut up now.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1127202]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (3)
As of 2024-04-25 19:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found