Perl: the Markov chain saw | |
PerlMonks |
Re^2: Window size for shuffling DNA?by onlyIDleft (Scribe) |
on May 18, 2015 at 17:29 UTC ( [id://1127038]=note: print w/replies, xml ) | Need Help?? |
Please allow me to confirm your understanding of my problem: 1. The DNA sequence that I am feeding the software ranges in sizes but is large, often ~250MB or larger. But you understood that right. It is much larger than the sliding window size by at least 2 orders of magnitude 2. Yes, I've already run this software on DNA from several different species I now want to estimate the FDR for each of these species. 3. There is no (unstated) aim of this analyses other than reporting # elements for each species (that analysis is done) AND the FDR for each species (that I am having issues for which I seek help here). I am NOT trying to identify header and trailer terminal sequences common to all species. That is a good segway into the next point... 4. Indeed, I DO supply the software with 2 separate libraries of LCVs one for the headers, another for the trailer sequences that are supposed to be 'bona fide' based on independent verification - either experimental or some other bioinformatic approach. LCVs are supposed to be similar to profile HMMs, but that is all I know about LCVs at this point. 5. This is an important point: You ask if I want to eliminate false positives. This might be opening a can of worms, BUT, the short answer to that is NO. What I am REALLY trying to do is to count and compare # of hits with regular Vs randomized DNA inputs, to simply assess and report FDR. Due to the shuffling, IMO it would be quite complex to "identify" preserved elements and "lost" elements. Rather than "identify" true elements, I just want to report how many of them are likely false positives Problem 1 : As I see it, due to the nature of the shuffling being random, I imagine every time I do this random shuffling, and THEN predict the # of elements, I would obtain different results each time. Ideally a workaround would be to shuffle a large number of times to assess FDR that is more reliable. But due to time constraints due to run time for the software, this is not viable. So I am concerned about the statistical validity of 1 random shuffle. I don't think this can be circumvented by shuffling the same input DNA sequence 20 times and then providing this as input. Though such iterative shuffle will no doubt randomize the input sequence much better IMO, it would still produce ONLY 1 FDR value. So that would still be not be reliable. Right? Problem 2 : The observation I make and report in the math stack exchange post about the # of elements following a trend when the length of the sliding window is changed, worries me about using the 1MB recommended by the author. The FDR is lower at sliding window length 1MB than for 10bp, or 50p or 100bp.... and I wonder what the 'valid' length for shuffling DNA would be. In other words, there are also biological criteria that need to be imposed so that the shuffle is biologically meaningful. With the lure of trying to report lower FDR, did the author incorrectly use 1MB for sliding window within which the DNA is shuffled? Is the FDR actually higher, and should it be based on sliding window length that is in length ~ length of the header and trailer sequences? I do not know if problems 1 and 2 above are real or I've imagine them. If they are real, then I am NOT tied to the idea of shuffling DNA. If there is any other solution that math / biology proficient Monks can think of, to assess and report FDR, I am all eyes and ears. Thank you!
In Section
Seekers of Perl Wisdom
|
|