Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Window size for shuffling DNA?

by onlyIDleft (Scribe)
on May 18, 2015 at 04:58 UTC ( [id://1126947]=perlquestion: print w/replies, xml ) Need Help??

onlyIDleft has asked for the wisdom of the Perl Monks concerning the following question:

I am using a Perl script to randomly shuffle DNA. This script uses the Perl module http://search.cpan.org/~salva/Array-Shuffle-0.03/lib/Array/Shuffle.pm that appears to work OK for my purposes.

The problem is that I am not sure whether the theory behind what I am doing (to randomly shuffle DNA sequence) is statistically valid and biological valid for the purposes of my research.

I've explained my entire problem at http://math.stackexchange.com/questions/1279699/how-to-best-randomly-shuffle-dna-sequence, because I thought it is more math / stats related, than Perl coding related, but I've received no useful feedback there.

I've always received useful advice here, so I am trying my luck. Thank you!

Replies are listed 'Best First'.
Re: Window size for shuffling DNA?
by BrowserUk (Patriarch) on May 18, 2015 at 11:49 UTC

    Let's see if I understand your question, by summarising the descriptions you've given.

    You start with a real DNA sequence of ~100MB length; which when you pass it to a 3rd party program, is searched for a particular sequence (or sequences?) that are between 20bytes and 20kbytes in length, and are identified by the presence of two sequences (of ~50bytes) at either end of the wanted sequence.

    Eg.

    ...xxxxxxxxHEADERxxxxx 20-20k bytes xxxxxTRAILERxxxxxxxxxx....

    From your graph, I suspect that you run this process on several (3 shown) real DNA sequences?

    Further, I suspect that the (unstated) aim of this process is to identify the ~50 byte header and trailer sequences that are common to all the different DNA sequences, that delimit a 'common subsequence' of dna across species?

    (Do you supply the header & trailer sequences to the 3rd party program?)

    The purpose of your windowed shuffling process is to mix-up the real DNA -- in a locally, statistically similar, but randomised -- way, in order to eliminate false positives, such that if the header and trailer sequences you've previously identified are still found in the randomised sequences, then they are probably not good candidates for identifying common sequences;

    because good header/trailer sequences will not (or will be unlikely to) occur randomly.

    And your question is asking whether the way you are randomising the sequences, via this windowing mechanism, is statistically valid.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      Please allow me to confirm your understanding of my problem:

      1. The DNA sequence that I am feeding the software ranges in sizes but is large, often ~250MB or larger. But you understood that right. It is much larger than the sliding window size by at least 2 orders of magnitude

      2. Yes, I've already run this software on DNA from several different species I now want to estimate the FDR for each of these species.

      3. There is no (unstated) aim of this analyses other than reporting # elements for each species (that analysis is done) AND the FDR for each species (that I am having issues for which I seek help here). I am NOT trying to identify header and trailer terminal sequences common to all species. That is a good segway into the next point...

      4. Indeed, I DO supply the software with 2 separate libraries of LCVs one for the headers, another for the trailer sequences that are supposed to be 'bona fide' based on independent verification - either experimental or some other bioinformatic approach. LCVs are supposed to be similar to profile HMMs, but that is all I know about LCVs at this point.

      5. This is an important point: You ask if I want to eliminate false positives. This might be opening a can of worms, BUT, the short answer to that is NO. What I am REALLY trying to do is to count and compare # of hits with regular Vs randomized DNA inputs, to simply assess and report FDR. Due to the shuffling, IMO it would be quite complex to "identify" preserved elements and "lost" elements. Rather than "identify" true elements, I just want to report how many of them are likely false positives

      Problem 1 : As I see it, due to the nature of the shuffling being random, I imagine every time I do this random shuffling, and THEN predict the # of elements, I would obtain different results each time. Ideally a workaround would be to shuffle a large number of times to assess FDR that is more reliable. But due to time constraints due to run time for the software, this is not viable. So I am concerned about the statistical validity of 1 random shuffle. I don't think this can be circumvented by shuffling the same input DNA sequence 20 times and then providing this as input. Though such iterative shuffle will no doubt randomize the input sequence much better IMO, it would still produce ONLY 1 FDR value. So that would still be not be reliable. Right?

      Problem 2 : The observation I make and report in the math stack exchange post about the # of elements following a trend when the length of the sliding window is changed, worries me about using the 1MB recommended by the author. The FDR is lower at sliding window length 1MB than for 10bp, or 50p or 100bp.... and I wonder what the 'valid' length for shuffling DNA would be. In other words, there are also biological criteria that need to be imposed so that the shuffle is biologically meaningful. With the lure of trying to report lower FDR, did the author incorrectly use 1MB for sliding window within which the DNA is shuffled? Is the FDR actually higher, and should it be based on sliding window length that is in length ~ length of the header and trailer sequences?

      I do not know if problems 1 and 2 above are real or I've imagine them. If they are real, then I am NOT tied to the idea of shuffling DNA. If there is any other solution that math / biology proficient Monks can think of, to assess and report FDR, I am all eyes and ears. Thank you!

        Rather than "identify" true elements, I just want to report how many of them are likely false positives

        An answer: there is no (none, zero, zilch) statistical basis for predicting the FDR, based upon your process, regardless of the window length.

        Further, I do not believe that there ever could be any correlation between the length of the window in which you randomise, and any meaningful statistic about real-world DNA.

        Basis of conclusion: Instinct. My gut feel for requirements for Monti-Carlo simulations to produce statistically valid results; having constructed and run many hundreds of such simulations over the years.

        Caveat: Not one of my simulations had anything to do with DNA or genomics; and I know next to nothing about the subject.

        You cannot draw a statistically valid conclusion based upon 1 (or even a few) random trials; when shuffling just 50 bytes of your sequences can have 1,267,650,600,228,229,401,496,703,205,376 possible outcomes.

        However, if you want a mathematically sound, statistical assessment of your question, then you are going to have to describe much more of the detail of the process; and far more accurate assessments of the ranges of the numbers involved. See below for some of the questions arising.


        Warning: what follows may come across as "angry". It isn't. It's expressed this way to make a point.

        How do you expect to get assistance, when you ask: a statistics question; of a bunch programmers; and conceal everything in genomics lingo?

        What the &**&&^% are:

        • FDR -- Franklin D Roosevelt? Final Drive Ratio? Firearm Discharge Residue?
        • LCV -- Least Common Variable? Light commercial Vehicle? Lymphocytic Choriomeningitis Virus? Luncheon Vouchers?
        • HMM -- Half Molecule Model? Hazardous Materials Management? Hand Maid May?
        • Why do you (earlier) describe things as "base pairs", when you mean bytes or characters or symbols?

        You say "I DO supply the software with 2 separate libraries of LCVs one for the headers, another for the trailer sequences that are supposed to be 'bona fide' based on independent verification".

        That is an almost entirely useless description:

        • What is a LCV?
        • How many of them are there in a "library"?
        • How long are they?
        • Do they get used (by the 3rd party process) as matched pairs; or does any sequence starting with one from the first library and ending in one from the second library constitute a "discovery"?
        • Is the length of discoveries bounded? If so, is that an input parameter to that 3rd party process?

        Conclusion: based upon the information you've supplied so far; and my own experience of drawing conclusions based upon random simulations; I see no basis for any meaningful conclusions with regard to false discovery rates.

        But:

        1. I've only understood about 70% of the information you supplied.
        2. You've only supplied about 10% of the information required to make a proper assessment of the process.

        If you were to describe that 3rd party process in detail: What are the inputs (a genome and 2 libraries; but how big, and other constraints); and what are its outputs. The fact that your graph appears to tail off as the length of the window increases is, of itself, meaningless. It also seems to initially increase. And both could simply be artifacts of the set of randomisations that occurred in this run.

        How many runs would be required to draw a conclusion? There is no way to determine that from the information you have provided so far.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1126947]
Approved by jellisii2
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (2)
As of 2024-04-26 00:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found