Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Re: remove part of string (DNA)

by sundialsvc4 (Abbot)
on Apr 12, 2011 at 15:45 UTC ( #898974=note: print w/replies, xml ) Need Help??

in reply to remove part of string (DNA)

In most systems today, which have lots of RAM and which typically have only one “big” program busy at one time (not much contention for that RAM...), it is reasonable to use (virtual...) memory as what it is – a large disk file.   Your design does need to be mindful of the fact that, with virtual memory, everything works very well until a so-called “thrash point” is reached, at which time performance abruptly starts to go to hell fast.

As has been said, the largest piece of data is probably the input dataset(s), and so you want to process that one record at a time.   My further suggestion is that perhaps you can arrange things so that the output is effectively generated in “batches,” such that, at some particular turning-point, the program is able to release some of the memory it has been holding up to that time.   Let me now, specifically, explain what I mean and how you might be able to accomplish this.

I am not familiar with the task of DNA sequencing, but let me now suggest that perhaps it would be possible, and therefore advantageous, to design your algorithm so that it assumes (and requires, and checks) that the input file is sorted.   (Caution:   Any program which relies upon this assumption must be made to check the sort-order and to die() outright if an error is found.)   When a file is known to be sorted, the data is now quite-naturally grouped, and the boundaries of each group can be determined without searching.   All of the sequences beginning with (say) AGAGTTT (or any number of leading-characters desired) will occur all-at-once, such that, when you observe that this leading-character sequence has changed, you know that this group of records has ended and that there will be no more.   (At each gap in the sequence, you know, again without searching, that the gap is empty.)   The payback comes if this means that a turning-point has been reached such that in-memory data structures related to the group just-ended can now be released.   The capacity of the program is no longer bounded in any way by the worst-case size of the file, but by the worst-case size of each group.

On-disk sorting is one of the most-studied algorithms in all of computer science.   (Before that, it was done with tapes; before that, and before computers, it was done with punched cards.)   Even files consisting of millions of records can be sorted in a few seconds.   The time required to sort the data, then, is pragmatically inconsequential.   Meanwhile, the operational paybacks of designing algorithms which require (identically) sorted input streams can be huge.

Replies are listed 'Best First'.
Re^2: remove part of string (DNA)
by BrowserUk (Pope) on Apr 12, 2011 at 15:52 UTC

    Nothing you have said above is even vaguely relevant to the OPs question.

    You get more fatuous by the day. Why? What are you hoping to achieve?

      I am not a Perl expert, and have learned from many authors on PerlMonks, including both BrowserUK and sundialsvc4.

      If the comments by sundialsvc4 are inaccurate or not appropriate for the OP's problem, please tell us why. Pre-sorting the data seems like good advice to me, but I don't have your experience and knowledge of Perl. Please enlighten.


        Pre-sorting the data seems like good advice to me,

        It is quite the opposite of good advice. Ie very bad advice.

        1. Sorting is O(N logN). The OPs described processing is O(N).

          Sorting does not help the OPs processing at all.

        2. FASTA file are multi-line record format files.

          If you sorted a FASTA file with the system sort utility, it would screw the file up in a completely irrecoverable way.

          Eg. This:

          c:\test>type 845226.fasta >uc002yje.1 chr21:13973492-13976330 cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga acgcgccctacactctggcatgggggaacccggccccgcagagccctgga CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG >uc002yje.1 chr21:13973492-13976330 cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga acgcgccctacactctggcatgggggaaaaaacccggccccgcagagccctgga CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG >uc002yje.1 chr21:13973492-13976330 cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga acgcgccctacactctggcatgggggaacccggccccgcagagggccctgga CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG

          becomes this:

          c:\test>sort 845226.fasta >uc002yje.1 chr21:13973492-13976330 >uc002yje.1 chr21:13973492-13976330 >uc002yje.1 chr21:13973492-13976330 acgcgccctacactctggcatgggggaaaaaacccggccccgcagagccctgga acgcgccctacactctggcatgggggaacccggccccgcagagccctgga acgcgccctacactctggcatgggggaacccggccccgcagagggccctgga cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG

          And so is rendered entirely useless.

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        BrowserUK is right (identifier and sequence should remain together), but they'll be sorted before exporting them to a fasta file. But I don't think it would really make a difference in speed anyway.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://898974]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (10)
As of 2018-07-19 16:04 GMT
Find Nodes?
    Voting Booth?
    It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?

    Results (411 votes). Check out past polls.