Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^2: Reading files n lines a time

by naturalsciences (Beadle)
on Dec 06, 2012 at 14:00 UTC ( [id://1007576]=note: print w/replies, xml ) Need Help??


in reply to Re: Reading files n lines a time
in thread Reading files n lines a time

Thanx. I already know about the record operator. My fwn-s unfortunately don't contain anything else as useful as newline to determine useful blocks. At least to my senses.

Replies are listed 'Best First'.
Re^3: Reading files n lines a time
by ww (Archbishop) on Dec 06, 2012 at 20:27 UTC
    Perhaps you can post a real (or baudlerized sample) snippet of your actual data. It's amazing what a bit of exposure to regular expressions can help one spot, and here, you'll have many such well-educated eyes looking for proxy-para-markers.

    I'm actually surprised -- no, very surprised -- that this request hasn't been posted higher in the thread.

      Right now it is simply a fasta file. Fasta files are for storing DNA sequence information and they are formatted as following.

      >nameofsequence\n

      ATCGTACGTTGCTE\n

      >anothername\n

      GTCTGT\n

      so that a line starting with > containing a sequence name is followed by a line containing sequences nucleotide information

      I am thinking of dredging them in 4 lines a time, because I have reasons to suspect that due to some certain previous operations there might be sequences directly following eachother with different names (on >sequencename\n line) but exactly the same sequence information (on following ATGCTGT\n line). Right now I'm looking to identify and remove such duplicates but I might make use of scripts dealing with many comparision extraction etc. of neighbouring sequences in my files. (Two neigbours means four lines)
        SuperSearch (done already - here's the link: ?node_id=3989;BIT=FASTA -- will give you a short list of recent discussions on dealing with FASTA files.

        My notion that your paragraphing might be identifiable with a regex is pretty useless here. However, there's no reason you can't read a 2 lines at a time and use hashes to ensure the two "neighbors" values are discrete.

        That too, however, breaks down if the dups appear other than adjacent to one another, given the size of your data.

        So if none of the above help, you may wish to read about bioperl at both the wikipedia article, http://en.wikipedia.org/wiki/BioPerl and at the project page, http://www.bioperl.org/wiki/Main_Page.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1007576]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (2)
As of 2024-04-24 23:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found