<?xml version="1.0" encoding="windows-1252"?>
<node id="1007807" title="Re^4: Reading files n lines a time" created="2012-12-07 12:45:17" updated="2012-12-07 12:45:17">
<type id="11">
note</type>
<author id="823119">
naturalsciences</author>
<data>
<field name="doctext">
Right now it is simply a fasta file.
Fasta files are for storing DNA sequence information and they are  formatted as following.
&lt;p&gt;&gt;nameofsequence\n&lt;/p&gt;
&lt;p&gt;ATCGTACGTTGCTE\n&lt;/p&gt;
&lt;p&gt;&gt;anothername\n &lt;/p&gt;
&lt;p&gt;GTCTGT\n &lt;/p&gt;
&lt;p&gt;so that a line starting with &gt; containing a sequence name is followed by a line containing sequences nucleotide information &lt;/p&gt;
I am thinking of dredging them in 4 lines a time, because I have reasons to suspect that due to some certain previous operations there might be sequences directly following eachother with different names (on &gt;sequencename\n line) but exactly the same sequence information (on following  ATGCTGT\n line). Right now I'm looking to identify and remove such duplicates but I might make use of scripts dealing with many comparision extraction etc. of neighbouring sequences in my files. (Two neigbours means four lines)
</field>
<field name="root_node">
1007560</field>
<field name="parent_node">
1007643</field>
</data>
</node>
