<?xml version="1.0" encoding="windows-1252"?>
<node id="1007813" title="Re^5: Reading files n lines a time" created="2012-12-07 14:22:46" updated="2012-12-07 14:22:46">
<type id="11">
note</type>
<author id="352046">
ww</author>
<data>
<field name="doctext">
SuperSearch (done already - here's the link: [href://?node_id=3989;BIT=FASTA] -- will give you a short list of recent discussions on dealing with FASTA files.

&lt;p&gt;My notion that your paragraphing might be identifiable with a regex is pretty useless here. However, there's no reason you can't read a 2 lines at a time and use hashes to ensure the two "neighbors" values are discrete.&lt;/p&gt;

That too, however, breaks down if the dups appear other than adjacent to one another, given the size of your data.

&lt;p&gt;So if none of the above help, you may wish to read about bioperl at both the wikipedia article, [http://en.wikipedia.org/wiki/BioPerl] and at the project page, [http://www.bioperl.org/wiki/Main_Page].&lt;/p&gt;  </field>
<field name="root_node">
1007560</field>
<field name="parent_node">
1007807</field>
</data>
</node>
