<?xml version="1.0" encoding="windows-1252"?>
<node id="1007942" title="Best Way To Parse Concordance DAT File Using Modern Perl?" created="2012-12-08 21:14:53" updated="2012-12-08 21:14:53">
<type id="115">
perlquestion</type>
<author id="546548">
Jim</author>
<data>
<field name="doctext">
&lt;p&gt;A Concordance DAT file is simply a CSV text file that uses the following metacharacters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;tt&gt;U+0014&lt;/tt&gt; Field Separator ("Comma") &lt;tt&gt;&amp;#91;DEVICE CONTROL FOUR&amp;#93;&lt;/tt&gt;&lt;/li&gt;
&lt;li&gt;&lt;tt&gt;U+00FE&lt;/tt&gt; String Delimiter ("Quote") &lt;tt&gt;&amp;#91;LATIN SMALL LETTER THORN&amp;#93;&lt;/tt&gt;&lt;/li&gt;
&lt;li&gt;&lt;tt&gt;U+00AE&lt;/tt&gt; Newline Placeholder ("Newline") &lt;tt&gt;&amp;#91;REGISTERED SIGN&amp;#93;&lt;/tt&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What's the best way to parse a Concordance DAT file using Modern Perl?&lt;/p&gt;

&lt;p&gt;Assume the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The file is in the UTF-8 character encoding form of Unicode&lt;/li&gt;
&lt;li&gt;The file has the Unicode byte order mark in it (&lt;tt&gt;U+FEFF&lt;/tt&gt;), which must be handled properly&lt;/li&gt;
&lt;li&gt;Records are terminated by &lt;tt&gt;&amp;lt;CR&amp;gt;&amp;lt;LF&amp;gt;&lt;/tt&gt; pairs&lt;/li&gt;
&lt;li&gt;Despite the newline placeholder convention, fields &lt;i&gt;can&lt;/i&gt; have literal &lt;tt&gt;&lt;CR&gt;&lt;/tt&gt;, &lt;tt&gt;&lt;LF&gt;&lt;/tt&gt;, and paired &lt;tt&gt;&lt;CR&gt;&lt;LF&gt;&lt;/tt&gt; characters in them&lt;/li&gt;
&lt;li&gt;The text in fields can be arbitrarily large (many megabytes)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;i&gt;Thanks!&lt;/i&gt;&lt;/p&gt;

&lt;p&gt;Jim&lt;/p&gt;
</field>
</data>
</node>
