<?xml version="1.0" encoding="windows-1252"?>
<node id="1008172" title="Re^2: Best Way To Parse Concordance DAT File Using Modern Perl?" created="2012-12-10 17:09:11" updated="2012-12-10 17:09:11">
<type id="11">
note</type>
<author id="546548">
Jim</author>
<data>
<field name="doctext">
&lt;blockquote&gt;&lt;i&gt;If it's a UTF-8 file, isn't it meant to have a 3 byte BOM? Your BOM indicates that it's a UTF-16 file, not UTF-8.&lt;/i&gt;&lt;/blockquote&gt;

&lt;p&gt;It &lt;i&gt;is&lt;/i&gt; a Unicode BOM encoded in three bytes in the UTF-8 character encoding scheme. But it's just &lt;i&gt;one&lt;/i&gt; character (&lt;i&gt;one&lt;/i&gt; Unicode code point), represented in Perl as &lt;tt&gt;\x{FEFF}&lt;/tt&gt; or &lt;tt&gt;\N{BYTE ORDER MARK}&lt;/tt&gt;. In a decoded, abstract Unicode string, distinctions between various encodings (serializations) of the string don't exist.&lt;/p&gt;

&lt;p&gt;Jim&lt;/p&gt;
</field>
<field name="root_node">
1007942</field>
<field name="parent_node">
1008106</field>
</data>
</node>
