<?xml version="1.0" encoding="windows-1252"?>
<node id="938591" title="Re^3: Converting a Text file to XML" created="2011-11-17 06:03:47" updated="2011-11-17 06:03:47">
<type id="11">
note</type>
<author id="44715">
graff</author>
<data>
<field name="doctext">
You don't seem to understand what [GrandFather]'s code is doing. In particular, this chunk of code determines how the original line of text is divided into tag-able pieces:
&lt;c&gt;
    my ($bibData, $quote, $primary, $sec) = /
        ^([^"]* "[^"]+"[^"]*)
        ([^\@]+)
        \@([^%]+)
        \%(.*)
        /x;
&lt;/c&gt;
That's a regex, expressed on multiple lines (thanks to the "x" modifier at the end), where the first line captures everything &lt;strike&gt;up through the first close-quote&lt;/strike&gt; &lt;i&gt;up to the second open-quote&lt;/i&gt;, and the second line captures everything from that point up to the first "@" (keyword symbol).
&lt;P&gt;
To get the date as a separate item, you just need to divide up the match a little differently, like this:
&lt;c&gt;
    my ($bibData, $date, $quote, $primary, $sec) = /
        ^([^"]* "[^"]+".*?)
        (\d{4})\.\s+
        ([^\@]+)
        \@([^%]+)
        \%(.*)
        /x;

    $xml-&gt;startTag('entry');
    $xml-&gt;dataElement(bib     =&gt; $bibData);
    $xml-&gt;dataElement(date    =&gt; $date);
    $xml-&gt;dataElement(quote   =&gt; $quote);
    $xml-&gt;dataElement(primary =&gt; $primary);
    $xml-&gt;dataElement(sec     =&gt; $sec);
    $xml-&gt;endTag();
&lt;/c&gt;
Note how the first capture changed: it now ends with &lt;c&gt;.*?&lt;/c&gt; to do a "non-greedy" match of any character until the next capture match is found, which is the one I added to look for 4 digits followed by a literal period and whitespace (updated to require at least one whitespace character).  Then we also have to add a $date variable to the list of assignments, as well as a the &lt;c&gt;$xml-&gt;dataElement()&lt;/c&gt; call to include the $date value in the output.
&lt;P&gt;
Bear in mind that if your input ever includes a line of text like this, the method above will do the wrong thing:
&lt;c&gt;
"Big Brother."  Review of Orwell's Novel 1984.  Nov. 2011.  "Tough situation." @tricky %unparsable.
&lt;/c&gt;
That could be "fixed" by making the regex match more explicit -- e.g. looking for any of the 12 month abbreviations before the 4-digit year -- but then some entries might lack a month, or the month will be unabbreviated or misspelled...
&lt;P&gt;
Any attempt to impose structure like this on plain text has a non-zero probability of failing, because it's impossible to anticipate all the unexpected variations that eventually show up in (human-authored) plain text.</field>
<field name="root_node">
938507</field>
<field name="parent_node">
938516</field>
</data>
</node>
