<?xml version="1.0" encoding="windows-1252"?>
<node id="938677" title="Re^4: Converting a Text file to XML" created="2011-11-17 14:52:44" updated="2011-11-17 14:52:44">
<type id="11">
note</type>
<author id="937512">
strobodyne</author>
<data>
<field name="doctext">
&lt;p&gt;Thanks for your clarification.  I did understand Grandfather's code, I think I just used the wrong terminology in my question -- as you said, what I wanted was the proper regex to search for that four-digit year.  Your additions (as well as your modification of the &lt;code&gt;$bibData&lt;/code&gt; field) did that beautifully.&lt;/p&gt;

&lt;p&gt;I do expect to come upon a number of rough spots, especially as I'm expecting to edit all of my research notes in a file that is equally human- and machine-readable.  Quite a dream, isn't it?&lt;/p&gt;

&lt;p&gt;One immediate problem I see with this is that the script only recognizes bibliographic data between quotation marks.  So, a journal article between quotes will get picked up while a book title, which conventionally doesn't have quotes, will not.  This effectively excludes about a third of my data from the xml output.&lt;/p&gt;

&lt;p&gt;I think I might go back and edit the raw text file so that the bibliographic info on each line is between | characters.&lt;/p&gt;

&lt;p&gt;My question is, what regex could I use to replace &lt;code&gt;^([^"]* "[^"]+".*?)&lt;/code&gt; so that $bibData identifies all text between | characters?&lt;/p&gt;

&lt;p&gt;Thanks again.  I'll be sure to show everyone the final product once I'm finished.&lt;/p&gt;</field>
<field name="root_node">
938507</field>
<field name="parent_node">
938591</field>
</data>
</node>
