in reply to
Re^2: Converting a Text file to XML
in thread Converting a Text file to XML
You don't seem to understand what GrandFather's code is doing. In particular, this chunk of code determines how the original line of text is divided into tag-able pieces:
my ($bibData, $quote, $primary, $sec) = /
That's a regex, expressed on multiple lines (thanks to the "x" modifier at the end), where the first line captures everything
up through the first close-quote up to the second open-quote
, and the second line captures everything from that point up to the first "@" (keyword symbol).
To get the date as a separate item, you just need to divide up the match a little differently, like this:
my ($bibData, $date, $quote, $primary, $sec) = /
$xml->dataElement(bib => $bibData);
$xml->dataElement(date => $date);
$xml->dataElement(quote => $quote);
$xml->dataElement(primary => $primary);
$xml->dataElement(sec => $sec);
Note how the first capture changed: it now ends with .*?
to do a "non-greedy" match of any character until the next capture match is found, which is the one I added to look for 4 digits followed by a literal period and whitespace (updated to require at least one whitespace character). Then we also have to add a $date variable to the list of assignments, as well as a the $xml->dataElement()
call to include the $date value in the output.
Bear in mind that if your input ever includes a line of text like this, the method above will do the wrong thing:
"Big Brother." Review of Orwell's Novel 1984. Nov. 2011. "Tough sit
+uation." @tricky %unparsable.
That could be "fixed" by making the regex match more explicit -- e.g. looking for any of the 12 month abbreviations before the 4-digit year -- but then some entries might lack a month, or the month will be unabbreviated or misspelled...
Any attempt to impose structure like this on plain text has a non-zero probability of failing, because it's impossible to anticipate all the unexpected variations that eventually show up in (human-authored) plain text.