Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Re^3: Converting a Text file to XML

by graff (Chancellor)
on Nov 17, 2011 at 11:03 UTC ( #938591=note: print w/replies, xml ) Need Help??

in reply to Re^2: Converting a Text file to XML
in thread Converting a Text file to XML

You don't seem to understand what GrandFather's code is doing. In particular, this chunk of code determines how the original line of text is divided into tag-able pieces:
my ($bibData, $quote, $primary, $sec) = / ^([^"]* "[^"]+"[^"]*) ([^\@]+) \@([^%]+) \%(.*) /x;
That's a regex, expressed on multiple lines (thanks to the "x" modifier at the end), where the first line captures everything up through the first close-quote up to the second open-quote, and the second line captures everything from that point up to the first "@" (keyword symbol).

To get the date as a separate item, you just need to divide up the match a little differently, like this:

my ($bibData, $date, $quote, $primary, $sec) = / ^([^"]* "[^"]+".*?) (\d{4})\.\s+ ([^\@]+) \@([^%]+) \%(.*) /x; $xml->startTag('entry'); $xml->dataElement(bib => $bibData); $xml->dataElement(date => $date); $xml->dataElement(quote => $quote); $xml->dataElement(primary => $primary); $xml->dataElement(sec => $sec); $xml->endTag();
Note how the first capture changed: it now ends with .*? to do a "non-greedy" match of any character until the next capture match is found, which is the one I added to look for 4 digits followed by a literal period and whitespace (updated to require at least one whitespace character). Then we also have to add a $date variable to the list of assignments, as well as a the $xml->dataElement() call to include the $date value in the output.

Bear in mind that if your input ever includes a line of text like this, the method above will do the wrong thing:

"Big Brother." Review of Orwell's Novel 1984. Nov. 2011. "Tough sit +uation." @tricky %unparsable.
That could be "fixed" by making the regex match more explicit -- e.g. looking for any of the 12 month abbreviations before the 4-digit year -- but then some entries might lack a month, or the month will be unabbreviated or misspelled...

Any attempt to impose structure like this on plain text has a non-zero probability of failing, because it's impossible to anticipate all the unexpected variations that eventually show up in (human-authored) plain text.

Replies are listed 'Best First'.
Re^4: Converting a Text file to XML
by strobodyne (Initiate) on Nov 17, 2011 at 19:52 UTC

    Thanks for your clarification. I did understand Grandfather's code, I think I just used the wrong terminology in my question -- as you said, what I wanted was the proper regex to search for that four-digit year. Your additions (as well as your modification of the $bibData field) did that beautifully.

    I do expect to come upon a number of rough spots, especially as I'm expecting to edit all of my research notes in a file that is equally human- and machine-readable. Quite a dream, isn't it?

    One immediate problem I see with this is that the script only recognizes bibliographic data between quotation marks. So, a journal article between quotes will get picked up while a book title, which conventionally doesn't have quotes, will not. This effectively excludes about a third of my data from the xml output.

    I think I might go back and edit the raw text file so that the bibliographic info on each line is between | characters.

    My question is, what regex could I use to replace ^([^"]* "[^"]+".*?) so that $bibData identifies all text between | characters?

    Thanks again. I'll be sure to show everyone the final product once I'm finished.

      If you're going to manually insert field delimiters, then you could just switch to using split:
      my ( $title, $date, $this, $that ) = split /\|/;
      But it's likely that the data will mostly fall into a few dominant format groups, with some long tail of "outliers". You could either apply a list of regex matches (if the first one doesn't work, try the next one, and so on), or you could try some simple diagnostics to divide the data into subsets according to the absense/presence/type of difficulty: if there's more than one 4-digit string, that's one problem; if there are no double quotes (or an odd number of quotes), that's another problem, ... This will reduce the number of cases that need to be fixed by hand in order to be parsable.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://938591]
and the rats come out to play...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2017-04-30 16:07 GMT
Find Nodes?
    Voting Booth?
    I'm a fool:

    Results (541 votes). Check out past polls.