Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Converting a Text file to XML

by GrandFather (Cardinal)
on Nov 17, 2011 at 03:29 UTC ( #938513=note: print w/ replies, xml ) Need Help??


in reply to Converting a Text file to XML

It may be that XML::Writer is helpful. Consider:

#!/usr/lib/perl use strict; use warnings; use XML::Writer; my $out; my $xml = XML::Writer->new(OUTPUT => \$out, DATA_MODE => 1, DATA_INDEN +T => ' '); $xml->xmlDecl(); $xml->startTag('doc'); while (<DATA>) { chomp; next if !length; my ($bibData, $quote, $primary, $sec) = / ^([^"]* "[^"]+"[^"]*) ([^\@]+) \@([^%]+) \%(.*) /x; $xml->startTag('entry'); $xml->dataElement(bib => $bibData); $xml->dataElement(quote => $quote); $xml->dataElement(primary => $primary); $xml->dataElement(sec => $sec); $xml->endTag(); } $xml->endTag(); $xml->end(); print $out; __DATA__ Leyson, Captain Burr. "With or Without Gadgets." Boys' Life. Nov 19 +49. p. 6. "An old-timer knew what he had to do in a jam. He didn't + need hundreds of those gadgets to guide him to safety." @gauge %tri +vial "The Battle Against Baldness." Kiplinger's Personal Finance. Feb 194 +9. "A little home hair-cutter gadget--a comb with a razor attached-- + has zipped its way into fame in recent months. Barbers pooh-pooh it + as a threat, but sales are going strong." @tool %american

Prints:

<?xml version="1.0"?> <doc> <entry> <bib>Leyson, Captain Burr. "With or Without Gadgets." Boys' Li +fe. Nov 1949. p. 6. </bib> <quote>"An old-timer knew what he had to do in a jam. He didn't + need hundreds of those gadgets to guide him to safety." </quote> <primary>gauge </primary> <sec>trivial</sec> </entry> <entry> <bib>"The Battle Against Baldness." Kiplinger's Personal Financ +e. Feb 1949. </bib> <quote>"A little home hair-cutter gadget--a comb with a razor at +tached-- has zipped its way into fame in recent months. Barbers pooh +-pooh it as a threat, but sales are going strong." </quote> <primary>tool </primary> <sec>american</sec> </entry> </doc>
True laziness is hard work


Comment on Re: Converting a Text file to XML
Select or Download Code
Re^2: Converting a Text file to XML
by strobodyne (Initiate) on Nov 17, 2011 at 04:32 UTC
    Excellent, this is exactly what I was looking for -- I don't even want to tell you how much time I wasted on trial and error. And to add a $year field, that searches for four-digit numbers, would that just be some kind of grep? Perhaps something like ^\d\d\d\d$ this? Or does XML::Writer not have that capability?
      You don't seem to understand what GrandFather's code is doing. In particular, this chunk of code determines how the original line of text is divided into tag-able pieces:
      my ($bibData, $quote, $primary, $sec) = / ^([^"]* "[^"]+"[^"]*) ([^\@]+) \@([^%]+) \%(.*) /x;
      That's a regex, expressed on multiple lines (thanks to the "x" modifier at the end), where the first line captures everything up through the first close-quote up to the second open-quote, and the second line captures everything from that point up to the first "@" (keyword symbol).

      To get the date as a separate item, you just need to divide up the match a little differently, like this:

      my ($bibData, $date, $quote, $primary, $sec) = / ^([^"]* "[^"]+".*?) (\d{4})\.\s+ ([^\@]+) \@([^%]+) \%(.*) /x; $xml->startTag('entry'); $xml->dataElement(bib => $bibData); $xml->dataElement(date => $date); $xml->dataElement(quote => $quote); $xml->dataElement(primary => $primary); $xml->dataElement(sec => $sec); $xml->endTag();
      Note how the first capture changed: it now ends with .*? to do a "non-greedy" match of any character until the next capture match is found, which is the one I added to look for 4 digits followed by a literal period and whitespace (updated to require at least one whitespace character). Then we also have to add a $date variable to the list of assignments, as well as a the $xml->dataElement() call to include the $date value in the output.

      Bear in mind that if your input ever includes a line of text like this, the method above will do the wrong thing:

      "Big Brother." Review of Orwell's Novel 1984. Nov. 2011. "Tough sit +uation." @tricky %unparsable.
      That could be "fixed" by making the regex match more explicit -- e.g. looking for any of the 12 month abbreviations before the 4-digit year -- but then some entries might lack a month, or the month will be unabbreviated or misspelled...

      Any attempt to impose structure like this on plain text has a non-zero probability of failing, because it's impossible to anticipate all the unexpected variations that eventually show up in (human-authored) plain text.

        Thanks for your clarification. I did understand Grandfather's code, I think I just used the wrong terminology in my question -- as you said, what I wanted was the proper regex to search for that four-digit year. Your additions (as well as your modification of the $bibData field) did that beautifully.

        I do expect to come upon a number of rough spots, especially as I'm expecting to edit all of my research notes in a file that is equally human- and machine-readable. Quite a dream, isn't it?

        One immediate problem I see with this is that the script only recognizes bibliographic data between quotation marks. So, a journal article between quotes will get picked up while a book title, which conventionally doesn't have quotes, will not. This effectively excludes about a third of my data from the xml output.

        I think I might go back and edit the raw text file so that the bibliographic info on each line is between | characters.

        My question is, what regex could I use to replace ^([^"]* "[^"]+".*?) so that $bibData identifies all text between | characters?

        Thanks again. I'll be sure to show everyone the final product once I'm finished.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://938513]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (12)
As of 2014-09-19 12:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (138 votes), past polls