Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Converting a Text file to XML

by strobodyne (Initiate)
on Nov 17, 2011 at 02:43 UTC ( #938507=perlquestion: print w/ replies, xml ) Need Help??
strobodyne has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to parse a text file and convert it to XML. The .txt file consists of a list of entries, separated by line breaks. So, two sample entries look like this:

Leyson, Captain Burr. "With or Without Gadgets." Boys' Life. Nov 19 +49. p. 6. "An old-timer knew what he had to do in a jam. He didn't + need hundreds of those gadgets to guide him to safety." @gauge %tri +vial "The Battle Against Baldness." Kiplinger's Personal Finance. Feb 194 +9. "A little home hair-cutter gadget--a comb with a razor attached-- + has zipped its way into fame in recent months. Barbers pooh-pooh it + as a threat, but sales are going strong." @tool %american

Each entry contains bibliographic data, a quotation from that source, and two sets of tags: a set of primary and secondary classifications, one using @tags and the other using %tags, all on a single line.

The most important information for me to extract from each entry is year and tags. So, I came up with the following script:

#!/usr/bin/perl -w my $year = ""; while (<>) { chomp; if ($_ eq "") {next;} elsif ($_ =~ /^\d\d\d\d$/) { $_ = $year; } else { s/\@(\w*)/ <keyword> $1 <\/keyword>/g; s/\%(\w*)/ <tag> $1 <\/tag>/g; print "<entry>$_ <year> $year </year> </entry>\n"; } }

The @tags and %tags are recognized just fine. Problem is, entries and years are not located. My program doesn't differentiate between entries: I get <entry> at the very beginning of the output and </entry> at the very end. Similarly, there's only a single, blank <year></year> right before </entry>.

I realize there's probably a very simple solution to this, but I'm still at the circumference of a circle, knock-knock-joke stage of perl programming, so your expertise would be very much appreciated. Thanks!

Comment on Converting a Text file to XML
Select or Download Code
Re: Converting a Text file to XML
by GrandFather (Cardinal) on Nov 17, 2011 at 03:29 UTC

    It may be that XML::Writer is helpful. Consider:

    #!/usr/lib/perl use strict; use warnings; use XML::Writer; my $out; my $xml = XML::Writer->new(OUTPUT => \$out, DATA_MODE => 1, DATA_INDEN +T => ' '); $xml->xmlDecl(); $xml->startTag('doc'); while (<DATA>) { chomp; next if !length; my ($bibData, $quote, $primary, $sec) = / ^([^"]* "[^"]+"[^"]*) ([^\@]+) \@([^%]+) \%(.*) /x; $xml->startTag('entry'); $xml->dataElement(bib => $bibData); $xml->dataElement(quote => $quote); $xml->dataElement(primary => $primary); $xml->dataElement(sec => $sec); $xml->endTag(); } $xml->endTag(); $xml->end(); print $out; __DATA__ Leyson, Captain Burr. "With or Without Gadgets." Boys' Life. Nov 19 +49. p. 6. "An old-timer knew what he had to do in a jam. He didn't + need hundreds of those gadgets to guide him to safety." @gauge %tri +vial "The Battle Against Baldness." Kiplinger's Personal Finance. Feb 194 +9. "A little home hair-cutter gadget--a comb with a razor attached-- + has zipped its way into fame in recent months. Barbers pooh-pooh it + as a threat, but sales are going strong." @tool %american

    Prints:

    <?xml version="1.0"?> <doc> <entry> <bib>Leyson, Captain Burr. "With or Without Gadgets." Boys' Li +fe. Nov 1949. p. 6. </bib> <quote>"An old-timer knew what he had to do in a jam. He didn't + need hundreds of those gadgets to guide him to safety." </quote> <primary>gauge </primary> <sec>trivial</sec> </entry> <entry> <bib>"The Battle Against Baldness." Kiplinger's Personal Financ +e. Feb 1949. </bib> <quote>"A little home hair-cutter gadget--a comb with a razor at +tached-- has zipped its way into fame in recent months. Barbers pooh +-pooh it as a threat, but sales are going strong." </quote> <primary>tool </primary> <sec>american</sec> </entry> </doc>
    True laziness is hard work
      Excellent, this is exactly what I was looking for -- I don't even want to tell you how much time I wasted on trial and error. And to add a $year field, that searches for four-digit numbers, would that just be some kind of grep? Perhaps something like ^\d\d\d\d$ this? Or does XML::Writer not have that capability?
        You don't seem to understand what GrandFather's code is doing. In particular, this chunk of code determines how the original line of text is divided into tag-able pieces:
        my ($bibData, $quote, $primary, $sec) = / ^([^"]* "[^"]+"[^"]*) ([^\@]+) \@([^%]+) \%(.*) /x;
        That's a regex, expressed on multiple lines (thanks to the "x" modifier at the end), where the first line captures everything up through the first close-quote up to the second open-quote, and the second line captures everything from that point up to the first "@" (keyword symbol).

        To get the date as a separate item, you just need to divide up the match a little differently, like this:

        my ($bibData, $date, $quote, $primary, $sec) = / ^([^"]* "[^"]+".*?) (\d{4})\.\s+ ([^\@]+) \@([^%]+) \%(.*) /x; $xml->startTag('entry'); $xml->dataElement(bib => $bibData); $xml->dataElement(date => $date); $xml->dataElement(quote => $quote); $xml->dataElement(primary => $primary); $xml->dataElement(sec => $sec); $xml->endTag();
        Note how the first capture changed: it now ends with .*? to do a "non-greedy" match of any character until the next capture match is found, which is the one I added to look for 4 digits followed by a literal period and whitespace (updated to require at least one whitespace character). Then we also have to add a $date variable to the list of assignments, as well as a the $xml->dataElement() call to include the $date value in the output.

        Bear in mind that if your input ever includes a line of text like this, the method above will do the wrong thing:

        "Big Brother." Review of Orwell's Novel 1984. Nov. 2011. "Tough sit +uation." @tricky %unparsable.
        That could be "fixed" by making the regex match more explicit -- e.g. looking for any of the 12 month abbreviations before the 4-digit year -- but then some entries might lack a month, or the month will be unabbreviated or misspelled...

        Any attempt to impose structure like this on plain text has a non-zero probability of failing, because it's impossible to anticipate all the unexpected variations that eventually show up in (human-authored) plain text.

Re: Converting a Text file to XML
by CountZero (Bishop) on Nov 17, 2011 at 07:24 UTC
    /^\d\d\d\d$/ looks for a string of 4 digits and nothing more, thanks to the ^ and $ anchors. Just use /\d{4}/ and you will find the year.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://938507]
Approved by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (11)
As of 2014-10-01 11:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    What is your favourite meta-syntactic variable name?














    Results (9 votes), past polls