I'm trying to parse a text file and convert it to XML. The .txt file consists of a list of entries, separated by line breaks. So, two sample entries look like this:
Leyson, Captain Burr. "With or Without Gadgets." Boys' Life. Nov 19
+49. p. 6. "An old-timer knew what he had to do in a jam. He didn't
+ need hundreds of those gadgets to guide him to safety." @gauge %tri
+vial
"The Battle Against Baldness." Kiplinger's Personal Finance. Feb 194
+9. "A little home hair-cutter gadget--a comb with a razor attached--
+ has zipped its way into fame in recent months. Barbers pooh-pooh it
+ as a threat, but sales are going strong." @tool %american
Each entry contains bibliographic data, a quotation from that source, and two sets of tags: a set of primary and secondary classifications, one using @tags and the other using %tags, all on a single line.
The most important information for me to extract from each entry is year and tags. So, I came up with the following script:
#!/usr/bin/perl -w
my $year = "";
while (<>) {
chomp;
if ($_ eq "") {next;}
elsif ($_ =~ /^\d\d\d\d$/) {
$_ = $year;
}
else {
s/\@(\w*)/ <keyword> $1 <\/keyword>/g;
s/\%(\w*)/ <tag> $1 <\/tag>/g;
print "<entry>$_ <year> $year </year> </entry>\n";
}
}
The @tags and %tags are recognized just fine. Problem is, entries and years are not located. My program doesn't differentiate between entries: I get <entry> at the very beginning of the output and </entry> at the very end. Similarly, there's only a single, blank <year></year> right before </entry>.
I realize there's probably a very simple solution to this, but I'm still at the circumference of a circle, knock-knock-joke stage of perl programming, so your expertise would be very much appreciated. Thanks!
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
Outside of code tags, you may need to use entities for some characters:
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.
|
|