|Perl: the Markov chain saw|
Parsing XML/HTMLby sartzava (Initiate)
|on Apr 08, 2005 at 15:25 UTC||Need Help??|
sartzava has asked for the
wisdom of the Perl Monks concerning the following question:
Okay, I'm new to Perl, so this is probably a simple question...
I am attempting to edit an XML/XHTML document that was generated by a Quark extraction utility. The paragraphs in the document use nested span tags to apply formatting, and I am attempting to fix the issues associated with that.
In that example, the span class="type1" is applying an italic style to the entire paragraph. Then the type2 is applying a non-italic style to large sections of the paragraph and leaving individual words to be italicized.
Now, instances like this are easy to catch with a regex, but they can also be more involved:
Notice that the "SmallCaps" span is added in the middle of the paragraph and that there are multiple instances of the type2 tags.
Of course, I also have to deal with the possiblilty of the type2 tags being used to apply an italics style, like in this example I found:
What I would like to do is be able to match the opening and closing tags to each other and make adjustments as necessary to remove the extraneous mark-up. For instance, I want the first instace above to look like this:
I need to know if anyone can help me or direct me to an extremely simple example of/tutorial on the XML::Parser or HTML::Parser module, since I am sure that one of those does what I need to do. Again, I am very new at this, so any help will be greatly appreciated.